nips nips2001 nips2001-68 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ilya Nemenman, F. Shafee, William Bialek
Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j
Reference: text
sentIndex sentText sentNum sentScore
1 edu 1 Abstract We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. [sent-4, score-0.615]
2 However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. [sent-5, score-0.472]
3 This leads to a surprisingly good estimator of entropies of discrete distributions. [sent-6, score-0.172]
4 Learning a probability distribution from examples is one of the basic problems in data analysis. [sent-7, score-0.139]
5 Common practical approaches introduce a family of parametric models, leading to questions about model selection. [sent-8, score-0.044]
6 Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. [sent-11, score-0.13]
7 The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. [sent-12, score-0.692]
8 Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. [sent-13, score-0.131]
9 We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. [sent-14, score-0.191]
10 The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. [sent-15, score-0.677]
11 This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. [sent-16, score-0.414]
12 Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. [sent-17, score-0.379]
13 In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. [sent-18, score-0.26]
14 What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. [sent-19, score-0.143]
15 Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. [sent-20, score-0.246]
16 In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. [sent-21, score-0.146]
17 Is there some nonmetric analog of this notion that we can apply in the discrete case? [sent-22, score-0.387]
18 If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. [sent-24, score-0.234]
19 Even more simply, we might say that smooth distributions have large entropy. [sent-25, score-0.209]
20 While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. [sent-26, score-0.52]
21 As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. [sent-27, score-1.129]
22 We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. [sent-28, score-0.459]
23 While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. [sent-29, score-0.532]
24 At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. [sent-30, score-0.482]
25 Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. [sent-32, score-0.941]
26 What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. [sent-33, score-0.243]
27 If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . [sent-35, score-0.092]
28 Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . [sent-36, score-1.553]
29 N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. [sent-37, score-0.341]
30 This sensible procedure was first introduced by Laplace [8]. [sent-38, score-0.077]
31 It has the desirable property that events which have not been observed are not automatically assigned probability zero. [sent-39, score-0.131]
32 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). [sent-40, score-0.156]
wordName wordTfidf (topN-words)
[('qi', 0.608), ('ni', 0.296), ('nonmetric', 0.197), ('priors', 0.158), ('zu', 0.156), ('entropy', 0.131), ('uniform', 0.125), ('integrals', 0.119), ('pu', 0.119), ('qj', 0.114), ('princeton', 0.108), ('inference', 0.105), ('occam', 0.104), ('phase', 0.104), ('volume', 0.098), ('factors', 0.095), ('jersey', 0.087), ('seemingly', 0.087), ('discrete', 0.086), ('distributions', 0.081), ('smooth', 0.078), ('metric', 0.077), ('sensible', 0.077), ('space', 0.074), ('prior', 0.073), ('dirichlet', 0.072), ('normalization', 0.07), ('reliable', 0.066), ('analog', 0.062), ('extra', 0.061), ('acid', 0.057), ('amino', 0.057), ('barbara', 0.057), ('beta', 0.057), ('brute', 0.057), ('resolves', 0.057), ('revisited', 0.057), ('unordered', 0.057), ('nite', 0.055), ('asymptotic', 0.054), ('integration', 0.054), ('physics', 0.053), ('automatically', 0.052), ('disastrous', 0.052), ('saved', 0.052), ('overwhelmed', 0.052), ('examples', 0.051), ('might', 0.05), ('eld', 0.049), ('interplay', 0.048), ('bioinformatics', 0.048), ('discriminates', 0.048), ('hardly', 0.048), ('nec', 0.048), ('smoother', 0.048), ('parameterizations', 0.048), ('distribution', 0.047), ('surprisingly', 0.046), ('cancel', 0.046), ('quantum', 0.046), ('uniformity', 0.046), ('obey', 0.046), ('entropic', 0.046), ('emphasizing', 0.046), ('proceeding', 0.046), ('consistent', 0.046), ('words', 0.044), ('family', 0.044), ('successively', 0.043), ('intuitions', 0.043), ('statements', 0.043), ('bins', 0.043), ('smoothness', 0.043), ('look', 0.042), ('notion', 0.042), ('expands', 0.041), ('self', 0.041), ('dna', 0.041), ('phrases', 0.041), ('william', 0.041), ('probability', 0.041), ('laplace', 0.04), ('metrics', 0.04), ('something', 0.04), ('imposes', 0.04), ('entropies', 0.04), ('english', 0.04), ('index', 0.04), ('observed', 0.038), ('reaching', 0.038), ('constrain', 0.038), ('common', 0.038), ('santa', 0.037), ('possibilities', 0.037), ('vocabulary', 0.037), ('label', 0.037), ('arising', 0.036), ('style', 0.036), ('regime', 0.036), ('institute', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 68 nips-2001-Entropy and Inference, Revisited
Author: Ilya Nemenman, F. Shafee, William Bialek
Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j
2 0.11061838 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
3 0.10812561 61 nips-2001-Distribution of Mutual Information
Author: Marcus Hutter
Abstract: The mutual information of two random variables z and J with joint probabilities {7rij} is commonly used in learning Bayesian nets as well as in many other fields. The chances 7rij are usually estimated by the empirical sampling frequency nij In leading to a point estimate J(nij In) for the mutual information. To answer questions like
4 0.087132119 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds
Author: B. D. Wright, Kamal Sen, William Bialek, A. J. Doupe
Abstract: In nature, animals encounter high dimensional sensory stimuli that have complex statistical and dynamical structure. Attempts to study the neural coding of these natural signals face challenges both in the selection of the signal ensemble and in the analysis of the resulting neural responses. For zebra finches, naturalistic stimuli can be defined as sounds that they encounter in a colony of conspecific birds. We assembled an ensemble of these sounds by recording groups of 10-40 zebra finches, and then analyzed the response of single neurons in the songbird central auditory area (field L) to continuous playback of long segments from this ensemble. Following methods developed in the fly visual system, we measured the information that spike trains provide about the acoustic stimulus without any assumptions about which features of the stimulus are relevant. Preliminary results indicate that large amounts of information are carried by spike timing, with roughly half of the information accessible only at time resolutions better than 10 ms; additional information is still being revealed as time resolution is improved to 2 ms. Information can be decomposed into that carried by the locking of individual spikes to the stimulus (or modulations of spike rate) vs. that carried by timing in spike patterns. Initial results show that in field L, temporal patterns give at least % extra information. Thus, single central auditory neurons can provide an informative representation of naturalistic sounds, in which spike timing may play a significant role.
5 0.082752548 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
Author: Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1
6 0.081678115 43 nips-2001-Bayesian time series classification
7 0.070475824 3 nips-2001-ACh, Uncertainty, and Cortical Inference
8 0.058780491 107 nips-2001-Latent Dirichlet Allocation
9 0.057205461 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
10 0.057045918 183 nips-2001-The Infinite Hidden Markov Model
11 0.053729177 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
12 0.053682502 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images
13 0.051481996 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
14 0.051440213 58 nips-2001-Covariance Kernels from Bayesian Generative Models
15 0.050523099 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
16 0.048109666 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
17 0.046892375 13 nips-2001-A Natural Policy Gradient
18 0.046674576 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
19 0.046480525 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
20 0.045543917 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
topicId topicWeight
[(0, -0.16), (1, -0.019), (2, -0.003), (3, -0.032), (4, -0.053), (5, -0.118), (6, 0.026), (7, 0.02), (8, -0.033), (9, -0.033), (10, 0.072), (11, 0.043), (12, -0.059), (13, -0.123), (14, -0.038), (15, -0.017), (16, -0.001), (17, -0.013), (18, 0.027), (19, 0.029), (20, -0.089), (21, 0.056), (22, -0.083), (23, -0.098), (24, -0.025), (25, -0.059), (26, -0.103), (27, 0.071), (28, 0.114), (29, 0.042), (30, -0.104), (31, 0.003), (32, -0.043), (33, 0.003), (34, -0.102), (35, 0.046), (36, -0.109), (37, 0.093), (38, 0.013), (39, 0.116), (40, -0.118), (41, 0.069), (42, 0.05), (43, -0.026), (44, 0.108), (45, 0.083), (46, -0.033), (47, -0.129), (48, -0.018), (49, -0.065)]
simIndex simValue paperId paperTitle
same-paper 1 0.96821117 68 nips-2001-Entropy and Inference, Revisited
Author: Ilya Nemenman, F. Shafee, William Bialek
Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j
2 0.74434882 61 nips-2001-Distribution of Mutual Information
Author: Marcus Hutter
Abstract: The mutual information of two random variables z and J with joint probabilities {7rij} is commonly used in learning Bayesian nets as well as in many other fields. The chances 7rij are usually estimated by the empirical sampling frequency nij In leading to a point estimate J(nij In) for the mutual information. To answer questions like
3 0.63303846 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
4 0.50292879 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
Author: Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: Estimating the parameters of sparse multinomial distributions is an important component of many statistical learning tasks. Recent approaches have used uncertainty over the vocabulary of symbols in a multinomial distribution as a means of accounting for sparsity. We present a Bayesian approach that allows weak prior knowledge, in the form of a small set of approximate candidate vocabularies, to be used to dramatically improve the resulting estimates. We demonstrate these improvements in applications to text compression and estimating distributions over words in newsgroup data. 1
5 0.48255485 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
Author: Igor V. Cadez, P. S. Bradley
Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for finding good solutions in linear time. We consider an arbitrary but fixed parametric form of the model, therefore we sometimes refer to a specific set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is confi dence that the model is not overfi tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also confi rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signifi cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good fi for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good fi any more. t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. In applications presented in this paper we certainly do not have that condition satisfi and we consider this approach as an “educated ed heuristic” (gaining fi statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must fi obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model fi tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is diffi cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the fi 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justifi incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
6 0.46150574 3 nips-2001-ACh, Uncertainty, and Cortical Inference
7 0.43763968 43 nips-2001-Bayesian time series classification
8 0.39997682 42 nips-2001-Bayesian morphometry of hippocampal cells suggests same-cell somatodendritic repulsion
9 0.39010733 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds
10 0.38360974 117 nips-2001-MIME: Mutual Information Minimization and Entropy Maximization for Bayesian Belief Propagation
11 0.38303611 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
12 0.38204303 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA
13 0.37175554 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
14 0.36625108 107 nips-2001-Latent Dirichlet Allocation
15 0.35116664 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
16 0.34600484 108 nips-2001-Learning Body Pose via Specialized Maps
17 0.34534854 31 nips-2001-Algorithmic Luckiness
18 0.34188488 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task
19 0.33865032 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines
20 0.32224652 143 nips-2001-PAC Generalization Bounds for Co-training
topicId topicWeight
[(14, 0.03), (17, 0.022), (19, 0.049), (27, 0.14), (30, 0.071), (36, 0.012), (38, 0.021), (59, 0.026), (72, 0.082), (79, 0.037), (91, 0.205), (94, 0.224)]
simIndex simValue paperId paperTitle
same-paper 1 0.89317894 68 nips-2001-Entropy and Inference, Revisited
Author: Ilya Nemenman, F. Shafee, William Bialek
Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j
2 0.75911307 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
3 0.7586804 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
Author: Jens Kohlmorgen, Steven Lemm
Abstract: We propose a novel method for the analysis of sequential data that exhibits an inherent mode switching. In particular, the data might be a non-stationary time series from a dynamical system that switches between multiple operating modes. Unlike other approaches, our method processes the data incrementally and without any training of internal parameters. We use an HMM with a dynamically changing number of states and an on-line variant of the Viterbi algorithm that performs an unsupervised segmentation and classification of the data on-the-fly, i.e. the method is able to process incoming data in real-time. The main idea of the approach is to track and segment changes of the probability density of the data in a sliding window on the incoming data stream. The usefulness of the algorithm is demonstrated by an application to a switching dynamical system. 1
4 0.75797755 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
5 0.75783551 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
Author: Roni Khardon, Dan Roth, Rocco A. Servedio
Abstract: We study online learning in Boolean domains using kernels which capture feature expansions equivalent to using conjunctions over basic features. We demonstrate a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability of the resulting classifier. We first describe several kernel functions which capture either limited forms of conjunctions or all conjunctions. We show that these kernels can be used to efficiently run the Perceptron algorithm over an exponential number of conjunctions; however we also prove that using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. We also consider an analogous use of kernel functions to run the multiplicative-update Winnow algorithm over an expanded feature space of exponentially many conjunctions. While known upper bounds imply that Winnow can learn DNF formulae with a polynomial mistake bound in this setting, we prove that it is computationally hard to simulate Winnow’s behavior for learning DNF over such a feature set, and thus that such kernel functions for Winnow are not efficiently computable.
7 0.75423777 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
8 0.75370193 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
9 0.75279659 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
10 0.75263208 89 nips-2001-Grouping with Bias
11 0.75130665 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
12 0.74997497 143 nips-2001-PAC Generalization Bounds for Co-training
13 0.74876082 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments
14 0.74855137 84 nips-2001-Global Coordination of Local Linear Models
15 0.74748582 36 nips-2001-Approximate Dynamic Programming via Linear Programming
16 0.74748164 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
17 0.74743402 57 nips-2001-Correlation Codes in Neuronal Populations
18 0.74446046 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
19 0.7444461 144 nips-2001-Partially labeled classification with Markov random walks
20 0.74439615 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's