nips nips2001 nips2001-41 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Igor V. Cadez, Padhraic Smyth
Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. [sent-12, score-0.644]
2 In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. [sent-13, score-0.854]
3 We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. [sent-14, score-0.281]
4 We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. [sent-15, score-0.859]
5 1 Introduction Transaction data sets consist of records of pairs of individuals and events, e. [sent-16, score-0.268]
6 , items purchased (market basket data), telephone calls made (call records), or Web pages visited (from Web logs). [sent-18, score-0.45]
7 Of significant practical interest in many applications is the ability to derive individual-specific (or personalized) models for each individual from the historical transaction data, e. [sent-19, score-0.73]
8 In this paper we propose a generative model based on mixture models and Bayesian estimation for learning predictive profiles. [sent-22, score-0.366]
9 The mixture model is used to address the heterogeneity problem: different individuals purchase combinations of products on different visits to the store. [sent-23, score-0.474]
10 The Bayesian estimation framework is used to address the fact that we have different amounts of data for different individuals. [sent-24, score-0.05]
11 , only one) we can “shrink” our predictive profile for that individual towards a general population profile. [sent-27, score-0.435]
12 On the other hand, for an individual with many transactions, their predictive model can be more individualized. [sent-28, score-0.391]
13 Our goal is an accurate and computationally efficient modeling framework that smoothly adapts a profile to each individual based on both their own historical data as well as general population patterns. [sent-29, score-0.441]
14 The idea of using mixture models as a flexible approach for modeling discrete and categorical data has been known for many years, e. [sent-32, score-0.32]
15 Traditionally these methods were only applied to relatively small low-dimensional data sets. [sent-35, score-0.05]
16 More recently there has been a resurgence of interest in mixtures of multinomials and mixtures of conditionally independent Bernoulli models for modeling high-dimensional document-term data in text analysis (e. [sent-36, score-0.345]
17 (2000) on probabilistic model-based collaborative filtering is also similar in spirit to the approach described in this paper except that we focus on explicitly extracting individual-level profiles rather than global models (i. [sent-40, score-0.204]
18 , we have explicit models for each individual in our framework). [sent-42, score-0.296]
19 Our work can be viewed as being an extension of this broad family of probabilistic modeling ideas to the specific case of transaction data, where we deal directly with the problem of making inferences about specific individuals and handling multiple transactions per individual. [sent-43, score-0.763]
20 Other approaches have also been proposed in the data mining literature for clustering and exploratory analysis of transaction data, but typically in a non-probabilistic framework (e. [sent-44, score-0.516]
21 , for association rule techniques) can make it difficult for these models to fully leverage the data for individual-level forecasting. [sent-50, score-0.092]
22 2 Mixture-Basis Models for Profiles We have an observed data set D = {D1 , . [sent-51, score-0.05]
23 Each individual data set Di consists of one or more transactions for that customer , i. [sent-55, score-0.576]
24 , yini }, where yij is the jth transaction for customer i and ni is the total number of transactions observed for customer i. [sent-63, score-1.002]
25 The jth transaction for individual i, yij , consists of a description of the set of products (or a “market basket”) that was purchased at a specific time by customer i (and yi will be used to denote an arbitrary transaction from individual i). [sent-64, score-1.821]
26 For the purposes of the experiments described in this paper, each individual transaction yij is represented as a vector of d counts yij = (nij1 , . [sent-65, score-0.998]
27 , nijC ), where nijc indicates how many items of type c are in transaction yij , 1 ≤ c ≤ C. [sent-71, score-0.878]
28 We define a predictive profile as a probabilistic model p(yi ), i. [sent-72, score-0.167]
29 , a probability distribution on the items that individual i will purchase during a store-visit. [sent-74, score-0.563]
30 We propose a simple generative mixture model for an individual’s purchasing behavior, namely that a randomly selected transaction yi from individual i is generated by one of K components in a K-component mixture model. [sent-75, score-1.14]
31 The kth mixture component, 1 ≤ k ≤ K is a specific model for generating the counts and we can think of each of the K models as “basis functions” describing prototype transactions. [sent-76, score-0.292]
32 For example, in a clothing store, one might have a mixture component that acts as a prototype for suit-buying behavior, where the expected counts for items such as suits, ties, shirts, etc. [sent-77, score-0.628]
33 There are several modeling choices for the component transaction models for generating item counts. [sent-79, score-0.596]
34 In this paper we choose a particularly simple memoryless multinomial model that operates as follows. [sent-80, score-0.11]
35 Conditioned on nij (the total number of items in the basket) each of the individual items is selected in a memoryless fashion by nij draws from a multinomial distribution Pk = (θk1 , . [sent-81, score-0.91]
36 2 0 10 20 30 Department 40 50 0 0 10 20 30 Department 40 50 Figure 1: An example of 6 “basis” mixture components fit to retail transaction data. [sent-103, score-0.833]
37 Figure 1 shows an example of K = 6 such basis mixture components that have been learned from a large retail transaction data (more details on learning will be discussed below). [sent-104, score-0.883]
38 Each window shows a different set of component probabilities Pk , each modeling a different type of transaction. [sent-105, score-0.125]
39 The components show a striking bimodal pattern in that the multinomial models appear to involve departments that are either above or below department 25, but there is very little probability mass that crosses over. [sent-106, score-0.251]
40 In fact the models are capturing the fact that departments numbered lower than 25 correspond to men’s clothing and those above 25 correspond to women’s clothing, and that baskets tend to be “tuned” to one set or the other. [sent-107, score-0.227]
41 1 Individual-Specific Weights We further assume that for each individual i there exists a set of K weights, and in the general case these weights are individual-specific, denoted by αi = (αi1 , . [sent-109, score-0.438]
42 Weight αik represents the probability that when individual i enters the store their transactions will be generated by component k. [sent-113, score-0.495]
43 Or, in other words, the αik ’s govern individual i’s propensity to engage in “shopping behavior” k (again, there are numerous possible generalizations, such as making the αik ’s have dependence over time, that we will not discuss here). [sent-114, score-0.254]
44 The αik ’s are in effect the profile coefficients for individual i, relative to the K component models. [sent-115, score-0.338]
45 This idea of individual-specific weights (or profiles) is a key component of our proposed approach. [sent-116, score-0.268]
46 The mixture component models Pk are fixed and shared across all individuals, providing a mechanism for borrowing of strength across individual data. [sent-117, score-0.567]
47 The individual weights are in principle allowed to freely vary for each individual within a K-dimensional simplex. [sent-118, score-0.692]
48 In effect the K weights can be thought as basis coefficients that represent the location of individual i within the space spanned by the K basis functions (the component Pk multinomials). [sent-119, score-0.522]
49 This approach is quite similar in spirit to the recent probabilistic PCA work of Hofmann (1999) on mixture models for text documents, where he proposes a general mixture model framework that represents documents as existing within a K-dimensional simplex of multinomial component models. [sent-120, score-0.637]
50 05 0 0 5 10 15 20 25 30 35 Figure 3: Inferred “effective” profiles from global weights, smoothed histograms, and individual-specific weights for the individual whose data was shown in Figure 2. [sent-134, score-0.636]
51 weights are specific to individual i: K p(yij ) = αik p(yij |k) k=1 K = C n ijc θkc . [sent-135, score-0.438]
52 αik k=1 c=1 where θkc is the probability that the cth item is purchased given component k and nijc is the number of items of category c purchased by individual i, during transaction ij. [sent-136, score-1.381]
53 As an example of the application of these ideas, in Figure 2 the training data and test data for a particular individual are displayed. [sent-137, score-0.421]
54 Note that there is some predictability from training to test data, although the test data contains (for example) a purchase in department 14 (which was not seen in the training data). [sent-138, score-0.255]
55 , αik = αk , (2) a maximum a posteriori (MAP) technique that smooths each individual’s training histogram with a population-based histogram, and (3) individual weights estimated in a Bayesian fashion that are “tuned” to the individual’s specific behavior. [sent-141, score-0.545]
56 One can see in Figure 3 that the global weight profile reflects broad population-based purchasing patterns and is not representative of this individual. [sent-144, score-0.177]
57 The smoothed histogram is somewhat better, but the smoothing parameter has “blurred” the individual’s focus on departments below 25. [sent-145, score-0.209]
58 The individual-weight profile appears to be a better representation of this individual’s behavior and indeed it does provide the best predictive score (of the 3 methods) on the test data in Figure 2. [sent-146, score-0.254]
59 Note that the individual-weights profile in Figure 3 “borrows strength” from the purchases of other similar customers, i. [sent-147, score-0.094]
60 , it allows for small but non-zero probabilities of the individual making purchases in departments (such as 6 through 9) if he or she has not purchased there in the past. [sent-149, score-0.571]
61 15) corresponding to the six component models shown in Figure 1. [sent-157, score-0.126]
62 The most weight is placed on components 2, 3 and 6, which agrees with our intuition given the individual’s training data. [sent-158, score-0.112]
63 2 Learning the Model Parameters The unknown parameters in our model consist of both the parameters of the K multinomials, θkc , 1 ≤ k ≤ K, 1 ≤ c ≤ C, and the vectors of individual-specific profile weights αi , 1 ≤ i ≤ N . [sent-160, score-0.184]
64 We investigate two different approaches to learning individual-specific weights: • Mixture-Based Maximum Likelihood (ML) Weights: We treat the weights αi and component parameters θ as unknown and use expectationmaximization (EM) to learn both simultaneously. [sent-161, score-0.268]
65 • Mixture-Based Empirical Bayes (EB) Weights: We first use EM to learn a mixture of K transaction models (ignoring individuals). [sent-163, score-0.611]
66 We then use a second EM algorithm in weight-space to estimate individualspecific weights αi for each individual. [sent-164, score-0.184]
67 In effect, we are learning how best to represent each individual within the K-dimensional simplex of basis components. [sent-166, score-0.254]
68 The empirical prior uses the marginal weights (α’s) from the first run for the mean of the Dirichlet, and an equivalent sample size of n = 10 transactions is used in the results reported in the paper. [sent-167, score-0.307]
69 We also compare two other approaches for profiling for comparison: (1) Global Mixture Weights: instead of individualized weights we set each individual’s tion is not a multinomial that can be plotted as a bar chart: however, we can approximate it and we are plotting one such approximation here 3. [sent-171, score-0.411]
70 1 Mixtures: Individualized ML weights 3 Mixtures: Global mixture weights 2. [sent-176, score-0.555]
71 5 0 10 20 30 40 50 60 70 Number of Mixture Components [k] 80 90 100 Figure 4: Plot of the negative log probability scores per item (predictive entropy) on out-of-sample transactions, for various weight models as a function of the number of mixture components K. [sent-181, score-0.441]
72 This provides an (optimistic) baseline of using multinomial profiles directly, without use of any mixture models. [sent-183, score-0.262]
73 3 Experimental Results To evaluate our approach we used a real-world transaction data set. [sent-184, score-0.432]
74 The data consists of transactions collected at a chain of retail stores over a two-year period. [sent-185, score-0.385]
75 We analyze the transactions here at the store department level (50 categories of items). [sent-186, score-0.157]
76 We separate the data into two time periods (all transactions are timestamped), with approximately 70% of the data being in the first time period (the training data) and the remainder in the test period data. [sent-187, score-0.352]
77 We train our mixture and weight models on the first period and evaluate our models in terms of their ability to predict transactions that occur in the subsequent out-of-sample test period. [sent-188, score-0.494]
78 The training data contains data on 4339 individuals, 58,866 transactions, and 164,000 items purchased. [sent-189, score-0.367]
79 The test data consists of 4040 individuals, 25,292 transactions, and 69,103 items purchased. [sent-190, score-0.326]
80 Not all individuals in the test data set appear in the training data set (and vice-versa): individuals in the test data set with no training data are assigned a global population model for scoring purposes. [sent-191, score-0.885]
81 To evaluate the predictive power of each model, we calculate the log-probability (“logp scores”) of the transactions as predicted by each model. [sent-192, score-0.26]
82 Higher logp scores mean that the model assigned higher probability to events that actually occurred. [sent-193, score-0.226]
83 Note that the mean negative logp score over a set of transactions, divided by the total number of items, can be interpreted as a predictive entropy term in bits. [sent-194, score-0.315]
84 number of mixture components K for the mixture-based ML weights, the mixturebased Global weights (where all individuals are assigned the same marginal mixture weights), the mixture-based Empirical Bayes weights, and the non-mixture MAP histogram method (as a baseline). [sent-198, score-0.909]
85 The mixture-based approaches generally outperform the non-mixture MAP histogram approach (solid line). [sent-199, score-0.107]
86 The ML-based mixture weights start to overfit after about 6 mixture components (as expected). [sent-200, score-0.61]
87 The Global mixture weights and individualized mixture weights improve up to about K = 50 components and then show some evidence of overfitting. [sent-201, score-0.917]
88 The mixture-based individual weights method is systematically the best predictor, providing a 15% decrease in predictive entropy compared to the MAP histogram method, and a roughly 3% decrease compared to non-individualized global mixture weights. [sent-202, score-0.978]
89 Figure 5 shows a more detailed comparison of the difference between individual mixtures and the Global profiles, on a subset of individuals. [sent-203, score-0.303]
90 We can see that the Global profiles are systematically worse than the individual weights model (i. [sent-204, score-0.438]
91 For individuals with the lowest likelihood (lower left of the left plot) the individual weight model is consistently better: typically lower weight total likelihood individuals are those with more transactions and items. [sent-207, score-0.85]
92 (2001) we report more detailed results on both this data set and a second retail data set involving 15 million items and 300,000 individuals. [sent-209, score-0.55]
93 On both data sets the individual-level models were found to be consistently more accurate out-of-sample compared to both non-mixture and non-Bayesian approaches. [sent-210, score-0.129]
94 4 Conclusions In this paper we investigated the use of mixture models and approximate Bayesian estimation for automatically inferring individual-level profiles from transaction data records. [sent-212, score-0.69]
95 On a real-world retail data set the proposed framework consistently outperformed alternative approaches in terms of accuracy of predictions on future unseen customer behavior. [sent-213, score-0.448]
96 (1993) Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD’98), New York: ACM Press, pp. [sent-219, score-0.238]
97 (2001) Predictive profiles for transaction data using finite mixture models, Technical Report UCI-ICS-01-67, Information and Computer Science, University of California, Irvine (available online at www. [sent-226, score-0.619]
98 (2000) Dependency networks for inference, collaborative filtering, and data visualization. [sent-236, score-0.083]
99 (1999) Multi-label text classification with a mixture model trained by EM, in AAAI’99 Workshop on Text Learning. [sent-257, score-0.219]
100 Ghosh (2000) Value-based customer grouping from large retail datasets, Proc. [sent-260, score-0.361]
wordName wordTfidf (topN-words)
[('transaction', 0.382), ('pro', 0.298), ('individual', 0.254), ('items', 0.238), ('retail', 0.212), ('mixture', 0.187), ('individuals', 0.187), ('weights', 0.184), ('les', 0.167), ('yij', 0.164), ('customer', 0.149), ('cadez', 0.143), ('ik', 0.143), ('purchased', 0.141), ('predictive', 0.137), ('transactions', 0.123), ('individualized', 0.123), ('le', 0.112), ('logp', 0.11), ('di', 0.108), ('global', 0.099), ('nijc', 0.094), ('purchases', 0.094), ('erent', 0.085), ('component', 0.084), ('scores', 0.082), ('departments', 0.082), ('kc', 0.082), ('multinomials', 0.082), ('histogram', 0.078), ('multinomial', 0.075), ('basket', 0.071), ('purchase', 0.071), ('smyth', 0.061), ('pk', 0.059), ('clothing', 0.056), ('components', 0.052), ('historical', 0.052), ('bayesian', 0.051), ('map', 0.05), ('data', 0.05), ('smoothed', 0.049), ('profile', 0.049), ('irvine', 0.049), ('mixtures', 0.049), ('mining', 0.047), ('agrawal', 0.047), ('baskets', 0.047), ('ghosh', 0.047), ('henry', 0.047), ('lazarsfeld', 0.047), ('personalization', 0.047), ('purchasing', 0.047), ('strehl', 0.047), ('swami', 0.047), ('item', 0.047), ('acm', 0.045), ('population', 0.044), ('em', 0.043), ('ective', 0.043), ('models', 0.042), ('modeling', 0.041), ('igor', 0.041), ('sigmod', 0.041), ('spie', 0.041), ('entropy', 0.039), ('dirichlet', 0.039), ('test', 0.038), ('ect', 0.037), ('exploratory', 0.037), ('speci', 0.037), ('consistently', 0.037), ('ml', 0.035), ('jth', 0.035), ('market', 0.035), ('coe', 0.035), ('memoryless', 0.035), ('nij', 0.035), ('counts', 0.034), ('store', 0.034), ('web', 0.034), ('assigned', 0.034), ('heckerman', 0.033), ('collaborative', 0.033), ('text', 0.032), ('period', 0.031), ('ho', 0.031), ('records', 0.031), ('weight', 0.031), ('yi', 0.031), ('probabilistic', 0.03), ('training', 0.029), ('score', 0.029), ('products', 0.029), ('tuned', 0.029), ('plotting', 0.029), ('inferring', 0.029), ('prototype', 0.029), ('outperform', 0.029), ('mccallum', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
Author: Igor V. Cadez, Padhraic Smyth
Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1
2 0.17901328 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
Author: Igor V. Cadez, P. S. Bradley
Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for finding good solutions in linear time. We consider an arbitrary but fixed parametric form of the model, therefore we sometimes refer to a specific set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is confi dence that the model is not overfi tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also confi rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signifi cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good fi for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good fi any more. t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. In applications presented in this paper we certainly do not have that condition satisfi and we consider this approach as an “educated ed heuristic” (gaining fi statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must fi obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model fi tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is diffi cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the fi 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justifi incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
3 0.11405436 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
4 0.1011268 107 nips-2001-Latent Dirichlet Allocation
Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1
5 0.087664373 84 nips-2001-Global Coordination of Local Linear Models
Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton
Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as: RP&¤§¢ Q ¡ I 0 ( 3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡ ¡ ¥ ¡ ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨
6 0.073827855 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
7 0.063547783 43 nips-2001-Bayesian time series classification
8 0.062925346 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
9 0.062531039 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
10 0.060959037 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
11 0.056801446 58 nips-2001-Covariance Kernels from Bayesian Generative Models
12 0.053079821 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
13 0.052941907 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
14 0.052772924 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding
15 0.051001869 114 nips-2001-Learning from Infinite Data in Finite Time
16 0.050854102 35 nips-2001-Analysis of Sparse Bayesian Learning
17 0.050484847 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task
18 0.049930889 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
19 0.048430104 149 nips-2001-Probabilistic Abstraction Hierarchies
20 0.04714423 172 nips-2001-Speech Recognition using SVMs
topicId topicWeight
[(0, -0.164), (1, -0.004), (2, -0.036), (3, -0.035), (4, -0.103), (5, -0.05), (6, 0.01), (7, 0.01), (8, -0.045), (9, -0.06), (10, 0.125), (11, 0.022), (12, -0.073), (13, -0.171), (14, -0.049), (15, -0.065), (16, 0.056), (17, -0.016), (18, 0.012), (19, 0.074), (20, -0.023), (21, 0.101), (22, -0.017), (23, -0.019), (24, -0.05), (25, -0.145), (26, 0.04), (27, 0.08), (28, 0.231), (29, -0.172), (30, 0.033), (31, -0.189), (32, 0.007), (33, -0.015), (34, -0.033), (35, -0.011), (36, 0.119), (37, 0.057), (38, -0.099), (39, -0.147), (40, -0.082), (41, -0.031), (42, -0.185), (43, 0.118), (44, -0.019), (45, 0.003), (46, 0.043), (47, -0.035), (48, 0.013), (49, -0.086)]
simIndex simValue paperId paperTitle
same-paper 1 0.97071677 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
Author: Igor V. Cadez, Padhraic Smyth
Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1
2 0.81934386 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
Author: Igor V. Cadez, P. S. Bradley
Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for finding good solutions in linear time. We consider an arbitrary but fixed parametric form of the model, therefore we sometimes refer to a specific set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is confi dence that the model is not overfi tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also confi rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signifi cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good fi for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good fi any more. t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. In applications presented in this paper we certainly do not have that condition satisfi and we consider this approach as an “educated ed heuristic” (gaining fi statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must fi obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model fi tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is diffi cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the fi 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justifi incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
3 0.62866908 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
Author: Nicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana Ghosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng
Abstract: Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are .discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. We find that function approximation methods that do not optimize a squared loss, like Support Vector Machines regression, do not work well in this context. Compared methods include decision trees and generalized linear models. The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. 1
4 0.61225903 107 nips-2001-Latent Dirichlet Allocation
Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1
5 0.51345378 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
6 0.44607192 84 nips-2001-Global Coordination of Local Linear Models
7 0.41070169 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
8 0.40196598 68 nips-2001-Entropy and Inference, Revisited
9 0.39821881 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task
10 0.3950758 149 nips-2001-Probabilistic Abstraction Hierarchies
11 0.38228276 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
12 0.35291731 114 nips-2001-Learning from Infinite Data in Finite Time
13 0.34318757 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
14 0.32101449 43 nips-2001-Bayesian time series classification
15 0.3198235 155 nips-2001-Quantizing Density Estimators
16 0.31685126 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
17 0.3089695 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
18 0.30641407 35 nips-2001-Analysis of Sparse Bayesian Learning
19 0.30443606 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
20 0.30430666 140 nips-2001-Optimising Synchronisation Times for Mobile Devices
topicId topicWeight
[(14, 0.014), (17, 0.013), (19, 0.019), (27, 0.118), (30, 0.072), (38, 0.018), (59, 0.037), (61, 0.335), (72, 0.045), (79, 0.044), (83, 0.024), (91, 0.176)]
simIndex simValue paperId paperTitle
1 0.91118544 177 nips-2001-Switch Packet Arbitration via Queue-Learning
Author: Timothy X. Brown
Abstract: In packet switches, packets queue at switch inputs and contend for outputs. The contention arbitration policy directly affects switch performance. The best policy depends on the current state of the switch and current traffic patterns. This problem is hard because the state space, possible transitions, and set of actions all grow exponentially with the size of the switch. We present a reinforcement learning formulation of the problem that decomposes the value function into many small independent value functions and enables an efficient action selection.
2 0.8877933 145 nips-2001-Perceptual Metamers in Stereoscopic Vision
Author: B. T. Backus
Abstract: Theories of cue combination suggest the possibility of constructing visual stimuli that evoke different patterns of neural activity in sensory areas of the brain, but that cannot be distinguished by any behavioral measure of perception. Such stimuli, if they exist, would be interesting for two reasons. First, one could know that none of the differences between the stimuli survive past the computations used to build the percepts. Second, it can be difficult to distinguish stimulus-driven components of measured neural activity from top-down components (such as those due to the interestingness of the stimuli). Changing the stimulus without changing the percept could be exploited to measure the stimulusdriven activity. Here we describe stimuli in which vertical and horizontal disparities trade during the construction of percepts of slanted surfaces, yielding stimulus equivalence classes. Equivalence class membership changed after a change of vergence eye posture alone, without changes to the retinal images. A formal correspondence can be drawn between these “perceptual metamers” and more familiar “sensory metamers” such as color metamers. 1
same-paper 3 0.8213374 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
Author: Igor V. Cadez, Padhraic Smyth
Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1
4 0.55937517 183 nips-2001-The Infinite Hidden Markov Model
Author: Matthew J. Beal, Zoubin Ghahramani, Carl E. Rasmussen
Abstract: We show that it is possible to extend hidden Markov models to have a countably infinite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the infinitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters define a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a finite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be infinite— consider, for example, symbols being possible words appearing in English text.
5 0.55879211 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments
Author: Jonathan L. Shapiro, J. Wearden
Abstract: Animal data on delayed-reward conditioning experiments shows a striking property - the data for different time intervals collapses into a single curve when the data is scaled by the time interval. This is called the scalar property of interval timing. Here a simple model of a neural clock is presented and shown to give rise to the scalar property. The model is an accumulator consisting of noisy, linear spiking neurons. It is analytically tractable and contains only three parameters. When coupled with reinforcement learning it simulates peak procedure experiments, producing both the scalar property and the pattern of single trial covariances. 1
6 0.55831307 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
7 0.55791599 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
8 0.55791336 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
9 0.55774581 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
10 0.55641353 89 nips-2001-Grouping with Bias
11 0.55524313 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
12 0.55216449 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
13 0.55162585 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
14 0.55117285 135 nips-2001-On Spectral Clustering: Analysis and an algorithm
15 0.55110526 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
16 0.55091619 68 nips-2001-Entropy and Inference, Revisited
17 0.55054331 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
18 0.54974395 169 nips-2001-Small-World Phenomena and the Dynamics of Information
19 0.54904699 84 nips-2001-Global Coordination of Local Linear Models
20 0.54806077 144 nips-2001-Partially labeled classification with Markov random walks