nips nips2001 nips2001-122 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Igor V. Cadez, P. S. Bradley
Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for finding good solutions in linear time. We consider an arbitrary but fixed parametric form of the model, therefore we sometimes refer to a specific set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is confi dence that the model is not overfi tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also confi rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signifi cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good fi for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good fi any more. t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. In applications presented in this paper we certainly do not have that condition satisfi and we consider this approach as an “educated ed heuristic” (gaining fi statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must fi obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model fi tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is diffi cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the fi 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justifi incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. [sent-9, score-0.265]
2 Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. [sent-10, score-0.323]
3 In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. [sent-11, score-0.502]
4 However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. [sent-12, score-0.4]
5 In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. [sent-13, score-0.39]
6 In this paper we formulate the problem of change detection and propose a principled solution. [sent-14, score-0.191]
7 Empirical results over both synthetic and real-life data sets are presented. [sent-15, score-0.333]
8 1 Introduction and Notation Consider a data set D = {x1 , x2 , . [sent-16, score-0.148]
9 , xn } consisting of n independent, identically distributed (iid) data points. [sent-19, score-0.148]
10 In context of this paper the data points could be vectors, sequences, etc. [sent-20, score-0.218]
11 Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . [sent-21, score-0.566]
12 |θk ) represents a mixture component, while πi represents mixture weights. [sent-29, score-0.234]
13 to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. [sent-32, score-0.518]
14 The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. [sent-33, score-0.296]
15 For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. [sent-34, score-0.349]
16 Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e. [sent-35, score-0.408]
17 Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). [sent-43, score-0.685]
18 For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. [sent-44, score-0.403]
19 In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . [sent-45, score-0.193]
20 We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . [sent-49, score-0.681]
21 The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. [sent-50, score-0.302]
22 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. [sent-51, score-0.144]
23 In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). [sent-52, score-0.636]
24 • The optimal mixture model puts more PDF mass over dense regions in the data space. [sent-53, score-0.577]
25 Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. [sent-54, score-0.537]
26 More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. [sent-55, score-0.581]
27 , there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. [sent-58, score-0.632]
28 • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. [sent-59, score-0.672]
29 In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. [sent-60, score-0.428]
30 , cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. [sent-65, score-0.697]
31 Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. [sent-66, score-0.632]
32 In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). [sent-67, score-0.748]
33 It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. [sent-69, score-0.311]
34 In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. [sent-71, score-0.162]
35 It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. [sent-72, score-0.413]
36 To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. [sent-73, score-0.485]
37 In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. [sent-75, score-0.205]
38 In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). [sent-76, score-0.278]
39 Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. [sent-77, score-0.408]
40 In the situation with streaming data, the model is estimated over the in-sample data D0 . [sent-78, score-0.193]
41 Then the individual in-sample data sets D1 , D2 , . [sent-79, score-0.244]
42 This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. [sent-83, score-0.307]
43 When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. [sent-84, score-0.436]
44 To clarify various steps, we provide an algorithmic description of the change detection process. [sent-86, score-0.158]
45 Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). [sent-91, score-0.162]
46 Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . [sent-93, score-0.154]
47 t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. [sent-113, score-0.272]
48 1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. [sent-116, score-0.373]
49 For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. [sent-118, score-0.297]
50 The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). [sent-119, score-0.274]
51 In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). [sent-120, score-0.469]
52 Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). [sent-121, score-0.761]
53 Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). [sent-122, score-0.347]
54 In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. [sent-123, score-0.216]
55 In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e. [sent-124, score-0.526]
56 5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). [sent-145, score-0.846]
57 Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. [sent-148, score-0.348]
58 , t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). [sent-151, score-0.212]
59 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. [sent-152, score-0.271]
60 1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. [sent-154, score-0.219]
61 Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). [sent-155, score-0.304]
62 The data consist of a two-state discrete dimension and a continuous dimension. [sent-156, score-0.148]
63 First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0. [sent-157, score-0.439]
64 5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. [sent-168, score-0.277]
65 This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. [sent-169, score-0.299]
66 The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). [sent-170, score-0.736]
67 Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. [sent-175, score-0.244]
68 Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. [sent-176, score-0.22]
69 Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. [sent-177, score-0.499]
70 As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. [sent-179, score-0.433]
71 At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. [sent-180, score-0.342]
72 Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. [sent-181, score-0.663]
73 2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. [sent-183, score-0.176]
74 There are almost 50,000 data points in the data set with over 100 dimensions each. [sent-184, score-0.366]
75 The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. [sent-185, score-0.406]
76 Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. [sent-186, score-0.315]
77 Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). [sent-187, score-0.322]
78 The data is a single-day of click-stream data from a commercial web site. [sent-190, score-0.296]
79 Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. [sent-191, score-0.567]
80 data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. [sent-192, score-0.678]
81 The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. [sent-193, score-0.52]
82 Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. [sent-194, score-0.168]
83 Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. [sent-195, score-0.198]
84 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. [sent-196, score-0.285]
85 For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. [sent-197, score-0.47]
86 This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. [sent-198, score-0.601]
87 A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. [sent-199, score-0.157]
88 More recently mixture models have been applied more directly to outlier detection [3]. [sent-200, score-0.267]
89 We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. [sent-202, score-0.218]
90 An outlier or two do not necessarily mean that the underlying data distribution has changed. [sent-203, score-0.264]
91 Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. [sent-204, score-0.475]
92 There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e. [sent-205, score-0.266]
93 In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. [sent-208, score-0.314]
94 For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. [sent-209, score-0.307]
95 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. [sent-210, score-0.486]
96 Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. [sent-211, score-0.828]
97 Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. [sent-212, score-0.463]
98 Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. [sent-213, score-0.469]
99 This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. [sent-214, score-0.495]
100 Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. [sent-234, score-0.148]
wordName wordTfidf (topN-words)
[('logp', 0.564), ('ks', 0.416), ('mks', 0.215), ('dt', 0.19), ('scores', 0.175), ('dense', 0.149), ('data', 0.148), ('day', 0.119), ('mixture', 0.117), ('log', 0.1), ('sets', 0.096), ('detection', 0.091), ('ci', 0.089), ('synthetic', 0.089), ('pdf', 0.085), ('variability', 0.081), ('lks', 0.08), ('regions', 0.078), ('methodology', 0.076), ('complexity', 0.076), ('population', 0.075), ('online', 0.073), ('points', 0.07), ('unusual', 0.07), ('similarity', 0.069), ('change', 0.067), ('score', 0.067), ('adequately', 0.064), ('count', 0.064), ('outlier', 0.059), ('distribution', 0.057), ('bellevue', 0.054), ('digimine', 0.054), ('glitch', 0.054), ('logpi', 0.054), ('pks', 0.054), ('novelty', 0.053), ('nt', 0.053), ('generating', 0.052), ('automatic', 0.05), ('detect', 0.05), ('slight', 0.049), ('signature', 0.047), ('cadez', 0.047), ('permanent', 0.047), ('sigkdd', 0.047), ('changes', 0.046), ('outliers', 0.045), ('model', 0.045), ('row', 0.043), ('tails', 0.042), ('constantly', 0.042), ('website', 0.042), ('di', 0.041), ('mass', 0.04), ('measure', 0.04), ('monitoring', 0.04), ('tting', 0.039), ('histogram', 0.038), ('acm', 0.038), ('shift', 0.038), ('determining', 0.038), ('distributions', 0.038), ('measures', 0.037), ('peaks', 0.037), ('business', 0.037), ('life', 0.037), ('time', 0.036), ('pairwise', 0.035), ('plots', 0.034), ('stream', 0.034), ('tted', 0.034), ('real', 0.034), ('true', 0.033), ('generated', 0.033), ('principled', 0.033), ('truly', 0.033), ('discovered', 0.033), ('applicability', 0.033), ('ect', 0.032), ('trends', 0.031), ('histograms', 0.03), ('statistics', 0.03), ('pages', 0.03), ('probabilistic', 0.029), ('certainly', 0.028), ('heuristics', 0.028), ('compute', 0.028), ('dataset', 0.028), ('variance', 0.028), ('dence', 0.028), ('natural', 0.027), ('rst', 0.027), ('proceed', 0.027), ('detecting', 0.027), ('ni', 0.027), ('mining', 0.027), ('discovery', 0.027), ('con', 0.026), ('purely', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
Author: Igor V. Cadez, P. S. Bradley
Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for finding good solutions in linear time. We consider an arbitrary but fixed parametric form of the model, therefore we sometimes refer to a specific set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is confi dence that the model is not overfi tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also confi rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signifi cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good fi for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good fi any more. t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. In applications presented in this paper we certainly do not have that condition satisfi and we consider this approach as an “educated ed heuristic” (gaining fi statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must fi obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model fi tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is diffi cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the fi 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justifi incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
2 0.18515263 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama
Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1
3 0.17901328 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
Author: Igor V. Cadez, Padhraic Smyth
Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1
4 0.096795551 139 nips-2001-Online Learning with Kernels
Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson
Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss. ¡
5 0.09127903 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
6 0.087674327 107 nips-2001-Latent Dirichlet Allocation
7 0.069241777 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
8 0.067998081 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates
9 0.060552031 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
10 0.06029034 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
11 0.060134582 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
12 0.058112793 35 nips-2001-Analysis of Sparse Bayesian Learning
13 0.05728044 96 nips-2001-Information-Geometric Decomposition in Spike Analysis
14 0.056342509 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
15 0.055978592 46 nips-2001-Categorization by Learning and Combining Object Parts
16 0.053982232 43 nips-2001-Bayesian time series classification
17 0.052847818 114 nips-2001-Learning from Infinite Data in Finite Time
18 0.051291399 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
19 0.051064797 21 nips-2001-A Variational Approach to Learning Curves
20 0.050577845 84 nips-2001-Global Coordination of Local Linear Models
topicId topicWeight
[(0, -0.2), (1, 0.019), (2, -0.04), (3, -0.018), (4, -0.074), (5, -0.013), (6, 0.03), (7, 0.0), (8, -0.036), (9, -0.045), (10, 0.082), (11, -0.027), (12, -0.014), (13, -0.05), (14, -0.04), (15, -0.05), (16, 0.031), (17, -0.03), (18, 0.005), (19, 0.002), (20, 0.001), (21, 0.117), (22, 0.024), (23, -0.094), (24, -0.029), (25, -0.15), (26, -0.08), (27, 0.072), (28, 0.176), (29, -0.186), (30, 0.005), (31, -0.251), (32, 0.025), (33, -0.021), (34, -0.02), (35, 0.014), (36, 0.093), (37, 0.101), (38, -0.104), (39, -0.139), (40, -0.233), (41, 0.155), (42, -0.144), (43, 0.119), (44, 0.129), (45, -0.074), (46, 0.07), (47, -0.075), (48, -0.009), (49, -0.05)]
simIndex simValue paperId paperTitle
same-paper 1 0.93943828 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
Author: Igor V. Cadez, P. S. Bradley
Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for finding good solutions in linear time. We consider an arbitrary but fixed parametric form of the model, therefore we sometimes refer to a specific set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is confi dence that the model is not overfi tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also confi rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signifi cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good fi for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good fi any more. t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. In applications presented in this paper we certainly do not have that condition satisfi and we consider this approach as an “educated ed heuristic” (gaining fi statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must fi obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model fi tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is diffi cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the fi 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justifi incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
2 0.79945904 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
Author: Igor V. Cadez, Padhraic Smyth
Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1
3 0.53565484 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
Author: Nicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana Ghosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng
Abstract: Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are .discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. We find that function approximation methods that do not optimize a squared loss, like Support Vector Machines regression, do not work well in this context. Compared methods include decision trees and generalized linear models. The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. 1
4 0.45702407 68 nips-2001-Entropy and Inference, Revisited
Author: Ilya Nemenman, F. Shafee, William Bialek
Abstract: We study properties of popular near–uniform (Dirichlet) priors for learning undersampled probability distributions on discrete nonmetric spaces and show that they lead to disastrous results. However, an Occam–style phase space argument expands the priors into their infinite mixture and resolves most of the observed problems. This leads to a surprisingly good estimator of entropies of discrete distributions. Learning a probability distribution from examples is one of the basic problems in data analysis. Common practical approaches introduce a family of parametric models, leading to questions about model selection. In Bayesian inference, computing the total probability of the data arising from a model involves an integration over parameter space, and the resulting “phase space volume” automatically discriminates against models with larger numbers of parameters—hence the description of these volume terms as Occam factors [1, 2]. As we move from finite parameterizations to models that are described by smooth functions, the integrals over parameter space become functional integrals and methods from quantum field theory allow us to do these integrals asymptotically; again the volume in model space consistent with the data is larger for models that are smoother and hence less complex [3]. Further, at least under some conditions the relevant degree of smoothness can be determined self–consistently from the data, so that we approach something like a model independent method for learning a distribution [4]. The results emphasizing the importance of phase space factors in learning prompt us to look back at a seemingly much simpler problem, namely learning a distribution on a discrete, nonmetric space. Here the probability distribution is just a list of numbers {q i }, i = 1, 2, · · · , K, where K is the number of bins or possibilities. We do not assume any metric on the space, so that a priori there is no reason to believe that any q i and qj should be similar. The task is to learn this distribution from a set of examples, which we can describe as the number of times ni each possibility is observed in a set of N = K ni i=1 samples. This problem arises in the context of language, where the index i might label words or phrases, so that there is no natural way to place a metric on the space, nor is it even clear that our intuitions about similarity are consistent with the constraints of a metric space. Similarly, in bioinformatics the index i might label n–mers of the the DNA or amino acid sequence, and although most work in the field is based on metrics for sequence comparison one might like an alternative approach that does not rest on such assumptions. In the analysis of neural responses, once we fix our time resolution the response becomes a set of discrete “words,” and estimates of the information content in the response are de- termined by the probability distribution on this discrete space. What all of these examples have in common is that we often need to draw some conclusions with data sets that are not in the asymptotic limit N K. Thus, while we might use a large corpus to sample the distribution of words in English by brute force (reaching N K with K the size of the vocabulary), we can hardly do the same for three or four word phrases. In models described by continuous functions, the infinite number of “possibilities” can never be overwhelmed by examples; one is saved by the notion of smoothness. Is there some nonmetric analog of this notion that we can apply in the discrete case? Our intuition is that information theoretic quantities may play this role. If we have a joint distribution of two variables, the analog of a smooth distribution would be one which does not have too much mutual information between these variables. Even more simply, we might say that smooth distributions have large entropy. While the idea of “maximum entropy inference” is common [5], the interplay between constraints on the entropy and the volume in the space of models seems not to have been considered. As we shall explain, phase space factors alone imply that seemingly sensible, more or less uniform priors on the space of discrete probability distributions correspond to disastrously singular prior hypotheses about the entropy of the underlying distribution. We argue that reliable inference outside the asymptotic regime N K requires a more uniform prior on the entropy, and we offer one way of doing this. While many distributions are consistent with the data when N ≤ K, we provide empirical evidence that this flattening of the entropic prior allows us to make surprisingly reliable statements about the entropy itself in this regime. At the risk of being pedantic, we state very explicitly what we mean by uniform or nearly uniform priors on the space of distributions. The natural “uniform” prior is given by Pu ({qi }) = 1 δ Zu K 1− K qi i=1 , Zu = A dq1 dq2 · · · dqK δ 1− qi (1) i=1 where the delta function imposes the normalization, Zu is the total volume in the space of models, and the integration domain A is such that each qi varies in the range [0, 1]. Note that, because of the normalization constraint, an individual q i chosen from this distribution in fact is not uniformly distributed—this is also an example of phase space effects, since in choosing one qi we constrain all the other {qj=i }. What we mean by uniformity is that all distributions that obey the normalization constraint are equally likely a priori. Inference with this uniform prior is straightforward. If our examples come independently from {qi }, then we calculate the probability of the model {qi } with the usual Bayes rule: 1 K P ({qi }|{ni }) = P ({ni }|{qi })Pu ({qi }) , P ({ni }|{qi }) = (qi )ni . Pu ({ni }) i=1 (2) If we want the best estimate of the probability qi in the least squares sense, then we should compute the conditional mean, and this can be done exactly, so that [6, 7] qi = ni + 1 . N +K (3) Thus we can think of inference with this uniform prior as setting probabilities equal to the observed frequencies, but with an “extra count” in every bin. This sensible procedure was first introduced by Laplace [8]. It has the desirable property that events which have not been observed are not automatically assigned probability zero. 1 If the data are unordered, extra combinatorial factors have to be included in P ({ni }|{qi }). However, these cancel immediately in later expressions. A natural generalization of these ideas is to consider priors that have a power–law dependence on the probabilities, the so called Dirichlet family of priors: K Pβ ({qi }) = 1 δ 1− qi Z(β) i=1 K β−1 qi , (4) i=1 It is interesting to see what typical distributions from these priors look like. Even though different qi ’s are not independent random variables due to the normalizing δ–function, generation of random distributions is still easy: one can show that if q i ’s are generated successively (starting from i = 1 and proceeding up to i = K) from the Beta–distribution j
5 0.44679108 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
6 0.40338933 149 nips-2001-Probabilistic Abstraction Hierarchies
7 0.40159044 107 nips-2001-Latent Dirichlet Allocation
8 0.39377269 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
9 0.37432215 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
10 0.34479097 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
11 0.3278223 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
12 0.32360193 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
13 0.32250497 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task
14 0.31444618 172 nips-2001-Speech Recognition using SVMs
15 0.31239226 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
16 0.31084138 185 nips-2001-The Method of Quantum Clustering
17 0.30929589 84 nips-2001-Global Coordination of Local Linear Models
18 0.30583218 108 nips-2001-Learning Body Pose via Specialized Maps
19 0.30510089 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
20 0.30413565 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
topicId topicWeight
[(4, 0.23), (14, 0.065), (17, 0.027), (19, 0.019), (27, 0.134), (30, 0.1), (38, 0.037), (59, 0.025), (61, 0.015), (72, 0.106), (79, 0.028), (83, 0.017), (91, 0.13)]
simIndex simValue paperId paperTitle
1 0.92740697 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance
Author: Michael C. Mozer, Robert Dodier, Michael D. Colagrosso, Cesar Guerra-Salcedo, Richard Wolniewicz
Abstract: When designing a two-alternative classifier, one ordinarily aims to maximize the classifier’s ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: that the classifier achieve a specified correct acceptance or correct rejection rate (i.e., that it achieve a fixed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose four algorithms for training a classifier subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance. Although the improvement is modest in magnitude, it is nonetheless impressive given the difficulty of the problem and the financial return that it achieves to the service provider. When designing a classifier, one must specify an objective measure by which the classifier’s performance is to be evaluated. One simple objective measure is to minimize the number of misclassifications. If the cost of a classification error depends on the target and/ or response class, one might utilize a risk-minimization framework to reduce the expected loss. A more general approach is to maximize the classifier’s ability to discriminate one class from another class (e.g., Chang & Lippmann, 1994). An ROC curve (Green & Swets, 1966) can be used to visualize the discriminative performance of a two-alternative classifier that outputs class posteriors. To explain the ROC curve, a classifier can be thought of as making a positive/negative judgement as to whether an input is a member of some class. Two different accuracy measures can be obtained from the classifier: the accuracy of correctly identifying an input as a member of the class (a correct acceptance or CA), and the accuracy of correctly identifying an input as a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates, it is necessary to pick a threshold above which the classifier’s probability estimate is interpreted as an “accept,” and below which is interpreted as a “reject”—call this the criterion. The ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the threshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1, the CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1 0 0 correct rejection rate 20 40 60 80 100 100 (b) correct rejection rate 20 40 60 80 (a) 0 20 40 60 80 100 correct acceptance rate 0 20 40 60 80 100 correct acceptance rate FIGURE 1. (a) two ROC curves reflecting discrimination performance; the dashed curve indicates better performance. (b) two plausible ROC curves, neither of which is clearly superior to the other. and the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotonically nonincreasing. The degree to which the curve is bowed reflects the discriminative ability of the classifier. The dashed curve in Figure 1a is therefore a better classifier than the solid curve. The degree to which the curve is bowed can be quantified by various measures such as the area under the ROC curve or d’, the distance between the positive and negative distributions. However, training a classifier to maximize either the ROC area or d’ often yields the same result as training a classifier to estimate posterior class probabilities, or equivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The ROC area and d’ scores are useful, however, because they reflect a classifier’s intrinsic ability to discriminate between two classes, regardless of how the decision criterion is set. That is, each point on an ROC curve indicates one possible CA/CR trade off the classifier can achieve, and that trade off is determined by the criterion. But changing the criterion does not change the classifier’s intrinsic ability to discriminate. Generally, one seeks to optimize the discrimination performance of a classifier. However, we are working in a domain where overall discrimination performance is not as critical as performance at a particular point on the ROC curve, and we are not interested in the remainder of the ROC curve. To gain an intuition as to why this goal should be feasible, consider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they satisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not decrease and the CR rate does not increase. Although the bow shape of the solid curve is typical, it is not mandatory; the precise shape of the curve depends on the nature of the classifier and the nature of the domain. Thus, it is conceivable that a classifier could produce a curve like the dashed one. The dashed curve indicates better performance when the CA rate is around 50%, but worse performance when the CA rate is much lower or higher than 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint that the CA rate is around 50%, or to maximize the CA rate subject to the constraint that the CR rate is around 90%, the dashed curve is superior to the solid curve. One can imagine that better performance can be obtained along some stretches of the curve by sacrificing performance along other stretches of the curve. Note that obtaining a result such as the dashed curve requires a nonstandard training algorithm, as the discrimination performance as measured by the ROC area is worse for the dashed curve than for the solid curve. In this paper, we propose and evaluate four algorithms for optimizing performance in a certain region of the ROC curve. To begin, we explain the domain we are concerned with and why focusing on a certain region of the ROC curve is important in this domain. 1 OUR DOMAIN Athene Software focuses on predicting and managing subscriber churn in the telecommunications industry (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000). “Churn” refers to the loss of subscribers who switch from one company to the other. Churn is a significant problem for wireless, long distance, and internet service providers. For example, in the wireless industry, domestic monthly churn rates are 2–3% of the customer base. Consequently, service providers are highly motivated to identify subscribers who are dissatisfied with their service and offer them incentives to prevent churn. We use techniques from statistical machine learning—primarily neural networks and ensemble methods—to estimate the probability that an individual subscriber will churn in the near future. The prediction of churn is based on various sources of information about a subscriber, including: call detail records (date, time, duration, and location of each call, and whether call was dropped due to lack of coverage or available bandwidth), financial information appearing on a subscriber’s bill (monthly base fee, additional charges for roaming and usage beyond monthly prepaid limit), complaints to the customer service department and their resolution, information from the initial application for service (contract details, rate plan, handset type, credit report), market information (e.g., rate plans offered by the service provider and its competitors), and demographic data. Churn prediction is an extremely difficult problem for several reasons. First, the business environment is highly nonstationary; models trained on data from a certain time period perform far better with hold-out examples from that same time period than examples drawn from successive time periods. Second, features available for prediction are only weakly related to churn; when computing mutual information between individual features and churn, the greatest value we typically encounter is .01 bits. Third, information critical to predicting subscriber behavior, such as quality of service, is often unavailable. Obtaining accurate churn predictions is only part of the challenge of subscriber retention. Subscribers who are likely to churn must be contacted by a call center and offered some incentive to remain with the service provider. In a mathematically principled business scenario, one would frame the challenge as maximizing profitability to a service provider, and making the decision about whether to contact a subscriber and what incentive to offer would be based on the expected utility of offering versus not offering an incentive. However, business practices complicate the scenario and place some unique constraints on predictive models. First, call centers are operated by a staff of customer service representatives who can contact subscribers at a fixed rate; consequently, our models cannot advise contacting 50,000 subscribers one week, and 50 the next. Second, internal business strategies at the service providers constrain the minimum acceptable CA or CR rates (above and beyond the goal of maximizing profitability). Third, contracts that Athene makes with service providers will occasionally call for achieving a specific target CA and CR rate. These three practical issues pose formal problems which, to the best of our knowledge, have not been addressed by the machine learning community. The formal problems can be stated in various ways, including: (1) maximize the CA rate, subject to the constraint that a fixed percentage of the subscriber base is identified as potential churners, (2) optimize the CR rate, subject to the constraint that the CA rate should be αCA, (3) optimize the CA rate, subject to the constraint that the CR rate should be αCR, and finally—what marketing executives really want—(4) design a classifier that has a CA rate of αCA and a CR rate of αCR. Problem (1) sounds somewhat different than problems (2) or (3), but it can be expressed in terms of a lift curve, which plots the CA rate as a function of the total fraction of subscribers identified by the model. Problem (1) thus imposes the constraint that the solution lies at one coordinate of the lift curve, just as problems (2) and (3) place the constraint that the solution lies at one coordinate of the ROC curve. Thus, a solution to problems (2) or (3) will also serve as a solution to (1). Although addressing problem (4) seems most fanciful, it encompasses problems (2) and (3), and thus we focus on it. Our goal is not altogether unreasonable, because a solution to problem (4) has the property we characterized in Figure 1b: the ROC curve can suffer everywhere except in the region near CA αCA and CR αCR. Hence, the approaches we consider will trade off performance in some regions of the ROC curve against performance in other regions. We call this prodding the ROC curve. 2 FOUR ALGORITHMS TO PROD THE ROC CURVE In this section, we describe four algorithms for prodding the ROC curve toward a target CA rate of αCA and a target CR rate of αCR. 2.1 EMPHASIZING CRITICAL TRAINING EXAMPLES Suppose we train a classifier on a set of positive and negative examples from a class— churners and nonchurners in our domain. Following training, the classifier will assign a posterior probability of class membership to each example. The examples can be sorted by the posterior and arranged on a continuum anchored by probabilities 0 and 1 (Figure 2). We can identify the thresholds, θCA and θCR, which yield CA and CR rates of αCA and αCR, respectively. If the classifier’s discrimination performance fails to achieve the target CA and CR rates, then θCA will be lower than θCR, as depicted in the Figure. If we can bring these two thresholds together, we will achieve the target CA and CR rates. Thus, the first algorithm we propose involves training a series of classifiers, attempting to make classifier n+1 achieve better CA and CR rates by focusing its effort on examples from classifier n that lie between θCA and θCR; the positive examples must be pushed above θCR and the negative examples must be pushed below θCA. (Of course, the thresholds are specific to a classifier, and hence should be indexed by n.) We call this the emphasis algorithm, because it involves placing greater weight on the examples that lie between the two thresholds. In the Figure, the emphasis for classifier n+1 would be on examples e5 through e8. This retraining procedure can be iterated until the classifier’s training set performance reaches asymptote. In our implementation, we define a weighting of each example i for training classifier n, λ in . For classifier 1, λ i1 = 1 . For subsequent classifiers, λ in + 1 = λ in if example i is not in the region of emphasis, or λ in + 1 = κ e λ in otherwise, where κe is a constant, κe > 1. 2.2 DEEMPHASIZING IRRELEVANT TRAINING EXAMPLES The second algorithm we propose is related to the first, but takes a slightly different perspective on the continuum depicted in Figure 2. Positive examples below θCA—such as e2—are clearly the most dif ficult positive examples to classify correctly. Not only are they the most difficult positive examples, but they do not in fact need to be classified correctly to achieve the target CA and CR rates. Threshold θCR does not depend on examples such as e2, and threshold θCA allows a fraction (1–αCA) of the positive examples to be classified incorrectly. Likewise, one can argue that negative examples above θCR—such as e10 and e11—need not be of concern. Essentially , the second algorithm, which we term the eemd phasis algorithm, is like the emphasis algorithm in that a series of classifiers are trained, but when training classifier n+1, less weight is placed on the examples whose correct clasθCA e1 e2 e3 0 e4 θCR e5 e6 e7 e8 churn probability e9 e10 e11 e12 e13 1 FIGURE 2. A schematic depiction of all training examples arranged by the classifier’s posterior. Each solid bar corresponds to a positive example (e.g., a churner) and each grey bar corresponds to a negative example (e.g., a nonchurner). sification is unnecessary to achieve the target CA and CR rates for classifier n. As with the emphasis algorithm, the retraining procedure can be iterated until no further performance improvements are obtained on the training set. Note that the set of examples given emphasis by the previous algorithm is not the complement of the set of examples deemphasized by the current algorithm; the algorithms are not identical. In our implementation, we assign a weight to each example i for training classifier n, λ in . For classifier 1, λ i1 = 1 . For subsequent classifiers, λ in + 1 = λ in if example i is not in the region of deemphasis, or λ in + 1 = κ d λ in otherwise, where κd is a constant, κd <1. 2.3 CONSTRAINED OPTIMIZATION The third algorithm we propose is formulated as maximizing the CR rate while maintaining the CA rate equal to αCA. (We do not attempt to simultaneously maximize the CA rate while maintaining the CR rate equal to αCR.) Gradient methods cannot be applied directly because the CA and CR rates are nondifferentiable, but we can approximate the CA and CR rates with smooth differentiable functions: 1 1 CA ( w, t ) = ----- ∑ σ β ( f ( x i, w ) – t ) CR ( w, t ) = ------ ∑ σ β ( t – f ( x i, w ) ) , P i∈P N i∈N where P and N are the set of positive and negative examples, respectively, f(x,w) is the model posterior for input x, w is the parameterization of the model, t is a threshold, and σβ –1 is a sigmoid function with scaling parameter β: σ β ( y ) = ( 1 + exp ( – βy ) ) . The larger β is, the more nearly step-like the sigmoid is and the more nearly equal the approximations are to the model CR and CA rates. We consider the problem formulation in which CA is a constraint and CR is a figure of merit. We convert the constrained optimization problem into an unconstrained problem by the augmented Lagrangian method (Bertsekas, 1982), which involves iteratively maximizing an objective function 2 µ A ( w, t ) = CR ( w, t ) + ν CA ( w, t ) – α CA + -- CA ( w, t ) – α CA 2 with a fixed Lagrangian multiplier, ν, and then updating ν following the optimization step: ν ← ν + µ CA ( w *, t * ) – α CA , where w * and t * are the values found by the optimization step. We initialize ν = 1 and fix µ = 1 and β = 10 and iterate until ν converges. 2.4 GENETIC ALGORITHM The fourth algorithm we explore is a steady-state genetic search over a space defined by the continuous parameters of a classifier (Whitley, 1989). The fitness of a classifier is the reciprocal of the number of training examples falling between the θCA and θCR thresholds. Much like the emphasis algorithm, this fitness function encourages the two thresholds to come together. The genetic search permits direct optimization over a nondifferentiable criterion, and therefore seems sensible for the present task. 3 METHODOLOGY For our tests, we studied two large data bases made available to Athene by two telecommunications providers. Data set 1 had 50,000 subscribers described by 35 input features and a churn rate of 4.86%. Data set 2 had 169,727 subscribers described by 51 input features and a churn rate of 6.42%. For each data base, the features input to the classifier were obtained by proprietary transformations of the raw data (see Mozer et al., 2000). We chose these two large, real world data sets because achieving gains with these data sets should be more difficult than with smaller, less noisy data sets. Plus, with our real-world data, we can evaluate the cost savings achieved by an improvement in prediction accuracy. We performed 10-fold cross-validation on each data set, preserving the overall churn/nonchurn ratio in each split. In all tests, we chose α CR = 0.90 and α CA = 0.50 , values which, based on our past experience in this domain, are ambitious yet realizable targets for data sets such as these. We used a logistic regression model (i.e., a no hidden unit neural network) for our studies, believing that it would be more difficult to obtain improvements with such a model than with a more flexible multilayer perceptron. For the emphasis and deemphasis algorithms, models were trained to minimize mean-squared error on the training set. We chose κe = 1.3 and κd = .75 by quick exploration. Because the weightings are cumulative over training restarts, the choice of κ is not critical for either algorithm; rather, the magnitude of κ controls how many restarts are necessary to reach asymptotic performance, but the results we obtained were robust to the choice of κ. The emphasis and deemphasis algorithms were run for 100 iterations, which was the number of iterations required to reach asymptotic performance on the training set. 4 RESULTS Figure 3 illustrates training set performance for the emphasis algorithm on data set 1. The graph on the left shows the CA rate when the CR rate is .9, and the graph on the right show the CR rate when the CA rate is .5. Clearly, the algorithm appears to be stable, and the ROC curve is improving in the region around (αCA, αCR). Figure 4 shows cross-validation performance on the two data sets for the four prodding algorithms as well as for a traditional least-squares training procedure. The emphasis and deemphasis algorithms yield reliable improvements in performance in the critical region of the ROC curve over the traditional training procedure. The constrained-optimization and genetic algorithms perform well on achieving a high CR rate for a fixed CA rate, but neither does as well on achieving a high CA rate for a fixed CR rate. For the constrained-optimization algorithm, this result is not surprising as it was trained asymmetrically, with the CA rate as the constraint. However, for the genetic algorithm, we have little explanation for its poor performance, other than the difficulty faced in searching a continuous space without gradient information. 5 DISCUSSION In this paper, we have identified an interesting, novel problem in classifier design which is motivated by our domain of churn prediction and real-world business considerations. Rather than seeking a classifier that maximizes discriminability between two classes, as measured by area under the ROC curve, we are concerned with optimizing performance at certain points along the ROC curve. We presented four alternative approaches to prodding the ROC curve, and found that all four have promise, depending on the specific goal. Although the magnitude of the gain is small—an increase of about .01 in the CR rate given a target CA rate of .50—the impro vement results in significant dollar savings. Using a framework for evaluating dollar savings to a service provider, based on estimates of subscriber retention and costs of intervention obtained in real world data collection (Mozer et 0.845 0.84 0.39 0.835 0.385 CR rate CA rate 0.4 0.395 0.38 0.83 0.825 0.375 0.82 0.37 0.815 0.365 0.81 0 5 10 15 20 25 30 35 40 45 50 Iteration 0 5 10 15 20 25 30 35 40 45 50 Iteration FIGURE 3. Training set performance for the emphasis algorithm on data set 1. (a) CA rate as a function of iteration for a CR rate of .9; (b) CR rate as a function of iteration for a CA rate of .5. Error bars indicate +/–1 standard error of the mean. Data set 1 0.835 0.380 0.830 0.375 0.825 CR rate ISP Test Set 0.840 0.385 CA rate 0.390 0.370 0.820 0.365 0.815 0.360 0.810 0.355 0.805 0.350 0.800 std emph deemph constr GA std emph deemph constr GA std emph deemph constr GA 0.900 0.375 0.350 CR rate Data set 2 0.875 CA rate Wireless Test Set 0.850 0.325 0.825 0.300 0.800 std emph deemph constr GA FIGURE 4. Cross-validation performance on the two data sets for the standard training procedure (STD), as well as the emphasis (EMPH), deemphasis (DEEMPH), constrained optimization (CONSTR), and genetic (GEN) algorithms. The left column shows the CA rate for CR rate .9; the right column shows the CR rate for CA rate .5. The error bar indicates one standard error of the mean over the 10 data splits. al., 2000), we obtain a savings of $11 per churnable subscriber when the (CA, CR) rates go from (.50, .80) to (.50, .81), which amounts to an 8% increase in profitability of the subscriber intervention effort. These figures are clearly promising. However, based on the data sets we have studied, it is difficult to know whether another algorithm might exist that achieves even greater gains. Interestingly, all algorithms we proposed yielded roughly the same gains when successful, suggesting that we may have milked the data for whatever gain could be had, given the model class evaluated. Our work clearly illustrate the difficulty of the problem, and we hope that others in the NIPS community will be motivated by the problem to suggest even more powerful, theoretically grounded approaches. 6 ACKNOWLEDGEMENTS No white males were angered in the course of conducting this research. We thank Lian Yan and David Grimes for comments and assistance on this research. This research was supported in part by McDonnell-Pew grant 97-18, NSF award IBN-9873492, and NIH/IFOPAL R01 MH61549–01A1. 7 REFERENCES Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NY: Academic. Chang, E. I., & Lippmann, R. P. (1994). Figure of merit training for detection and spotting. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6 (1019–1026). San Mateo, CA: Morgan Kaufmann. Frederick, E. D., & Floyd, C. E. (1998). Analysis of mammographic findings and patient history data with genetic algorithms for the prediction of breast cancer biopsy outcome. Proceedings of the SPIE, 3338, 241–245. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Mozer, M. C., Wolniewicz, R., Grimes, D., Johnson, E., & Kaushansky, H. (2000). Maximizing revenue by predicting and addressing customer dissatisfaction. IEEE Transactions on Neural Networks, 11, 690–696. Whitley, D. (1989). The GENITOR algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In D. Schaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms (pp. 116–121). San Mateo, CA: Morgan Kaufmann.
2 0.85927248 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance
Author: Odelia Schwartz, E. J. Chichilnisky, Eero P. Simoncelli
Abstract: Spike-triggered averaging techniques are effective for linear characterization of neural responses. But neurons exhibit important nonlinear behaviors, such as gain control, that are not captured by such analyses. We describe a spike-triggered covariance method for retrieving suppressive components of the gain control signal in a neuron. We demonstrate the method in simulation and on retinal ganglion cell data. Analysis of physiological data reveals significant suppressive axes and explains neural nonlinearities. This method should be applicable to other sensory areas and modalities. White noise analysis has emerged as a powerful technique for characterizing response properties of spiking neurons. A sequence of stimuli are drawn randomly from an ensemble and presented in rapid succession, and one examines the subset that elicit action potentials. This “spike-triggered” stimulus ensemble can provide information about the neuron’s response characteristics. In the most widely used form of this analysis, one estimates an excitatory linear kernel by computing the spike-triggered average (STA); that is, the mean stimulus that elicited a spike [e.g., 1, 2]. Under the assumption that spikes are generated by a Poisson process with instantaneous rate determined by linear projection onto a kernel followed by a static nonlinearity, the STA provides an unbiased estimate of this kernel [3]. Recently, a number of authors have developed interesting extensions of white noise analysis. Some have examined spike-triggered averages in a reduced linear subspace of input stimuli [e.g., 4]. Others have recovered excitatory subspaces, by computing the spiketriggered covariance (STC), followed by an eigenvector analysis to determine the subspace axes [e.g., 5, 6]. Sensory neurons exhibit striking nonlinear behaviors that are not explained by fundamentally linear mechanisms. For example, the response of a neuron typically saturates for large amplitude stimuli; the response to the optimal stimulus is often suppressed by the presence of a non-optimal mask [e.g., 7]; and the kernel recovered from STA analysis may change shape as a function of stimulus amplitude [e.g., 8, 9]. A variety of these nonlinear behaviors can be attributed to gain control [e.g., 8, 10, 11, 12, 13, 14], in which neural responses are suppressively modulated by a gain signal derived from the stimulus. Although the underlying mechanisms and time scales associated with such gain control are current topics of research, the basic functional properties appear to be ubiquitous, occurring throughout the nervous system. a b 0 k0 0 Figure 1: Geometric depiction of spike-triggered analyses. a, Spike-triggered averaging with two-dimensional stimuli. Black points indicate raw stimuli. White points indicate stimuli eliciting a spike, and the STA (black vector), which provides an estimate of , corresponds to their center of mass. b, Spike-triggered covariance analysis of suppressive axes. Shown are a set of stimuli lying on a plane perpendicular to the excitatory kernel, . Within the plane, stimuli eliciting a spike are concentrated in an elliptical region. The minor axis of the ellipse corresponds to a suppressive stimulus direction: stimuli with a significant component along this axis are less likely to elicit spikes. The stimulus component along the major axis of the ellipse has no influence on spiking. ¢ £ ¡ ¢ ¡ Here we develop a white noise methodology for characterizing a neuron with gain control. We show that a set of suppressive kernels may be recovered by finding the eigenvectors of the spike-triggered covariance matrix associated with smallest variance. We apply the technique to electrophysiological data obtained from ganglion cells in salamander and macaque retina, and recover a set of axes that are shown to reduce responses in the neuron. Moreover, when we fit a gain control model to the data using a maximum likelihood procedure within this subspace, the model accounts for changes in the STA as a function of contrast. 1 Characterizing suppressive axes ¤¥ As in all white noise approaches, we assume that stimuli correspond to vectors, , in some finite-dimensional space (e.g., a neighborhood of pixels or an interval of time samples). We assume a gain control model in which the probability of a stimulus eliciting a spike grows monotonically with the halfwave-rectified projection onto an excitatory linear kernel, , and is suppressively modulated by the fullwave-rectified projection onto a set of . linear kernels, ¨ ¤§ ¤¥ ©¤ § ¨ ¤ ¥ ©¤ § ¦ First, we recover the excitatory kernel, . This is achieved by presenting spherically symmetric input stimuli (e.g., Gaussian white noise) to the neuron and computing the STA (Fig. 1a). STA correctly recovers the excitatory kernel, under the assumption that each of the gain control kernels are orthogonal (or equal) to the excitatory kernel. The proof is essentially the same as that given for recovering the kernel of a linear model followed by a monotonic nonlinearity [3]. In particular, any stimulus can be decomposed into a component in the direction of the excitatory kernel, and a component in a perpendicular direction. This can be paired with another stimulus that is identical, except that its component in the perpendicular direction is negated. The two stimuli are equally likely to occur in a spherically Gaussian stimulus set (since they are equidistant from the origin), and they are equally likely to elicit a spike (since their excitatory components are equal, and their rectified perpendicular components are equal). Their vector average lies in the direction of the excitatory kernel. Thus, the STA (which is an average over all such stimuli, or all such stimulus pairs) must also lie in that direction. In a subsequent section we explain how to Model: Retrieved: Excitatory: Excitatory: Eigenvalues: Suppressive: Suppressive: Weights Variance (eigenvalue) { 1.5 { 2{ 2.5 { 3{ 1 1 Arbitrary 0 Axis number 350 Figure 2: Estimation of kernels from a simulated model (equation 2). Left: Model kernels. Right: Sorted eigenvalues of covariance matrix of stimuli eliciting spikes (STC). Five eigenvalues fall significantly below the others. Middle: STA (excitatory kernel) and eigenvectors (suppressive kernels) associated with the lowest eigenvalues. recover the excitatory kernel when it is not orthogonal to the suppressive kernels. Next, we recover the suppressive subspace, assuming the excitatory kernel is known. Consider the stimuli lying on a plane perpendicular to this kernel. These stimuli all elicit the same response in the excitatory kernel, but they may produce different amounts of suppression. Figure 1b illustrates the behavior in a three-dimensional stimulus space, in which one axis is assumed to be suppressive. The distribution of raw stimuli on the plane is spherically symmetric about the origin. But the distribution of stimuli eliciting a spike is narrower along the suppressive direction: these stimuli have a component along the suppressive axis and are therefore less likely to elicit a spike. This behavior is easily generalized from this plane to the entire stimulus space. If we assume that the suppressive axes are fixed, then we expect to see reductions in variance in the same directions for any level of numerator excitation. Given this behavior of the spike-triggered stimulus ensemble, we can recover the suppressive subspace using principal component analysis. We construct the sample covariance matrix of the stimuli eliciting a spike: £ §¥
same-paper 3 0.81297851 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
Author: Igor V. Cadez, P. S. Bradley
Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for finding good solutions in linear time. We consider an arbitrary but fixed parametric form of the model, therefore we sometimes refer to a specific set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is confi dence that the model is not overfi tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also confi rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signifi cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good fi for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good fi any more. t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. In applications presented in this paper we certainly do not have that condition satisfi and we consider this approach as an “educated ed heuristic” (gaining fi statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must fi obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model fi tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is diffi cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the fi 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justifi incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
4 0.72823131 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
5 0.71472371 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
Author: Olivier Chapelle, Bernhard Schćžšlkopf
Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1
6 0.71403229 60 nips-2001-Discriminative Direction for Kernel Classifiers
7 0.71300489 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
8 0.71279174 185 nips-2001-The Method of Quantum Clustering
10 0.71263301 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
11 0.71121061 13 nips-2001-A Natural Policy Gradient
12 0.7106694 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
13 0.71013618 46 nips-2001-Categorization by Learning and Combining Object Parts
14 0.70912439 22 nips-2001-A kernel method for multi-labelled classification
15 0.70894408 8 nips-2001-A General Greedy Approximation Algorithm with Applications
16 0.70842761 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
17 0.70801443 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
18 0.70662159 56 nips-2001-Convolution Kernels for Natural Language
19 0.70345247 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
20 0.70330679 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems