nips nips2001 nips2001-70 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana Ghosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng
Abstract: Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are .discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. We find that function approximation methods that do not optimize a squared loss, like Support Vector Machines regression, do not work well in this context. Compared methods include decision trees and generalized linear models. The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are . [sent-5, score-1.005]
2 discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. [sent-6, score-0.145]
3 We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. [sent-7, score-0.952]
4 Compared methods include decision trees and generalized linear models. [sent-9, score-0.026]
5 The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. [sent-10, score-0.768]
6 1 Introduction The main mathematical problem faced by actuaries is that of estimating how much each insurance contract is expected to cost. [sent-11, score-0.601]
7 This conditional expected claim amount is called the pure premium and it is the basis of the gross premium charged to the insured. [sent-12, score-1.109]
8 This expected value is conditionned on information available about the insured and about the contract, which we call input profile here. [sent-13, score-0.029]
9 This distribution has a mass at zero: the vast majority of the insurance contracts do not yield any claim. [sent-16, score-0.548]
10 This distribution is also strongly asymmetric and it has fat tails (on one side only, corresponding to the large claims). [sent-17, score-0.221]
11 In this paper we study and compare several learning algorithms along with methods traditionally used by actuaries for setting insurance premia. [sent-18, score-0.459]
12 The study is performed on a large database of automobile insurance policies. [sent-19, score-0.464]
13 In a variety of practical applications, we often find data distributions with an asymmetric heavy tail extending out towards more positive values. [sent-21, score-0.15]
14 Modeling data with such an asymmetric heavy-tail distribution is essentially difficult because outliers, which are sampled from the tail of the distribution, have a strong influence on parameter estimation. [sent-22, score-0.174]
15 When the distribution is symmetric (around the mean), the problems caused by outliers can be reduced using robust estimation techniques (Huber, 1982; F. [sent-23, score-0.145]
16 Note that these techniques do not work for an asymmetric distribution: most outliers are on the same side of the mean, so downweighting them introduces a strong bias on its estimation: the conditional expectation would be systematically underestimated. [sent-27, score-0.338]
17 There is another statistical difficulty, due to the large number of variables (mostly discrete) and the fact that many interactions exist between them. [sent-28, score-0.024]
18 Thus the traditional actuarial methods based on tabulating average claim amounts for combinations of values are quickly hurt by the curse of dimensionality, unless they make hurtful independence assumptions (Bailey and Simon, 1960). [sent-29, score-0.34]
19 We then highlight our most important experimental results (section 4), and in view of them conclude with an examination of the prospects for applying statistical learning algorithms to insurance modeling (section 5). [sent-32, score-0.446]
20 2 Mathematical Objectives The first goal of insurance premia modeling is to estimate the expected claim amount for a given insurance contract for a future one-year period (here we consider that the amount is 0 when no claim is filed). [sent-33, score-2.106]
21 Let X E Rm denote the customer and contract input profile, a vector representing all the information known about the customer and the proposed insurance policy before the beginning of the contract. [sent-34, score-0.709]
22 Let A E R+ denote the amount that the customer claims during the contract period; we shall assume that A is non-negative. [sent-35, score-0.373]
23 Our objective is to estimate this claim amount, which is the pure premium Ppure of a given contract x: 1 Ppure(X) == E[AIX == x]. [sent-36, score-0.737]
24 One possible criterion is to seek the most precise estimator, which minimizes the mean-squared error (MSE) over a data set D == {(xl,a£)}r=l. [sent-39, score-0.084]
25 Let P == {p(·;8)} be a function class parametrized by the IThe pure premium is distinguished from the premium actually charged to the customer, which must account for the risk remaining with the insurer, the administrative overhead, desired profit, and other business costs. [sent-40, score-0.701]
26 The MSE criterion produces the most precise function (on average) within the class, as measured with respect to D: L ()* = argm:n ~ L(P(Xi; (}) - ai)2. [sent-42, score-0.084]
27 (2) i=1 Is it an appropriate criterion and why? [sent-43, score-0.055]
28 However, in insurance policy pricing, the precision criterion is not the sole part of the picture; just as important is that the estimated premia do not systematically discriminate against specific segments of the population. [sent-46, score-0.988]
29 We define the bias of the premia b(P) to be the difference between the average premium and the average incurred amount, in a given population P: 1 (3) b(P) = 1FT p(Xi) - ai, L (xi,ai)EP where IPI denotes the cardinality of the set P, and p(. [sent-48, score-0.889]
30 A possible fairness criterion would be based on minimizing the norm of the bias over every subpopulation Q of P. [sent-50, score-0.211]
31 From a practical standpoint, such a minimization would be extremely difficult to carry out. [sent-51, score-0.037]
32 Furthermore, the bias over small subpopulations is hard to estimate with statistical significance. [sent-52, score-0.095]
33 After training a model to minimize the MSE criterion (2), we define a finite number of disjoint subsets (subpopulations) of the test set P, PkC P, P k n Pj:f;k == 0, and verify that the absolute bias is not significantly different from zero. [sent-54, score-0.182]
34 The subsets Pk can be chosen at convenience; in our experiments, we considered 10 subsets of equal-size delimited by the deciles of the test set premium distribution. [sent-55, score-0.381]
35 In this way, we verify that, for example, for the group of contracts with a premium between the 5th and the 6th decile, the average premium matches the average claim amount. [sent-56, score-0.983]
36 3 Models Evaluated An important requirement for any model of insurance premia is that it should produce positive premia: the company does not want to charge negative money to its customers! [sent-57, score-0.946]
37 To obtain positive outputs neural networks we have considered using an exponential activation function at the output layer but this created numerical difficulties (when the argument of the exponential is large, the gradient is huge). [sent-58, score-0.099]
38 fustead, we have successfully used the "softplus" activation function (Dugas et al. [sent-59, score-0.026]
39 , 2001): softplus(s) == log(1 + e 8 ) where s is the weighted sum of an output neuron, and softplus(s) is the corresponding predicted premium. [sent-60, score-0.023]
40 The best model that we obtained is a mixture of experts in which the experts are positive outputs neural networks. [sent-62, score-0.306]
41 , 1991) has softmax outputs to obtain positive w~ights summing to one. [sent-64, score-0.073]
42 X 10-3 Distribution of (claim - prediction) in each prediction quintile 2 1. [sent-65, score-0.117]
43 -_----L_----l-==:::=:~::=::::::r:::===:±==~ -3000 -2000 -1000 1000 2000 claim - prediction 3000 4000 5000 6000 Proportion of non-zero claims in each prediction quintile 0. [sent-84, score-0.59]
44 05 3 quintile Figure 1: A view of the conditional distribution of the claim amounts in the out-ofsample test set. [sent-89, score-0.556]
45 Top: probability density of (claim amount - conditional expectation) for 5 quintiles of the conditional expectation, excluding zero-claim records. [sent-90, score-0.181]
46 The mode moves left for increasing conditional expectation quintiles. [sent-91, score-0.102]
47 Bottom: proportion of non-zero claim records per quintile of the prediction. [sent-92, score-0.377]
48 The linear model corresponds to a ridge linear regression (with weight decay chosen with the validation set). [sent-95, score-0.141]
49 Generalized linear models (GLM) estimate the conditional expectation from j(x) == eb+w1x with parameters b and w. [sent-96, score-0.126]
50 Again weight decay is used and tuned on the validation set. [sent-97, score-0.08]
51 There are many variants of GLMs and they are popular for building insurance models, since they provide positive outputs, interpretable parameters, and can be associated to parametric models of the noise. [sent-98, score-0.485]
52 Decision trees are also used by practitioners in the insurance industry, in particular the CHAID-type models (Kass, 1980; Biggs, Ville and Suen, 1991), which use statistical criteria for deciding how to split nodes and when to stop growing the tree. [sent-99, score-0.47]
53 We have compared our models with a CHAID implementation based on (Biggs, Ville and Suen, 1991), adapted for regression purposes using a MANOVA analysis. [sent-100, score-0.085]
54 The threshold parameters were selected based on validation set MSE. [sent-101, score-0.08]
55 Models have been sorted in ascending order of test results. [sent-233, score-0.04]
56 The training, validation and test curves have been shifted closer together for visualization purposes (the significant differences in MSE between the 3 sets are due to "outliers"). [sent-234, score-0.12]
57 The out-of-sample test performance of the Mixture model is significantly better than any of the other. [sent-235, score-0.075]
58 Validation based model selection is confirmed on test results. [sent-236, score-0.04]
59 Since the median is actually 0 for our data, we tried to train the SVM using only the cases with positive claim amounts, and compared the performance to that obtained with the GLM and the neural network. [sent-239, score-0.415]
60 Figure 1 (top) illustrates the fat tails and asymetry of the conditional distribution of the claim amounts. [sent-241, score-0.498]
61 Finally, we compared the best statistical model with a proprietary table-based and rule-based premium estimation method that was provided to us as the benchmark against which to judge improvements. [sent-243, score-0.391]
62 4 Experimental Results Data from five kinds of losses were included in the study (Le. [sent-244, score-0.109]
63 The input variables contain information about the policy (e. [sent-246, score-0.026]
64 For most models except CHAID, the discrete variables are one-hot encoded. [sent-256, score-0.024]
65 An overall data set containing about Table 1: Statistical comparison of the prediction accuracy difference between several individual learning models and the best Mixture model. [sent-258, score-0.09]
66 The p-value is given under the null hypothesis oino difference between Model #1 and the best Mixture model. [sent-259, score-0.033]
67 02e-04 Table 2: MSE difference between benchmark and Mixture models across the 5 claim categories (kinds of losses) and the total claim amount. [sent-285, score-0.715]
68 In all cases except category 1, the IvIixture model is statistically significantly (p < 0. [sent-286, score-0.105]
69 Claim Category (Kind of Loss) Category 1 Category 2 Category 3 Category 4 Category 5 Total claim amount MSE Difference Benchmark minus Mixture 20669. [sent-288, score-0.368]
70 24) 8 million examples is randomly permuted and split into a training set, validation set and test set, respectively of size 50%, 25% and 25% of the total. [sent-306, score-0.12]
71 The validation set is used to select among models (includi~g the choice of capacity), and th~ test set is used for final statistical comparisons. [sent-307, score-0.168]
72 Sample-wise paired statistical tests are used to reduce the effect of huge per-sample variability. [sent-308, score-0.068]
73 Figure 1 is an attempt at capturing the shape of the conditional distribution of claim amounts given input profiles, by considering the distributions of claim amounts in different quantiles of the prediction (pure premium), on the test set. [sent-309, score-0.845]
74 The top figure excludes the point mass of zero claims and rather shows the difference between the claim amount and the estimated conditional expectation (obtained with the mixture model). [sent-310, score-0.767]
75 The bottom histogram shows that the fraction of claims increases nicely for the higher predicted pure premia. [sent-311, score-0.226]
76 Table 1 and Figure 2 summarize the comparison between the test MSE of the different tested models. [sent-312, score-0.04]
77 NN is a neural network with linear output activation whereas Softplus NNhas the softplus output activations. [sent-313, score-0.195]
78 The Mixture is the mixture of softplus neural networks. [sent-314, score-0.312]
79 This result identifies the mixture model with softplus neural networks as the best-performing of the tested statistical models. [sent-315, score-0.363]
80 Our conjecture is that the mixture model works better because it is more robust to the effect of "outliers" (large claims). [sent-316, score-0.179]
81 Classical robust regression methods (Rousseeuw and Leroy, 1987) work by discarding or downweighting outliers: they cannot be applied here because the claims distribution is highly asymmetric (the extreme values are always large ones, the claims being all non-negative). [sent-317, score-0.541]
82 Note that the capacity of each model has been tuned on the validation set. [sent-318, score-0.08]
83 L---~~ -3000 -2500 -2000 -1500 -1000 -500 Difference between premia ($) o 500 1000 Figure 3: The premia difference distribution is negatively skewed, but has a positive m~dian for a mean of zero. [sent-351, score-1.11]
84 This implies that the benchmark model (current pricing) undercharges risky customers, while overcharging typical customers. [sent-352, score-0.182]
85 The improvements are shown across the five types of losses. [sent-354, score-0.033]
86 In all cases the mixture improves, and the improvement is significant in four out of the five as well as across the sum of the five. [sent-355, score-0.176]
87 A qualitative analysis of the resulting predicted premia shows that the mixture model has smoother and more spread-out premia than the benchmark. [sent-356, score-1.136]
88 The analysis (figure 3) also reveals that the difference between the mixture premia and the benchmark premia is negatively skewed, with a positive median, i. [sent-357, score-1.299]
89 , the typical customer will pay less under the new mixture model, but the "bad" (risky) customers will pay much more. [sent-359, score-0.343]
90 To evaluate fairness, as discussed in the previous section, the distribution of premia computed by the best model is analyzed, splitting the contracts in 10 groups according to their premium level. [sent-360, score-0.906]
91 Figure 4 shows that the premia charged are fair for each sub-population. [sent-361, score-0.568]
92 5 Conclusion This paper illustrates a successful data-mining application in the insurance industry. [sent-362, score-0.422]
93 It shows that a specialized model (the mixture model), that was designed taking into consideration the specific problem posed by the data (outliers, asymmetric distribution, positive outputs), performs significantly better than existing and popular learning algorithms. [sent-363, score-0.299]
94 It also shows that such models can significantly improve over the current practice, allowing to compute premia that are lower for less risky contracts and higher for more risky contracts, thereby reducing the cost of the median contract. [sent-364, score-0.947]
95 Future work should investigate in more detail the role of temporal pon-stationarity, how to optimize fairness (rather than just test for it afterwards), and how to further increase the robustness of the model with respect to large claim amounts. [sent-365, score-0.46]
96 Difference with incurred claims (sum of all KOL-groups) 200 . [sent-366, score-0.194]
97 The comparisqn is for the sum of claim amounts over all 5 kinds of losses (KOL). [sent-404, score-0.416]
wordName wordTfidf (topN-words)
[('premia', 0.485), ('insurance', 0.422), ('premium', 0.295), ('claim', 0.293), ('mse', 0.176), ('softplus', 0.169), ('claims', 0.147), ('mixture', 0.143), ('fairness', 0.127), ('risky', 0.11), ('aix', 0.11), ('chaid', 0.105), ('contracts', 0.1), ('contract', 0.093), ('quintile', 0.084), ('customer', 0.084), ('outliers', 0.083), ('median', 0.083), ('asymmetric', 0.082), ('validation', 0.08), ('benchmark', 0.072), ('category', 0.07), ('conditional', 0.066), ('decile', 0.063), ('dugas', 0.063), ('fat', 0.063), ('leroy', 0.063), ('rousseeuw', 0.063), ('suen', 0.063), ('ville', 0.063), ('regression', 0.061), ('pure', 0.056), ('charged', 0.055), ('biggs', 0.055), ('kass', 0.055), ('criterion', 0.055), ('tails', 0.05), ('glm', 0.05), ('customers', 0.05), ('amount', 0.049), ('amounts', 0.047), ('incurred', 0.047), ('experts', 0.045), ('huge', 0.044), ('car', 0.044), ('automobile', 0.042), ('bailey', 0.042), ('downweighting', 0.042), ('negatively', 0.042), ('neider', 0.042), ('ppure', 0.042), ('subpopulations', 0.042), ('nn', 0.042), ('losses', 0.042), ('test', 0.04), ('positive', 0.039), ('difficult', 0.037), ('pricing', 0.037), ('actuaries', 0.037), ('expectation', 0.036), ('robust', 0.036), ('significantly', 0.035), ('outputs', 0.034), ('kinds', 0.034), ('mccullagh', 0.033), ('pay', 0.033), ('prediction', 0.033), ('difference', 0.033), ('five', 0.033), ('montreal', 0.031), ('huber', 0.031), ('svm', 0.03), ('bias', 0.029), ('precise', 0.029), ('tail', 0.029), ('profile', 0.029), ('fair', 0.028), ('jacobs', 0.028), ('vapnik', 0.027), ('identifies', 0.027), ('skewed', 0.027), ('generalized', 0.026), ('activation', 0.026), ('reasons', 0.026), ('distribution', 0.026), ('policy', 0.026), ('bengio', 0.026), ('minus', 0.026), ('simon', 0.026), ('wiley', 0.025), ('mathematical', 0.025), ('models', 0.024), ('statistical', 0.024), ('estimating', 0.024), ('pi', 0.023), ('difficulty', 0.023), ('cd', 0.023), ('confidence', 0.023), ('predicted', 0.023), ('subsets', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
Author: Nicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana Ghosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng
Abstract: Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are .discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. We find that function approximation methods that do not optimize a squared loss, like Support Vector Machines regression, do not work well in this context. Compared methods include decision trees and generalized linear models. The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. 1
2 0.11155788 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
Author: Ronan Collobert, Samy Bengio, Yoshua Bengio
Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1
3 0.083134301 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
Author: William M. Campbell
Abstract: A novel approach for comparing sequences of observations using an explicit-expansion kernel is demonstrated. The kernel is derived using the assumption of the independence of the sequence of observations and a mean-squared error training criterion. The use of an explicit expansion kernel reduces classifier model size and computation dramatically, resulting in model sizes and computation one-hundred times smaller in our application. The explicit expansion also preserves the computational advantages of an earlier architecture based on mean-squared error training. Training using standard support vector machine methodology gives accuracy that significantly exceeds the performance of state-of-the-art mean-squared error training for a speaker recognition task.
4 0.074826688 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
5 0.062925346 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
Author: Igor V. Cadez, Padhraic Smyth
Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1
6 0.05890014 8 nips-2001-A General Greedy Approximation Algorithm with Applications
7 0.057936441 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
8 0.055737901 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
9 0.048204798 139 nips-2001-Online Learning with Kernels
10 0.045984264 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
11 0.043659892 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
12 0.043012977 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
13 0.042913176 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
14 0.041659974 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
15 0.041208412 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
16 0.040984165 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
17 0.04015483 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
18 0.04010817 62 nips-2001-Duality, Geometry, and Support Vector Regression
19 0.038146194 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
20 0.037897985 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
topicId topicWeight
[(0, -0.139), (1, 0.031), (2, 0.004), (3, 0.044), (4, -0.033), (5, -0.005), (6, 0.032), (7, -0.03), (8, -0.035), (9, -0.032), (10, 0.054), (11, 0.027), (12, -0.004), (13, -0.075), (14, 0.042), (15, 0.003), (16, 0.036), (17, -0.033), (18, 0.005), (19, 0.048), (20, 0.008), (21, -0.034), (22, 0.024), (23, 0.027), (24, -0.02), (25, -0.043), (26, 0.011), (27, -0.001), (28, 0.177), (29, -0.111), (30, 0.05), (31, -0.065), (32, 0.089), (33, -0.081), (34, 0.089), (35, 0.005), (36, -0.006), (37, 0.104), (38, -0.1), (39, -0.103), (40, 0.038), (41, 0.034), (42, 0.028), (43, 0.001), (44, -0.054), (45, -0.023), (46, 0.1), (47, 0.049), (48, 0.045), (49, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.91440243 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
Author: Nicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana Ghosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng
Abstract: Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are .discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. We find that function approximation methods that do not optimize a squared loss, like Support Vector Machines regression, do not work well in this context. Compared methods include decision trees and generalized linear models. The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. 1
2 0.70397961 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
Author: Ronan Collobert, Samy Bengio, Yoshua Bengio
Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1
3 0.62835979 41 nips-2001-Bayesian Predictive Profiles With Applications to Retail Transaction Data
Author: Igor V. Cadez, Padhraic Smyth
Abstract: Massive transaction data sets are recorded in a routine manner in telecommunications, retail commerce, and Web site management. In this paper we address the problem of inferring predictive individual profiles from such historical transaction data. We describe a generative mixture model for count data and use an an approximate Bayesian estimation framework that effectively combines an individual’s specific history with more general population patterns. We use a large real-world retail transaction data set to illustrate how these profiles consistently outperform non-mixture and non-Bayesian techniques in predicting customer behavior in out-of-sample data. 1
4 0.53621358 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
Author: Igor V. Cadez, P. S. Bradley
Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classification, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reflect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and define the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point fits under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for finding good solutions in linear time. We consider an arbitrary but fixed parametric form of the model, therefore we sometimes refer to a specific set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, fitted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufficient to require building of a new model Θ1 . The criterion for building a new model is loosely defined as “the model does not adequately fit the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model defines the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no significant overfitting) it cannot put significant PDF mass over regions of data space that are populated by data points solely due to the details of a specific data sample used to build the model. • Dense regions in the data space discovered by a non-overfitting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is confi dence that the model is not overfi tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to defi a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to effi ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also confi rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To defi the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the specifi KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantifi due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signifi cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good fi for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good fi any more. t (4) Note, however, that the 3-σ interval be interpreted as a confi dence interval only in the limit when number of data sets goes to infi nity. In applications presented in this paper we certainly do not have that condition satisfi and we consider this approach as an “educated ed heuristic” (gaining fi statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the fi phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must fi obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the fi model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model fi tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters fi and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The fi gure shows LogP distributions for the fi two data sets generated from the fi model (top row) rst rst and the fi two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is diffi cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the fi 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (fi 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the fi 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to defi the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as verifi by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different flavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to fi unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reflect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reflect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to profi ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justifi incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientific Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.
5 0.52214432 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
Author: Lawrence K. Saul, Daniel D. Lee
Abstract: We investigate a learning algorithm for the classification of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classifiers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classification than comparably sized models trained by EM. 1
6 0.45485565 154 nips-2001-Products of Gaussians
7 0.44991508 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
8 0.44360581 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
9 0.3856191 45 nips-2001-Boosting and Maximum Likelihood for Exponential Models
10 0.38502982 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
11 0.38454878 17 nips-2001-A Quantitative Model of Counterfactual Reasoning
12 0.38415927 108 nips-2001-Learning Body Pose via Specialized Maps
13 0.38300166 114 nips-2001-Learning from Infinite Data in Finite Time
14 0.37065798 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
15 0.36713722 149 nips-2001-Probabilistic Abstraction Hierarchies
16 0.36110237 47 nips-2001-Causal Categorization with Bayes Nets
17 0.35005295 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
18 0.32870644 140 nips-2001-Optimising Synchronisation Times for Mobile Devices
19 0.32662401 107 nips-2001-Latent Dirichlet Allocation
20 0.32632124 35 nips-2001-Analysis of Sparse Bayesian Learning
topicId topicWeight
[(8, 0.337), (14, 0.051), (17, 0.031), (19, 0.019), (20, 0.01), (27, 0.115), (30, 0.077), (38, 0.016), (44, 0.015), (59, 0.026), (72, 0.064), (79, 0.032), (83, 0.013), (91, 0.095)]
simIndex simValue paperId paperTitle
same-paper 1 0.77008152 70 nips-2001-Estimating Car Insurance Premia: a Case Study in High-Dimensional Data Inference
Author: Nicolas Chapados, Yoshua Bengio, Pascal Vincent, Joumana Ghosn, Charles Dugas, Ichiro Takeuchi, Linyan Meng
Abstract: Estimating insurance premia from data is a difficult regression problem for several reasons: the large number of variables, many of which are .discrete, and the very peculiar shape of the noise distribution, asymmetric with fat tails, with a large majority zeros and a few unreliable and very large values. We compare several machine learning methods for estimating insurance premia, and test them on a large data base of car insurance policies. We find that function approximation methods that do not optimize a squared loss, like Support Vector Machines regression, do not work well in this context. Compared methods include decision trees and generalized linear models. The best results are obtained with a mixture of experts, which better identifies the least and most risky contracts, and allows to reduce the median premium by charging more to the most risky customers. 1
2 0.75546503 166 nips-2001-Self-regulation Mechanism of Temporally Asymmetric Hebbian Plasticity
Author: N. Matsumoto, M. Okada
Abstract: Recent biological experimental findings have shown that the synaptic plasticity depends on the relative timing of the pre- and postsynaptic spikes which determines whether Long Term Potentiation (LTP) occurs or Long Term Depression (LTD) does. The synaptic plasticity has been called “Temporally Asymmetric Hebbian plasticity (TAH)”. Many authors have numerically shown that spatiotemporal patterns can be stored in neural networks. However, the mathematical mechanism for storage of the spatio-temporal patterns is still unknown, especially the effects of LTD. In this paper, we employ a simple neural network model and show that interference of LTP and LTD disappears in a sparse coding scheme. On the other hand, it is known that the covariance learning is indispensable for storing sparse patterns. We also show that TAH qualitatively has the same effect as the covariance learning when spatio-temporal patterns are embedded in the network. 1
3 0.4875856 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
4 0.48367944 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
Author: Olivier Chapelle, Bernhard Schćžšlkopf
Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1
5 0.48312736 60 nips-2001-Discriminative Direction for Kernel Classifiers
Author: Polina Golland
Abstract: In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work.
6 0.48305517 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
7 0.48241049 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
8 0.48240858 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
9 0.48186854 8 nips-2001-A General Greedy Approximation Algorithm with Applications
10 0.48107558 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
11 0.48053968 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
13 0.47858661 56 nips-2001-Convolution Kernels for Natural Language
14 0.47857198 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
15 0.47761589 185 nips-2001-The Method of Quantum Clustering
16 0.47715896 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
17 0.47688347 121 nips-2001-Model-Free Least-Squares Policy Iteration
18 0.476668 197 nips-2001-Why Neuronal Dynamics Should Control Synaptic Learning Rules
19 0.4758583 46 nips-2001-Categorization by Learning and Combining Object Parts
20 0.47580442 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms