nips nips2001 nips2001-159 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: B. Zadrozny
Abstract: This paper presents a method for obtaining class membership probability estimates for multiclass classification problems by coupling the probability estimates produced by binary classifiers. This is an extension for arbitrary code matrices of a method due to Hastie and Tibshirani for pairwise coupling of probability estimates. Experimental results with Boosted Naive Bayes show that our method produces calibrated class membership probability estimates, while having similar classification accuracy as loss-based decoding, a method for obtaining the most likely class that does not generate probability estimates.
Reference: text
sentIndex sentText sentNum sentScore
1 Reducing multiclass to binary by coupling probability estimates Bianca Zadrozny Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92093-0114 zadrozny@cs. [sent-1, score-0.712]
2 edu Abstract This paper presents a method for obtaining class membership probability estimates for multiclass classification problems by coupling the probability estimates produced by binary classifiers. [sent-3, score-1.327]
3 This is an extension for arbitrary code matrices of a method due to Hastie and Tibshirani for pairwise coupling of probability estimates. [sent-4, score-0.67]
4 Experimental results with Boosted Naive Bayes show that our method produces calibrated class membership probability estimates, while having similar classification accuracy as loss-based decoding, a method for obtaining the most likely class that does not generate probability estimates. [sent-5, score-0.71]
5 1 Introduction The two most well-known approaches for reducing a multiclass classification problem to a set of binary classification problems are known as one-against-all and all-pairs. [sent-6, score-0.422]
6 In the one-against-all approach, we train a classifier for each of the classes using as positive examples the training examples that belong to that class, and as negatives all the other training examples. [sent-7, score-0.232]
7 In the all-pairs approach, we train a classifier for each possible pair of classes ignoring the examples that do not belong to the classes in question. [sent-8, score-0.178]
8 , 2000] have shown that there are many other ways in which a multiclass problem can be decomposed into a number of binary classification problems. [sent-11, score-0.388]
9 We can represent each such decom1 0 1 k l , where k is the number of classes and l is position by a code matrix M the number of binary classification problems. [sent-12, score-0.522]
10 If M c b 1 then the examples belonging to class c are considered to be positive examples for the binary classification problem b. [sent-13, score-0.4]
11 Similarly, if M c b 1 the examples belonging to c are considered to be negative 0 the examples belonging to c are not used in training examples for b. [sent-14, score-0.301]
12 The ECOC scheme does not allow zeros in the code matrix, meaning that all examples are used in each binary classification problem. [sent-17, score-0.461]
13 Orthogonal to the problem of choosing a code matrix for reducing multiclass to binary is the problem of classifying an example given the labels assigned by each binary classifier. [sent-18, score-0.95]
14 , 2000] first create a vector v of length l containing the -1,+1 labels assigned to x by each binary classifier. [sent-21, score-0.133]
15 Then, they compute the Hamming distance between v and each row of M, and find the row c that is closest to v according to this metric. [sent-22, score-0.101]
16 ¡ ¨ For the case in which the binary classifiers output a score whose magnitude is a measure of confidence in the prediction, they use a loss-based decoding approach that takes into account the scores to calculate the distance between v and each row of M, instead of using the Hamming distance. [sent-25, score-0.467]
17 However, both of these methods simply assign a class label to each example. [sent-30, score-0.153]
18 They do not ˆ output class membership probability estimates P C c X x for an example x. [sent-31, score-0.54]
19 These probability estimates are important when the classification outputs are not used in isolation and must be combined with other sources of information, such as misclassification costs [Zadrozny and Elkan, 2001a] or the outputs of another classifier. [sent-32, score-0.341]
20 Given a code matrix M and a binary classification learning algorithm that outputs probability estimates, we would like to couple the estimates given by each binary classifier in order to obtain class probability membership estimates for the multiclass problem. [sent-33, score-1.688]
21 Hastie and Tibshirani [Hastie and Tibshirani, 1998] describe a solution for obtaining probˆ ability estimates P C c X x in the all-pairs case by coupling the pairwise probability estimates, which we describe in Section 2. [sent-34, score-0.423]
22 In Section 3, we extend the method to arbitrary code matrices. [sent-35, score-0.402]
23 In Section 4 we discuss the loss-based decoding approach in more detail and compare it mathematically to the method by Hastie and Tibshirani. [sent-36, score-0.229]
24 2 Coupling pairwise probability estimates We are given pairwise probability estimates ri j x for every class i j, obtained by training a classifier using the examples belonging to class i as positives and the examples belonging to class j as negatives. [sent-38, score-1.371]
25 We would like to couple these estimates to obtain a set of class membership probabilities pi x P C ci X x for each example x. [sent-39, score-0.64]
26 The ri j are related to the pi according to pi x pj x £ pi x ¢ ¤ ©£ x £ ¤ j X £ ¢ i C £ ¢ £ ¨ ¤ § iC ¤ ¦ PC ¡ ¤ ¢ ri j x ¤ ¥£ ¢ Since we additionally require that ∑i pi x 1, there are k 1 free parameters and k k 1 2 constraints. [sent-40, score-0.966]
27 This implies that there may not exist pi satisfying these constraints. [sent-41, score-0.164]
28 Let ni j be the number of training examples used to train the binary classifier that predicts pi x pi x ˆ ˆ p j x , Hastie and ˆ ri j . [sent-42, score-0.744]
29 p r i i £ ¢ k i n i j ri j r i i ni j ˆj x x £ ¢ ∑j ∑j ¡¢ ˆx p i ¡¢ £ ¢ © £ ¨ ¤ ˆx p i £ ¢ ¦¦ ¦ ¨§§¨ 12 ¨ (a) For each i £ ¢ 2. [sent-45, score-0.174]
30 r i £ ¢ £ ¢ Hastie and Tibshirani [Hastie and Tibshirani, 1998] prove that the Kullback-Leibler distance between ri j x and ri j x decreases at each step. [sent-48, score-0.363]
31 At convergence, the r i j are consistent with the pi . [sent-50, score-0.164]
32 The ˆ ˆ class predicted for each example x is c x ˆ argmax pi x . [sent-51, score-0.313]
33 ˆ Hastie and Tibshirani also prove that the pi x are in the same order as the non-iterative ˆ ˜ estimates pi x ˜ ∑ j i ri j x for each x. [sent-52, score-0.708]
34 Thus, the pi x are sufficient for predicting the most likely class for each example. [sent-53, score-0.273]
35 However, as shown by Hastie and Tibshirani, they are not accurate probability estimates because they tend to underestimate the differences between the pi x values. [sent-54, score-0.423]
36 We would like to obtain a set of class membership probabilities pi x for each example 1. [sent-56, score-0.408]
37 In this case, the number of free x compatible with the rb x and subject to ∑i pi x parameters is k 1 and the number of constraints is l 1, where l is the number of columns of the code matrix. [sent-57, score-0.853]
38 ¦ £ £ Since for most code matrices l is greater than k 1, in general there is no exact solution to this problem. [sent-58, score-0.376]
39 ˆ Let nb be the number of training examples used to train the binary classifier that corresponds to column b of the code matrix. [sent-60, score-0.592]
40 Repeat until convergence: k M ib 1 nb 1 1 1 nb ¢ M ib £ £ ¢ ¢¤ £ ¢ ¡ ¡ ¤ £ ¢ ¡¡ ¢ ¢ £ ¢ ¨ © ¤ £ ¢ (b) Renormalize the ˆ x . [sent-64, score-0.24]
41 1 and B i be the set of matrix Let B i be the set of matrix columns for which M i columns for which M c 1. [sent-67, score-0.248]
42 By analogy with the non-iterative estimates suggested by Hastie and Tibshirani, we can define non-iterative estimates pi x ˜ ∑ b B i rb x ∑b B i 1 rb x . [sent-68, score-1.278]
43 For the all-pairs code matrix, these estimates are the same as the ones suggested by Hastie and Tibshirani. [sent-69, score-0.49]
44 However, for arbitrary matrices, we cannot prove that the non-iterative estimates predict the same class as the iterative estimates. [sent-70, score-0.471]
45 ¦ ¥ ¨ ¦ © £ ¥ § £ © 4 Loss-based decoding In this section, we discuss how to apply the loss-based decoding method to classifiers that output class membership probability estimates. [sent-71, score-0.726]
46 We also study the conditions under which this method predicts the same class as the Hastie-Tibshirani method, in the all-pairs case. [sent-72, score-0.223]
47 , 2000] requires that each binary classifier output a margin score satisfying two requirements. [sent-74, score-0.22]
48 Let f x b be the margin score predicted by the classifier corresponding to column b of the code matrix for example x. [sent-78, score-0.51]
49 For each row c of the code matrix M and for each example x, we compute the distance between f and M c as ¥ ¥ l ∑L M c b ¢ ¢ b 1 f xb (1) ££ ¨ ¢ £ ¨ ¢ dL x c ¤ ©£ ¨ ¢ where L is a loss function that is dependent on the nature of the binary classifier and M c b = 0, 1 or 1. [sent-79, score-0.587]
50 We then label each example x with the label c for which dL is minimized. [sent-80, score-0.088]
51 ¥ £ If the binary classification learning algorithm outputs scores that are probability estimates, they do not satisfy the first requirement because the probability estimates are all between 0 and 1. [sent-81, score-0.5]
52 However, we can transform the probability estimates rb x output by each classifier b into margin scores by subtracting 1 2 from the scores, so that we consider as positives the examples x for which rb x is above 1/2, and as negatives the examples x for which rb x is below 1/2. [sent-82, score-1.629]
53 We now prove a theorem that relates the loss-based decoding method to the HastieTibshirani method, for a particular class of loss functions. [sent-83, score-0.413]
54 Theorem 1 The loss-based decoding method for all-pairs code matrices predicts the same class label as the iterative estimates pi x given by Hastie and Tibshirani, if the loss function ˆ is of the form L y ay, for any a 0. [sent-84, score-1.284]
55 £ £ Proof: We first show that, if the loss function is of the form L y ay, the loss-based decoding method predicts the same class label as the non-iterative estimates p i x , for the ˜ all-pairs code matrix. [sent-85, score-0.952]
56 Dataset satimage pendigits soybean #Training Examples 4435 7494 307 #Test Examples 2000 3498 376 #Attributes 36 16 35 #Classes 7 10 9 Table 1: Characteristics of the datasets used in the experiments. [sent-86, score-0.379]
57 So, the distance d x c is £ d xc ¥ c 1 2 £ £ a rb x 1 2, and eliminating the terms for which ¤ ¥£ ¨ ¢ It is now easy to see that the class c x which minimizes d x c for example x, also maximizes pc x . [sent-88, score-0.625]
58 In this case, the class predicted by loss-based decoding loss functions such as L y may differ from the one predicted by the method by Hastie and Tibshirani. [sent-93, score-0.455]
59 ¨ This theorem applies only to the all-pairs code matrix. [sent-94, score-0.289]
60 For other matrices such that B c B c is a linear function of B c (such as the one-against-all matrix), we can prove ay) predicts the same class as the non-iterative esthat loss-based decoding (with L y timates. [sent-95, score-0.435]
61 However, in this case, the non-iterative estimates do not necessarily predict the same class as the iterative ones. [sent-96, score-0.391]
62 ¨ £ § £ ¨ 5 Experiments We performed experiments using the following multiclass datasets from the UCI Machine Learning Repository [Blake and Merz, 1998]: satimage, pendigits and soybean. [sent-97, score-0.436]
63 The binary learning algorithm used in the experiments is boosted naive Bayes [Elkan, 1997], since this is a method that cannot be easily extended to handle multiclass problems directly. [sent-99, score-0.602]
64 0651 ¤ £ ¢ ¤ £ ¢ ¤ £ ¢ ¤ £ ¢ ¤ £ ¢ ¤ £ ¢ ¥ ¥ ¥ Table 2: Test set results on the satimage dataset. [sent-122, score-0.133]
65 We use three different code matrices for each dataset: all-pairs, one-against-all and a sparse random matrix. [sent-123, score-0.471]
66 The sparse random matrices have 15 log2 k columns, and each element is 0 with probability 1/2 and -1 or +1 with probability 1/4 each. [sent-124, score-0.298]
67 This is the same type of sparse random matrix used by Allwein et al. [sent-125, score-0.217]
68 In order to have good error correcting properties, the Hamming distance ρ between each pair of rows in the matrix must be large. [sent-128, score-0.149]
69 We select the matrix by generating 10,000 random matrices and selecting the one for which ρ is maximized, checking that each column has at least one 1 and one 1, and that the matrix does not have two identical columns. [sent-129, score-0.274]
70 The first metric is the error rate obtained when we assign each example to the most likely class predicted by the method. [sent-131, score-0.235]
71 This metric is sufficient if we are only interested in classifying the examples correctly and do not need accurate probability estimates of class membership. [sent-132, score-0.492]
72 £ The second metric is squared error, defined for one example x as SE x ∑j tj x 2 , where p x is the probability estimated by the method for example x and class pj x j j, and t j x is the true probability of class j for x. [sent-133, score-0.495]
73 Since for most real-world datasets true labels are known, but not probabilities, t j x is defined to be 1 if the label of x is j and 0 otherwise. [sent-134, score-0.09]
74 We calculate the squared error for each x to obtain the mean squared error (MSE). [sent-135, score-0.106]
75 The mean squared error is an adequate metrics for assessing the accuracy of probability estimates [Zadrozny and Elkan, 2001b]. [sent-136, score-0.312]
76 This metric cannot be applied to the loss-based decoding method, since it does not produce probability estimates. [sent-137, score-0.251]
77 Table 2 shows the results of the experiments on the satimage dataset for each type of code matrix. [sent-138, score-0.453]
78 As a baseline for comparison, we also show the results of applying multiclass Naive Bayes to this dataset. [sent-139, score-0.279]
79 We can see that the iterative Hastie-Tibshirani procedure (and its extension to arbitrary code matrices) succeeds in lowering the MSE significantly compared to the non-iterative estimates, which indicates that it produces probability estimates that are more accurate. [sent-140, score-0.671]
80 For one-against-all matrices, the iterative method performs consistently worse, while for sparse random matrices, it performs consistently better. [sent-142, score-0.247]
81 Figure 1 shows how the MSE is lowered at each iteration of the Hastie-Tibshirani algorithm, for the three types of code matrices. [sent-143, score-0.317]
82 Table 3 shows the results of the same experiments on the datasets pendigits and soybean. [sent-144, score-0.157]
83 Again, the MSE is significantly lowered by the iterative procedure, in all cases. [sent-145, score-0.109]
84 For the soybean dataset, using the sparse random matrix, the iterative method again has a lower error rate than the other methods, which is even lower than the error rate using the all-pairs matrix. [sent-146, score-0.438]
85 This is an interesting result, since in this case the all-pairs matrix has 171 columns (corresponding to 171 classifiers), while the sparse matrix has only 64 columns. [sent-147, score-0.299]
86 03 0 5 10 15 20 25 30 35 Iteration Figure 1: Convergence of the MSE for the satimage dataset. [sent-158, score-0.133]
87 ) Multiclass Naive Bayes Code Matrix All-pairs All-pairs All-pairs All-pairs One-against-all One-against-all One-against-all One-against-all Sparse Sparse Sparse Sparse - pendigits Error Rate MSE 0. [sent-167, score-0.111]
88 0996 ¤ £ ¢ ¤ £ ¢ ¥ ¤ £ ¢ ¤ £ ¢ ¥ ¤ £ ¢ ¤ £ ¢ ¥ Table 3: Test set results on the pendigits and soybean datasets. [sent-207, score-0.2]
89 6 Conclusions We have presented a method for producing class membership probability estimates for multiclass problems, given probability estimates for a series of binary problems determined by an arbitrary code matrix. [sent-208, score-1.552]
90 Since research in designing optimal code matrices is still on-going [Utschick and Weichselberger, 2001] [Crammer and Singer, 2000], it is important to be able to obtain class membership probability estimates from arbitrary code matrices. [sent-209, score-1.21]
91 In current research, the effectiveness of a code matrix is determined primarily by the classification accuracy. [sent-210, score-0.369]
92 However, since many applications require accurate class membership probability estimates for each of the classes, it is important to also compare the different types of code matrices according to their ability of producing such estimates. [sent-211, score-0.879]
93 Our method relies on the probability estimates given by the binary classifiers to produce the multiclass probability estimates. [sent-213, score-0.776]
94 However, the probability estimates produced by Boosted Naive Bayes are not calibrated probability estimates. [sent-214, score-0.375]
95 An interesting direction for future work is in determining whether the calibration of the probability estimates given by the binary classifiers improves the calibration of the multiclass probabilities. [sent-215, score-0.725]
96 Reducing multiclass to binary: A unifying approach for margin classifiers. [sent-223, score-0.307]
97 Rank analysis of incomplete block designs, I: The method of paired comparisons. [sent-240, score-0.097]
98 On the learnability and design of output codes for multiclass problems. [sent-245, score-0.346]
99 Stochastic organization of output codes in multiclass learning problems. [sent-266, score-0.346]
100 Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. [sent-277, score-0.411]
wordName wordTfidf (topN-words)
[('rb', 0.356), ('code', 0.289), ('multiclass', 0.279), ('allwein', 0.267), ('classi', 0.207), ('tibshirani', 0.201), ('estimates', 0.201), ('hastie', 0.2), ('elkan', 0.2), ('zadrozny', 0.178), ('pi', 0.164), ('decoding', 0.158), ('ri', 0.141), ('mse', 0.139), ('membership', 0.135), ('satimage', 0.133), ('pendigits', 0.111), ('binary', 0.109), ('class', 0.109), ('sparse', 0.095), ('naive', 0.094), ('bakiri', 0.089), ('soybean', 0.089), ('matrices', 0.087), ('pc', 0.084), ('iterative', 0.081), ('matrix', 0.08), ('er', 0.079), ('nb', 0.077), ('ay', 0.075), ('hamming', 0.073), ('method', 0.071), ('merz', 0.067), ('terry', 0.067), ('utschick', 0.067), ('weichselberger', 0.067), ('coupling', 0.065), ('cation', 0.064), ('examples', 0.063), ('probability', 0.058), ('calibrated', 0.058), ('crammer', 0.058), ('pairwise', 0.058), ('belonging', 0.056), ('bradley', 0.053), ('boosted', 0.049), ('singer', 0.047), ('blake', 0.046), ('ers', 0.046), ('score', 0.046), ('datasets', 0.046), ('scores', 0.045), ('ecoc', 0.044), ('columns', 0.044), ('classes', 0.044), ('label', 0.044), ('ib', 0.043), ('distance', 0.043), ('predicts', 0.043), ('arbitrary', 0.042), ('et', 0.042), ('obtaining', 0.041), ('predicted', 0.04), ('calibration', 0.039), ('renormalize', 0.039), ('bayes', 0.038), ('prove', 0.038), ('loss', 0.037), ('output', 0.037), ('negatives', 0.035), ('recompute', 0.035), ('dietterich', 0.035), ('metric', 0.035), ('reducing', 0.034), ('ni', 0.033), ('xc', 0.033), ('dataset', 0.031), ('dl', 0.031), ('couple', 0.031), ('positives', 0.031), ('codes', 0.03), ('uci', 0.029), ('repository', 0.029), ('outputs', 0.029), ('row', 0.029), ('table', 0.028), ('margin', 0.028), ('pj', 0.028), ('lowered', 0.028), ('squared', 0.027), ('train', 0.027), ('column', 0.027), ('classifying', 0.026), ('paired', 0.026), ('guess', 0.026), ('error', 0.026), ('rate', 0.025), ('california', 0.025), ('assigned', 0.024), ('costs', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
Author: B. Zadrozny
Abstract: This paper presents a method for obtaining class membership probability estimates for multiclass classification problems by coupling the probability estimates produced by binary classifiers. This is an extension for arbitrary code matrices of a method due to Hastie and Tibshirani for pairwise coupling of probability estimates. Experimental results with Boosted Naive Bayes show that our method produces calibrated class membership probability estimates, while having similar classification accuracy as loss-based decoding, a method for obtaining the most likely class that does not generate probability estimates.
2 0.13623668 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
Author: Paul Viola, Michael Jones
Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.
3 0.12300797 97 nips-2001-Information-Geometrical Significance of Sparsity in Gallager Codes
Author: Toshiyuki Tanaka, Shiro Ikeda, Shun-ichi Amari
Abstract: We report a result of perturbation analysis on decoding error of the belief propagation decoder for Gallager codes. The analysis is based on information geometry, and it shows that the principal term of decoding error at equilibrium comes from the m-embedding curvature of the log-linear submanifold spanned by the estimated pseudoposteriors, one for the full marginal, and K for partial posteriors, each of which takes a single check into account, where K is the number of checks in the Gallager code. It is then shown that the principal error term vanishes when the parity-check matrix of the code is so sparse that there are no two columns with overlap greater than 1. 1
4 0.12024499 46 nips-2001-Categorization by Learning and Combining Object Parts
Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio
Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.
5 0.11363334 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller
Abstract: Driven by the progress in the field of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming finger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classification accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classifiers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).
6 0.11313887 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
7 0.10545636 144 nips-2001-Partially labeled classification with Markov random walks
8 0.10322934 8 nips-2001-A General Greedy Approximation Algorithm with Applications
9 0.097780116 22 nips-2001-A kernel method for multi-labelled classification
10 0.097050168 60 nips-2001-Discriminative Direction for Kernel Classifiers
11 0.096618935 98 nips-2001-Information Geometrical Framework for Analyzing Belief Propagation Decoder
12 0.093541346 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance
13 0.093518905 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
14 0.090235583 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
15 0.082997099 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
16 0.082953528 139 nips-2001-Online Learning with Kernels
17 0.080692574 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
18 0.072798401 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
19 0.071028151 43 nips-2001-Bayesian time series classification
20 0.068478674 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
topicId topicWeight
[(0, -0.198), (1, 0.099), (2, -0.046), (3, 0.168), (4, -0.008), (5, -0.061), (6, -0.004), (7, -0.106), (8, 0.129), (9, 0.042), (10, -0.006), (11, -0.122), (12, -0.139), (13, 0.037), (14, -0.043), (15, -0.069), (16, 0.144), (17, -0.029), (18, 0.088), (19, -0.037), (20, -0.04), (21, 0.08), (22, 0.085), (23, 0.015), (24, 0.056), (25, -0.022), (26, 0.005), (27, 0.045), (28, 0.05), (29, 0.026), (30, 0.069), (31, 0.006), (32, -0.105), (33, -0.021), (34, 0.059), (35, 0.069), (36, -0.076), (37, 0.004), (38, 0.031), (39, 0.113), (40, -0.032), (41, 0.08), (42, -0.024), (43, -0.068), (44, -0.107), (45, 0.046), (46, -0.083), (47, 0.086), (48, -0.0), (49, 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.96154577 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
Author: B. Zadrozny
Abstract: This paper presents a method for obtaining class membership probability estimates for multiclass classification problems by coupling the probability estimates produced by binary classifiers. This is an extension for arbitrary code matrices of a method due to Hastie and Tibshirani for pairwise coupling of probability estimates. Experimental results with Boosted Naive Bayes show that our method produces calibrated class membership probability estimates, while having similar classification accuracy as loss-based decoding, a method for obtaining the most likely class that does not generate probability estimates.
2 0.62252003 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance
Author: Michael C. Mozer, Robert Dodier, Michael D. Colagrosso, Cesar Guerra-Salcedo, Richard Wolniewicz
Abstract: When designing a two-alternative classifier, one ordinarily aims to maximize the classifier’s ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: that the classifier achieve a specified correct acceptance or correct rejection rate (i.e., that it achieve a fixed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose four algorithms for training a classifier subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance. Although the improvement is modest in magnitude, it is nonetheless impressive given the difficulty of the problem and the financial return that it achieves to the service provider. When designing a classifier, one must specify an objective measure by which the classifier’s performance is to be evaluated. One simple objective measure is to minimize the number of misclassifications. If the cost of a classification error depends on the target and/ or response class, one might utilize a risk-minimization framework to reduce the expected loss. A more general approach is to maximize the classifier’s ability to discriminate one class from another class (e.g., Chang & Lippmann, 1994). An ROC curve (Green & Swets, 1966) can be used to visualize the discriminative performance of a two-alternative classifier that outputs class posteriors. To explain the ROC curve, a classifier can be thought of as making a positive/negative judgement as to whether an input is a member of some class. Two different accuracy measures can be obtained from the classifier: the accuracy of correctly identifying an input as a member of the class (a correct acceptance or CA), and the accuracy of correctly identifying an input as a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates, it is necessary to pick a threshold above which the classifier’s probability estimate is interpreted as an “accept,” and below which is interpreted as a “reject”—call this the criterion. The ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the threshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1, the CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1 0 0 correct rejection rate 20 40 60 80 100 100 (b) correct rejection rate 20 40 60 80 (a) 0 20 40 60 80 100 correct acceptance rate 0 20 40 60 80 100 correct acceptance rate FIGURE 1. (a) two ROC curves reflecting discrimination performance; the dashed curve indicates better performance. (b) two plausible ROC curves, neither of which is clearly superior to the other. and the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotonically nonincreasing. The degree to which the curve is bowed reflects the discriminative ability of the classifier. The dashed curve in Figure 1a is therefore a better classifier than the solid curve. The degree to which the curve is bowed can be quantified by various measures such as the area under the ROC curve or d’, the distance between the positive and negative distributions. However, training a classifier to maximize either the ROC area or d’ often yields the same result as training a classifier to estimate posterior class probabilities, or equivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The ROC area and d’ scores are useful, however, because they reflect a classifier’s intrinsic ability to discriminate between two classes, regardless of how the decision criterion is set. That is, each point on an ROC curve indicates one possible CA/CR trade off the classifier can achieve, and that trade off is determined by the criterion. But changing the criterion does not change the classifier’s intrinsic ability to discriminate. Generally, one seeks to optimize the discrimination performance of a classifier. However, we are working in a domain where overall discrimination performance is not as critical as performance at a particular point on the ROC curve, and we are not interested in the remainder of the ROC curve. To gain an intuition as to why this goal should be feasible, consider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they satisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not decrease and the CR rate does not increase. Although the bow shape of the solid curve is typical, it is not mandatory; the precise shape of the curve depends on the nature of the classifier and the nature of the domain. Thus, it is conceivable that a classifier could produce a curve like the dashed one. The dashed curve indicates better performance when the CA rate is around 50%, but worse performance when the CA rate is much lower or higher than 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint that the CA rate is around 50%, or to maximize the CA rate subject to the constraint that the CR rate is around 90%, the dashed curve is superior to the solid curve. One can imagine that better performance can be obtained along some stretches of the curve by sacrificing performance along other stretches of the curve. Note that obtaining a result such as the dashed curve requires a nonstandard training algorithm, as the discrimination performance as measured by the ROC area is worse for the dashed curve than for the solid curve. In this paper, we propose and evaluate four algorithms for optimizing performance in a certain region of the ROC curve. To begin, we explain the domain we are concerned with and why focusing on a certain region of the ROC curve is important in this domain. 1 OUR DOMAIN Athene Software focuses on predicting and managing subscriber churn in the telecommunications industry (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000). “Churn” refers to the loss of subscribers who switch from one company to the other. Churn is a significant problem for wireless, long distance, and internet service providers. For example, in the wireless industry, domestic monthly churn rates are 2–3% of the customer base. Consequently, service providers are highly motivated to identify subscribers who are dissatisfied with their service and offer them incentives to prevent churn. We use techniques from statistical machine learning—primarily neural networks and ensemble methods—to estimate the probability that an individual subscriber will churn in the near future. The prediction of churn is based on various sources of information about a subscriber, including: call detail records (date, time, duration, and location of each call, and whether call was dropped due to lack of coverage or available bandwidth), financial information appearing on a subscriber’s bill (monthly base fee, additional charges for roaming and usage beyond monthly prepaid limit), complaints to the customer service department and their resolution, information from the initial application for service (contract details, rate plan, handset type, credit report), market information (e.g., rate plans offered by the service provider and its competitors), and demographic data. Churn prediction is an extremely difficult problem for several reasons. First, the business environment is highly nonstationary; models trained on data from a certain time period perform far better with hold-out examples from that same time period than examples drawn from successive time periods. Second, features available for prediction are only weakly related to churn; when computing mutual information between individual features and churn, the greatest value we typically encounter is .01 bits. Third, information critical to predicting subscriber behavior, such as quality of service, is often unavailable. Obtaining accurate churn predictions is only part of the challenge of subscriber retention. Subscribers who are likely to churn must be contacted by a call center and offered some incentive to remain with the service provider. In a mathematically principled business scenario, one would frame the challenge as maximizing profitability to a service provider, and making the decision about whether to contact a subscriber and what incentive to offer would be based on the expected utility of offering versus not offering an incentive. However, business practices complicate the scenario and place some unique constraints on predictive models. First, call centers are operated by a staff of customer service representatives who can contact subscribers at a fixed rate; consequently, our models cannot advise contacting 50,000 subscribers one week, and 50 the next. Second, internal business strategies at the service providers constrain the minimum acceptable CA or CR rates (above and beyond the goal of maximizing profitability). Third, contracts that Athene makes with service providers will occasionally call for achieving a specific target CA and CR rate. These three practical issues pose formal problems which, to the best of our knowledge, have not been addressed by the machine learning community. The formal problems can be stated in various ways, including: (1) maximize the CA rate, subject to the constraint that a fixed percentage of the subscriber base is identified as potential churners, (2) optimize the CR rate, subject to the constraint that the CA rate should be αCA, (3) optimize the CA rate, subject to the constraint that the CR rate should be αCR, and finally—what marketing executives really want—(4) design a classifier that has a CA rate of αCA and a CR rate of αCR. Problem (1) sounds somewhat different than problems (2) or (3), but it can be expressed in terms of a lift curve, which plots the CA rate as a function of the total fraction of subscribers identified by the model. Problem (1) thus imposes the constraint that the solution lies at one coordinate of the lift curve, just as problems (2) and (3) place the constraint that the solution lies at one coordinate of the ROC curve. Thus, a solution to problems (2) or (3) will also serve as a solution to (1). Although addressing problem (4) seems most fanciful, it encompasses problems (2) and (3), and thus we focus on it. Our goal is not altogether unreasonable, because a solution to problem (4) has the property we characterized in Figure 1b: the ROC curve can suffer everywhere except in the region near CA αCA and CR αCR. Hence, the approaches we consider will trade off performance in some regions of the ROC curve against performance in other regions. We call this prodding the ROC curve. 2 FOUR ALGORITHMS TO PROD THE ROC CURVE In this section, we describe four algorithms for prodding the ROC curve toward a target CA rate of αCA and a target CR rate of αCR. 2.1 EMPHASIZING CRITICAL TRAINING EXAMPLES Suppose we train a classifier on a set of positive and negative examples from a class— churners and nonchurners in our domain. Following training, the classifier will assign a posterior probability of class membership to each example. The examples can be sorted by the posterior and arranged on a continuum anchored by probabilities 0 and 1 (Figure 2). We can identify the thresholds, θCA and θCR, which yield CA and CR rates of αCA and αCR, respectively. If the classifier’s discrimination performance fails to achieve the target CA and CR rates, then θCA will be lower than θCR, as depicted in the Figure. If we can bring these two thresholds together, we will achieve the target CA and CR rates. Thus, the first algorithm we propose involves training a series of classifiers, attempting to make classifier n+1 achieve better CA and CR rates by focusing its effort on examples from classifier n that lie between θCA and θCR; the positive examples must be pushed above θCR and the negative examples must be pushed below θCA. (Of course, the thresholds are specific to a classifier, and hence should be indexed by n.) We call this the emphasis algorithm, because it involves placing greater weight on the examples that lie between the two thresholds. In the Figure, the emphasis for classifier n+1 would be on examples e5 through e8. This retraining procedure can be iterated until the classifier’s training set performance reaches asymptote. In our implementation, we define a weighting of each example i for training classifier n, λ in . For classifier 1, λ i1 = 1 . For subsequent classifiers, λ in + 1 = λ in if example i is not in the region of emphasis, or λ in + 1 = κ e λ in otherwise, where κe is a constant, κe > 1. 2.2 DEEMPHASIZING IRRELEVANT TRAINING EXAMPLES The second algorithm we propose is related to the first, but takes a slightly different perspective on the continuum depicted in Figure 2. Positive examples below θCA—such as e2—are clearly the most dif ficult positive examples to classify correctly. Not only are they the most difficult positive examples, but they do not in fact need to be classified correctly to achieve the target CA and CR rates. Threshold θCR does not depend on examples such as e2, and threshold θCA allows a fraction (1–αCA) of the positive examples to be classified incorrectly. Likewise, one can argue that negative examples above θCR—such as e10 and e11—need not be of concern. Essentially , the second algorithm, which we term the eemd phasis algorithm, is like the emphasis algorithm in that a series of classifiers are trained, but when training classifier n+1, less weight is placed on the examples whose correct clasθCA e1 e2 e3 0 e4 θCR e5 e6 e7 e8 churn probability e9 e10 e11 e12 e13 1 FIGURE 2. A schematic depiction of all training examples arranged by the classifier’s posterior. Each solid bar corresponds to a positive example (e.g., a churner) and each grey bar corresponds to a negative example (e.g., a nonchurner). sification is unnecessary to achieve the target CA and CR rates for classifier n. As with the emphasis algorithm, the retraining procedure can be iterated until no further performance improvements are obtained on the training set. Note that the set of examples given emphasis by the previous algorithm is not the complement of the set of examples deemphasized by the current algorithm; the algorithms are not identical. In our implementation, we assign a weight to each example i for training classifier n, λ in . For classifier 1, λ i1 = 1 . For subsequent classifiers, λ in + 1 = λ in if example i is not in the region of deemphasis, or λ in + 1 = κ d λ in otherwise, where κd is a constant, κd <1. 2.3 CONSTRAINED OPTIMIZATION The third algorithm we propose is formulated as maximizing the CR rate while maintaining the CA rate equal to αCA. (We do not attempt to simultaneously maximize the CA rate while maintaining the CR rate equal to αCR.) Gradient methods cannot be applied directly because the CA and CR rates are nondifferentiable, but we can approximate the CA and CR rates with smooth differentiable functions: 1 1 CA ( w, t ) = ----- ∑ σ β ( f ( x i, w ) – t ) CR ( w, t ) = ------ ∑ σ β ( t – f ( x i, w ) ) , P i∈P N i∈N where P and N are the set of positive and negative examples, respectively, f(x,w) is the model posterior for input x, w is the parameterization of the model, t is a threshold, and σβ –1 is a sigmoid function with scaling parameter β: σ β ( y ) = ( 1 + exp ( – βy ) ) . The larger β is, the more nearly step-like the sigmoid is and the more nearly equal the approximations are to the model CR and CA rates. We consider the problem formulation in which CA is a constraint and CR is a figure of merit. We convert the constrained optimization problem into an unconstrained problem by the augmented Lagrangian method (Bertsekas, 1982), which involves iteratively maximizing an objective function 2 µ A ( w, t ) = CR ( w, t ) + ν CA ( w, t ) – α CA + -- CA ( w, t ) – α CA 2 with a fixed Lagrangian multiplier, ν, and then updating ν following the optimization step: ν ← ν + µ CA ( w *, t * ) – α CA , where w * and t * are the values found by the optimization step. We initialize ν = 1 and fix µ = 1 and β = 10 and iterate until ν converges. 2.4 GENETIC ALGORITHM The fourth algorithm we explore is a steady-state genetic search over a space defined by the continuous parameters of a classifier (Whitley, 1989). The fitness of a classifier is the reciprocal of the number of training examples falling between the θCA and θCR thresholds. Much like the emphasis algorithm, this fitness function encourages the two thresholds to come together. The genetic search permits direct optimization over a nondifferentiable criterion, and therefore seems sensible for the present task. 3 METHODOLOGY For our tests, we studied two large data bases made available to Athene by two telecommunications providers. Data set 1 had 50,000 subscribers described by 35 input features and a churn rate of 4.86%. Data set 2 had 169,727 subscribers described by 51 input features and a churn rate of 6.42%. For each data base, the features input to the classifier were obtained by proprietary transformations of the raw data (see Mozer et al., 2000). We chose these two large, real world data sets because achieving gains with these data sets should be more difficult than with smaller, less noisy data sets. Plus, with our real-world data, we can evaluate the cost savings achieved by an improvement in prediction accuracy. We performed 10-fold cross-validation on each data set, preserving the overall churn/nonchurn ratio in each split. In all tests, we chose α CR = 0.90 and α CA = 0.50 , values which, based on our past experience in this domain, are ambitious yet realizable targets for data sets such as these. We used a logistic regression model (i.e., a no hidden unit neural network) for our studies, believing that it would be more difficult to obtain improvements with such a model than with a more flexible multilayer perceptron. For the emphasis and deemphasis algorithms, models were trained to minimize mean-squared error on the training set. We chose κe = 1.3 and κd = .75 by quick exploration. Because the weightings are cumulative over training restarts, the choice of κ is not critical for either algorithm; rather, the magnitude of κ controls how many restarts are necessary to reach asymptotic performance, but the results we obtained were robust to the choice of κ. The emphasis and deemphasis algorithms were run for 100 iterations, which was the number of iterations required to reach asymptotic performance on the training set. 4 RESULTS Figure 3 illustrates training set performance for the emphasis algorithm on data set 1. The graph on the left shows the CA rate when the CR rate is .9, and the graph on the right show the CR rate when the CA rate is .5. Clearly, the algorithm appears to be stable, and the ROC curve is improving in the region around (αCA, αCR). Figure 4 shows cross-validation performance on the two data sets for the four prodding algorithms as well as for a traditional least-squares training procedure. The emphasis and deemphasis algorithms yield reliable improvements in performance in the critical region of the ROC curve over the traditional training procedure. The constrained-optimization and genetic algorithms perform well on achieving a high CR rate for a fixed CA rate, but neither does as well on achieving a high CA rate for a fixed CR rate. For the constrained-optimization algorithm, this result is not surprising as it was trained asymmetrically, with the CA rate as the constraint. However, for the genetic algorithm, we have little explanation for its poor performance, other than the difficulty faced in searching a continuous space without gradient information. 5 DISCUSSION In this paper, we have identified an interesting, novel problem in classifier design which is motivated by our domain of churn prediction and real-world business considerations. Rather than seeking a classifier that maximizes discriminability between two classes, as measured by area under the ROC curve, we are concerned with optimizing performance at certain points along the ROC curve. We presented four alternative approaches to prodding the ROC curve, and found that all four have promise, depending on the specific goal. Although the magnitude of the gain is small—an increase of about .01 in the CR rate given a target CA rate of .50—the impro vement results in significant dollar savings. Using a framework for evaluating dollar savings to a service provider, based on estimates of subscriber retention and costs of intervention obtained in real world data collection (Mozer et 0.845 0.84 0.39 0.835 0.385 CR rate CA rate 0.4 0.395 0.38 0.83 0.825 0.375 0.82 0.37 0.815 0.365 0.81 0 5 10 15 20 25 30 35 40 45 50 Iteration 0 5 10 15 20 25 30 35 40 45 50 Iteration FIGURE 3. Training set performance for the emphasis algorithm on data set 1. (a) CA rate as a function of iteration for a CR rate of .9; (b) CR rate as a function of iteration for a CA rate of .5. Error bars indicate +/–1 standard error of the mean. Data set 1 0.835 0.380 0.830 0.375 0.825 CR rate ISP Test Set 0.840 0.385 CA rate 0.390 0.370 0.820 0.365 0.815 0.360 0.810 0.355 0.805 0.350 0.800 std emph deemph constr GA std emph deemph constr GA std emph deemph constr GA 0.900 0.375 0.350 CR rate Data set 2 0.875 CA rate Wireless Test Set 0.850 0.325 0.825 0.300 0.800 std emph deemph constr GA FIGURE 4. Cross-validation performance on the two data sets for the standard training procedure (STD), as well as the emphasis (EMPH), deemphasis (DEEMPH), constrained optimization (CONSTR), and genetic (GEN) algorithms. The left column shows the CA rate for CR rate .9; the right column shows the CR rate for CA rate .5. The error bar indicates one standard error of the mean over the 10 data splits. al., 2000), we obtain a savings of $11 per churnable subscriber when the (CA, CR) rates go from (.50, .80) to (.50, .81), which amounts to an 8% increase in profitability of the subscriber intervention effort. These figures are clearly promising. However, based on the data sets we have studied, it is difficult to know whether another algorithm might exist that achieves even greater gains. Interestingly, all algorithms we proposed yielded roughly the same gains when successful, suggesting that we may have milked the data for whatever gain could be had, given the model class evaluated. Our work clearly illustrate the difficulty of the problem, and we hope that others in the NIPS community will be motivated by the problem to suggest even more powerful, theoretically grounded approaches. 6 ACKNOWLEDGEMENTS No white males were angered in the course of conducting this research. We thank Lian Yan and David Grimes for comments and assistance on this research. This research was supported in part by McDonnell-Pew grant 97-18, NSF award IBN-9873492, and NIH/IFOPAL R01 MH61549–01A1. 7 REFERENCES Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NY: Academic. Chang, E. I., & Lippmann, R. P. (1994). Figure of merit training for detection and spotting. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6 (1019–1026). San Mateo, CA: Morgan Kaufmann. Frederick, E. D., & Floyd, C. E. (1998). Analysis of mammographic findings and patient history data with genetic algorithms for the prediction of breast cancer biopsy outcome. Proceedings of the SPIE, 3338, 241–245. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Mozer, M. C., Wolniewicz, R., Grimes, D., Johnson, E., & Kaushansky, H. (2000). Maximizing revenue by predicting and addressing customer dissatisfaction. IEEE Transactions on Neural Networks, 11, 690–696. Whitley, D. (1989). The GENITOR algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In D. Schaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms (pp. 116–121). San Mateo, CA: Morgan Kaufmann.
3 0.58468175 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller
Abstract: Driven by the progress in the field of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming finger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classification accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classifiers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).
4 0.55516988 60 nips-2001-Discriminative Direction for Kernel Classifiers
Author: Polina Golland
Abstract: In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work.
5 0.5520519 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
Author: Paul Viola, Michael Jones
Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.
6 0.52465034 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
7 0.51648575 97 nips-2001-Information-Geometrical Significance of Sparsity in Gallager Codes
8 0.51422906 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
9 0.49709764 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
10 0.48769948 46 nips-2001-Categorization by Learning and Combining Object Parts
11 0.46857464 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
12 0.46168444 98 nips-2001-Information Geometrical Framework for Analyzing Belief Propagation Decoder
13 0.43906206 22 nips-2001-A kernel method for multi-labelled classification
14 0.40875232 144 nips-2001-Partially labeled classification with Markov random walks
15 0.40327916 43 nips-2001-Bayesian time series classification
16 0.38215709 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
17 0.37780577 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
18 0.35390428 8 nips-2001-A General Greedy Approximation Algorithm with Applications
19 0.3528516 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
20 0.35225081 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
topicId topicWeight
[(14, 0.023), (17, 0.012), (19, 0.011), (27, 0.127), (30, 0.56), (72, 0.051), (79, 0.02), (91, 0.092)]
simIndex simValue paperId paperTitle
Author: Takashi Morie, Tomohiro Matsuura, Makoto Nagata, Atsushi Iwata
Abstract: This paper describes a clustering algorithm for vector quantizers using a “stochastic association model”. It offers a new simple and powerful softmax adaptation rule. The adaptation process is the same as the on-line K-means clustering method except for adding random fluctuation in the distortion error evaluation process. Simulation results demonstrate that the new algorithm can achieve efficient adaptation as high as the “neural gas” algorithm, which is reported as one of the most efficient clustering methods. It is a key to add uncorrelated random fluctuation in the similarity evaluation process for each reference vector. For hardware implementation of this process, we propose a nanostructure, whose operation is described by a single-electron circuit. It positively uses fluctuation in quantum mechanical tunneling processes.
2 0.97459662 173 nips-2001-Speech Recognition with Missing Data using Recurrent Neural Nets
Author: S. Parveen, P. Green
Abstract: In the ‘missing data’ approach to improving the robustness of automatic speech recognition to added noise, an initial process identifies spectraltemporal regions which are dominated by the speech source. The remaining regions are considered to be ‘missing’. In this paper we develop a connectionist approach to the problem of adapting speech recognition to the missing data case, using Recurrent Neural Networks. In contrast to methods based on Hidden Markov Models, RNNs allow us to make use of long-term time constraints and to make the problems of classification with incomplete data and imputing missing values interact. We report encouraging results on an isolated digit recognition task.
3 0.97264218 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections
Author: Xiaohui Xie, Martin A. Giese
Abstract: Asymmetric lateral connections are one possible mechanism that can account for the direction selectivity of cortical neurons. We present a mathematical analysis for a class of these models. Contrasting with earlier theoretical work that has relied on methods from linear systems theory, we study the network’s nonlinear dynamic properties that arise when the threshold nonlinearity of the neurons is taken into account. We show that such networks have stimulus-locked traveling pulse solutions that are appropriate for modeling the responses of direction selective cortical neurons. In addition, our analysis shows that outside a certain regime of stimulus speeds the stability of this solutions breaks down giving rise to another class of solutions that are characterized by specific spatiotemporal periodicity. This predicts that if direction selectivity in the cortex is mainly achieved by asymmetric lateral connections lurching activity waves might be observable in ensembles of direction selective cortical neurons within appropriate regimes of the stimulus speed.
same-paper 4 0.96442413 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
Author: B. Zadrozny
Abstract: This paper presents a method for obtaining class membership probability estimates for multiclass classification problems by coupling the probability estimates produced by binary classifiers. This is an extension for arbitrary code matrices of a method due to Hastie and Tibshirani for pairwise coupling of probability estimates. Experimental results with Boosted Naive Bayes show that our method produces calibrated class membership probability estimates, while having similar classification accuracy as loss-based decoding, a method for obtaining the most likely class that does not generate probability estimates.
5 0.93067455 163 nips-2001-Risk Sensitive Particle Filters
Author: Sebastian Thrun, John Langford, Vandi Verma
Abstract: We propose a new particle filter that incorporates a model of costs when generating particles. The approach is motivated by the observation that the costs of accidentally not tracking hypotheses might be significant in some areas of state space, and next to irrelevant in others. By incorporating a cost model into particle filtering, states that are more critical to the system performance are more likely to be tracked. Automatic calculation of the cost model is implemented using an MDP value function calculation that estimates the value of tracking a particular state. Experiments in two mobile robot domains illustrate the appropriateness of the approach.
6 0.92317522 151 nips-2001-Probabilistic principles in unsupervised learning of visual structure: human data and a model
7 0.80192995 149 nips-2001-Probabilistic Abstraction Hierarchies
8 0.79198551 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
9 0.78265184 73 nips-2001-Eye movements and the maturation of cortical orientation selectivity
10 0.74580294 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
11 0.70856392 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
12 0.67460132 46 nips-2001-Categorization by Learning and Combining Object Parts
13 0.67351627 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
14 0.66795081 60 nips-2001-Discriminative Direction for Kernel Classifiers
15 0.66783917 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
16 0.65322369 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
17 0.64350557 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines
18 0.64217478 34 nips-2001-Analog Soft-Pattern-Matching Classifier using Floating-Gate MOS Technology
19 0.63737065 116 nips-2001-Linking Motor Learning to Function Approximation: Learning in an Unlearnable Force Field
20 0.63227665 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition