nips nips2001 nips2001-60 knowledge-graph by maker-knowledge-mining

60 nips-2001-Discriminative Direction for Kernel Classifiers


Source: pdf

Author: Polina Golland

Abstract: In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. [sent-3, score-0.174]

2 In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. [sent-4, score-0.473]

3 However, such analysis is crucial if we want to understand and visualize the detected differences. [sent-5, score-0.099]

4 We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. [sent-6, score-0.318]

5 For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. [sent-7, score-1.027]

6 We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work. [sent-8, score-0.774]

7 The statistical learning algorithms are also used in scientific studies to detect and analyze differences between the two classes when the “correct answer” is unknown, and the information we have on the differences is represented implicitly by the training set. [sent-10, score-0.283]

8 Example applications include morphological analysis of anatomical organs (comparing organ shape in patients vs. [sent-11, score-0.373]

9 In such applications, interpretation of the resulting classifier in terms of the original feature vectors can provide an insight into the nature of the differences detected by the learning algorithm and is therefore a crucial step in the analysis. [sent-13, score-0.33]

10 Furthermore, we would argue that studying the spatial structure of the data captured by the classification function is important in any application, as it leads to a better understanding of the data and can potentially help in improving the technique. [sent-14, score-0.091]

11 This paper addresses the problem of translating a classifier into a different representation that allows us to visualize and study the differences between the classes. [sent-15, score-0.176]

12 We introduce and derive a so called discriminative direction at every point in the original feature space with respect to a given classifier. [sent-16, score-0.743]

13 Informally speaking, the discriminative direction tells us how to change any input example to make it look more like an example from another class without introducing any irrelevant changes that possibly make it more similar to other examples from the same class. [sent-17, score-0.879]

14 It allows us to characterize differences captured by the classifier and to express them as changes in the original input examples. [sent-18, score-0.357]

15 We start with a brief background section on kernelbased classification, stating without proof the main facts on kernel-based SVMs necessary for derivation of the discriminative direction. [sent-20, score-0.356]

16 In Section 3, we provide a formal definition of the discriminative direction and explain how it can be estimated from the classification function. [sent-22, score-0.599]

17 Section 4 demonstrates the discriminative direction for different kernels, followed by an example from the problem of statistical analysis of shape differences that originally motivated this work. [sent-24, score-0.849]

18 The normal to the resulting separating hyperplane is a linear combination of the training data: w= k αk yk ΦK (xk ), (1) where the coefficients αk are computed by solving a constrained quadratic optimization problem. [sent-27, score-0.29]

19 The resulting classifier fK (x) = x · w +b = k αk yk ΦK (x) · ΦK (xk ) +b = k αk yk K(x, xk )+b (2) defines a nonlinear separating boundary in the original feature space. [sent-28, score-0.593]

20 3 Discriminative Direction Equations (1) and (2) imply that the classification function fK (x) is directly proportional to the signed distance from the input point to the separating boundary computed in the higher dimensional space defined by the mapping ΦK . [sent-29, score-0.385]

21 In other words, the function output depends only on the projection of vector ΦK (x) onto w and completely ignores the component of ΦK (x) that is perpendicular to w. [sent-30, score-0.175]

22 This suggests that in order to create a displacement of ΦK (x) that corresponds to the differences between the two classes, one should change the vector’s projection onto w while keeping its perpendicular component the same. [sent-31, score-0.403]

23 This is similar to visualization techniques typically used in linear generative modeling, where the data variation is captured using PCA, and new samples are generated by changing a single principal component at a time. [sent-33, score-0.187]

24 However, this approach is infeasible in the non-linear case, because we do not have access to the image vectors ΦK (x). [sent-34, score-0.104]

25 Furthermore, the resulting image vector might not even have a source in the original feature space, i. [sent-35, score-0.184]

26 , there might be no vector in the original space R n that maps into the resulting vector in the space F. [sent-37, score-0.252]

27 Our solution is to search for the direction around w w p e z dz Φ Φ dx x (a) (b) Figure 1: Kernel-based classification (a) and the discriminative direction (b). [sent-38, score-1.508]

28 the feature vector x in the original space that minimizes the divergence of its image Φ K (x) from the direction of the projection vector w 1 . [sent-39, score-0.591]

29 We call it a discriminative direction, as it represents the direction that affects the output of the classifier while introducing as little irrelevant change as possible into the input vector. [sent-40, score-0.784]

30 Formally, as we move from x to x + dx in Rn , the image vector in the space F changes by dz = ΦK (x + dx) − ΦK (x) (Fig. [sent-41, score-0.861]

31 This displacement can be thought of as a vector sum of its projection onto w and its deviation from w: p= dz · w dz · w w and e = dz − p = dz − w. [sent-43, score-1.597]

32 w·w w·w (3) The discriminative direction minimizes the divergence component e, leading to the following optimization problem: E(dx) = e minimize dx s. [sent-44, score-0.953]

33 −2 T fK (x) fK (x) dx (10) (11) 1 A similar complication arises in kernel-based generative modeling, e. [sent-51, score-0.314]

34 Constructing linear combinations of vectors in the space F leads to a global search in the original space [6, 7]. [sent-54, score-0.249]

35 Since we are interested in the direction that best approximates w, we use infinitesimal analysis that results in a different optimization problem. [sent-55, score-0.243]

36 The solution to this problem is the smallest eigenvector of matrix QK (x) = HK (x) − w T fK (x) fK (x). [sent-56, score-0.174]

37 −2 (12) Note that in general, the matrix QK (x) and its smallest eigenvector are not the same for different points in the original space and must be estimated separately for every input vector x. [sent-57, score-0.356]

38 Furthermore, each solution defines two opposite directions in the input space, corresponding to the positive and the negative projections onto w. [sent-58, score-0.164]

39 We want to move the input example towards the opposite class and therefore assign the direction of increasing function values to the examples with label −1 and the direction of decreasing function values to the examples with label 1. [sent-59, score-0.682]

40 Obtaining a closed-form solution of this minimization problem could be desired, or even necessary, if the dimensionality of the input space is high and computing the smallest eigenvector is computationally expensive and numerically challenging. [sent-60, score-0.248]

41 In the next section, we demonstrate how a particular form of the matrix HK (x) leads to an analytical solution for a large family of kernel functions2. [sent-61, score-0.2]

42 We will show in this section that both for the linear kernel and, more surprisingly, for RBF kernels, the matrix HK (x) is of the right form to yield an analytical solution of this form. [sent-64, score-0.224]

43 In the case of the linear and the RBF kernels, the gradient also corresponds to the direction that distinguishes between the two classes while ignoring inter-class variability. [sent-66, score-0.395]

44 Thus the linear kernel is the only dot product kernel for which this simplification is relevant. [sent-71, score-0.288]

45 In the linear case, HK (x) = I, and the discriminative direction is defined as dx∗ = T fK (x) = w = αk yk xk ; E(dx∗ ) = 0. [sent-72, score-0.816]

46 (14) This is not entirely surprising, as the classifier is a linear function in the original space and we can move precisely along w. [sent-73, score-0.174]

47 Polynomial kernels are a special case of dot product kernels. [sent-74, score-0.103]

48 (15) (u=x,v=x) HK (x) is not necessarily diagonal for all x, and we have to solve the general eigenvector problem to identify the discriminative direction. [sent-76, score-0.439]

49 2 While a very specialized structure of HK (x) in the next section is sufficient for simplifying the solution significantly, it is by no means necessary, and other kernel families might exist for which estimating the discriminative direction does not require solving the full eigenvector problem. [sent-77, score-0.822]

50 For a distance kernel, ∂ 2 K(u, v) ∂ui ∂vj = −2k (0)δij , (16) (u=x,v=x) and therefore the discriminative direction can be determined analytically: dx∗ = T fK (x); E(dx∗ ) = −2k (0) − w −2 2 T fK (x) . [sent-79, score-0.624]

51 (17) The Gaussian kernels are a special case of the distance kernel family, and yield a closed form solution for the discriminative direction: dx∗ = −2/γ k αk y k e − x−xk 2 γ (x − xk ); E(dx∗ ) = 2/γ − 2 T fK (x) / w 2 . [sent-80, score-0.678]

52 (18) Unlike the linear case, we cannot achieve zero error, and the discriminative direction is only an approximation. [sent-81, score-0.623]

53 The exact solution is unattainable in this case, as it has no corresponding direction in the original space. [sent-82, score-0.346]

54 2 Geometric Interpretation We start by noting that the image vectors ΦK (x)’s do not populate the entire space F, but rather form a manifold of lower dimensionality whose geometry is fully defined by the kernel function K (Fig. [sent-84, score-0.311]

55 We will refer to this manifold as the target manifold in this discussion. [sent-86, score-0.108]

56 We cannot explicitly manipulate elements of the space F, but can only explore the target manifold through search in the original space. [sent-87, score-0.17]

57 We perform the search in the original space by considering all points on an infinitesimally small sphere centered at the original input vector x. [sent-88, score-0.314]

58 In the range space of the mapping function Φ K , the images of points x + dx form an ellipsoid defined by the quadratic form dzT dz = dxT HK (x)dx. [sent-89, score-0.785]

59 For HK (x) ∼ I, the ellipsoid becomes a sphere, all dz’s are of the same length, and the minimum of error in the displacement vector dz corresponds to the maximum of the projection of dz onto w. [sent-90, score-0.987]

60 Therefore, the discriminative direction is parallel to the gradient of the classifier function. [sent-91, score-0.659]

61 If HK (x) is of any other form, the length of the displacement vector dz changes as we vary dx, and the minimum of the error in the displacement is not necessarily aligned with the direction that maximizes the projection. [sent-92, score-0.938]

62 Curvature and other properties of target manifolds have been studied extensively for different kernel functions [1, 4]. [sent-96, score-0.107]

63 In particular, one can show that the kernel function implies a metric on the original space. [sent-97, score-0.204]

64 Similarly to the natural gradient [2] that maximizes the change in the function value under an arbitrary metric, we minimize the changes that do not affect the function under the metric implied by the kernel. [sent-98, score-0.23]

65 3 Selecting Inputs Given any input example, we can compute the discriminative direction that represents the differences between the two classes captured by the classifier in the neighborhood of the example. [sent-100, score-0.864]

66 But how should we choose the input examples for which to compute the discriminative direction? [sent-101, score-0.443]

67 We argue that in order to study the differences between the classes, one has to examine the input vectors that are close to the separating boundary, namely, the support vectors. [sent-102, score-0.406]

68 In the discriminative framework, we are more interested in the examples that lie close to the opposite class, as they define the differences between the two classes and the optimal separating boundary. [sent-104, score-0.677]

69 The black solid line is the separating boundary, the dotted lines indicate the margin corridor. [sent-107, score-0.098]

70 The length of the vectors is proportional to the magnitude of the classifier gradient. [sent-109, score-0.102]

71 Support vectors define a margin corridor whose shape is determined by the kernel type used for training. [sent-110, score-0.264]

72 We can estimate the distance from any support vector to the separating boundary by examining the gradient of the classification function for that vector. [sent-111, score-0.398]

73 Large gradient indicates that the support vector is close to the separating boundary and therefore can provide more information on the spatial structure of the boundary. [sent-112, score-0.373]

74 This provides a natural heuristic for assigning importance weighting to different support vectors in the analysis of the discriminative direction. [sent-113, score-0.498]

75 We show the estimated discriminative direction for all points that are close to the separating boundary, not just support vectors. [sent-117, score-0.752]

76 While the magnitude of discriminative direction vector is irrelevant in our infinitesimal analysis, we scaled the vectors in the figure according to the magnitude of the classifier gradient to illustrate importance ranking. [sent-118, score-0.937]

77 Note that for the RBF support vectors far away from the boundary (Fig. [sent-119, score-0.233]

78 2c), the magnitude of the gradient is so small (tenth of the magnitude at the boundary), it renders the vectors Normal Control Patient Figure 3: Right hippocampus in schizophrenia study. [sent-120, score-0.302]

79 First support vector from each group is shown, four views per shape (front, medial, back, lateral). [sent-121, score-0.219]

80 The color coding is used to visualize the amount and the direction of the deformation that corresponds to the discriminative direction, changing from blue (moving inwards) to green (zero deformation) to red (moving outwards). [sent-122, score-0.881]

81 We can see that in the areas where there is enough evidence to estimate the boundary reliably, all three classifiers agree on the boundary and the discriminative direction (lower cluster of arrows). [sent-124, score-0.829]

82 However, if the boundary location is reconstructed based on the regularization defined by the kernel, the classifiers suggest different answers (the upper cluster of arrows), stressing the importance of model selection for classification. [sent-125, score-0.139]

83 The classifiers also provide an indication of the reliability of the differences represented by each arrow, which was repeatedly demonstrated in other experiments we performed. [sent-126, score-0.135]

84 5 Morphological Studies Morphological studies of anatomical organs motivated the analysis presented in this paper. [sent-127, score-0.126]

85 In this study, MRI scans of the brain were acquired for schizophrenia patients and a matched group of normal control subjects. [sent-129, score-0.171]

86 Using the shape information (positions of the outline points), we trained a Gaussian RBF classifier to discriminate between schizophrenia patients and normal controls. [sent-131, score-0.24]

87 However, the classifier in its original form does not provide the medical researchers with information on how the hippocampal shape varies between the two groups. [sent-132, score-0.187]

88 Our goal was to translate the information captured by the classifier into anatomically meaningful terms of organ development and deformation. [sent-133, score-0.133]

89 In this application, the coordinates in the input space correspond to the surface point locations for any particular example shape. [sent-134, score-0.131]

90 The discriminative direction vector corresponds to displacements of the surface points and can be conveniently represented by a deformation of the original shape, yielding an intuitive description of shape differences for visualization and further analysis. [sent-135, score-1.211]

91 We show the deformation that corresponds to the discriminative direction, omitting the details of shape extraction (see [5] for more information). [sent-136, score-0.639]

92 3 displays the first support vector from each group with the discriminative direction “painted” on it. [sent-138, score-0.724]

93 Each row shows four snapshots of the same shape form different viewpoints 3. [sent-139, score-0.094]

94 The color at every node of the surface encodes the corresponding component of the discriminative direction. [sent-140, score-0.441]

95 Note that the deformation represented by the two vectors is very similar in nature, but of opposite signs, as expected from the analysis in Section 3. [sent-141, score-0.293]

96 We can see that the main deformation represented by this pair of vectors is localized in the bulbous “head” of 3 An alternative way to visualize the same information is to actually generate the animation of the example shape undergoing the detected deformation. [sent-143, score-0.444]

97 The next four support vectors in each group represent a virtually identical deformation to the one shown here. [sent-145, score-0.304]

98 Starting with such visualization, the medical researchers can explore the organ deformation and interaction caused by the disease. [sent-146, score-0.254]

99 We introduced the notion of the discriminative direction, which corresponds to the maximum changes in the classifier’s response while minimizing irrelevant changes in the input. [sent-148, score-0.576]

100 For kernel-based classifiers the discriminative directions is determined by minimizing the divergence of the infinitesimal displacement vector and the normal to the separating hyperplane in the higher dimensional kernel space. [sent-149, score-0.816]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('discriminative', 0.356), ('dz', 0.342), ('hk', 0.342), ('dx', 0.291), ('classi', 0.251), ('fk', 0.245), ('direction', 0.243), ('er', 0.186), ('deformation', 0.161), ('boundary', 0.115), ('displacement', 0.11), ('differences', 0.108), ('kernel', 0.107), ('xk', 0.104), ('separating', 0.098), ('shape', 0.094), ('yk', 0.089), ('eigenvector', 0.083), ('original', 0.07), ('rbf', 0.07), ('organ', 0.07), ('qk', 0.07), ('visualize', 0.068), ('irrelevant', 0.068), ('ers', 0.067), ('captured', 0.063), ('vectors', 0.063), ('changes', 0.062), ('morphological', 0.06), ('nitesimal', 0.06), ('gradient', 0.06), ('vj', 0.056), ('anatomical', 0.055), ('schizophrenia', 0.055), ('support', 0.055), ('input', 0.054), ('manifold', 0.054), ('kernels', 0.053), ('ui', 0.052), ('dot', 0.05), ('patients', 0.048), ('visualization', 0.048), ('space', 0.046), ('rn', 0.046), ('dxt', 0.046), ('ellipsoid', 0.046), ('golland', 0.046), ('organs', 0.046), ('polina', 0.046), ('hippocampus', 0.046), ('vector', 0.045), ('cation', 0.043), ('normal', 0.043), ('opposite', 0.042), ('image', 0.041), ('classes', 0.04), ('magnitude', 0.039), ('projection', 0.039), ('introducing', 0.036), ('quadratic', 0.036), ('onto', 0.035), ('divergence', 0.034), ('analytical', 0.034), ('lkopf', 0.034), ('move', 0.034), ('solution', 0.033), ('examples', 0.033), ('sch', 0.032), ('smallest', 0.032), ('de', 0.032), ('surface', 0.031), ('detected', 0.031), ('interpretation', 0.03), ('curvature', 0.029), ('sphere', 0.029), ('component', 0.029), ('scienti', 0.028), ('implied', 0.028), ('corresponds', 0.028), ('argue', 0.028), ('feature', 0.028), ('metric', 0.027), ('perpendicular', 0.027), ('change', 0.027), ('represented', 0.027), ('maximizes', 0.026), ('matrix', 0.026), ('color', 0.025), ('ij', 0.025), ('distance', 0.025), ('group', 0.025), ('motivated', 0.025), ('mapping', 0.024), ('importance', 0.024), ('linear', 0.024), ('arrows', 0.023), ('medical', 0.023), ('generative', 0.023), ('dimensional', 0.023), ('originally', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 60 nips-2001-Discriminative Direction for Kernel Classifiers

Author: Polina Golland

Abstract: In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work.

2 0.21033612 46 nips-2001-Categorization by Learning and Combining Object Parts

Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio

Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.

3 0.17589962 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

Author: Koji Tsuda, Motoaki Kawanabe, Gunnar Rätsch, Sören Sonnenburg, Klaus-Robert Müller

Abstract: Recently, Jaakkola and Haussler proposed a method for constructing kernel functions from probabilistic models. Their so called

4 0.16646616 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller

Abstract: Driven by the progress in the field of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming finger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classification accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classifiers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).

5 0.15694183 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.

6 0.14144085 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

7 0.13876902 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

8 0.12713414 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

9 0.12467556 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

10 0.11539348 144 nips-2001-Partially labeled classification with Markov random walks

11 0.11245688 164 nips-2001-Sampling Techniques for Kernel Methods

12 0.11218452 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

13 0.11032802 58 nips-2001-Covariance Kernels from Bayesian Generative Models

14 0.10735427 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

15 0.10317636 105 nips-2001-Kernel Machines and Boolean Functions

16 0.09797097 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

17 0.097050168 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

18 0.096351184 139 nips-2001-Online Learning with Kernels

19 0.092493325 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

20 0.090078749 170 nips-2001-Spectral Kernel Methods for Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.269), (1, 0.146), (2, -0.097), (3, 0.111), (4, 0.005), (5, 0.172), (6, -0.081), (7, -0.059), (8, 0.112), (9, 0.091), (10, 0.024), (11, -0.153), (12, -0.07), (13, -0.041), (14, -0.004), (15, -0.046), (16, -0.032), (17, -0.034), (18, 0.059), (19, -0.117), (20, 0.017), (21, 0.019), (22, 0.049), (23, -0.015), (24, -0.062), (25, 0.052), (26, -0.054), (27, 0.096), (28, -0.147), (29, 0.092), (30, -0.062), (31, 0.049), (32, -0.156), (33, -0.077), (34, 0.033), (35, 0.031), (36, -0.017), (37, 0.022), (38, -0.091), (39, -0.034), (40, -0.052), (41, -0.017), (42, -0.06), (43, 0.022), (44, -0.049), (45, 0.055), (46, -0.024), (47, 0.057), (48, -0.043), (49, 0.104)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96711588 60 nips-2001-Discriminative Direction for Kernel Classifiers

Author: Polina Golland

Abstract: In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work.

2 0.68186498 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller

Abstract: Driven by the progress in the field of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming finger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classification accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classifiers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).

3 0.65186012 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

Author: Ji Zhu, Trevor Hastie

Abstract: The support vector machine (SVM) is known for its good performance in binary classification, but its extension to multi-class classification is still an on-going research issue. In this paper, we propose a new approach for classification, called the import vector machine (IVM), which is built on kernel logistic regression (KLR). We show that the IVM not only performs as well as the SVM in binary classification, but also can naturally be generalized to the multi-class case. Furthermore, the IVM provides an estimate of the underlying probability. Similar to the “support points” of the SVM, the IVM model uses only a fraction of the training data to index kernel basis functions, typically a much smaller fraction than the SVM. This gives the IVM a computational advantage over the SVM, especially when the size of the training data set is large.

4 0.63224024 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

Author: Michael C. Mozer, Robert Dodier, Michael D. Colagrosso, Cesar Guerra-Salcedo, Richard Wolniewicz

Abstract: When designing a two-alternative classifier, one ordinarily aims to maximize the classifier’s ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: that the classifier achieve a specified correct acceptance or correct rejection rate (i.e., that it achieve a fixed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose four algorithms for training a classifier subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance. Although the improvement is modest in magnitude, it is nonetheless impressive given the difficulty of the problem and the financial return that it achieves to the service provider. When designing a classifier, one must specify an objective measure by which the classifier’s performance is to be evaluated. One simple objective measure is to minimize the number of misclassifications. If the cost of a classification error depends on the target and/ or response class, one might utilize a risk-minimization framework to reduce the expected loss. A more general approach is to maximize the classifier’s ability to discriminate one class from another class (e.g., Chang & Lippmann, 1994). An ROC curve (Green & Swets, 1966) can be used to visualize the discriminative performance of a two-alternative classifier that outputs class posteriors. To explain the ROC curve, a classifier can be thought of as making a positive/negative judgement as to whether an input is a member of some class. Two different accuracy measures can be obtained from the classifier: the accuracy of correctly identifying an input as a member of the class (a correct acceptance or CA), and the accuracy of correctly identifying an input as a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates, it is necessary to pick a threshold above which the classifier’s probability estimate is interpreted as an “accept,” and below which is interpreted as a “reject”—call this the criterion. The ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the threshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1, the CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1 0 0 correct rejection rate 20 40 60 80 100 100 (b) correct rejection rate 20 40 60 80 (a) 0 20 40 60 80 100 correct acceptance rate 0 20 40 60 80 100 correct acceptance rate FIGURE 1. (a) two ROC curves reflecting discrimination performance; the dashed curve indicates better performance. (b) two plausible ROC curves, neither of which is clearly superior to the other. and the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotonically nonincreasing. The degree to which the curve is bowed reflects the discriminative ability of the classifier. The dashed curve in Figure 1a is therefore a better classifier than the solid curve. The degree to which the curve is bowed can be quantified by various measures such as the area under the ROC curve or d’, the distance between the positive and negative distributions. However, training a classifier to maximize either the ROC area or d’ often yields the same result as training a classifier to estimate posterior class probabilities, or equivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The ROC area and d’ scores are useful, however, because they reflect a classifier’s intrinsic ability to discriminate between two classes, regardless of how the decision criterion is set. That is, each point on an ROC curve indicates one possible CA/CR trade off the classifier can achieve, and that trade off is determined by the criterion. But changing the criterion does not change the classifier’s intrinsic ability to discriminate. Generally, one seeks to optimize the discrimination performance of a classifier. However, we are working in a domain where overall discrimination performance is not as critical as performance at a particular point on the ROC curve, and we are not interested in the remainder of the ROC curve. To gain an intuition as to why this goal should be feasible, consider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they satisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not decrease and the CR rate does not increase. Although the bow shape of the solid curve is typical, it is not mandatory; the precise shape of the curve depends on the nature of the classifier and the nature of the domain. Thus, it is conceivable that a classifier could produce a curve like the dashed one. The dashed curve indicates better performance when the CA rate is around 50%, but worse performance when the CA rate is much lower or higher than 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint that the CA rate is around 50%, or to maximize the CA rate subject to the constraint that the CR rate is around 90%, the dashed curve is superior to the solid curve. One can imagine that better performance can be obtained along some stretches of the curve by sacrificing performance along other stretches of the curve. Note that obtaining a result such as the dashed curve requires a nonstandard training algorithm, as the discrimination performance as measured by the ROC area is worse for the dashed curve than for the solid curve. In this paper, we propose and evaluate four algorithms for optimizing performance in a certain region of the ROC curve. To begin, we explain the domain we are concerned with and why focusing on a certain region of the ROC curve is important in this domain. 1 OUR DOMAIN Athene Software focuses on predicting and managing subscriber churn in the telecommunications industry (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000). “Churn” refers to the loss of subscribers who switch from one company to the other. Churn is a significant problem for wireless, long distance, and internet service providers. For example, in the wireless industry, domestic monthly churn rates are 2–3% of the customer base. Consequently, service providers are highly motivated to identify subscribers who are dissatisfied with their service and offer them incentives to prevent churn. We use techniques from statistical machine learning—primarily neural networks and ensemble methods—to estimate the probability that an individual subscriber will churn in the near future. The prediction of churn is based on various sources of information about a subscriber, including: call detail records (date, time, duration, and location of each call, and whether call was dropped due to lack of coverage or available bandwidth), financial information appearing on a subscriber’s bill (monthly base fee, additional charges for roaming and usage beyond monthly prepaid limit), complaints to the customer service department and their resolution, information from the initial application for service (contract details, rate plan, handset type, credit report), market information (e.g., rate plans offered by the service provider and its competitors), and demographic data. Churn prediction is an extremely difficult problem for several reasons. First, the business environment is highly nonstationary; models trained on data from a certain time period perform far better with hold-out examples from that same time period than examples drawn from successive time periods. Second, features available for prediction are only weakly related to churn; when computing mutual information between individual features and churn, the greatest value we typically encounter is .01 bits. Third, information critical to predicting subscriber behavior, such as quality of service, is often unavailable. Obtaining accurate churn predictions is only part of the challenge of subscriber retention. Subscribers who are likely to churn must be contacted by a call center and offered some incentive to remain with the service provider. In a mathematically principled business scenario, one would frame the challenge as maximizing profitability to a service provider, and making the decision about whether to contact a subscriber and what incentive to offer would be based on the expected utility of offering versus not offering an incentive. However, business practices complicate the scenario and place some unique constraints on predictive models. First, call centers are operated by a staff of customer service representatives who can contact subscribers at a fixed rate; consequently, our models cannot advise contacting 50,000 subscribers one week, and 50 the next. Second, internal business strategies at the service providers constrain the minimum acceptable CA or CR rates (above and beyond the goal of maximizing profitability). Third, contracts that Athene makes with service providers will occasionally call for achieving a specific target CA and CR rate. These three practical issues pose formal problems which, to the best of our knowledge, have not been addressed by the machine learning community. The formal problems can be stated in various ways, including: (1) maximize the CA rate, subject to the constraint that a fixed percentage of the subscriber base is identified as potential churners, (2) optimize the CR rate, subject to the constraint that the CA rate should be αCA, (3) optimize the CA rate, subject to the constraint that the CR rate should be αCR, and finally—what marketing executives really want—(4) design a classifier that has a CA rate of αCA and a CR rate of αCR. Problem (1) sounds somewhat different than problems (2) or (3), but it can be expressed in terms of a lift curve, which plots the CA rate as a function of the total fraction of subscribers identified by the model. Problem (1) thus imposes the constraint that the solution lies at one coordinate of the lift curve, just as problems (2) and (3) place the constraint that the solution lies at one coordinate of the ROC curve. Thus, a solution to problems (2) or (3) will also serve as a solution to (1). Although addressing problem (4) seems most fanciful, it encompasses problems (2) and (3), and thus we focus on it. Our goal is not altogether unreasonable, because a solution to problem (4) has the property we characterized in Figure 1b: the ROC curve can suffer everywhere except in the region near CA αCA and CR αCR. Hence, the approaches we consider will trade off performance in some regions of the ROC curve against performance in other regions. We call this prodding the ROC curve. 2 FOUR ALGORITHMS TO PROD THE ROC CURVE In this section, we describe four algorithms for prodding the ROC curve toward a target CA rate of αCA and a target CR rate of αCR. 2.1 EMPHASIZING CRITICAL TRAINING EXAMPLES Suppose we train a classifier on a set of positive and negative examples from a class— churners and nonchurners in our domain. Following training, the classifier will assign a posterior probability of class membership to each example. The examples can be sorted by the posterior and arranged on a continuum anchored by probabilities 0 and 1 (Figure 2). We can identify the thresholds, θCA and θCR, which yield CA and CR rates of αCA and αCR, respectively. If the classifier’s discrimination performance fails to achieve the target CA and CR rates, then θCA will be lower than θCR, as depicted in the Figure. If we can bring these two thresholds together, we will achieve the target CA and CR rates. Thus, the first algorithm we propose involves training a series of classifiers, attempting to make classifier n+1 achieve better CA and CR rates by focusing its effort on examples from classifier n that lie between θCA and θCR; the positive examples must be pushed above θCR and the negative examples must be pushed below θCA. (Of course, the thresholds are specific to a classifier, and hence should be indexed by n.) We call this the emphasis algorithm, because it involves placing greater weight on the examples that lie between the two thresholds. In the Figure, the emphasis for classifier n+1 would be on examples e5 through e8. This retraining procedure can be iterated until the classifier’s training set performance reaches asymptote. In our implementation, we define a weighting of each example i for training classifier n, λ in . For classifier 1, λ i1 = 1 . For subsequent classifiers, λ in + 1 = λ in if example i is not in the region of emphasis, or λ in + 1 = κ e λ in otherwise, where κe is a constant, κe > 1. 2.2 DEEMPHASIZING IRRELEVANT TRAINING EXAMPLES The second algorithm we propose is related to the first, but takes a slightly different perspective on the continuum depicted in Figure 2. Positive examples below θCA—such as e2—are clearly the most dif ficult positive examples to classify correctly. Not only are they the most difficult positive examples, but they do not in fact need to be classified correctly to achieve the target CA and CR rates. Threshold θCR does not depend on examples such as e2, and threshold θCA allows a fraction (1–αCA) of the positive examples to be classified incorrectly. Likewise, one can argue that negative examples above θCR—such as e10 and e11—need not be of concern. Essentially , the second algorithm, which we term the eemd phasis algorithm, is like the emphasis algorithm in that a series of classifiers are trained, but when training classifier n+1, less weight is placed on the examples whose correct clasθCA e1 e2 e3 0 e4 θCR e5 e6 e7 e8 churn probability e9 e10 e11 e12 e13 1 FIGURE 2. A schematic depiction of all training examples arranged by the classifier’s posterior. Each solid bar corresponds to a positive example (e.g., a churner) and each grey bar corresponds to a negative example (e.g., a nonchurner). sification is unnecessary to achieve the target CA and CR rates for classifier n. As with the emphasis algorithm, the retraining procedure can be iterated until no further performance improvements are obtained on the training set. Note that the set of examples given emphasis by the previous algorithm is not the complement of the set of examples deemphasized by the current algorithm; the algorithms are not identical. In our implementation, we assign a weight to each example i for training classifier n, λ in . For classifier 1, λ i1 = 1 . For subsequent classifiers, λ in + 1 = λ in if example i is not in the region of deemphasis, or λ in + 1 = κ d λ in otherwise, where κd is a constant, κd <1. 2.3 CONSTRAINED OPTIMIZATION The third algorithm we propose is formulated as maximizing the CR rate while maintaining the CA rate equal to αCA. (We do not attempt to simultaneously maximize the CA rate while maintaining the CR rate equal to αCR.) Gradient methods cannot be applied directly because the CA and CR rates are nondifferentiable, but we can approximate the CA and CR rates with smooth differentiable functions: 1 1 CA ( w, t ) = ----- ∑ σ β ( f ( x i, w ) – t ) CR ( w, t ) = ------ ∑ σ β ( t – f ( x i, w ) ) , P i∈P N i∈N where P and N are the set of positive and negative examples, respectively, f(x,w) is the model posterior for input x, w is the parameterization of the model, t is a threshold, and σβ –1 is a sigmoid function with scaling parameter β: σ β ( y ) = ( 1 + exp ( – βy ) ) . The larger β is, the more nearly step-like the sigmoid is and the more nearly equal the approximations are to the model CR and CA rates. We consider the problem formulation in which CA is a constraint and CR is a figure of merit. We convert the constrained optimization problem into an unconstrained problem by the augmented Lagrangian method (Bertsekas, 1982), which involves iteratively maximizing an objective function 2 µ A ( w, t ) = CR ( w, t ) + ν CA ( w, t ) – α CA + -- CA ( w, t ) – α CA 2 with a fixed Lagrangian multiplier, ν, and then updating ν following the optimization step: ν ← ν + µ CA ( w *, t * ) – α CA , where w * and t * are the values found by the optimization step. We initialize ν = 1 and fix µ = 1 and β = 10 and iterate until ν converges. 2.4 GENETIC ALGORITHM The fourth algorithm we explore is a steady-state genetic search over a space defined by the continuous parameters of a classifier (Whitley, 1989). The fitness of a classifier is the reciprocal of the number of training examples falling between the θCA and θCR thresholds. Much like the emphasis algorithm, this fitness function encourages the two thresholds to come together. The genetic search permits direct optimization over a nondifferentiable criterion, and therefore seems sensible for the present task. 3 METHODOLOGY For our tests, we studied two large data bases made available to Athene by two telecommunications providers. Data set 1 had 50,000 subscribers described by 35 input features and a churn rate of 4.86%. Data set 2 had 169,727 subscribers described by 51 input features and a churn rate of 6.42%. For each data base, the features input to the classifier were obtained by proprietary transformations of the raw data (see Mozer et al., 2000). We chose these two large, real world data sets because achieving gains with these data sets should be more difficult than with smaller, less noisy data sets. Plus, with our real-world data, we can evaluate the cost savings achieved by an improvement in prediction accuracy. We performed 10-fold cross-validation on each data set, preserving the overall churn/nonchurn ratio in each split. In all tests, we chose α CR = 0.90 and α CA = 0.50 , values which, based on our past experience in this domain, are ambitious yet realizable targets for data sets such as these. We used a logistic regression model (i.e., a no hidden unit neural network) for our studies, believing that it would be more difficult to obtain improvements with such a model than with a more flexible multilayer perceptron. For the emphasis and deemphasis algorithms, models were trained to minimize mean-squared error on the training set. We chose κe = 1.3 and κd = .75 by quick exploration. Because the weightings are cumulative over training restarts, the choice of κ is not critical for either algorithm; rather, the magnitude of κ controls how many restarts are necessary to reach asymptotic performance, but the results we obtained were robust to the choice of κ. The emphasis and deemphasis algorithms were run for 100 iterations, which was the number of iterations required to reach asymptotic performance on the training set. 4 RESULTS Figure 3 illustrates training set performance for the emphasis algorithm on data set 1. The graph on the left shows the CA rate when the CR rate is .9, and the graph on the right show the CR rate when the CA rate is .5. Clearly, the algorithm appears to be stable, and the ROC curve is improving in the region around (αCA, αCR). Figure 4 shows cross-validation performance on the two data sets for the four prodding algorithms as well as for a traditional least-squares training procedure. The emphasis and deemphasis algorithms yield reliable improvements in performance in the critical region of the ROC curve over the traditional training procedure. The constrained-optimization and genetic algorithms perform well on achieving a high CR rate for a fixed CA rate, but neither does as well on achieving a high CA rate for a fixed CR rate. For the constrained-optimization algorithm, this result is not surprising as it was trained asymmetrically, with the CA rate as the constraint. However, for the genetic algorithm, we have little explanation for its poor performance, other than the difficulty faced in searching a continuous space without gradient information. 5 DISCUSSION In this paper, we have identified an interesting, novel problem in classifier design which is motivated by our domain of churn prediction and real-world business considerations. Rather than seeking a classifier that maximizes discriminability between two classes, as measured by area under the ROC curve, we are concerned with optimizing performance at certain points along the ROC curve. We presented four alternative approaches to prodding the ROC curve, and found that all four have promise, depending on the specific goal. Although the magnitude of the gain is small—an increase of about .01 in the CR rate given a target CA rate of .50—the impro vement results in significant dollar savings. Using a framework for evaluating dollar savings to a service provider, based on estimates of subscriber retention and costs of intervention obtained in real world data collection (Mozer et 0.845 0.84 0.39 0.835 0.385 CR rate CA rate 0.4 0.395 0.38 0.83 0.825 0.375 0.82 0.37 0.815 0.365 0.81 0 5 10 15 20 25 30 35 40 45 50 Iteration 0 5 10 15 20 25 30 35 40 45 50 Iteration FIGURE 3. Training set performance for the emphasis algorithm on data set 1. (a) CA rate as a function of iteration for a CR rate of .9; (b) CR rate as a function of iteration for a CA rate of .5. Error bars indicate +/–1 standard error of the mean. Data set 1 0.835 0.380 0.830 0.375 0.825 CR rate ISP Test Set 0.840 0.385 CA rate 0.390 0.370 0.820 0.365 0.815 0.360 0.810 0.355 0.805 0.350 0.800 std emph deemph constr GA std emph deemph constr GA std emph deemph constr GA 0.900 0.375 0.350 CR rate Data set 2 0.875 CA rate Wireless Test Set 0.850 0.325 0.825 0.300 0.800 std emph deemph constr GA FIGURE 4. Cross-validation performance on the two data sets for the standard training procedure (STD), as well as the emphasis (EMPH), deemphasis (DEEMPH), constrained optimization (CONSTR), and genetic (GEN) algorithms. The left column shows the CA rate for CR rate .9; the right column shows the CR rate for CA rate .5. The error bar indicates one standard error of the mean over the 10 data splits. al., 2000), we obtain a savings of $11 per churnable subscriber when the (CA, CR) rates go from (.50, .80) to (.50, .81), which amounts to an 8% increase in profitability of the subscriber intervention effort. These figures are clearly promising. However, based on the data sets we have studied, it is difficult to know whether another algorithm might exist that achieves even greater gains. Interestingly, all algorithms we proposed yielded roughly the same gains when successful, suggesting that we may have milked the data for whatever gain could be had, given the model class evaluated. Our work clearly illustrate the difficulty of the problem, and we hope that others in the NIPS community will be motivated by the problem to suggest even more powerful, theoretically grounded approaches. 6 ACKNOWLEDGEMENTS No white males were angered in the course of conducting this research. We thank Lian Yan and David Grimes for comments and assistance on this research. This research was supported in part by McDonnell-Pew grant 97-18, NSF award IBN-9873492, and NIH/IFOPAL R01 MH61549–01A1. 7 REFERENCES Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NY: Academic. Chang, E. I., & Lippmann, R. P. (1994). Figure of merit training for detection and spotting. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6 (1019–1026). San Mateo, CA: Morgan Kaufmann. Frederick, E. D., & Floyd, C. E. (1998). Analysis of mammographic findings and patient history data with genetic algorithms for the prediction of breast cancer biopsy outcome. Proceedings of the SPIE, 3338, 241–245. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Mozer, M. C., Wolniewicz, R., Grimes, D., Johnson, E., & Kaushansky, H. (2000). Maximizing revenue by predicting and addressing customer dissatisfaction. IEEE Transactions on Neural Networks, 11, 690–696. Whitley, D. (1989). The GENITOR algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In D. Schaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms (pp. 116–121). San Mateo, CA: Morgan Kaufmann.

5 0.61903936 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

Author: B. Zadrozny

Abstract: This paper presents a method for obtaining class membership probability estimates for multiclass classification problems by coupling the probability estimates produced by binary classifiers. This is an extension for arbitrary code matrices of a method due to Hastie and Tibshirani for pairwise coupling of probability estimates. Experimental results with Boosted Naive Bayes show that our method produces calibrated class membership probability estimates, while having similar classification accuracy as loss-based decoding, a method for obtaining the most likely class that does not generate probability estimates.

6 0.61486137 46 nips-2001-Categorization by Learning and Combining Object Parts

7 0.60670573 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

8 0.59662467 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

9 0.5691877 15 nips-2001-A New Discriminative Kernel From Probabilistic Models

10 0.53827381 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

11 0.53144783 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

12 0.50836599 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines

13 0.50392753 144 nips-2001-Partially labeled classification with Markov random walks

14 0.49207789 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

15 0.49137628 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

16 0.47103685 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes

17 0.4663727 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

18 0.45732179 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

19 0.43327564 74 nips-2001-Face Recognition Using Kernel Methods

20 0.42005429 164 nips-2001-Sampling Techniques for Kernel Methods


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.204), (14, 0.063), (17, 0.03), (19, 0.036), (27, 0.155), (30, 0.128), (38, 0.029), (59, 0.037), (72, 0.086), (79, 0.02), (83, 0.016), (91, 0.115)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88939583 60 nips-2001-Discriminative Direction for Kernel Classifiers

Author: Polina Golland

Abstract: In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work.

2 0.76398182 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama

Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1

3 0.76067269 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.

4 0.75860208 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

Author: Olivier Chapelle, Bernhard Schćžšlkopf

Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1

5 0.75738573 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

Author: Mário Figueiredo

Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.

6 0.75543582 46 nips-2001-Categorization by Learning and Combining Object Parts

7 0.75512892 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

8 0.75384313 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

9 0.75088036 13 nips-2001-A Natural Policy Gradient

10 0.75008988 185 nips-2001-The Method of Quantum Clustering

11 0.74950576 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

12 0.7479409 149 nips-2001-Probabilistic Abstraction Hierarchies

13 0.74678868 8 nips-2001-A General Greedy Approximation Algorithm with Applications

14 0.74444449 56 nips-2001-Convolution Kernels for Natural Language

15 0.74330348 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

16 0.74271679 190 nips-2001-Thin Junction Trees

17 0.74138469 22 nips-2001-A kernel method for multi-labelled classification

18 0.74125177 74 nips-2001-Face Recognition Using Kernel Methods

19 0.74081671 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

20 0.74065757 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing