nips nips2001 nips2001-46 knowledge-graph by maker-knowledge-mining

46 nips-2001-Categorization by Learning and Combining Object Parts

Source: pdf

Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio

Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classiﬁers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classiﬁers are then combined in a second stage to yield a hierarchical SVM classiﬁer. Experimental results in face classiﬁcation show considerable robustness against rotations in depth and suggest performance at signiﬁcantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classiﬁcation experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classiﬁer which may be relevant for biological models of visual recognition.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de ¤ ¡ £ ¤ ¥ ¦ Abstract We describe an algorithm for automatically learning discriminative components of objects with SVM classiﬁers. [sent-11, score-0.32]

2 It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. [sent-12, score-0.122]

3 Component-based face classiﬁers are then combined in a second stage to yield a hierarchical SVM classiﬁer. [sent-13, score-0.285]

4 Experimental results in face classiﬁcation show considerable robustness against rotations in depth and suggest performance at signiﬁcantly better level than other face detection systems. [sent-14, score-0.661]

5 1 Introduction We study the problem of automatically synthesizing hierarchical classiﬁers by learning discriminative object parts in images. [sent-16, score-0.483]

6 faces, cars) seem to be naturally described by a few characteristic parts or components and their geometrical relation. [sent-19, score-0.27]

7 Greater invariance to viewpoint changes and robustness against partial occlusions are the two main potential advantages of component-based approaches compared to a global approach. [sent-20, score-0.153]

8 The ﬁrst challenge in developing component-based systems is how to choose automatically a set of discriminative object components. [sent-21, score-0.417]

9 Instead of manually selecting the components, it is desirable to learn the components from a set of examples based on their discriminative power and their robustness against pose and illumination changes. [sent-22, score-0.468]

10 The second challenge is to combine the component-based experts to perform the ﬁnal classiﬁcation. [sent-23, score-0.107]

11 Categorization by Learning and Combining Object Parts 2 Background Global approaches in which the whole pattern of an object is used as input to a single classiﬁer were successfully applied to tasks where the pose of the object was ﬁxed. [sent-24, score-0.638]

12 In [6] Haar wavelet features are used to detect frontal and back views of pedestrians with an SVM classiﬁer. [sent-25, score-0.18]

13 Learning-based systems for detecting frontal faces based on a gray value features are described in [14, 13, 10, 2]. [sent-26, score-0.436]

14 Component-based techniques promise to provide more invariance since the individual components vary less under pose changes than the whole object. [sent-27, score-0.314]

15 Variations induced by pose changes occur mainly in the locations of the components. [sent-28, score-0.096]

16 A component-based method for detecting faces based on the empirical probabilities of overlapping rectangular image parts is proposed in [11]. [sent-29, score-0.508]

17 Another probabilistic approach which detects small parts of faces is proposed in [4]. [sent-30, score-0.307]

18 It uses local feature extractors to detect the eyes, the corner of the mouth, and the tip of the nose. [sent-31, score-0.122]

19 The geometrical conﬁguration of these features is matched with a model conﬁguration by conditional search. [sent-32, score-0.076]

20 Local features are extracted by applying multi-scale and multi-orientation ﬁlters to the input image. [sent-34, score-0.061]

21 The responses of the ﬁlters on the training set are modeled as Gaussian distributions. [sent-35, score-0.073]

22 In [5] pedestrian detection is performed by a set of SVM classiﬁers each of which was trained to detect a speciﬁc part of the human body. [sent-36, score-0.249]

23 In this paper we present a technique for learning relevant object components. [sent-37, score-0.231]

24 The technique starts with a set of small seed regions which are gradually grown by minimizing a bound on the expected error probability of an SVM. [sent-38, score-0.113]

25 Once the components have been determined, we train a system consisting of a two-level hierarchy of SVM classiﬁers. [sent-39, score-0.207]

26 Second, a combination classiﬁer learns the geometrical relation between the components and performs the ﬁnal detection of the object. [sent-41, score-0.4]

27 1 Linear Support Vector Machines Linear SVMs [15] perform pattern recognition for two-class problems by determining the separating hyperplane with maximum distance to the closest points in the training set. [sent-43, score-0.182]

28 In fact, the expected error probability of the SVM, , satisﬁes the following bound [15]: ) a bY Y W9 U `XVC GGF D IH¢EC 6 1 T GGF D 7SRQPC W where (3) is the diameter of the smallest sphere containing all data points in the training set. [sent-49, score-0.073]

29 2 Learning Components Our method automatically determines rectangular components from a set of object images. [sent-51, score-0.534]

30 The algorithm starts with a small rectangular component located around a pre-selected point in the object image (e. [sent-52, score-0.542]

31 for faces this could be the center of the left eye). [sent-54, score-0.292]

32 The component is extracted from each object image to build a training set of positive examples. [sent-55, score-0.531]

33 We also generate a training set of background patterns that have the same rectangular shape as the component. [sent-56, score-0.289]

34 After training an SVM on the component data we estimate the performance of the SVM based on the upper bound on the error probability. [sent-57, score-0.18]

35 After determining we enlarge the component by expanding the rectangle by one pixel into one of the four directions (up, down, left, right). [sent-60, score-0.206]

36 Again, we generate training data, train an SVM and determine . [sent-61, score-0.107]

37 We do this for expansions into all four directions and ﬁnally keep the expansion which decreases the most. [sent-62, score-0.069]

38 This process is continued until the expansions into all four directions lead to an increase of . [sent-63, score-0.069]

39 In order to learn a set of components this process can be applied to different seed regions. [sent-64, score-0.205]

40 4 Learning Facial Components Extracting face patterns is usually a tedious and time-consuming work that has to be done manually. [sent-65, score-0.299]

41 Taking the component-based approach we would have to manually extract each single component from all images in the training set. [sent-66, score-0.325]

42 For this reason we used textured 3-D head models [16] to generate the training data. [sent-68, score-0.272]

43 By rendering the 3-D head models we could automatically generate large numbers of faces in arbitrary poses and with arbitrary illumination. [sent-69, score-0.465]

44 In addition to the 3-D information we also knew the 3-D correspondences for a set of reference points shown in Fig. [sent-70, score-0.095]

45 These correspondences allowed us to automatically extract facial components located around the reference points. [sent-72, score-0.476]

46 Originally we had seven textured head models acquired by a 3-D scanner. [sent-73, score-0.165]

47 Additional head models were generated by 3-D morphing between all pairs of the original head models. [sent-74, score-0.281]

48 The faces were illuminated by ambient light and a single directional light and pointing towards the center of the face. [sent-76, score-0.384]

49 The position of the light varied between in azimuth and between and in elevation. [sent-77, score-0.109]

50 Overall, we generated 2,457 face images of size 58 58. [sent-78, score-0.313]

51 Some examples of synthetic face images used for training are shown in Fig. [sent-79, score-0.428]

52 ¥£¡ ¦¤¢4 £ ¥ §¡ ¥£¡ ¨¤¢4 ¥ §£ ¥£ ©§¡ ¥£ ©¤¡ The negative training set initially consisted of 10,209 58 58 non-face patterns randomly extracted from 502 non-face images. [sent-81, score-0.257]

53 We then applied bootstrapping to enlarge the training data by non-face patterns that look similar to faces. [sent-82, score-0.193]

54 To do so we trained a single linear SVM classiﬁer and applied it to the previously used set of 502 non-face images. [sent-83, score-0.079]

55 The false positives (FPs) were added to the non-face training data to build the ﬁnal training set of size 13,654. [sent-84, score-0.146]

56 We started with fourteen manually selected seed regions of size 5 5. [sent-85, score-0.249]

57 6 ¤ a) b) Figure 1: a) Reference points on the head models which were used for 3-D morphing and automatic extraction of facial components. [sent-87, score-0.291]

58 On the ﬁrst level the component classiﬁers independently detect components of the face. [sent-91, score-0.316]

59 Each classiﬁer was trained on a set of facial components and on a set of non-face patterns generated from the training set described in Section 4. [sent-92, score-0.419]

60 On the second level the combination classiﬁer performs the detection of the face based on the outputs of the component classiﬁers. [sent-93, score-0.579]

61 The maximum real-valued outputs of each component classiﬁer within rectangular search regions around the expected positions of the components are used as inputs to the combination classiﬁer. [sent-94, score-0.469]

62 The size of the search regions was estimated from the mean and the standard deviation of the locations of the components in the training images. [sent-95, score-0.243]

63 The maximum operation is performed both during training and at run-time. [sent-96, score-0.073]

64 Interestingly it turns out to be similar to the key pooling mechanism postulated in a recent model of object recognition in the visual cortex [8]. [sent-97, score-0.299]

65 We also provide the combination classiﬁer with the precise positions of the detected components relative to the upper left corner of the 58 58 window. [sent-98, score-0.242]

66 Overall we have three values per component classiﬁer that are propagated to the combination classiﬁer: the maximum output of the component classiﬁer and the - image coordinates of the maximum. [sent-99, score-0.34]

67 8 *Outputs of component experts: bright intensities indicate high confidence. [sent-100, score-0.107]

68 Shift component experts over 58x58 window Left Eye Left Eye expert: expert: Linear SVM Linear SVM * Combination Combination classifier: classifier: Linear SVM Linear SVM * (O14 , X 14 , Y14 ) 3. [sent-112, score-0.178]

69 For each component k, determine its maximum output within a search region and its location: (Ok , X k , Yk ) 4. [sent-113, score-0.107]

70 Final decision: face / background Figure 2: System overview of the component-based classiﬁer. [sent-114, score-0.319]

71 6 Experiments In our experiments we compared the component-based system to global classiﬁers. [sent-115, score-0.104]

72 The component system consisted of fourteen linear SVM classiﬁers for detecting the components and a single linear SVM as combination classiﬁer. [sent-116, score-0.593]

73 The global classiﬁers were a single linear SVM and a single second-degree polynomial SVM both trained on the gray values of the whole face pattern. [sent-117, score-0.521]

74 The training data for these three classiﬁers consisted of 2,457 synthetic gray face images and 13,654 non-face gray images of size 58 58. [sent-118, score-0.755]

75 The positive test set consisted of 1,834 faces rotated between about and in depth. [sent-119, score-0.356]

76 The faces were manually extracted from the CMU PIE database [12]. [sent-120, score-0.383]

77 The negative test set consisted of 24,464 difﬁcult non-face patterns that were collected by a fast face detector [3] from web images. [sent-121, score-0.369]

78 For an indirect comparison, we used a second-degree polynomial SVM [2] which was trained on a large set of 19 19 real face images. [sent-124, score-0.291]

79 This classiﬁer performed amongst the best face detection systems on the MIT-CMU test set. [sent-125, score-0.372]

80 3 show that the component-based classiﬁer is signiﬁcantly better than the three global classiﬁers. [sent-127, score-0.062]

81 Some detection results generated by the component system are shown in Fig. [sent-128, score-0.275]

82 ¥£ §§¡ ¥ §¢4 £¡ Figure 3: Comparison between global classiﬁers and the component-based classiﬁer. [sent-130, score-0.062]

83 A natural question that arises is about the role of geometrical information. [sent-132, score-0.112]

84 To answer this question–which has relevant implications for models of cortex–we tested another system in which the combination classiﬁer receives as inputs only the output of each component classiﬁer but not the position of its maximum. [sent-133, score-0.279]

85 5 this system still outperforms the whole face systems but it is worse than the system with position information. [sent-135, score-0.432]

86 Figure 5: Comparison between a component-based classiﬁer trained with position information and a component-based classiﬁer without position information. [sent-136, score-0.171]

87 7 Open Questions An extension under way of the component-based approach to face identiﬁcation is already showing good performances [1]. [sent-137, score-0.246]

88 Another natural generalization of the work described here involves the application of our system to various classes of objects such as cars, animals, and people. [sent-138, score-0.081]

89 Still another extension regards the question of view-invariant object detection. [sent-139, score-0.267]

90 As suggested by [7] in a biological context and demonstrated recently by [11] in machine vision, full pose invariance in recognition tasks can be achieved by combining view-dependent classiﬁers. [sent-140, score-0.329]

91 It is interesting to ask whether the approach described here could also be used to learn which views are most discriminative and how to combine them optimally. [sent-141, score-0.125]

92 Finally, the role of geometry and in particular how to compute and represent position information in biologically plausible networks, is an important open question at the interface between machine and biological vision. [sent-142, score-0.184]

93 Face recognition with support vector machines: global versus component-based approach. [sent-147, score-0.13]

94 Feature reduction and hierarchy of classiﬁers for fast object detection in video images. [sent-163, score-0.391]

95 Finding faces in cluttered scenes using random labeled graph matching. [sent-172, score-0.244]

96 International Conference on Computer Vision, pages 637–644, Cambridge, MA, 1995. [sent-174, score-0.037]

97 In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 23, pages 349–361, April 2001. [sent-180, score-0.037]

98 In International Journal of Computer Vision, volume 38, 1, pages 15–33, 2000. [sent-185, score-0.037]

99 A statistical method for 3d object detection applied to faces and cars. [sent-215, score-0.601]

100 An original approach for the localisation of objects in images. [sent-233, score-0.039]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('classi', 0.372), ('svm', 0.282), ('face', 0.246), ('faces', 0.244), ('object', 0.231), ('er', 0.194), ('mouth', 0.168), ('ers', 0.139), ('heisele', 0.134), ('nose', 0.134), ('components', 0.131), ('pixels', 0.128), ('expert', 0.127), ('detection', 0.126), ('facial', 0.117), ('component', 0.107), ('head', 0.107), ('cmu', 0.099), ('pose', 0.096), ('gray', 0.095), ('rectangular', 0.092), ('vision', 0.089), ('biological', 0.085), ('automatically', 0.08), ('manually', 0.078), ('detect', 0.078), ('geometrical', 0.076), ('seed', 0.074), ('training', 0.073), ('experts', 0.071), ('discriminative', 0.07), ('consisted', 0.07), ('recognition', 0.068), ('enlarge', 0.067), ('freiburg', 0.067), ('ggf', 0.067), ('morphing', 0.067), ('siena', 0.067), ('vetter', 0.067), ('images', 0.067), ('combination', 0.067), ('poggio', 0.067), ('position', 0.063), ('parts', 0.063), ('global', 0.062), ('extracted', 0.061), ('image', 0.059), ('cars', 0.058), ('fourteen', 0.058), ('papageorgiou', 0.058), ('pontil', 0.058), ('serre', 0.058), ('textured', 0.058), ('views', 0.055), ('located', 0.053), ('pie', 0.053), ('patterns', 0.053), ('eye', 0.05), ('detecting', 0.05), ('correspondences', 0.05), ('illumination', 0.05), ('ok', 0.05), ('center', 0.048), ('invariance', 0.048), ('frontal', 0.047), ('lters', 0.047), ('light', 0.046), ('trained', 0.045), ('reference', 0.045), ('corner', 0.044), ('robustness', 0.043), ('rotated', 0.042), ('system', 0.042), ('synthetic', 0.042), ('pattern', 0.041), ('eyes', 0.041), ('guration', 0.041), ('ma', 0.041), ('regions', 0.039), ('whole', 0.039), ('computer', 0.039), ('objects', 0.039), ('hierarchical', 0.039), ('pages', 0.037), ('background', 0.037), ('international', 0.037), ('expansions', 0.037), ('yk', 0.037), ('question', 0.036), ('challenge', 0.036), ('overview', 0.036), ('generate', 0.034), ('linear', 0.034), ('conference', 0.034), ('hierarchy', 0.034), ('outputs', 0.033), ('nal', 0.033), ('combining', 0.032), ('directions', 0.032), ('published', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 46 nips-2001-Categorization by Learning and Combining Object Parts

Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio

2 0.36324865 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classiﬁers each trained to achieve high detection rates and modest false positive rates can yield a ﬁnal detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classiﬁers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields signiﬁcant improvements in performance over conventional AdaBoost. The ﬁnal face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.

3 0.23627912 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama

Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classiﬁcation algorithms can be employed without further modiﬁcations. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1

4 0.23285052 54 nips-2001-Contextual Modulation of Target Saliency

Author: Antonio Torralba

Abstract: The most popular algorithms for object detection require the use of exhaustive spatial and scale search procedures. In such approaches, an object is defined by means of local features. fu this paper we show that including contextual information in object detection procedures provides an efficient way of cutting down the need for exhaustive search. We present results with real images showing that the proposed scheme is able to accurately predict likely object classes, locations and sizes. 1

5 0.21033612 60 nips-2001-Discriminative Direction for Kernel Classifiers

Author: Polina Golland

Abstract: In many scientiﬁc and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classiﬁer for labeling new examples while making as few mistakes as possible. In the traditional classiﬁcation setting, the resulting classiﬁer is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we deﬁne a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classiﬁer function. We derive the discriminative direction for kernel-based classiﬁers, demonstrate the technique on several examples and brieﬂy discuss its use in the statistical shape analysis, an application that originally motivated this work.

6 0.20423438 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

7 0.18043257 74 nips-2001-Face Recognition Using Kernel Methods

8 0.17197716 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

9 0.16782357 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes

10 0.16529503 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems

11 0.16088513 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

12 0.15210479 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

13 0.13186419 144 nips-2001-Partially labeled classification with Markov random walks

14 0.13115445 84 nips-2001-Global Coordination of Local Linear Models

15 0.13088034 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates

16 0.12822123 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

17 0.12243807 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

18 0.12024499 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

19 0.11536678 105 nips-2001-Kernel Machines and Boolean Functions

20 0.10839516 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.312), (1, 0.14), (2, -0.174), (3, 0.264), (4, -0.189), (5, 0.191), (6, -0.298), (7, -0.087), (8, 0.16), (9, 0.05), (10, 0.087), (11, -0.115), (12, 0.116), (13, 0.003), (14, -0.067), (15, -0.052), (16, 0.152), (17, -0.064), (18, 0.061), (19, -0.16), (20, -0.037), (21, -0.112), (22, 0.042), (23, -0.09), (24, -0.06), (25, -0.049), (26, -0.083), (27, -0.111), (28, 0.023), (29, -0.013), (30, 0.003), (31, -0.016), (32, -0.011), (33, -0.046), (34, 0.03), (35, -0.01), (36, 0.003), (37, -0.064), (38, 0.005), (39, -0.021), (40, 0.03), (41, 0.028), (42, 0.017), (43, -0.047), (44, 0.051), (45, 0.024), (46, 0.009), (47, 0.001), (48, 0.065), (49, -0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98364532 46 nips-2001-Categorization by Learning and Combining Object Parts

Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio

2 0.82083762 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

Author: Paul Viola, Michael Jones

3 0.67458779 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

Author: Michael C. Mozer, Robert Dodier, Michael D. Colagrosso, Cesar Guerra-Salcedo, Richard Wolniewicz

Abstract: When designing a two-alternative classiﬁer, one ordinarily aims to maximize the classiﬁer’s ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: that the classiﬁer achieve a speciﬁed correct acceptance or correct rejection rate (i.e., that it achieve a ﬁxed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose four algorithms for training a classiﬁer subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance. Although the improvement is modest in magnitude, it is nonetheless impressive given the difﬁculty of the problem and the ﬁnancial return that it achieves to the service provider. When designing a classiﬁer, one must specify an objective measure by which the classiﬁer’s performance is to be evaluated. One simple objective measure is to minimize the number of misclassiﬁcations. If the cost of a classiﬁcation error depends on the target and/ or response class, one might utilize a risk-minimization framework to reduce the expected loss. A more general approach is to maximize the classiﬁer’s ability to discriminate one class from another class (e.g., Chang & Lippmann, 1994). An ROC curve (Green & Swets, 1966) can be used to visualize the discriminative performance of a two-alternative classiﬁer that outputs class posteriors. To explain the ROC curve, a classiﬁer can be thought of as making a positive/negative judgement as to whether an input is a member of some class. Two different accuracy measures can be obtained from the classiﬁer: the accuracy of correctly identifying an input as a member of the class (a correct acceptance or CA), and the accuracy of correctly identifying an input as a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates, it is necessary to pick a threshold above which the classiﬁer’s probability estimate is interpreted as an “accept,” and below which is interpreted as a “reject”—call this the criterion. The ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the threshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1, the CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1 0 0 correct rejection rate 20 40 60 80 100 100 (b) correct rejection rate 20 40 60 80 (a) 0 20 40 60 80 100 correct acceptance rate 0 20 40 60 80 100 correct acceptance rate FIGURE 1. (a) two ROC curves reﬂecting discrimination performance; the dashed curve indicates better performance. (b) two plausible ROC curves, neither of which is clearly superior to the other. and the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotonically nonincreasing. The degree to which the curve is bowed reﬂects the discriminative ability of the classiﬁer. The dashed curve in Figure 1a is therefore a better classiﬁer than the solid curve. The degree to which the curve is bowed can be quantiﬁed by various measures such as the area under the ROC curve or d’, the distance between the positive and negative distributions. However, training a classiﬁer to maximize either the ROC area or d’ often yields the same result as training a classiﬁer to estimate posterior class probabilities, or equivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The ROC area and d’ scores are useful, however, because they reﬂect a classiﬁer’s intrinsic ability to discriminate between two classes, regardless of how the decision criterion is set. That is, each point on an ROC curve indicates one possible CA/CR trade off the classiﬁer can achieve, and that trade off is determined by the criterion. But changing the criterion does not change the classiﬁer’s intrinsic ability to discriminate. Generally, one seeks to optimize the discrimination performance of a classiﬁer. However, we are working in a domain where overall discrimination performance is not as critical as performance at a particular point on the ROC curve, and we are not interested in the remainder of the ROC curve. To gain an intuition as to why this goal should be feasible, consider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they satisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not decrease and the CR rate does not increase. Although the bow shape of the solid curve is typical, it is not mandatory; the precise shape of the curve depends on the nature of the classiﬁer and the nature of the domain. Thus, it is conceivable that a classiﬁer could produce a curve like the dashed one. The dashed curve indicates better performance when the CA rate is around 50%, but worse performance when the CA rate is much lower or higher than 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint that the CA rate is around 50%, or to maximize the CA rate subject to the constraint that the CR rate is around 90%, the dashed curve is superior to the solid curve. One can imagine that better performance can be obtained along some stretches of the curve by sacriﬁcing performance along other stretches of the curve. Note that obtaining a result such as the dashed curve requires a nonstandard training algorithm, as the discrimination performance as measured by the ROC area is worse for the dashed curve than for the solid curve. In this paper, we propose and evaluate four algorithms for optimizing performance in a certain region of the ROC curve. To begin, we explain the domain we are concerned with and why focusing on a certain region of the ROC curve is important in this domain. 1 OUR DOMAIN Athene Software focuses on predicting and managing subscriber churn in the telecommunications industry (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000). “Churn” refers to the loss of subscribers who switch from one company to the other. Churn is a signiﬁcant problem for wireless, long distance, and internet service providers. For example, in the wireless industry, domestic monthly churn rates are 2–3% of the customer base. Consequently, service providers are highly motivated to identify subscribers who are dissatisﬁed with their service and offer them incentives to prevent churn. We use techniques from statistical machine learning—primarily neural networks and ensemble methods—to estimate the probability that an individual subscriber will churn in the near future. The prediction of churn is based on various sources of information about a subscriber, including: call detail records (date, time, duration, and location of each call, and whether call was dropped due to lack of coverage or available bandwidth), ﬁnancial information appearing on a subscriber’s bill (monthly base fee, additional charges for roaming and usage beyond monthly prepaid limit), complaints to the customer service department and their resolution, information from the initial application for service (contract details, rate plan, handset type, credit report), market information (e.g., rate plans offered by the service provider and its competitors), and demographic data. Churn prediction is an extremely difﬁcult problem for several reasons. First, the business environment is highly nonstationary; models trained on data from a certain time period perform far better with hold-out examples from that same time period than examples drawn from successive time periods. Second, features available for prediction are only weakly related to churn; when computing mutual information between individual features and churn, the greatest value we typically encounter is .01 bits. Third, information critical to predicting subscriber behavior, such as quality of service, is often unavailable. Obtaining accurate churn predictions is only part of the challenge of subscriber retention. Subscribers who are likely to churn must be contacted by a call center and offered some incentive to remain with the service provider. In a mathematically principled business scenario, one would frame the challenge as maximizing proﬁtability to a service provider, and making the decision about whether to contact a subscriber and what incentive to offer would be based on the expected utility of offering versus not offering an incentive. However, business practices complicate the scenario and place some unique constraints on predictive models. First, call centers are operated by a staff of customer service representatives who can contact subscribers at a ﬁxed rate; consequently, our models cannot advise contacting 50,000 subscribers one week, and 50 the next. Second, internal business strategies at the service providers constrain the minimum acceptable CA or CR rates (above and beyond the goal of maximizing proﬁtability). Third, contracts that Athene makes with service providers will occasionally call for achieving a speciﬁc target CA and CR rate. These three practical issues pose formal problems which, to the best of our knowledge, have not been addressed by the machine learning community. The formal problems can be stated in various ways, including: (1) maximize the CA rate, subject to the constraint that a ﬁxed percentage of the subscriber base is identiﬁed as potential churners, (2) optimize the CR rate, subject to the constraint that the CA rate should be αCA, (3) optimize the CA rate, subject to the constraint that the CR rate should be αCR, and ﬁnally—what marketing executives really want—(4) design a classiﬁer that has a CA rate of αCA and a CR rate of αCR. Problem (1) sounds somewhat different than problems (2) or (3), but it can be expressed in terms of a lift curve, which plots the CA rate as a function of the total fraction of subscribers identiﬁed by the model. Problem (1) thus imposes the constraint that the solution lies at one coordinate of the lift curve, just as problems (2) and (3) place the constraint that the solution lies at one coordinate of the ROC curve. Thus, a solution to problems (2) or (3) will also serve as a solution to (1). Although addressing problem (4) seems most fanciful, it encompasses problems (2) and (3), and thus we focus on it. Our goal is not altogether unreasonable, because a solution to problem (4) has the property we characterized in Figure 1b: the ROC curve can suffer everywhere except in the region near CA αCA and CR αCR. Hence, the approaches we consider will trade off performance in some regions of the ROC curve against performance in other regions. We call this prodding the ROC curve. 2 FOUR ALGORITHMS TO PROD THE ROC CURVE In this section, we describe four algorithms for prodding the ROC curve toward a target CA rate of αCA and a target CR rate of αCR. 2.1 EMPHASIZING CRITICAL TRAINING EXAMPLES Suppose we train a classiﬁer on a set of positive and negative examples from a class— churners and nonchurners in our domain. Following training, the classiﬁer will assign a posterior probability of class membership to each example. The examples can be sorted by the posterior and arranged on a continuum anchored by probabilities 0 and 1 (Figure 2). We can identify the thresholds, θCA and θCR, which yield CA and CR rates of αCA and αCR, respectively. If the classiﬁer’s discrimination performance fails to achieve the target CA and CR rates, then θCA will be lower than θCR, as depicted in the Figure. If we can bring these two thresholds together, we will achieve the target CA and CR rates. Thus, the ﬁrst algorithm we propose involves training a series of classiﬁers, attempting to make classiﬁer n+1 achieve better CA and CR rates by focusing its effort on examples from classiﬁer n that lie between θCA and θCR; the positive examples must be pushed above θCR and the negative examples must be pushed below θCA. (Of course, the thresholds are speciﬁc to a classiﬁer, and hence should be indexed by n.) We call this the emphasis algorithm, because it involves placing greater weight on the examples that lie between the two thresholds. In the Figure, the emphasis for classiﬁer n+1 would be on examples e5 through e8. This retraining procedure can be iterated until the classiﬁer’s training set performance reaches asymptote. In our implementation, we deﬁne a weighting of each example i for training classiﬁer n, λ in . For classiﬁer 1, λ i1 = 1 . For subsequent classiﬁers, λ in + 1 = λ in if example i is not in the region of emphasis, or λ in + 1 = κ e λ in otherwise, where κe is a constant, κe > 1. 2.2 DEEMPHASIZING IRRELEVANT TRAINING EXAMPLES The second algorithm we propose is related to the ﬁrst, but takes a slightly different perspective on the continuum depicted in Figure 2. Positive examples below θCA—such as e2—are clearly the most dif ﬁcult positive examples to classify correctly. Not only are they the most difﬁcult positive examples, but they do not in fact need to be classiﬁed correctly to achieve the target CA and CR rates. Threshold θCR does not depend on examples such as e2, and threshold θCA allows a fraction (1–αCA) of the positive examples to be classiﬁed incorrectly. Likewise, one can argue that negative examples above θCR—such as e10 and e11—need not be of concern. Essentially , the second algorithm, which we term the eemd phasis algorithm, is like the emphasis algorithm in that a series of classiﬁers are trained, but when training classiﬁer n+1, less weight is placed on the examples whose correct clasθCA e1 e2 e3 0 e4 θCR e5 e6 e7 e8 churn probability e9 e10 e11 e12 e13 1 FIGURE 2. A schematic depiction of all training examples arranged by the classiﬁer’s posterior. Each solid bar corresponds to a positive example (e.g., a churner) and each grey bar corresponds to a negative example (e.g., a nonchurner). siﬁcation is unnecessary to achieve the target CA and CR rates for classiﬁer n. As with the emphasis algorithm, the retraining procedure can be iterated until no further performance improvements are obtained on the training set. Note that the set of examples given emphasis by the previous algorithm is not the complement of the set of examples deemphasized by the current algorithm; the algorithms are not identical. In our implementation, we assign a weight to each example i for training classiﬁer n, λ in . For classiﬁer 1, λ i1 = 1 . For subsequent classiﬁers, λ in + 1 = λ in if example i is not in the region of deemphasis, or λ in + 1 = κ d λ in otherwise, where κd is a constant, κd <1. 2.3 CONSTRAINED OPTIMIZATION The third algorithm we propose is formulated as maximizing the CR rate while maintaining the CA rate equal to αCA. (We do not attempt to simultaneously maximize the CA rate while maintaining the CR rate equal to αCR.) Gradient methods cannot be applied directly because the CA and CR rates are nondifferentiable, but we can approximate the CA and CR rates with smooth differentiable functions: 1 1 CA ( w, t ) = ----- ∑ σ β ( f ( x i, w ) – t ) CR ( w, t ) = ------ ∑ σ β ( t – f ( x i, w ) ) , P i∈P N i∈N where P and N are the set of positive and negative examples, respectively, f(x,w) is the model posterior for input x, w is the parameterization of the model, t is a threshold, and σβ –1 is a sigmoid function with scaling parameter β: σ β ( y ) = ( 1 + exp ( – βy ) ) . The larger β is, the more nearly step-like the sigmoid is and the more nearly equal the approximations are to the model CR and CA rates. We consider the problem formulation in which CA is a constraint and CR is a ﬁgure of merit. We convert the constrained optimization problem into an unconstrained problem by the augmented Lagrangian method (Bertsekas, 1982), which involves iteratively maximizing an objective function 2 µ A ( w, t ) = CR ( w, t ) + ν CA ( w, t ) – α CA + -- CA ( w, t ) – α CA 2 with a ﬁxed Lagrangian multiplier, ν, and then updating ν following the optimization step: ν ← ν + µ CA ( w *, t * ) – α CA , where w * and t * are the values found by the optimization step. We initialize ν = 1 and ﬁx µ = 1 and β = 10 and iterate until ν converges. 2.4 GENETIC ALGORITHM The fourth algorithm we explore is a steady-state genetic search over a space deﬁned by the continuous parameters of a classiﬁer (Whitley, 1989). The ﬁtness of a classiﬁer is the reciprocal of the number of training examples falling between the θCA and θCR thresholds. Much like the emphasis algorithm, this ﬁtness function encourages the two thresholds to come together. The genetic search permits direct optimization over a nondifferentiable criterion, and therefore seems sensible for the present task. 3 METHODOLOGY For our tests, we studied two large data bases made available to Athene by two telecommunications providers. Data set 1 had 50,000 subscribers described by 35 input features and a churn rate of 4.86%. Data set 2 had 169,727 subscribers described by 51 input features and a churn rate of 6.42%. For each data base, the features input to the classiﬁer were obtained by proprietary transformations of the raw data (see Mozer et al., 2000). We chose these two large, real world data sets because achieving gains with these data sets should be more difﬁcult than with smaller, less noisy data sets. Plus, with our real-world data, we can evaluate the cost savings achieved by an improvement in prediction accuracy. We performed 10-fold cross-validation on each data set, preserving the overall churn/nonchurn ratio in each split. In all tests, we chose α CR = 0.90 and α CA = 0.50 , values which, based on our past experience in this domain, are ambitious yet realizable targets for data sets such as these. We used a logistic regression model (i.e., a no hidden unit neural network) for our studies, believing that it would be more difﬁcult to obtain improvements with such a model than with a more ﬂexible multilayer perceptron. For the emphasis and deemphasis algorithms, models were trained to minimize mean-squared error on the training set. We chose κe = 1.3 and κd = .75 by quick exploration. Because the weightings are cumulative over training restarts, the choice of κ is not critical for either algorithm; rather, the magnitude of κ controls how many restarts are necessary to reach asymptotic performance, but the results we obtained were robust to the choice of κ. The emphasis and deemphasis algorithms were run for 100 iterations, which was the number of iterations required to reach asymptotic performance on the training set. 4 RESULTS Figure 3 illustrates training set performance for the emphasis algorithm on data set 1. The graph on the left shows the CA rate when the CR rate is .9, and the graph on the right show the CR rate when the CA rate is .5. Clearly, the algorithm appears to be stable, and the ROC curve is improving in the region around (αCA, αCR). Figure 4 shows cross-validation performance on the two data sets for the four prodding algorithms as well as for a traditional least-squares training procedure. The emphasis and deemphasis algorithms yield reliable improvements in performance in the critical region of the ROC curve over the traditional training procedure. The constrained-optimization and genetic algorithms perform well on achieving a high CR rate for a ﬁxed CA rate, but neither does as well on achieving a high CA rate for a ﬁxed CR rate. For the constrained-optimization algorithm, this result is not surprising as it was trained asymmetrically, with the CA rate as the constraint. However, for the genetic algorithm, we have little explanation for its poor performance, other than the difﬁculty faced in searching a continuous space without gradient information. 5 DISCUSSION In this paper, we have identiﬁed an interesting, novel problem in classiﬁer design which is motivated by our domain of churn prediction and real-world business considerations. Rather than seeking a classiﬁer that maximizes discriminability between two classes, as measured by area under the ROC curve, we are concerned with optimizing performance at certain points along the ROC curve. We presented four alternative approaches to prodding the ROC curve, and found that all four have promise, depending on the speciﬁc goal. Although the magnitude of the gain is small—an increase of about .01 in the CR rate given a target CA rate of .50—the impro vement results in signiﬁcant dollar savings. Using a framework for evaluating dollar savings to a service provider, based on estimates of subscriber retention and costs of intervention obtained in real world data collection (Mozer et 0.845 0.84 0.39 0.835 0.385 CR rate CA rate 0.4 0.395 0.38 0.83 0.825 0.375 0.82 0.37 0.815 0.365 0.81 0 5 10 15 20 25 30 35 40 45 50 Iteration 0 5 10 15 20 25 30 35 40 45 50 Iteration FIGURE 3. Training set performance for the emphasis algorithm on data set 1. (a) CA rate as a function of iteration for a CR rate of .9; (b) CR rate as a function of iteration for a CA rate of .5. Error bars indicate +/–1 standard error of the mean. Data set 1 0.835 0.380 0.830 0.375 0.825 CR rate ISP Test Set 0.840 0.385 CA rate 0.390 0.370 0.820 0.365 0.815 0.360 0.810 0.355 0.805 0.350 0.800 std emph deemph constr GA std emph deemph constr GA std emph deemph constr GA 0.900 0.375 0.350 CR rate Data set 2 0.875 CA rate Wireless Test Set 0.850 0.325 0.825 0.300 0.800 std emph deemph constr GA FIGURE 4. Cross-validation performance on the two data sets for the standard training procedure (STD), as well as the emphasis (EMPH), deemphasis (DEEMPH), constrained optimization (CONSTR), and genetic (GEN) algorithms. The left column shows the CA rate for CR rate .9; the right column shows the CR rate for CA rate .5. The error bar indicates one standard error of the mean over the 10 data splits. al., 2000), we obtain a savings of $11 per churnable subscriber when the (CA, CR) rates go from (.50, .80) to (.50, .81), which amounts to an 8% increase in proﬁtability of the subscriber intervention effort. These ﬁgures are clearly promising. However, based on the data sets we have studied, it is difﬁcult to know whether another algorithm might exist that achieves even greater gains. Interestingly, all algorithms we proposed yielded roughly the same gains when successful, suggesting that we may have milked the data for whatever gain could be had, given the model class evaluated. Our work clearly illustrate the difﬁculty of the problem, and we hope that others in the NIPS community will be motivated by the problem to suggest even more powerful, theoretically grounded approaches. 6 ACKNOWLEDGEMENTS No white males were angered in the course of conducting this research. We thank Lian Yan and David Grimes for comments and assistance on this research. This research was supported in part by McDonnell-Pew grant 97-18, NSF award IBN-9873492, and NIH/IFOPAL R01 MH61549–01A1. 7 REFERENCES Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NY: Academic. Chang, E. I., & Lippmann, R. P. (1994). Figure of merit training for detection and spotting. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6 (1019–1026). San Mateo, CA: Morgan Kaufmann. Frederick, E. D., & Floyd, C. E. (1998). Analysis of mammographic ﬁndings and patient history data with genetic algorithms for the prediction of breast cancer biopsy outcome. Proceedings of the SPIE, 3338, 241–245. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Mozer, M. C., Wolniewicz, R., Grimes, D., Johnson, E., & Kaushansky, H. (2000). Maximizing revenue by predicting and addressing customer dissatisfaction. IEEE Transactions on Neural Networks, 11, 690–696. Whitley, D. (1989). The GENITOR algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In D. Schaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms (pp. 116–121). San Mateo, CA: Morgan Kaufmann.

4 0.61823434 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

Author: William M. Campbell

Abstract: A novel approach for comparing sequences of observations using an explicit-expansion kernel is demonstrated. The kernel is derived using the assumption of the independence of the sequence of observations and a mean-squared error training criterion. The use of an explicit expansion kernel reduces classiﬁer model size and computation dramatically, resulting in model sizes and computation one-hundred times smaller in our application. The explicit expansion also preserves the computational advantages of an earlier architecture based on mean-squared error training. Training using standard support vector machine methodology gives accuracy that signiﬁcantly exceeds the performance of state-of-the-art mean-squared error training for a speaker recognition task.

5 0.60950029 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller

Abstract: Driven by the progress in the ﬁeld of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming ﬁnger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classiﬁcation accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classiﬁers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).

6 0.60720134 60 nips-2001-Discriminative Direction for Kernel Classifiers

7 0.56934714 54 nips-2001-Contextual Modulation of Target Saliency

8 0.56283855 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

9 0.54743534 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

10 0.50980198 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

11 0.47762001 182 nips-2001-The Fidelity of Local Ordinal Encoding

12 0.46730462 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates

13 0.45947522 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

14 0.43368235 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes

15 0.42956501 25 nips-2001-Active Learning in the Drug Discovery Process

16 0.40931344 144 nips-2001-Partially labeled classification with Markov random walks

17 0.40332642 74 nips-2001-Face Recognition Using Kernel Methods

18 0.39473438 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

19 0.38365731 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

20 0.37824166 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.032), (17, 0.016), (19, 0.034), (27, 0.121), (30, 0.134), (38, 0.06), (59, 0.056), (72, 0.091), (76, 0.197), (79, 0.03), (83, 0.011), (91, 0.128)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8886686 46 nips-2001-Categorization by Learning and Combining Object Parts

Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio

2 0.77273905 102 nips-2001-KLD-Sampling: Adaptive Particle Filters

Author: Dieter Fox

Abstract: Over the last years, particle ﬁlters have been applied with great success to a variety of state estimation problems. We present a statistical approach to increasing the efﬁciency of particle ﬁlters by adapting the size of sample sets on-the-ﬂy. The key idea of the KLD-sampling method is to bound the approximation error introduced by the sample-based representation of the particle ﬁlter. The name KLD-sampling is due to the fact that we measure the approximation error by the Kullback-Leibler distance. Our adaptation approach chooses a small number of samples if the density is focused on a small part of the state space, and it chooses a large number of samples if the state uncertainty is high. Both the implementation and computation overhead of this approach are small. Extensive experiments using mobile robot localization as a test application show that our approach yields drastic improvements over particle ﬁlters with ﬁxed sample set sizes and over a previously introduced adaptation technique.

3 0.76408517 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

Author: Mário Figueiredo

Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classiﬁcation. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.

4 0.75911355 149 nips-2001-Probabilistic Abstraction Hierarchies

Author: Eran Segal, Daphne Koller, Dirk Ormoneit

Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.

5 0.75743026 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

Author: Gregor Wenning, Klaus Obermayer

Abstract: Cortical neurons might be considered as threshold elements integrating in parallel many excitatory and inhibitory inputs. Due to the apparent variability of cortical spike trains this yields a strongly fluctuating membrane potential, such that threshold crossings are highly irregular. Here we study how a neuron could maximize its sensitivity w.r.t. a relatively small subset of excitatory input. Weak signals embedded in fluctuations is the natural realm of stochastic resonance. The neuron's response is described in a hazard-function approximation applied to an Ornstein-Uhlenbeck process. We analytically derive an optimality criterium and give a learning rule for the adjustment of the membrane fluctuations, such that the sensitivity is maximal exploiting stochastic resonance. We show that adaptation depends only on quantities that could easily be estimated locally (in space and time) by the neuron. The main results are compared with simulations of a biophysically more realistic neuron model. 1

6 0.75354934 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

7 0.75186962 60 nips-2001-Discriminative Direction for Kernel Classifiers

8 0.7517494 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

9 0.74701571 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

10 0.74562383 185 nips-2001-The Method of Quantum Clustering

11 0.74414545 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

12 0.74042821 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's

13 0.73980451 182 nips-2001-The Fidelity of Local Ordinal Encoding

14 0.73850644 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes

15 0.73813081 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

16 0.73731339 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition

17 0.73555928 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

18 0.73537171 56 nips-2001-Convolution Kernels for Natural Language

19 0.73499179 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

20 0.73378026 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing