nips nips2001 nips2001-46 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio
Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.
Reference: text
sentIndex sentText sentNum sentScore
1 de ¤ ¡ £ ¤ ¥ ¦ Abstract We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. [sent-11, score-0.32]
2 It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. [sent-12, score-0.122]
3 Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. [sent-13, score-0.285]
4 Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. [sent-14, score-0.661]
5 1 Introduction We study the problem of automatically synthesizing hierarchical classifiers by learning discriminative object parts in images. [sent-16, score-0.483]
6 faces, cars) seem to be naturally described by a few characteristic parts or components and their geometrical relation. [sent-19, score-0.27]
7 Greater invariance to viewpoint changes and robustness against partial occlusions are the two main potential advantages of component-based approaches compared to a global approach. [sent-20, score-0.153]
8 The first challenge in developing component-based systems is how to choose automatically a set of discriminative object components. [sent-21, score-0.417]
9 Instead of manually selecting the components, it is desirable to learn the components from a set of examples based on their discriminative power and their robustness against pose and illumination changes. [sent-22, score-0.468]
10 The second challenge is to combine the component-based experts to perform the final classification. [sent-23, score-0.107]
11 Categorization by Learning and Combining Object Parts 2 Background Global approaches in which the whole pattern of an object is used as input to a single classifier were successfully applied to tasks where the pose of the object was fixed. [sent-24, score-0.638]
12 In [6] Haar wavelet features are used to detect frontal and back views of pedestrians with an SVM classifier. [sent-25, score-0.18]
13 Learning-based systems for detecting frontal faces based on a gray value features are described in [14, 13, 10, 2]. [sent-26, score-0.436]
14 Component-based techniques promise to provide more invariance since the individual components vary less under pose changes than the whole object. [sent-27, score-0.314]
15 Variations induced by pose changes occur mainly in the locations of the components. [sent-28, score-0.096]
16 A component-based method for detecting faces based on the empirical probabilities of overlapping rectangular image parts is proposed in [11]. [sent-29, score-0.508]
17 Another probabilistic approach which detects small parts of faces is proposed in [4]. [sent-30, score-0.307]
18 It uses local feature extractors to detect the eyes, the corner of the mouth, and the tip of the nose. [sent-31, score-0.122]
19 The geometrical configuration of these features is matched with a model configuration by conditional search. [sent-32, score-0.076]
20 Local features are extracted by applying multi-scale and multi-orientation filters to the input image. [sent-34, score-0.061]
21 The responses of the filters on the training set are modeled as Gaussian distributions. [sent-35, score-0.073]
22 In [5] pedestrian detection is performed by a set of SVM classifiers each of which was trained to detect a specific part of the human body. [sent-36, score-0.249]
23 In this paper we present a technique for learning relevant object components. [sent-37, score-0.231]
24 The technique starts with a set of small seed regions which are gradually grown by minimizing a bound on the expected error probability of an SVM. [sent-38, score-0.113]
25 Once the components have been determined, we train a system consisting of a two-level hierarchy of SVM classifiers. [sent-39, score-0.207]
26 Second, a combination classifier learns the geometrical relation between the components and performs the final detection of the object. [sent-41, score-0.4]
27 1 Linear Support Vector Machines Linear SVMs [15] perform pattern recognition for two-class problems by determining the separating hyperplane with maximum distance to the closest points in the training set. [sent-43, score-0.182]
28 In fact, the expected error probability of the SVM, , satisfies the following bound [15]: ) a bY Y W9 U `XVC GGF D IH¢EC 6 1 T GGF D 7SRQPC W where (3) is the diameter of the smallest sphere containing all data points in the training set. [sent-49, score-0.073]
29 2 Learning Components Our method automatically determines rectangular components from a set of object images. [sent-51, score-0.534]
30 The algorithm starts with a small rectangular component located around a pre-selected point in the object image (e. [sent-52, score-0.542]
31 for faces this could be the center of the left eye). [sent-54, score-0.292]
32 The component is extracted from each object image to build a training set of positive examples. [sent-55, score-0.531]
33 We also generate a training set of background patterns that have the same rectangular shape as the component. [sent-56, score-0.289]
34 After training an SVM on the component data we estimate the performance of the SVM based on the upper bound on the error probability. [sent-57, score-0.18]
35 After determining we enlarge the component by expanding the rectangle by one pixel into one of the four directions (up, down, left, right). [sent-60, score-0.206]
36 Again, we generate training data, train an SVM and determine . [sent-61, score-0.107]
37 We do this for expansions into all four directions and finally keep the expansion which decreases the most. [sent-62, score-0.069]
38 This process is continued until the expansions into all four directions lead to an increase of . [sent-63, score-0.069]
39 In order to learn a set of components this process can be applied to different seed regions. [sent-64, score-0.205]
40 4 Learning Facial Components Extracting face patterns is usually a tedious and time-consuming work that has to be done manually. [sent-65, score-0.299]
41 Taking the component-based approach we would have to manually extract each single component from all images in the training set. [sent-66, score-0.325]
42 For this reason we used textured 3-D head models [16] to generate the training data. [sent-68, score-0.272]
43 By rendering the 3-D head models we could automatically generate large numbers of faces in arbitrary poses and with arbitrary illumination. [sent-69, score-0.465]
44 In addition to the 3-D information we also knew the 3-D correspondences for a set of reference points shown in Fig. [sent-70, score-0.095]
45 These correspondences allowed us to automatically extract facial components located around the reference points. [sent-72, score-0.476]
46 Originally we had seven textured head models acquired by a 3-D scanner. [sent-73, score-0.165]
47 Additional head models were generated by 3-D morphing between all pairs of the original head models. [sent-74, score-0.281]
48 The faces were illuminated by ambient light and a single directional light and pointing towards the center of the face. [sent-76, score-0.384]
49 The position of the light varied between in azimuth and between and in elevation. [sent-77, score-0.109]
50 Overall, we generated 2,457 face images of size 58 58. [sent-78, score-0.313]
51 Some examples of synthetic face images used for training are shown in Fig. [sent-79, score-0.428]
52 ¥£¡ ¦¤¢4 £ ¥ §¡ ¥£¡ ¨¤¢4 ¥ §£ ¥£ ©§¡ ¥£ ©¤¡ The negative training set initially consisted of 10,209 58 58 non-face patterns randomly extracted from 502 non-face images. [sent-81, score-0.257]
53 We then applied bootstrapping to enlarge the training data by non-face patterns that look similar to faces. [sent-82, score-0.193]
54 To do so we trained a single linear SVM classifier and applied it to the previously used set of 502 non-face images. [sent-83, score-0.079]
55 The false positives (FPs) were added to the non-face training data to build the final training set of size 13,654. [sent-84, score-0.146]
56 We started with fourteen manually selected seed regions of size 5 5. [sent-85, score-0.249]
57 6 ¤ a) b) Figure 1: a) Reference points on the head models which were used for 3-D morphing and automatic extraction of facial components. [sent-87, score-0.291]
58 On the first level the component classifiers independently detect components of the face. [sent-91, score-0.316]
59 Each classifier was trained on a set of facial components and on a set of non-face patterns generated from the training set described in Section 4. [sent-92, score-0.419]
60 On the second level the combination classifier performs the detection of the face based on the outputs of the component classifiers. [sent-93, score-0.579]
61 The maximum real-valued outputs of each component classifier within rectangular search regions around the expected positions of the components are used as inputs to the combination classifier. [sent-94, score-0.469]
62 The size of the search regions was estimated from the mean and the standard deviation of the locations of the components in the training images. [sent-95, score-0.243]
63 The maximum operation is performed both during training and at run-time. [sent-96, score-0.073]
64 Interestingly it turns out to be similar to the key pooling mechanism postulated in a recent model of object recognition in the visual cortex [8]. [sent-97, score-0.299]
65 We also provide the combination classifier with the precise positions of the detected components relative to the upper left corner of the 58 58 window. [sent-98, score-0.242]
66 Overall we have three values per component classifier that are propagated to the combination classifier: the maximum output of the component classifier and the - image coordinates of the maximum. [sent-99, score-0.34]
67 8 *Outputs of component experts: bright intensities indicate high confidence. [sent-100, score-0.107]
68 Shift component experts over 58x58 window Left Eye Left Eye expert: expert: Linear SVM Linear SVM * Combination Combination classifier: classifier: Linear SVM Linear SVM * (O14 , X 14 , Y14 ) 3. [sent-112, score-0.178]
69 For each component k, determine its maximum output within a search region and its location: (Ok , X k , Yk ) 4. [sent-113, score-0.107]
70 Final decision: face / background Figure 2: System overview of the component-based classifier. [sent-114, score-0.319]
71 6 Experiments In our experiments we compared the component-based system to global classifiers. [sent-115, score-0.104]
72 The component system consisted of fourteen linear SVM classifiers for detecting the components and a single linear SVM as combination classifier. [sent-116, score-0.593]
73 The global classifiers were a single linear SVM and a single second-degree polynomial SVM both trained on the gray values of the whole face pattern. [sent-117, score-0.521]
74 The training data for these three classifiers consisted of 2,457 synthetic gray face images and 13,654 non-face gray images of size 58 58. [sent-118, score-0.755]
75 The positive test set consisted of 1,834 faces rotated between about and in depth. [sent-119, score-0.356]
76 The faces were manually extracted from the CMU PIE database [12]. [sent-120, score-0.383]
77 The negative test set consisted of 24,464 difficult non-face patterns that were collected by a fast face detector [3] from web images. [sent-121, score-0.369]
78 For an indirect comparison, we used a second-degree polynomial SVM [2] which was trained on a large set of 19 19 real face images. [sent-124, score-0.291]
79 This classifier performed amongst the best face detection systems on the MIT-CMU test set. [sent-125, score-0.372]
80 3 show that the component-based classifier is significantly better than the three global classifiers. [sent-127, score-0.062]
81 Some detection results generated by the component system are shown in Fig. [sent-128, score-0.275]
82 ¥£ §§¡ ¥ §¢4 £¡ Figure 3: Comparison between global classifiers and the component-based classifier. [sent-130, score-0.062]
83 A natural question that arises is about the role of geometrical information. [sent-132, score-0.112]
84 To answer this question–which has relevant implications for models of cortex–we tested another system in which the combination classifier receives as inputs only the output of each component classifier but not the position of its maximum. [sent-133, score-0.279]
85 5 this system still outperforms the whole face systems but it is worse than the system with position information. [sent-135, score-0.432]
86 Figure 5: Comparison between a component-based classifier trained with position information and a component-based classifier without position information. [sent-136, score-0.171]
87 7 Open Questions An extension under way of the component-based approach to face identification is already showing good performances [1]. [sent-137, score-0.246]
88 Another natural generalization of the work described here involves the application of our system to various classes of objects such as cars, animals, and people. [sent-138, score-0.081]
89 Still another extension regards the question of view-invariant object detection. [sent-139, score-0.267]
90 As suggested by [7] in a biological context and demonstrated recently by [11] in machine vision, full pose invariance in recognition tasks can be achieved by combining view-dependent classifiers. [sent-140, score-0.329]
91 It is interesting to ask whether the approach described here could also be used to learn which views are most discriminative and how to combine them optimally. [sent-141, score-0.125]
92 Finally, the role of geometry and in particular how to compute and represent position information in biologically plausible networks, is an important open question at the interface between machine and biological vision. [sent-142, score-0.184]
93 Face recognition with support vector machines: global versus component-based approach. [sent-147, score-0.13]
94 Feature reduction and hierarchy of classifiers for fast object detection in video images. [sent-163, score-0.391]
95 Finding faces in cluttered scenes using random labeled graph matching. [sent-172, score-0.244]
96 International Conference on Computer Vision, pages 637–644, Cambridge, MA, 1995. [sent-174, score-0.037]
97 In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 23, pages 349–361, April 2001. [sent-180, score-0.037]
98 In International Journal of Computer Vision, volume 38, 1, pages 15–33, 2000. [sent-185, score-0.037]
99 A statistical method for 3d object detection applied to faces and cars. [sent-215, score-0.601]
100 An original approach for the localisation of objects in images. [sent-233, score-0.039]
wordName wordTfidf (topN-words)
[('classi', 0.372), ('svm', 0.282), ('face', 0.246), ('faces', 0.244), ('object', 0.231), ('er', 0.194), ('mouth', 0.168), ('ers', 0.139), ('heisele', 0.134), ('nose', 0.134), ('components', 0.131), ('pixels', 0.128), ('expert', 0.127), ('detection', 0.126), ('facial', 0.117), ('component', 0.107), ('head', 0.107), ('cmu', 0.099), ('pose', 0.096), ('gray', 0.095), ('rectangular', 0.092), ('vision', 0.089), ('biological', 0.085), ('automatically', 0.08), ('manually', 0.078), ('detect', 0.078), ('geometrical', 0.076), ('seed', 0.074), ('training', 0.073), ('experts', 0.071), ('discriminative', 0.07), ('consisted', 0.07), ('recognition', 0.068), ('enlarge', 0.067), ('freiburg', 0.067), ('ggf', 0.067), ('morphing', 0.067), ('siena', 0.067), ('vetter', 0.067), ('images', 0.067), ('combination', 0.067), ('poggio', 0.067), ('position', 0.063), ('parts', 0.063), ('global', 0.062), ('extracted', 0.061), ('image', 0.059), ('cars', 0.058), ('fourteen', 0.058), ('papageorgiou', 0.058), ('pontil', 0.058), ('serre', 0.058), ('textured', 0.058), ('views', 0.055), ('located', 0.053), ('pie', 0.053), ('patterns', 0.053), ('eye', 0.05), ('detecting', 0.05), ('correspondences', 0.05), ('illumination', 0.05), ('ok', 0.05), ('center', 0.048), ('invariance', 0.048), ('frontal', 0.047), ('lters', 0.047), ('light', 0.046), ('trained', 0.045), ('reference', 0.045), ('corner', 0.044), ('robustness', 0.043), ('rotated', 0.042), ('system', 0.042), ('synthetic', 0.042), ('pattern', 0.041), ('eyes', 0.041), ('guration', 0.041), ('ma', 0.041), ('regions', 0.039), ('whole', 0.039), ('computer', 0.039), ('objects', 0.039), ('hierarchical', 0.039), ('pages', 0.037), ('background', 0.037), ('international', 0.037), ('expansions', 0.037), ('yk', 0.037), ('question', 0.036), ('challenge', 0.036), ('overview', 0.036), ('generate', 0.034), ('linear', 0.034), ('conference', 0.034), ('hierarchy', 0.034), ('outputs', 0.033), ('nal', 0.033), ('combining', 0.032), ('directions', 0.032), ('published', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 46 nips-2001-Categorization by Learning and Combining Object Parts
Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio
Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.
2 0.36324865 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
Author: Paul Viola, Michael Jones
Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.
3 0.23627912 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
Author: Hiroshi Shimodaira, Ken-ichi Noma, Mitsuru Nakai, Shigeki Sagayama
Abstract: A new class of Support Vector Machine (SVM) that is applicable to sequential-pattern recognition such as speech recognition is developed by incorporating an idea of non-linear time alignment into the kernel function. Since the time-alignment operation of sequential pattern is embedded in the new kernel function, standard SVM training and classification algorithms can be employed without further modifications. The proposed SVM (DTAK-SVM) is evaluated in speaker-dependent speech recognition experiments of hand-segmented phoneme recognition. Preliminary experimental results show comparable recognition performance with hidden Markov models (HMMs). 1
4 0.23285052 54 nips-2001-Contextual Modulation of Target Saliency
Author: Antonio Torralba
Abstract: The most popular algorithms for object detection require the use of exhaustive spatial and scale search procedures. In such approaches, an object is defined by means of local features. fu this paper we show that including contextual information in object detection procedures provides an efficient way of cutting down the need for exhaustive search. We present results with real images showing that the proposed scheme is able to accurately predict likely object classes, locations and sizes. 1
5 0.21033612 60 nips-2001-Discriminative Direction for Kernel Classifiers
Author: Polina Golland
Abstract: In many scientific and engineering applications, detecting and understanding differences between two groups of examples can be reduced to a classical problem of training a classifier for labeling new examples while making as few mistakes as possible. In the traditional classification setting, the resulting classifier is rarely analyzed in terms of the properties of the input data captured by the discriminative model. However, such analysis is crucial if we want to understand and visualize the detected differences. We propose an approach to interpretation of the statistical model in the original feature space that allows us to argue about the model in terms of the relevant changes to the input vectors. For each point in the input space, we define a discriminative direction to be the direction that moves the point towards the other class while introducing as little irrelevant change as possible with respect to the classifier function. We derive the discriminative direction for kernel-based classifiers, demonstrate the technique on several examples and briefly discuss its use in the statistical shape analysis, an application that originally motivated this work.
6 0.20423438 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
7 0.18043257 74 nips-2001-Face Recognition Using Kernel Methods
8 0.17197716 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
9 0.16782357 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
10 0.16529503 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
11 0.16088513 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance
12 0.15210479 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
13 0.13186419 144 nips-2001-Partially labeled classification with Markov random walks
14 0.13115445 84 nips-2001-Global Coordination of Local Linear Models
15 0.13088034 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates
16 0.12822123 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
17 0.12243807 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
18 0.12024499 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
19 0.11536678 105 nips-2001-Kernel Machines and Boolean Functions
20 0.10839516 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
topicId topicWeight
[(0, -0.312), (1, 0.14), (2, -0.174), (3, 0.264), (4, -0.189), (5, 0.191), (6, -0.298), (7, -0.087), (8, 0.16), (9, 0.05), (10, 0.087), (11, -0.115), (12, 0.116), (13, 0.003), (14, -0.067), (15, -0.052), (16, 0.152), (17, -0.064), (18, 0.061), (19, -0.16), (20, -0.037), (21, -0.112), (22, 0.042), (23, -0.09), (24, -0.06), (25, -0.049), (26, -0.083), (27, -0.111), (28, 0.023), (29, -0.013), (30, 0.003), (31, -0.016), (32, -0.011), (33, -0.046), (34, 0.03), (35, -0.01), (36, 0.003), (37, -0.064), (38, 0.005), (39, -0.021), (40, 0.03), (41, 0.028), (42, 0.017), (43, -0.047), (44, 0.051), (45, 0.024), (46, 0.009), (47, 0.001), (48, 0.065), (49, -0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.98364532 46 nips-2001-Categorization by Learning and Combining Object Parts
Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio
Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.
2 0.82083762 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
Author: Paul Viola, Michael Jones
Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.
3 0.67458779 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance
Author: Michael C. Mozer, Robert Dodier, Michael D. Colagrosso, Cesar Guerra-Salcedo, Richard Wolniewicz
Abstract: When designing a two-alternative classifier, one ordinarily aims to maximize the classifier’s ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: that the classifier achieve a specified correct acceptance or correct rejection rate (i.e., that it achieve a fixed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose four algorithms for training a classifier subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance. Although the improvement is modest in magnitude, it is nonetheless impressive given the difficulty of the problem and the financial return that it achieves to the service provider. When designing a classifier, one must specify an objective measure by which the classifier’s performance is to be evaluated. One simple objective measure is to minimize the number of misclassifications. If the cost of a classification error depends on the target and/ or response class, one might utilize a risk-minimization framework to reduce the expected loss. A more general approach is to maximize the classifier’s ability to discriminate one class from another class (e.g., Chang & Lippmann, 1994). An ROC curve (Green & Swets, 1966) can be used to visualize the discriminative performance of a two-alternative classifier that outputs class posteriors. To explain the ROC curve, a classifier can be thought of as making a positive/negative judgement as to whether an input is a member of some class. Two different accuracy measures can be obtained from the classifier: the accuracy of correctly identifying an input as a member of the class (a correct acceptance or CA), and the accuracy of correctly identifying an input as a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates, it is necessary to pick a threshold above which the classifier’s probability estimate is interpreted as an “accept,” and below which is interpreted as a “reject”—call this the criterion. The ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the threshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1, the CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1 0 0 correct rejection rate 20 40 60 80 100 100 (b) correct rejection rate 20 40 60 80 (a) 0 20 40 60 80 100 correct acceptance rate 0 20 40 60 80 100 correct acceptance rate FIGURE 1. (a) two ROC curves reflecting discrimination performance; the dashed curve indicates better performance. (b) two plausible ROC curves, neither of which is clearly superior to the other. and the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotonically nonincreasing. The degree to which the curve is bowed reflects the discriminative ability of the classifier. The dashed curve in Figure 1a is therefore a better classifier than the solid curve. The degree to which the curve is bowed can be quantified by various measures such as the area under the ROC curve or d’, the distance between the positive and negative distributions. However, training a classifier to maximize either the ROC area or d’ often yields the same result as training a classifier to estimate posterior class probabilities, or equivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The ROC area and d’ scores are useful, however, because they reflect a classifier’s intrinsic ability to discriminate between two classes, regardless of how the decision criterion is set. That is, each point on an ROC curve indicates one possible CA/CR trade off the classifier can achieve, and that trade off is determined by the criterion. But changing the criterion does not change the classifier’s intrinsic ability to discriminate. Generally, one seeks to optimize the discrimination performance of a classifier. However, we are working in a domain where overall discrimination performance is not as critical as performance at a particular point on the ROC curve, and we are not interested in the remainder of the ROC curve. To gain an intuition as to why this goal should be feasible, consider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they satisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not decrease and the CR rate does not increase. Although the bow shape of the solid curve is typical, it is not mandatory; the precise shape of the curve depends on the nature of the classifier and the nature of the domain. Thus, it is conceivable that a classifier could produce a curve like the dashed one. The dashed curve indicates better performance when the CA rate is around 50%, but worse performance when the CA rate is much lower or higher than 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint that the CA rate is around 50%, or to maximize the CA rate subject to the constraint that the CR rate is around 90%, the dashed curve is superior to the solid curve. One can imagine that better performance can be obtained along some stretches of the curve by sacrificing performance along other stretches of the curve. Note that obtaining a result such as the dashed curve requires a nonstandard training algorithm, as the discrimination performance as measured by the ROC area is worse for the dashed curve than for the solid curve. In this paper, we propose and evaluate four algorithms for optimizing performance in a certain region of the ROC curve. To begin, we explain the domain we are concerned with and why focusing on a certain region of the ROC curve is important in this domain. 1 OUR DOMAIN Athene Software focuses on predicting and managing subscriber churn in the telecommunications industry (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000). “Churn” refers to the loss of subscribers who switch from one company to the other. Churn is a significant problem for wireless, long distance, and internet service providers. For example, in the wireless industry, domestic monthly churn rates are 2–3% of the customer base. Consequently, service providers are highly motivated to identify subscribers who are dissatisfied with their service and offer them incentives to prevent churn. We use techniques from statistical machine learning—primarily neural networks and ensemble methods—to estimate the probability that an individual subscriber will churn in the near future. The prediction of churn is based on various sources of information about a subscriber, including: call detail records (date, time, duration, and location of each call, and whether call was dropped due to lack of coverage or available bandwidth), financial information appearing on a subscriber’s bill (monthly base fee, additional charges for roaming and usage beyond monthly prepaid limit), complaints to the customer service department and their resolution, information from the initial application for service (contract details, rate plan, handset type, credit report), market information (e.g., rate plans offered by the service provider and its competitors), and demographic data. Churn prediction is an extremely difficult problem for several reasons. First, the business environment is highly nonstationary; models trained on data from a certain time period perform far better with hold-out examples from that same time period than examples drawn from successive time periods. Second, features available for prediction are only weakly related to churn; when computing mutual information between individual features and churn, the greatest value we typically encounter is .01 bits. Third, information critical to predicting subscriber behavior, such as quality of service, is often unavailable. Obtaining accurate churn predictions is only part of the challenge of subscriber retention. Subscribers who are likely to churn must be contacted by a call center and offered some incentive to remain with the service provider. In a mathematically principled business scenario, one would frame the challenge as maximizing profitability to a service provider, and making the decision about whether to contact a subscriber and what incentive to offer would be based on the expected utility of offering versus not offering an incentive. However, business practices complicate the scenario and place some unique constraints on predictive models. First, call centers are operated by a staff of customer service representatives who can contact subscribers at a fixed rate; consequently, our models cannot advise contacting 50,000 subscribers one week, and 50 the next. Second, internal business strategies at the service providers constrain the minimum acceptable CA or CR rates (above and beyond the goal of maximizing profitability). Third, contracts that Athene makes with service providers will occasionally call for achieving a specific target CA and CR rate. These three practical issues pose formal problems which, to the best of our knowledge, have not been addressed by the machine learning community. The formal problems can be stated in various ways, including: (1) maximize the CA rate, subject to the constraint that a fixed percentage of the subscriber base is identified as potential churners, (2) optimize the CR rate, subject to the constraint that the CA rate should be αCA, (3) optimize the CA rate, subject to the constraint that the CR rate should be αCR, and finally—what marketing executives really want—(4) design a classifier that has a CA rate of αCA and a CR rate of αCR. Problem (1) sounds somewhat different than problems (2) or (3), but it can be expressed in terms of a lift curve, which plots the CA rate as a function of the total fraction of subscribers identified by the model. Problem (1) thus imposes the constraint that the solution lies at one coordinate of the lift curve, just as problems (2) and (3) place the constraint that the solution lies at one coordinate of the ROC curve. Thus, a solution to problems (2) or (3) will also serve as a solution to (1). Although addressing problem (4) seems most fanciful, it encompasses problems (2) and (3), and thus we focus on it. Our goal is not altogether unreasonable, because a solution to problem (4) has the property we characterized in Figure 1b: the ROC curve can suffer everywhere except in the region near CA αCA and CR αCR. Hence, the approaches we consider will trade off performance in some regions of the ROC curve against performance in other regions. We call this prodding the ROC curve. 2 FOUR ALGORITHMS TO PROD THE ROC CURVE In this section, we describe four algorithms for prodding the ROC curve toward a target CA rate of αCA and a target CR rate of αCR. 2.1 EMPHASIZING CRITICAL TRAINING EXAMPLES Suppose we train a classifier on a set of positive and negative examples from a class— churners and nonchurners in our domain. Following training, the classifier will assign a posterior probability of class membership to each example. The examples can be sorted by the posterior and arranged on a continuum anchored by probabilities 0 and 1 (Figure 2). We can identify the thresholds, θCA and θCR, which yield CA and CR rates of αCA and αCR, respectively. If the classifier’s discrimination performance fails to achieve the target CA and CR rates, then θCA will be lower than θCR, as depicted in the Figure. If we can bring these two thresholds together, we will achieve the target CA and CR rates. Thus, the first algorithm we propose involves training a series of classifiers, attempting to make classifier n+1 achieve better CA and CR rates by focusing its effort on examples from classifier n that lie between θCA and θCR; the positive examples must be pushed above θCR and the negative examples must be pushed below θCA. (Of course, the thresholds are specific to a classifier, and hence should be indexed by n.) We call this the emphasis algorithm, because it involves placing greater weight on the examples that lie between the two thresholds. In the Figure, the emphasis for classifier n+1 would be on examples e5 through e8. This retraining procedure can be iterated until the classifier’s training set performance reaches asymptote. In our implementation, we define a weighting of each example i for training classifier n, λ in . For classifier 1, λ i1 = 1 . For subsequent classifiers, λ in + 1 = λ in if example i is not in the region of emphasis, or λ in + 1 = κ e λ in otherwise, where κe is a constant, κe > 1. 2.2 DEEMPHASIZING IRRELEVANT TRAINING EXAMPLES The second algorithm we propose is related to the first, but takes a slightly different perspective on the continuum depicted in Figure 2. Positive examples below θCA—such as e2—are clearly the most dif ficult positive examples to classify correctly. Not only are they the most difficult positive examples, but they do not in fact need to be classified correctly to achieve the target CA and CR rates. Threshold θCR does not depend on examples such as e2, and threshold θCA allows a fraction (1–αCA) of the positive examples to be classified incorrectly. Likewise, one can argue that negative examples above θCR—such as e10 and e11—need not be of concern. Essentially , the second algorithm, which we term the eemd phasis algorithm, is like the emphasis algorithm in that a series of classifiers are trained, but when training classifier n+1, less weight is placed on the examples whose correct clasθCA e1 e2 e3 0 e4 θCR e5 e6 e7 e8 churn probability e9 e10 e11 e12 e13 1 FIGURE 2. A schematic depiction of all training examples arranged by the classifier’s posterior. Each solid bar corresponds to a positive example (e.g., a churner) and each grey bar corresponds to a negative example (e.g., a nonchurner). sification is unnecessary to achieve the target CA and CR rates for classifier n. As with the emphasis algorithm, the retraining procedure can be iterated until no further performance improvements are obtained on the training set. Note that the set of examples given emphasis by the previous algorithm is not the complement of the set of examples deemphasized by the current algorithm; the algorithms are not identical. In our implementation, we assign a weight to each example i for training classifier n, λ in . For classifier 1, λ i1 = 1 . For subsequent classifiers, λ in + 1 = λ in if example i is not in the region of deemphasis, or λ in + 1 = κ d λ in otherwise, where κd is a constant, κd <1. 2.3 CONSTRAINED OPTIMIZATION The third algorithm we propose is formulated as maximizing the CR rate while maintaining the CA rate equal to αCA. (We do not attempt to simultaneously maximize the CA rate while maintaining the CR rate equal to αCR.) Gradient methods cannot be applied directly because the CA and CR rates are nondifferentiable, but we can approximate the CA and CR rates with smooth differentiable functions: 1 1 CA ( w, t ) = ----- ∑ σ β ( f ( x i, w ) – t ) CR ( w, t ) = ------ ∑ σ β ( t – f ( x i, w ) ) , P i∈P N i∈N where P and N are the set of positive and negative examples, respectively, f(x,w) is the model posterior for input x, w is the parameterization of the model, t is a threshold, and σβ –1 is a sigmoid function with scaling parameter β: σ β ( y ) = ( 1 + exp ( – βy ) ) . The larger β is, the more nearly step-like the sigmoid is and the more nearly equal the approximations are to the model CR and CA rates. We consider the problem formulation in which CA is a constraint and CR is a figure of merit. We convert the constrained optimization problem into an unconstrained problem by the augmented Lagrangian method (Bertsekas, 1982), which involves iteratively maximizing an objective function 2 µ A ( w, t ) = CR ( w, t ) + ν CA ( w, t ) – α CA + -- CA ( w, t ) – α CA 2 with a fixed Lagrangian multiplier, ν, and then updating ν following the optimization step: ν ← ν + µ CA ( w *, t * ) – α CA , where w * and t * are the values found by the optimization step. We initialize ν = 1 and fix µ = 1 and β = 10 and iterate until ν converges. 2.4 GENETIC ALGORITHM The fourth algorithm we explore is a steady-state genetic search over a space defined by the continuous parameters of a classifier (Whitley, 1989). The fitness of a classifier is the reciprocal of the number of training examples falling between the θCA and θCR thresholds. Much like the emphasis algorithm, this fitness function encourages the two thresholds to come together. The genetic search permits direct optimization over a nondifferentiable criterion, and therefore seems sensible for the present task. 3 METHODOLOGY For our tests, we studied two large data bases made available to Athene by two telecommunications providers. Data set 1 had 50,000 subscribers described by 35 input features and a churn rate of 4.86%. Data set 2 had 169,727 subscribers described by 51 input features and a churn rate of 6.42%. For each data base, the features input to the classifier were obtained by proprietary transformations of the raw data (see Mozer et al., 2000). We chose these two large, real world data sets because achieving gains with these data sets should be more difficult than with smaller, less noisy data sets. Plus, with our real-world data, we can evaluate the cost savings achieved by an improvement in prediction accuracy. We performed 10-fold cross-validation on each data set, preserving the overall churn/nonchurn ratio in each split. In all tests, we chose α CR = 0.90 and α CA = 0.50 , values which, based on our past experience in this domain, are ambitious yet realizable targets for data sets such as these. We used a logistic regression model (i.e., a no hidden unit neural network) for our studies, believing that it would be more difficult to obtain improvements with such a model than with a more flexible multilayer perceptron. For the emphasis and deemphasis algorithms, models were trained to minimize mean-squared error on the training set. We chose κe = 1.3 and κd = .75 by quick exploration. Because the weightings are cumulative over training restarts, the choice of κ is not critical for either algorithm; rather, the magnitude of κ controls how many restarts are necessary to reach asymptotic performance, but the results we obtained were robust to the choice of κ. The emphasis and deemphasis algorithms were run for 100 iterations, which was the number of iterations required to reach asymptotic performance on the training set. 4 RESULTS Figure 3 illustrates training set performance for the emphasis algorithm on data set 1. The graph on the left shows the CA rate when the CR rate is .9, and the graph on the right show the CR rate when the CA rate is .5. Clearly, the algorithm appears to be stable, and the ROC curve is improving in the region around (αCA, αCR). Figure 4 shows cross-validation performance on the two data sets for the four prodding algorithms as well as for a traditional least-squares training procedure. The emphasis and deemphasis algorithms yield reliable improvements in performance in the critical region of the ROC curve over the traditional training procedure. The constrained-optimization and genetic algorithms perform well on achieving a high CR rate for a fixed CA rate, but neither does as well on achieving a high CA rate for a fixed CR rate. For the constrained-optimization algorithm, this result is not surprising as it was trained asymmetrically, with the CA rate as the constraint. However, for the genetic algorithm, we have little explanation for its poor performance, other than the difficulty faced in searching a continuous space without gradient information. 5 DISCUSSION In this paper, we have identified an interesting, novel problem in classifier design which is motivated by our domain of churn prediction and real-world business considerations. Rather than seeking a classifier that maximizes discriminability between two classes, as measured by area under the ROC curve, we are concerned with optimizing performance at certain points along the ROC curve. We presented four alternative approaches to prodding the ROC curve, and found that all four have promise, depending on the specific goal. Although the magnitude of the gain is small—an increase of about .01 in the CR rate given a target CA rate of .50—the impro vement results in significant dollar savings. Using a framework for evaluating dollar savings to a service provider, based on estimates of subscriber retention and costs of intervention obtained in real world data collection (Mozer et 0.845 0.84 0.39 0.835 0.385 CR rate CA rate 0.4 0.395 0.38 0.83 0.825 0.375 0.82 0.37 0.815 0.365 0.81 0 5 10 15 20 25 30 35 40 45 50 Iteration 0 5 10 15 20 25 30 35 40 45 50 Iteration FIGURE 3. Training set performance for the emphasis algorithm on data set 1. (a) CA rate as a function of iteration for a CR rate of .9; (b) CR rate as a function of iteration for a CA rate of .5. Error bars indicate +/–1 standard error of the mean. Data set 1 0.835 0.380 0.830 0.375 0.825 CR rate ISP Test Set 0.840 0.385 CA rate 0.390 0.370 0.820 0.365 0.815 0.360 0.810 0.355 0.805 0.350 0.800 std emph deemph constr GA std emph deemph constr GA std emph deemph constr GA 0.900 0.375 0.350 CR rate Data set 2 0.875 CA rate Wireless Test Set 0.850 0.325 0.825 0.300 0.800 std emph deemph constr GA FIGURE 4. Cross-validation performance on the two data sets for the standard training procedure (STD), as well as the emphasis (EMPH), deemphasis (DEEMPH), constrained optimization (CONSTR), and genetic (GEN) algorithms. The left column shows the CA rate for CR rate .9; the right column shows the CR rate for CA rate .5. The error bar indicates one standard error of the mean over the 10 data splits. al., 2000), we obtain a savings of $11 per churnable subscriber when the (CA, CR) rates go from (.50, .80) to (.50, .81), which amounts to an 8% increase in profitability of the subscriber intervention effort. These figures are clearly promising. However, based on the data sets we have studied, it is difficult to know whether another algorithm might exist that achieves even greater gains. Interestingly, all algorithms we proposed yielded roughly the same gains when successful, suggesting that we may have milked the data for whatever gain could be had, given the model class evaluated. Our work clearly illustrate the difficulty of the problem, and we hope that others in the NIPS community will be motivated by the problem to suggest even more powerful, theoretically grounded approaches. 6 ACKNOWLEDGEMENTS No white males were angered in the course of conducting this research. We thank Lian Yan and David Grimes for comments and assistance on this research. This research was supported in part by McDonnell-Pew grant 97-18, NSF award IBN-9873492, and NIH/IFOPAL R01 MH61549–01A1. 7 REFERENCES Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NY: Academic. Chang, E. I., & Lippmann, R. P. (1994). Figure of merit training for detection and spotting. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6 (1019–1026). San Mateo, CA: Morgan Kaufmann. Frederick, E. D., & Floyd, C. E. (1998). Analysis of mammographic findings and patient history data with genetic algorithms for the prediction of breast cancer biopsy outcome. Proceedings of the SPIE, 3338, 241–245. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Mozer, M. C., Wolniewicz, R., Grimes, D., Johnson, E., & Kaushansky, H. (2000). Maximizing revenue by predicting and addressing customer dissatisfaction. IEEE Transactions on Neural Networks, 11, 690–696. Whitley, D. (1989). The GENITOR algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In D. Schaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms (pp. 116–121). San Mateo, CA: Morgan Kaufmann.
4 0.61823434 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition
Author: William M. Campbell
Abstract: A novel approach for comparing sequences of observations using an explicit-expansion kernel is demonstrated. The kernel is derived using the assumption of the independence of the sequence of observations and a mean-squared error training criterion. The use of an explicit expansion kernel reduces classifier model size and computation dramatically, resulting in model sizes and computation one-hundred times smaller in our application. The explicit expansion also preserves the computational advantages of an earlier architecture based on mean-squared error training. Training using standard support vector machine methodology gives accuracy that significantly exceeds the performance of state-of-the-art mean-squared error training for a speaker recognition task.
5 0.60950029 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller
Abstract: Driven by the progress in the field of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming finger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classification accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classifiers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).
6 0.60720134 60 nips-2001-Discriminative Direction for Kernel Classifiers
7 0.56934714 54 nips-2001-Contextual Modulation of Target Saliency
8 0.56283855 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
9 0.54743534 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
10 0.50980198 159 nips-2001-Reducing multiclass to binary by coupling probability estimates
11 0.47762001 182 nips-2001-The Fidelity of Local Ordinal Encoding
12 0.46730462 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates
13 0.45947522 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
14 0.43368235 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
15 0.42956501 25 nips-2001-Active Learning in the Drug Discovery Process
16 0.40931344 144 nips-2001-Partially labeled classification with Markov random walks
17 0.40332642 74 nips-2001-Face Recognition Using Kernel Methods
18 0.39473438 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
19 0.38365731 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
20 0.37824166 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
topicId topicWeight
[(14, 0.032), (17, 0.016), (19, 0.034), (27, 0.121), (30, 0.134), (38, 0.06), (59, 0.056), (72, 0.091), (76, 0.197), (79, 0.03), (83, 0.011), (91, 0.128)]
simIndex simValue paperId paperTitle
same-paper 1 0.8886686 46 nips-2001-Categorization by Learning and Combining Object Parts
Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio
Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.
2 0.77273905 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
Author: Dieter Fox
Abstract: Over the last years, particle filters have been applied with great success to a variety of state estimation problems. We present a statistical approach to increasing the efficiency of particle filters by adapting the size of sample sets on-the-fly. The key idea of the KLD-sampling method is to bound the approximation error introduced by the sample-based representation of the particle filter. The name KLD-sampling is due to the fact that we measure the approximation error by the Kullback-Leibler distance. Our adaptation approach chooses a small number of samples if the density is focused on a small part of the state space, and it chooses a large number of samples if the state uncertainty is high. Both the implementation and computation overhead of this approach are small. Extensive experiments using mobile robot localization as a test application show that our approach yields drastic improvements over particle filters with fixed sample set sizes and over a previously introduced adaptation technique.
3 0.76408517 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
4 0.75911355 149 nips-2001-Probabilistic Abstraction Hierarchies
Author: Eran Segal, Daphne Koller, Dirk Ormoneit
Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.
5 0.75743026 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
Author: Gregor Wenning, Klaus Obermayer
Abstract: Cortical neurons might be considered as threshold elements integrating in parallel many excitatory and inhibitory inputs. Due to the apparent variability of cortical spike trains this yields a strongly fluctuating membrane potential, such that threshold crossings are highly irregular. Here we study how a neuron could maximize its sensitivity w.r.t. a relatively small subset of excitatory input. Weak signals embedded in fluctuations is the natural realm of stochastic resonance. The neuron's response is described in a hazard-function approximation applied to an Ornstein-Uhlenbeck process. We analytically derive an optimality criterium and give a learning rule for the adjustment of the membrane fluctuations, such that the sensitivity is maximal exploiting stochastic resonance. We show that adaptation depends only on quantities that could easily be estimated locally (in space and time) by the neuron. The main results are compared with simulations of a biophysically more realistic neuron model. 1
6 0.75354934 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
7 0.75186962 60 nips-2001-Discriminative Direction for Kernel Classifiers
8 0.7517494 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
9 0.74701571 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
10 0.74562383 185 nips-2001-The Method of Quantum Clustering
11 0.74414545 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
12 0.74042821 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
13 0.73980451 182 nips-2001-The Fidelity of Local Ordinal Encoding
14 0.73850644 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes
16 0.73731339 4 nips-2001-ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition
17 0.73555928 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
18 0.73537171 56 nips-2001-Convolution Kernels for Natural Language
19 0.73499179 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
20 0.73378026 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing