nips nips2001 nips2001-25 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Manfred K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen
Abstract: We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, find those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector votes with its prediction and we pick the unlabeled examples for which the prediction is most evenly split between and . For a third selection strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select unlabeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceuticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. § © ¨
Reference: text
sentIndex sentText sentNum sentScore
1 In each iteration a comparatively small batch of compounds is screened for binding to the target. [sent-15, score-0.526]
2 One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. [sent-17, score-0.857]
3 Each weight vector votes with its prediction and we pick the unlabeled examples for which the prediction is most evenly split between and . [sent-19, score-0.492]
4 For a third selection strategy note that each unlabeled example bisects the version space of consistent weight vectors. [sent-20, score-0.535]
5 We estimate the volume on both sides of the split by bouncing a billiard through the version space and select unlabeled examples that cause the most even split of the version space. [sent-21, score-0.737]
6 We demonstrate that on two data sets provided by DuPont Pharmaceuticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. [sent-22, score-0.413]
7 § © ¨ 1 Introduction Two of the most important goals in Computational Drug Design are to find active compounds in large databases quickly and (usually along the way) to obtain an interpretable model for what makes a specific subset of compounds active. [sent-23, score-0.426]
8 That is in each iteration a batch of unlabeled compounds is screened against the target using some sort of biological assay[MGST97]. [sent-28, score-0.702]
9 The desired goal is that many active hits show up in the assays of the selected batches. [sent-29, score-0.29]
10 In each iteration the learner selects a batch of un-labeled examples for being labeled as positive (active) or negative (inactive). [sent-31, score-0.521]
11 Note that our classification problem is fundamentally asymmetric in that the data sets have typically many more negative examples and the Chemists are more interested in the positive hits because these compounds might lead to new drugs. [sent-36, score-0.574]
12 However, we simulate the real-life situation by initially hiding all labels and only giving to the algorithm the labels for the requested batches of examples (virtual screening). [sent-40, score-0.273]
13 The long-term goal of this type of research is to provide a computer program to the Chemists which will do the following interactive job: At any point new unlabeled examples may be added. [sent-41, score-0.388]
14 Whenever a new test needs to be set up, the Chemist asks the program to suggest a batch of unlabeled compounds. [sent-43, score-0.486]
15 The suggested batch might be “edited” and augmented using the invaluable knowledge and intuition of the medicinal Chemist. [sent-44, score-0.297]
16 Even though compound descriptors can be computed, the compounds have not been Figure 1: Three types of comsynthesized yet. [sent-47, score-0.262]
17 In other words it is comparatively pounds/points: are active, are easy to generate lots of unlabeled data. [sent-48, score-0.248]
18 Our selection strategies do much better than choosing random batches indicating that the long-term goal outlined above may be feasible. [sent-54, score-0.356]
19 ¤¦ ©¨§¦¥£ ¤ Thus from the Machine Learning point of view we have a fixed set of points in that are either unlabeled or labeled positive or negative. [sent-55, score-0.382]
20 The binary descriptors of the compounds are rather “complete” and the data is always linearly separable. [sent-57, score-0.238]
21 In the next section we describe different selection strategies on the basis of these hyperplanes in detail and provide an experimental comparison. [sent-60, score-0.275]
22 2 Different Selection Criteria and their Performance A selection algorithm is specified in three parts: a batch size, an initialization and a selection strategy. [sent-64, score-0.468]
23 We always chose 5% of the total data set as our batch size, which matches reasonably with typical experimental constraints. [sent-66, score-0.341]
24 All further batches are chosen using the selection strategy. [sent-69, score-0.202]
25 As we mentioned in the introduction, all our selection strategies are based on linear classifiers of the data labeled so far. [sent-70, score-0.315]
26 All examples are normalized to unit-length and we consider homogeneous hyperplanes where the normal direction is again unitlength. [sent-71, score-0.186]
27 ¤ © ¦ ¤ ¢ ¨§¥£¡¥ ¦ ¦ ¤ ¤ Once we specify how the weight vector is found then the next batch is found by selecting the unlabeled examples closest to this hyperplane. [sent-73, score-0.856]
28 The simplest way to obtain such a weight vector is to run a perceptron over the labeled data until it produces a consistent weight vector (Perc). [sent-74, score-0.251]
29 Our second selection strategy (called SVM) uses the maximum margin hyperplane [BGV92] produced by a Support Vector Machine. [sent-75, score-0.458]
30 We remember all weight vectors for each pass 3 and do this for 100 random permutations of the labeled examples. [sent-79, score-0.192]
31 The prediction on an example is positive if the total vote is larger than zero and we select the unlabeled examples whose total vote is closest to zero4 . [sent-81, score-0.832]
32 So when then the point lies on the positive side of the hyperplane . [sent-85, score-0.265]
33 In a dual view the point lies on the positive side of the hyperplane (Recall all instances and weight vectors have unit-length). [sent-86, score-0.349]
34 A weight vector that is must lie on the -side of the plane for consistent with all -labeled examples all . [sent-87, score-0.232]
35 The set of all consistent weight vectors is called the version space which is a section of the unit hypersphere bounded by the planes corresponding to the labeled examples. [sent-88, score-0.198]
36 For our third selection strategy (VolEst) of bounce a billiard is bounced 1000 times inside the version space and the fraction points on the positive side of is computed. [sent-90, score-0.679]
37 The prediction for is positive if is larger than half and the strategy selects unlabeled points whose fraction is closest to half. [sent-91, score-0.686]
38 # 1 # ¤ 3 # # 0& # & % # )('¥$ ¤ § 2 3 # # 5 # 4 In Figure 2 (left) we plot the true positives and false positives w. [sent-92, score-0.454]
39 Figure 3 (left) shows the averaged true positives and false positives of VoPerc, SVM, and VolEst. [sent-97, score-0.428]
40 We also plotted ROC curves after each batch has been added (not shown). [sent-99, score-0.264]
41 The three strategies VoPerc, SVM, and VolEst all perform much better than the corresponding strategies where the selection criterion is to select random unlabeled examples instead of using a “closest” criterion. [sent-101, score-0.782]
42 The reason is that the Round0 has a smaller fraction of positive examples (3%). [sent-104, score-0.34]
43 [FS98] 4 Instead of voting the predictions of all weight vectors one can also average all the weight vectors after normalizing them and select unlabeled examples closest to the resulting single weight vector. [sent-106, score-0.774]
44 30 Perc true pos Perc false pos VoPerc true pos VoPerc false pos 25 standard deviation number of examples 150 100 50 0 0 Perc true pos Perc false pos VoPerc true pos VoPerc false pos 0. [sent-108, score-3.672]
45 8 fraction of examples selected 20 15 10 5 0 0 1 0. [sent-112, score-0.345]
46 8 fraction of examples selected 1 Figure 2: (left) Average (over 10 runs) of true positives and false positives on the entire Round1 data set after each 5% batch for Perc and VoPerc. [sent-116, score-1.059]
47 150 total number of hits number of examples 150 100 50 0 0 VoPerc true pos VoPerc false pos SVM true pos SVM false pos VolEst true pos VolEst false pos 0. [sent-118, score-3.017]
48 8 fraction of examples selected 100 50 5% batch size 1 example batch size 1 0 0 0. [sent-122, score-0.873]
49 8 fraction of examples selected 1 Figure 3: (left) Average (over 10 runs) of true and false positives on entire Round1 data set after each 5% batch for VoPerc, SVM, and VolEst. [sent-126, score-0.943]
50 (right) Comparison of 5% batch size and 1 example batch size for VoPerc on Round1 data. [sent-127, score-0.528]
51 that the Round1 data was preselected by the Chemists for actives and the fraction was raised to about 25%. [sent-128, score-0.198]
52 This suggest that our methods are particularly suitable when few positive examples are hidden in a large set of negative examples. [sent-129, score-0.197]
53 The simple strategy SVM of choosing unlabeled examples closest to the maximum margin hyperplane has been investigated by other authors (in [CCS00] for character recognition and in [TK00] for text categorization). [sent-130, score-0.864]
54 The labeled points that are closest to the hyperplane are called the support vectors because if all other points are removed then the maximum margin hyperplane remains unchanged. [sent-131, score-0.748]
55 For each 5% batch the location of the points is scattered onto a thin stripe. [sent-134, score-0.31]
56 In the right plot the geometric distance to the hyperplane is plotted. [sent-137, score-0.276]
57 Recall that we pick unlabeled points closest to the hyperplane (center of the stripe). [sent-138, score-0.582]
58 As soon as the “window” between the support vectors is cleaned most positive examples have been found (compare with the SVM curves given in Figure 3 (left)). [sent-139, score-0.261]
59 § So far our three selection strategies VoPerc, SVM and VolEst have shown similar performance. [sent-141, score-0.233]
60 Here the goal is to label/verify many positive compounds quickly. [sent-143, score-0.236]
61 We therefore think that the total number of positives (hits) among all examples tested so far is the best performance criterion. [sent-144, score-0.315]
62 3 fraction of examples selected fraction of examples selected Figure 4: Comparisons of SVM using random batch selection and closest batch selection. [sent-192, score-1.485]
63 3 geometric distance to hyperplane Figure 5: (left) Scatter plot of the distance of examples to the maximum margin hyperplane normalized so support vectors are at 1. [sent-204, score-0.77]
64 (right) Scatter plot of the geometric distance of examples to the hyperplane. [sent-205, score-0.238]
65 Each stripe shows location of a random sub-sample of points (Round1 data) after an additional 5% batch has been labeled by SVM. [sent-206, score-0.436]
66 Selected examples are black x, unselected positives are red plus, unselected negatives are blue square. [sent-207, score-0.326]
67 In contrast the total number of hits of VoPerc, SVM and VolEst is 5% in the first batch (since it is random) but much faster thereafter (See Figure 6). [sent-209, score-0.491]
68 Since the positive examples are much more valuable in our application, we also changed the selection strategy SVM to selecting unlabeled examples of largest positive distance 5 to the maximum margin hyperplane (SVM ) rather than smallest distance . [sent-211, score-1.204]
69 Correspondingly VoPerc picks the unlabeled example with the highest vote and VolEst picks the unlabeled example with the largest fraction . [sent-212, score-0.768]
70 The total hit plots of the resulting modified strategies SVM , VoPerc and VolEst are improved ( see Figure 7 versus Figure 6 ). [sent-213, score-0.273]
71 Thus in some sense the original strategies are better at ”exploration” (giving better generalization on the entire data set) while the modified strategies are better at ”exploitation” (higher number of total hits). [sent-217, score-0.339]
72 ¦ ¢ ¤ ¤ ¡ ¢ ¦ ¡ # ¤ ¡ 3 ¡ ¡ ¡ ¡ ¡ ¡ Finally we investigate the effect of batch size on performance. [sent-220, score-0.264]
73 Note that for our data a batch size of 5% (31 examples for Round1) is performing not much worse than the experimentally unrealistic batch size of only 1 example. [sent-222, score-0.694]
74 Only when the results for batch size 1 are much better than 5 In Figure 5 this means we are selecting from right to left 40 150 total number of hits 30 25 20 15 10 VoPerc SVM VolEst 5 0 0 0. [sent-223, score-0.583]
75 8 fraction of examples selected 50 0 0 1 VoPerc SVM VolEst 100 0. [sent-227, score-0.345]
76 8 fraction of examples selected Figure 6: Total hit performance on Round0 (left) and Round1 (right) data of VolEst with 5% batch size. [sent-231, score-0.697]
77 40 £ ¡ ¤¢ total number of hits 35 1 , VoPerc and 150 total number of hits 25 20 15 10 VoPerc+ SVM+ VolEst+ 5 0 0 0. [sent-232, score-0.454]
78 8 fraction of examples selected 1 100 50 VoPerc+ SVM+ VolEst+ 0 0 0. [sent-236, score-0.345]
79 8 fraction of examples selected Figure 7: Total hit performance on Round0 (left) and Round1 (right) data of VolEst with 5% batch size. [sent-240, score-0.697]
80 ¦ £ ¡ §¥¢ 30 1 , VoPerc and ¦ total number of hits 35 ¦ the results for larger batch sizes, more sophisticated selection strategies are worth exploring that pick say a batch that is “close” and at the same time “diverse”. [sent-241, score-1.026]
81 After this preprocessing, one pass of a perceptron is at most , where is the number of labeled examples and the number of mistakes. [sent-243, score-0.255]
82 Finding the maximum margin hyperplane is estimated at time. [sent-244, score-0.277]
83 © ¨ © ¨ © © ¨ If we have the internal hypothesis of the algorithm then for applying the selection criterion we need to evaluate the hypothesis for each unlabeled point. [sent-247, score-0.324]
84 3 Theoretical Justifications As we see in Figure 5(right) the geometric margin of the support vectors (half the width of the window) is shrinking as more examples are labeled. [sent-252, score-0.361]
85 Thus the following goal is reasonable for designing selection strategies: pick unlabeled examples that cause the margin to shrink the most. [sent-253, score-0.601]
86 The simplest such strategy is to pick examples closest to the maximum margin hyperplane since these example are expected to change the maximum margin 150 number of examples total number of hits 150 100 50 SVM+ SVM 0 0 0. [sent-254, score-1.193]
87 8 100 SVM+ true pos SVM true pos SVM+ false pos SVM false pos 50 0 0 1 fraction of examples selected 0. [sent-258, score-2.109]
88 8 1 fraction of examples selected Figure 8: Exploitation versus Exploration: (left) Total hit performance and (right) True and False positives performance (right) of SVM and on Round 1 data ¦ ¥¢ £ ¡ hyperplane the most [TK00, CCS00]. [sent-262, score-0.704]
89 The maximum margin as (in the dual view) the distance of the point hyperplane is the point in version space with the largest sphere that is completely contained in the version space [Ruj97, RM00]. [sent-267, score-0.509]
90 So this is a second justification for selecting unlabeled examples closest to the maximum margin hyperplane. [sent-270, score-0.665]
91 ¤ ¦ ¡¤ ¤ ¤ ¤ ¤ Our selection strategy VolEst starts from any point inside the version space and then bounces a billiard 1000 times. [sent-271, score-0.411]
92 Thus the fraction of bounces on the positive side of an unlabeled hyperplane is an estimate of the fraction of volume on the positive side of . [sent-273, score-0.902]
93 # # # ¡ # 3 # # 3 We tried to improve our estimate of the volume by replacing by the fraction of the total trajectory located on the positive side of . [sent-277, score-0.316]
94 The resulting weight vector (an approximation to the center of mass of the version space) approximates the so called Bayes point [Ruj97] which has the following property: Any unlabeled hyperplane passing through the Bayes point cuts the version space roughly 6 in half. [sent-280, score-0.597]
95 We thus tested a selection strategy which picks unlabeled points closest to the estimated center of mass. [sent-281, score-0.635]
96 This strategy was again indistinguishable from the other two strategies based on bouncing the billiard. [sent-282, score-0.236]
97 After some deliberations we concluded that the total number of positive examples (hits) among the tested examples is the best performance criterion for the drug design application. [sent-285, score-0.509]
98 that a number of different selection strategies with comparable performance. [sent-287, score-0.233]
99 The variants that select the unlabeled examples with the highest score (i. [sent-288, score-0.417]
100 Overall the selection strategies based on the Voted Perceptron were the most versatile and showed slightly better performance. [sent-291, score-0.233]
wordName wordTfidf (topN-words)
[('voperc', 0.45), ('volest', 0.35), ('pos', 0.343), ('batch', 0.264), ('unlabeled', 0.222), ('compounds', 0.183), ('svm', 0.175), ('hits', 0.172), ('hyperplane', 0.155), ('false', 0.15), ('examples', 0.144), ('fraction', 0.143), ('closest', 0.142), ('strategies', 0.131), ('perc', 0.117), ('positives', 0.116), ('selection', 0.102), ('batches', 0.1), ('billiard', 0.1), ('margin', 0.095), ('drug', 0.087), ('strategy', 0.079), ('bounce', 0.067), ('hit', 0.066), ('vote', 0.066), ('labeled', 0.06), ('active', 0.06), ('selected', 0.058), ('total', 0.055), ('version', 0.054), ('positive', 0.053), ('perceptron', 0.051), ('chemists', 0.05), ('weight', 0.049), ('true', 0.046), ('compound', 0.046), ('picks', 0.046), ('stripe', 0.043), ('dupont', 0.043), ('hyperplanes', 0.042), ('justi', 0.042), ('classi', 0.042), ('plane', 0.039), ('split', 0.039), ('pick', 0.038), ('distance', 0.036), ('vectors', 0.035), ('selecting', 0.035), ('side', 0.035), ('bounces', 0.033), ('chemist', 0.033), ('descriptors', 0.033), ('medicinal', 0.033), ('preselected', 0.033), ('ruj', 0.033), ('screened', 0.033), ('unselected', 0.033), ('query', 0.033), ('geometric', 0.032), ('left', 0.03), ('volume', 0.03), ('bisects', 0.029), ('chemistry', 0.029), ('requested', 0.029), ('select', 0.029), ('support', 0.029), ('right', 0.027), ('maximum', 0.027), ('comparatively', 0.026), ('bouncing', 0.026), ('shrinking', 0.026), ('plot', 0.026), ('design', 0.026), ('points', 0.025), ('inactive', 0.025), ('exploitation', 0.025), ('permutations', 0.025), ('bayes', 0.024), ('random', 0.023), ('largest', 0.023), ('diverse', 0.023), ('recall', 0.023), ('variants', 0.022), ('point', 0.022), ('half', 0.022), ('data', 0.022), ('plots', 0.021), ('location', 0.021), ('inside', 0.021), ('passes', 0.021), ('sphere', 0.021), ('ers', 0.021), ('workshop', 0.02), ('simplest', 0.02), ('voting', 0.02), ('binding', 0.02), ('window', 0.02), ('scatter', 0.019), ('queries', 0.019), ('center', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 25 nips-2001-Active Learning in the Drug Discovery Process
Author: Manfred K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen
Abstract: We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, find those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector votes with its prediction and we pick the unlabeled examples for which the prediction is most evenly split between and . For a third selection strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select unlabeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceuticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. § © ¨
2 0.15396106 144 nips-2001-Partially labeled classification with Markov random walks
Author: Martin Szummer, Tommi Jaakkola
Abstract: To classify a large number of unlabeled examples we combine a limited number of labeled examples with a Markov random walk representation over the unlabeled examples. The random walk representation exploits any low dimensional structure in the data in a robust, probabilistic manner. We develop and compare several estimation criteria/algorithms suited to this representation. This includes in particular multi-way classification with an average margin criterion which permits a closed form solution. The time scale of the random walk regularizes the representation and can be set through a margin-based criterion favoring unambiguous classification. We also extend this basic regularization by adapting time scales for individual examples. We demonstrate the approach on synthetic examples and on text classification problems.
3 0.1099907 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
Author: Shai Fine, Katya Scheinberg
Abstract: We propose a framework based on a parametric quadratic programming (QP) technique to solve the support vector machine (SVM) training problem. This framework, can be specialized to obtain two SVM optimization methods. The first solves the fixed bias problem, while the second starts with an optimal solution for a fixed bias problem and adjusts the bias until the optimal value is found. The later method can be applied in conjunction with any other existing technique which obtains a fixed bias solution. Moreover, the second method can also be used independently to solve the complete SVM training problem. A combination of these two methods is more flexible than each individual method and, among other things, produces an incremental algorithm which exactly solve the 1-Norm Soft Margin SVM optimization problem. Applying Selective Sampling techniques may further boost convergence. 1
4 0.10633516 46 nips-2001-Categorization by Learning and Combining Object Parts
Author: Bernd Heisele, Thomas Serre, Massimiliano Pontil, Thomas Vetter, Tomaso Poggio
Abstract: We describe an algorithm for automatically learning discriminative components of objects with SVM classifiers. It is based on growing image parts by minimizing theoretical bounds on the error probability of an SVM. Component-based face classifiers are then combined in a second stage to yield a hierarchical SVM classifier. Experimental results in face classification show considerable robustness against rotations in depth and suggest performance at significantly better level than other face detection systems. Novel aspects of our approach are: a) an algorithm to learn component-based classification experts and their combination, b) the use of 3-D morphable models for training, and c) a maximum operation on the output of each component classifier which may be relevant for biological models of visual recognition.
5 0.10356828 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
Author: Paul Viola, Michael Jones
Abstract: This paper develops a new approach for extremely fast detection in domains where the distribution of positive and negative examples is highly skewed (e.g. face detection or database retrieval). In such domains a cascade of simple classifiers each trained to achieve high detection rates and modest false positive rates can yield a final detector with many desirable features: including high detection rates, very low false positive rates, and fast performance. Achieving extremely high detection rates, rather than low error, is not a task typically addressed by machine learning algorithms. We propose a new variant of AdaBoost as a mechanism for training the simple classifiers used in the cascade. Experimental results in the domain of face detection show the training algorithm yields significant improvements in performance over conventional AdaBoost. The final face detection system can process 15 frames per second, achieves over 90% detection, and a false positive rate of 1 in a 1,000,000.
6 0.10325724 167 nips-2001-Semi-supervised MarginBoost
7 0.096254088 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
8 0.091621414 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
9 0.091393813 139 nips-2001-Online Learning with Kernels
10 0.088928908 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
11 0.087624259 143 nips-2001-PAC Generalization Bounds for Co-training
12 0.080995321 62 nips-2001-Duality, Geometry, and Support Vector Regression
13 0.075068742 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
14 0.073439933 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA
15 0.073201455 105 nips-2001-Kernel Machines and Boolean Functions
16 0.071718462 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
17 0.070741855 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
18 0.069299355 58 nips-2001-Covariance Kernels from Bayesian Generative Models
19 0.067368671 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
20 0.063597016 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
topicId topicWeight
[(0, -0.175), (1, 0.113), (2, -0.035), (3, 0.13), (4, 0.025), (5, 0.067), (6, -0.049), (7, -0.028), (8, -0.034), (9, 0.035), (10, 0.077), (11, -0.064), (12, -0.017), (13, -0.001), (14, 0.166), (15, 0.011), (16, 0.063), (17, -0.019), (18, 0.03), (19, -0.037), (20, -0.118), (21, 0.049), (22, -0.108), (23, -0.008), (24, 0.087), (25, 0.104), (26, 0.069), (27, -0.14), (28, 0.017), (29, -0.102), (30, -0.028), (31, 0.133), (32, 0.117), (33, 0.119), (34, -0.096), (35, -0.048), (36, -0.003), (37, -0.023), (38, 0.016), (39, -0.021), (40, 0.011), (41, -0.016), (42, 0.036), (43, 0.085), (44, 0.133), (45, 0.087), (46, 0.014), (47, -0.177), (48, -0.015), (49, -0.137)]
simIndex simValue paperId paperTitle
same-paper 1 0.95486432 25 nips-2001-Active Learning in the Drug Discovery Process
Author: Manfred K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen
Abstract: We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, find those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector votes with its prediction and we pick the unlabeled examples for which the prediction is most evenly split between and . For a third selection strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select unlabeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceuticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. § © ¨
2 0.68616927 144 nips-2001-Partially labeled classification with Markov random walks
Author: Martin Szummer, Tommi Jaakkola
Abstract: To classify a large number of unlabeled examples we combine a limited number of labeled examples with a Markov random walk representation over the unlabeled examples. The random walk representation exploits any low dimensional structure in the data in a robust, probabilistic manner. We develop and compare several estimation criteria/algorithms suited to this representation. This includes in particular multi-way classification with an average margin criterion which permits a closed form solution. The time scale of the random walk regularizes the representation and can be set through a margin-based criterion favoring unambiguous classification. We also extend this basic regularization by adapting time scales for individual examples. We demonstrate the approach on synthetic examples and on text classification problems.
3 0.51694775 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
Author: Shai Fine, Katya Scheinberg
Abstract: We propose a framework based on a parametric quadratic programming (QP) technique to solve the support vector machine (SVM) training problem. This framework, can be specialized to obtain two SVM optimization methods. The first solves the fixed bias problem, while the second starts with an optimal solution for a fixed bias problem and adjusts the bias until the optimal value is found. The later method can be applied in conjunction with any other existing technique which obtains a fixed bias solution. Moreover, the second method can also be used independently to solve the complete SVM training problem. A combination of these two methods is more flexible than each individual method and, among other things, produces an incremental algorithm which exactly solve the 1-Norm Soft Margin SVM optimization problem. Applying Selective Sampling techniques may further boost convergence. 1
4 0.48938137 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
5 0.48134595 167 nips-2001-Semi-supervised MarginBoost
Author: Florence D'alché-buc, Yves Grandvalet, Christophe Ambroise
Abstract: In many discrimination problems a large amount of data is available but only a few of them are labeled. This provides a strong motivation to improve or develop methods for semi-supervised learning. In this paper, boosting is generalized to this task within the optimization framework of MarginBoost . We extend the margin definition to unlabeled data and develop the gradient descent algorithm that corresponds to the resulting margin cost function. This meta-learning scheme can be applied to any base classifier able to benefit from unlabeled data. We propose here to apply it to mixture models trained with an Expectation-Maximization algorithm. Promising results are presented on benchmarks with different rates of labeled data. 1
6 0.4679586 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
7 0.43251127 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
8 0.41849944 165 nips-2001-Scaling Laws and Local Minima in Hebbian ICA
9 0.40715978 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
10 0.36716425 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
11 0.35879704 46 nips-2001-Categorization by Learning and Combining Object Parts
12 0.35558563 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
13 0.34955421 143 nips-2001-PAC Generalization Bounds for Co-training
14 0.34509906 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
15 0.33329147 105 nips-2001-Kernel Machines and Boolean Functions
16 0.33130842 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
17 0.32836956 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
18 0.32665998 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
19 0.32663867 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
20 0.3101882 139 nips-2001-Online Learning with Kernels
topicId topicWeight
[(14, 0.043), (17, 0.031), (19, 0.033), (27, 0.127), (30, 0.044), (38, 0.02), (49, 0.286), (59, 0.014), (72, 0.118), (79, 0.029), (83, 0.012), (91, 0.135)]
simIndex simValue paperId paperTitle
1 0.89379293 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation
Author: Christophe Andrieu, Nando D. Freitas, Arnaud Doucet
Abstract: In this paper, we extend the Rao-Blackwellised particle filtering method to more complex hybrid models consisting of Gaussian latent variables and discrete observations. This is accomplished by augmenting the models with artificial variables that enable us to apply Rao-Blackwellisation. Other improvements include the design of an optimal importance proposal distribution and being able to swap the sampling an selection steps to handle outliers. We focus on sequential binary classifiers that consist of linear combinations of basis functions , whose coefficients evolve according to a Gaussian smoothness prior. Our results show significant improvements. 1
2 0.8408795 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables
Author: Brendan J. Frey, Anitha Kannan, Nebojsa Jojic
Abstract: Factor analysis and principal components analysis can be used to model linear relationships between observed variables and linearly map high-dimensional data to a lower-dimensional hidden space. In factor analysis, the observations are modeled as a linear combination of normally distributed hidden variables. We describe a nonlinear generalization of factor analysis , called
same-paper 3 0.82141119 25 nips-2001-Active Learning in the Drug Discovery Process
Author: Manfred K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen
Abstract: We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, find those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector votes with its prediction and we pick the unlabeled examples for which the prediction is most evenly split between and . For a third selection strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select unlabeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceuticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. § © ¨
4 0.62280512 84 nips-2001-Global Coordination of Local Linear Models
Author: Sam T. Roweis, Lawrence K. Saul, Geoffrey E. Hinton
Abstract: High dimensional data that lies on or near a low dimensional manifold can be described by a collection of local linear models. Such a description, however, does not provide a global parameterization of the manifold—arguably an important goal of unsupervised learning. In this paper, we show how to learn a collection of local linear models that solves this more difficult problem. Our local linear models are represented by a mixture of factor analyzers, and the “global coordination” of these models is achieved by adding a regularizing term to the standard maximum likelihood objective function. The regularizer breaks a degeneracy in the mixture model’s parameter space, favoring models whose internal coordinate systems are aligned in a consistent way. As a result, the internal coordinates change smoothly and continuously as one traverses a connected path on the manifold—even when the path crosses the domains of many different local models. The regularizer takes the form of a Kullback-Leibler divergence and illustrates an unexpected application of variational methods: not to perform approximate inference in intractable probabilistic models, but to learn more useful internal representations in tractable ones. 1 Manifold Learning Consider an ensemble of images, each of which contains a face against a neutral background. Each image can be represented by a point in the high dimensional vector space of pixel intensities. This representation, however, does not exploit the strong correlations between pixels of the same image, nor does it support many useful operations for reasoning about faces. If, for example, we select two images with faces in widely different locations and then average their pixel intensities, we do not obtain an image of a face at their average location. Images of faces lie on or near a low-dimensional, curved manifold, and we can represent them more usefully by the coordinates on this manifold than by pixel intensities. Using these “intrinsic coordinates”, the average of two faces is another face with the average of their locations, poses and expressions. To analyze and manipulate faces, it is helpful to imagine a “magic black box” with levers or dials corresponding to the intrinsic coordinates on this manifold. Given a setting of the levers and dials, the box generates an image of a face. Given an image of a face, the box deduces the appropriate setting of the levers and dials. In this paper, we describe a fairly general way to construct such a box automatically from an ensemble of high-dimensional vectors. We assume only that there exists an underlying manifold of low dimensionality and that the relationship between the raw data and the manifold coordinates is locally linear and smoothly varying. Thus our method applies not only to images of faces, but also to many other forms of highly distributed perceptual and scientific data (e.g., spectrograms of speech, robotic sensors, gene expression arrays, document collections). 2 Local Linear Models The global structure of perceptual manifolds (such as images of faces) tends to be highly nonlinear. Fortunately, despite their complicated global structure, we can usually characterize these manifolds as locally linear. Thus, to a good approximation, they can be represented by collections of simpler models, each of which describes a locally linear neighborhood[3, 6, 8]. For unsupervised learning tasks, a probabilistic model that nicely captures this intuition is a mixture of factor analyzers (MFA)[5]. The model is used to describe high dimensional data that lies on or near a lower dimensional manifold. MFAs parameterize a joint distribution over observed and hidden variables: (1) where the observed variable, , represents the high dimensional data; the discrete hidden variables, , indexes different neighborhoods on the manifold; and the continuous hidden variables, , represent low dimensional local coordinates. The model assumes that data is sampled from different neighborhoods on the manifold with prior probabilities , and that within each neighborhood, the data’s local coordinates are normally distributed1 as: RP&¤§¢ Q ¡ I 0 ( 3HGF D C¥@@@¥ 8¥75 ( E¨BAA9¨©©64§ 2 0 ( 31)£ ¥ ¡ ¡ ¥ ¡ ¥ §¥ ¡ &¤§¢'&§ %#¤¢$#¨
5 0.62259185 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
7 0.61906195 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
8 0.61577916 13 nips-2001-A Natural Policy Gradient
9 0.61393338 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
10 0.61332637 40 nips-2001-Batch Value Function Approximation via Support Vectors
11 0.61302644 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes
12 0.61262465 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
13 0.61167121 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
14 0.61137974 143 nips-2001-PAC Generalization Bounds for Co-training
15 0.61110353 58 nips-2001-Covariance Kernels from Bayesian Generative Models
16 0.61095035 8 nips-2001-A General Greedy Approximation Algorithm with Applications
17 0.60734916 61 nips-2001-Distribution of Mutual Information
18 0.60734218 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
19 0.60649812 1 nips-2001-(Not) Bounding the True Error
20 0.60549599 132 nips-2001-Novel iteration schemes for the Cluster Variation Method