nips nips2001 nips2001-64 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Qi Zhang, Sally A. Goldman
Abstract: We present a new multiple-inst ance (MI) learning technique (EMDD) that combines EM with the diverse density (DD) algorithm. EM-DD is a general-purpose MI algorithm that can be applied with boolean or real-value labels and makes real-value predictions. On the boolean Musk benchmarks, the EM-DD algorithm without any tuning significantly outperforms all previous algorithms. EM-DD is relatively insensitive to the number of relevant attributes in the data set and scales up well to large bag sizes. Furthermore, EMDD provides a new framework for MI learning, in which the MI problem is converted to a single-instance setting by using EM to estimate the instance responsible for the label of the bag. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a new multiple-inst ance (MI) learning technique (EMDD) that combines EM with the diverse density (DD) algorithm. [sent-8, score-0.307]
2 EM-DD is a general-purpose MI algorithm that can be applied with boolean or real-value labels and makes real-value predictions. [sent-9, score-0.247]
3 On the boolean Musk benchmarks, the EM-DD algorithm without any tuning significantly outperforms all previous algorithms. [sent-10, score-0.147]
4 EM-DD is relatively insensitive to the number of relevant attributes in the data set and scales up well to large bag sizes. [sent-11, score-0.529]
5 Furthermore, EMDD provides a new framework for MI learning, in which the MI problem is converted to a single-instance setting by using EM to estimate the instance responsible for the label of the bag. [sent-12, score-0.157]
6 In this model, each training example is a set (or bag) of instances along with a single label equal to the maximum label among all instances in the bag. [sent-14, score-0.36]
7 The individual instances within the bag are not given labels. [sent-15, score-0.555]
8 The goal is to learn to accurately predict the label of previously unseen bags. [sent-16, score-0.073]
9 Standard supervised learning can be viewed as a special case of MI learning where each bag holds a single instance. [sent-17, score-0.529]
10 The MI learning model was originally motivated by the drug activity prediction problem where each instance is a possible conformation (or shape) of a molecule and each bag contains all likely low-energy conformations for the molecule. [sent-18, score-0.795]
11 A molecule is active if it binds strongly to the target protein in at least one of its conformations and is inactive if no conformation binds to the protein. [sent-19, score-0.327]
12 The problem is to predict the label (active or inactive) of molecules based on their conformations. [sent-20, score-0.154]
13 in th eir seminal paper [4] in which they developed MI algorithms for learning axis-parallel rectangles (APRs) and they also provided two benchmark "Musk" data sets. [sent-22, score-0.086]
14 Maron and Raton [7] applied the multiple-instance model to the task of recognizing a person from a series of images that are labeled positive if they contain the person and negative otherwise. [sent-24, score-0.068]
15 While the musk data sets have boolean labels , algorithms that can handle realvalue labels are often desirable in real-world applications. [sent-27, score-0.554]
16 For example, the binding affinity between a molecule and receptor is quantitative, and hence a real-value classification of binding strength is preferable to a binary one. [sent-28, score-0.283]
17 Most prior research on MI learning is restricted to concept learning (i. [sent-29, score-0.056]
18 Recently, MI learning with real-value labels has been performed using extensions of the diverse density (DD) and k-NN algorithms [1] and using MI regression [10]. [sent-32, score-0.409]
19 In this paper , we present a general-purpose MI learning technique (EM-DD) that combines EM [3] with the extended DD [1] algorithm. [sent-33, score-0.078]
20 The algorithm is applied to both boolean and real-value labeled data and the results are compared with corresponding MI learning algorithms from previous work. [sent-34, score-0.27]
21 In addition, the effects of the number of instances per bag and the number of relevant features on the performance of EM-DD algorithm are also evaluated using artificial data sets . [sent-35, score-0.811]
22 Their best performing algorithm (iterated-discrim) , starts with a point in the feature space and "grows" a box with the goal of finding the smallest box that covers at least one instance from each positive bag and no instances from any negative bag. [sent-40, score-0.759]
23 More recently, Wang and Zucker [11] proposed a lazy learning approach by applying two variant of the k nearest neighbor algorithm (k-NN) which they refer to as citation-kNN and Bayesian k-NN. [sent-45, score-0.093]
24 When describing the shape of a molecule by n features , one can view each conformation of the molecule as a point in a n-dimensional feature space. [sent-48, score-0.309]
25 The diverse density at a point p in the feature space is a probabilistic m easure of both how many different positive bags have an instance near p, and how far the negative instances are from p. [sent-49, score-0.807]
26 Intuitively, the diversity density of a hypothesis h is just the likelihood (with respect to the data) that h is the target. [sent-50, score-0.129]
27 A high diverse density indicates a good candidate for a "true" concept. [sent-51, score-0.229]
28 We now formally define the general MI problem (with boolean or real-value la- bels) and DD likelihood measurement originally defined in [6] and extended to real-value labels in [1]. [sent-52, score-0.21]
29 Let D be the labeled data which consists of a set of m bags B = {B 1 , . [sent-53, score-0.422]
30 Assume the labels of the instances in Bi are £i 1, . [sent-69, score-0.207]
31 The diverse density of hypothesized target point h is deh) Pr(h) Pr(B , L I h) Pr(h) A . [sent-83, score-0.255]
32 When the labels are boolean (0 or 1) , this formulation is exactly the most-likely-cause estimator used in the original DD algorit hm [5]. [sent-87, score-0.242]
33 For most applications t he influence each feature has on t he label varies greatly. [sent-88, score-0.098]
34 This variation is modeled in the DD algorithm by associating with each attribute an (unknown) scale factor . [sent-89, score-0.102]
35 Hence the target concept really consists of two values per dimension , the ideal attribute value and the scale value. [sent-90, score-0.117]
36 Using the assumption that binding strength drops exponentially as the similarity between the conform ation to the ideal shape increases , the following generative model was introduced by Maron and Lozano-Perez [6] for estimating the label of bag B i for hypothesis h = {h 1 , . [sent-91, score-0.624]
37 , sn} : Label(Bi I h) =max{ ex P [- t (Sd(Bijd - hd)) 2]} J d=l (1) where Sd is a scale factor indicating the importance of feature d, h d is the feature value for dimension d, and B ijd is the feature value of instance B ij on dimension d. [sent-97, score-0.163]
38 Let NLDD(h , D) = 2::7 (-log Pr(£i I h , B i )) , where NLDD denote the negative =1 logarit hm of DD. [sent-98, score-0.063]
39 The DD algorithm [6] uses a two-step gradient descent search to find a value of h that minimizes NLDD (and hence maximizes DD). [sent-99, score-0.09]
40 Ray and Page [10] developed multiple-instance regression algorithm which can also handle real-value labeled data. [sent-100, score-0.121]
41 They assumed an underlying linear model for the hypothesis and applied the algorithm to some artificial data. [sent-101, score-0.151]
42 Similar to the current work, they also used EM to select one instance fro m each bag so multiple regression can be applied to MI learning. [sent-102, score-0.534]
43 The basic idea behind EM-DD is to view the knowledge of which instance corresponds to the label of th e bag as a missing attribute which can be estimated using EM approach in a way similar to how EM is used in the MI regression [10]. [sent-105, score-0.672]
44 EM-DD starts with some initial guess of a target point h obtained in the standard way by trying points from positive bags, then repeat edly performs the following two steps that combines EM with DD to search for the maximum likelihood hypothesis. [sent-106, score-0.08]
45 In the first step (E-step) , the current hypothesis h is used to pick one instance from each bag which is most likely (given our generative model) to be the one responsible for the label given to the bag. [sent-107, score-0.674]
46 In the second step (M -step), we use the two-step gradient ascent search (quasi-newton search dfpmin in [8]) of the standard DD algorithm to find a new hi that maximizes DD(h). [sent-108, score-0.161]
47 Once this maximization step is completed , we reset the proposed target h to hi and return to the first step until the algorithm converges. [sent-109, score-0.104]
48 In every search step , the DD algorithm uses all points in each bag and hence the maximum that occurs in Equation (1) must be computed. [sent-113, score-0.515]
49 The prior diverse density algorithms [1,5,6,7] used a softmax approximation for the maximum (so that it will b e differentiable), which dramatically increases the computation complexity and introduces additional error based on the parameter selected in softmax. [sent-114, score-0.304]
50 In comparison, EM-DD converts the multiple-instance data to single-instance data by removing all but one point per bag in the E -step, which greatly simplifies the search step since the maximum that occurs in Equation (1) is removed in the E -step. [sent-115, score-0.566]
51 In addition, we believe that EM-DD helps avoid getting caught in local minimum since it makes major changes in the hypothesis when it switches which point is selected from a bag. [sent-117, score-0.119]
52 Note that at each iteration t , given a set of instances selected in the E-step, the M-step will find a unique hypothesis (h t ) and corresponding DD (ddt). [sent-119, score-0.223]
53 At iteration t + 1, if dd t +1 ::; ddt , the algorithm will terminate. [sent-120, score-0.541]
54 Otherwise, dd t +1 > ddt , which means that a different set of instances are selected. [sent-121, score-0.611]
55 For the iteration to continue, the DD will decrease monotonically and the set of instances selected can not repeat. [sent-122, score-0.131]
56 Since there are only finite number of sets to instances that can be selected at the E-step , the algorithm will terminate after a finite number of iterations. [sent-123, score-0.201]
57 From empirical tests we found that it is often beneficial to allow NLDD to increase slightly to escape a local minima and thus we used the less restrictive termination condition: Idd 1 - dd oI < 0. [sent-126, score-0.448]
58 dd o or the number of iterations is greater than 10. [sent-128, score-0.448]
59 We begin by reporting our results for the two musk benchmark data sets provided by Dietterich et al. [sent-132, score-0.195]
60 These data sets contain 166 feature vectors describing the surface for low-energy conformations of 92 molecules for Muskl and 102 molecules for Musk2 wh ere roughly half of the molecules are known to smell musky and the remainder are not. [sent-134, score-0.388]
61 The Musk1 d ata set is smaller both in h aving fewer bags (i. [sent-135, score-0.354]
62 e molecules) and many fewer instances p er bag (an average of 6. [sent-136, score-0.555]
63 , D 10 }; 111 O-fold cross validation for (i = l ;i:::; 10 ;i++) Dt = D - Di ; IIDt training data , Di validation data pick k random positive bags B 1 , . [sent-144, score-0.489]
64 , B k from D t ; let Ho be the union of all instances from selected bags; for every instance I j E H 0 hj = EM-DD (Ij, D t ); ei = mino:<;:j:<;:IIHoll{error(hj,Di)}; return avg(e1,e2, . [sent-147, score-0.192]
65 , sn}; Ilinitial hypothesis For each dimension d = 1, . [sent-156, score-0.069]
66 summarize the generally held belief that "The performance reported for iterateddiscrim APR involves choosing parameters to maximize the test set performance and so probably represents an upper bound for accuracy on this (Musk1) data set. [sent-163, score-0.069]
67 To be consistent with the way in which past results have been reported for the musk benchmarks we report the average accuracy of la-fold cross-validation (which is the value returned by Main in Figure l. [sent-165, score-0.195]
68 A summary of the performance of different algorithms on the Musk1 and Musk2 data sets is given in Table l. [sent-169, score-0.091]
69 As compared to the standard DD algorithm , EM-DD only used three random bags for Muskl and two random bags for Musk2 (versus all positive bags used in DD) as the starting point of the algorithm. [sent-171, score-1.125]
70 Table 1: Comparison of performance on Musk1 and Musk2 data sets as measured by giving the average accuracy across 10 runs using 10-fold cross validation. [sent-175, score-0.127]
71 Algorithm EM-DD Iterated-discrim [4] Citation-kNN [11] Bayesian-kNN [11] Diverse density [6] Multi-instance neural network [9] Multinst [2] Musk1 accuracy 96. [sent-176, score-0.098]
72 0% In addition to its superior performance on the musk data sets, EM-DD can handle real-value labeled data and produces real-value predictions. [sent-190, score-0.275]
73 We present results using one real data set (Affinity) 1 that has real-value labels and several artificial data sets generated using the technique of our earlier work [1]. [sent-191, score-0.266]
74 For these data sets, we used as our starting points the points from the bag with the highest DD value. [sent-192, score-0.505]
75 The Affinity data set has 283 features and 139 bags with an average of 32. [sent-194, score-0.419]
76 Only 29 bags have labels that were high enough to be considered as "positive. [sent-196, score-0.454]
77 In contrast using the standard diverse density algorithm the loss was 0. [sent-200, score-0.266]
78 EM-DD also gained much better performance than DD on two artificial data (160. [sent-202, score-0.076]
79 The best result on Affinity data was obtained using a version of citation-kNN [1] that works with real-value data with the loss as 0. [sent-207, score-0.062]
80 We think that the affinity data set is well-suited for a nearest neighbor approach in that all of the negative bags have labels between 0. [sent-209, score-0.634]
81 42 and so the actual predictions for the negative bags are better with citation-kNN. [sent-211, score-0.428]
82 To study the sensitivity of EM-DD to the number ofrelevant attributes and the size of the bags, tests were performed on artificial data sets with different number of relevant features and bag sizes. [sent-212, score-0.641]
83 As shown in Table 2, similar to the DD algorithm [1], the performance of EM-DD degrades as the number of relevant features decreases. [sent-213, score-0.121]
84 This behavior is expected since all scale factors are initialized to the same value and when most of the features are relevant less adjustment is needed and hence the algorithm is more likely to succeed. [sent-214, score-0.121]
85 For example, as shown in Figure 2, when the number of relevant features is 160 out of 166, both EM-DD and DD algorithms perform well with good correlation between the actual labels and predicted labels. [sent-216, score-0.254]
86 However, when the number of relevant features decreases to 80 , almost no correlation between the actual and predicted labels is found using DD , while EM-DD can still provide good predictions on the labels. [sent-217, score-0.227]
87 Intuitively, as the size of bags increases, more ambiguity is introduced to the data and the p erformance of algorithms is expected to go down. [sent-218, score-0.467]
88 Table 2: Performance on data with real-value labels measured as squared loss. [sent-223, score-0.131]
89 features #pts per bag 160 160 160 80 80 80 40 40 40 32. [sent-243, score-0.508]
90 1116 surprisingly, the performance of EM-DD actually improves as the number of examples per bag increases . [sent-257, score-0.474]
91 We believe that this is partly due to the fact that with few points per bag the chance that a bad starting point has the highest diverse density is much higher than when the bags are large. [sent-258, score-1.109]
92 The fact that EM-DD scales up well to large bag sizes in both performance and running time is very important for real drug-discovery applications in which the bags can be quite large. [sent-260, score-0.802]
93 We believe that EM-DD can be refined to obtain better performance by finding alternate ways to select the initial hypothesis and scale factors. [sent-262, score-0.095]
94 One option would be to use the result from a different learning algorithm as the starting point then use EM-DD to refine the hypothesis. [sent-263, score-0.091]
95 Since our algorithm is based on the diverse density likelihood measurement we believe that it will perform well on all applications in which the standard diverse density algorithm has worked well. [sent-265, score-0.558]
96 In addition , EM-DD and MI regression [10] presented a framework to convert the multiple-instance data to single-instance data, where supervised learning algorithms can be applied. [sent-266, score-0.159]
97 We are currently working on using this general m ethodology to develop new MI learning techniques based on supervised learning algorithms and EM. [sent-267, score-0.108]
98 8 Figure 2: Comparison of EM-DD and DD on real-value labeled artificial data with different number of relevant features. [sent-356, score-0.163]
99 The x-axis corresponds to the actual label and y-axis gives t h e predicted label. [sent-357, score-0.116]
100 Learning single and multiple instance dec is io n tr'ees for' co mputer' security appli ca tions. [sent-421, score-0.085]
wordName wordTfidf (topN-words)
[('dd', 0.448), ('bag', 0.448), ('bags', 0.354), ('mi', 0.263), ('nldd', 0.261), ('diverse', 0.169), ('pr', 0.142), ('maron', 0.131), ('musk', 0.131), ('affinity', 0.118), ('boolean', 0.11), ('instances', 0.107), ('labels', 0.1), ('molecule', 0.097), ('molecules', 0.081), ('bi', 0.077), ('muskl', 0.075), ('label', 0.073), ('hypothesis', 0.069), ('bij', 0.065), ('attribute', 0.065), ('instance', 0.061), ('density', 0.06), ('conformation', 0.056), ('conformations', 0.056), ('ddt', 0.056), ('emdd', 0.056), ('heh', 0.056), ('ray', 0.056), ('sd', 0.055), ('em', 0.051), ('relevant', 0.05), ('drug', 0.049), ('artificial', 0.045), ('hd', 0.044), ('actual', 0.043), ('hi', 0.041), ('francisco', 0.04), ('accuracy', 0.038), ('amar', 0.037), ('apr', 0.037), ('aprs', 0.037), ('auer', 0.037), ('bijd', 0.037), ('dooly', 0.037), ('goldman', 0.037), ('louis', 0.037), ('multinst', 0.037), ('ramon', 0.037), ('algorithm', 0.037), ('labeled', 0.037), ('dietterich', 0.036), ('fi', 0.035), ('pi', 0.034), ('binding', 0.034), ('features', 0.034), ('sets', 0.033), ('binds', 0.032), ('erformance', 0.032), ('zucker', 0.032), ('greene', 0.032), ('hm', 0.032), ('arg', 0.032), ('page', 0.032), ('negative', 0.031), ('data', 0.031), ('search', 0.03), ('doctoral', 0.03), ('learning', 0.028), ('san', 0.028), ('mo', 0.028), ('inactive', 0.028), ('lazy', 0.028), ('ij', 0.027), ('morgan', 0.027), ('algorithms', 0.027), ('target', 0.026), ('jonathan', 0.026), ('wang', 0.026), ('benchmarks', 0.026), ('per', 0.026), ('starting', 0.026), ('believe', 0.026), ('technique', 0.026), ('feature', 0.025), ('box', 0.025), ('supervised', 0.025), ('cross', 0.025), ('regression', 0.025), ('ca', 0.024), ('selected', 0.024), ('validation', 0.024), ('combines', 0.024), ('softmax', 0.024), ('modification', 0.024), ('addition', 0.023), ('ambiguity', 0.023), ('responsible', 0.023), ('find', 0.023), ('handle', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
Author: Qi Zhang, Sally A. Goldman
Abstract: We present a new multiple-inst ance (MI) learning technique (EMDD) that combines EM with the diverse density (DD) algorithm. EM-DD is a general-purpose MI algorithm that can be applied with boolean or real-value labels and makes real-value predictions. On the boolean Musk benchmarks, the EM-DD algorithm without any tuning significantly outperforms all previous algorithms. EM-DD is relatively insensitive to the number of relevant attributes in the data set and scales up well to large bag sizes. Furthermore, EMDD provides a new framework for MI learning, in which the MI problem is converted to a single-instance setting by using EM to estimate the instance responsible for the label of the bag. 1
2 0.15784353 58 nips-2001-Covariance Kernels from Bayesian Generative Models
Author: Matthias Seeger
Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1
3 0.087968171 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
Author: Kari Torkkola
Abstract: The marriage of Renyi entropy with Parzen density estimation has been shown to be a viable tool in learning discriminative feature transforms. However, it suffers from computational complexity proportional to the square of the number of samples in the training data. This sets a practical limit to using large databases. We suggest immediate divorce of the two methods and remarriage of Renyi entropy with a semi-parametric density estimation method, such as a Gaussian Mixture Models (GMM). This allows all of the computation to take place in the low dimensional target space, and it reduces computational complexity proportional to square of the number of components in the mixtures. Furthermore, a convenient extension to Hidden Markov Models as commonly used in speech recognition becomes possible.
4 0.052096084 137 nips-2001-On the Convergence of Leveraging
Author: Gunnar Rätsch, Sebastian Mika, Manfred K. Warmuth
Abstract: We give an unified convergence analysis of ensemble learning methods including e.g. AdaBoost, Logistic Regression and the Least-SquareBoost algorithm for regression. These methods have in common that they iteratively call a base learning algorithm which returns hypotheses that are then linearly combined. We show that these methods are related to the Gauss-Southwell method known from numerical optimization and state non-asymptotical convergence results for all these methods. Our analysis includes -norm regularized cost functions leading to a clean and general way to regularize ensemble learning.
5 0.048663579 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
Author: Mikio L. Braun, Joachim M. Buhmann
Abstract: We consider noisy Euclidean traveling salesman problems in the plane, which are random combinatorial problems with underlying structure. Gibbs sampling is used to compute average trajectories, which estimate the underlying structure common to all instances. This procedure requires identifying the exact relationship between permutations and tours. In a learning setting, the average trajectory is used as a model to construct solutions to new instances sampled from the same source. Experimental results show that the average trajectory can in fact estimate the underlying structure and that overfitting effects occur if the trajectory adapts too closely to a single instance. 1
6 0.047185041 144 nips-2001-Partially labeled classification with Markov random walks
7 0.046157062 139 nips-2001-Online Learning with Kernels
8 0.045903377 143 nips-2001-PAC Generalization Bounds for Co-training
9 0.045252375 171 nips-2001-Spectral Relaxation for K-means Clustering
10 0.043895178 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
11 0.043885183 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
12 0.043458749 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
13 0.04322296 22 nips-2001-A kernel method for multi-labelled classification
14 0.043192875 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
15 0.042078037 114 nips-2001-Learning from Infinite Data in Finite Time
16 0.041617207 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
17 0.041143741 8 nips-2001-A General Greedy Approximation Algorithm with Applications
18 0.040540002 25 nips-2001-Active Learning in the Drug Discovery Process
19 0.039967529 170 nips-2001-Spectral Kernel Methods for Clustering
20 0.039606784 105 nips-2001-Kernel Machines and Boolean Functions
topicId topicWeight
[(0, -0.144), (1, 0.026), (2, -0.001), (3, -0.009), (4, 0.014), (5, -0.038), (6, -0.018), (7, 0.015), (8, -0.049), (9, -0.002), (10, 0.041), (11, 0.001), (12, 0.004), (13, -0.049), (14, -0.026), (15, 0.013), (16, 0.01), (17, 0.026), (18, -0.042), (19, 0.014), (20, -0.041), (21, 0.131), (22, -0.074), (23, 0.037), (24, 0.141), (25, 0.025), (26, 0.111), (27, 0.027), (28, -0.063), (29, -0.005), (30, -0.128), (31, 0.115), (32, 0.014), (33, -0.079), (34, -0.025), (35, 0.026), (36, 0.022), (37, 0.034), (38, -0.146), (39, -0.031), (40, 0.048), (41, 0.161), (42, -0.058), (43, 0.039), (44, 0.028), (45, 0.067), (46, -0.097), (47, 0.105), (48, 0.135), (49, -0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.90807593 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
Author: Qi Zhang, Sally A. Goldman
Abstract: We present a new multiple-inst ance (MI) learning technique (EMDD) that combines EM with the diverse density (DD) algorithm. EM-DD is a general-purpose MI algorithm that can be applied with boolean or real-value labels and makes real-value predictions. On the boolean Musk benchmarks, the EM-DD algorithm without any tuning significantly outperforms all previous algorithms. EM-DD is relatively insensitive to the number of relevant attributes in the data set and scales up well to large bag sizes. Furthermore, EMDD provides a new framework for MI learning, in which the MI problem is converted to a single-instance setting by using EM to estimate the instance responsible for the label of the bag. 1
2 0.52283615 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
Author: Kari Torkkola
Abstract: The marriage of Renyi entropy with Parzen density estimation has been shown to be a viable tool in learning discriminative feature transforms. However, it suffers from computational complexity proportional to the square of the number of samples in the training data. This sets a practical limit to using large databases. We suggest immediate divorce of the two methods and remarriage of Renyi entropy with a semi-parametric density estimation method, such as a Gaussian Mixture Models (GMM). This allows all of the computation to take place in the low dimensional target space, and it reduces computational complexity proportional to square of the number of components in the mixtures. Furthermore, a convenient extension to Hidden Markov Models as commonly used in speech recognition becomes possible.
3 0.49595687 58 nips-2001-Covariance Kernels from Bayesian Generative Models
Author: Matthias Seeger
Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1
4 0.43389985 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
Author: Andrew Y. Ng, Michael I. Jordan
Abstract: We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widelyheld belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. This stems from the observation- which is borne out in repeated experiments- that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster. 1
5 0.41344953 167 nips-2001-Semi-supervised MarginBoost
Author: Florence D'alché-buc, Yves Grandvalet, Christophe Ambroise
Abstract: In many discrimination problems a large amount of data is available but only a few of them are labeled. This provides a strong motivation to improve or develop methods for semi-supervised learning. In this paper, boosting is generalized to this task within the optimization framework of MarginBoost . We extend the margin definition to unlabeled data and develop the gradient descent algorithm that corresponds to the resulting margin cost function. This meta-learning scheme can be applied to any base classifier able to benefit from unlabeled data. We propose here to apply it to mixture models trained with an Expectation-Maximization algorithm. Promising results are presented on benchmarks with different rates of labeled data. 1
6 0.38952377 93 nips-2001-Incremental A*
7 0.33433428 143 nips-2001-PAC Generalization Bounds for Co-training
8 0.33349878 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
9 0.32571882 114 nips-2001-Learning from Infinite Data in Finite Time
10 0.32343176 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms
11 0.32140991 35 nips-2001-Analysis of Sparse Bayesian Learning
12 0.31345826 89 nips-2001-Grouping with Bias
13 0.29964978 144 nips-2001-Partially labeled classification with Markov random walks
14 0.29390076 25 nips-2001-Active Learning in the Drug Discovery Process
15 0.29329842 193 nips-2001-Unsupervised Learning of Human Motion Models
16 0.28867668 149 nips-2001-Probabilistic Abstraction Hierarchies
17 0.28823054 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
18 0.28759781 31 nips-2001-Algorithmic Luckiness
19 0.28552887 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
20 0.28468975 76 nips-2001-Fast Parameter Estimation Using Green's Functions
topicId topicWeight
[(14, 0.036), (17, 0.022), (19, 0.062), (27, 0.113), (30, 0.057), (38, 0.025), (49, 0.014), (59, 0.018), (60, 0.279), (72, 0.073), (79, 0.038), (83, 0.019), (88, 0.017), (91, 0.121)]
simIndex simValue paperId paperTitle
same-paper 1 0.80748057 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
Author: Qi Zhang, Sally A. Goldman
Abstract: We present a new multiple-inst ance (MI) learning technique (EMDD) that combines EM with the diverse density (DD) algorithm. EM-DD is a general-purpose MI algorithm that can be applied with boolean or real-value labels and makes real-value predictions. On the boolean Musk benchmarks, the EM-DD algorithm without any tuning significantly outperforms all previous algorithms. EM-DD is relatively insensitive to the number of relevant attributes in the data set and scales up well to large bag sizes. Furthermore, EMDD provides a new framework for MI learning, in which the MI problem is converted to a single-instance setting by using EM to estimate the instance responsible for the label of the bag. 1
2 0.59029782 13 nips-2001-A Natural Policy Gradient
Author: Sham M. Kakade
Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1
Author: Gregory Z. Grudic, Lyle H. Ungar
Abstract: We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, . Theory is presented showing that linear function approximation representations of can degrade the rate of convergence of performance gradient estimates by a factor of relative to when no function approximation of is used, where is the number of possible actions and is the number of basis functions in the function approximation representation. The second concerns the use of a bias term in estimating the state action value function. Theory is presented showing that a non-zero bias term can improve the rate of convergence of performance gradient estimates by , where is the number of possible actions. Experimental evidence is presented showing that these theoretical results lead to significant improvement in the convergence properties of Policy Gradient Reinforcement Learning algorithms. ¤ ¨ ¦ ¢ ©§¥¤£¡ ¦ ¤ ¨ £¡ ¨ ¤¢ ¢
4 0.57681584 190 nips-2001-Thin Junction Trees
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.
5 0.57403934 8 nips-2001-A General Greedy Approximation Algorithm with Applications
Author: T. Zhang
Abstract: Greedy approximation algorithms have been frequently used to obtain sparse solutions to learning problems. In this paper, we present a general greedy algorithm for solving a class of convex optimization problems. We derive a bound on the rate of approximation for this algorithm, and show that our algorithm includes a number of earlier studies as special cases.
6 0.57342947 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
7 0.57273698 1 nips-2001-(Not) Bounding the True Error
8 0.57165504 192 nips-2001-Tree-based reparameterization for approximate inference on loopy graphs
9 0.57105416 95 nips-2001-Infinite Mixtures of Gaussian Process Experts
10 0.56966484 132 nips-2001-Novel iteration schemes for the Cluster Variation Method
11 0.56949258 143 nips-2001-PAC Generalization Bounds for Co-training
12 0.56929839 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
13 0.56919187 58 nips-2001-Covariance Kernels from Bayesian Generative Models
14 0.5691269 36 nips-2001-Approximate Dynamic Programming via Linear Programming
15 0.56862652 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
16 0.568488 25 nips-2001-Active Learning in the Drug Discovery Process
17 0.56847417 185 nips-2001-The Method of Quantum Clustering
18 0.56690216 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data
19 0.56624025 57 nips-2001-Correlation Codes in Neuronal Populations
20 0.56613123 121 nips-2001-Model-Free Least-Squares Policy Iteration