nips nips2002 nips2002-68 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. [sent-7, score-0.63]
2 The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i. [sent-8, score-0.548]
3 We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. [sent-11, score-0.304]
4 Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. [sent-12, score-0.427]
5 In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). [sent-13, score-0.316]
6 We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. [sent-14, score-0.099]
7 With being the class-conditional probability density functions (PDFs) and denoting the corresponding apriori probabilities of ) & $ %#! [sent-18, score-0.19]
8 ¥(2 £ 403¤¡ 1 $ 6 75 ( class-membership we have the risk ! [sent-20, score-0.081]
9 [3]) that the expected risk is minimized, if one chooses the classifier (2) ¥(2 £ 403¤¡ ' 6) 4 0) ¥ ¡ 7¦5321¤¡ ¦£ ¢ 61 5 6 The resulting lower bound on is known as the Bayes risk which limits the average performance of the classifier . [sent-23, score-0.162]
10 Because the class-conditional densities are usually unknown, one way to realize the above classifier is to use estimates of these densities instead. [sent-24, score-0.636]
11 This leads to the so-called plug-in classifiers, which are Bayes-consistent if the density estimators are consistent (e. [sent-25, score-0.149]
12 ¥ £¡ ¦¤¢ We recently proposed a method for the design of density-based classifiers without resorting to the usual density estimation schemes of the plug-in approach [6]. [sent-29, score-0.216]
13 Instead we utilized discriminative densities with parameters optimized to solve the classification problem. [sent-30, score-0.511]
14 The approach requires maximization of the average bounded difference between class (discriminative) densities , which we refer to as the contrast of the underlying “true” dis-bounded contrast is the expectation with tributions. [sent-31, score-0.791]
15 In this paper we show that with some slight modification the contrast can be viewed as an approximation of the negative Bayes risk (up to some constant shift and scaling) which is valid for the binary as well as for the general multiclass case. [sent-34, score-0.372]
16 Therefore for certain parametrizations of the discriminative densities MCCs allow to find an optimal trade-off between the classical plug-in Bayes-consistency and the consistency which arises from direct minimization of the approximate Bayes risk. [sent-35, score-0.61]
17 Furthermore, for a particular parametrization of the PDFs, we obtain certain kinds of Linear Programming Machines (LPMs) [4] as (in general) approximate solutions of maximum contrast estimation. [sent-36, score-0.329]
18 In that way MCCs provide a Bayes-consistent approach to realize multiclass LPMs / SVMs and they suggest an interpretation of the magnitude of the LPM / SVM classification function in terms of density differences which provide a probabilistic measure of confidence. [sent-37, score-0.317]
19 For the case of LPMs we propose an extended optimization procedure for maximization of the contrast via iteration of linear optimizations. [sent-38, score-0.337]
20 Inspired by the MCC-framework, for the resulting Sequential Linear Programming Machines (SLPM) we propose a new regularizer which allows to find an optimal trade-off between the above mentioned two approaches to Bayes consistency. [sent-39, score-0.084]
21 2 Maximum Contrast Estimation For the design of MCCs the first step, which is the same as for the plug-in concept, requires to replace the unknown class-conditional densities of the Bayes classifier (2) by suitably parametrized PDFs. [sent-41, score-0.385]
22 Then, instead of choosing the parameters for an approximation of the original (true) densities (e. [sent-42, score-0.296]
23 by maximum likelihood estimation) as with the plug-in scheme, the density parameters are choosen to maximize the so-called contrast which is the expected value of the -bounded density differences as defined in (3). [sent-44, score-0.551]
24 Obviously, these are not the best discriminative densities we may think of and therefore we require an appropriate bound . [sent-52, score-0.511]
25 For finite , maximization of the contrast enforces a redistribution of the estimated probability mass and gives rise to a constrained linear optimization problem in the space of discriminative densities which may be solved by variational methods in some cases. [sent-53, score-0.927]
26 ¢£ The relation between contrast and Bayes risk becomes more convenient when we slightly modify the above definition (3) by a unit upper bound and by adding a lower bound on the -scaled density differences: S TRQ & © & HY¤£¡ 1 ! [sent-54, score-0.392]
27 Therefore, for an infinite scale factor the (expected) approaches the negative Bayes risk up to constant shift and ¥ H§ £ ¡ F G 12 G £ 2§ §) & 2"¥ H§ ¤¡ F 730§ £ ¡ ( ' & F %4# XU"! [sent-56, score-0.201]
28 ¡ ¥ G ¥ ¡ with scale factor contrast scaling: (4) (5) Thus the scale factor defines a subset of the input-space, which includes the decision boundary and which becomes increasingly focused in their vicinity as . [sent-58, score-0.379]
29 The extent of the region is defined by the bounds on the difference between discriminative densities. [sent-59, score-0.244]
30 In terms of the contrast function it can be defined as (6) Since for MCC-training we maximize the empirical contrast, i. [sent-60, score-0.194]
31 the corresponding sample average of , the scale factor then defines a subset of the training data which has impact on learning of the decision boundary. [sent-62, score-0.16]
32 Thus for increasing scale factor the relative size of that subset is shrinking. [sent-63, score-0.144]
33 However for increasing size of the training set the scale factor can be gradually increased and then, for suitable classes of PDFs, MCCs can approach the Bayes rule. [sent-64, score-0.265]
34 In other words, acts as a regularization parameter such that, for particular choices of the PDF class convergence to the Bayes classifier can be achieved if the quality of the approximation of the loss function is gradually increased for increasing sample sizes. [sent-65, score-0.146]
35 In the following section we shall consider such a class of PDFs which is flexible enough and which turns out to include a certain kind of SVMs. [sent-66, score-0.081]
36 3 MCC-Realizations In the following we shall first consider a particularly useful parametrization of the discriminative densities which gives rise to classifiers which in the binary case have the same functional form as SVMs up to a “missing” bias term in the MCC-case. [sent-67, score-0.762]
37 For training of these MCCs we derive a suitable objective function which can be maximized by sequential linear programming where we show the close relation to training of Linear Programming Machines. [sent-68, score-0.294]
38 On the other hand if we allow for local variation of the bandwidth we get a complicated contrast which is difficult to maximize due to nonlinear dependencies on the parameters. [sent-72, score-0.278]
39 The same is true if we treat the kernel centers as free parameters. [sent-73, score-0.094]
40 Thus we have class-specific densities with mixing weights the contribution of a single training example to the PDF. [sent-75, score-0.497]
41 with under certain which control With that choice we achieve plug-in Bayes-consistency for the case of equal mixing weights, since then we have the usual kernel density estimator (KDE), which, besides some mild assumptions about the distributions, requires a vanishing kernel bandwidth for . [sent-77, score-0.641]
42 ( &$ ' % # 4 # $ so that we can write the empirical contrast examples, as: and the For notational simplicity in the following we shall incorporate the scale factor mixing weigths into a common parameter vector with and . [sent-85, score-0.45]
43 Further we define the scaled density difference training (10) $ & " ¤ 6 D& 9 @8 6 A 1 £ ¡ ¥ ¤¢ 9 " ¡¥ # ¡ F 5 7 where the assignment variables realize the maximum function in (4). [sent-86, score-0.323]
44 With fixed assignment variables , is concave and maximization with respect to gives rise # ) 6 F G F 5 7 9 6 to a linear optimization problem. [sent-87, score-0.281]
45 On the other hand, for fixed maximization with respect to the is achieved by setting for negative terms. [sent-88, score-0.142]
46 This suggests a sequential linear optimization strategy for overall maximization of the contrast which shall be introduced in detail in the following section. [sent-89, score-0.441]
47 ¡6 # ) 9 9 6 Since we have already incorporated as a scaling factor into the parameter vector , is now identified with the norm . [sent-90, score-0.071]
48 Therefore the scale factor can be adjusted implicitly . [sent-91, score-0.09]
49 Thus a suitable by a regularization term which penalizes some suitable norm of the objective function can be defined by # & (11) # ) "¥ ¢ ¥ ¡ £¡ ¡ ' # # ¡ & # $ F ¡ ¥ ¡ 5 7 # F ¡ 5 7 ¡ with determining the weight of the penalty, i. [sent-92, score-0.209]
50 We now consider several instances of the case where the penalty corresponds to some -norm of . [sent-95, score-0.046]
51 With the -norm, for the probability mass of the discriminative densities is concentrated on those two kernel-functions which yield the highest average density difference. [sent-96, score-0.753]
52 Although that property forces the sparsest solution for large enough , clearly, that solution isn’t Bayes-consistent in general because as pointed out in Sec. [sent-97, score-0.127]
53 2, for all probability mass of the discriminative densities is concentrated at the two points with maximum average density difference. [sent-98, score-0.794]
54 1 # ¡ ¡ ¢¢ ¡ ¡ ¥ ¡ ¢ Conversely taking , which resembles the standard SVM regularizer [10], . [sent-99, score-0.055]
55 Indeed, it is easy to see that all yields the KDE with equal mixing weights for -norm penalties with share this convenient property, which guarantees “plug-in” Bayes consistency in the case where the solution is totally determined by the regularizer. [sent-100, score-0.242]
56 In that case kernel density estimators are achieved as the “default” solution. [sent-101, score-0.243]
57 For that we achieve an equal distribution of the weights kind of penalty in the limiting case which corresponds to the kernel density estimator (KDE) solution. [sent-103, score-0.381]
58 By a suitable choice of the kernel width and the scale of the weights, e. [sent-105, score-0.193]
59 Dividing the objective by , subtracting , setting and turning minimization to maximization of the negative objective shows that LPM training corresponds to a special case of MCC training with fixed and -norm regularizer with . [sent-109, score-0.367]
60 3 Sequential Linear Programming Estimation of mixing weights is now achieved by maximizing the sample contrast with respect to the and the assignment variables . [sent-112, score-0.369]
61 If convergence in contrast then stop else proceed with step 2. [sent-125, score-0.162]
62 # 6 ¥ £¡ & 4 (&9 6 Where are slack variables, measuring the part of the density difference which can be charged to the objective function. [sent-126, score-0.269]
63 Since we used unnormalized Gaussian kernel functions with , i. [sent-128, score-0.094]
64 we excluded all multiplicative density constants, that constraint doesn’t exclude any useful solutions for the weights. [sent-130, score-0.149]
65 % % & ¡ # $ # ¡ ¡ ¥ ¦£ £ ¡ ¦ 4 Experiments In the following section we consider the task of solving binary classification problems within the MCC-framework, using the above SLPM with Gaussian kernel function. [sent-131, score-0.128]
66 The first experiment illustrates the behaviour of the MCC for different values for the regularization by means of a simple two-dimensional toy dataset. [sent-132, score-0.094]
67 The second experiment compares the classification performance of the MCC with those of the SVM and KernelDensity-Classifier (KDC) which is a special case of the MCC with equal weighting of each kernel function. [sent-133, score-0.094]
68 ¡ The two-dimensional toy dataset consists of 300 data points, sampled from two overlapping isotropic normal distributions with a mutual distance of and standard deviation . [sent-135, score-0.119]
69 Figure 1 shows the solution of the MCC for two different values of (only data points with non-zero weights according the criterion are marked by symbols). [sent-136, score-0.131]
70 In both figures, data points with large mixing weights are located near the decision border. [sent-137, score-0.246]
71 In particular for small there are regions of high contrast alongside the decision function (illustrated by isolines). [sent-138, score-0.199]
72 For increasing the number of data points with non-zero increases. [sent-139, score-0.095]
73 This illustrates that for increasing the quality of the approximation of the loss function decreases. [sent-143, score-0.054]
74 In both figures, several data points are misclassified with a contrast . [sent-144, score-0.203]
75 The MCC identified those data points as outliers and deactivated them during the training (encircled symbols). [sent-145, score-0.185]
76 For this experiment we selected the Pima Indian Diabetes, Breast-Cancer, Heart and Thyroid dataset from the UCI Machine Learning repository. [sent-148, score-0.08]
77 5 ¡ 5 Figure 1: Two MCC solutions for the two-dimensional toy dataset for different values of (left: , right: ). [sent-167, score-0.119]
78 The symbols and depict the positions of data points with with non-zero . [sent-168, score-0.104]
79 Encircled symbols have been deactivated during the training (symbols for deactivated data points are not scaled according to , since in most cases is zero). [sent-170, score-0.388]
80 The absolute value of the contrast is illustrated by the isolines while the sign of the contrast depicts the binary classification of the classifier. [sent-171, score-0.413]
81 The region with which corresponds to as defined in (6) is colored white and the complement colored gray. [sent-172, score-0.088]
82 The percentage of data points that define the solution is (left figure) and (right figure) of the dataset. [sent-173, score-0.149]
83 Since we used for all classifiers the Gaussian kernel function, all three algorithms are parametrized by the bandwidth . [sent-180, score-0.236]
84 Additionally, for the SVM and MCC the regularization value had to be chosen. [sent-181, score-0.055]
85 The optimal parametrization was chosen by estimating the generalization performance for different values of bandwidth and regularization by means of the average test error on the first five dataset partitions. [sent-182, score-0.383]
86 More precisely, a first coarse scan was performed, followed by a fine scan in the interval near the optimal values of the first one. [sent-183, score-0.176]
87 Each scan considered 1600 different combinations of and , resp. [sent-184, score-0.058]
88 For parameter pairs with identical test error, the pair constructing the sparsest solution was kept. [sent-186, score-0.091]
89 ¦ ¦ ¨ ) ) © ¡ ¥ ¡ ¥ ¡ ¡ Table 1 shows the optimal parametrization of the MCC in combination with the classification rate and sparseness of the solution (measured as percentage non-zero ). [sent-190, score-0.272]
90 The last two columns show the absolute number of iterations and the final number of deactivated examples. [sent-192, score-0.111]
91 In particular for the Heart, Breast-Cancer and Diabetes dataset the solution of the MCC is significantly sparser than those of the SVM (see Tab. [sent-194, score-0.116]
92 6 ¨ 5 Conclusion The MCC-approach provides an understanding of SVMs / LPMs in terms of generative modelling using discriminative densities. [sent-198, score-0.215]
93 While usual unsupervised density estimation schemes try to minimize some distance criterion (e. [sent-199, score-0.185]
94 Kullback-Leibler divergence) be- ¥ ¡ ¡ Table 1: Optimal parametrization , classification rate, percentage of non-zero , number of iterations of the MCC and number of . [sent-201, score-0.207]
95 The results are averaged over all 100 dataset partitions. [sent-202, score-0.08]
96 For the classification rate and percentage of non-zero -coefficients the corresponding value after the first MCC iteration is given in brackets. [sent-203, score-0.072]
97 Given are the classification rates with percentage of non-zero (in brackets). [sent-239, score-0.072]
98 3 ) tween the models and the true densities, MC-estimation aims at learning of densities which represent the differences of the underlying distributions in an optimal way for classification. [sent-262, score-0.384]
99 Future work will address the investigation of the general multiclass performance and the capability to cope with misslabeled data. [sent-263, score-0.065]
100 Fast training of support vector machines using sequential minimal optimization. [sent-315, score-0.142]
wordName wordTfidf (topN-words)
[('mcc', 0.526), ('densities', 0.296), ('discriminative', 0.215), ('mccs', 0.194), ('classi', 0.184), ('contrast', 0.162), ('density', 0.149), ('bielefeld', 0.144), ('maximization', 0.142), ('parametrization', 0.135), ('bayes', 0.121), ('kde', 0.12), ('mixing', 0.114), ('deactivated', 0.111), ('lpms', 0.111), ('hy', 0.096), ('kernel', 0.094), ('svm', 0.091), ('pdfs', 0.088), ('bandwidth', 0.084), ('kdc', 0.083), ('slpm', 0.083), ('diabetes', 0.082), ('risk', 0.081), ('dataset', 0.08), ('thyroid', 0.072), ('percentage', 0.072), ('er', 0.07), ('heart', 0.07), ('programming', 0.07), ('cation', 0.067), ('multiclass', 0.065), ('symbols', 0.063), ('isn', 0.061), ('differences', 0.059), ('parametrized', 0.058), ('neuroinformatics', 0.058), ('scan', 0.058), ('encircled', 0.055), ('isolines', 0.055), ('keerthi', 0.055), ('sparsest', 0.055), ('sequential', 0.055), ('regularizer', 0.055), ('regularization', 0.055), ('increasing', 0.054), ('weights', 0.054), ('machines', 0.054), ('objective', 0.052), ('suitable', 0.051), ('ers', 0.05), ('shall', 0.049), ('scale', 0.048), ('svms', 0.048), ('lpm', 0.048), ('helge', 0.048), ('meinicke', 0.048), ('twellmann', 0.048), ('concentrated', 0.047), ('penalty', 0.046), ('mass', 0.046), ('realize', 0.044), ('colored', 0.044), ('benchmark', 0.043), ('factor', 0.042), ('points', 0.041), ('apriori', 0.041), ('datapoints', 0.041), ('assignment', 0.039), ('toy', 0.039), ('xu', 0.039), ('unbounded', 0.039), ('doesn', 0.039), ('slack', 0.039), ('estimator', 0.038), ('consistency', 0.038), ('decision', 0.037), ('gradually', 0.037), ('solution', 0.036), ('usual', 0.036), ('notational', 0.035), ('tsch', 0.035), ('platt', 0.035), ('binary', 0.034), ('concave', 0.034), ('rise', 0.033), ('germany', 0.033), ('optimization', 0.033), ('training', 0.033), ('maximize', 0.032), ('certain', 0.032), ('coarse', 0.031), ('design', 0.031), ('misclassi', 0.03), ('shift', 0.03), ('delta', 0.03), ('scaled', 0.029), ('difference', 0.029), ('optimal', 0.029), ('incorporated', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
2 0.1584864 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
Author: Yves Grandvalet, Stéphane Canu
Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.
3 0.15370266 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
4 0.14948842 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
5 0.14211817 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
Author: Peter Meinicke, Matthias Kaper, Florian Hoppe, Manfred Heumann, Helge Ritter
Abstract: In this paper we present results of a study on brain computer interfacing. We adopted an approach of Farwell & Donchin [4], which we tried to improve in several aspects. The main objective was to improve the transfer rates based on offline analysis of EEG-data but within a more realistic setup closer to an online realization than in the original studies. The objective was achieved along two different tracks: on the one hand we used state-of-the-art machine learning techniques for signal classification and on the other hand we augmented the data space by using more electrodes for the interface. For the classification task we utilized SVMs and, as motivated by recent findings on the learning of discriminative densities, we accumulated the values of the classification function in order to combine several classifications, which finally lead to significantly improved rates as compared with techniques applied in the original work. In combination with the data space augmentation, we achieved competitive transfer rates at an average of 50.5 bits/min and with a maximum of 84.7 bits/min.
6 0.12981148 119 nips-2002-Kernel Dependency Estimation
7 0.11854792 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
8 0.11819243 45 nips-2002-Boosted Dyadic Kernel Discriminants
9 0.11708593 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
10 0.11405857 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
11 0.10979089 114 nips-2002-Information Regularization with Partially Labeled Data
12 0.1091973 92 nips-2002-FloatBoost Learning for Classification
13 0.10878997 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
14 0.10097966 120 nips-2002-Kernel Design Using Boosting
15 0.0998137 156 nips-2002-On the Complexity of Learning the Kernel Matrix
16 0.099199116 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
17 0.09816356 106 nips-2002-Hyperkernels
18 0.097290777 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
19 0.092736721 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
20 0.090322383 138 nips-2002-Manifold Parzen Windows
topicId topicWeight
[(0, -0.254), (1, -0.159), (2, 0.097), (3, -0.067), (4, 0.179), (5, -0.031), (6, -0.018), (7, -0.01), (8, 0.032), (9, 0.026), (10, -0.016), (11, 0.036), (12, 0.063), (13, 0.013), (14, 0.136), (15, -0.05), (16, -0.029), (17, 0.042), (18, -0.004), (19, -0.008), (20, 0.008), (21, 0.021), (22, -0.044), (23, -0.044), (24, -0.005), (25, -0.064), (26, -0.078), (27, -0.012), (28, -0.067), (29, -0.004), (30, -0.02), (31, -0.109), (32, 0.013), (33, -0.047), (34, -0.007), (35, 0.075), (36, -0.06), (37, 0.103), (38, -0.003), (39, -0.043), (40, -0.018), (41, 0.008), (42, 0.156), (43, 0.039), (44, 0.126), (45, 0.122), (46, 0.024), (47, -0.01), (48, -0.116), (49, -0.142)]
simIndex simValue paperId paperTitle
same-paper 1 0.94800621 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
2 0.76770991 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
Author: Peter Meinicke, Matthias Kaper, Florian Hoppe, Manfred Heumann, Helge Ritter
Abstract: In this paper we present results of a study on brain computer interfacing. We adopted an approach of Farwell & Donchin [4], which we tried to improve in several aspects. The main objective was to improve the transfer rates based on offline analysis of EEG-data but within a more realistic setup closer to an online realization than in the original studies. The objective was achieved along two different tracks: on the one hand we used state-of-the-art machine learning techniques for signal classification and on the other hand we augmented the data space by using more electrodes for the interface. For the classification task we utilized SVMs and, as motivated by recent findings on the learning of discriminative densities, we accumulated the values of the classification function in order to combine several classifications, which finally lead to significantly improved rates as compared with techniques applied in the original work. In combination with the data space augmentation, we achieved competitive transfer rates at an average of 50.5 bits/min and with a maximum of 84.7 bits/min.
3 0.64700872 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
4 0.61104184 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer
Abstract: We introduce a family of classifiers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classifiers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classification function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-definite kernels, regularization techniques that allow for the construction of an optimal classifier in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classification tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classifier outperformed standard SVMs. 1
5 0.60453635 45 nips-2002-Boosted Dyadic Kernel Discriminants
Author: Baback Moghaddam, Gregory Shakhnarovich
Abstract: We introduce a novel learning algorithm for binary classification with hyperplane discriminants based on pairs of training points from opposite classes (dyadic hypercuts). This algorithm is further extended to nonlinear discriminants using kernel functions satisfying Mercer’s conditions. An ensemble of simple dyadic hypercuts is learned incrementally by means of a confidence-rated version of AdaBoost, which provides a sound strategy for searching through the finite set of hypercut hypotheses. In experiments with real-world datasets from the UCI repository, the generalization performance of the hypercut classifiers was found to be comparable to that of SVMs and k-NN classifiers. Furthermore, the computational cost of classification (at run time) was found to be similar to, or better than, that of SVM. Similarly to SVMs, boosted dyadic kernel discriminants tend to maximize the margin (via AdaBoost). In contrast to SVMs, however, we offer an on-line and incremental learning machine for building kernel discriminants whose complexity (number of kernel evaluations) can be directly controlled (traded off for accuracy). 1
6 0.59883869 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
7 0.58949733 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
8 0.58887136 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
9 0.56442434 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
10 0.56285346 138 nips-2002-Manifold Parzen Windows
11 0.5585897 114 nips-2002-Information Regularization with Partially Labeled Data
12 0.55636412 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
13 0.54735011 55 nips-2002-Combining Features for BCI
14 0.53193128 111 nips-2002-Independent Components Analysis through Product Density Estimation
15 0.52256256 92 nips-2002-FloatBoost Learning for Classification
16 0.52021599 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
17 0.47470462 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
18 0.47397894 67 nips-2002-Discriminative Binaural Sound Localization
19 0.44176254 119 nips-2002-Kernel Dependency Estimation
20 0.43845475 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
topicId topicWeight
[(11, 0.023), (23, 0.044), (42, 0.093), (54, 0.137), (55, 0.053), (67, 0.012), (68, 0.032), (74, 0.112), (80, 0.22), (87, 0.011), (92, 0.07), (98, 0.123)]
simIndex simValue paperId paperTitle
same-paper 1 0.83717322 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
2 0.73433852 3 nips-2002-A Convergent Form of Approximate Policy Iteration
Author: Theodore J. Perkins, Doina Precup
Abstract: We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a “policy improvement operator” to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solution from any initial policy. To our knowledge, this is the first convergence result for any form of approximate policy iteration under similar computational-resource assumptions.
3 0.73197347 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
Author: Max Welling, Simon Osindero, Geoffrey E. Hinton
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.
4 0.73143232 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
Author: Christopher M. Bishop, David Spiegelhalter, John Winn
Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are specified either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and flexibility of VIBES using examples from Bayesian mixture modelling. 1
5 0.73069143 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
Author: Peter Sykacek, Stephen J. Roberts
Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classification that combines the computational advantage of a parametric solution with the flexibility of sequential sampling techniques. We regard the parameters of the classifier as latent states in a first order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman filtering. The variational Kalman filter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classifiers both in stationary and non-stationary environments. Although we focus on classification, the algorithm is easily extended to other generalized nonlinear models.
6 0.73060226 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
7 0.72974718 10 nips-2002-A Model for Learning Variance Components of Natural Images
8 0.72951055 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
9 0.72384477 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
10 0.72377002 53 nips-2002-Clustering with the Fisher Score
11 0.72280586 27 nips-2002-An Impossibility Theorem for Clustering
12 0.72134483 2 nips-2002-A Bilinear Model for Sparse Coding
13 0.72009563 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
14 0.71940041 169 nips-2002-Real-Time Particle Filters
15 0.71933049 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
16 0.71907711 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
17 0.71810013 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
18 0.71725428 124 nips-2002-Learning Graphical Models with Mercer Kernels
19 0.71546072 46 nips-2002-Boosting Density Estimation
20 0.71465737 74 nips-2002-Dynamic Structure Super-Resolution