nips nips2002 nips2002-24 knowledge-graph by maker-knowledge-mining

24 nips-2002-Adaptive Scaling for Feature Selection in SVMs


Source: pdf

Author: Yves Grandvalet, Stéphane Canu

Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. [sent-5, score-0.287]

2 Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. [sent-6, score-0.62]

3 The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. [sent-7, score-0.641]

4 Feature selection is achieved by constraints encouraging the sparsity of scale factors. [sent-8, score-0.35]

5 The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem. [sent-9, score-0.759]

6 1 Introduction In pattern recognition, the problem of selecting relevant variables is difficult. [sent-10, score-0.148]

7 Optimal subset selection is attractive as it yields simple and interpretable models, but it is a combinatorial and acknowledged unstable procedure [2]. [sent-11, score-0.186]

8 In some problems, it may be better to resort to stable procedures penalizing irrelevant variables. [sent-12, score-0.165]

9 The relevance of input features may be measured by continuous weights or scale factors, which define a diagonal metric in input space. [sent-14, score-0.528]

10 Feature selection consists then in determining a sparse diagonal metric, and sparsity can be encouraged by constraining an appropriate norm on scale factors. [sent-15, score-0.394]

11 Our approach can be summarized by the setting of a global optimization problem pertaining to 1) the parameters of the SVM classifier, and 2) the parameters of the feature space mapping defining the metric in input space. [sent-16, score-0.63]

12 As in standard SVMs, only two tunable hyper-parameters are to be set: the penalization of training errors, and the magnitude of kernel bandwiths. [sent-17, score-0.181]

13 In this formalism we derive an efficient algorithm to monitor slack variables when optimizing the metric. [sent-18, score-0.219]

14 After presenting previous approaches to hard and soft feature selection procedures in the context of SVMs, we present our algorithm. [sent-20, score-0.381]

15 2 Feature Selection via adaptive scaling Scaling is a usual preprocessing step, which has important outcomes in many classification methods including SVM classifiers [9, 3]. [sent-22, score-0.263]

16 It is defined by a linear transformation within the input space: , where diag is a diagonal matrix of scale factors. [sent-23, score-0.145]

17  ¢  %$#" ¤ ¡ ¤ ¢ ¦¥£ ¡ ¢ §¤  ©¨ Adaptive scaling consists in letting to be adapted during the estimation process with the explicit aim of achieving a better recognition rate. [sent-25, score-0.243]

18 According to the structural risk minimization principle [8], can be tuned in two ways: © © 1. [sent-27, score-0.222]

19 estimate the parameters of classifier by empirical risk minimization for several values of to produce a structure of classifiers multi-indexed by . [sent-28, score-0.402]

20 Select one element of the structure by finding the set minimizing some estimate of generalization error. [sent-29, score-0.226]

21 estimate the parameters of classifier and the hyper-parameters by empirical risk minimization, while a second level hyper-parameter, say , constrains in order to avoid overfitting. [sent-31, score-0.332]

22 This procedure produces a structure of classifiers indexed by , whose value is computed by minimizing some estimate of generalization error. [sent-32, score-0.226]

23 & 7 2%3( 6 ' 1 0  2%0"( 5 ' © 1  & 2 1 0 ( %")  ' & © 2 10 4%3(   ' 2 10 4%3(   ' 7 The usual paradigm consists in computing the estimate of generalization error for regularly spaced hyper-parameter values and picking the best solution among all trials. [sent-33, score-0.319]

24 8  9 Several authors suggested to address this problem by optimizing an estimate of generalization error with respect to the hyper-parameters. [sent-35, score-0.397]

25 [4] first proposed to apply an iterative optimization scheme to estimate a single kernel width hyper-parameter. [sent-37, score-0.224]

26 [3] generalized this approach to multiple hyper-parameters in order to perform adaptive scaling and variable selection. [sent-40, score-0.159]

27 However, relying on the optimization of generalization error estimates over many hyper-parameters is hazardous. [sent-42, score-0.231]

28 Once optimized, the unbiased estimates become down-biased, and the bounds provided by VC-theory usually hold for kernels defined a priori (see the proviso on the radius/margin bound in [8]). [sent-43, score-0.115]

29 In the second solution considered here, the estimate of generalization error is minimized with respect to , a single (second level) hyper-parameter, which constrains . [sent-45, score-0.327]

30 The role of this constraint is twofold: control the complexity of the classifier, and encourage variable selection in input space. [sent-46, score-0.356]

31 Note that this type of optimization procedure has been proposed for linear SVM in both frequentist [1] and Bayesian frameworks [6]. [sent-48, score-0.166]

32  "   V I ¨  ¡¡  ©AE¦ ¨ §¤ ¥ ¦ ¢ ¡¡£ subject to are obtained by solving the following optimization problem: ` f where the parameters ) 0   ¦¤ ¨ S ¡  ¡¨ © S defined as . [sent-52, score-0.266]

33 In this problem setting, and the parameters of the with feature space mapping (typically a kernel bandwidth) are tunable hyper-parameters which need to be determined by the user. [sent-53, score-0.332]

34 2 A global optimization problem In [9, 3], adaptive scaling is performed by iteratively finding the parameters of the SVM classifier for a fixed value of and minimizing a bound on the estimate of generalization error with respect to hyper-parameters . [sent-55, score-0.786]

35 The algorithm minimizes 1) the SVM empirical criterion with respect to parameters and 2) an estimate of generalization error with respect to hyper-parameters. [sent-56, score-0.476]

36 2 1 0 3(   ' ¢ © f   0 )  ¨' 21 ( ©& In the present approach, we avoid the enlargement of the set of hyper-parameters by letting to be standard parameters of the classifier. [sent-57, score-0.1]

37 The latter defines the single hyper-parameter of the learning process related to scaling variables. [sent-59, score-0.104]

38 In the limit of , the constraint counts the number of non-zero scale parameters, resulting in a hard selection procedure. [sent-65, score-0.345]

39 This choice might seem appropriate for our purpose, but it amounts to attempt to solve a highly non-convex optimization problem, where the number of local minima grows exponentially with the input dimension . [sent-66, score-0.142]

40 To avoid this problem, we suggest to use , which is the smallest value for which the problem is convex with the linear mapping . [sent-67, score-0.105]

41 Indeed, for linear kernels, the constraint on amounts to minimize the standard SVM criterion where the penalization on the norm is replaced by the penalization of the norm. [sent-68, score-0.347]

42 For non-linear kernels however, the two solutions differ notably since the present algorithm modifies the metric in input space, while the SVM classifier modifies the metric in feature space. [sent-70, score-0.573]

43 3 An alternated optimization scheme  V I ¨ f Problem (3) is complex; we propose to solve iteratively a series of simplier problems. [sent-73, score-0.2]

44 The function is first optimized with respect to parameters for a fixed mapping (standard SVM problem). [sent-74, score-0.216]

45 Then, the parameters of the feature space mapping are optimized while some characteristics of are kept fixed: At step , starting from a given value, the optimal are computed. [sent-75, score-0.278]

46 ¤ 2 © ¢ ¥£¡   © © f  ¨ H ©3V  G©¨ I ¨ f  ¨ H ©"V  G©¨ I ¨ & ¤ ¢ ¥¨¡ ¦ & ¤ ¢ ¥£¡ ¤ ¢ ¥£¡ ¦ § ¤ ¢ ¥£¡ ©S © ¤ ¢ ¥£¡ In this scheme, are computed by solving the standard quadratic optimization problem (2). [sent-77, score-0.21]

47 ¦ ¦ § For solving the minimization problem with respect to , we use a reduced conjugate gradient technique. [sent-80, score-0.418]

48 The optimization problem was simplified by assuming that some of the other variables are fixed. [sent-81, score-0.197]

49 We tried several versions: 1) fixed; 2) Lagrange multipliers fixed; 3) set of support vectors fixed. [sent-82, score-0.34]

50 For the three versions, the optimal value of , or at least the optimal value of the slack variables can be obtained by solving a linear program, whose optimum is computed directly (in a single iteration). [sent-83, score-0.208]

51 4 Sclaling parameters update ©  I  2 ¡ ¨ © S b 1 ¢ I ` c  ¨  H G©"V ©¨ I ©¨ f f Starting from an initial solution , our goal is to update by solving a simple intermediate problem providing an improved solution to the global problem (3). [sent-87, score-0.392]

52 We first assume that the Lagrange multipliers defining are not affected by updates, so that is defined as . [sent-88, score-0.179]

53 © © I   I  I    Regarding problem (3), is sub-optimal when varies; nevertheless is guaranteed to be an admissible solution. [sent-89, score-0.153]

54 Hence, we minimize an upper bound of the original primal cost which guarantees that any admissible update (providing a decrease of the cost) of the intermediate problem will provide a decrease of the cost of the original problem. [sent-90, score-0.565]

55 The intermediate optimization problem is stated as follows: ` (f%%% ©$''&f; ¢ (f%%% ©$''&f;  (4) % 8 &''f f%%% # ¢   # $` ¢  2 1 `X ! [sent-91, score-0.249]

56 Hence, as stated above, will be updated by a descent algorithm. [sent-93, score-0.155]

57 The latter requires the evaluation of the cost and its gradient with respect to . [sent-94, score-0.189]

58 Parameters With , are then updated by a conjugate reduced gradient technique, i. [sent-102, score-0.165]

59 a conjugate gradient algorithm ensuring that the set of constraints on are always verified. [sent-104, score-0.12]

60 5 Updating Lagrange multipliers Assume now that only the support vectors remain fixed while optimizing . [sent-106, score-0.414]

61 This assumption is used to derive a rule to update at reasonable computing cost the Lagrange multipliers together with by computing . [sent-107, score-0.27]

62 for support vectors of the first category (7)   ¢ ) ` 4c )     2. [sent-110, score-0.321]

63 for support vectors of the second category (such that . [sent-111, score-0.321]

64 From these equations, and the assumption that support vectors remain support vectors (and that their category do not change) one derives a system of linear equations defining the derivatives of and with respect to [3]: © V  1. [sent-112, score-0.583]

65 for support vectors of the first category ) f` d c f` d c © ) ¢ 9 ` c © ©  1 2 1 U 2 X ¢ 5 U  ¡ ¡ ¨ © ©  b  X 6 ¡ ¡ ¨ © b V         2. [sent-113, score-0.321]

66 for support vectors of the second category (8) 3. [sent-114, score-0.321]

67 Finally, the system is completed by stating that the Lagrange multipliers should : (9) ) ¢ b   ©  1 2 X   c 2  1 X    ` ¥ c is updated from these equations, and the step size is limited to ensure that for support vectors of the first category. [sent-115, score-0.442]

68 Hence, in this version, is also an admissible sub-optimal solution regarding problem (3). [sent-116, score-0.229]

69  I The value of c ) ¢ b obey the constraint )  4 Experiments  ¢ In the experiments reported below, we used for the constraint on (3). [sent-117, score-0.124]

70 The scale parameters were optimized with the last version, where the set of support vectors is assumed to be fixed. [sent-118, score-0.413]

71 Although the value of the bound itself was not a faithful estimate of test error, the average loss induced by using the minimizer of these bounds was quite small. [sent-120, score-0.173]

72 compared two versions of their feature selection algorithm, to standard SVMs and filter methods (i. [sent-123, score-0.37]

73 preprocessing methods selecting features either based on Pearson correlation coefficients, Fisher criterion score, or the Kolmogorov-Smirnov statistic). [sent-125, score-0.149]

74 Their artificial data benchmarks provide a basis for comparing our approach with their, which is based on the minimization of error bounds. [sent-126, score-0.234]

75 In the nonlinear problem, two features out of 52 are relevant. [sent-129, score-0.104]

76 For each distribution, 30 experiments are conducted, and the average test recognition rate measures the performance of each method. [sent-130, score-0.13]

77 b Our test performances are qualitatively similar to the ones obtained by gradient descent on the radius/margin bound in [9], which are only improved by the forward selection algorithm minimizing the span bound. [sent-146, score-0.512]

78 These figures show that although our feature selection scheme is effective, it should be more stringent: a smaller value of would be more appropriate for this type of problem. [sent-153, score-0.355]

79 ¦ §¡ ¢ ) ( £ )) £  ) ) ¥   ¢ ) V) (  ¢ £ ¡ ¤¢  @ ))  ¢ ( Regarding training times, the optimization of required an average of over 100 times more computing time than standard SVM fitting for the linear problem and 40 times for the nonlinear problem. [sent-156, score-0.197]

80 These increases scale less than linearly with the number of variables, and are certainly yet to be improved. [sent-157, score-0.097]

81 2 Expression recognition We also tested our algorithm on a more demanding task to test its ability to handle a large number of features. [sent-159, score-0.19]

82 The considered problem consists in recognizing the happiness expression among the five other facial expressions corresponding to universal emotions (disgust, sadness, fear, anger, and surprise). [sent-160, score-0.469]

83 The data sets are made of gray level images of frontal faces, with standardized positions of eyes, nose and mouth. [sent-161, score-0.123]

84 The recognition rate is not significantly affected by our feature selection scheme (10 errors), but more than 1300 pixels are considered to be completely irrelevant at the end of the iterative procedure (estimating required about 80 times more computing time than standard SVM). [sent-167, score-0.574]

85 This selection brings some important clues for building relevant attributes for the facial recognition expression task. [sent-168, score-0.549]

86 © Figure 2 represents the scaling factors , where black is zero and white represents the highest value. [sent-169, score-0.189]

87 We see that, according to the classifier, the relevant areas for recognizing the happiness expression are mainly in the mouth area, especially on the mouth wrinkles, and to a lesser extent in the white of the eyes (which detects open eyes) and the outer eyebrows. [sent-170, score-0.575]

88 On the right hand side of this figure, we displayed masked support faces, i. [sent-171, score-0.249]

89 Although we lost many important features regarding the identity of people, the expression is still visible on these faces. [sent-174, score-0.247]

90 Areas irrelevant for the recognition task (forehead, nose, and upper cheeks) have been erased or softened by the expression mask. [sent-175, score-0.29]

91 © 5 Conclusion We have introduced a method to perform automatic relevance determination and feature selection in nonlinear SVMs. [sent-176, score-0.542]

92 Our approach considers that the metric in input space defines a set of parameters of the SVM classifier. [sent-177, score-0.267]

93 The update of the scale factors is performed by iteratively minimizing an approximation of the SVM cost. [sent-178, score-0.355]

94 The latter is efficiently minimized with respect to slack variables when the metric varies. [sent-179, score-0.372]

95 The approximation of the cost function is tight enough to allow large update of the metric when necessary. [sent-180, score-0.309]

96 Figure 2: Left: expression mask of happiness provided by the scaling factors ; Right, top row: the two positive masked support face; Right, bottom row: four negative masked support faces. [sent-182, score-0.94]

97 © Preliminary experimental results show that the method provides sensible results in a reasonable time, even in very high dimensional spaces, as illustrated on a facial expression recognition task. [sent-183, score-0.318]

98 In terms of test recognition rates, our method is comparable with [9, 3]. [sent-184, score-0.13]

99 The resulting algorithm would differ from [9, 3], ) would be estimated by since the relative relevance of each feature (as measured by empirical risk minimization, instead of being driven by an estimate of generalization error. [sent-188, score-0.544]

100 Feature selection via concave minimization and support vector machines. [sent-194, score-0.454]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('svm', 0.296), ('selection', 0.186), ('metric', 0.172), ('classi', 0.154), ('support', 0.141), ('happiness', 0.136), ('multipliers', 0.133), ('minimization', 0.127), ('ae', 0.12), ('penalization', 0.118), ('expression', 0.117), ('feature', 0.117), ('facial', 0.115), ('category', 0.114), ('weston', 0.11), ('relevance', 0.109), ('lagrange', 0.108), ('masked', 0.108), ('scaling', 0.104), ('admissible', 0.1), ('scale', 0.097), ('risk', 0.095), ('eyes', 0.095), ('slack', 0.095), ('optimization', 0.094), ('er', 0.092), ('compi', 0.09), ('gne', 0.09), ('generalization', 0.09), ('chapelle', 0.09), ('irrelevant', 0.087), ('recognition', 0.086), ('factors', 0.085), ('determination', 0.08), ('svms', 0.079), ('nose', 0.079), ('estimate', 0.078), ('procedures', 0.078), ('ning', 0.076), ('cost', 0.076), ('regarding', 0.076), ('faces', 0.075), ('optimizing', 0.074), ('frequentist', 0.072), ('springer', 0.071), ('ers', 0.067), ('mouth', 0.067), ('france', 0.067), ('sparsity', 0.067), ('versions', 0.067), ('vectors', 0.066), ('kernels', 0.064), ('cristianini', 0.064), ('tunable', 0.063), ('solving', 0.063), ('optimized', 0.062), ('conjugate', 0.062), ('descent', 0.062), ('constraint', 0.062), ('subject', 0.062), ('update', 0.061), ('encourage', 0.06), ('demanding', 0.06), ('benchmarks', 0.06), ('et', 0.06), ('gradient', 0.058), ('usual', 0.058), ('minimizing', 0.058), ('completed', 0.057), ('constrains', 0.057), ('adaptive', 0.055), ('empirical', 0.055), ('respect', 0.055), ('features', 0.054), ('intermediate', 0.054), ('iteratively', 0.054), ('letting', 0.053), ('performances', 0.053), ('problem', 0.053), ('mapping', 0.052), ('scheme', 0.052), ('bound', 0.051), ('variables', 0.05), ('nonlinear', 0.05), ('criterion', 0.049), ('stated', 0.048), ('recognizing', 0.048), ('input', 0.048), ('parameters', 0.047), ('decrease', 0.047), ('error', 0.047), ('affected', 0.046), ('picking', 0.046), ('preprocessing', 0.046), ('updated', 0.045), ('relevant', 0.045), ('de', 0.045), ('test', 0.044), ('constraining', 0.044), ('images', 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

Author: Yves Grandvalet, Stéphane Canu

Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.

2 0.23481631 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

Author: Sepp Hochreiter, Klaus Obermayer

Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1

3 0.21258283 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger

Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1

4 0.18237764 16 nips-2002-A Prototype for Automatic Recognition of Spontaneous Facial Actions

Author: M.S. Bartlett, G.C. Littlewort, T.J. Sejnowski, J.R. Movellan

Abstract: We present ongoing work on a project for automatic recognition of spontaneous facial actions. Spontaneous facial expressions differ substantially from posed expressions, similar to how continuous, spontaneous speech differs from isolated words produced on command. Previous methods for automatic facial expression recognition assumed images were collected in controlled environments in which the subjects deliberately faced the camera. Since people often nod or turn their heads, automatic recognition of spontaneous facial behavior requires methods for handling out-of-image-plane head rotations. Here we explore an approach based on 3-D warping of images into canonical views. We evaluated the performance of the approach as a front-end for a spontaneous expression recognition system using support vector machines and hidden Markov models. This system employed general purpose learning mechanisms that can be applied to recognition of any facial movement. The system was tested for recognition of a set of facial actions defined by the Facial Action Coding System (FACS). We showed that 3D tracking and warping followed by machine learning techniques directly applied to the warped images, is a viable and promising technology for automatic facial expression recognition. One exciting aspect of the approach presented here is that information about movement dynamics emerged out of filters which were derived from the statistics of images.

5 0.1584864 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter

Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.

6 0.15786323 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

7 0.15182546 156 nips-2002-On the Complexity of Learning the Kernel Matrix

8 0.14910606 90 nips-2002-Feature Selection in Mixture-Based Clustering

9 0.13782567 92 nips-2002-FloatBoost Learning for Classification

10 0.13742982 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

11 0.13655664 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

12 0.13651419 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

13 0.13190666 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

14 0.13150051 89 nips-2002-Feature Selection by Maximum Marginal Diversity

15 0.1310288 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging

16 0.13056242 45 nips-2002-Boosted Dyadic Kernel Discriminants

17 0.12941515 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

18 0.12875979 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

19 0.12442579 106 nips-2002-Hyperkernels

20 0.11606421 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.359), (1, -0.173), (2, 0.093), (3, -0.037), (4, 0.153), (5, -0.017), (6, -0.012), (7, -0.045), (8, 0.008), (9, 0.092), (10, -0.026), (11, 0.09), (12, 0.041), (13, 0.057), (14, 0.155), (15, -0.114), (16, -0.029), (17, 0.001), (18, 0.008), (19, 0.078), (20, -0.06), (21, -0.037), (22, 0.114), (23, -0.129), (24, -0.12), (25, -0.0), (26, 0.055), (27, -0.111), (28, 0.132), (29, 0.047), (30, 0.088), (31, -0.083), (32, 0.028), (33, 0.019), (34, 0.003), (35, 0.139), (36, -0.01), (37, 0.034), (38, 0.011), (39, -0.02), (40, -0.089), (41, -0.062), (42, 0.093), (43, 0.038), (44, -0.012), (45, -0.093), (46, -0.01), (47, 0.073), (48, -0.086), (49, 0.117)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97173768 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

Author: Yves Grandvalet, Stéphane Canu

Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.

2 0.74509692 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer

Abstract: We introduce a family of classifiers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classifiers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classification function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-definite kernels, regularization techniques that allow for the construction of an optimal classifier in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classification tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classifier outperformed standard SVMs. 1

3 0.71055591 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging

Author: Anton Schwaighofer, Volker Tresp, Peter Mayer, Alexander K. Scheel, Gerhard A. Müller

Abstract: We describe the RA scanner, a novel system for the examination of patients suffering from rheumatoid arthritis. The RA scanner is based on a novel laser-based imaging technique which is sensitive to the optical characteristics of finger joint tissue. Based on the laser images, finger joints are classified according to whether the inflammatory status has improved or worsened. To perform the classification task, various linear and kernel-based systems were implemented and their performances were compared. Special emphasis was put on measures to reliably perform parameter tuning and evaluation, since only a very small data set was available. Based on the results presented in this paper, it was concluded that the RA scanner permits a reliable classification of pathological finger joints, thus paving the way for a further development from prototype to product stage.

4 0.69018662 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger

Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1

5 0.68984079 16 nips-2002-A Prototype for Automatic Recognition of Spontaneous Facial Actions

Author: M.S. Bartlett, G.C. Littlewort, T.J. Sejnowski, J.R. Movellan

Abstract: We present ongoing work on a project for automatic recognition of spontaneous facial actions. Spontaneous facial expressions differ substantially from posed expressions, similar to how continuous, spontaneous speech differs from isolated words produced on command. Previous methods for automatic facial expression recognition assumed images were collected in controlled environments in which the subjects deliberately faced the camera. Since people often nod or turn their heads, automatic recognition of spontaneous facial behavior requires methods for handling out-of-image-plane head rotations. Here we explore an approach based on 3-D warping of images into canonical views. We evaluated the performance of the approach as a front-end for a spontaneous expression recognition system using support vector machines and hidden Markov models. This system employed general purpose learning mechanisms that can be applied to recognition of any facial movement. The system was tested for recognition of a set of facial actions defined by the Facial Action Coding System (FACS). We showed that 3D tracking and warping followed by machine learning techniques directly applied to the warped images, is a viable and promising technology for automatic facial expression recognition. One exciting aspect of the approach presented here is that information about movement dynamics emerged out of filters which were derived from the statistics of images.

6 0.65715808 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

7 0.64360172 89 nips-2002-Feature Selection by Maximum Marginal Diversity

8 0.64094537 55 nips-2002-Combining Features for BCI

9 0.63285577 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

10 0.61815274 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study

11 0.60365313 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA

12 0.54617214 45 nips-2002-Boosted Dyadic Kernel Discriminants

13 0.54348683 92 nips-2002-FloatBoost Learning for Classification

14 0.53393453 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations

15 0.5331322 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

16 0.52837688 90 nips-2002-Feature Selection in Mixture-Based Clustering

17 0.48040482 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

18 0.4795492 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

19 0.47903448 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

20 0.47322664 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.037), (14, 0.014), (23, 0.029), (42, 0.061), (52, 0.153), (54, 0.171), (55, 0.05), (67, 0.026), (68, 0.043), (74, 0.102), (87, 0.014), (92, 0.06), (98, 0.158)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97750735 58 nips-2002-Conditional Models on the Ranking Poset

Author: Guy Lebanon, John D. Lafferty

Abstract: A distance-based conditional model on the ranking poset is presented for use in classification and ranking. The model is an extension of the Mallows model, and generalizes the classifier combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.

2 0.92555541 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

Author: Ofer Dekel, Yoram Singer

Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.

same-paper 3 0.91917467 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

Author: Yves Grandvalet, Stéphane Canu

Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.

4 0.91485518 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

Author: Kenji Fukumizu, Shotaro Akaho, Shun-ichi Amari

Abstract: We show the existence of critical points as lines for the likelihood function of mixture-type models. They are given by embedding of a critical point for models with less components. A sufficient condition that the critical line gives local maxima or saddle points is also derived. Based on this fact, a component-split method is proposed for a mixture of Gaussian components, and its effectiveness is verified through experiments. 1

5 0.85588801 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model specification, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efficient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efficient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model specification into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more flexibility than a fixed code repository such as a Matlab toolbox, and allows the creation of efficient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level specification 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The figure shows the 7 double phi(1..n_classes) as ’weights’ specification of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-defined. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line specifies the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡   £  £  £ §¤¢ £ © ¨ ¦ ¥ ©   ¡     ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspecified. The class variable and single data variable are vectors, which defines them as i.i.d. Computer algebra. The first core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simplification are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efficient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly specified preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efficiently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A first-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., specific instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantified Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the figure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efficiently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤  ¡ parents B % % 9 C0A@ ! 9  @8 § ¥   ¢   2 ' % % 310  parents    ©¢   £ ¡ !    ' % #!  

6 0.85504937 10 nips-2002-A Model for Learning Variance Components of Natural Images

7 0.85435575 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

8 0.8538909 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

9 0.85183936 27 nips-2002-An Impossibility Theorem for Clustering

10 0.8509202 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

11 0.8495304 53 nips-2002-Clustering with the Fisher Score

12 0.84613687 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

13 0.84536344 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

14 0.84500563 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

15 0.84437704 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits

16 0.83937138 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

17 0.83917868 124 nips-2002-Learning Graphical Models with Mercer Kernels

18 0.83886886 3 nips-2002-A Convergent Form of Approximate Policy Iteration

19 0.83619118 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals

20 0.83609265 2 nips-2002-A Bilinear Model for Sparse Coding