nips nips2007 nips2007-24 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Olivier Chapelle, Alekh Agarwal, Fabian H. Sinz, Bernhard Schölkopf
Abstract: We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Bernhard Sch¨ lkopf o Max Planck Institute for biological Cybernetics Spemannstrasse 38, 72076, T¨ bingen, Germany u bs@tuebingen. [sent-8, score-0.028]
2 We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). [sent-12, score-0.097]
3 1 Introduction Learning algorithms need to make assumptions about the problem domain in order to generalise well. [sent-14, score-0.029]
4 These assumptions are usually encoded in the regulariser or the prior. [sent-15, score-0.048]
5 More elaborate prior knowledge, often needed for a good performance, can be hard to encode in a regulariser or a prior that is computationally efficient too. [sent-18, score-0.048]
6 A prominent example of data-dependent regularisation is semisupervised learning [1], where an additional set of unlabelled data, assumed to follow the same distribution as the training inputs, is tied to the regulariser using the so-called cluster assumption. [sent-20, score-0.105]
7 A novel form of data-dependent regularisation was recently proposed by [11]. [sent-21, score-0.036]
8 The additional dataset for this approach is explicitly not from the same distribution as the labelled data, but represents a third — neither — class. [sent-22, score-0.127]
9 This kind of dataset was first proposed by Vapnik [10] under the name Universum, owing its name to the intuition that the Universum captures a general backdrop against which a problem at hand is solved. [sent-23, score-0.088]
10 According to Vapnik, a suitable set for this purpose can be thought of as a set of examples that belong to the same problem framework, but about which the resulting decision function should not make a strong statement. [sent-24, score-0.039]
11 Although initially proposed for transductive inference, the authors of [11] proposed an inductive classifier where the decision surface is chosen such that the Universum examples are located close to it. [sent-25, score-0.039]
12 Although the authors showed that different choices of Universa and loss functions lead to certain known regularisers as special cases of their implementation, there are still a few unanswered questions. [sent-27, score-0.056]
13 On the other hand, except in special cases, the influence of the Universum data on the resulting decision hyperplane and therefore criteria for a good choice of a Universum is not known. [sent-29, score-0.062]
14 In the present paper we would like to address the second question by analysing the influence of the Universum data on the resulting function in the implementation of [11] as well as in a least squares version of it which we derive in section 2. [sent-30, score-0.06]
15 Clarifying the regularising influence of the Universum on the solution of the SVM can give valuable insight into which set of data points might be a helpful Universum and how to obtain it. [sent-31, score-0.038]
16 After briefly deriving the algorithms in section 2 we show in section 3 that the algorithm of [11] pushes the normal of the hyperplane into the orthogonal complement of the subspace spanned by the principal directions of the Universum set. [sent-33, score-0.306]
17 Furthermore, we demonstrate that the least squares version of the Universum algorithm is equivalent to a hybrid between kernel Fisher Discriminant Analysis and kernel Oriented Principal Component Analysis. [sent-34, score-0.123]
18 In section 4, we validate our analysis on toy experiments and give an example how to use the geometric and algorithmic intuition gained from the analysis to construct a Universum set for a real world problem. [sent-35, score-0.029]
19 , (xm , ym )} be the set of labelled examples and let U = {z1 , . [sent-41, score-0.118]
20 Using the hinge loss Ha [t] = max{0, a − t} and fw,b (x) = w, x + b, a standard SVM can compactly be formulated as m min w,b 1 ||w||2 + CL H1 [yi fw,b (xi )]. [sent-45, score-0.069]
21 2 The Least Squares U-SVM The derivation of the least squares U-SVM starts with the same general regularised error minimisation problem 1 CL min ||w||2 + w,b 2 2 m CU Qyi [fw,b (x)] + 2 i=1 q Q0 [fw,b (zj )]. [sent-49, score-0.083]
22 (2) j=1 Instead of using the hinge loss, we employ the quadratic loss Qa [t] = ||t − a||2 which is used in the 2 least squares versions of SVMs [9]. [sent-50, score-0.11]
23 1 CL ||w||2 + 2 2 m 2 ξi + i=1 CU 2 q ϑ2 j (3) j=1 w, xi + b = yi − ξi for i = 1, . [sent-53, score-0.059]
24 In the remaining part of this paper we denote the least squares SVM by Uls -SVM. [sent-63, score-0.041]
25 The authors of [12] describe an algorithm for the one-vs-one strategy in multiclass learning that additionally minimises the distance of the separating hyperplane to the examples that are in neither of the classes. [sent-66, score-0.092]
26 There are also two Bayesian algorithms that refer to non-examples or neither class in the binary classification setting. [sent-69, score-0.029]
27 [8] gives a probabilistic interpretation for a standard hinge loss SVM by establishing the connection between the MAP estimate of a Gaussian process with a Gaussian prior using a covariance function k and a hinge loss based noise model. [sent-70, score-0.182]
28 To circumvent this problem, the authors of [4] introduce an additional — neither — class to introduce a stochastic dependence between the parameter and the unobserved label in the discriminative model. [sent-74, score-0.049]
29 However, neither of the Bayesian approaches actually assigns an observed example to the introduced third class. [sent-75, score-0.029]
30 3 Analysis of the Algorithm The following two sections analyse the geometrical relation of the decision hyperplane learnt with one of the Universum SVMs to the Universum set. [sent-76, score-0.083]
31 It will turn out that in both cases the optimal solutions tend to make the normal vector orthogonal to the principal directions of the Universum. [sent-77, score-0.151]
32 The extreme case where w is completely orthogonal to U, makes the decision function defined by w invariant to transformations that act on the subspace spanned by the elements of U. [sent-78, score-0.194]
33 Therefore, the Universum should contain directions the resulting function should be invariant against. [sent-79, score-0.075]
34 However, our results generalise to the case where the xi and zj live in an RKHS spanned by some kernel. [sent-81, score-0.284]
35 After showing the equivalence to using a standard SVM trained on the orthogonal complement of the subspace spanned by the zj , we extend the result to the cases with soft margin on U. [sent-85, score-0.344]
36 Lemma A U-SVM with CU = ∞, ε = 0 is equivalent to training a standard SVM with the training points projected onto the orthogonal complement of span{zj −z0 , zj ∈ U}, where z0 is an arbitrary element of U. [sent-86, score-0.324]
37 3 Proof: Since CU = ∞ and ε = 0, any w yielding a finite value of (1) must fulfil w, zj + b = 0 for all j = 1, . [sent-87, score-0.17]
38 So w, zj − z0 = 0 and w is orthogonal to span{zj − z0 , zj ∈ U}. [sent-91, score-0.402]
39 Let PU⊥ denote the projection operator onto the orthogonal complement of that set. [sent-92, score-0.118]
40 From the previous argument, we can replace w, xi by PU⊥ w, xi in the solution of (1) without changing it. [sent-93, score-0.068]
41 Since PU⊥ is an orthogonal projection we have that PU⊥ = PU⊥ and hence PU⊥ w, xi = w, PU⊥ xi = w, PU⊥ xi . [sent-95, score-0.164]
42 Therefore, the optimisation problem in (1) is the same as a standard SVM where the xi have been replaced by PU⊥ xi . [sent-96, score-0.068]
43 Since the resulting w is orthogonal to an affine space spanned by the Universum points, it is invariant against features implicitly specified by directions of large variance in that affine space. [sent-98, score-0.188]
44 Picturing the ·, zj as filters that extract certain features from given labelled or test examples x, using the Universum algorithms means suppressing the features specified by the zj . [sent-99, score-0.458]
45 Finally, we generalise the result of the lemma by dropping the hard constraint assumption on the Universum examples, i. [sent-100, score-0.05]
46 | w∗ , zj + b∗ | ≥ CU min CU b j=1 j=1 The right hand side can be interpreted as an ”L1 variance”. [sent-105, score-0.17]
47 2 Uls -SVM, Fisher Discriminant Analysis and Principal Component Analysis In this section we present the relation of the Uls -SVM to two classic learning algorithms: (kernel) oriented Principal Component Analysis (koPCA) and (kernel) Fisher discriminant analysis (kFDA) [5]. [sent-109, score-0.076]
48 Since koPCA and kFDA can both be written as maximisation of a Rayleigh Quotient we start with the Rayleigh quotient of the hybrid from FDA z w max w w (CL X X + (c − }| + − c )(c k k (xi − c )(xi − c ) . [sent-111, score-0.086]
49 j=1 k=± i∈I k | { − −c ) w q X ˜ ˜ +CU (zj − c)(zj − c) )w {z } {z | from FDA } from oPCA ˜ 1 Here, c± denote the class means of the labelled examples and c = 2 (c+ + c− ) is the point between them. [sent-112, score-0.118]
50 For optimising the quotient it can be fixed to an arbitrary value while the denominator is minimised. [sent-119, score-0.059]
51 Since the denominator might not have full rank it needs to be regularised [6]. [sent-120, score-0.045]
52 Choosing the regulariser to be ||w||2 , the problem can be rephrased as min w ||w||2 + w “ CL P k=± P i∈I k (xi − ck )(xi − ck ) + CU Pq j=1 (zj ˜ ˜ − c)(zj − c) ” w (4) w (c+ − c− ) = 2 s. [sent-121, score-0.048]
53 As we will see below this problem can further be transformed into a quadratic program min ||w||2 + CL ||ξ||2 + CU ||ϑ||2 w,b (5) w, xi + b = yi + ξi for all i = 1, . [sent-123, score-0.059]
54 The following lemma establishes the relation of the Uls -SVM to kFDA and koPCA. [sent-133, score-0.042]
55 Let us choose b = −w c, ξi = w xi + b − yi and ϑj = w zj +b. [sent-141, score-0.229]
56 The above lemma establishes a relation of the Uls -SVM to two classic learning algorithms. [sent-145, score-0.042]
57 Since the noise covariance matrix of koPCA is given by the covariance of the Universum points centered on the average of the labelled class means, the role of the Universum as a data-dependent specification of principal directions of invariance is affirmed. [sent-147, score-0.271]
58 The koPCA term also shows that both the position and covariance structure are crucial to a good q q ˜ ˜ ˜ ˜ Universum. [sent-148, score-0.044]
59 To see this, we rewrite j=1 (zj − c)(zj − c) as j=1 (zj − z)(zj − z) + q(˜ − z q 1 ˜ z ˜ ˜ c)(˜ − c) , where z = q j=1 zj is the Universum mean. [sent-149, score-0.17]
60 The additive relationship between covariance of Universum about its mean, and the distance between Universum and training sample means projected onto w shows that either quantity can dominate depending on the data at hand. [sent-150, score-0.081]
61 In the next section, we demonstrate the theoretical results of this section on toy problems and give an example how to use the insight gained from this section to construct an appropriate Universum. [sent-151, score-0.029]
62 1 Experiments Toy Experiments The theoretical results of section 3 show that the covariance structure of the Universum as well as its absolute position influence the result of the learning process. [sent-153, score-0.044]
63 To validate this insight on toy data, we sample ten labelled sets of size 20, 50, 100 and 500 from two fifty-dimensional Gaussians. [sent-154, score-0.127]
64 Both Gaussians have a diagonal covariance that has low standard deviation (σ1,2 = 0. [sent-155, score-0.044]
65 We construct two kinds of Universa for this toy problem. [sent-166, score-0.029]
66 For the first kind we use a mean zero Gaussian with the same covariance structure as the Gaussians for the labelled data (σ3,. [sent-167, score-0.168]
67 For the second kind of Universa we use the same covariance as the labelled classes but shifted them along the line between the means of the labelled Gaussians. [sent-174, score-0.266]
68 In the top row, the degree of isotropy increases from left to right, whereas σ = 10 refers to the complete isotropic case. [sent-177, score-0.055]
69 As expected, performance converges to the performance of an SVM for high isotropy σ and large translations t. [sent-179, score-0.074]
70 However, this might be due to the fact the variance along the principal components of the Universum is much larger in magnitude than the applied shift. [sent-181, score-0.032]
71 5 SVM (q=0) q = 100 q = 500 mean error mean error 0. [sent-215, score-0.04]
72 1 100 200 300 m 400 500 0 0 100 200 m 300 400 500 m Figure 1: Learning curves of linear U-SVMs for different degrees of isotropy σ and different amounts of t translation z → z + 2 · (c+ − c− ). [sent-220, score-0.074]
73 With increasing isotropy and translation the performance of the U-SVMs converges to the performance of a normal SVM. [sent-221, score-0.095]
74 Note that for instance that digits 3 and 6 are the best Universum and they are also the closest to the decision boundary. [sent-255, score-0.05]
75 2 Results on MNIST Following the experimental work from [11], we took up the task of distinguishing between the digits 5 and 8 on MNIST data. [sent-257, score-0.031]
76 Training sets of size 1000 were used, and other digits served as Universum data. [sent-258, score-0.031]
77 Using different digits as universa, we recorded the test error (in percentage) of U-SVM. [sent-259, score-0.051]
78 w, x + b) of a normal SVM trained for binary classification between the digits 5 and 8, measured on the points from the Universum class. [sent-262, score-0.069]
79 Another quantity of interest measured was the angle between covariance matrices of training and Universum data in the feature space. [sent-263, score-0.068]
80 Note that for two covariance matrices CX and CY corresponding to matrices X and Y (centered about their means), the cosine of the angle is defined as trace(CX CY )/ trace(C2 )trace(C2 ). [sent-264, score-0.068]
81 3 Classification of Imagined Movements in Brain Computer Interfaces Brain computer interfaces (BCI) are devices that allow a user to control a computer by merely using his brain activity [3]. [sent-270, score-0.06]
82 In this paradigm the patient imagines the movement of his left or right hand for indicating the respective state. [sent-274, score-0.045]
83 In order to reverse the spatial blurring of the brain activity by the intermediate tissue of the skull, the signals from all sensors are demixed via 6 Algorithm SVM U-SVM LS-SVM Uls -SVM SVM U-SVM LS-SVM Uls -SVM U ∅ UC3 Unm ∅ UC3 Unm FS 40. [sent-275, score-0.071]
84 For the first set (DATA I) we recorded the EEG activity from three healthy subjects for an imagined movement paradigm as described by [3]. [sent-376, score-0.108]
85 The second set (DATA II) contains EEG signals from a similar paradigm [7]. [sent-377, score-0.054]
86 The first Universum, UC3 consists of recordings from a third condition in the experiments that is not related to imagined movements. [sent-379, score-0.044]
87 Since variations in signals from this condition should not carry any useful information about imagined movement task, the classifier should be invariant against them. [sent-380, score-0.134]
88 Unfortunately, signals in the α-band are also related to visual activity and independent components can be found that have a strong influence from sensors over the visual cortex. [sent-383, score-0.049]
89 In order to make the learning algorithm prefer the signals from the motor cortex, we construct a Universum Unm by projecting the labelled data onto the independent components that have a strong influence from the visual cortex. [sent-385, score-0.146]
90 As already observed for the DATA I dataset, the Uls -SVM performs constantly worse than its hinge loss counterpart. [sent-396, score-0.069]
91 The regularisation constant CU for the Universum points was chosen C = CU = 0. [sent-398, score-0.053]
92 This means that the non-orthogonality of w on the Universum points was only weakly 7 penalised, but had equal priority to classifying the labelled examples correctly. [sent-400, score-0.135]
93 We demonstrated that the U-SVM as implemented in [11] is equivalent to searching for a hyperplane which has its normal lying in the orthogonal complement of the space spanned by Universum examples. [sent-404, score-0.215]
94 We also showed that the corresponding least squares Uls -SVM can be seen as a hybrid between the two well known learning algorithms kFDA and koPCA where the Universum points, centered between the means of the labelled classes, play the role of the noise covariance in koPCA. [sent-405, score-0.211]
95 Ideally the covariance matrix of the Universum should thus contain some important invariant directions for the problem at hand. [sent-406, score-0.119]
96 The position of the Universum set plays also an important role and both our theoretical and experimental analysis show that the behaviour of the algorithm depends on the difference between the means of the labelled set and of the Universum set. [sent-407, score-0.098]
97 The question of whether the main influence of the Universum comes from the position or the covariance does not have a clear answer and is probably problem dependent. [sent-408, score-0.044]
98 From a practical point, the main contribution of this paper is to suggest how to select a good Universum set: it should be such that it contains invariant directions and is positioned “in between” the two classes. [sent-409, score-0.075]
99 Therefore, as can be partly seen from the BCI experiments, a good Universum dataset needs to be carefully chosen and cannot be an arbitrary backdrop as the name might suggest. [sent-410, score-0.045]
100 Classifying EEG and ECoG signals without subject o u training for fast bci implementation: Comparison of non-paralysed and completely paralysed subjects. [sent-433, score-0.076]
wordName wordTfidf (topN-words)
[('universum', 0.844), ('uls', 0.18), ('cu', 0.174), ('zj', 0.17), ('unm', 0.138), ('svm', 0.112), ('universa', 0.111), ('pu', 0.103), ('labelled', 0.098), ('kfda', 0.084), ('kopca', 0.083), ('cl', 0.079), ('orthogonal', 0.062), ('isotropy', 0.055), ('spanned', 0.051), ('vapnik', 0.049), ('regulariser', 0.048), ('bci', 0.046), ('imagined', 0.044), ('covariance', 0.044), ('hyperplane', 0.043), ('kxy', 0.041), ('hinge', 0.041), ('squares', 0.041), ('invariant', 0.039), ('complement', 0.038), ('jh', 0.036), ('quotient', 0.036), ('regularisation', 0.036), ('directions', 0.036), ('xi', 0.034), ('eeg', 0.034), ('principal', 0.032), ('digits', 0.031), ('zien', 0.031), ('signals', 0.03), ('trace', 0.03), ('discriminant', 0.03), ('neither', 0.029), ('generalise', 0.029), ('fisher', 0.029), ('toy', 0.029), ('lkopf', 0.028), ('uence', 0.028), ('chapelle', 0.028), ('hybrid', 0.028), ('loss', 0.028), ('alekh', 0.028), ('backdrop', 0.028), ('regularisers', 0.028), ('resp', 0.028), ('sch', 0.027), ('classi', 0.027), ('kernel', 0.027), ('kind', 0.026), ('yi', 0.025), ('oriented', 0.025), ('angle', 0.024), ('fda', 0.024), ('sinz', 0.024), ('paradigm', 0.024), ('dimensions', 0.023), ('subspace', 0.023), ('ica', 0.023), ('denominator', 0.023), ('maximisation', 0.022), ('regularised', 0.022), ('spemannstrasse', 0.022), ('brain', 0.022), ('normal', 0.021), ('helpful', 0.021), ('lemma', 0.021), ('relation', 0.021), ('movement', 0.021), ('cx', 0.021), ('cy', 0.021), ('mika', 0.021), ('rayleigh', 0.021), ('unlabelled', 0.021), ('error', 0.02), ('examples', 0.02), ('cybernetics', 0.02), ('discriminative', 0.02), ('clari', 0.019), ('interfaces', 0.019), ('minimising', 0.019), ('translations', 0.019), ('translation', 0.019), ('activity', 0.019), ('decision', 0.019), ('projected', 0.019), ('implementation', 0.019), ('jl', 0.018), ('bernhard', 0.018), ('planck', 0.018), ('af', 0.018), ('onto', 0.018), ('name', 0.017), ('points', 0.017), ('fs', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 24 nips-2007-An Analysis of Inference with the Universum
Author: Olivier Chapelle, Alekh Agarwal, Fabian H. Sinz, Bernhard Schölkopf
Abstract: We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results. 1
2 0.072089054 106 nips-2007-Invariant Common Spatial Patterns: Alleviating Nonstationarities in Brain-Computer Interfacing
Author: Benjamin Blankertz, Motoaki Kawanabe, Ryota Tomioka, Friederike Hohlefeld, Klaus-Robert Müller, Vadim V. Nikulin
Abstract: Brain-Computer Interfaces can suffer from a large variance of the subject conditions within and across sessions. For example vigilance fluctuations in the individual, variable task involvement, workload etc. alter the characteristics of EEG signals and thus challenge a stable BCI operation. In the present work we aim to define features based on a variant of the common spatial patterns (CSP) algorithm that are constructed invariant with respect to such nonstationarities. We enforce invariance properties by adding terms to the denominator of a Rayleigh coefficient representation of CSP such as disturbance covariance matrices from fluctuations in visual processing. In this manner physiological prior knowledge can be used to shape the classification engine for BCI. As a proof of concept we present a BCI classifier that is robust to changes in the level of parietal α -activity. In other words, the EEG decoding still works when there are lapses in vigilance.
3 0.07204251 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
4 0.071848989 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
Author: Krishnan Kumar, Chiru Bhattacharya, Ramesh Hariharan
Abstract: This paper investigates the application of randomized algorithms for large scale SVM learning. The key contribution of the paper is to show that, by using ideas random projections, the minimal number of support vectors required to solve almost separable classification problems, such that the solution obtained is near optimal with a very high probability, is given by O(log n); if on removal of properly chosen O(log n) points the data becomes linearly separable then it is called almost separable. The second contribution is a sampling based algorithm, motivated from randomized algorithms, which solves a SVM problem by considering subsets of the dataset which are greater in size than the number of support vectors for the problem. These two ideas are combined to obtain an algorithm for SVM classification problems which performs the learning by considering only O(log n) points at a time. Experiments done on synthetic and real life datasets show that the algorithm does scale up state of the art SVM solvers in terms of memory required and execution time without loss in accuracy. It is to be noted that the algorithm presented here nicely complements existing large scale SVM learning approaches as it can be used to scale up any SVM solver. 1
5 0.05978306 112 nips-2007-Learning Monotonic Transformations for Classification
Author: Andrew Howard, Tony Jebara
Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classifier. In many domains such as document classification, image histogram classification and gene microarray experiments, fixed monotonic transformations can be useful as a preprocessing step. However, most classifiers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1
6 0.055981923 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis
7 0.051794272 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
8 0.051484015 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
9 0.050103761 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
10 0.049205706 62 nips-2007-Convex Learning with Invariances
11 0.047581844 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
12 0.044862658 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
13 0.044472672 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection
14 0.037181545 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
15 0.03546112 63 nips-2007-Convex Relaxations of Latent Variable Training
16 0.035452411 160 nips-2007-Random Features for Large-Scale Kernel Machines
17 0.033670701 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
18 0.032424487 7 nips-2007-A Kernel Statistical Test of Independence
19 0.030601723 97 nips-2007-Hidden Common Cause Relations in Relational Learning
20 0.02935441 134 nips-2007-Multi-Task Learning via Conic Programming
topicId topicWeight
[(0, -0.115), (1, 0.031), (2, -0.035), (3, 0.057), (4, -0.008), (5, 0.033), (6, 0.044), (7, 0.041), (8, -0.077), (9, 0.029), (10, 0.034), (11, -0.074), (12, -0.011), (13, -0.004), (14, 0.028), (15, -0.026), (16, 0.019), (17, -0.087), (18, -0.009), (19, 0.033), (20, -0.016), (21, -0.018), (22, 0.031), (23, 0.061), (24, -0.014), (25, -0.057), (26, -0.024), (27, -0.028), (28, 0.007), (29, 0.034), (30, 0.033), (31, -0.045), (32, 0.03), (33, -0.085), (34, -0.06), (35, -0.019), (36, -0.003), (37, -0.052), (38, 0.012), (39, -0.031), (40, -0.069), (41, -0.01), (42, -0.083), (43, 0.033), (44, -0.067), (45, -0.035), (46, 0.078), (47, 0.043), (48, 0.088), (49, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.87466121 24 nips-2007-An Analysis of Inference with the Universum
Author: Olivier Chapelle, Alekh Agarwal, Fabian H. Sinz, Bernhard Schölkopf
Abstract: We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results. 1
2 0.63101035 112 nips-2007-Learning Monotonic Transformations for Classification
Author: Andrew Howard, Tony Jebara
Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classifier. In many domains such as document classification, image histogram classification and gene microarray experiments, fixed monotonic transformations can be useful as a preprocessing step. However, most classifiers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1
3 0.60840422 106 nips-2007-Invariant Common Spatial Patterns: Alleviating Nonstationarities in Brain-Computer Interfacing
Author: Benjamin Blankertz, Motoaki Kawanabe, Ryota Tomioka, Friederike Hohlefeld, Klaus-Robert Müller, Vadim V. Nikulin
Abstract: Brain-Computer Interfaces can suffer from a large variance of the subject conditions within and across sessions. For example vigilance fluctuations in the individual, variable task involvement, workload etc. alter the characteristics of EEG signals and thus challenge a stable BCI operation. In the present work we aim to define features based on a variant of the common spatial patterns (CSP) algorithm that are constructed invariant with respect to such nonstationarities. We enforce invariance properties by adding terms to the denominator of a Rayleigh coefficient representation of CSP such as disturbance covariance matrices from fluctuations in visual processing. In this manner physiological prior knowledge can be used to shape the classification engine for BCI. As a proof of concept we present a BCI classifier that is robust to changes in the level of parietal α -activity. In other words, the EEG decoding still works when there are lapses in vigilance.
4 0.5864982 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
5 0.57575238 118 nips-2007-Learning with Transformation Invariant Kernels
Author: Christian Walder, Olivier Chapelle
Abstract: This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale). 1
6 0.5452776 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
7 0.5364868 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis
8 0.49261087 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection
9 0.49069047 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
10 0.48808289 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
11 0.43903819 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers
12 0.42020845 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
13 0.38015762 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
14 0.37673315 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
15 0.36179042 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
16 0.3572183 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection
17 0.35545433 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
18 0.35385942 40 nips-2007-Bundle Methods for Machine Learning
19 0.35122842 70 nips-2007-Discriminative K-means for Clustering
20 0.3365216 160 nips-2007-Random Features for Large-Scale Kernel Machines
topicId topicWeight
[(0, 0.028), (5, 0.025), (13, 0.065), (16, 0.043), (18, 0.013), (19, 0.077), (21, 0.067), (34, 0.016), (35, 0.026), (47, 0.093), (49, 0.031), (62, 0.226), (83, 0.088), (85, 0.039), (87, 0.015), (90, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.73606777 24 nips-2007-An Analysis of Inference with the Universum
Author: Olivier Chapelle, Alekh Agarwal, Fabian H. Sinz, Bernhard Schölkopf
Abstract: We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results. 1
Author: Ping Li, Trevor J. Hastie
Abstract: Many tasks (e.g., clustering) in machine learning only require the lα distances instead of the original data. For dimension reductions in the lα norm (0 < α ≤ 2), the method of stable random projections can efficiently compute the lα distances in massive datasets (e.g., the Web or massive data streams) in one pass of the data. The estimation task for stable random projections has been an interesting topic. We propose a simple estimator based on the fractional power of the samples (projected data), which is surprisingly near-optimal in terms of the asymptotic variance. In fact, it achieves the Cram´ r-Rao bound when α = 2 and α = 0+. This e new result will be useful when applying stable random projections to distancebased clustering, classifications, kernels, massive data streams etc.
3 0.5902006 106 nips-2007-Invariant Common Spatial Patterns: Alleviating Nonstationarities in Brain-Computer Interfacing
Author: Benjamin Blankertz, Motoaki Kawanabe, Ryota Tomioka, Friederike Hohlefeld, Klaus-Robert Müller, Vadim V. Nikulin
Abstract: Brain-Computer Interfaces can suffer from a large variance of the subject conditions within and across sessions. For example vigilance fluctuations in the individual, variable task involvement, workload etc. alter the characteristics of EEG signals and thus challenge a stable BCI operation. In the present work we aim to define features based on a variant of the common spatial patterns (CSP) algorithm that are constructed invariant with respect to such nonstationarities. We enforce invariance properties by adding terms to the denominator of a Rayleigh coefficient representation of CSP such as disturbance covariance matrices from fluctuations in visual processing. In this manner physiological prior knowledge can be used to shape the classification engine for BCI. As a proof of concept we present a BCI classifier that is robust to changes in the level of parietal α -activity. In other words, the EEG decoding still works when there are lapses in vigilance.
4 0.58863258 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis
Author: Christoforos Christoforou, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electroencephalography (EEG) focus on two types of paradigms: phase locked methods, in which the amplitude of the signal is used as the feature for classification, e.g. event related potentials; and second order methods, in which the feature of interest is the power of the signal, e.g. event related (de)synchronization. The procedure for deciding which paradigm to use is ad hoc and is typically driven by knowledge of the underlying neurophysiology. Here we propose a principled method, based on a bilinear model, in which the algorithm simultaneously learns the best first and second order spatial and temporal features for classification of EEG. The method is demonstrated on simulated data as well as on EEG taken from a benchmark data used to test classification algorithms for brain computer interfaces. 1 1.1
Author: Lars Buesing, Wolfgang Maass
Abstract: We show that under suitable assumptions (primarily linearization) a simple and perspicuous online learning rule for Information Bottleneck optimization with spiking neurons can be derived. This rule performs on common benchmark tasks as well as a rather complex rule that has previously been proposed [1]. Furthermore, the transparency of this new learning rule makes a theoretical analysis of its convergence properties feasible. A variation of this learning rule (with sign changes) provides a theoretically founded method for performing Principal Component Analysis (PCA) with spiking neurons. By applying this rule to an ensemble of neurons, different principal components of the input can be extracted. In addition, it is possible to preferentially extract those principal components from incoming signals X that are related or are not related to some additional target signal YT . In a biological interpretation, this target signal YT (also called relevance variable) could represent proprioceptive feedback, input from other sensory modalities, or top-down signals. 1
7 0.55523992 100 nips-2007-Hippocampal Contributions to Control: The Third Way
8 0.55462033 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
9 0.55330217 86 nips-2007-Exponential Family Predictive Representations of State
10 0.55148059 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
11 0.55058497 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
12 0.54998469 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
13 0.54870892 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
14 0.54866636 195 nips-2007-The Generalized FITC Approximation
15 0.54829979 164 nips-2007-Receptive Fields without Spike-Triggering
16 0.54735321 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
17 0.54654866 158 nips-2007-Probabilistic Matrix Factorization
18 0.54635185 154 nips-2007-Predicting Brain States from fMRI Data: Incremental Functional Principal Component Regression
19 0.54604238 63 nips-2007-Convex Relaxations of Latent Variable Training
20 0.54515553 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes