nips nips2009 nips2009-237 knowledge-graph by maker-knowledge-mining

237 nips-2009-Subject independent EEG-based BCI decoding

Source: pdf

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

Abstract: In the quest to make Brain Computer Interfacing (BCI) more usable, dry electrodes have emerged that get rid of the initial 30 minutes required for placing an electrode cap. Another time consuming step is the required individualized adaptation to the BCI user, which involves another 30 minutes calibration for assessing a subject’s brain signature. In this paper we aim to also remove this calibration proceedure from BCI setup time by means of machine learning. In particular, we harvest a large database of EEG BCI motor imagination recordings (83 subjects) for constructing a library of subject-speciﬁc spatio-temporal ﬁlters and derive a subject independent BCI classiﬁer. Our ofﬂine results indicate that BCI-na¨ve ı users could start real-time BCI use with no prior calibration at only a very moderate performance loss.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Another time consuming step is the required individualized adaptation to the BCI user, which involves another 30 minutes calibration for assessing a subject’s brain signature. [sent-2, score-0.21]

2 In this paper we aim to also remove this calibration proceedure from BCI setup time by means of machine learning. [sent-3, score-0.192]

3 In particular, we harvest a large database of EEG BCI motor imagination recordings (83 subjects) for constructing a library of subject-speciﬁc spatio-temporal ﬁlters and derive a subject independent BCI classiﬁer. [sent-4, score-0.303]

4 Our ofﬂine results indicate that BCI-na¨ve ı users could start real-time BCI use with no prior calibration at only a very moderate performance loss. [sent-5, score-0.129]

5 1 Introduction The last years in BCI research have seen drastically reduced training and calibration times due to the use of machine learning and adaptive signal processing techniques (see [9] and references therein) and novel dry electrodes [18]. [sent-6, score-0.294]

6 Initial BCI systems were based on operant conditioning and could easily require months of training on the subject side before it was possible to use them [1, 10]. [sent-7, score-0.131]

7 Second generation BCI systems require to record a brief calibration session during which a subject assumes a ﬁxed number of brain states, say, movement imagination and after which the subject-speciﬁc spatio-temporal ﬁlters (e. [sent-8, score-0.599]

8 Recently, ﬁrst steps to transfer a BCI user’s ﬁlters and classiﬁers between sessions was studied [14] and a further online-study conﬁrmed that indeed such transfer is possible without signiﬁcant performance loss [16]. [sent-11, score-0.182]

9 In the present paper we even will go one step further in this spirit and propose a subject-independent zero-training BCI that enables both experienced and novice BCI subjects to use BCI immediately without calibration. [sent-12, score-0.19]

10 As expected, we ﬁnd that a distribution of different alpha band features in combination with a number of characteristic common spatial patterns (CSPs) is highly predictive for all users. [sent-18, score-0.172]

11 What is found as the outcome of a machine learning experiment can also be viewed as a compact quantitative description of the characteristic variability between individuals in the large subject group. [sent-19, score-0.162]

12 Note that it is not the best subjects that characterize the variance necessary for a subject independent algorithm, rather the spread over existing physiology is to be represented concisely. [sent-20, score-0.321]

13 Clearly, our proceedure may also be of use appart from BCI in other scientiﬁc ﬁelds, where complex characteristic features need to be homogenized into one overall inference model. [sent-21, score-0.13]

14 The paper ﬁrst provides an overview of the data used, then the ensemble learning algorithm is outlined, consisting of the procedure for building the 1 ﬁlters, the classiﬁers and the gating function, where we apply various machine learning methods. [sent-22, score-0.376]

15 Interestingly we are able to successfully classify trials of novel subjects with zero training suffering only a small loss in performance. [sent-23, score-0.283]

16 2 Available Data and Experiments We used 83 BCI datasets (sessions), each consisting of 150 trials from 83 individual subjects. [sent-25, score-0.09]

17 Each trial consists of one of two predeﬁned movement imaginations, being left and right hand, i. [sent-26, score-0.151]

18 data was chosen such that it relies only on these 2 classes, although originally three classes were cued during the calibration session, being left hand (L), right hand (R) and foot (F). [sent-28, score-0.161]

19 45 EEG channels, which are in accordance with the 10-20 system, were identiﬁed to be common in all sessions considered. [sent-29, score-0.082]

20 The data were recorded while subjects were immobile, seated on a comfortable chair with arm rests. [sent-30, score-0.216]

21 The cues for performing a movement imagination were given by visual stimuli, and occurred every 4. [sent-31, score-0.214]

22 In the third training paradigm the subject was instructed to follow a moving target on the screen. [sent-36, score-0.131]

23 Within this target the edges lit up to indicate the type of movement imagination required. [sent-37, score-0.214]

24 The experimental proceedure was designed to closely follow [3]. [sent-38, score-0.063]

25 Electromyogram (EMG) on both forearms and the foot were recorded as well as electrooculogram (EOG) to ensure there were no real movements of the arms and that the movements of the eyes were not correlated to the required mental tasks. [sent-39, score-0.152]

26 3 Generation of the Ensemble The ensemble consists of a large redundant set of subject-dependent common spatial pattern ﬁlters (CSP cf. [sent-40, score-0.381]

27 A corresponding spatial ﬁlter and linear classiﬁer is obtained for every dataset and temporal ﬁlter. [sent-45, score-0.111]

28 We employed different forms of regression and classiﬁcation in order to ﬁnd an optimal weighting for predicting the movement imagination data of unseen subjects[2, 4]. [sent-48, score-0.306]

29 the session of a particular subject was removed, the algorithm trained on the remaining trials (of the other subjects) and then applied to this subject’s data (see lower panel of Figure 1). [sent-51, score-0.336]

30 1 Temporal Filters The µ-rhythm (9-14 Hz) and synchronized components in the β-band (16-22 Hz) are macroscopic idle rhythms that prevail over the postcentral somatosensory cortex and precentral motor cortex, when a given subject is at rest. [sent-53, score-0.269]

31 Imaginations of movements as well as actual movements are known to suppress these idle rhythms contralaterally. [sent-54, score-0.204]

32 However, there are not only subject-speciﬁc differences of the most discriminative frequency range of the mentioned idle-rhythms, but also session differences thereof. [sent-55, score-0.142]

33 We identiﬁed 18 neurophysiologically relevant temporal ﬁlters, of which 12 lie within the µ-band, 3 in the β-band, two in between µ- and β-band and one broadband 7 − 30Hz. [sent-56, score-0.103]

34 W is a matrix of projections, where the i-th row has a relative variance of di for trials of class 1 and a relative 2 Figure 1: 2 Flowcharts of the ensemble method. [sent-62, score-0.417]

35 The red patches in the top panel illustrate the inactive nodes of the ensemble after sparsiﬁcation. [sent-63, score-0.389]

36 We use Linear Discriminant Analysis (LDA) [5], each time ﬁltered session corresponds to a CSP set and to a matched LDA. [sent-66, score-0.086]

37 3 Final gating function The ﬁnal gating function combines the outputs of the individual ensemble members to a single one. [sent-68, score-0.523]

38 For a number of ensemble methods the mean has proven to be a surprisingly good choice [17]. [sent-70, score-0.331]

39 As a baseline for our ensemble we simply averaged all outputs of our individual classiﬁers. [sent-71, score-0.406]

40 The numbers on the vertical axis represent the subject index as well as the Error Rate (%). [sent-75, score-0.131]

41 The red line depicts the baseline error of individual subjects (classical auto-band CSP). [sent-76, score-0.219]

42 Note that some of the features are useful in predicting the data of most other subjects, while some are rarely or never used. [sent-78, score-0.072]

43 of sessions, X the complete data set, Sk the set of sessions of subject k, Xk the dataset for subject k k, y(x) is the class label of trial x and wij in equation (3) are the weights given to the LDA outputs. [sent-79, score-0.48]

44 The dataset scaling factor is computed using cij (x), for all x ∈ X \ Xk . [sent-81, score-0.077]

45 The 1 regularized regression with this choice of α was then applied to all subjects, with results (in terms of feature sparsiﬁcation) shown in Figure 2. [sent-83, score-0.123]

46 The most predictive subjects show smooth monopolar patterns, while subjects with a higher self-prediction loss slowly move from dipolar to rather ragged maps. [sent-85, score-0.412]

47 From the point of view of approximation even the latter make sense for capturing the overall ensemble variance. [sent-86, score-0.331]

48 We coupled an 2 loss with an 1 penalty term on a linear voting scheme ensemble. [sent-88, score-0.089]

49 4 Validation The subject-speciﬁc CSP-based classiﬁcation methods with automatically, subject-dependent tuned temporal ﬁlters (termed reference methods) are validated by an 8-fold cross-validation, splitting the data chronologically. [sent-91, score-0.138]

50 To validate the quality of the ensemble learning we employed a leave-one-subject out crossvalidation (LOSO-CV) procedure, i. [sent-93, score-0.331]

51 for predicting the labels of a particular subject we only use data from other subjects. [sent-95, score-0.167]

52 4 Results Overall performance of the reference methods, other baseline methods and of the ensemble method is presented in Table 2. [sent-96, score-0.375]

53 Reference method performances of subject-speciﬁc CSP-based classiﬁcation are presented with heuristically tuned frequency bands [6]. [sent-97, score-0.136]

54 For the simple zero-training methods we chose a broad-band ﬁlter of 7 − 30Hz, since it is the least restrictive and scored one of the best performances on a subject level. [sent-157, score-0.163]

55 The bias b in equation (3) can be tuned broadly for all sessions or corrected individually by session, and implemented for online experiments in multiple ways [16, 20, 15]. [sent-158, score-0.115]

56 In our case we chose to adapt b without label information, but operating under the assumption that class frequency is balanced. [sent-159, score-0.088]

57 While SVM scores the best median loss over all subjects (see Table 1), L1 regularized regression scored better results for well performing BCI subjects (Figure 3 column 1, row 3). [sent-164, score-0.535]

58 In Figure 3 and Table 2 we furthermore show the results of the L1 regularized regression and SVM versus the auto-band reference method (zero-training versus subject-dependent training) as well as vs. [sent-165, score-0.167]

59 Figure 4 shows all individual temporal ﬁlters used to generate the ensemble, where their color codes for the frequency they were used to predict labels of previously unseen data. [sent-167, score-0.146]

60 However there is a strong correlation between subjects with a low self-prediction loss and the generalizability of their features to predicting other subjects, as can be seen on the right part of Figure 4. [sent-170, score-0.294]

61 1 Focusing on a particular subject In order to give an intuition of how the ensemble works in detail we will focus on a particular subject. [sent-172, score-0.462]

62 We chose to use the subject with the lowest reference method cross-validation error (10%). [sent-173, score-0.207]

63 Given the non-linearity in the band-power estimation (see Figure 1) it is impossible to picture the resulting ensemble spatial ﬁlter exactly. [sent-174, score-0.381]

64 Another way to exemplify the ensemble performance is to refer to a transfer function. [sent-177, score-0.365]

65 5 26−35 20 25 30 35 0 0 50 100 150 200 250 Number of active features frequency [Hz] Figure 4: On the left: The used temporal ﬁlters and in color-code their contribution to the ﬁnal L1 regularized regression classiﬁcation (the scale is normalized from 0 to 1). [sent-200, score-0.276]

66 6 and processing it by the four CSP ﬁlters, estimating the bandpower of the resulting signal and ﬁnally combining the four outputs by the LDA classiﬁer, we obtain a response for the particular channel, where the sinusoid was injected. [sent-205, score-0.083]

67 This procedure can be applied for a single CSP/LDA pair, however we may also repeat the given method for as many times as features were chosen for a given subject by the ensemble and hence obtain an accurate description of how the ensemble processes the given EEG data. [sent-207, score-0.829]

68 A third way of visualizing how the ensemble works, we show the primary projections of the CSP ﬁlters that were given the 6 highest weights by the ensemble on the left panel (F) and the distribution of all weights in panel D. [sent-210, score-0.879]

69 The spatial positions of highest channel weightings differ slightly for each of the CSP ﬁlters given, however the maxima of the projection matrices are clearly positioned around the primary motor cortex. [sent-211, score-0.26]

70 While dry electrodes provide a ﬁrst step to avoid the time for placing the cap, calibration still remained and it is here where we contribute by dispensing with calibration sessions. [sent-213, score-0.423]

71 Our present study is an ofﬂine analysis providing a positive answer to the question whether a subject independent classiﬁer could become reality for a BCI-na¨ve user. [sent-214, score-0.131]

72 We have taken great ı care in this work to exclude data from a given subject when predicting his/her performance by using the previously described LOSOCV. [sent-215, score-0.167]

73 In contrast with previous work on ensemble approaches to BCI classiﬁcation based on simple majority voting and Adaboost [21, 8] that have utilized only a limited dataset, we have proﬁtted greatly by a large body of high quality experimental data accumulated over the years. [sent-216, score-0.388]

74 This has enabled us to choose by means of machine learning technology a very sparse set of voting classiﬁers which performed as well as standard, state-of-the-art subject calibrated methods. [sent-217, score-0.188]

75 1 regularized regression in this case performed better than other methods (such as majority voting) which we have also tested. [sent-218, score-0.123]

76 most subjects with high autoband reference method performance were selected. [sent-222, score-0.234]

77 Interestingly some subjects with very high BCI performance are not selected at all, while others generalize well in the sense that their model are able to predict other subject’s data. [sent-223, score-0.19]

78 No single frequency band dominated classiﬁcation accuracy – see Figure 4. [sent-224, score-0.111]

79 For very able subjects our zero-training method exhibits a slight performance decrease, which however will not prevent them from performing successfully in BCI. [sent-228, score-0.19]

80 It identiﬁes relevant cortical locations and frequency bands of neuronal population activity which are in agreement with general neuroscientiﬁc knowledge. [sent-230, score-0.128]

81 So what we have found from our machine learning algorithm should be interpreted as representing the characteristic neurophysiological variation a large subject group, which in itself is a highly relevant topic that goes beyond the scope of this technical study. [sent-234, score-0.191]

82 We plan to adopt the ensemble approach in combination with a recently developed EEG cap having dry electrodes [18] and thus to be able to reduce the required preparation time for setting up a running BCI system to essentially zero. [sent-236, score-0.533]

83 The generic ensemble classiﬁer derived here is also an excellent starting point for a subsequent coadaptive learning procedure in the spirit of [7]. [sent-237, score-0.331]

84 7 Figure 5: A: primary projections for classical auto-band CSP. [sent-238, score-0.115]

85 C: transfer function for classical auto-band and ensemble CSP’s. [sent-240, score-0.41]

86 D: weightings of 28 ensemble members, the six highest components are shown in F. [sent-241, score-0.413]

87 E: linear average ensemble temporal ﬁlter (red), heuristic (blue). [sent-242, score-0.392]

88 F: primary projections of the 6 ensemble members that received highest weights. [sent-243, score-0.459]

89 G: Broad-band version of the ensemble for a single subject. [sent-244, score-0.331]

90 The outputs of all basis classiﬁers are applied to each trial of one subject. [sent-245, score-0.101]

91 The top row (broad) gives the label, the second row (broad) gives the output of the classical auto-band CSP, and each of the following rows (thin) gives the outputs of the individual classiﬁers of other subjects. [sent-246, score-0.12]

92 The individual classiﬁer outputs are sorted by their correlation coefﬁcient with respect to the class labels. [sent-247, score-0.1]

93 The trials (columns) are sorted by true labels with primary key and by mean ensemble output as a secondary key. [sent-248, score-0.449]

94 The row at the bottom gives the sign of the average ensemble output. [sent-249, score-0.331]

95 Classifying single trial EEG: Towards brain computer interu facing. [sent-267, score-0.094]

96 The Berlin u Brain-Computer Interface: EEG-based communication without subject training. [sent-284, score-0.131]

97 Towards a cure for bci illiteracy: Machine-learning based co-adaptive learning. [sent-323, score-0.449]

98 Reducing calibration time for brain-computer o u interfaces: A clustering approach. [sent-378, score-0.129]

99 Single trial classiﬁcation of motor u imagination using 6 dry EEG electrodes. [sent-416, score-0.332]

100 Optimal spatial ﬁltering of single trial EEG during u imagined hand movement. [sent-422, score-0.139]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bci', 0.449), ('csp', 0.443), ('ensemble', 0.331), ('subjects', 0.19), ('blankertz', 0.19), ('lters', 0.175), ('eeg', 0.153), ('subject', 0.131), ('calibration', 0.129), ('krauledat', 0.126), ('imagination', 0.118), ('dry', 0.105), ('movement', 0.096), ('classi', 0.087), ('session', 0.086), ('sessions', 0.082), ('wij', 0.081), ('lda', 0.08), ('ers', 0.078), ('cij', 0.077), ('regularized', 0.067), ('dornhege', 0.063), ('dashes', 0.063), ('lpm', 0.063), ('proceedure', 0.063), ('lter', 0.063), ('svm', 0.062), ('trials', 0.061), ('temporal', 0.061), ('electrodes', 0.06), ('movements', 0.06), ('panel', 0.058), ('voting', 0.057), ('regression', 0.056), ('frequency', 0.056), ('trial', 0.055), ('band', 0.055), ('motor', 0.054), ('channels', 0.053), ('sk', 0.052), ('weightings', 0.051), ('spatial', 0.05), ('ller', 0.049), ('bands', 0.047), ('outputs', 0.046), ('gating', 0.045), ('sparsi', 0.045), ('classical', 0.045), ('reference', 0.044), ('laplace', 0.044), ('channel', 0.042), ('cvx', 0.042), ('emphazising', 0.042), ('fazli', 0.042), ('hinterberger', 0.042), ('idle', 0.042), ('individualized', 0.042), ('interfacing', 0.042), ('kunzmann', 0.042), ('losch', 0.042), ('neurophysiologically', 0.042), ('rhythms', 0.042), ('knn', 0.041), ('brain', 0.039), ('projections', 0.038), ('popescu', 0.037), ('lsr', 0.037), ('cap', 0.037), ('lemm', 0.037), ('sinusoid', 0.037), ('eng', 0.037), ('predicting', 0.036), ('features', 0.036), ('er', 0.035), ('xk', 0.035), ('imagined', 0.034), ('imaginations', 0.034), ('transfer', 0.034), ('tuned', 0.033), ('hz', 0.032), ('loss', 0.032), ('chose', 0.032), ('primary', 0.032), ('magazine', 0.032), ('curio', 0.032), ('foot', 0.032), ('characteristic', 0.031), ('highest', 0.031), ('ine', 0.03), ('shenoy', 0.03), ('individual', 0.029), ('neurophysiological', 0.029), ('discriminant', 0.028), ('members', 0.027), ('filters', 0.026), ('arm', 0.026), ('sorted', 0.025), ('di', 0.025), ('cortical', 0.025), ('towards', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 237 nips-2009-Subject independent EEG-based BCI decoding

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

2 0.18275015 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

Author: Wenming Zheng, Zhouchen Lin

Abstract: The method of common spatio-spectral patterns (CSSPs) is an extension of common spatial patterns (CSPs) by utilizing the technique of delay embedding to alleviate the adverse effects of noises and artifacts on the electroencephalogram (EEG) classiﬁcation. Although the CSSPs method has shown to be more powerful than the CSPs method in the EEG classiﬁcation, this method is only suitable for two-class EEG classiﬁcation problems. In this paper, we generalize the two-class CSSPs method to multi-class cases. To this end, we ﬁrst develop a novel theory of multi-class Bayes error estimation and then present the multi-class CSSPs (MCSSPs) method based on this Bayes error theoretical framework. By minimizing the estimated closed-form Bayes error, we obtain the optimal spatio-spectral ﬁlters of MCSSPs. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on the BCI competition 2005 data set. The experimental results show that our method signiﬁcantly outperforms the previous multi-class CSPs (MCSPs) methods in the EEG classiﬁcation.

3 0.1067018 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders

Abstract: Contrast statistics of the majority of natural images conform to a Weibull distribution. This property of natural images may facilitate efficient and very rapid extraction of a scene's visual gist. Here we investigated whether a neural response model based on the Wei bull contrast distribution captures visual information that humans use to rapidly identify natural scenes. In a learning phase, we measured EEG activity of 32 subjects viewing brief flashes of 700 natural scenes. From these neural measurements and the contrast statistics of the natural image stimuli, we derived an across subject Wei bull response model. We used this model to predict the EEG responses to 100 new natural scenes and estimated which scene the subject viewed by finding the best match between the model predictions and the observed EEG responses. In almost 90 percent of the cases our model accurately predicted the observed scene. Moreover, in most failed cases, the scene mistaken for the observed scene was visually similar to the observed scene itself. Similar results were obtained in a separate experiment in which 16 other subjects where presented with artificial occlusion models of natural images. Together, these results suggest that Weibull contrast statistics of natural images contain a considerable amount of visual gist information to warrant rapid image identification.

4 0.10543811 81 nips-2009-Ensemble Nystrom Method

Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar

Abstract: A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices. We introduce a new family of algorithms based on mixtures of Nystr¨ m o approximations, ensemble Nystr¨ m algorithms, that yield more accurate low-rank o approximations than the standard Nystr¨ m method. We give a detailed study of o variants of these algorithms based on simple averaging, an exponential weight method, or regression-based methods. We also present a theoretical analysis of these algorithms, including novel error bounds guaranteeing a better convergence rate than the standard Nystr¨ m method. Finally, we report results of extensive o experiments with several data sets containing up to 1M points demonstrating the signiﬁcant improvement over the standard Nystr¨ m approximation. o 1

5 0.10489466 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

Author: Daniel Zoran, Yair Weiss

Abstract: We propose a new model for natural image statistics. Instead of minimizing dependency between components of natural images, we maximize a simple form of dependency in the form of tree-dependencies. By learning ﬁlters and tree structures which are best suited for natural images we observe that the resulting ﬁlters are edge ﬁlters, similar to the famous ICA on natural images results. Calculating the likelihood of an image patch using our model requires estimating the squared output of pairs of ﬁlters connected in the tree. We observe that after learning, these pairs of ﬁlters are predominantly of similar orientations but different phases, so their joint energy resembles models of complex cells. 1 Introduction and related work Many models of natural image statistics have been proposed in recent years [1, 2, 3, 4]. A common goal of many of these models is ﬁnding a representation in which components or sub-components of the image are made as independent or as sparse as possible [5, 6, 2]. This has been found to be a difﬁcult goal, as natural images have a highly intricate structure and removing dependencies between components is hard [7]. In this work we take a different approach, instead of minimizing dependence between components we try to maximize a simple form of dependence - tree dependence. It would be useful to place this model in context of previous works about natural image statistics. Many earlier models are described by the marginal statistics solely, obtaining a factorial form of the likelihood: p(x) = pi (xi ) (1) i The most notable model of this approach is Independent Component Analysis (ICA), where one seeks to ﬁnd a linear transformation which maximizes independence between components (thus ﬁtting well with the aforementioned factorization). This model has been applied to many scenarios, and proved to be one of the great successes of natural image statistics modeling with the emergence of edge-ﬁlters [5]. This approach has two problems. The ﬁrst is that dependencies between components are still very strong, even with those learned transformation seeking to remove them. Second, it has been shown that ICA achieves, after the learned transformation, only marginal gains when measured quantitatively against simpler method like PCA [7] in terms of redundancy reduction. A different approach was taken recently in the form of radial Gaussianization [8], in which components which are distributed in a radially symmetric manner are made independent by transforming them non-linearly into a radial Gaussian, and thus, independent from one another. A more elaborate approach, related to ICA, is Independent Subspace Component Analysis or ISA. In this model, one looks for independent subspaces of the data, while allowing the sub-components 1 Figure 1: Our model with respect to marginal models such as ICA (a), and ISA like models (b). Our model, being a tree based model (c), allows components to belong to more than one subspace, and the subspaces are not required to be independent. of each subspace to be dependent: p(x) = pk (xi∈K ) (2) k This model has been applied to natural images as well and has been shown to produce the emergence of phase invariant edge detectors, akin to complex cells in V1 [2]. Independent models have several shortcoming, but by far the most notable one is the fact that the resulting components are, in fact, highly dependent. First, dependency between the responses of ICA ﬁlters has been reported many times [2, 7]. Also, dependencies between ISA components has also been observed [9]. Given these robust dependencies between ﬁlter outputs, it is somewhat peculiar that in order to get simple cell properties one needs to assume independence. In this work we ask whether it is possible to obtain V1 like ﬁlters in a model that assumes dependence. In our model we assume the ﬁlter distribution can be described by a tree graphical model [10] (see Figure 1). Degenerate cases of tree graphical models include ICA (in which no edges are present) and ISA (in which edges are only present within a subspace). But in its non-degenerate form, our model assumes any two ﬁlter outputs may be dependent. We allow components to belong to more than one subspace, and as a result, do not require independence between them. 2 Model and learning Our model is comprised of three main components. Given a set of patches, we look for the parameters which maximize the likelihood of a whitened natural image patch z: N p(yi |ypai ; β) p(z; W, β, T ) = p(y1 ) (3) i=1 Where y = Wz, T is the tree structure, pai denotes the parent of node i and β is a parameter of the density model (see below for the details). The three components we are trying to learn are: 1. The ﬁlter matrix W, where every row deﬁnes one of the ﬁlters. The response of these ﬁlters is assumed to be tree-dependent. We assume that W is orthogonal (and is a rotation of a whitening transform). 2. The tree structure T which speciﬁes which components are dependent on each other. 3. The probability density function for connected nodes in the tree, which specify the exact form of dependency between nodes. All three together describe a complete model for whitened natural image patches, allowing likelihood estimation and exact inference [11]. We perform the learning in an iterative manner: we start by learning the tree structure and density model from the entire data set, then, keeping the structure and density constant, we learn the ﬁlters via gradient ascent in mini-batches. Going back to the tree structure we repeat the process many times iteratively. It is important to note that both the ﬁlter set and tree structure are learned from the data, and are continuously updated during learning. In the following sections we will provide details on the speciﬁcs of each part of the model. 2 β=0.0 β=0.5 β=1.0 β=0.0 β=0.5 β=1.0 2 1 1 1 1 1 1 0 −1 0 −1 −2 −1 −2 −3 0 x1 2 0 x1 2 0 x1 2 −2 −3 −2 0 x1 2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 −1 −2 −3 −2 0 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 2 x2 3 −3 −2 0 x1 2 −2 0 x1 2 Figure 2: Shape of the conditional (Left three plots) and joint (Right three plots) density model in log scale for several values of β, from dependence to independence. 2.1 Learning tree structure In their seminal paper, Chow and Liu showed how to learn the optimal tree structure approximation for a multidimensional probability density function [12]. This algorithm is easy to apply to this scenario, and requires just a few simple steps. First, given the current estimate for the ﬁlter matrix W, we calculate the response of each of the ﬁlters with all the patches in the data set. Using these responses, we calculate the mutual information between each pair of ﬁlters (nodes) to obtain a fully connected weighted graph. The ﬁnal step is to ﬁnd a maximal spanning tree over this graph. The resulting unrooted tree is the optimal tree approximation of the joint distribution function over all nodes. We will note that the tree is unrooted, and the root can be chosen arbitrarily - this means that there is no node, or ﬁlter, that is more important than the others - the direction in the tree graph is arbitrary as long as it is chosen in a consistent way. 2.2 Joint probability density functions Gabor ﬁlter responses on natural images exhibit highly kurtotic marginal distributions, with heavy tails and sharp peaks [13, 3, 14]. Joint pair wise distributions also exhibit this same shape with varying degrees of dependency between the components [13, 2]. The density model we use allows us to capture both the highly kurtotic nature of the distributions, while still allowing varying degrees of dependence using a mixing variable. We use a mix of two forms of ﬁnite, zero mean Gaussian Scale Mixtures (GSM). In one, the components are assumed to be independent of each other and in the other, they are assumed to be spherically distributed. The mixing variable linearly interpolates between the two, allowing us to capture the whole range of dependencies: p(x1 , x2 ; β) = βpdep (x1 , x2 ) + (1 − β)pind (x1 , x2 ) (4) When β = 1 the two components are dependent (unless p is Gaussian), whereas when β = 0 the two components are independent. For the density functions themselves, we use a ﬁnite GSM. The dependent case is a scale mixture of bivariate Gaussians: 2 πk N (x1 , x2 ; σk I) pdep (x1 , x2 ) = (5) k While the independent case is a product of two independent univariate Gaussians: 2 πk N (x1 ; σk ) pind (x1 , x2 ) = k 2 πk N (x2 ; σk ) (6) k 2 Estimating parameters πk and σk for the GSM is done directly from the data using Expectation Maximization. These parameters are the same for all edges and are estimated only once on the ﬁrst iteration. See Figure 2 for a visualization of the conditional distribution functions for varying values of β. We will note that the marginal distributions for the two types of joint distributions above are the same. The mixing parameter β is also estimated using EM, but this is done for each edge in the tree separately, thus allowing our model to theoretically capture the fully independent case (ICA) and other degenerate models such as ISA. 2.3 Learning tree dependent components Given the current tree structure and density model, we can now learn the matrix W via gradient ascent on the log likelihood of the model. All learning is performed on whitened, dimensionally 3 reduced patches. This means that W is a N × N rotation (orthonormal) matrix, where N is the number of dimensions after dimensionality reduction (see details below). Given an image patch z we multiply it by W to get the response vector y: y = Wz (7) Now we can calculate the log likelihood of the given patch using the tree model (which we assume is constant at the moment): N log p(yi |ypai ) log p(y) = log p(yroot ) + (8) i=1 Where pai denotes the parent of node i. Now, taking the derivative w.r.t the r-th row of W: ∂ log p(y) ∂ log p(y) T = z ∂Wr ∂yr (9) Where z is the whitened natural image patch. Finally, we can calculate the derivative of the log likelihood with respect to the r-th element in y: ∂ log p(ypar , yr ) ∂ log p(y) = + ∂yr ∂yr c∈C(r) ∂ log p(yr , yc ) ∂ log p(yr ) − ∂yr ∂yr (10) Where C(r) denote the children of node r. In summary, the gradient ascent rule for updating the rotation matrix W is given by: t+1 t Wr = Wr + η ∂ log p(y) T z ∂yr (11) Where η is the learning rate constant. After update, the rows of W are orthonormalized. This gradient ascent rule is applied for several hundreds of patches (see details below), after which the tree structure is learned again as described in Section 2.1, using the new ﬁlter matrix W, repeating this process for many iterations. 3 Results and analysis 3.1 Validation Before running the full algorithm on natural image data, we wanted to validate that it does produce sensible results with simple synthetic data. We generated noise from four different models, one is 1/f independent Gaussian noise with 8 Discrete Cosine Transform (DCT) ﬁlters, the second is a simple ICA model with 8 DCT ﬁlters, and highly kurtotic marginals. The third was a simple ISA model - 4 subspaces, each with two ﬁlters from the DCT ﬁlter set. Distribution within the subspace was a circular, highly kurtotic GSM, and the subspaces were sampled independently. Finally, we generated data from a simple synthetic tree of DCT ﬁlters, using the same joint distributions as for the ISA model. These four synthetic random data sets were given to the algorithm - results can be seen in Figure 3 for the ICA, ISA and tree samples. In all cases the model learned the ﬁlters and distribution correctly, reproducing both the ﬁlters (up to rotations within the subspace in ISA) and the dependency structure between the different ﬁlters. In the case of 1/f Gaussian noise, any whitening transformation is equally likely and any value of beta is equally likely. Thus in this case, the algorithm cannot ﬁnd the tree or the ﬁlters. 3.2 Learning from natural image patches We then ran experiments with a set of natural images [9]1 . These images contain natural scenes such as mountains, ﬁelds and lakes. . The data set was 50,000 patches, each 16 × 16 pixels large. The patches’ DC was removed and they were then whitened using PCA. Dimension was reduced from 256 to 128 dimensions. The GSM for the density model had 16 components. Several initial 1 available at http://www.cis.hut.ﬁ/projects/ica/imageica/ 4 Figure 3: Validation of the algorithm. Noise was generated from three models - top row is ICA, middle row is ISA and bottom row is a tree model. Samples were then given to the algorithm. On the right are the resulting learned tree models. Presented are the learned ﬁlters, tree model (with white edges meaning β = 0, black meaning β = 1 and grays intermediate values) and an example of a marginal histogram for one of the ﬁlters. It can be seen that in all cases all parts of the model were correctly learned. Filters in the ISA case were learned up to rotation within the subspace, and all ﬁlters were learned up to sign. β values for the ICA case were always below 0.1, as were the values of β between subspaces in ISA. conditions for the matrix W were tried out (random rotations, identity) but this had little effect on results. Mini-batches of 10 patches each were used for the gradient ascent - the gradient of 10 patches was summed, and then normalized to have unit norm. The learning rate constant η was set to 0.1. Tree structure learning and estimation of the mixing variable β were done every 500 mini-batches. All in all, 50 iterations were done over the data set. 3.3 Filters and tree structure Figures 4 and 5 show the learned ﬁlters (WQ where Q is the whitening matrix) and tree structure (T ) learned from natural images. Unlike the ISA toy data in ﬁgure 3, here a full tree was learned and β is approximately one for all edges. The GSM that was learned for the marginals was highly kurtotic. It can be seen that resulting ﬁlters are edge ﬁlters at varying scales, positions and orientations. This is similar to the result one gets when applying ICA to natural images [5, 15]. More interesting is Figure 4: Left: Filter set learned from 16 × 16 natural image patches. Filters are ordered by PCA eigenvalues, largest to smallest. Resulting ﬁlters are edge ﬁlters having different orientations, positions, frequencies and phases. Right: The “feature” set learned, that is, columns of the pseudo inverse of the ﬁlter set. 5 Figure 5: The learned tree graph structure and feature set. It can be seen that neighboring features on the graph have similar orientation, position and frequency. See Figure 4 for a better view of the feature details, and see text for full detail and analysis. Note that the ﬁgure is rotated CW. 6 Optimal Orientation Optimal Frequency 3.5 Optimal Phase 7 3 0.8 6 2.5 Optimal Position Y 0.9 6 5 5 0.7 0.6 3 Child 1.5 4 Child 2 Child Child 4 3 2 1 1 0.4 0.3 2 0.5 0.5 0.2 0 1 0.1 0 0 1 2 Parent 3 4 0 0 2 4 Parent 6 8 0 0 1 2 3 Parent 4 5 6 0 0.2 0.4 0.6 Parent 0.8 1 Figure 6: Correlation of optimal parameters in neighboring nodes in the tree graph. Orientation, frequency and position are highly correlated, while phase seems to be entirely uncorrelated. This property of correlation in frequency and orientation, while having no correlation in phase is related to the ubiquitous energy model of complex cells in V1. See text for further details. Figure 7: Left: Comparison of log likelihood values of our model with PCA, ICA and ISA. Our model gives the highest likelihood. Right: Samples taken at random from ICA, ISA and our model. Samples from our model appear to contain more long-range structure. the tree graph structure learned along with the ﬁlters which is shown in Figure 5. It can be seen that neighboring ﬁlters (nodes) in the tree tend to have similar position, frequency and orientation. Figure 6 shows the correlation of optimal frequency, orientation and position for neighboring ﬁlters in the tree - it is obvious that all three are highly correlated. Also apparent in this ﬁgure is the fact that the optimal phase for neighboring ﬁlters has no signiﬁcant correlation. It has been suggested that ﬁlters which have the same orientation, frequency and position with different phase can be related to complex cells in V1 [2, 16]. 3.4 Comparison to other models Since our model is a generalization of both ICA and ISA we use it to learn both models. In order to learn ICA we used the exact same data set, but the tree had no edges and was not learned from the data (alternatively, we could have just set β = 0). For ISA we used a forest architecture of 2 node trees, setting β = 1 for all edges (which means a spherical symmetric distribution), no tree structure was learned. Both models produce edge ﬁlters similar to what we learn (and to those in [5, 15, 6]). The ISA model produces neighboring nodes with similar frequency and orientation, but different phase, as was reported in [2]. We also compare to a simple PCA whitening transform, using the same whitening transform and marginals as in the ICA case, but setting W = I. We compare the likelihood each model gives for a test set of natural image patches, different from the one that was used in training. There were 50,000 patches in the test set, and we calculate the mean log likelihood over the entire set. The table in Figure 7 shows the result - as can be seen, our model performs better in likelihood terms than both ICA and ISA. Using a tree model, as opposed to more complex graphical models, allows for easy sampling from the model. Figure 7 shows 20 random samples taken from our tree model along with samples from the ICA and ISA models. Note the elongated structures (e.g. in the bottom left sample) in the samples from the tree model, and compare to patches sampled from the ICA and ISA models. 7 40 40 30 30 20 20 10 1 10 0.8 0.6 0.4 0 0.2 0 0 1 2 3 Orientation 4 0 0 2 4 6 Frequency 8 0 2 4 Phase Figure 8: Left: Interpretation of the model. Given a patch, the response of all edge ﬁlters is computed (“simple cells”), then at each edge, the corresponding nodes are squared and summed to produce the response of the “complex cell” this edge represents. Both the response of complex cells and simple cells is summed to produce the likelihood of the patch. Right: Response of a “complex cell” in our model to changing phase, frequency and orientation. Response in the y-axis is the sum of squares of the two ﬁlters in this “complex cell”. Note that while the cell is selective to orientation and frequency, it is rather invariant to phase. 3.5 Tree models and complex cells One way to interpret the model is looking at the likelihood of a given patch under this model. For the case of β = 1 substituting Equation 4 into Equation 3 yields: 2 2 ρ( yi + ypai ) − ρ(|ypai |) log L(z) = (12) i 2 Where ρ(x) = log k πk N (x; σk ) . This form of likelihood has an interesting similarity to models of complex cells in V1 [2, 4]. In Figure 8 we draw a simple two-layer network that computes the likelihood. The ﬁrst layer applies linear ﬁlters (“simple cells”) to the image patch, while the second layer sums the squared outputs of similarly oriented ﬁlters from the ﬁrst layer, having different phases, which are connected in the tree (“complex cells”). Output is also dependent on the actual response of the “simple cell” layer. The likelihood here is maximized when both the response of the parent ﬁlter ypai and the child yi is zero, but, given that one ﬁlter has responded with a non-zero value, the likelihood is maximized when the other ﬁlter also ﬁres (see the conditional density in Figure 2). Figure 8 also shows an example of the phase invariance which is present in the learned

6 0.089425057 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

7 0.088325292 246 nips-2009-Time-Varying Dynamic Bayesian Networks

8 0.084053092 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

9 0.077523209 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

10 0.077123024 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning

11 0.075829551 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

12 0.075448811 47 nips-2009-Boosting with Spatial Regularization

13 0.06230551 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

14 0.062008642 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

15 0.061980899 70 nips-2009-Discriminative Network Models of Schizophrenia

16 0.061643954 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording

17 0.051122513 235 nips-2009-Structural inference affects depth perception in the context of potential occlusion

18 0.050619636 260 nips-2009-Zero-shot Learning with Semantic Output Codes

19 0.049603768 71 nips-2009-Distribution-Calibrated Hierarchical Classification

20 0.048816781 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.173), (1, -0.057), (2, 0.01), (3, 0.066), (4, 0.001), (5, 0.037), (6, -0.01), (7, -0.047), (8, -0.065), (9, 0.004), (10, -0.027), (11, 0.014), (12, -0.028), (13, 0.032), (14, 0.097), (15, -0.002), (16, -0.01), (17, 0.03), (18, -0.112), (19, 0.066), (20, 0.174), (21, -0.057), (22, -0.035), (23, -0.14), (24, 0.049), (25, 0.112), (26, -0.129), (27, 0.094), (28, -0.065), (29, -0.1), (30, 0.027), (31, -0.049), (32, 0.164), (33, -0.048), (34, 0.154), (35, -0.011), (36, -0.012), (37, 0.093), (38, -0.023), (39, 0.209), (40, -0.074), (41, -0.019), (42, 0.001), (43, -0.112), (44, -0.019), (45, -0.039), (46, -0.061), (47, 0.114), (48, -0.107), (49, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93542445 237 nips-2009-Subject independent EEG-based BCI decoding

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

2 0.69295704 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

Author: Wenming Zheng, Zhouchen Lin

3 0.61908609 81 nips-2009-Ensemble Nystrom Method

Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar

4 0.51704478 6 nips-2009-A Biologically Plausible Model for Rapid Natural Scene Identification

Author: Sennay Ghebreab, Steven Scholte, Victor Lamme, Arnold Smeulders

5 0.46677181 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks

Author: Yoshua Bengio, James S. Bergstra

Abstract: We introduce a new type of neural network activation function based on recent physiological rate models for complex cells in visual area V1. A single-hiddenlayer neural network of this kind of model achieves 1.50% error on MNIST. We also introduce an existing criterion for learning slow, decorrelated features as a pretraining strategy for image models. This pretraining strategy results in orientation-selective features, similar to the receptive ﬁelds of complex cells. With this pretraining, the same single-hidden-layer model achieves 1.34% error, even though the pretraining sample distribution is very different from the ﬁne-tuning distribution. To implement this pretraining strategy, we derive a fast algorithm for online learning of decorrelated features such that each iteration of the algorithm runs in linear time with respect to the number of features. 1

6 0.43205121 231 nips-2009-Statistical Models of Linear and Nonlinear Contextual Interactions in Early Visual Processing

7 0.39521793 241 nips-2009-The 'tree-dependent components' of natural scenes are edge filters

8 0.39389262 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

9 0.39216566 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

10 0.37630832 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

11 0.36207336 47 nips-2009-Boosting with Spatial Regularization

12 0.35588464 216 nips-2009-Sequential effects reflect parallel learning of multiple environmental regularities

13 0.35415497 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

14 0.35218272 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording

15 0.34294796 124 nips-2009-Lattice Regression

16 0.34069806 13 nips-2009-A Neural Implementation of the Kalman Filter

17 0.32829565 246 nips-2009-Time-Varying Dynamic Bayesian Networks

18 0.31258568 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

19 0.29630044 260 nips-2009-Zero-shot Learning with Semantic Output Codes

20 0.2887606 182 nips-2009-Optimal Scoring for Unsupervised Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.034), (24, 0.031), (25, 0.062), (35, 0.042), (36, 0.087), (39, 0.057), (58, 0.096), (61, 0.018), (71, 0.069), (77, 0.28), (81, 0.02), (86, 0.107), (91, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.81605291 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

Author: Zhirong Yang, Irwin King, Zenglin Xu, Erkki Oja

Abstract: Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in ﬁnding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. With this generalization, we are presented with two difﬁculties. The ﬁrst is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this ﬁnding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a ﬁxed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

same-paper 2 0.79033494 237 nips-2009-Subject independent EEG-based BCI decoding

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

3 0.76670599 136 nips-2009-Learning to Rank by Optimizing NDCG Measure

Author: Hamed Valizadegan, Rong Jin, Ruofei Zhang, Jianchang Mao

Abstract: Learning to rank is a relatively new ﬁeld of study, aiming to learn a ranking function from a set of training data with relevancy labels. The ranking algorithms are often evaluated using information retrieval measures, such as Normalized Discounted Cumulative Gain (NDCG) [1] and Mean Average Precision (MAP) [2]. Until recently, most learning to rank algorithms were not using a loss function related to the above mentioned evaluation measures. The main difﬁculty in direct optimization of these measures is that they depend on the ranks of documents, not the numerical values output by the ranking function. We propose a probabilistic framework that addresses this challenge by optimizing the expectation of NDCG over all the possible permutations of documents. A relaxation strategy is used to approximate the average of NDCG over the space of permutation, and a bound optimization approach is proposed to make the computation efﬁcient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art ranking algorithms on several benchmark data sets. 1

4 0.73739207 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

Author: Sundeep Rangan, Alyson K. Fletcher

Abstract: A well-known analysis of Tropp and Gilbert shows that orthogonal matching pursuit (OMP) can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free linear measurements obtained through a random Gaussian measurement matrix with a probability that approaches one as n → ∞. This work strengthens this result by showing that a lower number of measurements, m = 2k log(n − k), is in fact sufﬁcient for asymptotic recovery. More generally, when the sparsity level satisﬁes kmin ≤ k ≤ kmax but is unknown, m = 2kmax log(n − kmin ) measurements is sufﬁcient. Furthermore, this number of measurements is also sufﬁcient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to inﬁnity. The scaling m = 2k log(n − k) exactly matches the number of measurements required by the more complex lasso method for signal recovery in a similar SNR scaling.

5 0.63502818 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank

Author: Wei Chen, Tie-yan Liu, Yanyan Lan, Zhi-ming Ma, Hang Li

Abstract: Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking functions by minimizing loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking functions. In this work, we reveal the relationship between ranking measures and loss functions in learningto-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that the loss functions of these methods are upper bounds of the measurebased ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classiﬁcation tasks, and deﬁne a so-called essential loss for ranking as the weighted sum of the classiﬁcation errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modiﬁcations can lead to better ranking performances, demonstrating the correctness of our theoretical analysis. 1

6 0.59295416 87 nips-2009-Exponential Family Graph Matching and Ranking

7 0.58502078 230 nips-2009-Statistical Consistency of Top-k Ranking

8 0.5770148 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

9 0.57605785 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

10 0.57570291 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

11 0.57429075 104 nips-2009-Group Sparse Coding

12 0.57294708 70 nips-2009-Discriminative Network Models of Schizophrenia

13 0.57256573 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

14 0.57227504 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

15 0.57131559 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

16 0.57033414 137 nips-2009-Learning transport operators for image manifolds

17 0.56995094 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning

18 0.56981927 112 nips-2009-Human Rademacher Complexity

19 0.56974155 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

20 0.56863904 3 nips-2009-AUC optimization and the two-sample problem