nips nips2008 nips2008-97 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhengdong Lu, Jeffrey Kaye, Todd K. Leen
Abstract: We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and varying sampling intervals. This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classifiers based on the proposed kernel out-perform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. [sent-8, score-0.381]
2 This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. [sent-10, score-0.185]
3 Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. [sent-11, score-0.283]
4 Experiments on detecting cognitive decline show that classifiers based on the proposed kernel out-perform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information. [sent-12, score-0.655]
5 This paper develops new techniques based on hierarchical Bayesian generative models and the Fisher kernel derived from them. [sent-14, score-0.381]
6 The latter strategy, common in the biological sequence literature [4], destroys the geometrical invariance of the kernel. [sent-17, score-0.104]
7 Our construction retains the proper geometric structure, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. [sent-18, score-0.325]
8 This work was motivated by the need to classify clinical longitudinal data on human motor and psychometric test performance. [sent-19, score-0.276]
9 Clinical studies show that at the population level progressive slowing of walking and the rate at which a subject can tap their fingers are predictive of cognitive decline years before its manifestation [1]. [sent-20, score-0.47]
10 Similarly, performance on psychometric tests such as delayed recall of a story or word lists( tests not used in diagnosis), are predictive of cognitive decline [8]. [sent-21, score-0.449]
11 An early predictor of cognitive decline for individual patients based on such longitudinal data would improve medical care and planning for assistance. [sent-22, score-0.379]
12 1 Our new Fisher kernels use mixed-effects models [6] as the generative process. [sent-23, score-0.194]
13 These are hierarchical models that describe the population (consisting of many individuals) as a whole, and variations between individuals in the population. [sent-24, score-0.218]
14 The overall population model together with the covariance of the random effects comprise a set of parameters for the prior on an individual subject model, so the fitting scheme is a hierarchical empirical Bayesian procedure. [sent-26, score-0.221]
15 Data Description The data for this study was drawn from the Oregon Brain Aging Study (OBAS) [2], a longitudinal study spanning up to fifteen years with roughly yearly assessment of subjects. [sent-27, score-0.128]
16 For our work, we grouped the subjects into two classes: those who remain cognitively healthy through the course of the study (denoted normal), and those who progress to mild cognitive impairment (MCI) or further to dementia (denoted impaired). [sent-28, score-0.149]
17 We use 97 subjects from the normal group and 46 from the group that becomes impaired. [sent-30, score-0.121]
18 Motor task data included the time (denoted as seconds) and the number of steps (denoted as steps) to walk 9 meters, and the number of times the subject can tap their forefinger, both dominant (tappingD) and nondominant hands (tappingN) in 10 seconds. [sent-31, score-0.08]
19 Psychometric test data include delayed-recall, which measures the number of words from a list of 10 that the subject can recall one minute after hearing the list, and logical memory II in which the subject is graded on recall of a story told 15-20 minutes earlier. [sent-32, score-0.333]
20 Suppose there are k individuals (indexed by i = i 1, . [sent-35, score-0.088]
21 The superscript on the n model parameters γ i indicates that the regression parameters are different for each individual contributing to the population. [sent-43, score-0.085]
22 The feature values, observation times, and observation noise are i i yi ≡ [y1 , · · · , yN i ]T , ti ≡ [ti , · · · , ti i ]T , 1 N i ≡ [ i,··· , 1 i T Ni ] . [sent-51, score-0.662]
23 2 Maximum Likelihood Fitting Model fitting uses the entire collection of data {ti , yi }, i = 1, . [sent-53, score-0.11]
24 The likelihood of the data {ti , yi } given M is p(yi ; ti , M) = = p(yi | β i ; ti , σ)p(β i | M)dβ i (2π)−N i /2 |Σi |−1/2 exp((yi − αT Φ(ti ))T (Σi )−1 (yi − αT Φi (ti ))) where 1 More generally, the fixed and random effects can be associated with different basis functions. [sent-57, score-0.701]
25 5 logical memory II: Normal logical memory II: Impaired 4 3. [sent-59, score-0.384]
26 1 2 n n=1 The data likelihood for Y = {y1 , y2 , · · · , yk } with T = {t1 , t2 , · · · , tk } is then p(Y; T, M) = k i i The maximum likelihood values of {α, D, σ} are found using the Expectationi=1 p(y | t ; M). [sent-76, score-0.078]
27 For the four motor behavior measurements, we use the logarithm of data to reduce the skew of the residuals. [sent-83, score-0.075]
28 Figure 1 shows the fit models for seconds and logical memory II, as the representatives of the six measurements. [sent-84, score-0.267]
29 The plots confirm that subjects that become impaired deteriorate faster than those who remain healthy. [sent-87, score-0.097]
30 We have two components: one fit on the normal group (denoted M0 ) and one fit on impaired group (denoted M1 ), with Mm = {αm , Dm , σm }, m = 0, 1. [sent-89, score-0.218]
31 The overall generative process for any individual (ti , yi ) is summarized in Figure 2. [sent-92, score-0.268]
32 Here z i ∈ {0, 1} is the latent variable indicating which model component is used to generate yi . [sent-93, score-0.171]
33 1 Fisher Kernel Background The Fisher kernel [4] provides a way to extract discriminative features from the generative model. [sent-95, score-0.321]
34 For any θ-parameterized model p(x; θ), the Fisher kernel between xi and xj is defined as K(xi , xj ) = ( θ log p(xi ; θ))T I−1 θ log p(xj ; θ), (6) where I is the Fisher information matrix with the (n, m) entry In,m = x ∂ log p(x; θ) ∂ log p(x; θ) p(x; θ)dx. [sent-96, score-0.781]
35 ∂θn ∂θm (7) The kernel entry K(xi , xj ) can be viewed as the inner product of the natural gradient I−1 θ log p(x; θ) at xi and xj with metric I, and is invariant to re-parametrization of θ. [sent-97, score-0.55]
36 Jaakkola and Haussler [4] prove that a linear classifier based on the Fisher kernel performs at least as well as the generative model. [sent-98, score-0.321]
37 2 Retaining the Fisher Information Matrix In the bioinformatics literature [3] and for longitudinal data such as ours, p(xi ; θ) is different for each individual owing to different sequence lengths, and (for longitudinal data) different sampling times ti . [sent-100, score-0.58]
38 The integral in Equation (7) must therefore include the distribution sequence lengths and observation times. [sent-101, score-0.105]
39 However where observation times are non-uniform and vary considerably between individuals (as is the case here), there is insufficient data to form an estimate by empirical averaging. [sent-103, score-0.088]
40 This spoils the geometric structure, in particular the invariance of the the kernel K(xi , xj ) under change of coordinates in the model parameter space (model re-parameterization). [sent-105, score-0.442]
41 This is a significant flaw: the coordinate system used to describe the model is immaterial and should not influence the value of K(xi , xj ). [sent-106, score-0.147]
42 For probabilistic kernel regression, the choice of metric is immaterial in the limit of large training sets [4]. [sent-107, score-0.257]
43 Without the proper normalization provided by the Fisher information matrix, the kernel will be dominated by higher order entries2 . [sent-111, score-0.253]
44 A principled extension of the Fisher kernel provided by our hierarchical model allows proper calculation of the Fisher information matrix. [sent-112, score-0.313]
45 3 Hierarchical Fisher Kernel Our design of kernel is based on the generative hierarchy of mixture of mixed-effect models, in Figure 2. [sent-114, score-0.493]
46 We notice that the individual-specific information ti enter into this generative process at the last step, but the “la˜ tent” variables γ i and z i are drawn from the Gaussian mixture model (GMM) Θ = {π0 , α0 , D0 , π1 , α1 , D1 }, with p(z i , γ i ; Θ) = πzi p(γzi ; αzi , Dzi ). [sent-115, score-0.424]
47 We can thus build a standard Fisher kernel for the latent variables, and use it to induce a kernel on the observed data. [sent-116, score-0.483]
48 Denoting the latent variables by v i , the Fisher kernel between v i and v j is K(v i , v j ) = ( Θ log p(v i ; θ))T (Iv )−1 2 θ log p(v j ; Θ), Our experiments on the OBAS data show that replacing the Fisher information with the identity compromises classifier performance. [sent-117, score-0.473]
49 4 where the Fisher score ˜ Θ ˜ Θ ˜ log p(v i ; Θ) is a column vector ∂ log p ∂ log p ∂ log p ∂ log p ∂ log p ∂ log p T ˜ ; ] , log p(v i ; Θ) = [ ; ; ; ; ∂π0 ∂α0 ∂D0 ∂π1 ∂α1 ∂D1 and Iv is the well-defined Fisher information matrix for v: ˜ ˜ ∂ log p(v; Θ) ∂ log p(v; Θ) ˜ p(v|Θ)dv. [sent-118, score-0.77]
50 Iv = n,m ˜ ˜ ∂ Θn ∂ Θm v (8) The kernel for yi and yj is the expectation of K(v i , v j ) given the observation yi and yj . [sent-119, score-0.715]
51 K(yi , yj ) = Evi ,vj [K(v i , v j )| yi , yj ; ti , tj , M] = K(v i , v j )p(v i | yi ; ti , M)p(v j | yj ; tj , M)dv i dv j With different choices of latent variable v, we have three kernel design strategies in the following subsections. [sent-120, score-1.784]
52 This extension to the Fisher kernel, named hierarchical Fisher kernel (HFK), enables us to deal with time series with irregular sampling and different sequence lengths. [sent-121, score-0.271]
53 Design A: v i = γ i This kernel design marginalizes out the higher level variable {z i } and constructs Fisher kernel between the {γ i }. [sent-123, score-0.556]
54 This generative process is illustrated in Figure 3 (left panel), which is the same graphical model in Figure 2 with latent variable z i marginalized out3 . [sent-124, score-0.235]
55 The Fisher kernel for γ is K(γ i , γ j ) = ( ˜ Θ ˜ log p(γ i |Θ))T (Iγ )−1 ˜ Θ ˜ log p(γ i |Θ). [sent-125, score-0.365]
56 (9) The kernel between yi and yj as the expectation of K(γ i , γ j ): K(yi , yj ) = Eγ i ,γ j (K(γ i , γ j )| yi , yj ; ti , tj , M) = ( ˜ Θ (10) ˜ log p(γ i |Θ)p(γ i | yi ; ti , M)dγ i )T (Iγ )−1 ˜ Θ ˜ log p(γ j |Θ)p(γ j | yj ; tj M)dγ j . [sent-126, score-1.995]
57 (11) j ˜ j j j j The computational drawback is that the integral required to evaluate ˜ Θ log p(γ |Θ)p(γ | y ; t M)dγ and Ir do not have an analytical solution. [sent-127, score-0.115]
58 Design B: v i = (z i , γ i ) This design strategy takes both γ i and z i as joint latent variable and build a Fisher kernel for them. [sent-129, score-0.406]
59 The generative process, as summarized in Figure 3 (middle panel), gives the probability for latent variables ˜ p(z i , γ i ; Θ) = πzi p(γi ; αzi , Dzi ). [sent-130, score-0.171]
60 The Fisher kernel for the joint variable (γ i , z i ) is K((z i , γ i ), (z j , γ j )) = ( ˜ Θ ˜ log p(z i , γ i ; Θ))T (Iz,γ )−1 ˜ Θ ˜ log p(z i , γ i ; Θ), (12) ˜ where Iz,γ is the Fisher information matrix associated with distribution p(z, γ; Θ). [sent-131, score-0.365]
61 Design C: M = Mm , m = 0, 1 This design uses one mixed-effect component instead of the mixture as the generative model, as illustrated in Figure 3 (right panel). [sent-135, score-0.282]
62 Although any single Mm is not a satisfying generative model for the whole population, the resulting kernel is still useful for classification as follows. [sent-136, score-0.321]
63 For either model, m = 0, 1, the Fisher score for the ith individual Θm log p(γ i ; Θm ) describes how the probability p(γ i ; Θm ) responds to the change of parameters Θm . [sent-137, score-0.125]
64 This is a discriminative feature vector since the likelihood of γi for individuals from difDesign A Design B Design C ferent group are likely to have different response to the change of parameters Θm . [sent-138, score-0.171]
65 The kernel between Figure 3: The graphical model of the mixture of γ i and γ j is Km (γ i , γ j ) defined in Equation (13). [sent-139, score-0.249]
66 And then the kernel for yi and yj : K(yi , yj ) = Eγ i ,γ j (K(γ i , γ j )| yi , yj ; ti , tj , Mm ) (15) Our experiments show that the kernel based on the impaired group is significantly better than others; we therefore use this kernel as the representative of Design C. [sent-141, score-1.786]
67 It is easy to see that the designed kernel is a special case of Design A or Design B when π0 = 1 and π1 = 0. [sent-142, score-0.211]
68 4 Related Models Marginalized Kernel Our HFK is related to the marginalized kernel (MK) proposed by Tsuda et. [sent-144, score-0.275]
69 MK uses a distribution with discrete latent variable h (indicating the generating component) and observable x, which form a complete data pair x = (h, x). [sent-147, score-0.098]
70 The kernel for observable xi and xj is defined as K(xi , xj ) = P (hi |xi )P (hj |xj )K(xi , xj ) hi (16) hj where K(xi , xj ) is the joint kernel for complete data. [sent-148, score-1.047]
71 [10] uses the form: K(xi , xj ) = δ(hi , hj )Khi (xi , xj ), i j (17) i where Khi (x , x ) is the pre-defined kernel for observables associated the h generative component. [sent-151, score-0.589]
72 Equation (17) says that K(xi , xj ) takes the value of kernel defined for the mth component model if xi and xj are generated from the same component hi = hj = m; otherwise, K(xi , xj ) = 0. [sent-152, score-0.698]
73 HFK can be viewed as a special case of the generalized marginalized kernel that allows continuous latent variables h. [sent-153, score-0.336]
74 This is clear if we re-write Equation (16) as K(xi , xj ) = Ehi ,hj (K(xi , xj )|xi , xj ) i j and view K(x , x ) as a generalization of kernel between hi and hj . [sent-154, score-0.638]
75 Nevertheless HFK is different from the original work in [10], in that MK requires existing kernels for observable, such as Kh (xi , xj ) in Equation (17). [sent-155, score-0.185]
76 In our problem setting, this kernel does not exist due to the different lengths of time series. [sent-156, score-0.278]
77 6 Probability Product Kernel We can get a family of kernels by employing various kernel designs of K(v i , v j ). [sent-157, score-0.339]
78 We tested the classifiers on the five features: steps, seconds, tappingD, tappingN, and logical memory II. [sent-163, score-0.192]
79 The results of delayed-recall are omitted, they are very close to those for logical memory II. [sent-164, score-0.192]
80 For each feature, the kernels are used in support vector machines (SVM) for classification, and the ROC is obtained by thresholding the classifier output with varying values. [sent-166, score-0.084]
81 The classifiers are evaluated by leave-one-out cross-validation, the left-out sample consisting of an individual subject’s complete time series (which is also held out of the fitting of the generative model). [sent-167, score-0.158]
82 Second, we consider a feature extraction routine independent of any generative model. [sent-172, score-0.11]
83 We summarize each individual i with the least-square fit coefficients for a d-degree polynomial regression model, denoted as pi . [sent-173, score-0.082]
84 To get a reliable fitting we only consider the case d = 1 since many individuals only have four ˆ or five observations. [sent-174, score-0.088]
85 ), denoted as pi , as the feature vector, ||ˆ i −ˆ j ||2 p p 2 and build a RBF kernel Gij = exp(− 2s2 ), where s is the kernel width estimated with leave-one-out cross validation in our experiment. [sent-178, score-0.456]
86 The obtained kernel matrix G will be referred to as LSQ kernel. [sent-179, score-0.211]
87 On logical memory II (story recall), the three designs have comparable performance. [sent-184, score-0.236]
88 On four motor test, the classifier based on HFK obviously out-performs the other two classifiers, and on logical memory II, the three classifiers have very much comparable performance. [sent-186, score-0.267]
89 5 Discussion Fisher kernels derived from mixed-effect generative models retain the Fisher information matrix, and hence the proper invariance of the kernel under change of coordinates in the model parameter space. [sent-187, score-0.543]
90 In additional experiments, classifiers constructed with the proper kernel out-perform those constructed with the identity matrix in place of the Fisher information on our data. [sent-188, score-0.3]
91 7333, while the Fisher kernel computed with the identity matrix as metric on p(yi ; ti , M) achieves a AUC = 0. [sent-190, score-0.534]
92 Our classifiers built with Fisher kernels derived from mixed-effect models outperform those based solely on the generative model (using likelihood ratio tests) for the motor task data, and are comparable on the psychometric tests. [sent-193, score-0.381]
93 The hierarchical kernels also produce better classifiers than a standard SVM using the coefficients of a least squares fit to the individual’s data. [sent-194, score-0.144]
94 This shows that the generative model provides real advantage for classification. [sent-195, score-0.11]
95 The mixed-effect models capture both the population behavior (through α), and the statistical variability of the individual subject models (through the covariance of β). [sent-196, score-0.161]
96 the statistics of the subject variability is extremely important for classification: although not discussed here, classifiers based only on the population model (α) perform far worse than those presented here [7]. [sent-357, score-0.113]
97 Motor slowing precedes cognitive impairment in the oldest old. [sent-368, score-0.198]
98 The Oregon brain aging study: Neuropathology accompanying healthy aging in the oldest old. [sent-374, score-0.363]
99 Using the fisher kernel method to detect remote protein homologies. [sent-380, score-0.211]
100 Independent predictors of cognitive decline in healthy elderly persons. [sent-416, score-0.246]
wordName wordTfidf (topN-words)
[('fisher', 0.456), ('ti', 0.276), ('designc', 0.228), ('kernel', 0.211), ('hfk', 0.205), ('alarm', 0.162), ('yj', 0.142), ('aging', 0.137), ('decline', 0.137), ('design', 0.134), ('logical', 0.128), ('longitudinal', 0.128), ('designa', 0.114), ('designb', 0.114), ('lkhd', 0.114), ('lsqk', 0.114), ('yi', 0.11), ('generative', 0.11), ('false', 0.101), ('xj', 0.101), ('oregon', 0.1), ('impaired', 0.097), ('classi', 0.093), ('kaye', 0.091), ('tappingd', 0.091), ('tj', 0.09), ('individuals', 0.088), ('kernels', 0.084), ('mm', 0.082), ('detection', 0.08), ('log', 0.077), ('seconds', 0.075), ('motor', 0.075), ('ers', 0.074), ('psychometric', 0.073), ('rate', 0.071), ('population', 0.07), ('auc', 0.069), ('lsq', 0.068), ('tappingn', 0.068), ('mg', 0.068), ('lengths', 0.067), ('hj', 0.066), ('cognitive', 0.066), ('marginalized', 0.064), ('memory', 0.064), ('latent', 0.061), ('hierarchical', 0.06), ('xi', 0.06), ('tests', 0.059), ('invariance', 0.058), ('hi', 0.058), ('story', 0.055), ('age', 0.053), ('tsuda', 0.051), ('individual', 0.048), ('identity', 0.047), ('er', 0.047), ('alzheimer', 0.046), ('destroys', 0.046), ('dzi', 0.046), ('evi', 0.046), ('hkf', 0.046), ('howieson', 0.046), ('immaterial', 0.046), ('khi', 0.046), ('layton', 0.046), ('obas', 0.046), ('ohsu', 0.046), ('oldest', 0.046), ('sexton', 0.046), ('slowing', 0.046), ('group', 0.044), ('designs', 0.044), ('km', 0.043), ('healthy', 0.043), ('subject', 0.043), ('proper', 0.042), ('zi', 0.042), ('iz', 0.04), ('impairment', 0.04), ('likelihood', 0.039), ('mixture', 0.038), ('integral', 0.038), ('coordinates', 0.038), ('dm', 0.038), ('observable', 0.037), ('laird', 0.037), ('contributing', 0.037), ('tap', 0.037), ('mk', 0.036), ('jaakkola', 0.035), ('denoted', 0.034), ('iv', 0.034), ('roc', 0.034), ('ii', 0.034), ('geometric', 0.034), ('normal', 0.033), ('panel', 0.032), ('neurology', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data
Author: Zhengdong Lu, Jeffrey Kaye, Todd K. Leen
Abstract: We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and varying sampling intervals. This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classifiers based on the proposed kernel out-perform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information.
2 0.12645495 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
3 0.12326555 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
Author: Yen-yu Lin, Tyng-luh Liu, Chiou-shann Fuh
Abstract: In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance. These representations are typically high dimensional and assume diverse forms. Thus finding a way to transform them into a unified space of lower dimension generally facilitates the underlying tasks, such as object recognition or clustering. We describe an approach that incorporates multiple kernel learning with dimensionality reduction (MKL-DR). While the proposed framework is flexible in simultaneously tackling data in various feature representations, the formulation itself is general in that it is established upon graph embedding. It follows that any dimensionality reduction techniques explainable by graph embedding can be generalized by our method to consider data in multiple feature representations.
4 0.12248254 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
5 0.10285579 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
6 0.097247161 178 nips-2008-Performance analysis for L\ 2 kernel classification
7 0.086126156 228 nips-2008-Support Vector Machines with a Reject Option
8 0.085402943 138 nips-2008-Modeling human function learning with Gaussian processes
9 0.082496375 143 nips-2008-Multi-label Multiple Kernel Learning
10 0.082180806 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
11 0.081378229 226 nips-2008-Supervised Dictionary Learning
12 0.07552205 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
13 0.075193502 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
14 0.073610432 44 nips-2008-Characteristic Kernels on Groups and Semigroups
15 0.070470929 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
16 0.069481581 18 nips-2008-An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering
17 0.069046624 78 nips-2008-Exact Convex Confidence-Weighted Learning
18 0.069033027 119 nips-2008-Learning a discriminative hidden part model for human action recognition
19 0.068330385 32 nips-2008-Bayesian Kernel Shaping for Learning Control
20 0.067684449 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
topicId topicWeight
[(0, -0.225), (1, -0.069), (2, 0.005), (3, 0.021), (4, 0.052), (5, -0.028), (6, 0.053), (7, 0.005), (8, 0.073), (9, 0.081), (10, 0.19), (11, -0.027), (12, 0.025), (13, -0.006), (14, 0.035), (15, 0.061), (16, 0.047), (17, -0.167), (18, -0.118), (19, -0.001), (20, 0.086), (21, -0.043), (22, 0.056), (23, 0.034), (24, -0.029), (25, -0.041), (26, 0.062), (27, 0.003), (28, 0.144), (29, -0.028), (30, -0.014), (31, 0.04), (32, 0.093), (33, 0.003), (34, 0.03), (35, 0.039), (36, -0.099), (37, -0.046), (38, -0.03), (39, 0.004), (40, -0.058), (41, 0.109), (42, 0.043), (43, 0.023), (44, -0.081), (45, 0.002), (46, 0.09), (47, 0.046), (48, -0.017), (49, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.96179336 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data
Author: Zhengdong Lu, Jeffrey Kaye, Todd K. Leen
Abstract: We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and varying sampling intervals. This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classifiers based on the proposed kernel out-perform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information.
2 0.72095692 178 nips-2008-Performance analysis for L\ 2 kernel classification
Author: Jooseuk Kim, Clayton Scott
Abstract: We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1
3 0.66794574 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
4 0.66225332 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
5 0.66185153 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
Author: Yen-yu Lin, Tyng-luh Liu, Chiou-shann Fuh
Abstract: In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance. These representations are typically high dimensional and assume diverse forms. Thus finding a way to transform them into a unified space of lower dimension generally facilitates the underlying tasks, such as object recognition or clustering. We describe an approach that incorporates multiple kernel learning with dimensionality reduction (MKL-DR). While the proposed framework is flexible in simultaneously tackling data in various feature representations, the formulation itself is general in that it is established upon graph embedding. It follows that any dimensionality reduction techniques explainable by graph embedding can be generalized by our method to consider data in multiple feature representations.
6 0.62374526 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
7 0.58160585 196 nips-2008-Relative Margin Machines
8 0.55022198 78 nips-2008-Exact Convex Confidence-Weighted Learning
9 0.53301126 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
10 0.51988965 226 nips-2008-Supervised Dictionary Learning
11 0.51362908 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
12 0.50461179 143 nips-2008-Multi-label Multiple Kernel Learning
13 0.4771823 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning
14 0.47319609 32 nips-2008-Bayesian Kernel Shaping for Learning Control
15 0.47262695 110 nips-2008-Kernel-ARMA for Hand Tracking and Brain-Machine interfacing During 3D Motor Control
16 0.46665734 228 nips-2008-Support Vector Machines with a Reject Option
17 0.46391451 44 nips-2008-Characteristic Kernels on Groups and Semigroups
18 0.45823222 138 nips-2008-Modeling human function learning with Gaussian processes
19 0.44107842 18 nips-2008-An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering
20 0.43643886 72 nips-2008-Empirical performance maximization for linear rank statistics
topicId topicWeight
[(6, 0.077), (7, 0.071), (12, 0.017), (15, 0.023), (18, 0.317), (28, 0.168), (57, 0.085), (59, 0.018), (63, 0.019), (71, 0.011), (77, 0.022), (78, 0.018), (83, 0.06)]
simIndex simValue paperId paperTitle
1 0.93251503 33 nips-2008-Bayesian Model of Behaviour in Economic Games
Author: Debajyoti Ray, Brooks King-casas, P. R. Montague, Peter Dayan
Abstract: Classical game theoretic approaches that make strong rationality assumptions have difficulty modeling human behaviour in economic games. We investigate the role of finite levels of iterated reasoning and non-selfish utility functions in a Partially Observable Markov Decision Process model that incorporates game theoretic notions of interactivity. Our generative model captures a broad class of characteristic behaviours in a multi-round Investor-Trustee game. We invert the generative process for a recognition model that is used to classify 200 subjects playing this game against randomly matched opponents. 1
2 0.80199301 186 nips-2008-Probabilistic detection of short events, with application to critical care monitoring
Author: Norm Aleks, Stuart Russell, Michael G. Madden, Diane Morabito, Kristan Staudenmayer, Mitchell Cohen, Geoffrey T. Manley
Abstract: We describe an application of probabilistic modeling and inference technology to the problem of analyzing sensor data in the setting of an intensive care unit (ICU). In particular, we consider the arterial-line blood pressure sensor, which is subject to frequent data artifacts that cause false alarms in the ICU and make the raw data almost useless for automated decision making. The problem is complicated by the fact that the sensor data are averaged over fixed intervals whereas the events causing data artifacts may occur at any time and often have durations significantly shorter than the data collection interval. We show that careful modeling of the sensor, combined with a general technique for detecting sub-interval events and estimating their duration, enables detection of artifacts and accurate estimation of the underlying blood pressure values. Our model’s performance identifying artifacts is superior to two other classifiers’ and about as good as a physician’s. 1
same-paper 3 0.78397185 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data
Author: Zhengdong Lu, Jeffrey Kaye, Todd K. Leen
Abstract: We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and varying sampling intervals. This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classifiers based on the proposed kernel out-perform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information.
4 0.65751386 101 nips-2008-Human Active Learning
Author: Rui M. Castro, Charles Kalish, Robert Nowak, Ruichen Qian, Tim Rogers, Xiaojin Zhu
Abstract: We investigate a topic at the interface of machine learning and cognitive science. Human active learning, where learners can actively query the world for information, is contrasted with passive learning from random examples. Furthermore, we compare human active learning performance with predictions from statistical learning theory. We conduct a series of human category learning experiments inspired by a machine learning task for which active and passive learning error bounds are well understood, and dramatically distinct. Our results indicate that humans are capable of actively selecting informative queries, and in doing so learn better and faster than if they are given random training data, as predicted by learning theory. However, the improvement over passive learning is not as dramatic as that achieved by machine active learning algorithms. To the best of our knowledge, this is the first quantitative study comparing human category learning in active versus passive settings. 1
5 0.5728603 62 nips-2008-Differentiable Sparse Coding
Author: J. A. Bagnell, David M. Bradley
Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the benefits of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efficiently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and find that online optimization of the parameters of the KL-regularized model can significantly improve prediction performance. 1
6 0.5696919 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
7 0.56885433 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
8 0.56847131 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
9 0.5679394 226 nips-2008-Supervised Dictionary Learning
10 0.56775564 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
11 0.56609911 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
12 0.56561702 194 nips-2008-Regularized Learning with Networks of Features
13 0.56467879 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
14 0.56399143 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
15 0.56347173 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
16 0.56330818 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
17 0.56181568 231 nips-2008-Temporal Dynamics of Cognitive Control
18 0.56129783 4 nips-2008-A Scalable Hierarchical Distributed Language Model
19 0.56111759 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
20 0.56105703 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models