nips nips2005 nips2005-77 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fan Li, Yiming Yang, Eric P. Xing
Abstract: Lasso regression tends to assign zero weights to most irrelevant or redundant features, and hence is a promising technique for feature selection. Its limitation, however, is that it only offers solutions to linear models. Kernel machines with feature scaling techniques have been studied for feature selection with non-linear models. However, such approaches require to solve hard non-convex optimization problems. This paper proposes a new approach named the Feature Vector Machine (FVM). It reformulates the standard Lasso regression into a form isomorphic to SVM, and this form can be easily extended for feature selection with non-linear models by introducing kernels defined on feature vectors. FVM generates sparse solutions in the nonlinear feature space and it is much more tractable compared to feature scaling kernel machines. Our experiments with FVM on simulated data show encouraging results in identifying the small number of dominating features that are non-linearly correlated to the response, a task the standard Lasso fails to complete.
Reference: text
sentIndex sentText sentNum sentScore
1 From Lasso regression to Feature vector machine 1 Fan Li1 , Yiming Yang1 and Eric P. [sent-1, score-0.175]
2 edu 2 Abstract Lasso regression tends to assign zero weights to most irrelevant or redundant features, and hence is a promising technique for feature selection. [sent-4, score-0.432]
3 Kernel machines with feature scaling techniques have been studied for feature selection with non-linear models. [sent-6, score-0.534]
4 It reformulates the standard Lasso regression into a form isomorphic to SVM, and this form can be easily extended for feature selection with non-linear models by introducing kernels defined on feature vectors. [sent-9, score-0.764]
5 FVM generates sparse solutions in the nonlinear feature space and it is much more tractable compared to feature scaling kernel machines. [sent-10, score-0.625]
6 Our experiments with FVM on simulated data show encouraging results in identifying the small number of dominating features that are non-linearly correlated to the response, a task the standard Lasso fails to complete. [sent-11, score-0.211]
7 1 Introduction Finding a small subset of most predictive features in a high dimensional feature space is an interesting problem with many important applications, e. [sent-12, score-0.331]
8 , 1996]) is often an effective technique for shrinkage and feature selection. [sent-16, score-0.231]
9 The loss function of Lasso regression is defined as: L= βp xip )2 + λ (yi − i p ||βp ||1 p where xip denotes the pth predictor (feature) in the ith datum, yi denotes the value of the response in this datum, and βp denotes the regression coefficient of the pth feature. [sent-17, score-1.021]
10 The norm-1 regularizer p ||βp ||1 in Lasso regression typically leads to a sparse solution in the feature space, which means that the regression coefficients for most irrelevant or redundant features are shrunk to zero. [sent-18, score-0.734]
11 , 2003] indicates that Lasso regression is particularly effective when there are many irrelevant features and only a few training examples. [sent-20, score-0.278]
12 One of the limitations of standard Lasso regression is its assumption of linearity in the feature space. [sent-21, score-0.389]
13 Hence it is inadequate to capture non-linear dependencies from features to responses (output variables). [sent-22, score-0.134]
14 This loss function typically yields a sparse solution in the instance space, but not in feature space where data was originally represented. [sent-26, score-0.34]
15 Thus GLR does not lead to compression of data in the feature space. [sent-27, score-0.212]
16 They introduced feature scaling kernels in the form of: Kθ (xi , xj ) = φ(xi ∗ θ)φ(xj ∗ θ) = K(xi ∗ θ, xj ∗ θ) where xi ∗ θ denotes the component-wise product between two vectors: x i ∗ θ = (xi1 θ1 , . [sent-32, score-0.4]
17 , 2003] used a feature scaling polynomial kernel: Kγ (xi , xj ) = (1 + γp xip xjp )k , p 2 where γp = θp . [sent-37, score-0.539]
18 With a norm-1 or norm-0 penalizer on γ in the loss function of a feature scaling kernel machine, a sparse solution is supposed to identify the most influential features. [sent-38, score-0.425]
19 Notice that in this formalism the feature scaling vector θ is inside the kernel function, which means that the solution space of θ could be non-convex. [sent-39, score-0.425]
20 Thus, estimating θ in feature scaling kernel machines is a much harder problem than the convex optimization problem in conventional SVM of which the weight parameters to be estimated are outside of the kernel functions. [sent-40, score-0.5]
21 The last property is particularly desirable in the sense that it will allow us to leverage many existing works in kernel machines which have been very successful in SVM-related research. [sent-42, score-0.134]
22 We propose a new approach where the key idea is to re-formulate and extend Lasso regression into a form that is similar to SVM except that it generates a sparse solution in the feature space rather than in the instance space. [sent-43, score-0.482]
23 We call our newly formulated and extended Lasso regression ”Feature Vector Machine” (FVM). [sent-44, score-0.212]
24 The concepts of support vectors, kernels and slack variables can be easily adapted in FVM. [sent-46, score-0.139]
25 Most importantly, all the parameters we need to estimate for FVM are outside of the kernel functions, ensuring the convexity of the solution space, which is the same as in SVM. [sent-47, score-0.149]
26 1 When a linear kernel is put to use with no slack variables, FVM reduces to the standard Lasso regression. [sent-48, score-0.144]
27 1 Notice that we can not only use FVM to select important features from training data, but also use it to predict the values of response variables for test data (see section 5). [sent-49, score-0.179]
28 This only involves with a one-dimensional optimization problem with respect to the response variable for the test example. [sent-52, score-0.127]
29 , 2004] has recently developed an interesting feature selection technique named ”potential SVM”, which has the same form as the basic version of FVM (with linear kernel and no slack variables). [sent-56, score-0.419]
30 Furthermore, their method does not work for feature selection tasks with non-linear models since they did not introduce the concepts of kernels defined on feature vectors. [sent-58, score-0.585]
31 In section 2, we analyze some geometric similarities between the solution hyper-planes in the standard Lasso regression and in SVM. [sent-59, score-0.239]
32 In section 3, we re-formulate Lasso regression in a SVM style form. [sent-60, score-0.16]
33 In this form, all the operations on the training data can be expressed by dot products between feature vectors. [sent-61, score-0.236]
34 In section 4, we introduce kernels (defined for feature vectors) to FVM so that it can be used for feature selection with non-linear models. [sent-62, score-0.568]
35 2 Geometric parity between the solution hyper-planes of Lasso regression and SVM Formally, let X = [x1 , . [sent-65, score-0.202]
36 A feature vector can be defined as a transposed row in the sample matrix, i. [sent-72, score-0.252]
37 Now consider an example space of which each basis is represented by an x i in our sample matrix (note that this is different from the space “spanned” by the sample vectors). [sent-86, score-0.121]
38 Under the example space, both the features fq and the response vector y can be regarded as a point in this space. [sent-87, score-0.544]
39 These feature vectors, together with the response variable, determine the directions of these two hyper-planes. [sent-89, score-0.285]
40 This geometric view can be drawn from the following recast of the Lasso regression due to [Perkins et al. [sent-90, score-0.24]
41 , 2003]: λ | (yi − βp xip )xiq | ≤ , ∀q 2 p i λ , ∀q. [sent-91, score-0.257]
42 It can be shown that equality only holds for non-zero weighted features, and all the zero weighted feature vectors are between the hyper-planes with λ/2 margin (Fig. [sent-96, score-0.323]
43 It’s well known that the classification hyper-planes in SVM can be extended to hyper-surfaces by introducing kernels defined for example vectors. [sent-103, score-0.109]
44 Given the similarity of the feature a X1 response variable feature a feature f X3 X5 X1 feature b feature e feature b X2 feature c X4 feature d X2 X6 (a) X8 (b) Figure 1: Lasso regression vs. [sent-105, score-1.947]
45 (a) The solution of Lasso regression in the example space. [sent-107, score-0.202]
46 Only feature a and d have non-zero weights, and hence the support features. [sent-109, score-0.212]
47 geometric structures of Lasso regression and SVM, it is nature to pursue in parallel how one can apply similar “kernel tricks” to the feature vectors in Lasso regression, so that its feature selection power can be extended to non-linear models. [sent-115, score-0.735]
48 3 A re-formulation of Lasso regression akin to SVM [Hochreiter et al. [sent-117, score-0.22]
49 1 2 βp xip )2 i (yi − p βp xip )xiq | ≤ i( | p λ 2 ∀q. [sent-120, score-0.514]
50 , to relate the objection function to that of the Lasso regression, and to extend the objective function using kernel tricks in a way similar to SVM. [sent-135, score-0.114]
51 In other words, Lasso regression can be re-formulated as Eq. [sent-138, score-0.16]
52 Then, based on this re-formulation, we show how to introduce kernels to allow feature selection under a non-linear Lasso regression. [sent-140, score-0.356]
53 (3), and its kernelized extensions, as feature vector machine (FVM). [sent-142, score-0.227]
54 Proposition 2: For Problem (3), its solution β satisfies the Lasso sandwich Sketch of proof: Following the equivalence between feature matrix F and sample matrix X (see the begin of §2), Problem (3) can be re-written as: 1 T 2 2 ||X β|| T minβ s. [sent-147, score-0.309]
55 The optimizer satisfies: βL = XX T β − XX T (α+ − α− ) = 0 Suppose the data matrix X has been pre-processed so that the feature vectors are centered and normalized. [sent-152, score-0.288]
56 In this case the elements of XX T reflect the correlation coefficients of feature pairs and XX T is non-singular. [sent-153, score-0.212]
57 From the KKT condition, we know i (yi − p βp xip )xiq = − λ holds at this time. [sent-156, score-0.257]
58 For 2 the same reason we can get when βq < 0, α−q should be larger than zero thus i (yi − λ p βp xip )xiq = 2 holds. [sent-157, score-0.257]
59 When βq = 0, α+q and α−q must both be zero (it’s easy to see they can not be both non-zero from KKT condition), thus from KKT condition, both λ λ i (yi − p βp xip )xiq > − 2 and i (yi − p βp xip )xiq < 2 hold now, which means | i (yi − p βp xip )xiq | < λ at this time. [sent-158, score-0.771]
60 4 Feature kernels In many cases, the dependencies between feature vectors are non-linear. [sent-162, score-0.368]
61 Note that unlike SVM, our kernels are defined on feature vectors instead of the sampled vectors (i. [sent-164, score-0.403]
62 Suppose that two feature vectors fp and fq have a non-linear dependency relationship. [sent-168, score-0.783]
63 In the absence of linear interaction between fp and fq in the the original space, we assume that they can be mapped to some (higher dimensional, possibly infinite-dimensional) space via transformation φ(·), so that φ(fq ) and φ(fq ) interact linearly, i. [sent-169, score-0.538]
64 We introduce kernel K(fq , fp ) = φ(fp )T φ(fq ) to represent the outcome of this operation. [sent-172, score-0.283]
65 1 2 p,q ∀q, | βp βq K(fp , fp ) p βp K(fq , fp ) − K(fq , y)| ≤ λ 2 (5) Now, in Problem 5, we no longer have φ(·), which means we do not have to work in the transformed feature space, which could be high or infinite dimensional, to capture nonlinearity of features. [sent-175, score-0.558]
66 One possible kernel is the mutual information [Cover et al. [sent-182, score-0.177]
67 , 1991] between two feature vectors: K(fp , fq ) = M I(fp , fq ). [sent-183, score-0.886]
68 This kernel requires a pre-processing step to discritize the elements of features vectors because they are continuous in general. [sent-184, score-0.289]
69 It’s easy to see that in this way, we can guarantee that for any two features p and q, K(fp , fp ) = K(fq , fq ), which means the feature vectors are normalized and have the same length in the φ space (residing on a unit sphere centered at the origin). [sent-191, score-0.902]
70 , K(fp , fq ) = K(fq , fp ), non-negative, and can be normalized. [sent-195, score-0.51]
71 Therefore, a non-linear feature selection using generalized Lasso regression with this kernel yields human interpretable results. [sent-197, score-0.515]
72 It turns out that the same procedure also seemlessly leads to a Lasso-style regularized nonlinear regression capable of predicting the response given data in the original space. [sent-199, score-0.253]
73 Specifically, given a new sample xt of unknown response, our sample matrix X grows by one column X → [X, xt ], which means all our feature vectors gets one more dimension. [sent-201, score-0.39]
74 We denote the newly elongated features by F = {fq }q∈A (note that A is the pruned index set corresponding to features whose weight βq is non-zero). [sent-202, score-0.243]
75 Let y denote the elongated response vector due to the newly given sample: y = (y1 , . [sent-203, score-0.149]
76 , yN , yt )T , it can be shown that the optimum response yt can be obtained by solving the following optimization problem 2 : minyt K(y , y ) − 2 βp K(y , fp ) (6) p∈A When we replace the kernel function K with a linear dot product, FVM reduces to Lasso regression. [sent-206, score-0.445]
77 (6) that y t = p∈A βp xtp , which is exactly how Lasso regression would predict the response. [sent-208, score-0.16]
78 In general, to predict y t , we need not only xt and β, but also the non-zero weight features extracted from the training data. [sent-212, score-0.117]
79 As in SVM, we can introduce slack variables into FVM to define a “soft” feature surface. [sent-218, score-0.286]
80 Essentially, most of the methodologies developed for SVM can be easily adapted to FVM for nonlinear feature selection. [sent-220, score-0.232]
81 6 Experiments We test FVM on a simulated dataset with 100 features and 500 examples. [sent-221, score-0.117]
82 The response variable y in the simulated data is generated by a highly nonlinear rule: 2 1 − f2 − 3 ∗ f3 + ξ. [sent-222, score-0.137]
83 y = sin(10 ∗ f1 − 5) + 4 ∗ Here feature f1 and f3 are random variables following a uniform distribution in [0, 1]; feature f2 is a random variable uniformly distributed in [−1, 1]; and ξ represents Gaussian noise. [sent-223, score-0.476]
84 , 2003] has proposed a fast grafting algorithm to solve Lasso regression, which is a special case of FVM when linear kernel is used. [sent-244, score-0.153]
85 The only difference is that, each time when we need to calculate i xpi xqi , we calculate K(fp , fq ) instead. [sent-246, score-0.337]
86 We found that fast grafting algorithm is very efficient in our case because it uses the sparse property of the solution of FVM. [sent-247, score-0.131]
87 We apply both standard Lasso regression and FVM with mutual information kernel on this dataset. [sent-248, score-0.294]
88 In one case, we set λ such that only 3 non-zero weighted features are selected; in another case, we relaxed a bit and allowed 10 features. [sent-251, score-0.146]
89 (3), under stringent λ, FVM successfully identified the three correct features, f1 , f2 and f3 , whereas Lasso regression has missed f1 and f2 , which are non-linearly correlated with y. [sent-254, score-0.21]
90 Even when λ was relaxed, Lasso regression still missed the right features, whereas FVM was very robust. [sent-255, score-0.192]
91 6 4 response variable y response variable y 5 3 2 1 0 −1 −2 −3 1 6 5 4 3 2 1 1 0 0. [sent-256, score-0.182]
92 5 −1 f1 0 Figure 2: The responses y and the two features f1 and f2 in our simulated data. [sent-268, score-0.134]
93 7 Conclusions In this paper, we proposed a novel non-linear feature selection approach named FVM, which extends standard Lasso regression by introducing kernels on feature vectors. [sent-270, score-0.774]
94 5 x 10 Lasso (10 features) 0 −1 Weight assigned to features −5 −1. [sent-272, score-0.118]
95 2 0 0 20 40 60 Feature id 80 100 0 0 20 40 60 80 100 Feature id Figure 3: Results of FVM and the standard Lasso regression on this dataset. [sent-283, score-0.217]
96 The X axis represents the feature IDs and the Y axis represents the weights assigned to features. [sent-284, score-0.335]
97 The two left graphs show the case when 3 features are selected by each algorithm and the two right graphs show the case when 10 features are selected. [sent-285, score-0.224]
98 From the up left graph, we can see that Lasso regression missed f 1 and f2 , which are non-linearly correlated with y. [sent-287, score-0.21]
99 Our experiments with FVM on highly nonlinear and noisy simulated data show encouraging results, in which it can correctly identify the small number of dominating features that are non-linearly correlated to the response variable, a task the standard Lasso fails to complete. [sent-290, score-0.304]
100 Joint classifier and feature optimization for cancer diagnosis using gene expression data. [sent-303, score-0.265]
wordName wordTfidf (topN-words)
[('lasso', 0.522), ('fvm', 0.514), ('fq', 0.337), ('xip', 0.257), ('feature', 0.212), ('fp', 0.173), ('xiq', 0.16), ('regression', 0.16), ('svm', 0.137), ('features', 0.091), ('kernel', 0.089), ('hochreiter', 0.08), ('perkins', 0.08), ('response', 0.073), ('kernels', 0.069), ('grafting', 0.064), ('vectors', 0.061), ('et', 0.06), ('krishnapuram', 0.056), ('selection', 0.054), ('canu', 0.048), ('discritize', 0.048), ('glr', 0.048), ('yi', 0.046), ('kkt', 0.042), ('solution', 0.042), ('fk', 0.042), ('xx', 0.042), ('scaling', 0.039), ('slack', 0.038), ('optimization', 0.036), ('newly', 0.036), ('missed', 0.032), ('xj', 0.031), ('tibshirani', 0.029), ('leverage', 0.028), ('regarded', 0.028), ('space', 0.028), ('mutual', 0.028), ('weston', 0.027), ('assigned', 0.027), ('phase', 0.027), ('irrelevant', 0.027), ('limitation', 0.027), ('named', 0.026), ('roth', 0.026), ('proposition', 0.026), ('dependencies', 0.026), ('xt', 0.026), ('simulated', 0.026), ('elongated', 0.025), ('dominating', 0.025), ('pth', 0.025), ('tricks', 0.025), ('sample', 0.025), ('yt', 0.025), ('weighted', 0.025), ('sparse', 0.025), ('introducing', 0.024), ('dot', 0.024), ('cover', 0.022), ('graphs', 0.021), ('mirror', 0.021), ('coef', 0.021), ('axis', 0.021), ('introduce', 0.021), ('datum', 0.02), ('id', 0.02), ('geometric', 0.02), ('nonlinear', 0.02), ('shrinkage', 0.019), ('represents', 0.019), ('variable', 0.018), ('xi', 0.018), ('correlated', 0.018), ('loss', 0.018), ('min', 0.018), ('outside', 0.018), ('fj', 0.017), ('encouraging', 0.017), ('redundant', 0.017), ('gene', 0.017), ('ng', 0.017), ('standard', 0.017), ('concepts', 0.017), ('fails', 0.017), ('machines', 0.017), ('responses', 0.017), ('weights', 0.016), ('extended', 0.016), ('sketch', 0.016), ('notice', 0.016), ('matrix', 0.015), ('variables', 0.015), ('bit', 0.015), ('sort', 0.015), ('relaxed', 0.015), ('vector', 0.015), ('instance', 0.015), ('biology', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 77 nips-2005-From Lasso regression to Feature vector machine
Author: Fan Li, Yiming Yang, Eric P. Xing
Abstract: Lasso regression tends to assign zero weights to most irrelevant or redundant features, and hence is a promising technique for feature selection. Its limitation, however, is that it only offers solutions to linear models. Kernel machines with feature scaling techniques have been studied for feature selection with non-linear models. However, such approaches require to solve hard non-convex optimization problems. This paper proposes a new approach named the Feature Vector Machine (FVM). It reformulates the standard Lasso regression into a form isomorphic to SVM, and this form can be easily extended for feature selection with non-linear models by introducing kernels defined on feature vectors. FVM generates sparse solutions in the nonlinear feature space and it is much more tractable compared to feature scaling kernel machines. Our experiments with FVM on simulated data show encouraging results in identifying the small number of dominating features that are non-linearly correlated to the response, a task the standard Lasso fails to complete.
2 0.16752616 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
Author: Jorge Silva, Jorge Marques, João Lemos
Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added benefit, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1
3 0.13890038 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
Author: Jo-anne Ting, Aaron D'souza, Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato
Abstract: An increasing number of projects in neuroscience requires the statistical analysis of high dimensional data sets, as, for instance, in predicting behavior from neural firing or in operating artificial devices from brain recordings in brain-machine interfaces. Linear analysis techniques remain prevalent in such cases, but classical linear regression approaches are often numerically too fragile in high dimensions. In this paper, we address the question of whether EMG data collected from arm movements of monkeys can be faithfully reconstructed with linear approaches from neural activity in primary motor cortex (M1). To achieve robust data analysis, we develop a full Bayesian approach to linear regression that automatically detects and excludes irrelevant features in the data, regularizing against overfitting. In comparison with ordinary least squares, stepwise regression, partial least squares, LASSO regression and a brute force combinatorial search for the most predictive input features in the data, we demonstrate that the new Bayesian method offers a superior mixture of characteristics in terms of regularization against overfitting, computational efficiency and ease of use, demonstrating its potential as a drop-in replacement for other linear regression techniques. As neuroscientific results, our analyses demonstrate that EMG data can be well predicted from M1 neurons, further opening the path for possible real-time interfaces between brains and machines. 1
4 0.10434314 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
5 0.090201244 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia
Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.
6 0.089806363 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
7 0.089566894 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
8 0.083479151 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
9 0.071643509 47 nips-2005-Consistency of one-class SVM and related algorithms
10 0.067930624 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
11 0.065310523 114 nips-2005-Learning Rankings via Convex Hull Separation
12 0.064752765 50 nips-2005-Convex Neural Networks
13 0.064296231 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
14 0.062214892 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
15 0.062002528 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
16 0.060604367 98 nips-2005-Infinite latent feature models and the Indian buffet process
17 0.054426193 45 nips-2005-Conditional Visual Tracking in Kernel Space
18 0.050348472 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
19 0.047850739 103 nips-2005-Kernels for gene regulatory regions
20 0.047683448 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
topicId topicWeight
[(0, 0.163), (1, 0.068), (2, -0.062), (3, -0.016), (4, 0.035), (5, 0.073), (6, 0.115), (7, -0.045), (8, 0.099), (9, 0.01), (10, 0.027), (11, -0.055), (12, -0.047), (13, 0.069), (14, -0.041), (15, 0.203), (16, -0.055), (17, 0.069), (18, -0.132), (19, -0.121), (20, 0.094), (21, -0.044), (22, 0.013), (23, -0.314), (24, 0.06), (25, -0.123), (26, 0.021), (27, -0.014), (28, 0.055), (29, -0.048), (30, -0.004), (31, -0.107), (32, -0.058), (33, -0.073), (34, -0.058), (35, -0.174), (36, 0.064), (37, 0.074), (38, 0.032), (39, 0.107), (40, -0.023), (41, 0.079), (42, 0.014), (43, 0.144), (44, 0.04), (45, -0.054), (46, -0.014), (47, 0.085), (48, 0.037), (49, -0.146)]
simIndex simValue paperId paperTitle
same-paper 1 0.95401996 77 nips-2005-From Lasso regression to Feature vector machine
Author: Fan Li, Yiming Yang, Eric P. Xing
Abstract: Lasso regression tends to assign zero weights to most irrelevant or redundant features, and hence is a promising technique for feature selection. Its limitation, however, is that it only offers solutions to linear models. Kernel machines with feature scaling techniques have been studied for feature selection with non-linear models. However, such approaches require to solve hard non-convex optimization problems. This paper proposes a new approach named the Feature Vector Machine (FVM). It reformulates the standard Lasso regression into a form isomorphic to SVM, and this form can be easily extended for feature selection with non-linear models by introducing kernels defined on feature vectors. FVM generates sparse solutions in the nonlinear feature space and it is much more tractable compared to feature scaling kernel machines. Our experiments with FVM on simulated data show encouraging results in identifying the small number of dominating features that are non-linearly correlated to the response, a task the standard Lasso fails to complete.
2 0.69653887 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
Author: Jorge Silva, Jorge Marques, João Lemos
Abstract: There has been a surge of interest in learning non-linear manifold models to approximate high-dimensional data. Both for computational complexity reasons and for generalization capability, sparsity is a desired feature in such models. This usually means dimensionality reduction, which naturally implies estimating the intrinsic dimension, but it can also mean selecting a subset of the data to use as landmarks, which is especially important because many existing algorithms have quadratic complexity in the number of observations. This paper presents an algorithm for selecting landmarks, based on LASSO regression, which is well known to favor sparse approximations because it uses regularization with an l1 norm. As an added benefit, a continuous manifold parameterization, based on the landmarks, is also found. Experimental results with synthetic and real data illustrate the algorithm. 1
3 0.58196962 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia
Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.
4 0.5816521 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares
Author: Jo-anne Ting, Aaron D'souza, Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato
Abstract: An increasing number of projects in neuroscience requires the statistical analysis of high dimensional data sets, as, for instance, in predicting behavior from neural firing or in operating artificial devices from brain recordings in brain-machine interfaces. Linear analysis techniques remain prevalent in such cases, but classical linear regression approaches are often numerically too fragile in high dimensions. In this paper, we address the question of whether EMG data collected from arm movements of monkeys can be faithfully reconstructed with linear approaches from neural activity in primary motor cortex (M1). To achieve robust data analysis, we develop a full Bayesian approach to linear regression that automatically detects and excludes irrelevant features in the data, regularizing against overfitting. In comparison with ordinary least squares, stepwise regression, partial least squares, LASSO regression and a brute force combinatorial search for the most predictive input features in the data, we demonstrate that the new Bayesian method offers a superior mixture of characteristics in terms of regularization against overfitting, computational efficiency and ease of use, demonstrating its potential as a drop-in replacement for other linear regression techniques. As neuroscientific results, our analyses demonstrate that EMG data can be well predicted from M1 neurons, further opening the path for possible real-time interfaces between brains and machines. 1
5 0.52052516 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
Author: Larry Wasserman, John D. Lafferty
Abstract: We present a method for nonparametric regression that performs bandwidth selection and variable selection simultaneously. The approach is based on the technique of incrementally decreasing the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, our approach avoids the curse of dimensionality, achieving the optimal minimax rate of convergence, up to logarithmic factors, as if the relevant variables were known in advance. The method—called rodeo (regularization of derivative expectation operator)—conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces hard with soft thresholding effectively solves a sequence of lasso problems. 1
6 0.49578854 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
7 0.43785462 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression
8 0.42477041 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
9 0.31473738 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
10 0.3119657 103 nips-2005-Kernels for gene regulatory regions
11 0.30912545 114 nips-2005-Learning Rankings via Convex Hull Separation
12 0.30158657 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
13 0.29841533 98 nips-2005-Infinite latent feature models and the Indian buffet process
14 0.29279274 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
15 0.27197987 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
16 0.2679503 185 nips-2005-Subsequence Kernels for Relation Extraction
17 0.2675457 205 nips-2005-Worst-Case Bounds for Gaussian Process Models
18 0.26360792 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
19 0.26144332 84 nips-2005-Generalization in Clustering with Unobserved Features
20 0.25847647 104 nips-2005-Laplacian Score for Feature Selection
topicId topicWeight
[(3, 0.042), (10, 0.049), (11, 0.01), (27, 0.028), (31, 0.026), (34, 0.137), (41, 0.016), (50, 0.012), (52, 0.294), (55, 0.021), (69, 0.052), (73, 0.064), (77, 0.013), (88, 0.084), (91, 0.034)]
simIndex simValue paperId paperTitle
1 0.79226637 61 nips-2005-Dynamical Synapses Give Rise to a Power-Law Distribution of Neuronal Avalanches
Author: Anna Levina, Michael Herrmann
Abstract: There is experimental evidence that cortical neurons show avalanche activity with the intensity of firing events being distributed as a power-law. We present a biologically plausible extension of a neural network which exhibits a power-law avalanche distribution for a wide range of connectivity parameters. 1
same-paper 2 0.77498257 77 nips-2005-From Lasso regression to Feature vector machine
Author: Fan Li, Yiming Yang, Eric P. Xing
Abstract: Lasso regression tends to assign zero weights to most irrelevant or redundant features, and hence is a promising technique for feature selection. Its limitation, however, is that it only offers solutions to linear models. Kernel machines with feature scaling techniques have been studied for feature selection with non-linear models. However, such approaches require to solve hard non-convex optimization problems. This paper proposes a new approach named the Feature Vector Machine (FVM). It reformulates the standard Lasso regression into a form isomorphic to SVM, and this form can be easily extended for feature selection with non-linear models by introducing kernels defined on feature vectors. FVM generates sparse solutions in the nonlinear feature space and it is much more tractable compared to feature scaling kernel machines. Our experiments with FVM on simulated data show encouraging results in identifying the small number of dominating features that are non-linearly correlated to the response, a task the standard Lasso fails to complete.
3 0.74287385 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
Author: Bhaskar D. Rao, David P. Wipf
Abstract: Given a redundant dictionary of basis vectors (or atoms), our goal is to find maximally sparse representations of signals. Previously, we have argued that a sparse Bayesian learning (SBL) framework is particularly well-suited for this task, showing that it has far fewer local minima than other Bayesian-inspired strategies. In this paper, we provide further evidence for this claim by proving a restricted equivalence condition, based on the distribution of the nonzero generating model weights, whereby the SBL solution will equal the maximally sparse representation. We also prove that if these nonzero weights are drawn from an approximate Jeffreys prior, then with probability approaching one, our equivalence condition is satisfied. Finally, we motivate the worst-case scenario for SBL and demonstrate that it is still better than the most widely used sparse representation algorithms. These include Basis Pursuit (BP), which is based on a convex relaxation of the ℓ0 (quasi)-norm, and Orthogonal Matching Pursuit (OMP), a simple greedy strategy that iteratively selects basis vectors most aligned with the current residual. 1
4 0.71021056 45 nips-2005-Conditional Visual Tracking in Kernel Space
Author: Cristian Sminchisescu, Atul Kanujia, Zhiguo Li, Dimitris Metaxas
Abstract: We present a conditional temporal probabilistic framework for reconstructing 3D human motion in monocular video based on descriptors encoding image silhouette observations. For computational efÄ?Ĺš ciency we restrict visual inference to low-dimensional kernel induced non-linear state spaces. Our methodology (kBME) combines kernel PCA-based non-linear dimensionality reduction (kPCA) and Conditional Bayesian Mixture of Experts (BME) in order to learn complex multivalued predictors between observations and model hidden states. This is necessary for accurate, inverse, visual perception inferences, where several probable, distant 3D solutions exist due to noise or the uncertainty of monocular perspective projection. Low-dimensional models are appropriate because many visual processes exhibit strong non-linear correlations in both the image observations and the target, hidden state variables. The learned predictors are temporally combined within a conditional graphical model in order to allow a principled propagation of uncertainty. We study several predictors and empirically show that the proposed algorithm positively compares with techniques based on regression, Kernel Dependency Estimation (KDE) or PCA alone, and gives results competitive to those of high-dimensional mixture predictors at a fraction of their computational cost. We show that the method successfully reconstructs the complex 3D motion of humans in real monocular video sequences. 1 Introduction and Related Work We consider the problem of inferring 3D articulated human motion from monocular video. This research topic has applications for scene understanding including human-computer interfaces, markerless human motion capture, entertainment and surveillance. A monocular approach is relevant because in real-world settings the human body parts are rarely completely observed even when using multiple cameras. This is due to occlusions form other people or objects in the scene. A robust system has to necessarily deal with incomplete, ambiguous and uncertain measurements. Methods for 3D human motion reconstruction can be classiÄ?Ĺš ed as generative and discriminative. They both require a state representation, namely a 3D human model with kinematics (joint angles) or shape (surfaces or joint positions) and they both use a set of image features as observations for state inference. The computational goal in both cases is the conditional distribution for the model state given image observations. Generative model-based approaches [6, 16, 14, 13] have been demonstrated to Ä?Ĺš‚exibly reconstruct complex unknown human motions and to naturally handle problem constraints. However it is difÄ?Ĺš cult to construct reliable observation likelihoods due to the complexity of modeling human appearance. This varies widely due to different clothing and deformation, body proportions or lighting conditions. Besides being somewhat indirect, the generative approach further imposes strict conditional independence assumptions on the temporal observations given the states in order to ensure computational tractability. Due to these factors inference is expensive and produces highly multimodal state distributions [6, 16, 13]. Generative inference algorithms require complex annealing schedules [6, 13] or systematic non-linear search for local optima [16] in order to ensure continuing tracking. These difÄ?Ĺš culties motivate the advent of a complementary class of discriminative algorithms [10, 12, 18, 2], that approximate the state conditional directly, in order to simplify inference. However, inverse, observation-to-state multivalued mappings are difÄ?Ĺš cult to learn (see e.g. Ä?Ĺš g. 1a) and a probabilistic temporal setting is necessary. In an earlier paper [15] we introduced a probabilistic discriminative framework for human motion reconstruction. Because the method operates in the originally selected state and observation spaces that can be task generic, therefore redundant and often high-dimensional, inference is more expensive and can be less robust. To summarize, reconstructing 3D human motion in a Figure 1: (a, Left) Example of 180o ambiguity in predicting 3D human poses from silhouette image features (center). It is essential that multiple plausible solutions (e.g. F 1 and F2 ) are correctly represented and tracked over time. A single state predictor will either average the distant solutions or zig-zag between them, see also tables 1 and 2. (b, Right) A conditional chain model. The local distributions p(yt |yt−1 , zt ) or p(yt |zt ) are learned as in Ä?Ĺš g. 2. For inference, the predicted local state conditional is recursively combined with the Ä?Ĺš ltered prior c.f . (1). conditional temporal framework poses the following difÄ?Ĺš culties: (i) The mapping between temporal observations and states is multivalued (i.e. the local conditional distributions to be learned are multimodal), therefore it cannot be accurately represented using global function approximations. (ii) Human models have multivariate, high-dimensional continuous states of 50 or more human joint angles. The temporal state conditionals are multimodal which makes efÄ?Ĺš cient Kalman Ä?Ĺš ltering algorithms inapplicable. General inference methods (particle Ä?Ĺš lters, mixtures) have to be used instead, but these are expensive for high-dimensional models (e.g. when reconstructing the motion of several people that operate in a joint state space). (iii) The components of the human state and of the silhouette observation vector exhibit strong correlations, because many repetitive human activities like walking or running have low intrinsic dimensionality. It appears wasteful to work with high-dimensional states of 50+ joint angles. Even if the space were truly high-dimensional, predicting correlated state dimensions independently may still be suboptimal. In this paper we present a conditional temporal estimation algorithm that restricts visual inference to low-dimensional, kernel induced state spaces. To exploit correlations among observations and among state variables, we model the local, temporal conditional distributions using ideas from Kernel PCA [11, 19] and conditional mixture modeling [7, 5], here adapted to produce multiple probabilistic predictions. The corresponding predictor is referred to as a Conditional Bayesian Mixture of Low-dimensional Kernel-Induced Experts (kBME). By integrating it within a conditional graphical model framework (Ä?Ĺš g. 1b), we can exploit temporal constraints probabilistically. We demonstrate that this methodology is effective for reconstructing the 3D motion of multiple people in monocular video. Our contribution w.r.t. [15] is a probabilistic conditional inference framework that operates over a non-linear, kernel-induced low-dimensional state spaces, and a set of experiments (on both real and artiÄ?Ĺš cial image sequences) that show how the proposed framework positively compares with powerful predictors based on KDE, PCA, or with the high-dimensional models of [15] at a fraction of their cost. 2 Probabilistic Inference in a Kernel Induced State Space We work with conditional graphical models with a chain structure [9], as shown in Ä?Ĺš g. 1b, These have continuous temporal states yt , t = 1 . . . T , observations zt . For compactness, we denote joint states Yt = (y1 , y2 , . . . , yt ) or joint observations Zt = (z1 , . . . , zt ). Learning and inference are based on local conditionals: p(yt |zt ) and p(yt |yt−1 , zt ), with yt and zt being low-dimensional, kernel induced representations of some initial model having state xt and observation rt . We obtain zt , yt from rt , xt using kernel PCA [11, 19]. Inference is performed in a low-dimensional, non-linear, kernel induced latent state space (see Ä?Ĺš g. 1b and Ä?Ĺš g. 2 and (1)). For display or error reporting, we compute the original conditional p(x|r), or a temporally Ä?Ĺš ltered version p(xt |Rt ), Rt = (r1 , r2 , . . . , rt ), using a learned pre-image state map [3]. 2.1 Density Propagation for Continuous Conditional Chains For online Ä?Ĺš ltering, we compute the optimal distribution p(yt |Zt ) for the state yt , conditioned by observations Zt up to time t. The Ä?Ĺš ltered density can be recursively derived as: p(yt |Zt ) = p(yt |yt−1 , zt )p(yt−1 |Zt−1 ) (1) yt−1 We compute using a conditional mixture for p(yt |yt−1 , zt ) (a Bayesian mixture of experts c.f . §2.2) and the prior p(yt−1 |Zt−1 ), each having, say M components. We integrate M 2 pairwise products of Gaussians analytically. The means of the expanded posterior are clustered and the centers are used to initialize a reduced M -component Kullback-Leibler approximation that is reÄ?Ĺš ned using gradient descent [15]. The propagation rule (1) is similar to the one used for discrete state labels [9], but here we work with multivariate continuous state spaces and represent the local multimodal state conditionals using kBME (Ä?Ĺš g. 2), and not log-linear models [9] (these would require intractable normalization). This complex continuous model rules out inference based on Kalman Ä?Ĺš ltering or dynamic programming [9]. 2.2 Learning Bayesian Mixtures over Kernel Induced State Spaces (kBME) In order to model conditional mappings between low-dimensional non-linear spaces we rely on kernel dimensionality reduction and conditional mixture predictors. The authors of KDE [19] propose a powerful structured unimodal predictor. This works by decorrelating the output using kernel PCA and learning a ridge regressor between the input and each decorrelated output dimension. Our procedure is also based on kernel PCA but takes into account the structure of the studied visual problem where both inputs and outputs are likely to be low-dimensional and the mapping between them multivalued. The output variables xi are projected onto the column vectors of the principal space in order to obtain their principal coordinates y i . A z ∈ P(Fr ) O p(y|z) kP CA ĂŽĹšr (r) ⊂ Fr O / y ∈ P(Fx ) O QQQ QQQ QQQ kP CA QQQ Q( ĂŽĹšx (x) ⊂ Fx x ≈ PreImage(y) O ĂŽĹšr ĂŽĹšx r ∈ R ⊂ Rr x ∈ X ⊂ Rx p(x|r) ≈ p(x|y) Figure 2: The learned low-dimensional predictor, kBME, for computing p(x|r) â‰Ä„ p(xt |rt ), ∀t. (We similarly learn p(xt |xt−1 , rt ), with input (x, r) instead of r – here we illustrate only p(x|r) for clarity.) The input r and the output x are decorrelated using Kernel PCA to obtain z and y respectively. The kernels used for the input and output are ĂŽĹš r and ĂŽĹšx , with induced feature spaces Fr and Fx , respectively. Their principal subspaces obtained by kernel PCA are denoted by P(Fr ) and P(Fx ), respectively. A conditional Bayesian mixture of experts p(y|z) is learned using the low-dimensional representation (z, y). Using learned local conditionals of the form p(yt |zt ) or p(yt |yt−1 , zt ), temporal inference can be efÄ?Ĺš ciently performed in a low-dimensional kernel induced state space (see e.g. (1) and Ä?Ĺš g. 1b). For visualization and error measurement, the Ä?Ĺš ltered density, e.g. p(yt |Zt ), can be mapped back to p(xt |Rt ) using the pre-image c.f . (3). similar procedure is performed on the inputs ri to obtain zi . In order to relate the reduced feature spaces of z and y (P(Fr ) and P(Fx )), we estimate a probability distribution over mappings from training pairs (zi , yi ). We use a conditional Bayesian mixture of experts (BME) [7, 5] in order to account for ambiguity when mapping similar, possibly identical reduced feature inputs to very different feature outputs, as common in our problem (Ä?Ĺš g. 1a). This gives a model that is a conditional mixture of low-dimensional kernel-induced experts (kBME): M g(z|δ j )N (y|Wj z, ĂŽĹ j ) p(y|z) = (2) j=1 where g(z|δ j ) is a softmax function parameterized by δ j and (Wj , ĂŽĹ j ) are the parameters and the output covariance of expert j, here a linear regressor. As in many Bayesian settings [17, 5], the weights of the experts and of the gates, Wj and δ j , are controlled by hierarchical priors, typically Gaussians with 0 mean, and having inverse variance hyperparameters controlled by a second level of Gamma distributions. We learn this model using a double-loop EM and employ ML-II type approximations [8, 17] with greedy (weight) subset selection [17, 15]. Finally, the kBME algorithm requires the computation of pre-images in order to recover the state distribution x from it’s image y ∈ P(Fx ). This is a closed form computation for polynomial kernels of odd degree. For more general kernels optimization or learning (regression based) methods are necessary [3]. Following [3, 19], we use a sparse Bayesian kernel regressor to learn the pre-image. This is based on training data (xi , yi ): p(x|y) = N (x|AĂŽĹšy (y), â„Ĺš) (3) with parameters and covariances (A, â„Ĺš). Since temporal inference is performed in the low-dimensional kernel induced state space, the pre-image function needs to be calculated only for visualizing results or for the purpose of error reporting. Propagating the result from the reduced feature space P(Fx ) to the output space X pro- duces a Gaussian mixture with M elements, having coefÄ?Ĺš cients g(z|δ j ) and components N (x|AĂŽĹšy (Wj z), AJĂŽĹšy ĂŽĹ j JĂŽĹšy A + â„Ĺš), where JĂŽĹšy is the Jacobian of the mapping ĂŽĹšy . 3 Experiments We run experiments on both real image sequences (Ä?Ĺš g. 5 and Ä?Ĺš g. 6) and on sequences where silhouettes were artiÄ?Ĺš cially rendered. The prediction error is reported in degrees (for mixture of experts, this is w.r.t. the most probable one, but see also Ä?Ĺš g. 4a), and normalized per joint angle, per frame. The models are learned using standard cross-validation. Pre-images are learned using kernel regressors and have average error 1.7o . Training Set and Model State Representation: For training we gather pairs of 3D human poses together with their image projections, here silhouettes, using the graphics package Maya. We use realistically rendered computer graphics human surface models which we animate using human motion capture [1]. Our original human representation (x) is based on articulated skeletons with spherical joints and has 56 skeletal d.o.f. including global translation. The database consists of 8000 samples of human activities including walking, running, turns, jumps, gestures in conversations, quarreling and pantomime. Image Descriptors: We work with image silhouettes obtained using statistical background subtraction (with foreground and background models). Silhouettes are informative for pose estimation although prone to ambiguities (e.g. the left / right limb assignment in side views) or occasional lack of observability of some of the d.o.f. (e.g. 180o ambiguities in the global azimuthal orientation for frontal views, e.g. Ä?Ĺš g. 1a). These are multiplied by intrinsic forward / backward monocular ambiguities [16]. As observations r, we use shape contexts extracted on the silhouette [4] (5 radial, 12 angular bins, size range 1/8 to 3 on log scale). The features are computed at different scales and sizes for points sampled on the silhouette. To work in a common coordinate system, we cluster all features in the training set into K = 50 clusters. To compute the representation of a new shape feature (a point on the silhouette), we ‘project’ onto the common basis by (inverse distance) weighted voting into the cluster centers. To obtain the representation (r) for a new silhouette we regularly sample 200 points on it and add all their feature vectors into a feature histogram. Because the representation uses overlapping features of the observation the elements of the descriptor are not independent. However, a conditional temporal framework (Ä?Ĺš g. 1b) Ä?Ĺš‚exibly accommodates this. For experiments, we use Gaussian kernels for the joint angle feature space and dot product kernels for the observation feature space. We learn state conditionals for p(yt |zt ) and p(yt |yt−1 , zt ) using 6 dimensions for the joint angle kernel induced state space and 25 dimensions for the observation induced feature space, respectively. In Ä?Ĺš g. 3b) we show an evaluation of the efÄ?Ĺš cacy of our kBME predictor for different dimensions in the joint angle kernel induced state space (the observation feature space dimension is here 50). On the analyzed dancing sequence, that involves complex motions of the arms and the legs, the non-linear model signiÄ?Ĺš cantly outperforms alternative PCA methods and gives good predictions for compact, low-dimensional models.1 In tables 1 and 2, as well as Ä?Ĺš g. 4, we perform quantitative experiments on artiÄ?Ĺš cially rendered silhouettes. 3D ground truth joint angles are available and this allows a more 1 Running times: On a Pentium 4 PC (3 GHz, 2 GB RAM), a full dimensional BME model with 5 experts takes 802s to train p(xt |xt−1 , rt ), whereas a kBME (including the pre-image) takes 95s to train p(yt |yt−1 , zt ). The prediction time is 13.7s for BME and 8.7s (including the pre-image cost 1.04s) for kBME. The integration in (1) takes 2.67s for BME and 0.31s for kBME. The speed-up for kBME is signiÄ?Ĺš cant and likely to increase with original models having higher dimensionality. Prediction Error Number of Clusters 100 1000 100 10 1 1 2 3 4 5 6 7 8 Degree of Multimodality kBME KDE_RVM PCA_BME PCA_RVM 10 1 0 20 40 Number of Dimensions 60 Figure 3: (a, Left) Analysis of ‘multimodality’ for a training set. The input zt dimension is 25, the output yt dimension is 6, both reduced using kPCA. We cluster independently in (yt−1 , zt ) and yt using many clusters (2100) to simulate small input perturbations and we histogram the yt clusters falling within each cluster in (yt−1 , zt ). This gives intuition on the degree of ambiguity in modeling p(yt |yt−1 , zt ), for small perturbations in the input. (b, Right) Evaluation of dimensionality reduction methods for an artiÄ?Ĺš cial dancing sequence (models trained on 300 samples). The kBME is our model §2.2, whereas the KDE-RVM is a KDE model learned with a Relevance Vector Machine (RVM) [17] feature space map. PCA-BME and PCA-RVM are models where the mappings between feature spaces (obtained using PCA) is learned using a BME and a RVM. The non-linearity is signiÄ?Ĺš cant. Kernel-based methods outperform PCA and give low prediction error for 5-6d models. systematic evaluation. Notice that the kernelized low-dimensional models generally outperform the PCA ones. At the same time, they give results competitive to the ones of high-dimensional BME predictors, while being lower-dimensional and therefore signiÄ?Ĺš cantly less expensive for inference, e.g. the integral in (1). In Ä?Ĺš g. 5 and Ä?Ĺš g. 6 we show human motion reconstruction results for two real image sequences. Fig. 5 shows the good quality reconstruction of a person performing an agile jump. (Given the missing observations in a side view, 3D inference for the occluded body parts would not be possible without using prior knowledge!) For this sequence we do inference using conditionals having 5 modes and reduced 6d states. We initialize tracking using p(yt |zt ), whereas for inference we use p(yt |yt−1 , zt ) within (1). In the second sequence in Ä?Ĺš g. 6, we simultaneously reconstruct the motion of two people mimicking domestic activities, namely washing a window and picking an object. Here we do inference over a product, 12-dimensional state space consisting of the joint 6d state of each person. We obtain good 3D reconstruction results, using only 5 hypotheses. Notice however, that the results are not perfect, there are small errors in the elbow and the bending of the knee for the subject at the l.h.s., and in the different wrist orientations for the subject at the r.h.s. This reÄ?Ĺš‚ects the bias of our training set. Walk and turn Conversation Run and turn left KDE-RR 10.46 7.95 5.22 RVM 4.95 4.96 5.02 KDE-RVM 7.57 6.31 6.25 BME 4.27 4.15 5.01 kBME 4.69 4.79 4.92 Table 1: Comparison of average joint angle prediction error for different models. All kPCA-based models use 6 output dimensions. Testing is done on 100 video frames for each sequence, the inputs are artiÄ?Ĺš cially generated silhouettes, not in the training set. 3D joint angle ground truth is used for evaluation. KDE-RR is a KDE model with ridge regression (RR) for the feature space mapping, KDE-RVM uses an RVM. BME uses a Bayesian mixture of experts with no dimensionality reduction. kBME is our proposed model. kPCAbased methods use kernel regressors to compute pre-images. Expert Prediction Frequency − Closest to Ground truth Frequency − Close to ground truth 30 25 20 15 10 5 0 1 2 3 4 Expert Number 14 10 8 6 4 2 0 5 1st Probable Prev Output 2nd Probable Prev Output 3rd Probable Prev Output 4th Probable Prev Output 5th Probable Prev Output 12 1 2 3 4 Current Expert 5 Figure 4: (a, Left) Histogram showing the accuracy of various expert predictors: how many times the expert ranked as the k-th most probable by the model (horizontal axis) is closest to the ground truth. The model is consistent (the most probable expert indeed is the most accurate most frequently), but occasionally less probable experts are better. (b, Right) Histograms show the dynamics of p(yt |yt−1 , zt ), i.e. how the probability mass is redistributed among experts between two successive time steps, in a conversation sequence. Walk and turn back Run and turn KDE-RR 7.59 17.7 RVM 6.9 16.8 KDE-RVM 7.15 16.08 BME 3.6 8.2 kBME 3.72 8.01 Table 2: Joint angle prediction error computed for two complex sequences with walks, runs and turns, thus more ambiguity (100 frames). Models have 6 state dimensions. Unimodal predictors average competing solutions. kBME has signiÄ?Ĺš cantly lower error. Figure 5: Reconstruction of a jump (selected frames). Top: original image sequence. Middle: extracted silhouettes. Bottom: 3D reconstruction seen from a synthetic viewpoint. 4 Conclusion We have presented a probabilistic framework for conditional inference in latent kernelinduced low-dimensional state spaces. Our approach has the following properties: (a) Figure 6: Reconstructing the activities of 2 people operating in an 12-d state space (each person has its own 6d state). Top: original image sequence. Bottom: 3D reconstruction seen from a synthetic viewpoint. Accounts for non-linear correlations among input or output variables, by using kernel nonlinear dimensionality reduction (kPCA); (b) Learns probability distributions over mappings between low-dimensional state spaces using conditional Bayesian mixture of experts, as required for accurate prediction. In the resulting low-dimensional kBME predictor ambiguities and multiple solutions common in visual, inverse perception problems are accurately represented. (c) Works in a continuous, conditional temporal probabilistic setting and offers a formal management of uncertainty. We show comparisons that demonstrate how the proposed approach outperforms regression, PCA or KDE alone for reconstructing the 3D human motion in monocular video. Future work we will investigate scaling aspects for large training sets and alternative structured prediction methods. References [1] CMU Human Motion DataBase. Online at http://mocap.cs.cmu.edu/search.html, 2003. [2] A. Agarwal and B. Triggs. 3d human pose from silhouettes by Relevance Vector Regression. In CVPR, 2004. [3] G. Bakir, J. Weston, and B. Scholkopf. Learning to Ä?Ĺš nd pre-images. In NIPS, 2004. [4] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 24, 2002. [5] C. Bishop and M. Svensen. Bayesian mixtures of experts. In UAI, 2003. [6] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by Annealed Particle Filtering. In CVPR, 2000. [7] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, (6):181–214, 1994. [8] D. Mackay. Bayesian interpolation. Neural Computation, 4(5):720–736, 1992. [9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, 2000. [10] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps. In NIPS, 2002. [11] B. Sch¨ lkopf, A. Smola, and K. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299–1319, 1998. [12] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter Sensitive Hashing. In ICCV, 2003. [13] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard. Tracking Loose-limbed People. In CVPR, 2004. [14] C. Sminchisescu and A. Jepson. Generative Modeling for Continuous Non-Linearly Embedded Visual Inference. In ICML, pages 759–766, Banff, 2004. [15] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative Density Propagation for 3D Human Motion Estimation. In CVPR, 2005. [16] C. Sminchisescu and B. Triggs. Kinematic Jump Processes for Monocular 3D Human Tracking. In CVPR, volume 1, pages 69–76, Madison, 2003. [17] M. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. JMLR, 2001. [18] C. Tomasi, S. Petrov, and A. Sastry. 3d tracking = classiÄ?Ĺš cation + interpolation. In ICCV, 2003. [19] J. Weston, O. Chapelle, A. Elisseeff, B. Scholkopf, and V. Vapnik. Kernel dependency estimation. In NIPS, 2002.
5 0.53471208 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
6 0.53381234 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
7 0.53077215 184 nips-2005-Structured Prediction via the Extragradient Method
8 0.52867955 105 nips-2005-Large-Scale Multiclass Transduction
9 0.52792388 177 nips-2005-Size Regularized Cut for Data Clustering
10 0.52571392 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
11 0.52513611 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
12 0.52482772 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
13 0.52142417 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
14 0.51915544 102 nips-2005-Kernelized Infomax Clustering
15 0.51729536 50 nips-2005-Convex Neural Networks
16 0.51728797 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
17 0.51717067 23 nips-2005-An Application of Markov Random Fields to Range Sensing
18 0.51679963 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
19 0.51624101 195 nips-2005-Transfer learning for text classification
20 0.51592481 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning