nips nips2012 nips2012-225 knowledge-graph by maker-knowledge-mining

225 nips-2012-Multi-task Vector Field Learning


Source: pdf

Author: Binbin Lin, Sen Yang, Chiyuan Zhang, Jieping Ye, Xiaofei He

Abstract: Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. [sent-6, score-0.138]

2 A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. [sent-9, score-0.417]

3 We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. [sent-10, score-0.231]

4 In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. [sent-11, score-0.198]

5 (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. [sent-13, score-0.18]

6 (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. [sent-14, score-0.206]

7 (3) The vector fields from all tasks share a low dimensional subspace. [sent-15, score-0.228]

8 1 Introduction In many applications, labeled data are expensive and time consuming to obtain while unlabeled data are abundant. [sent-18, score-0.118]

9 The manifold assumption [15, 5], which has been widely adopted in the last decade, states that the predictor function lives in a low dimensional manifold of the marginal distribution. [sent-21, score-0.458]

10 [12] proposed a regularization MTL framework which assumes all tasks are related and close to each other. [sent-26, score-0.1]

11 Ando and Zhang [2] proposed a structural learning framework, which assumed multiple predictors for different tasks shared a common structure on the underlying predictor space. [sent-27, score-0.283]

12 An alternating structure optimization (ASO) method was proposed for linear predictors which assumed the task parameters shared a low dimensional subspace. [sent-28, score-0.267]

13 [1] generalized the idea of sharing a subspace by assuming that all task parameters lie on a manifold. [sent-30, score-0.106]

14 1 (a) A parallel field on R2 (b) A parallel field on Swiss roll Figure 1: Examples of parallel fields. [sent-31, score-0.344]

15 The parallel field on R2 spans a one dimensional subspace and the parallel field on the Swiss roll spans a two dimensional subspace. [sent-32, score-0.477]

16 However they require that related tasks share similar representations [9]. [sent-37, score-0.107]

17 The cluster structure is characterized by task parameters of linear predictor functions. [sent-40, score-0.166]

18 For linear predictors, the task parameters they used are actually the constant gradient of the predictor functions which form a first order differential structure. [sent-41, score-0.235]

19 For general nonlinear predictor functions, we show it is more natural to capture the shared differential structure using vector fields. [sent-42, score-0.293]

20 A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. [sent-44, score-0.417]

21 In this way, a vector field naturally characterizes the differential structure of functions while also providing a natural way to exploit the geometric structure of data; these are the two most important aspects for SSMTL. [sent-45, score-0.165]

22 Based on this idea, we develop the multi-task vector field learning (MTVFL) method which learns the prediction functions and the vector fields simultaneously. [sent-46, score-0.11]

23 The vector fields we learned are forced to be close to the gradient fields of predictor functions. [sent-47, score-0.18]

24 In each task, the vector field is required to be as parallel as possible. [sent-48, score-0.14]

25 We say that a vector field is parallel if the vectors are parallel along the geodesics on the manifold. [sent-49, score-0.25]

26 In extreme cases, when the manifold is a linear (or an affine) space, then the geodesics of such manifold are straight lines. [sent-50, score-0.329]

27 , with zero curvature) or the curvature is small, it is expected that these parallel vectors concentrate on a low dimensional subspace. [sent-54, score-0.183]

28 1 that the parallel field on the plane spans a one-dimensional subspace and the parallel field on the Swiss roll spans a two-dimensional subspace. [sent-56, score-0.389]

29 For the multi-task case, these vector fields share a low dimensional subspace. [sent-57, score-0.156]

30 In addition, we assume these vector fields share a low dimensional subspace among all tasks. [sent-58, score-0.21]

31 In essence, we use a first-order differential structure to characterize the shared structure of tasks and use a second-order differential structure to characterize the specific parts of tasks. [sent-59, score-0.332]

32 2 Multi-task Learning: A Vector Field Approach In this section, we first introduce vector fields and then present multi-task learning via exploring shared structure using vector fields. [sent-62, score-0.202]

33 We are given m tasks, with nl samples xl , i = 1, . [sent-65, score-0.533]

34 For the l-th task, we assume the data xl are on a dl -dimensional manifold Ml . [sent-70, score-0.505]

35 Without loss of generality, we assume the first nl (nl < nl ) samples l l are labeled, with yj ∈ R for regression and yj ∈ {−1, 1} for classification, j = 1, . [sent-73, score-0.662]

36 The total number of labeled samples is n = l nl . [sent-77, score-0.393]

37 i Given the l-th task, we first construct a nearest neighbor graph by either -neighborhood or k nearest l neighbors. [sent-82, score-0.11]

38 Let wij denote the weight which i j i j measures the similarity between xl and xl . [sent-84, score-0.46]

39 For each point xl , we estimate its tangent space Txl M by performing PCA on its i i neighborhood. [sent-86, score-0.334]

40 We choose the largest dl eigenvectors as the bases since the tangent space Txl M has i the same dimension as the manifold Ml . [sent-87, score-0.435]

41 It is easy to show that Pil = Til Til is the unique orthogonal i projection from RD onto the tangent space Txl M [13]. [sent-89, score-0.132]

42 A vector field X on the manifold M is a continuous map X : M → T M where T M is the set of tangent spaces, written as p → Xp , with the property that for each p ∈ M, Xp is an element of Tp M. [sent-94, score-0.339]

43 We can think of a vector field on the manifold as an arrow in the same way as we think of the vector field in the Euclidean space, with a given magnitude and direction attached to each point on the manifold, and chosen to be tangent to the manifold. [sent-95, score-0.394]

44 A vector field V on the manifold is called a gradient field if there exists a function f on the manifold such that f = V where is the covariant derivative on the manifold. [sent-96, score-0.454]

45 For each point xl , let Vxl denote the value of the i i vector field Vl at xl . [sent-100, score-0.459]

46 Recall the definition of vector field, Vxl should be a vector in the tangent i i space Txl Ml . [sent-101, score-0.242]

47 Therefore, we can represent it by the coordinates of the tangent space Txl Ml as i i l l Vxl = Til vi , where vi ∈ Rdl is the local representation of Vxl with respect of Til . [sent-102, score-0.328]

48 By abusing the notation without confusion, we also use fl to denote the vector T l l fl = (fl (x1 ), . [sent-104, score-0.465]

49 , fl (xl l ))T and use Vl to denote the vector Vl = v1 , . [sent-107, score-0.26]

50 That l is, Vl is a dl nl -dimensional big column vector which concatenates all the vi ’s for a fixed l. [sent-111, score-0.635]

51 Then for each task, we aim to compute the vector fl and the vector Vl . [sent-112, score-0.315]

52 Many existing MTL methods capture the task relatedness by sharing task parameters. [sent-115, score-0.127]

53 For linear predictors, the task parameters they used are actually the constant gradient vectors of the predictor functions. [sent-116, score-0.177]

54 For general nonlinear predictor functions, we show it is natural to capture the shared T T differential structure using vector fields. [sent-117, score-0.293]

55 We propose to learn f and V simultaneously: • The vector field Vl should be close to the gradient field as follows: m m min R1 (f, V ) = f,V fl of fl , which can be formularized fl − V l R1 (fl , Vl ) := l=1 l=1 2 . [sent-128, score-0.707]

56 (1) Ml • The vector field Vl should be as parallel as possible: m min R2 (V ) = V m R2 (Vl ) := l=1 Vl l=1 3 Ml 2 HS , (2) where is the covariant derivative on the manifold, where · HS denotes the HilbertSchmidt tensor norm [11]. [sent-129, score-0.198]

57 Vl measures the change of the vector field, therefore minimizing Ml Vl 2 enforces the vector field Vl to be parallel. [sent-130, score-0.11]

58 HS • All vector fields share an h-dimensional subspace where h is a predefined parameter: l l Til vi = ul + ΘT wi , i s. [sent-131, score-0.358]

59 (3) Since these vector fields are assumed to share a low dimensional space, it is expected that the residual vector ul is small. [sent-134, score-0.295]

60 We define another term R3 to control the complexity as follows: i nl m l l R3 (vi , wi , Θ) α ul i = 2 l + β Til vi 2 (4) l=1 i=1 m nl l l α Til vi − ΘT wi = 2 l + β Til vi 2 . [sent-135, score-1.104]

61 Since we would like the vector field to be parallel, the vector norm is not expected to be too small. [sent-137, score-0.11]

62 Besides, we assume the vector fields share a low dimensional subspace, the residual vector ul is expected to be small. [sent-138, score-0.295]

63 i By setting β = 0, R3 will reduce to the regularization term proposed in ASO if we also replace the tangent vectors by the task parameters. [sent-140, score-0.212]

64 ∗ l l l l l It can be verified that wi = ΘTil vi = arg minwi R3 (vi , wi , Θ). [sent-142, score-0.197]

65 Thus we have ul = Til vi − l i l l ΘT wi = (I − ΘT Θ)Til vi . [sent-143, score-0.312]

66 For simplicity, we use the quadratic loss function R0 (f ) = nl m l l 2 i=1 (fl (xi ) − yi ) . [sent-148, score-0.331]

67 Using the discrete methods in [17], we have the following discrete form equations: R1 (fl , Vl ) l l l wij (xl − xl )T Til vi − fj + fil j i = 2 , (8) i∼j R2 (fl , Vl ) l l l wij Pil Tjl vj − Til vi = 2 . [sent-152, score-0.51]

68 (9) i∼j Interestingly, with some algebraic transformations, we have the following matrix forms for our objective functions: R1 (fl , Vl ) = 2flT Ll fl + VlT Gl Vl − 2VlT Cl fl , (10) 4 where Ll is the graph Laplacian matrix, Gl is a dl nl × dl nl block diagonal matrix, and Cl = lT lT [C1 , . [sent-153, score-1.492]

69 And R2 becomes R2 (Vl ) = VlT Bl Vl , (12) where Bl is a dl nl × dl nl sparse block matrix. [sent-158, score-1.027]

70 If we index each dl × dl block by l Bii l Bij , then we have T l wij (Ql Ql + I), ij ij = (13) j∼i l Bij = l −2wij Ql , if xi ∼ xj ij 0, otherwise , (14) T where Ql = Til Tjl . [sent-159, score-0.508]

71 It is worth nothing that both R1 and R2 depend on tangent spaces Til . [sent-160, score-0.183]

72 C is a column block matrix with the l-th block matrix being Cl . [sent-162, score-0.17]

73 Let I denote an n × n diagonal matrix where Iii = 1 if the corresponding i-th data is labeled and Iii = 0 otherwise. [sent-163, score-0.117]

74 And let y ∈ Rn be a column vector whose i-th element is the corresponding label 1 of the i-th labeled data and 0 otherwise. [sent-164, score-0.117]

75 Firstly, when constructing the nearest neighbor graph, data points from different tasks are disconnected. [sent-183, score-0.14]

76 Therefore when estimating tangent spaces, data points from different tasks are independent. [sent-184, score-0.204]

77 Secondly, we do not require the dimension of tangent spaces from each task to be the same. [sent-185, score-0.214]

78 (6), we rewrite R3 (V, Θ) as follows: m ˆ Θ = nl α arg min Θ l (I − ΘT Θ)Til vi l=1 i=1 arg min α tr VT ((1 + = 2 Θ + β l l T v α i i 2 β )I − ΘT Θ)V α (23) arg max tr(ΘVVT ΘT ), = Θ 1 1 m m (T1 v1 , . [sent-192, score-0.595]

79 , Tnm vnm ) is a D × n matrix with each column being a tangent vector. [sent-195, score-0.154]

80 (23) into a convex formulation, we replace ΘT Θ with M , and naturally relax its feasible domain into a convex set based on the relationship between Me and Mc presented above; this results in an optimization problem as arg min R3 (V, M ), s. [sent-210, score-0.095]

81 The kernel constructed in KASO uses both labeled data and unlabeled data. [sent-218, score-0.118]

82 We generate two data sets including Swiss roll and Swiss roll with hole embedded in 3-dimensional Euclidean space. [sent-228, score-0.208]

83 The Swill roll with hole excludes points within t1 ∈ [9, 12] and t2 ∈ [9, 14]. [sent-230, score-0.119]

84 We randomly select a number of labeled data in each task and try to predict the value on other unlabeled data. [sent-233, score-0.17]

85 The number of nearest neighbors is set to 5 and the manifold dimension is set to 2 as they are both 2 dimensional manifolds. [sent-236, score-0.238]

86 We also show the singular value distribution of the ground truth gradient fields. [sent-245, score-0.204]

87 Given the ground truth f , we can compute the gradient field V by taking derivatives of R1 (f, V ) with respect to V . [sent-246, score-0.118]

88 After obtaining V , the l gradient vector Vxl at each point can be obtained as Vxl = Til vi . [sent-248, score-0.19]

89 Then we perform PCA on these i i vectors and the singular values of the covariance matrix of Vxl are shown in Fig. [sent-249, score-0.108]

90 2 (b), the number of dominant singular values is 2 which indicates that the ground truth gradient fields concentrate on a 2-dimensional subspace. [sent-252, score-0.236]

91 Each data example is represented by a 9-dimensional vector with a binary label, which is either 1 for landmine or 0 for clutter. [sent-256, score-0.146]

92 The number of nearest neighbors is set to 10 and the manifold dimension is set to 4 empirically. [sent-273, score-0.194]

93 The shared subspace dimension is set to be 5 for both of MTVFL and ASO and the shared subspace dimension of KASO is set to 10. [sent-275, score-0.24]

94 ASO does not improve much when the amount of labeled data increases, which is probably because the data have severely unbalanced labels and the ground truth predictor function is nonlinear. [sent-288, score-0.255]

95 We also show the singular value distribution of the ground truth gradient fields in Fig. [sent-289, score-0.204]

96 34%, which indicates that the ground truth gradient fields concentrate on a 5-dimensional subspace. [sent-297, score-0.15]

97 We show that vector fields can naturally capture the shared differential structure among tasks as well as the structure of the data manifolds. [sent-299, score-0.303]

98 One is the relation between learning on task parameters and learning on vector fields. [sent-302, score-0.107]

99 A framework for learning predictive structures from multiple tasks and unlabeled data. [sent-320, score-0.128]

100 Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. [sent-340, score-0.118]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('til', 0.491), ('nl', 0.331), ('vl', 0.306), ('mtvfl', 0.218), ('fl', 0.205), ('xl', 0.202), ('manifold', 0.152), ('dl', 0.151), ('kaso', 0.146), ('eld', 0.138), ('tangent', 0.132), ('aso', 0.127), ('txl', 0.127), ('vxl', 0.127), ('mtl', 0.125), ('stvfl', 0.109), ('elds', 0.108), ('vi', 0.098), ('landmine', 0.091), ('ih', 0.089), ('roll', 0.089), ('predictor', 0.088), ('singular', 0.086), ('parallel', 0.085), ('gl', 0.084), ('ul', 0.084), ('pil', 0.08), ('swiss', 0.076), ('ssmtl', 0.073), ('ml', 0.072), ('tasks', 0.072), ('shared', 0.066), ('block', 0.063), ('labeled', 0.062), ('ql', 0.059), ('differential', 0.058), ('unlabeled', 0.056), ('wij', 0.056), ('vector', 0.055), ('subspace', 0.054), ('task', 0.052), ('auc', 0.051), ('mc', 0.049), ('cl', 0.047), ('gv', 0.044), ('dimensional', 0.044), ('nearest', 0.042), ('field', 0.041), ('truth', 0.041), ('ground', 0.04), ('spans', 0.038), ('mse', 0.037), ('gradient', 0.037), ('bl', 0.037), ('lf', 0.037), ('rdl', 0.036), ('sigular', 0.036), ('tjl', 0.036), ('vlt', 0.036), ('vnl', 0.036), ('share', 0.035), ('arg', 0.035), ('hs', 0.033), ('tr', 0.033), ('diagonal', 0.033), ('covariant', 0.032), ('wi', 0.032), ('concentrate', 0.032), ('predictors', 0.031), ('convex', 0.03), ('cf', 0.03), ('spaces', 0.03), ('ando', 0.03), ('liao', 0.03), ('hole', 0.03), ('ij', 0.029), ('regularization', 0.028), ('rewrite', 0.028), ('china', 0.027), ('ssl', 0.026), ('derivative', 0.026), ('structure', 0.026), ('neighbor', 0.026), ('alternating', 0.026), ('synthetic', 0.026), ('geodesics', 0.025), ('bv', 0.025), ('vt', 0.025), ('unbalanced', 0.024), ('relatedness', 0.023), ('bij', 0.023), ('low', 0.022), ('effectiveness', 0.022), ('directional', 0.022), ('vanish', 0.022), ('micchelli', 0.022), ('matrix', 0.022), ('sl', 0.022), ('ll', 0.021), ('worth', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 225 nips-2012-Multi-task Vector Field Learning

Author: Binbin Lin, Sen Yang, Chiyuan Zhang, Jieping Ye, Xiaofei He

Abstract: Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach. 1

2 0.13530903 305 nips-2012-Selective Labeling via Error Bound Minimization

Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding

Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.

3 0.12961946 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

4 0.096264146 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

Author: Ben Calderhead, Mátyás A. Sustik

Abstract: One of the enduring challenges in Markov chain Monte Carlo methodology is the development of proposal mechanisms to make moves distant from the current point, that are accepted with high probability and at low computational cost. The recent introduction of locally adaptive MCMC methods based on the natural underlying Riemannian geometry of such models goes some way to alleviating these problems for certain classes of models for which the metric tensor is analytically tractable, however computational efficiency is not assured due to the necessity of potentially high-dimensional matrix operations at each iteration. In this paper we firstly investigate a sampling-based approach for approximating the metric tensor and suggest a valid MCMC algorithm that extends the applicability of Riemannian Manifold MCMC methods to statistical models that do not admit an analytically computable metric tensor. Secondly, we show how the approximation scheme we consider naturally motivates the use of 1 regularisation to improve estimates and obtain a sparse approximate inverse of the metric, which enables stable and sparse approximations of the local geometry to be made. We demonstrate the application of this algorithm for inferring the parameters of a realistic system of ordinary differential equations using a biologically motivated robust Student-t error model, for which the Expected Fisher Information is analytically intractable. 1

5 0.092118941 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

Author: Thanh Ngo, Yousef Saad

Abstract: This paper describes gradient methods based on a scaled metric on the Grassmann manifold for low-rank matrix completion. The proposed methods significantly improve canonical gradient methods, especially on ill-conditioned matrices, while maintaining established global convegence and exact recovery guarantees. A connection between a form of subspace iteration for matrix completion and the scaled gradient descent procedure is also established. The proposed conjugate gradient method based on the scaled gradient outperforms several existing algorithms for matrix completion and is competitive with recently proposed methods. 1

6 0.091620184 9 nips-2012-A Geometric take on Metric Learning

7 0.075406522 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

8 0.071132362 179 nips-2012-Learning Manifolds with K-Means and K-Flats

9 0.067720935 187 nips-2012-Learning curves for multi-task Gaussian process regression

10 0.0586797 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

11 0.055331428 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

12 0.051620141 330 nips-2012-Supervised Learning with Similarity Functions

13 0.050745387 206 nips-2012-Majorization for CRFs and Latent Likelihoods

14 0.046820719 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

15 0.045623168 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

16 0.04554015 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

17 0.045344662 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

18 0.044529192 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

19 0.04449936 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

20 0.044438843 242 nips-2012-Non-linear Metric Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.131), (1, 0.038), (2, 0.016), (3, -0.039), (4, 0.041), (5, 0.009), (6, -0.004), (7, 0.042), (8, 0.009), (9, -0.058), (10, -0.01), (11, -0.076), (12, -0.034), (13, -0.015), (14, 0.011), (15, 0.04), (16, 0.003), (17, 0.02), (18, 0.129), (19, 0.005), (20, 0.045), (21, -0.006), (22, 0.002), (23, 0.014), (24, -0.014), (25, -0.025), (26, 0.066), (27, -0.034), (28, -0.037), (29, -0.032), (30, -0.099), (31, 0.004), (32, -0.026), (33, 0.039), (34, -0.076), (35, -0.049), (36, 0.011), (37, 0.039), (38, -0.027), (39, -0.089), (40, 0.033), (41, -0.072), (42, 0.011), (43, 0.084), (44, 0.02), (45, -0.039), (46, -0.009), (47, 0.005), (48, -0.124), (49, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89901114 225 nips-2012-Multi-task Vector Field Learning

Author: Binbin Lin, Sen Yang, Chiyuan Zhang, Jieping Ye, Xiaofei He

Abstract: Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach. 1

2 0.62731886 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

Author: Ben Calderhead, Mátyás A. Sustik

Abstract: One of the enduring challenges in Markov chain Monte Carlo methodology is the development of proposal mechanisms to make moves distant from the current point, that are accepted with high probability and at low computational cost. The recent introduction of locally adaptive MCMC methods based on the natural underlying Riemannian geometry of such models goes some way to alleviating these problems for certain classes of models for which the metric tensor is analytically tractable, however computational efficiency is not assured due to the necessity of potentially high-dimensional matrix operations at each iteration. In this paper we firstly investigate a sampling-based approach for approximating the metric tensor and suggest a valid MCMC algorithm that extends the applicability of Riemannian Manifold MCMC methods to statistical models that do not admit an analytically computable metric tensor. Secondly, we show how the approximation scheme we consider naturally motivates the use of 1 regularisation to improve estimates and obtain a sparse approximate inverse of the metric, which enables stable and sparse approximations of the local geometry to be made. We demonstrate the application of this algorithm for inferring the parameters of a realistic system of ordinary differential equations using a biologically motivated robust Student-t error model, for which the Expected Fisher Information is analytically intractable. 1

3 0.62158316 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

Author: Thanh Ngo, Yousef Saad

Abstract: This paper describes gradient methods based on a scaled metric on the Grassmann manifold for low-rank matrix completion. The proposed methods significantly improve canonical gradient methods, especially on ill-conditioned matrices, while maintaining established global convegence and exact recovery guarantees. A connection between a form of subspace iteration for matrix completion and the scaled gradient descent procedure is also established. The proposed conjugate gradient method based on the scaled gradient outperforms several existing algorithms for matrix completion and is competitive with recently proposed methods. 1

4 0.60161477 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classifiers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the first practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufficient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classification tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Specifically, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is defined as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the first principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our final model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive definite D × D matrix. Classic approaches for finding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-definite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classifier [5], which finds a linear transformation that satisfies local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classifier. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] find a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the first and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same figure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classifiers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to define sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, first we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by defining Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Definition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often flat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is defined as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is defined as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we define the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we define akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the definition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classifier and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the benefit that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the specific choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the definition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of first order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), finds the unique geodesic starting at x with initial velocity v. As the geodesic must fulfill (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we find numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The first step is to compute the mean value on the manifold, which is defined as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efficient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is defined as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efficient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But first, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefficients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefficients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the first two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a first visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the first principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we briefly consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classifier in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classification accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classifiers. The idea of defining the metric tensor at any given point as the weighted average of a finite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Specifically, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classification problems using local metrics, but we feel that our approach is the first to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difficulty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classification tasks as this is all the current literature provides. We expect improvements by learning the metrics specifically for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classification using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-first international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classification. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classification and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

5 0.51404446 305 nips-2012-Selective Labeling via Error Bound Minimization

Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding

Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.

6 0.5115006 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means

7 0.5089587 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

8 0.50436229 187 nips-2012-Learning curves for multi-task Gaussian process regression

9 0.4992578 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction

10 0.49685487 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

11 0.48192859 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

12 0.48156956 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

13 0.46339875 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

14 0.4623377 222 nips-2012-Multi-Task Averaging

15 0.45879531 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression

16 0.45700654 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

17 0.45377871 86 nips-2012-Convex Multi-view Subspace Learning

18 0.44058868 22 nips-2012-A latent factor model for highly multi-relational data

19 0.43736655 221 nips-2012-Multi-Stage Multi-Task Feature Learning

20 0.42528242 32 nips-2012-Active Comparison of Prediction Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.043), (11, 0.359), (17, 0.023), (21, 0.036), (38, 0.12), (39, 0.016), (42, 0.017), (54, 0.033), (74, 0.034), (76, 0.11), (80, 0.079), (92, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.75915295 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

Author: Jaldert Rombouts, Pieter Roelfsema, Sander M. Bohte

Abstract: A key function of brains is undoubtedly the abstraction and maintenance of information from the environment for later use. Neurons in association cortex play an important role in this process: by learning these neurons become tuned to relevant features and represent the information that is required later as a persistent elevation of their activity [1]. It is however not well known how such neurons acquire these task-relevant working memories. Here we introduce a biologically plausible learning scheme grounded in Reinforcement Learning (RL) theory [2] that explains how neurons become selective for relevant information by trial and error learning. The model has memory units which learn useful internal state representations to solve working memory tasks by transforming partially observable Markov decision problems (POMDP) into MDPs. We propose that synaptic plasticity is guided by a combination of attentional feedback signals from the action selection stage to earlier processing levels and a globally released neuromodulatory signal. Feedback signals interact with feedforward signals to form synaptic tags at those connections that are responsible for the stimulus-response mapping. The neuromodulatory signal interacts with tagged synapses to determine the sign and strength of plasticity. The learning scheme is generic because it can train networks in different tasks, simply by varying inputs and rewards. It explains how neurons in association cortex learn to 1) temporarily store task-relevant information in non-linear stimulus-response mapping tasks [1, 3, 4] and 2) learn to optimally integrate probabilistic evidence for perceptual decision making [5, 6]. 1

2 0.70734972 151 nips-2012-High-Order Multi-Task Feature Learning to Identify Longitudinal Phenotypic Markers for Alzheimer's Disease Progression Prediction

Author: Hua Wang, Feiping Nie, Heng Huang, Jingwen Yan, Sungeun Kim, Shannon Risacher, Andrew Saykin, Li Shen

Abstract: Alzheimer’s disease (AD) is a neurodegenerative disorder characterized by progressive impairment of memory and other cognitive functions. Regression analysis has been studied to relate neuroimaging measures to cognitive status. However, whether these measures have further predictive power to infer a trajectory of cognitive performance over time is still an under-explored but important topic in AD research. We propose a novel high-order multi-task learning model to address this issue. The proposed model explores the temporal correlations existing in imaging and cognitive data by structured sparsity-inducing norms. The sparsity of the model enables the selection of a small number of imaging measures while maintaining high prediction accuracy. The empirical studies, using the longitudinal imaging and cognitive data of the ADNI cohort, have yielded promising results.

same-paper 3 0.70702928 225 nips-2012-Multi-task Vector Field Learning

Author: Binbin Lin, Sen Yang, Chiyuan Zhang, Jieping Ye, Xiaofei He

Abstract: Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach. 1

4 0.6402474 305 nips-2012-Selective Labeling via Error Bound Minimization

Author: Quanquan Gu, Tong Zhang, Jiawei Han, Chris H. Ding

Abstract: In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the out-of-sample error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic out-of-sample error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.

5 0.63744581 327 nips-2012-Structured Learning of Gaussian Graphical Models

Author: Karthik Mohan, Mike Chung, Seungyeop Han, Daniela Witten, Su-in Lee, Maryam Fazel

Abstract: We consider estimation of multiple high-dimensional Gaussian graphical models corresponding to a single set of nodes under several distinct conditions. We assume that most aspects of the networks are shared, but that there are some structured differences between them. Specifically, the network differences are generated from node perturbations: a few nodes are perturbed across networks, and most or all edges stemming from such nodes differ between networks. This corresponds to a simple model for the mechanism underlying many cancers, in which the gene regulatory network is disrupted due to the aberrant activity of a few specific genes. We propose to solve this problem using the perturbed-node joint graphical lasso, a convex optimization problem that is based upon the use of a row-column overlap norm penalty. We then solve the convex problem using an alternating directions method of multipliers algorithm. Our proposal is illustrated on synthetic data and on an application to brain cancer gene expression data. 1

6 0.4963766 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

7 0.48702726 363 nips-2012-Wavelet based multi-scale shape features on arbitrary surfaces for cortical thickness discrimination

8 0.47807342 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

9 0.47646397 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

10 0.47428676 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

11 0.47350383 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.47294062 292 nips-2012-Regularized Off-Policy TD-Learning

13 0.4725385 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

14 0.47244009 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

15 0.47192273 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

16 0.47185943 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

17 0.47119468 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

18 0.47088477 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

19 0.47080973 38 nips-2012-Algorithms for Learning Markov Field Policies

20 0.47076502 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models