nips nips2012 nips2012-212 knowledge-graph by maker-knowledge-mining

212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL


Source: pdf

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. [sent-9, score-0.209]

2 One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. [sent-10, score-0.355]

3 Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. [sent-11, score-0.586]

4 The full paradigm itself is loss-compositional, operating on the vector of empirical risks. [sent-12, score-0.149]

5 It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. [sent-13, score-0.289]

6 We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. [sent-14, score-0.468]

7 A multi-task learning (MTL) algorithm learns an inductive bias to learn several tasks together. [sent-17, score-0.143]

8 But if we see examples from a random set of tasks today, which of these tasks will matter tomorrow? [sent-19, score-0.192]

9 Consider a simple learning scenario: A music preference prediction company is in the business of predicting what 5-star ratings different users would assign to songs. [sent-21, score-0.16]

10 At training time, the company learns a shared representation for predicting the users’ song ratings by pooling together the company’s limited data on each user’s preferences. [sent-22, score-0.12]

11 At test time, the environment draws a user according to some (possibly randomized) rule and solicits from the company a prediction of that user’s preference for a particular song. [sent-24, score-0.156]

12 The environment may also ask for predictions about new users, described by a few ratings each, and so the company must leverage its existing representation to rapidly learn new predictors and produce ratings for these new users. [sent-25, score-0.163]

13 Classically, multi-task learning has sought to minimize the (regularized) sum of the empirical risks over a set of tasks. [sent-26, score-0.205]

14 In this way, classical MTL implicitly assumes that once the learner has been trained, it will be tested on test tasks drawn uniformly at random from the empirical task distribution of the training tasks. [sent-27, score-0.334]

15 • Whereas minimizing the average prediction error is very much a teleological endeavor, typically at the expense of some locally egregious outcomes, minimizing the worst-case prediction error respects a notion of fairness to all tasks (or people). [sent-31, score-0.174]

16 This work introduces minimax multi-task learning as a response to the above scenario. [sent-32, score-0.213]

17 At one end of the spectrum lies minimax MTL, and departing from this point progressively relaxes the “hardness” of the maximum until full relaxation reaches the second endpoint and recovers classical MTL. [sent-34, score-0.338]

18 We further sculpt a generalized loss-compositional paradigm for MTL which includes this spectrum and several other new MTL formulations. [sent-35, score-0.133]

19 This paradigm equally applies to the problem of learning to learn (LTL), in which the goal is to learn a hypothesis space from a set of training tasks such that this representation admits good hypotheses on future tasks. [sent-36, score-0.32]

20 The first contribution of this work is to introduce minimax MTL and a continuum of relaxations. [sent-39, score-0.259]

21 Second, we introduce a generalized loss-compositional paradigm for MTL which admits a number of new MTL formulations and also includes classical MTL as a special case. [sent-40, score-0.243]

22 Third, we empirically evaluate the performance of several MTL formulations from this paradigm in the multi-task learning and learning to learn settings, under the task-wise maximum test risk and task-wise mean test risk criteria, on four datasets (one synthetic, three real). [sent-41, score-0.559]

23 Hence, if the goal is to minimize the worst case outcome over new tasks, the theory suggests minimizing the maximum of the empirical risks across the training tasks rather than their mean. [sent-43, score-0.396]

24 In the next section, we recall the settings of multi-task learning and learning to learn, formally introduce minimax MTL, and motivate it theoretically. [sent-44, score-0.227]

25 In Section 3, we introduce a continuously parametrized family of minimax MTL relaxations and the new generalized loss-compositional paradigm. [sent-45, score-0.256]

26 Section 4 presents an empirical evaluation of various MTL/LTL formulations with different models on four datasets. [sent-46, score-0.131]

27 MTL and LTL often are framed as applying an inductive bias to learn a common hypothesis space, selected from a fixed family of hypothesis spaces, and thereafter learning from this hypothesis space a hypothesis for each task observed at training time. [sent-53, score-0.265]

28 1 Note that minimax MTL does not refer to the minimax estimators of statistical decision theory. [sent-57, score-0.426]

29 The goal is to find the optimal representation via the objective inf EP ∼Q inf E(x,y)∼P (y, h(x)). [sent-61, score-0.281]

30 , PT ∈ P are drawn iid from Q, and from each task t a set of m examples are drawn iid from Pt . [sent-65, score-0.196]

31 Classically, the goal of MTL is to minimize the expected loss at test time under the uniform distribution on [T ]: inf H∈H 1 T inf E(x,y)∼Pt (y, h(x)). [sent-72, score-0.341]

32 In terms of the data generation model, MTL differs from LTL since the tasks are fixed; however, just as in LTL, from each task t a set of m examples are drawn iid from Pt . [sent-77, score-0.22]

33 1 Minimax MTL A natural generalization of classical MTL results by introducing a prior distribution π over the index set of tasks [T ]. [sent-79, score-0.134]

34 Given π, the (idealized) objective of this generalized MTL is inf Et∼π inf E(x,y)∼Pt (y, h(x)), H∈H h∈H (3) given only the training data {(xt,1 , yt,1 ), . [sent-80, score-0.313]

35 We propose to minimize the maximum error over tasks under an adversarial choice of π, yielding the objective: inf sup Et∼π inf E(x,y)∼Pt (y, h(x)), H∈H π h∈H where the supremum is taken over the T -dimensional simplex. [sent-86, score-0.427]

36 As the supremum (assuming it is attained) is attained at an extreme point of the simplex, this objective is equivalent to inf max inf E(x,y)∼Pt (y, h(x)). [sent-87, score-0.298]

37 H∈H t∈[T ] h∈H In practice, we approximate the true objective via a regularized form of the empirical objective m inf max inf H∈H t∈[T ] h∈H (yt,i , h(xt,i )). [sent-88, score-0.401]

38 i=1 In the next section, we motivate minimax MTL theoretically by showing that the worst-case performance on future tasks likely will not be much higher than the maximum of the empirical risks for the training tasks. [sent-89, score-0.546]

39 2 A learning to learn bound for the maximum risk In this subsection, we use the following notation. [sent-92, score-0.185]

40 , P (T ) be probability measures drawn iid from Q, and for t ∈ [T ] let z(t) be an m-sample (a sample of m points) from P (t) with corre(t) sponding empirical measure Pm . [sent-96, score-0.127]

41 Our focus is the learning to learn setting with a minimax lens: when one learns a representation H ∈ H from multiple training tasks and observes maximum empirical risk γ, we would like to 3 guarantee that H’s true risk on a newly drawn test task will be bounded by roughly γ. [sent-98, score-0.856]

42 h∈H (4) The expected true risk is not of primary interest for controlling the tail of the (random) true risk, and a more direct approach yields a much better bound. [sent-101, score-0.16]

43 , P (T ) are drawn iid from Q and from each task P (t) an iid m-sample (t) z(t) is drawn. [sent-108, score-0.164]

44 Let h be the empirical risk minimizer over the test m-sample. [sent-111, score-0.209]

45 Critically, in (5) the probability of observing a task with high true risk decays with T , whereas in (4) the decay is independent of T . [sent-115, score-0.208]

46 Hence, when the goal is to minimize the probability of bad performance on future tasks uniformly, this theorem motivates minimizing the maximum of the empirical risks as opposed to their mean. [sent-116, score-0.362]

47 Suppose that for γ fixed a (t) priori, the maximum of the empirical risks is bounded by γ, i. [sent-118, score-0.219]

48 Then the probability that γ bounds all T empirical risks is at most (1 − ε)T ≤ e−T ε . [sent-123, score-0.181]

49 h∈H T The bound in the lemma states a 1/T rate of decay for the probability that the empirical risk obtained by H on a new task exceeds γ. [sent-127, score-0.233]

50 Next, we relate this empirical risk to the true risk obtained by the empirical risk minimizer. [sent-128, score-0.505]

51 In particular, with high probability the true risk of the empirical risk minimizer is not much larger than its empirical risk. [sent-133, score-0.379]

52 , B }; note that mapping the observed maximum empirical risk γ to 1 min{γ ∈ Γ | γ ≤ γ } picks up the additional T term in (5). [sent-137, score-0.219]

53 In the next section, we introduce a loss-compositional paradigm for multi-task learning which includes as special cases minimax MTL and classical MTL. [sent-138, score-0.345]

54 4 3 A generalized loss-compositional paradigm for MTL The paradigm can benefit from a bit of notation. [sent-139, score-0.202]

55 Given a set of T tasks, we represent the empirical m risk for hypothesis ht ∈ H (∈ H) on task t ∈ [T ] as ˆt (ht ) := i=1 (yt,i , ht (xt,i )). [sent-140, score-0.576]

56 , hT ) ∈ HT and the vector of empirical risks ˆ(h) := ( ˆ1 (h1 ), . [sent-144, score-0.181]

57 With this notation set, the proposed loss-compositional paradigm encompasses any regularized minimization of a (typically convex) function φ : RT → R+ of the empirical risks: + inf φ ˆ(h) + Ω (H, h) , inf (6) H∈H h∈HT where Ω(·) : H × ∪H∈H HT → R+ is a regularizer. [sent-148, score-0.441]

58 A natural question is why one might consider minimizing p -norms of the empirical risks vector for 1 < p < ∞, as in 2 MTL. [sent-154, score-0.204]

59 The contour of the 1 -norm of the empirical risks evenly trades off empirical risks between different tasks; however, it has been observed that overfitting often happens near the end of learning, rather than the beginning [14]. [sent-155, score-0.362]

60 More precisely, when the empirical risk is high, the gradient of the empirical risk (taken with respect to the parameter (H, h)) is likely to have positive inner product with the gradient of the true risk. [sent-156, score-0.379]

61 Therefore, given a candidate solution with a corresponding vector of empirical risks, a sensible strategy is to take a step in solution space which places more emphasis on tasks with higher empirical risk. [sent-157, score-0.236]

62 In this work, we also discuss an interesting subset of the loss-compositional paradigm which does not fit into p MTL; this subfamily embodies a continuum of relaxations of minimax MTL. [sent-160, score-0.406]

63 In some cases, minimizing the maximum loss can exhibit certain disadvantages because the maximum loss is not robust to situations when a small fraction of the tasks are fundamentally harder than the remaining tasks. [sent-162, score-0.265]

64 Consider the case when the empirical risk for each task in this small fraction can not be reduced below a level u. [sent-163, score-0.233]

65 Intuitively, the idea is to ensure that most tasks have low empirical risk, but a small fraction of tasks are permitted to have higher loss. [sent-165, score-0.247]

66 t∈[T ] In the above, φ from the loss-compositional paradigm (6) is a variational function of the empirical risks vector. [sent-167, score-0.275]

67 (2) Suppose one task is much harder than all the other tasks (e. [sent-173, score-0.148]

68 an outlier task), and its empirical risk is separated from the maximum empirical risk of the other tasks by ρ. [sent-175, score-0.512]

69 Thus, the objective can focus on minimizing the maximum risk of the set of T − 1 easier tasks. [sent-177, score-0.214]

70 (3) As α approaches 0, we recover the hard maximum case of minimax MTL. [sent-179, score-0.251]

71 Evgeniou and Pontil presented this model as minv0 ,{vt }t∈[T ] t∈[T ] m i=1 (yt,i , v0 + vt , xt,i ) + λ0 v0 2 + λ1 T t∈[T ] vt 2 , for the hinge loss or squared loss. [sent-201, score-0.172]

72 This can be set in the new paradigm via H = {Hv0 | v0 ∈ Rd }, m 1 Hv0 = {h : x → v0 + vt , x | vt ∈ Rd }, and ˆt (ht ) = m i=1 yt,i , v0 + vt , xt,i . [sent-202, score-0.262]

73 The AEP model minimizes the task-wise average loss with the trace norm (nuclear norm) penalty: m minW t i=1 (yt,i , Wt , xt,i ) + λ W tr , where · tr : W → i σi (W ) is the trace norm. [sent-203, score-0.17]

74 Each ht = Wt in some H ∈ H lives ˆt (ht ) = 1 m yt,i , ht , xt,i . [sent-205, score-0.31]

75 Also, i=1 m For easy empirical comparison between the various MTL formulations from the paradigm, at times it will be convenient to use constrained formulations of the EP and AEP model. [sent-207, score-0.207]

76 minimax MTL), model (EP or AEP), and choice of regularized versus constrained. [sent-218, score-0.234]

77 6 250 250 200 200 400 350 100 squared−loss risk squared−loss risk squared−loss risk 300 150 150 100 250 200 150 100 50 50 50 0 −1 0 1 σnoise 2 3 0 −1 4 0 1 σnoise 2 3 0 −0. [sent-222, score-0.378]

78 The two modes regression problem consists of 50 linear prediction tasks for the first type of task and 5 linear prediction tasks for the second task type. [sent-237, score-0.328]

79 The true parameter for the first task type is a vector µ drawn uniformly from the sphere of radius 5; the true parameter for the second task type is −2µ. [sent-238, score-0.17]

80 Each task is drawn from an isotropic Gaussian with mean taken from the task type and the standard deviation of all dimensions set to σtask . [sent-239, score-0.158]

81 For each LTL experiment, 55 new test tasks were drawn using the same µ as from the training tasks. [sent-246, score-0.174]

82 Figure 1 shows a tradeoff: when each task group is fairly homogeneous (left and center plots), minimax is better at minimizing the maximum of the test risks while 1 is better at minimizing the mean of the test risks. [sent-247, score-0.553]

83 As task homogeneity decreases (right plot), the gap in performance closes with respect to the maximum of the test risks and remains roughly the same with respect to the mean. [sent-248, score-0.244]

84 The results (see Figure 2) demonstrate that when the learner has moderate shared capacity τ0 and high task-specific capacity τ1 , minimax MTL outperforms 1 MTL for the max objective; additionally, for the max objective in almost all parameter settings (0. [sent-286, score-0.386]

85 2T )-minimax MTL outperform 1 MTL, and they also outperform minimax MTL when the task-specific capacity τ1 is not too large. [sent-288, score-0.291]

86 We hypothesize that minimax MTL performs the best in the high−τ1 regime because stopping learning once the maximum of the empirical risks cannot be improved invokes early stopping and its built-in regularization properties (see e. [sent-289, score-0.432]

87 Interestingly, for the normalized mean RMSE objective, both minimax relaxations are competitive with 1 MTL; however, when the 7 shared capacity τ0 is high (right section, right plot), 1 MTL performs the best. [sent-292, score-0.331]

88 For high task-specific capacity τ1 , minimax MTL and its relaxations again seem to resist overfitting compared to 1 MTL. [sent-293, score-0.292]

89 For the mean RMSE, 1 and Mean 2 risk (Right) vs bound on W tr . [sent-325, score-0.191]

90 LTL used 10-fold MTL obtains the lowest risk for al- cross-validation (10% of tasks left out in each fold). [sent-326, score-0.222]

91 1 , solid blue • is minimax, dashed green CV−mean Mean squared loss CV−mean Maximum squared loss 0. [sent-329, score-0.203]

92 Minimax MTL outperforms 1 MTL Figure 4: Test multiclass 0-1 loss vs when the capacity W tr is somewhat limited, with the W tr . [sent-352, score-0.153]

93 1T )-minimax, every capacity minimax MTL is competitive with 1 MTL. [sent-355, score-0.263]

94 5 Discussion We have established a continuum of formulations for MTL which recovers as special cases classical MTL and the newly formulated minimax MTL. [sent-359, score-0.407]

95 In between these extreme points lies a continuum of relaxed minimax MTL formulations. [sent-360, score-0.277]

96 More generally, we introduced a loss-compositional paradigm that operates on the vector of empirical risks, inducing the additional p MTL paradigms. [sent-361, score-0.149]

97 The empirical evaluations indicate that α-minimax MTL at either the 10% or 20% level often outperforms 1 MTL in terms of the maximum test risk objective and sometimes even in the mean test risk objective. [sent-362, score-0.45]

98 All the minimax or α-minimax MTL formulations exhibit a built-in safeguard against overfitting in the case of learning with a model that is very complex relative to the available data. [sent-363, score-0.289]

99 A good direction for the future is to obtain efficient algorithms for minimax and α-minimax MTL. [sent-365, score-0.213]

100 Another area ripe for exploration is to establish more general learning bounds for minimax MTL and to extend these bounds to α-minimax MTL. [sent-367, score-0.213]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mtl', 0.853), ('minimax', 0.213), ('ht', 0.155), ('ltl', 0.139), ('inf', 0.127), ('risk', 0.126), ('risks', 0.126), ('aep', 0.121), ('tasks', 0.096), ('paradigm', 0.094), ('formulations', 0.076), ('pm', 0.069), ('rmse', 0.059), ('vt', 0.056), ('empirical', 0.055), ('task', 0.052), ('capacity', 0.05), ('pt', 0.05), ('continuum', 0.046), ('ratings', 0.043), ('company', 0.042), ('evgeniou', 0.041), ('iid', 0.04), ('maximum', 0.038), ('classical', 0.038), ('ep', 0.038), ('minh', 0.037), ('user', 0.036), ('conjoint', 0.036), ('tournament', 0.036), ('loss', 0.035), ('newly', 0.034), ('dashed', 0.034), ('hypothesis', 0.033), ('trace', 0.032), ('solid', 0.032), ('drawn', 0.032), ('pontil', 0.031), ('relaxations', 0.029), ('test', 0.028), ('theodoros', 0.028), ('objective', 0.027), ('massimiliano', 0.026), ('cv', 0.026), ('inductive', 0.026), ('tr', 0.025), ('spectrum', 0.025), ('squared', 0.025), ('users', 0.024), ('endpoint', 0.024), ('subfamily', 0.024), ('minimize', 0.024), ('minimizing', 0.023), ('mean', 0.022), ('regularized', 0.021), ('norm', 0.021), ('learn', 0.021), ('admits', 0.021), ('preference', 0.02), ('personal', 0.019), ('baxter', 0.019), ('minw', 0.019), ('georgia', 0.019), ('classically', 0.019), ('heterogeneity', 0.019), ('training', 0.018), ('vs', 0.018), ('relaxed', 0.018), ('wt', 0.017), ('true', 0.017), ('encompasses', 0.017), ('cvx', 0.017), ('normals', 0.017), ('marketing', 0.017), ('yoshua', 0.017), ('green', 0.017), ('shared', 0.017), ('pr', 0.017), ('attained', 0.017), ('worst', 0.016), ('prediction', 0.016), ('hypotheses', 0.016), ('outlier', 0.016), ('idealized', 0.016), ('framed', 0.016), ('learner', 0.015), ('adversarial', 0.015), ('sensible', 0.015), ('music', 0.015), ('emphasis', 0.015), ('generalized', 0.014), ('outperform', 0.014), ('andreas', 0.014), ('tting', 0.014), ('environment', 0.014), ('settings', 0.014), ('convex', 0.013), ('decays', 0.013), ('complexities', 0.013), ('peter', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

2 0.12961946 225 nips-2012-Multi-task Vector Field Learning

Author: Binbin Lin, Sen Yang, Chiyuan Zhang, Jieping Ye, Xiaofei He

Abstract: Multi-task learning (MTL) aims to improve generalization performance by learning multiple related tasks simultaneously and identifying the shared information among tasks. Most of existing MTL methods focus on learning linear models under the supervised setting. We propose a novel semi-supervised and nonlinear approach for MTL using vector fields. A vector field is a smooth mapping from the manifold to the tangent spaces which can be viewed as a directional derivative of functions on the manifold. We argue that vector fields provide a natural way to exploit the geometric structure of data as well as the shared differential structure of tasks, both of which are crucial for semi-supervised multi-task learning. In this paper, we develop multi-task vector field learning (MTVFL) which learns the predictor functions and the vector fields simultaneously. MTVFL has the following key properties. (1) The vector fields MTVFL learns are close to the gradient fields of the predictor functions. (2) Within each task, the vector field is required to be as parallel as possible which is expected to span a low dimensional subspace. (3) The vector fields from all tasks share a low dimensional subspace. We formalize our idea in a regularization framework and also provide a convex relaxation method to solve the original non-convex problem. The experimental results on synthetic and real data demonstrate the effectiveness of our proposed approach. 1

3 0.081828415 222 nips-2012-Multi-Task Averaging

Author: Sergey Feldman, Maya Gupta, Bela Frigyik

Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. 1

4 0.081664667 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1

5 0.075596519 195 nips-2012-Learning visual motion in recurrent neural networks

Author: Marius Pachitariu, Maneesh Sahani

Abstract: We present a dynamic nonlinear generative model for visual motion based on a latent representation of binary-gated Gaussian variables. Trained on sequences of images, the model learns to represent different movement directions in different variables. We use an online approximate inference scheme that can be mapped to the dynamics of networks of neurons. Probed with drifting grating stimuli and moving bars of light, neurons in the model show patterns of responses analogous to those of direction-selective simple cells in primary visual cortex. Most model neurons also show speed tuning and respond equally well to a range of motion directions and speeds aligned to the constraint line of their respective preferred speed. We show how these computations are enabled by a specific pattern of recurrent connections learned by the model. 1

6 0.071960501 293 nips-2012-Relax and Randomize : From Value to Algorithms

7 0.068667866 187 nips-2012-Learning curves for multi-task Gaussian process regression

8 0.068451323 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

9 0.064124331 208 nips-2012-Matrix reconstruction with the local max norm

10 0.062848255 16 nips-2012-A Polynomial-time Form of Robust Regression

11 0.060322095 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

12 0.06027583 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

13 0.059914477 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

14 0.05957086 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

15 0.052983217 296 nips-2012-Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds

16 0.052867215 168 nips-2012-Kernel Latent SVM for Visual Recognition

17 0.052403178 275 nips-2012-Privacy Aware Learning

18 0.049042255 221 nips-2012-Multi-Stage Multi-Task Feature Learning

19 0.046144269 295 nips-2012-Risk-Aversion in Multi-armed Bandits

20 0.045661256 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.114), (1, -0.002), (2, 0.044), (3, 0.002), (4, 0.055), (5, 0.008), (6, 0.002), (7, 0.101), (8, -0.048), (9, -0.015), (10, 0.016), (11, 0.026), (12, -0.055), (13, -0.036), (14, 0.011), (15, -0.004), (16, -0.007), (17, -0.001), (18, -0.016), (19, 0.01), (20, -0.034), (21, 0.015), (22, -0.007), (23, 0.011), (24, -0.043), (25, -0.04), (26, 0.079), (27, -0.003), (28, 0.03), (29, -0.02), (30, -0.069), (31, -0.011), (32, -0.135), (33, 0.019), (34, 0.009), (35, 0.118), (36, -0.033), (37, 0.004), (38, -0.014), (39, -0.097), (40, -0.03), (41, -0.022), (42, -0.034), (43, 0.139), (44, 0.058), (45, 0.034), (46, 0.052), (47, -0.025), (48, -0.146), (49, -0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91997117 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

2 0.57907796 32 nips-2012-Active Comparison of Prediction Models

Author: Christoph Sawade, Niels Landwehr, Tobias Scheffer

Abstract: We address the problem of comparing the risks of two given predictive models—for instance, a baseline model and a challenger—as confidently as possible on a fixed labeling budget. This problem occurs whenever models cannot be compared on held-out training data, possibly because the training data are unavailable or do not reflect the desired test distribution. In this case, new test instances have to be drawn and labeled at a cost. We devise an active comparison method that selects instances according to an instrumental sampling distribution. We derive the sampling distribution that maximizes the power of a statistical test applied to the observed empirical risks, and thereby minimizes the likelihood of choosing the inferior model. Empirically, we investigate model selection problems on several classification and regression tasks and study the accuracy of the resulting p-values. 1

3 0.56568271 271 nips-2012-Pointwise Tracking the Optimal Regression Function

Author: Yair Wiener, Ran El-Yaniv

Abstract: This paper examines the possibility of a ‘reject option’ in the context of least squares regression. It is shown that using rejection it is theoretically possible to learn ‘selective’ regressors that can ǫ-pointwise track the best regressor in hindsight from the same hypothesis class, while rejecting only a bounded portion of the domain. Moreover, the rejected volume vanishes with the training set size, under certain conditions. We then develop efficient and exact implementation of these selective regressors for the case of linear regression. Empirical evaluation over a suite of real-world datasets corroborates the theoretical analysis and indicates that our selective regressors can provide substantial advantage by reducing estimation error.

4 0.54468304 222 nips-2012-Multi-Task Averaging

Author: Sergey Feldman, Maya Gupta, Bela Frigyik

Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. 1

5 0.54235375 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

Author: Koby Crammer, Yishay Mansour

Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1

6 0.51917493 208 nips-2012-Matrix reconstruction with the local max norm

7 0.51003122 247 nips-2012-Nonparametric Reduced Rank Regression

8 0.49414545 319 nips-2012-Sparse Prediction with the $k$-Support Norm

9 0.48646882 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

10 0.4654431 225 nips-2012-Multi-task Vector Field Learning

11 0.45297399 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

12 0.43910423 187 nips-2012-Learning curves for multi-task Gaussian process regression

13 0.42971319 361 nips-2012-Volume Regularization for Binary Classification

14 0.41826248 221 nips-2012-Multi-Stage Multi-Task Feature Learning

15 0.41607568 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

16 0.40623516 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

17 0.40407327 46 nips-2012-Assessing Blinding in Clinical Trials

18 0.40342376 86 nips-2012-Convex Multi-view Subspace Learning

19 0.39404359 16 nips-2012-A Polynomial-time Form of Robust Regression

20 0.38505045 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (21, 0.041), (27, 0.019), (32, 0.189), (36, 0.014), (38, 0.173), (39, 0.02), (42, 0.055), (53, 0.014), (54, 0.016), (55, 0.035), (74, 0.041), (76, 0.125), (80, 0.083), (92, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90366352 23 nips-2012-A lattice filter model of the visual pathway

Author: Karol Gregor, Dmitri B. Chklovskii

Abstract: Early stages of visual processing are thought to decorrelate, or whiten, the incoming temporally varying signals. Motivated by the cascade structure of the visual pathway (retina → lateral geniculate nucelus (LGN) → primary visual cortex, V1) we propose to model its function using lattice filters - signal processing devices for stage-wise decorrelation of temporal signals. Lattice filter models predict neuronal responses consistent with physiological recordings in cats and primates. In particular, they predict temporal receptive fields of two different types resembling so-called lagged and non-lagged cells in the LGN. Moreover, connection weights in the lattice filter can be learned using Hebbian rules in a stage-wise sequential manner reminiscent of the neuro-developmental sequence in mammals. In addition, lattice filters can model visual processing in insects. Therefore, lattice filter is a useful abstraction that captures temporal aspects of visual processing. Our sensory organs face an ongoing barrage of stimuli from the world and must transmit as much information about them as possible to the rest of the brain [1]. This is a formidable task because, in sensory modalities such as vision, the dynamic range of natural stimuli (more than three orders of magnitude) greatly exceeds the dynamic range of relay neurons (less than two orders of magnitude) [2]. The reason why high fidelity transmission is possible at all is that the continuity of objects in the physical world leads to correlations in natural stimuli, which imply redundancy. In turn, such redundancy can be eliminated by compression performed by the front end of the visual system leading to the reduction of the dynamic range [3, 4]. A compression strategy appropriate for redundant natural stimuli is called predictive coding [5, 6, 7]. In predictive coding, a prediction of the incoming signal value is computed from past values delayed in the circuit. This prediction is subtracted from the actual signal value and only the prediction error is transmitted. In the absence of transmission noise such compression is lossless as the original signal could be decoded on the receiving end by inverting the encoder. If predictions are accurate, the dynamic range of the error is much smaller than that of the natural stimuli. Therefore, minimizing dynamic range using predictive coding reduces to optimizing prediction. Experimental support for viewing the front end of the visual system as a predictive encoder comes from the measurements of receptive fields [6, 7]. In particular, predictive coding suggests that, for natural stimuli, the temporal receptive fields should be biphasic and the spatial receptive fields center-surround. These predictions are born out by experimental measurements in retinal ganglion cells, [8], lateral geniculate nucleus (LGN) neurons [9] and fly second order visual neurons called large monopolar cells (LMCs) [2]. In addition, the experimentally measured receptive fields vary with signal-to-noise ratio as would be expected from optimal prediction theory [6]. Furthermore, experimentally observed whitening of the transmitted signal [10] is consistent with removing correlated components from the incoming signals [11]. As natural stimuli contain correlations on time scales greater than hundred milliseconds, experimentally measured receptive fields of LGN neurons are equally long [12]. Decorrelation over such long time scales requires equally long delays. How can such extended receptive field be produced by 1 biological neurons and synapses whose time constants are typically less than hundred milliseconds [13]? The field of signal processing offers a solution to this problem in the form of a device called a lattice filter, which decorrelates signals in stages, sequentially adding longer and longer delays [14, 15, 16, 17]. Motivated by the cascade structure of visual systems [18], we propose to model decorrelation in them by lattice filters. Naturally, visual systems are more complex than lattice filters and perform many other operations. However, we show that the lattice filter model explains several existing observations in vertebrate and invertebrate visual systems and makes testable predictions. Therefore, we believe that lattice filters provide a convenient abstraction for modeling temporal aspects of visual processing. This paper is organized as follows. First, we briefly summarize relevant results from linear prediction theory. Second, we explain the operation of the lattice filter in discrete and continuous time. Third, we compare lattice filter predictions with physiological measurements. 1 Linear prediction theory Despite the non-linear nature of neurons and synapses, the operation of some neural circuits in vertebrates [19] and invertebrates [20] can be described by a linear systems theory. The advantage of linear systems is that optimal circuit parameters may be obtained analytically and the results are often intuitively clear. Perhaps not surprisingly, the field of signal processing relies heavily on the linear prediction theory, offering a convenient framework [15, 16, 17]. Below, we summarize the results from linear prediction that will be used to explain the operation of the lattice filter. Consider a scalar sequence y = {yt } where time t = 1, . . . , n. Suppose that yt at each time point depends on side information provided by vector zt . Our goal is to generate a series of linear predictions, yt from the vector zt , yt = w · zt . We define a prediction error as: ˆ ˆ et = yt − yt = yt − w · zt ˆ (1) and look for values of w that minimize mean squared error: e2 = 1 nt e2 = t t 1 nt (yt − w · zt )2 . (2) t The weight vector, w is optimal for prediction of sequence y from sequence z if and only if the prediction error sequence e = y − w · z is orthogonal to each component of vector z: ez = 0. (3) When the whole series y is given in advance, i.e. in the offline setting, these so-called normal equations can be solved for w, for example, by Gaussian elimination [21]. However, in signal processing and neuroscience applications, another setting called online is more relevant: At every time step t, prediction yt must be made using only current values of zt and w. Furthermore, after a ˆ prediction is made, w is updated based on the prediction yt and observed yt , zt . ˆ In the online setting, an algorithm called stochastic gradient descent is often used, where, at each time step, w is updated in the direction of negative gradient of e2 : t w →w−η w (yt − w · zt ) 2 . (4) This leads to the following weight update, known as least mean square (LMS) [15], for predicting sequence y from sequence z: w → w + ηet zt , (5) where η is the learning rate. The value of η represents the relative influence of more recent observations compared to more distant ones. The larger the learning rate the faster the system adapts to recent observations and less past it remembers. In this paper, we are interested in predicting a current value xt of sequence x from its past values xt−1 , . . . , xt−k restricted by the prediction order k > 0: xt = wk · (xt−1 , . . . , xt−k )T . ˆ 2 (6) This problem is a special case of the online linear prediction framework above, where yt = xt , zt = (xt−1 , . . . , xt−k )T . Then the gradient update is given by: w → wk + ηet (xt−1 , . . . , xt−k )T . (7) While the LMS algorithm can find the weights that optimize linear prediction (6), the filter wk has a long temporal extent making it difficult to implement with neurons and synapses. 2 Lattice filters One way to generate long receptive fields in circuits of biological neurons is to use a cascade architecture, known as the lattice filter, which calculates optimal linear predictions for temporal sequences and transmits prediction errors [14, 15, 16, 17]. In this section, we explain the operation of a discrete-time lattice filter, then adapt it to continuous-time operation. 2.1 Discrete-time implementation The first stage of the lattice filter, Figure 1, calculates the error of the first order optimal prediction (i.e. only using the preceding element of the sequence), the second stage uses the output of the first stage and calculates the error of the second order optimal prediction (i.e. using only two previous values) etc. To make such stage-wise error computations possible the lattice filter calculates at every stage not only the error of optimal prediction of xt from past values xt−1 , . . . , xt−k , called forward error, ftk = xt − wk · (xt−1 , . . . , xt−k )T , (8) but, perhaps non-intuitively, also the error of optimal prediction of a past value xt−k from the more recent values xt−k+1 , . . . , xt , called backward error: bk = xt−k − w k · (xt−k+1 , . . . , xt )T , t k where w and w k (9) are the weights of the optimal prediction. For example, the first stage of the filter calculates the forward error ft1 of optimal prediction of xt from xt−1 : ft1 = xt − u1 xt−1 as well as the backward error b1 of optimal prediction of xt−1 from t xt : b1 = xt−1 − v 1 xt , Figure 1. Here, we assume that coefficients u1 and v 1 that give optimal linear t prediction are known and return to learning them below. Each following stage of the lattice filter performs a stereotypic operation on its inputs, Figure 1. The k-th stage (k > 1) receives forward, ftk−1 , and backward, bk−1 , errors from the previous stage, t delays backward error by one time step and computes a forward error: ftk = ftk−1 − uk bk−1 t−1 (10) of the optimal linear prediction of ftk−1 from bk−1 . In addition, each stage computes a backward t−1 error k−1 k bt = bt−1 − v k ftk−1 (11) of the optimal linear prediction of bk−1 from ftk−1 . t−1 As can be seen in Figure 1, the lattice filter contains forward prediction error (top) and backward prediction error (bottom) branches, which interact at every stage via cross-links. Operation of the lattice filter can be characterized by the linear filters acting on the input, x, to compute forward or backward errors of consecutive order, so called prediction-error filters (blue bars in Figure 1). Because of delays in the backward error branch the temporal extent of the filters grows from stage to stage. In the next section, we will argue that prediction-error filters correspond to the measurements of temporal receptive fields in neurons. For detailed comparison with physiological measurements we will use the result that, for bi-phasic prediction-error filters, such as the ones in Figure 1, the first bar of the forward prediction-error filter has larger weight, by absolute value, than the combined weights of the remaining coefficients of the corresponding filter. Similarly, in backward predictionerror filters, the last bar has greater weight than the rest of them combined. This fact arises from the observation that forward prediction-error filters are minimum phase, while backward predictionerror filters are maximum phase [16, 17]. 3 Figure 1: Discrete-time lattice filter performs stage-wise computation of forward and backward prediction errors. In the first stage, the optimal prediction of xt from xt−1 is computed by delaying the input by one time step and multiplying it by u1 . The upper summation unit subtracts the predicted xt from the actual value and outputs prediction error ft1 . Similarly, the optimal prediction of xt−1 from xt is computed by multiplying the input by v 1 . The lower summation unit subtracts the optimal prediction from the actual value and outputs backward error b1 . In each following stage t k, the optimal prediction of ftk−1 from bk−1 is computed by delaying bk−1 by one time step and t t multiplying it by uk . The upper summation unit subtracts the prediction from the actual ftk−1 and outputs prediction error ftk . Similarly, the optimal prediction of bk−1 from ftk−1 is computed by t−1 multiplying it by uk . The lower summation unit subtracts the optimal prediction from the actual value and outputs backward error bk . Black connections have unitary weights and red connections t have learnable negative weights. One can view forward and backward error calculations as applications of so-called prediction-error filters (blue) to the input sequence. Note that the temporal extent of the filters gets longer from stage to stage. Next, we derive a learning rule for finding optimal coefficients u and v in the online setting. The uk is used for predicting ftk−1 from bk−1 to obtain error ftk . By substituting yt = ftk−1 , zt = bk−1 and t−1 t−1 et = ftk into (5) the update of uk becomes uk → uk + ηftk bk−1 . t−1 (12) Similarly, v k is updated by v k → v k + ηbk ftk−1 . (13) t Interestingly, the updates of the weights are given by the product of the activities of outgoing and incoming nodes of the corresponding cross-links. Such updates are known as Hebbian learning rules thought to be used by biological neurons [22, 23]. Finally, we give a simple proof that, in the offline setting when the entire sequence x is known, f k and bk , given by equations (10, 11), are indeed errors of optimal k-th order linear prediction. Let D be one step time delay operator (Dx)t = xt−1 . The induction statement at k is that f k and bk are k-th order forward and backward errors of optimal linear prediction which is equivalent to f k and bk k k being of the form f k = x−w1 Dx−. . .−wk Dk x and bk = Dk x−w1k Dk−1 x−. . .−wkk x and, from k i normal equations (3), satisfying f D x = 0 and Dbk Di x = bk Di−1 x = 0 for i = 1, . . . , k. That this is true for k = 1 directly follows from the definition of f 1 and b1 . Now we assume that this is true for k − 1 ≥ 1 and show it is true for k. It is easy to see from the forms of f k−1 and bk−1 k k and from f k = f k−1 − uk Dbk−1 that f k has the correct form f k = x − w1 Dx − . . . − wk Dk x. k i k−1 k k−1 Regarding orthogonality for i = 1, . . . , k − 1 we have f D x = (f − u Db )Di x = f k−1 Di x − uk (Dbk−1 )Di x = 0 using the induction assumptions of orhogonality at k − 1. For the remaining i = k we note that f k is the error of the optimal linear prediction of f k−1 from Dbk−1 k−1 and therefore 0 = f k Dbk−1 = f k (Dk x − w1k−1 Dk−1 x − . . . + wk−1 Dx) = f k Dk x as desired. The bk case can be proven similarly. 2.2 Continuous-time implementation The last hurdle remaining for modeling neuronal circuits which operate in continuous time with a lattice filter is its discrete-time operation. To obtain a continuous-time implementation of the lattice 4 filter we cannot simply take the time step size to zero as prediction-error filters would become infinitesimally short. Here, we adapt the discrete-time lattice filter to continous-time operation in two steps. First, we introduce a discrete-time Laguerre lattice filter [24, 17] which uses Laguerre polynomials rather than the shift operator to generate its basis functions, Figure 2. The input signal passes through a leaky integrator whose leakage constant α defines a time-scale distinct from the time step (14). A delay, D, at every stage is replaced by an all-pass filter, L, (15) with the same constant α, which preserves the magnitude of every Fourier component of the input but shifts its phase in a frequency dependent manner. Such all-pass filter reduces to a single time-step delay when α = 0. The optimality of a general discrete-time Laguerre lattice filter can be proven similarly to that for the discrete-time filter, simply by replacing operator D with L in the proof of section 2.1. Figure 2: Continuous-time lattice filter using Laguerre polynomials. Compared to the discretetime version, it contains a leaky integrator, L0 ,(16) and replaces delays with all-pass filters, L, (17). Second, we obtain a continuous-time formulation of the lattice filter by replacing t − 1 → t − δt, defining the inverse time scale γ = (1 − α)/δt and taking the limit δt → 0 while keeping γ fixed. As a result L0 and L are given by: Discrete time L0 (x)t L(x)t Continuous time = αL0 (x)t−1 + xt (14) = α(L(x)t−1 − xt ) + xt−1 (15) dL0 (x)/dt = −γL0 (x) + x L(x) = x − 2γL0 (x) (16) (17) Representative impulse responses of the continuous Laguerre filter are shown in Figure 2. Note that, similarly to the discrete-time case, the area under the first (peak) phase is greater than the area under the second (rebound) phase in the forward branch and the opposite is true in the backward branch. Moreover, the temporal extent of the rebound is greater than that of the peak not just in the forward branch like in the basic discrete-time implementation but also in the backward branch. As will be seen in the next section, these predictions are confirmed by physiological recordings. 3 Experimental evidence for the lattice filter in visual pathways In this section we demonstrate that physiological measurements from visual pathways in vertebrates and invertebrates are consistent with the predictions of the lattice filter model. For the purpose of modeling visual pathways, we identify summation units of the lattice filter with neurons and propose that neural activity represents forward and backward errors. In the fly visual pathway neuronal activity is represented by continuously varying graded potentials. In the vertebrate visual system, all neurons starting with ganglion cells are spiking and we identify their firing rate with the activity in the lattice filter. 3.1 Mammalian visual pathway In mammals, visual processing is performed in stages. In the retina, photoreceptors synapse onto bipolar cells, which in turn synapse onto retinal ganglion cells (RGCs). RGCs send axons to the LGN, where they synapse onto LGN relay neurons projecting to the primary visual cortex, V1. In addition to this feedforward pathway, at each stage there are local circuits involving (usually inhibitory) inter-neurons such as horizontal and amacrine cells in the retina. Neurons of each class 5 come in many types, which differ in their connectivity, morphology and physiological response. The bewildering complexity of these circuits has posed a major challenge to visual neuroscience. Alonso et al. • Connections between LGN and Cortex J. Neurosci., June 1, 2001, 21(11):4002–4015 4009 Temporal Filter 1 0.5 0 -0.5 -1 RGC LGN 0 100 Time (ms) 200 Figure 7. Distribution of geniculate cells and simple cells with respect to the timing of their responses. The distribution of three parameters derived from impulse responses of geniculate and cortical neurons is shown. A, Peak time. B, Zero-crossing time. C, Rebound index. Peak time is the time with the strongest response in the first phase of the impulse response. Zero-crossing time is the time between the first and second phases. Rebound index is the area of the impulse response after the zero crossing divided by the area before the zero crossing. Only impulse responses with good signal to noise were included (Ͼ5 SD above baseline; n ϭ 169). Figure 3: Electrophysiologically measured temporal receptive fields get progressively longer along the cat visual pathway. Left: A cat LGN cell (red) has a longer receptive field than a corresponding RGC cell (blue) (adapted from [12] which also reports population data). Right (A,B): Extent of the temporal receptive fields of simple cells in cat V1 is greater than that of corresponding LGN cells as quantified by the peak (A) and zero-crossing (B) times. Right (C): In the temporal receptive fields of cat LGN and V1 cells the peak can be stronger or weaker than the rebound (adapted from [25]). simple cells and geniculate cells differed for all temporal parameters measured, there was considerable overlap between the distributions (Fig. 7). This overlap raises the following question: does connectivity depend on how well geniculate and cortical responses are matched with respect to time? For instance, do simple cells with fast subregions (early times to peak and early zero crossings) receive input mostly from geniculate cells with fast centers? Figure 8 illustrates the visual responses from a geniculate cell and a simple cell that were monosynaptically connected. A strong positive peak was observed in the correlogram (shown with a 10 msec time window to emphasize its short latency and fast rise time). In this case, an ON central subregion was well overlapped with an ON geniculate center (precisely at the peak of the subregion). Moreover, the timings of the visual responses from the overlapped subregion and the geniculate center were very similar (same onset, ϳ0 –25 msec; same peak, ϳ25–50 msec). It is worth noting that the two central subregions of the simple cell were faster and stronger than the two lateral subregions. The responses of the central subregions matched the timing of the geniculate center. In contrast, the timing of the lateral subregions resembled more closely the timing of the geniculate surround (both peaked at 25–50 msec). Unlike the example shown in Figure 8, a considerable number of geniculocortical pairs produced responses with different timing. For example, Figure 9 illustrates a case in which a geniculate center fully overlapped a strong simple-cell subregion of the same sign, but with slower timing (LGN onset, ϳ0 –25 msec; peak, ϳ25–50 msec; simple-cell onset, ϳ25–50 msec; peak, ϳ50 –75 msec). The cross-correlogram between this pair of neurons was flat, which indicates the absence of a monosynaptic connection (Fig. 9, top right). To examine the role of timing in geniculocortical connectivity, we measured the response time course from all cell pairs that met two criteria. First, the geniculate center overlapped a simple-cell subregion of the same sign (n ϭ 104). Second, the geniculate center overlapped the cortical subregion in a near-optimal position (relative overlap Ͼ 50%, n ϭ 47; see Materials and Methods; Fig. 5A). All these cell pairs had a high probability of being monosynaptically connected because of the precise match in receptive-field position and sign (31 of 47 were connected). The distributions of peak time, zero-crossing time, and rebound index from these cell pairs were very similar to the distributions from the entire sample (Fig. 7; see also Fig. 10 legend). The selected cell pairs included both presumed directional (predicted DI Ͼ 0.3, see Materials and Methods; 12/20 connected) and nondirectional (19/27 connected) simple cells. Most geniculate cells had small receptive fields (less than two simple-cell subregion widths; see Receptive-field sign), although five cells with larger receptive fields were also included (three connected). From the 47 cell pairs used in this analysis, those with similar response time courses had a higher probability of being connected (Fig. 10). In particular, cell pairs that had both similar peak time and zero-crossing time were all connected (n ϭ 12; Fig. 10 A). Directionally selective simple cells were included in all timing groups. For example, in Figure 10 A there were four, five, two, and one directionally selective cells in the time groups Ͻ20, 40, 60, and Ͼ60 msec, respectively. Similar results were obtained if we restricted our sample to geniculate centers overlapped with the dominant subregion of the simple cell (n ϭ 31). Interestingly, the efficacy and contributions of the connections seemed to depend little on the relative timing of the visual responses (Fig. 10, right). Although our sample of them was quite small, lagged cells are of considerable interest and therefore deserve comment. We recorded from 13 potentially lagged LGN cells whose centers were superimposed with a simple-cell subregion (eight with rebound indices between 1.2 and 1.5; five with rebound indices Ͼ1.9). Only seven of these pairs could be used for timing comparisons (in one pair the baseline of the correlogram had insufficient spikes; in three pairs the geniculate receptive fields were Here, we point out several experimental observations related to temporal processing in the visual system consistent with the lattice filter model. First, measurements of temporal receptive fields demonstrate that they get progressively longer at each consecutive stage: i) LGN neurons have longer receptive fields than corresponding pre-synaptic ganglion cells [12], Figure 3left; ii) simple cells in V1 have longer receptive fields than corresponding pre-synaptic LGN neurons [25], Figure 3rightA,B. These observation are consistent with the progressively greater temporal extent of the prediction-error filters (blue plots in Figure 2). Second, the weight of the peak (integrated area under the curve) may be either greater or less than that of the rebound both in LGN relay cells [26] and simple cells of V1 [25], Figure 3right(C). Neurons with peak weight exceeding that of rebound are often referred to as non-lagged while the others are known as lagged found both in cat [27, 28, 29] and monkey [30]. The reason for this becomes clear from the response to a step stimulus, Figure 4(top). By comparing experimentally measured receptive fields with those of the continuous lattice filter, Figure 4, we identify non-lagged neurons with the forward branch and lagged neurons with the backward branch. Another way to characterize step-stimulus response is whether the sign of the transient is the same (non-lagged) or different (lagged) relative to sustained response. Third, measurements of cross-correlation between RGCs and LGN cell spikes in lagged and nonlagged neurons reveals a difference of the transfer function indicative of the difference in underlying circuitry [30]. This is consistent with backward pathway circuit of the Laguerre lattice filter, Figure 2, being more complex then that of the forward path (which results in different transfer function). ” (or replacing ”more complex” with ”different”) Third, measurements of cross-correlation between RGCs and LGN cell spikes in lagged and nonlagged neurons reveals a difference of the transfer function indicative of the difference in underlying circuitry [31]. This is consistent with the backward branch circuit of the Laguerre lattice filter, Figure 2, being different then that of the forward branch (which results in different transfer function). In particular, a combination of different glutamate receptors such as AMPA and NMDA, as well as GABA receptors are thought to be responsible for observed responses in lagged cells [32]. However, further investigation of the corresponding circuitry, perhaps using connectomics technology, is desirable. Fourth, the cross-link weights of the lattice filter can be learned using Hebbian rules, (12,13) which are biologically plausible [22, 23]. Interestingly, if these weights are learned sequentially, starting from the first stage, they do not need to be re-learned when additional stages are added or learned. This property maps naturally on the fact that in the course of mammalian development the visual pathway matures in a stage-wise fashion - starting with the retina, then LGN, then V1 - and implying that the more peripheral structures do not need to adapt to the maturation of the downstream ones. 6 Figure 4: Comparison of electrophysiologically measured responses of cat LGN cells with the continuous-time lattice filter model. Top: Experimentally measured temporal receptive fields and step-stimulus responses of LGN cells (adapted from [26]). Bottom: Typical examples of responses in the continuous-time lattice filter model. Lattice filter coefficients were u1 = v 1 = 0.4, u2 = v 2 = 0.2 and 1/γ = 50ms to model the non-lagged cell and u1 = v 1 = u2 = v 2 = 0.2 and 1/γ = 60ms to model the lagged cell. To model photoreceptor contribution to the responses, an additional leaky integrator L0 was added to the circuit of Figure 2. While Hebbian rules are biologically plausible, one may get an impression from Figure 2 that they must apply to inhibitory cross-links. We point out that this circuit is meant to represent only the computation performed rather than the specific implementation in terms of neurons. As the same linear computation can be performed by circuits with a different arrangement of the same components, there are multiple implementations of the lattice filter. For example, activity of non-lagged OFF cells may be seen as representing minus forward error. Then the cross-links between the non-lagged OFF pathway and the lagged ON pathway would be excitatory. In general, classification of cells into lagged and non-lagged seems independent of their ON/OFF and X/Y classification [31, 28, 29], but see[33]. 3.2 Insect visual pathway In insects, two cell types, L1 and L2, both post-synaptic to photoreceptors play an important role in visual processing. Physiological responses of L1 and L2 indicate that they decorrelate visual signals by subtracting their predictable parts. In fact, receptive fields of these neurons were used as the first examples of predictive coding in neuroscience [6]. Yet, as the numbers of synapses from photoreceptors to L1 and L2 are the same [34] and their physiological properties are similar, it has been a mystery why insects, have not just one but a pair of such seemingly redundant neurons per facet. Previously, it was suggested that L1 and L2 provide inputs to the two pathways that map onto ON and OFF pathways in the vertebrate retina [35, 36]. Here, we put forward a hypothesis that the role of L1 and L2 in visual processing is similar to that of the two branches of the lattice filter. We do not incorporate the ON/OFF distinction in the effectively linear lattice filter model but anticipate that such combined description will materialize in the future. As was argued in Section 2, in forward prediction-error filters, the peak has greater weight than the rebound, while in backward prediction-error filters the opposite is true. Such difference implies that in response to a step-stimulus the signs of sustained responses compared to initial transients are different between the branches. Indeed, Ca2+ imaging shows that responses of L1 and L2 to step-stimulus are different as predicted by the lattice filter model [35], Figure 5b. Interestingly, the activity of L1 seems to represent minus forward error and L2 - plus backward error, suggesting that the lattice filter cross-links are excitatory. To summarize, the predictions of the lattice filter model seem to be consistent with the physiological measurements in the fly visual system and may help understand its operation. 7 Stimulus 1 0.5 0 0 5 10 15 20 5 10 15 20 5 10 time 15 20 − Forward Error 1 0 −1 0 Backward Error 1 0 −1 0 Figure 5: Response of the lattice filter and fruit fly LMCs to a step-stimulus. Left: Responses of the first order discrete-time lattice filter to a step stimulus. Right: Responses of fly L1 and L2 cells to a moving step stimulus (adapted from [35]). Predicted and the experimentally measured responses have qualitatively the same shape: a transient followed by sustained response, which has the same sign for the forward error and L1 and the opposite sign for the backward error and L2. 4 Discussion Motivated by the cascade structure of the visual pathway, we propose to model its operation with the lattice filter. We demonstrate that the predictions of the continuous-time lattice filter model are consistent with the course of neural development and the physiological measurement in the LGN, V1 of cat and monkey, as well as fly LMC neurons. Therefore, lattice filters may offer a useful abstraction for understanding aspects of temporal processing in visual systems of vertebrates and invertebrates. Previously, [11] proposed that lagged and non-lagged cells could be a result of rectification by spiking neurons. Although we agree with [11] that LGN performs temporal decorrelation, our explanation does not rely on non-linear processing but rather on the cascade architecture and, hence, is fundamentally different. Our model generates the following predictions that are not obvious in [11]: i) Not only are LGN receptive fields longer than RGC but also V1 receptive fields are longer than LGN; ii) Even a linear model can generate a difference in the peak/rebound ratio; iii) The circuit from RGC to LGN should be different for lagged and non-lagged cells consistent with [31]; iv) The lattice filter circuit can self-organize using Hebbian rules, which gives a mechanistic explanation of receptive fields beyond the normative framework of [11]. In light of the redundancy reduction arguments given in the introduction, we note that, if the only goal of the system were to compress incoming signals using a given number of lattice filter stages, then after the compression is peformed only one kind of prediction errors, forward or backward needs to be transmitted. Therefore, having two channels, in the absence of noise, may seem redundant. However, transmitting both forward and backward errors gives one the flexibility to continue decorrelation further by adding stages performing relatively simple operations. We are grateful to D.A. Butts, E. Callaway, M. Carandini, D.A. Clark, J.A. Hirsch, T. Hu, S.B. Laughlin, D.N. Mastronarde, R.C. Reid, H. Rouault, A. Saul, L. Scheffer, F.T. Sommer, X. Wang for helpful discussions. References [1] F. Rieke, D. Warland, R.R. van Steveninck, and W. Bialek. Spikes: exploring the neural code. MIT press, 1999. [2] S.B. Laughlin. Matching coding, circuits, cells, and molecules to signals: general principles of retinal design in the fly’s eye. Progress in retinal and eye research, 13(1):165–196, 1994. [3] F. Attneave. Some informational aspects of visual perception. Psychological review, 61(3):183, 1954. [4] H. Barlow. Redundancy reduction revisited. Network: Comp in Neural Systems, 12(3):241–253, 2001. [5] R.M. Gray. Linear Predictive Coding and the Internet Protocol. Now Publishers, 2010. [6] MV Srinivasan, SB Laughlin, and A. Dubs. Predictive coding: a fresh view of inhibition in the retina. Proceedings of the Royal Society of London. Series B. Biological Sciences, 216(1205):427–459, 1982. [7] T. Hosoya, S.A. Baccus, and M. Meister. Dynamic predictive coding by the retina. Nature, 436:71, 2005. 8 [8] HK Hartline, H.G. Wagner, and EF MacNichol Jr. The peripheral origin of nervous activity in the visual system. Studies on excitation and inhibition in the retina: a collection of papers from the laboratories of H. Keffer Hartline, page 99, 1974. [9] N.A. Lesica, J. Jin, C. Weng, C.I. Yeh, D.A. Butts, G.B. Stanley, and J.M. Alonso. Adaptation to stimulus contrast and correlations during natural visual stimulation. Neuron, 55(3):479–491, 2007. [10] Y. Dan, J.J. Atick, and R.C. Reid. Efficient coding of natural scenes in the lateral geniculate nucleus: experimental test of a computational theory. The Journal of Neuroscience, 16(10):3351–3362, 1996. [11] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6(3):345–358, 1995. [12] X. Wang, J.A. Hirsch, and F.T. Sommer. Recoding of sensory information across the retinothalamic synapse. The Journal of Neuroscience, 30(41):13567–13577, 2010. [13] C. Koch. Biophysics of computation: information processing in single neurons. Oxford Univ Press, 2005. [14] F. Itakura and S. Saito. On the optimum quantization of feature parameters in the parcor speech synthesizer. In Conference Record, 1972 International Conference on Speech Communication and Processing, Boston, MA, pages 434–437, 1972. [15] B. Widrow and S.D. Stearns. Adaptive signal processing. Prentice-Hall, Inc. Englewood Cliffs, NJ, 1985. [16] S. Haykin. Adaptive filter theory. Prentice-Hall, Englewood-Cliffs, NJ, 2003. [17] A.H. Sayed. Fundamentals of adaptive filtering. Wiley-IEEE Press, 2003. [18] D.J. Felleman and D.C. Van Essen. Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex, 1(1):1–47, 1991. [19] X. Wang, F.T. Sommer, and J.A. Hirsch. Inhibitory circuits for visual processing in thalamus. Current Opinion in Neurobiology, 2011. [20] SB Laughlin, J. Howard, and B. Blakeslee. Synaptic limitations to contrast coding in the retina of the blowfly calliphora. Proceedings of the Royal society of London. Series B. Biological sciences, 231(1265):437–467, 1987. [21] D.C. Lay. Linear Algebra and Its Applications. Addison-Wesley/Longman, New York/London, 2000. [22] D.O. Hebb. The organization of behavior: A neuropsychological theory. Lawrence Erlbaum, 2002. [23] O. Paulsen and T.J. Sejnowski. Natural patterns of activity and long-term synaptic plasticity. Current opinion in neurobiology, 10(2):172–180, 2000. [24] Z. Fejzo and H. Lev-Ari. Adaptive laguerre-lattice filters. Signal Processing, IEEE Transactions on, 45(12):3006–3016, 1997. [25] J.M. Alonso, W.M. Usrey, and R.C. Reid. Rules of connectivity between geniculate cells and simple cells in cat primary visual cortex. The Journal of Neuroscience, 21(11):4002–4015, 2001. [26] D. Cai, G.C. Deangelis, and R.D. Freeman. Spatiotemporal receptive field organization in the lateral geniculate nucleus of cats and kittens. Journal of Neurophysiology, 78(2):1045–1061, 1997. [27] D.N. Mastronarde. Two classes of single-input x-cells in cat lateral geniculate nucleus. i. receptive-field properties and classification of cells. Journal of Neurophysiology, 57(2):357–380, 1987. [28] J. Wolfe and L.A. Palmer. Temporal diversity in the lateral geniculate nucleus of cat. Visual neuroscience, 15(04):653–675, 1998. [29] AB Saul and AL Humphrey. Spatial and temporal response properties of lagged and nonlagged cells in cat lateral geniculate nucleus. Journal of Neurophysiology, 64(1):206–224, 1990. [30] A.B. Saul. Lagged cells in alert monkey lateral geniculate nucleus. Visual neurosci, 25:647–659, 2008. [31] D.N. Mastronarde. Two classes of single-input x-cells in cat lateral geniculate nucleus. ii. retinal inputs and the generation of receptive-field properties. Journal of Neurophysiology, 57(2):381–413, 1987. [32] P. Heggelund and E. Hartveit. Neurotransmitter receptors mediating excitatory input to cells in the cat lateral geniculate nucleus. i. lagged cells. Journal of neurophysiology, 63(6):1347–1360, 1990. [33] J. Jin, Y. Wang, R. Lashgari, H.A. Swadlow, and J.M. Alonso. Faster thalamocortical processing for dark than light visual targets. The Journal of Neuroscience, 31(48):17471–17479, 2011. [34] M. Rivera-Alba, S.N. Vitaladevuni, Y. Mischenko, Z. Lu, S. Takemura, L. Scheffer, I.A. Meinertzhagen, D.B. Chklovskii, and G.G. de Polavieja. Wiring economy and volume exclusion determine neuronal placement in the drosophila brain. Current Biology, 21(23):2000–5, 2011. [35] D.A. Clark, L. Bursztyn, M.A. Horowitz, M.J. Schnitzer, and T.R. Clandinin. Defining the computational structure of the motion detector in drosophila. Neuron, 70(6):1165–1177, 2011. [36] M. Joesch, B. Schnell, S.V. Raghu, D.F. Reiff, and A. Borst. On and off pathways in drosophila motion vision. Nature, 468(7321):300–304, 2010. 9

2 0.89159417 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

Author: James Hensman, Magnus Rattray, Neil D. Lawrence

Abstract: We present a general method for deriving collapsed variational inference algorithms for probabilistic models in the conjugate exponential family. Our method unifies many existing approaches to collapsed variational inference. Our collapsed variational inference leads to a new lower bound on the marginal likelihood. We exploit the information geometry of the bound to derive much faster optimization methods based on conjugate gradients for these models. Our approach is very general and is easily applied to any model where the mean field update equations have been derived. Empirically we show significant speed-ups for probabilistic inference using our bound. 1

3 0.89108789 309 nips-2012-Semi-supervised Eigenvectors for Locally-biased Learning

Author: Toke Hansen, Michael W. Mahoney

Abstract: In many applications, one has side information, e.g., labels that are provided in a semi-supervised manner, about a specific target region of a large data set, and one wants to perform machine learning and data analysis tasks “nearby” that pre-specified target region. Locally-biased problems of this sort are particularly challenging for popular eigenvector-based machine learning and data analysis tools. At root, the reason is that eigenvectors are inherently global quantities. In this paper, we address this issue by providing a methodology to construct semi-supervised eigenvectors of a graph Laplacian, and we illustrate how these locally-biased eigenvectors can be used to perform locally-biased machine learning. These semi-supervised eigenvectors capture successively-orthogonalized directions of maximum variance, conditioned on being well-correlated with an input seed set of nodes that is assumed to be provided in a semi-supervised manner. We also provide several empirical examples demonstrating how these semi-supervised eigenvectors can be used to perform locally-biased learning. 1

same-paper 4 0.84099412 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

Author: Nishant Mehta, Dongryeol Lee, Alexander G. Gray

Abstract: Since its inception, the modus operandi of multi-task learning (MTL) has been to minimize the task-wise mean of the empirical risks. We introduce a generalized loss-compositional paradigm for MTL that includes a spectrum of formulations as a subfamily. One endpoint of this spectrum is minimax MTL: a new MTL formulation that minimizes the maximum of the tasks’ empirical risks. Via a certain relaxation of minimax MTL, we obtain a continuum of MTL formulations spanning minimax MTL and classical MTL. The full paradigm itself is loss-compositional, operating on the vector of empirical risks. It incorporates minimax MTL, its relaxations, and many new MTL formulations as special cases. We show theoretically that minimax MTL tends to avoid worst case outcomes on newly drawn test tasks in the learning to learn (LTL) test setting. The results of several MTL formulations on synthetic and real problems in the MTL and LTL test settings are encouraging. 1

5 0.83435094 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

Author: Kenji Fukumizu, Chenlei Leng

Abstract: We propose a novel kernel approach to dimension reduction for supervised learning: feature extraction and variable selection; the former constructs a small number of features from predictors, and the latter finds a subset of predictors. First, a method of linear feature extraction is proposed using the gradient of regression function, based on the recent development of the kernel method. In comparison with other existing methods, the proposed one has wide applicability without strong assumptions on the regressor or type of variables, and uses computationally simple eigendecomposition, thus applicable to large data sets. Second, in combination of a sparse penalty, the method is extended to variable selection, following the approach by Chen et al. [2]. Experimental results show that the proposed methods successfully find effective features and variables without parametric models. 1

6 0.83176875 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

7 0.77256155 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

8 0.77167904 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

9 0.77116889 227 nips-2012-Multiclass Learning with Simplex Coding

10 0.77004802 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

11 0.76990503 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

12 0.7698068 187 nips-2012-Learning curves for multi-task Gaussian process regression

13 0.7694521 358 nips-2012-Value Pursuit Iteration

14 0.7687782 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

15 0.76855379 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

16 0.7682215 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

17 0.7681064 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

18 0.76704669 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

19 0.7668435 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

20 0.76616597 114 nips-2012-Efficient coding provides a direct link between prior and likelihood in perceptual Bayesian inference