nips nips2006 nips2006-152 knowledge-graph by maker-knowledge-mining

152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Source: pdf

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classiﬁcation where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by deﬁning a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract We describe and analyze an algorithmic framework for online classiﬁcation where each online trial consists of multiple prediction tasks that are tied together. [sent-7, score-0.605]

2 We tackle the problem of updating the online hypothesis by deﬁning a projection problem in which each prediction task corresponds to a single linear constraint. [sent-8, score-0.481]

3 These constraints are tied together through a single slack parameter. [sent-9, score-0.166]

4 We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. [sent-10, score-0.217]

5 We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. [sent-11, score-0.162]

6 We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. [sent-12, score-0.395]

7 We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. [sent-13, score-0.175]

8 Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. [sent-14, score-0.258]

9 1 Introduction In this paper we discuss and analyze a framework for devising efﬁcient online learning algorithms for complex prediction problems such as multiclass categorization. [sent-15, score-0.48]

10 In the settings we cover, a complex prediction problem is cast as the task of simultaneously coping with multiple simpliﬁed subproblems which are nonetheless tied together. [sent-16, score-0.418]

11 Our simultaneous projection approach is based on the fact that we can retrospectively (after making a prediction) cast the problem as the task of making k − 1 binary decisions each of which involves the correct label and one of the competing labels. [sent-18, score-0.441]

12 Such an approach was used for instance by Weston and Watkins [1] for batch learning of multiclass support vector machines. [sent-22, score-0.179]

13 While this approach adheres with the original structure of the problem, the resulting update mechanism is by construction sub-optimal as it oversees almost all of the constraints imposed by the complex prediction problem. [sent-25, score-0.262]

14 ) The third approach for dealing with complex problems is to tailor a speciﬁc efﬁcient solution for the problem on hand. [sent-27, score-0.225]

15 While this approach yielded efﬁcient learning algorithms for multiclass categorization problems [2] and aesthetic solutions for structured output problems [3, 4], devising these algorithms required dedicated efforts. [sent-28, score-0.303]

16 In contrast to previously studied approaches, we propose a simple, general, and efﬁcient framework for online learning of a wide variety of complex problems. [sent-30, score-0.19]

17 We do so by casting the online update task as an optimization problem in which the newly devised hypothesis is required to be similar to the current hypothesis while attaining a small loss on multiple binary prediction problems. [sent-31, score-0.605]

18 Casting the online learning task as a sequence of instantaneous optimization problems was ﬁrst suggested and analyzed by Kivinen and Warmuth [12] for binary classiﬁcation and regression problems. [sent-32, score-0.342]

19 In our optimization-based approach, the complex decision problem is cast as an optimization problem that consists of multiple linear constraints each of which represents a simpliﬁed sub-problem. [sent-33, score-0.32]

20 These constraints are tied through a single slack variable whose role is to asses the overall prediction quality for the complex problem. [sent-34, score-0.353]

21 The end result is an approach whose time complexity and mistake bounds are equivalent to approaches which solely deal with the worst-violating constraint [9]. [sent-40, score-0.173]

22 In practice, though, the performance of the simultaneous projection framework is much better than single-constraint update schemes. [sent-41, score-0.265]

23 At trial t the algorithm receives a matrix Xt of size kt × n, where each row of Xt is an instance, and is required to make a prediction on the ˆ label associated with each instance. [sent-55, score-0.903]

24 We allow yj ˆt t t to take any value in R, where the actual label being predicted is sign(ˆj ) and |ˆj | is the conﬁdence y y ˆ in the prediction. [sent-57, score-0.403]

25 After making a prediction yt the algorithm receives the correct labels yt where t yj ∈ {−1, 1} for all j ∈ [kt ]. [sent-58, score-1.129]

26 In this paper we assume that the predictions in each trial are formed by calculating the inner product between a weight vector ω t ∈ Rn with each instance in Xt , thus ˆ ˆ yt = Xt ω t . [sent-59, score-0.562]

27 Our goal is to perfectly predict the entire vector yt . [sent-60, score-0.327]

28 We thus say that the vector yt t was imperfectly predicted if there exists an outcome j such that yj = sign(ˆj ). [sent-61, score-0.68]

29 That is, we suffer a yt t unit loss on trial t if there exists j, such that sign(ˆj ) = yj . [sent-62, score-0.803]

30 Directly minimizing this combinatorial yt error is a computationally difﬁcult task. [sent-63, score-0.295]

31 Therefore, we use an adaptation of the hinge-loss, deﬁned t t t t ˆ ˆ (ˆ t , yt ) = maxj∈[kt ] 1 − yj yj + , as a proxy for the combinatorial error. [sent-64, score-0.909]

32 The quantity yj yj is y often referred to as the (signed) margin of the prediction and ties the correctness and the conﬁdence ˆ in the prediction. [sent-65, score-0.746]

33 We use (ω; (Xt , yt )) to denote (ˆ t , yt ) where yt = Xt ω. [sent-66, score-0.918]

34 We also denote the y t set of instances whose labels were predicted incorrectly by Mt = {j | sign(ˆj ) = yj }, and similarly yt t t t the set of instances whose hinge-losses are greater than zero by Γ = {j | [1 − yj yj ]+ > 0}. [sent-67, score-1.585]

35 ˆ 3 Derived Problems In this section we further explore the motivation for our problem setting by describing two different complex decision tasks and showing how they can be cast as special cases of our setting. [sent-68, score-0.18]

36 We also would like to note that our approach can be employed in other prediction problems (see Sec. [sent-69, score-0.163]

37 Multilabel Categorization In the multilabel categorization task each instance is associated with a set of relevant labels from the set [k]. [sent-71, score-0.361]

38 The multilabel categorization task can be cast as a special case of a ranking task in which the goal is to rank the relevant labels above the irrelevant ones. [sent-72, score-0.436]

39 A learner is required to build a vector ω that successfully ranks the labels according to their relevance, namely for each pair of classes (r, s) such that r is relevant while s is not, the class r should be ranked higher than the class s. [sent-78, score-0.168]

40 We say that a label ranking is imperfect if there exists any pair (r, s) which violates this requirement. [sent-80, score-0.181]

41 The loss associated with each such violation is [1 − (ω · φ(x, r) − ω · φ(x, s))]+ and the loss of the categorizer is deﬁned as the maximum over the losses induced by the violated pairs. [sent-81, score-0.273]

42 In order to map the problem to our setting, we deﬁne a virtual instance for every pair (r, s) such that r is relevant and s is not. [sent-82, score-0.179]

43 The label associated with all of the instances is set to 1. [sent-84, score-0.175]

44 It is clear that an imperfect categorizer makes a prediction mistake on at least one of the instances, and that the losses deﬁned by both problems are the same. [sent-85, score-0.488]

45 ω t+1 t y1 (ω · xt ) ≥ 1 1 t y3 (ω · xt ) ≥ 1 3 t y2 (ω · xt ) ≥ 1 2 ωt Figure 1: Illustration of the simultaneous projections algorithm: each instance casts a constraint on ω and each such constraint deﬁnes a halfspace of feasible solutions. [sent-86, score-1.616]

46 We project on each halfspace in parallel and the new vector is a weighted average of these projections Ordinal Regression In the problem of ordinal regression an instance x is a vector of n features that is associated with a target rank y ∈ [k]. [sent-87, score-0.474]

47 The value of ω · x provides a score from which the prediction value can be deﬁned as the smallest index i for which ω · x < bi , y = min {i|ω · x < bi }. [sent-89, score-0.254]

48 ˆ In order to obtain a correct prediction, an ordinal regressor is required to ensure that ω ·x ≥ bi for all i < y and that ω · x < bi for i ≥ y. [sent-90, score-0.284]

49 It is considered a prediction mistake if any of these constraints is violated. [sent-91, score-0.301]

50 The label of the ﬁrst y − 1 instances is 1, while the remaining k − y instances are labeled as −1. [sent-96, score-0.244]

51 Once we learned an expanded vector in Rn+k−1 , the regressor ω is obtained by taking the ﬁrst n components of the expanded vector and the thresholds b1 , . [sent-97, score-0.208]

52 A prediction mistake of any of the instances corresponds to an incorrect rank in the original problem. [sent-101, score-0.39]

53 4 Simultaneous Projection Algorithms ˆ Recall that on trial t the algorithm receives a matrix, Xt , of kt instances, and predicts yt = Xt ω t . [sent-102, score-0.959]

54 After performing its prediction, the algorithm receives the corresponding labels yt . [sent-103, score-0.395]

55 Each such t instance-label pair casts a constraint on ω t , yj ω t · xt ≥ 1. [sent-104, score-0.761]

56 Thus, we allow some of the constraints to remain violated by introducing a tradeoff between the change to ω t and the loss attained on (Xt , yt ). [sent-108, score-0.437]

57 Formally, we would like to set ω t+1 to be the solution of the following 2 1 optimization problem, minω∈Rn 2 ω − ω t + C (ω; (Xt , yt )), where C is a tradeoff parameter. [sent-109, score-0.451]

58 (1) by P t and refer to it as the instantaneous primal problem to be solved on trial t. [sent-115, score-0.359]

59 The dual optimization problem of P t is the maximization problem kt α1 ,. [sent-116, score-0.843]

60 ,αk k t αj − maxt t t j=1 t 1 t t t ω + αj yj xt j 2 j=1 kt 2 t t αj ≤ C , ∀j : αj ≥ 0 . [sent-118, score-1.11]

61 j=1 (2) Each dual variable corresponds to a single constraint of the primal problem. [sent-121, score-0.339]

62 The minimizer of the kt t t primal problem is calculated from the optimal dual solution as follows, ω t+1 = ω t + j=1 αj yj xt . [sent-122, score-1.545]

63 j Unfortunately, in the common case, where each xt is in an arbitrary orientation, there does not exist j an analytic solution for the dual problem (Eq. [sent-123, score-0.696]

64 We tackle the problem by breaking it down into kt reduced problems, each of which focuses on a single dual variable. [sent-125, score-0.849]

65 Each reduced optimization problem amounts to the following problem t max αj − t αj 1 t t ω t + αj yj xt j 2 2 t s. [sent-128, score-0.816]

66 (3) We next obtain an exact or approximate solution for each reduced problem as if it were independent of the rest. [sent-131, score-0.158]

67 Since µ ∈ ∆kt , this yields a feasible solution to the dual problem deﬁned in Eq. [sent-133, score-0.442]

68 j j Finally, the algorithm uses the combined solution and sets ω t+1 = ω t + We next present three schemes to obtain a solution for the reduced problem (Eq. [sent-136, score-0.229]

69 Simultaneous Perceptron: The simplest of the update forms generalizes the famous Perceptron t algorithm from [8] by setting αj to C if the j’th instance is incorrectly labeled, and to 0 otherwise. [sent-138, score-0.162]

70 ing αj = min C, ω t ; (xt , yj ) / xt j j t independently assign each αj this optimal solu1 tion. [sent-150, score-0.628]

71 We would like to comment that this j t solution may update αj also for instances which were correctly classiﬁed as long as the margin they attain is not sufﬁciently large. [sent-152, score-0.204]

72 Conservative Simultaneous Projections: Combining ideas from both methods, the conservative t simultaneous projections scheme optimally sets αj according to the analytic solution. [sent-154, score-0.407]

73 In the conservative scheme only the instances which were incorrectly predicted (j ∈ Mt ) are assigned a positive weight. [sent-156, score-0.286]

74 To recap, on each trial t we obtain a feasible solution for the instantaneous dual given in Eq. [sent-159, score-0.639]

75 While this solution may not be optimal, it does constitutes an infrastructure for obtaining a mistake bound and, as we demonstrate in Sec. [sent-162, score-0.233]

76 5 Analysis The algorithms described in the previous section perform updates in order to increase the instantaneous dual problem deﬁned in Eq. [sent-164, score-0.369]

77 We now use the mistake bound model to derive an upper bound on the number of trials on which the predictions of SimPerc and ConProj algorithms are imperfect. [sent-166, score-0.187]

78 Following [6], the ﬁrst step in the analysis is to tie the instantaneous dual problems to a global loss function. [sent-167, score-0.448]

79 To do so, we introduce a primal optimization problem deﬁned over the en2 T 1 tire sequence of examples as follows, minω∈Rn 2 ω + C t=1 (ω; (X t , Y t )) . [sent-168, score-0.179]

80 We rewrite the optimization problem as the following equivalent constrained optimization problem, min ω∈Rn ,ξ∈RT 1 ω 2 T 2 t ξt s. [sent-169, score-0.193]

81 ∀t ∈ [T ], ∀j ∈ [kt ] : yj ω · xt ≥ 1 − ξt ∀t : ξt ≥ 0. [sent-171, score-0.628]

82 (4) j +C t=1 We denote the value of the objective function at (ω, ξ) for this optimization problem by P(ω, ξ). [sent-172, score-0.163]

83 Standard usage of Lagrange multipliers yields that the dual of Eq. [sent-174, score-0.218]

84 (4) is, T kt λt,j − max λ t=1 j=1 T 1 2 kt t λt,j yj xt j kt 2 s. [sent-175, score-2.074]

85 Through our derivation we use the fact that any set of dual variables λ1 , · · · , λT T kt t deﬁnes a feasible solution ω = t=1 j=1 λt,j yj xt with a corresponding assignment of the slack j variables. [sent-180, score-1.568]

86 We note however, that if we ensure that λs,j = 0 for all s > t then the dual function no longer depends on instances occurring on rounds proceeding round t. [sent-183, score-0.346]

87 2 by ﬁnding a new feasible solution for the dual problem on every trial. [sent-185, score-0.442]

88 (2), is equivalent (after omitting an additive constant) to the following constrained optimization problem, kt λj ≤ C . [sent-187, score-0.588]

89 λ ≥ 0 , λ (6) j=1 That is, the instantaneous dual problem is obtained from D(λ1 , · · · , λT ) by ﬁxing λ1 , . [sent-190, score-0.369]

90 , λt−1 it is straightforward to show that t−1 s the prediction vector used on trial t is ω t = s=1 j λs,j yj xs . [sent-197, score-0.601]

91 (6) can be rewritten as, kt k t 1 t max λj − ωt + λj yj xt j λ1 ,. [sent-199, score-1.11]

92 , µt t also yields a feasible solution for the problem deﬁned in Eq. [sent-214, score-0.224]

93 , XT , yT be a sequence of examples where Xt is a matrix of kt examples and yt are the associated labels. [sent-222, score-0.805]

94 Assume that for all t and j the norm of an instance xt j is at most R. [sent-223, score-0.399]

95 Then, for any ω ∈ Rn the number of trials on which the prediction of SimPerc is imperfect is at most, T 1 2 + C t=1 (ω ; (Xt , yt )) 2 ω . [sent-224, score-0.527]

96 Recall that any dual feasible solution induces a value for the dual’s objective function which is upper bounded by the optimum value of the primal problem, P (ω , ξ ). [sent-227, score-0.507]

97 In particular, the solution obtained at the end of trial T is dual feasible, and thus D(λ1 , . [sent-228, score-0.419]

98 Therefore, denoting by ∆t the difference in two consecutive dual T objective values, D(λ1 , . [sent-252, score-0.247]

99 First, note that if the prediction on trial t is perfect (Mt = ∅) then SimPerc sets λt to the zero vector and thus ∆t = 0. [sent-266, score-0.294]

100 We can thus focus on trials for which the algorithm’s prediction is imperfect. [sent-267, score-0.162]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kt', 0.482), ('xt', 0.321), ('yj', 0.307), ('yt', 0.295), ('dual', 0.218), ('simultaneous', 0.17), ('simperc', 0.147), ('prediction', 0.132), ('mistake', 0.13), ('trial', 0.13), ('mt', 0.121), ('feasible', 0.111), ('instantaneous', 0.109), ('online', 0.106), ('projections', 0.098), ('instances', 0.097), ('categorizer', 0.088), ('cast', 0.083), ('ordinal', 0.082), ('primal', 0.078), ('th', 0.078), ('instance', 0.078), ('multilabel', 0.077), ('rkt', 0.077), ('solution', 0.071), ('imperfect', 0.07), ('tied', 0.069), ('multiclass', 0.069), ('categorization', 0.065), ('abbreviate', 0.065), ('rn', 0.065), ('bi', 0.061), ('projection', 0.059), ('optimization', 0.059), ('casts', 0.059), ('conproj', 0.059), ('simproj', 0.059), ('slack', 0.058), ('complex', 0.055), ('receives', 0.052), ('conservative', 0.052), ('devising', 0.051), ('regressor', 0.051), ('halfspace', 0.051), ('bk', 0.05), ('label', 0.05), ('incorrectly', 0.048), ('labels', 0.048), ('tie', 0.047), ('omitting', 0.047), ('predicted', 0.046), ('reduced', 0.045), ('analytic', 0.044), ('constraint', 0.043), ('scheme', 0.043), ('loss', 0.043), ('problem', 0.042), ('casting', 0.041), ('sign', 0.04), ('constraints', 0.039), ('formally', 0.038), ('losses', 0.037), ('task', 0.037), ('update', 0.036), ('analyze', 0.036), ('perceptron', 0.036), ('tackle', 0.036), ('violated', 0.034), ('thresholds', 0.033), ('denote', 0.033), ('rewrite', 0.033), ('bold', 0.032), ('constitutes', 0.032), ('vector', 0.032), ('rounds', 0.031), ('rank', 0.031), ('problems', 0.031), ('pair', 0.031), ('expanded', 0.03), ('trials', 0.03), ('ranking', 0.03), ('studied', 0.029), ('objective', 0.029), ('required', 0.029), ('associated', 0.028), ('suffer', 0.028), ('relevant', 0.028), ('receive', 0.027), ('hypothesis', 0.027), ('predictions', 0.027), ('solutions', 0.027), ('xing', 0.027), ('algorithmic', 0.026), ('minimizer', 0.026), ('tradeoff', 0.026), ('letters', 0.026), ('focuses', 0.026), ('attaining', 0.026), ('tailor', 0.026), ('remind', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

2 0.35655382 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our ﬁrst algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1

3 0.28272367 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi

Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1

4 0.22534202 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This uniﬁed view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the ﬁrst player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the ﬁrst player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the ﬁrst player. The goal of the ﬁrst player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us ﬁrst cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner ﬁrst receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classiﬁers, which is used for answering yes/no questions, is deﬁned as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redeﬁne the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which deﬁnes a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the ﬁrst player using the notion of regret. Given a number of rounds T and a ﬁxed vector u ∈ S, we deﬁne the regret of the ﬁrst player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the ﬁrst player which guarantees low regret with respect to any vector u ∈ S. Speciﬁcally, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Speciﬁcally, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the ﬁrst player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the ﬁrst player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the ﬁrst player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Speciﬁcally, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is deﬁned as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few deﬁnitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is deﬁned as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the deﬁnition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the ﬁrst player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best ﬁxed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The ﬁrst update scheme simply ﬁnds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update deﬁnes (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the ﬁnal value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisﬁes the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its ﬁnal sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The deﬁnition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the deﬁnition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classiﬁcation, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the conﬁdence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are deﬁned to be yt = Xt wt . Finally, let us deﬁne two loss functions for ranking, both generalize the hinge-loss used in binary classiﬁcation problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we deﬁne a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufﬁcient conﬁdence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufﬁcient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster ﬁrst deﬁnes a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a ﬁnal, so called strong, hypothesis whose error is small. The ﬁnal hypothesis combines linearly the T hypotheses returned by the weak learner with coefﬁcients α1 , . . . , αT , and is deﬁned to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefﬁcients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the ﬁnal strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note ﬁrst that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, deﬁned as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms deﬁne weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally ﬁts the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us ﬁrst denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us deﬁne αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: ﬁnd a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of ﬁnding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster ﬁrst constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different deﬁnition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The deﬁnition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the deﬁnition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now brieﬂy discuss. First, we generalize the applicability of the framework beyond the speciﬁc setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was ﬁrst presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of deﬁning appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simpliﬁes the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, 2003.

5 0.21995172 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

Abstract: Adaptively optimizing experiments can signiﬁcantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efﬁcient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efﬁciently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efﬁciently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efﬁciency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be deﬁned a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in ﬁnding a neuron’s “preferred stimulus,” is the stimulus which maximizes the ﬁring rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to deﬁne the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identiﬁcation approach: we deﬁne the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), speciﬁed by a ﬁnite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difﬁcult to realize this goal in a computationally efﬁcient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) ﬁnd the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on ﬁnding solutions which scale well in high dimensions. We solve problem (1) by using efﬁcient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efﬁcient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous ﬁring rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; brieﬂy, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-ﬁre cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the ﬁlter coefﬁcients ki,t−l capture the dependence of the neuron’s instantaneous ﬁring rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive ﬁelds. For convenience, we arrange all the stimulus coefﬁcients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive ﬁeld. The coefﬁcients aj model the dependence on the observed recent activity r at time t − j (these terms may reﬂect e.g. refractory effects, burstiness, ﬁring-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown ﬁlter coefﬁcients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our ﬁrst key task is to efﬁciently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive ﬁeld for a simulated visual neuron. The neuron’s receptive ﬁeld θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-ﬁt by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justiﬁed by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simpliﬁes matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to ﬁnd the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the ﬁrst term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efﬁciency; our simulations (data not shown) showed that, despite this improved efﬁciency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the ﬁring rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The ﬁrst term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefﬁcients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefﬁcients, k. B) The estimated spike history coefﬁcients, a. C) The ﬁnal estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we ﬁnd that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undeﬁned, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly deﬁned by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by ﬁxing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then ﬁnd the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive ﬁeld when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to ﬁnd the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch signiﬁcantly faster and more stable. To summarize, updating the posterior and ﬁnding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modiﬁcation of Ct−1 , and there exist efﬁcient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modiﬁcation of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deﬂation which reduces the cost of matrix multiplications associated with ﬁnding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we ﬁxed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to ﬁnd the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefﬁcients, k, and the spike history coefﬁcients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our ﬁrst simulation considered the use of our algorithm for learning the receptive ﬁeld of a visually sensitive neuron. We took the neuron’s receptive ﬁeld to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efﬁcient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate ﬁtted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive ﬁeld of a neuron whose response depends on its past spiking. Here, the stimulus coefﬁcients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modiﬁcation of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modiﬁcation. the spike history coefﬁcients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefﬁcients and the spike history coefﬁcients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefﬁcients. Our ﬁnal set of results, Fig. 3, considers a neuron whose receptive ﬁeld drifts non-systematically with time. We take the receptive ﬁeld to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efﬁcient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides signiﬁcant improvements over random sampling; indeed, the differences in efﬁciency are large enough that the information-optimization strategy may permit robust system identiﬁcation in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique signiﬁcantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).

6 0.19420302 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

7 0.18888482 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

8 0.18218315 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

9 0.12948419 79 nips-2006-Fast Iterative Kernel PCA

10 0.12060589 156 nips-2006-Ordinal Regression by Extended Binary Classification

11 0.11918677 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

12 0.11710759 186 nips-2006-Support Vector Machines on a Budget

13 0.11330687 146 nips-2006-No-regret Algorithms for Online Convex Programs

14 0.11097538 117 nips-2006-Learning on Graph with Laplacian Regularization

15 0.1052302 154 nips-2006-Optimal Change-Detection and Spiking Neurons

16 0.099894889 163 nips-2006-Prediction on a Graph with a Perceptron

17 0.083383575 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

18 0.078911796 98 nips-2006-Inferring Network Structure from Co-Occurrences

19 0.070205458 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

20 0.07012102 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.242), (1, 0.051), (2, -0.315), (3, 0.171), (4, -0.191), (5, -0.014), (6, -0.06), (7, -0.285), (8, -0.24), (9, -0.237), (10, 0.002), (11, -0.028), (12, -0.07), (13, 0.038), (14, -0.147), (15, 0.067), (16, -0.028), (17, -0.078), (18, 0.07), (19, -0.044), (20, -0.04), (21, 0.005), (22, -0.124), (23, -0.035), (24, 0.047), (25, 0.037), (26, -0.016), (27, 0.088), (28, 0.06), (29, -0.07), (30, 0.028), (31, -0.038), (32, -0.069), (33, -0.078), (34, 0.009), (35, -0.004), (36, 0.118), (37, 0.001), (38, -0.087), (39, 0.003), (40, 0.056), (41, -0.006), (42, 0.107), (43, 0.027), (44, -0.115), (45, 0.042), (46, 0.085), (47, 0.007), (48, -0.063), (49, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96710742 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

2 0.79908848 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

3 0.56013131 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

4 0.53552878 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

5 0.52055991 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

Author: Manfred K. Warmuth, Dima Kuzmin

Abstract: We design an on-line algorithm for Principal Component Analysis. In each trial the current instance is projected onto a probabilistically chosen low dimensional subspace. The total expected quadratic approximation error equals the total quadratic approximation error of the best subspace chosen in hindsight plus some additional term that grows linearly in dimension of the subspace but logarithmically in the dimension of the instances. 1

6 0.51137501 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

7 0.47136718 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

8 0.38860667 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

9 0.38484573 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

10 0.37682459 146 nips-2006-No-regret Algorithms for Online Convex Programs

11 0.37525797 117 nips-2006-Learning on Graph with Laplacian Regularization

12 0.35745427 156 nips-2006-Ordinal Regression by Extended Binary Classification

13 0.3546654 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

14 0.35210696 79 nips-2006-Fast Iterative Kernel PCA

15 0.31043178 154 nips-2006-Optimal Change-Detection and Spiking Neurons

16 0.30986035 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

17 0.30714819 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

18 0.29177281 186 nips-2006-Support Vector Machines on a Budget

19 0.28984049 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

20 0.28958264 163 nips-2006-Prediction on a Graph with a Perceptron

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.145), (3, 0.018), (7, 0.09), (9, 0.057), (20, 0.017), (22, 0.127), (44, 0.062), (57, 0.078), (64, 0.01), (65, 0.128), (69, 0.027), (71, 0.014), (98, 0.156)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94370902 22 nips-2006-Adaptive Spatial Filters with predefined Region of Interest for EEG based Brain-Computer-Interfaces

Author: Moritz Grosse-wentrup, Klaus Gramann, Martin Buss

Abstract: The performance of EEG-based Brain-Computer-Interfaces (BCIs) critically depends on the extraction of features from the EEG carrying information relevant for the classiﬁcation of different mental states. For BCIs employing imaginary movements of different limbs, the method of Common Spatial Patterns (CSP) has been shown to achieve excellent classiﬁcation results. The CSP-algorithm however suffers from a lack of robustness, requiring training data without artifacts for good performance. To overcome this lack of robustness, we propose an adaptive spatial ﬁlter that replaces the training data in the CSP approach by a-priori information. More speciﬁcally, we design an adaptive spatial ﬁlter that maximizes the ratio of the variance of the electric ﬁeld originating in a predeﬁned region of interest (ROI) and the overall variance of the measured EEG. Since it is known that the component of the EEG used for discriminating imaginary movements originates in the motor cortex, we design two adaptive spatial ﬁlters with the ROIs centered in the hand areas of the left and right motor cortex. We then use these to classify EEG data recorded during imaginary movements of the right and left hand of three subjects, and show that the adaptive spatial ﬁlters outperform the CSP-algorithm, enabling classiﬁcation rates of up to 94.7 % without artifact rejection. 1

same-paper 2 0.8791576 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

3 0.84369934 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

4 0.83544248 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

Author: Hamed Valizadegan, Rong Jin

Abstract: Maximum margin clustering was proposed lately and has shown promising performance in recent studies [1, 2]. It extends the theory of support vector machine to unsupervised learning. Despite its good performance, there are three major problems with maximum margin clustering that question its eﬃciency for real-world applications. First, it is computationally expensive and diﬃcult to scale to large-scale datasets because the number of parameters in maximum margin clustering is quadratic in the number of examples. Second, it requires data preprocessing to ensure that any clustering boundary will pass through the origins, which makes it unsuitable for clustering unbalanced dataset. Third, it is sensitive to the choice of kernel functions, and requires external procedure to determine the appropriate values for the parameters of kernel functions. In this paper, we propose “generalized maximum margin clustering” framework that addresses the above three problems simultaneously. The new framework generalizes the maximum margin clustering algorithm by allowing any clustering boundaries including those not passing through the origins. It signiﬁcantly improves the computational eﬃciency by reducing the number of parameters. Furthermore, the new framework is able to automatically determine the appropriate kernel matrix without any labeled data. Finally, we show a formal connection between maximum margin clustering and spectral clustering. We demonstrate the eﬃciency of the generalized maximum margin clustering algorithm using both synthetic datasets and real datasets from the UCI repository. 1

5 0.82527298 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

6 0.82075208 65 nips-2006-Denoising and Dimension Reduction in Feature Space

7 0.81974292 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

8 0.81393725 126 nips-2006-Logistic Regression for Single Trial EEG Classification

9 0.81341279 79 nips-2006-Fast Iterative Kernel PCA

10 0.81175959 20 nips-2006-Active learning for misspecified generalized linear models

11 0.80755633 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

12 0.80371827 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

13 0.80265647 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

14 0.80006516 150 nips-2006-On Transductive Regression

15 0.79796135 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

16 0.79722059 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

17 0.79622185 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

18 0.79526889 168 nips-2006-Reducing Calibration Time For Brain-Computer Interfaces: A Clustering Approach

19 0.79499453 163 nips-2006-Prediction on a Graph with a Perceptron

20 0.79272366 117 nips-2006-Learning on Graph with Laplacian Regularization