nips nips2006 nips2006-203 knowledge-graph by maker-knowledge-mining

203 nips-2006-implicit Online Learning with Kernels

Source: pdf

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our ﬁrst algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract We present two new algorithms for online learning in reproducing kernel Hilbert spaces. [sent-17, score-0.092]

2 Our ﬁrst algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. [sent-18, score-0.23]

3 We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. [sent-19, score-0.052]

4 We prove loss bounds and analyze the convergence rate of both. [sent-20, score-0.096]

5 1 Introduction Online learning refers to a paradigm where, at each time t, an instance xt ∈ X is presented to a learner, which uses its parameter vector ft to predict a label. [sent-22, score-0.776]

6 This predicted label is then compared to the true label yt , via a non-negative, piecewise differentiable, convex loss function L(xt , yt , ft ). [sent-23, score-1.67]

7 The learner then updates its parameter vector to minimize a risk functional, and the process repeats. [sent-24, score-0.109]

8 The parameter updates are then derived via the principle ft+1 = argmin Jt (f ) := argmin{∆G (f, ft ) + ηt R(xt , yt , f )}, (1) f f where ηt is the learning rate. [sent-26, score-1.042]

9 Using the fact that ∂f ∆G (f, ft ) = ∂f G(f ) − ∂f G(ft ), one obtains ∂f G(ft+1 ) = ∂f G(ft ) − ηt ∂f R(xt , yt , ft+1 ). [sent-28, score-0.986]

10 (3) 1 In particular, if we set G(f ) = 2 ||f ||2 , then ∆G (f, ft ) = 1 ||f − ft ||2 and ∂f G(f ) = f , and we 2 obtain the familiar stochastic gradient descent update ft+1 = ft − ηt ∂f R(xt , yt , ft ). [sent-30, score-2.21]

11 (4) We are interested in applying online learning updates in a reproducing kernel Hilbert space (RKHS). [sent-31, score-0.142]

12 Since ∂f R(xt , yt , ft ) = λft + C · ∂f L(xt , yt , ft ), one can use (4) to obtain an ˜ explicit update ft+1 = (1 − ηt λ)ft − ηt C · ∂f L(xt , yt , ft ), which combined with (6) shows that there must exist coefﬁcients αi,˜ fully specifying ft+1 via y t αi,˜k((xi , y ), ·). [sent-35, score-2.951]

13 ˜ y ft+1 = (7) i=1 y ∈Y ˜ In this paper we propose an algorithm ILK (implicit online learning with kernels) that solves (2) directly, while still expressing updates in the form (7). [sent-36, score-0.106]

14 That is, we derive a technique for computing the implicit update that can be applied to many popular loss functions, including quadratic, hinge, and logistic losses, as well as their extensions to structured domains (see e. [sent-37, score-0.185]

15 We also provide a general recipe to check if a new convex loss function is amenable to these implicit updates. [sent-40, score-0.171]

16 Furthermore, to reduce the memory requirement of ILK, which grows linearly with the number of observations (instance-label pairs), we propose a sparse variant SILK (sparse ILK) that approximates the decision function f by truncating past observations with insigniﬁcant weights. [sent-41, score-0.102]

17 2 Implicit Updates in an RKHS As shown in (1), to perform an implicit update one needs to minimize ∆G (f, ft ) + R(xt , yt , f ). [sent-42, score-1.054]

18 By replacing R(xt , yt , f ) with (5), and setting G(f ) = 1 ||f ||2 , one obtains H 2 ft+1 = arg min J(f ) = argmin f f 1 ||f − ft ||2 + ηt H 2 λ ||f ||2 + C · L(xt , yt , f ) . [sent-43, score-1.58]

19 H 2 (8) Since L is assumed convex with respect to f , setting ∂f J = 0 and using an auxiliary variable ηt τt = 1+ηλ λ yields t ft+1 = (1 − τt )ft − (1 − τt )ηt C∂f L(xt , yt , ft+1 ). [sent-44, score-0.594]

20 Since y βt,˜k((xt , y ), ·), ˜ y ∂f L(xt , yt , ft+1 ) = y ∈Y ˜ and for ease of exposition, we assume a ﬁxed step size (learning rate) ηt = 1, consequently τt = τ , it follows from (9) and (10) that αi,˜ = (1 − τ )αi,˜ y y αt,˜ = −(1 − τ )Cβt,˜ y y for i = 1, . [sent-49, score-0.589]

21 We now elucidate the details for some y well-known loss functions. [sent-61, score-0.076]

22 Square Loss In this case, k((xt , yt ), ·) = k(xt , ·). [sent-62, score-0.576]

23 Furthermore, we assume that Y = R, and write 1 1 L(xt , yt , f ) = (f (xt ) − yt )2 = ( f (·), k(xt , ·) H − yt )2 , 2 2 which yields ∂f L(xt , yt , f ) = (f (xt ) − yt ) k(xt , ·). [sent-64, score-2.88]

24 (13) Substituting into (12) and using (9) we have αt = −(1 − τ )C((1 − τ )ft (xt ) + αt k(xt , xt ) − yt ). [sent-65, score-0.954]

25 1 + C(1 − τ )k(xt , xt ) Binary Hinge Loss As before, we assume k((xt , yt ), ·) = k(xt , ·), and set Y = {±1}. [sent-67, score-0.954]

26 The hinge loss for binary classiﬁcation can be written as L(xt , yt , f ) = (ρ − yt f (xt ))+ = (ρ − yt f, k(xt , ·) H )+ , (14) where ρ > 0 is the margin parameter, and (·)+ := max(0, ·). [sent-68, score-1.895]

27 Recall that the subgradient is a set, and the function is said to be differentiable at a point if this set is a singleton [4]. [sent-69, score-0.051]

28 The binary hinge loss is not differentiable at the hinge point, but its subgradient exists everywhere. [sent-70, score-0.255]

29 Writing ∂f L(xt , yt , f ) = βt k(xt , ·) we have: yt f (xt ) > ρ =⇒ βt = 0; yt f (xt ) = ρ =⇒ βt ∈ [0, −yt ]; yt f (xt ) < ρ =⇒ βt = −yt . [sent-71, score-2.304]

30 On one hand we want the loss to be zero, which can be achieved by setting ρ − yt ft+1 (xt ) = 0. [sent-73, score-0.652]

31 On the other hand, the gradient of the loss at the new point ∂f L(xt , yt , ft+1 ) must satisfy (15). [sent-74, score-0.665]

32 Let αt denote the optimal estimate of αt which leads to ρ − yt ft+1 (xt ) = 0. [sent-76, score-0.576]

33 Using (9) we have ˆ ρ − yt ((1 − τ )ft (xt ) + αt k(xt , xt )) = 0, which yields ˆ αt = ˆ yt (ρ − (1 − τ )yt ft (xt )) ρ − (1 − τ )yt ft (xt ) = . [sent-77, score-2.326]

34 yt k(xt , xt ) k(xt , xt ) On the other hand, by using (15) and (12) we scenarios, we arrive at the ﬁnal update  ˆ αt αt = 0  yt (1 − τ )C have αt yt ∈ [0, (1 − τ )C]. [sent-78, score-2.524]

35 By combining the two if yt αt ∈ [0, (1 − τ )C]; ˆ if yt αt < 0; ˆ if yt αt > (1 − τ )C. [sent-79, score-1.728]

36 ˆ (16) The updates for the hinge loss used in novelty detection are very similar. [sent-80, score-0.243]

37 Graph Structured Loss The graph-structured loss on label domain can be written as L(xt , yt , f ) = ˜ ˜ −f (xt , yt ) + max(∆(yt , y ) + f (xt , y )) y =yt ˜ . [sent-81, score-1.241]

38 This a very general loss, which includes binary and multiclass hinge loss as special cases (see e. [sent-83, score-0.176]

39 ˜ ˜ Let y ∗ = argmaxy=yt {∆(yt , y ) + ft (xt , y )} denote the best runner-up label for current instance ˜ xt . [sent-87, score-0.789]

40 Then set αt,yt = −αt,y∗ = αt , use kt (y, y ) to denote k((xt , y), (xt , y )) and write αt = ˆ −(1 − τ )ft (xt , yt ) + ∆(yt , y ∗ ) + (1 − τ )ft (xt , y ∗ ) . [sent-88, score-0.646]

41 (kt (yt , yt ) + kt (y ∗ , y ∗ ) − 2kt (yt , y ∗ )) The updates are now given by  0 αt = αt ˆ  (1 − τ )C if αt < 0; ˆ if αt ∈ [0, (1 − τ )C]; ˆ if αt > (1 − τ )C. [sent-89, score-0.696]

42 ˆ (18) Logisitic Regression Loss The logistic regression loss and its gradient can be written as −yt k(xt , ·) . [sent-90, score-0.103]

43 L(xt , yt , f ) = log (1 + exp(−yt f (xt ))) , ∂f L(xt , yt , f ) = 1 + exp(yt f (xt )) respectively. [sent-91, score-1.152]

44 1 + exp(yt (1 − τ )ft (xt ) + αt yt k(xt , xt )) Although this equation does not give a closed-form solution, the value of αt can still be obtained by using a numerical root-ﬁnding routine, such as those described in [5]. [sent-93, score-0.954]

45 1 ILK and SILK Algorithms We refer to the algorithm that performs implicit updates as ILK, for “implicit online learning with kernels”. [sent-95, score-0.157]

46 ˜ y ft+1 = (19) i=1 y ∈Y ˜ Intuitively, the parameter τ ∈ (0, 1) (determined by λ and η) trades off between the regularizer and the loss on the current sample. [sent-98, score-0.076]

47 In the case of hinge losses—both binary and graph structured—the weight |αt | is always upper bounded by (1 − τ )C, which ensures limited inﬂuence from outliers (cf. [sent-99, score-0.095]

48 A major drawback of the ILK algorithm described above is that the size of the kernel expansion grows linearly with the number of data points up to time t (see (10)). [sent-101, score-0.059]

49 In many practical domains, where real time prediction is important (for example, video surveillance), storing all the past observations and their coefﬁcients is prohibitively expensive. [sent-102, score-0.089]

50 [3] one can truncate the function expansion by storing only a few relevant past observations. [sent-105, score-0.058]

51 Speciﬁcally, the SILK algorithm maintains a buffer of size ω. [sent-107, score-0.137]

52 Each new point is inserted into the buffer with coefﬁcient αt . [sent-108, score-0.113]

53 Once the buffer limit ω is exceeded, the point with the lowest coefﬁcient value is discarded to maintain a bound on memory usage. [sent-109, score-0.136]

54 It is relatively straightforward to show that the difference between the true predictor and its truncated version obtained by storing only ω expansion coefﬁcients decreases exponentially as the buffer size ω increases [2]. [sent-113, score-0.225]

55 3 Theoretical Analysis In this section we will primarily focus on analyzing the graph-structured loss (17), establishing relative loss bounds and analyzing the rate of convergence of ILK and SILK. [sent-114, score-0.172]

56 Although the bounds we obtain are similar to those obtained in [2], our experimental results clearly show that ILK and SILK are stronger than the NORMA strategy of [2] and its truncated variant. [sent-118, score-0.059]

57 , fT ) : ft ∈ H} is said to be (T, B, D1 , D2 ) bounded if it satisﬁes ||ft ||2 ≤ B 2 ∀t ∈ {1, . [sent-124, score-0.426]

58 Given a ﬁxed sequence of observations {(x1 , y1 ), . [sent-129, score-0.048]

59 , (xT , yT )}, and a sequence of hypotheses {(f1 , . [sent-132, score-0.069]

60 , fT ) ∈ F}, the number of errors M is deﬁned as ∗ M := |{t : ∆f (xt , yt , yt ) ≤ 0}| , ∗ ∗ ∗ where ∆f (xt , yt , yt ) = f (xt , yt ) − f (xt , yt ) and yt is the best runner-up label. [sent-135, score-4.032]

61 To keep the equations succinct, we denote ∆kt ((yt , y), ·) := k((xt , yt ), ·)−k((xt , y), ·), and ∆kt ((yt , y), (yt , y)) := ∆kt ((yt , y), ·) 2 = kt (yt , yt ) − 2kt (yt , y) + kt (y, y). [sent-136, score-1.292]

62 In the following we bound the number H of mistakes M made by ILK by the cumulative loss of an arbitrary sequence of hypotheses from F(T, B, D1 , D2 ). [sent-137, score-0.259]

63 , (xT , yT )} be an arbitrary sequence of observations such that ∆kt ((yt , y), (yt , y)) ≤ X 2 holds for any t, any y, and for some X > 0. [sent-141, score-0.048]

64 In particular, if we assume the sequence of hypotheses (g1 , · · · , gT ) ∈ F(T, B, D1 = 0, D2 = 0) and the cumulative loss K = 0, we obtain a bound on the number of mistakes B2X 2 . [sent-144, score-0.259]

65 , (xT , yT )} be an arbitrary sequence of observations such that ∆kt ((yt , yt ), (yt , yt )) ≤ X 2 holds for any t, any y. [sent-150, score-1.2]

66 , fT ) the sequence of hypotheses T produced by ILK with learning rate ηt = ηt−1/2 , t=1 R(xt , yt , ft ) the cumulative risk of this T sequence, and t=1 R(xt , yt , g) the batch cumulative risk of (g, . [sent-154, score-1.808]

67 Then T T R(xt , yt , g) + aT 1/2 + b, R(xt , yt , ft ) ≤ t=1 where U = CX λ , 2 t=1 2 a = 4ηC X + 2U 2 η , and b = U2 2η are constants. [sent-158, score-1.55]

68 In particular, if T g ∗ = arg min g∈H we obtain 1 T T R(xt , yt , ft ) ≤ t=1 1 T R(xt , yt , g), t=1 T R(xt , yt , g ∗ ) + O(T −1/2 ). [sent-159, score-2.126]

69 In addition, denote g ∗ the minimizer of the batch learning cumulaη tive risk t R(xt , yt , g), and f ∗ the minimizer of the minimum expected risk with R(f ∗ ) := minf E(x,y)∼P (x,y) R(x, y, f ). [sent-161, score-0.691]

70 As stated in [6] for the structured risk minimization framework, as 3500 3000 NORMA vs. [sent-162, score-0.063]

71 NORMA(0) 160 Mistakes of ILK 180 Figure 1: The left panel depicts a synthetic data sequence containing two classes (blue crosses and red diamonds, see the zoomed-in portion in bottom-left corner), with each class being sampled from a mixture of two drifting Gaussian distributions. [sent-164, score-0.062]

72 Performance comparison of ILK vs NORMA and truncated NORMA on this data: Average cumulative error over 100 trials (middle), and average cumulative error each trial (right). [sent-165, score-0.124]

73 This subsequently guarantees the convergence of the average regularized risk of ILK and SILK to R(f ∗ ). [sent-167, score-0.048]

74 Let f denote the average hypothesis produced by averaging over all hypotheses f1 , . [sent-170, score-0.057]

75 Then for any δ ∈ (0, 1), with probability ¯ at least 1 − δ, the expected risk of f is upper bounded by the risk of the best hypothesis chosen in hindsight plus a term which grows as O 4 1 T . [sent-174, score-0.149]

76 In addition, we ﬁxed the margin to ρ = 1 for all our loss functions. [sent-178, score-0.089]

77 Here, ILK and SILK make fewer mistakes than NORMA and truncated NORMA. [sent-182, score-0.105]

78 As expected, both these variants make more mistakes because they are unable to forget the past, which is crucial for obtaining good performance in a non-stationary environment. [sent-184, score-0.066]

79 We formulate the problem as a novelty detection task using a network of classiﬁers, one for each pixel. [sent-190, score-0.056]

80 A constant buffer size ω = 20 is used for both algorithms in this application. [sent-194, score-0.126]

81 The ﬁrst task is to identify people, under varying lighting conditions, in an indoor video sequence taken with a static camera. [sent-196, score-0.096]

82 As can be seen, SILK is able to recover from the change in lighting condition better than NORMA, and is able to identify foreground objects reasonably close to the ground truth. [sent-200, score-0.045]

83 4 0 1 2 3 4 False Positive 5 6 −3 x 10 NORMA SILK Figure 2: Performance comparison of SILK vs truncated NORMA on a background subtraction (moving object detection) task, with varying lighting conditions. [sent-207, score-0.097]

84 Our second experiment is a trafﬁc sequence taken by a camera that shakes irregularly, which creates a challenging problem for any novelty detection algorithm. [sent-211, score-0.086]

85 As seen from the randomly chosen frames plotted in Figure 3 SILK manages to obtain a visually plausible detection result. [sent-212, score-0.059]

86 A polynomial kernel of degree 9 and a buffer size of ω = 128 is employed for all three algorithms. [sent-217, score-0.141]

87 Figure 4 (b) examines the effect of buffer size on SILK. [sent-222, score-0.126]

88 As expected, smaller buffer sizes result in larger truncation error and hence worse performance. [sent-223, score-0.113]

89 With increasing buffer size the asymptotic average error decreases. [sent-224, score-0.126]

90 Figure 4 (c) shows that SILK consistently outperforms NORMA and SVMD, while the trend with the increasing buffer size is repeated, as shown in Figure 4 (d). [sent-226, score-0.137]

91 5 Outlook and Discussion In this paper we presented a general recipe for performing implicit online updates in an RKHS. [sent-229, score-0.183]

92 Speciﬁcally, we showed that for many popular loss functions these updates can be computed efﬁciently. [sent-230, score-0.126]

93 For graph-structured loss we also showed loss bounds and rates of convergence. [sent-232, score-0.172]

94 For the binary hinge loss, when τ = 0 the proposed update formula for αt (16) reduces to the PA-I algorithm of Crammer et al. [sent-234, score-0.122]

95 Curiously enough, the motivation for the updates in both cases seems completely different. [sent-236, score-0.05]

96 While we use an implicit update formula Crammer et al. [sent-237, score-0.095]

97 (d) Performance of SILK on three different buffer sizes. [sent-242, score-0.113]

98 Furthermore, the loss functions they handle are generally linear (hinge loss and its various generalizations) while our updates can handle other non-linear losses such as quadratic or logistic loss. [sent-244, score-0.239]

99 Our analysis of loss bounds is admittedly straightforward given current results. [sent-245, score-0.11]

100 The use of more sophisticated analysis and extending our bounds to deal with other non-linear loss functions is ongoing. [sent-246, score-0.11]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('yt', 0.576), ('ft', 0.398), ('xt', 0.378), ('ilk', 0.345), ('silk', 0.309), ('norma', 0.238), ('buffer', 0.113), ('loss', 0.076), ('kt', 0.07), ('mistakes', 0.066), ('hinge', 0.061), ('svmd', 0.059), ('kivinen', 0.057), ('online', 0.056), ('implicit', 0.051), ('updates', 0.05), ('risk', 0.048), ('hypotheses', 0.039), ('truncated', 0.039), ('cumulative', 0.037), ('rkhs', 0.033), ('vishwanathan', 0.03), ('novelty', 0.03), ('ict', 0.03), ('sequence', 0.03), ('update', 0.029), ('lighting', 0.029), ('jt', 0.028), ('recipe', 0.026), ('video', 0.026), ('detection', 0.026), ('australia', 0.025), ('gt', 0.025), ('past', 0.024), ('ocr', 0.024), ('subgradient', 0.023), ('losses', 0.023), ('multiclass', 0.022), ('crammer', 0.022), ('frames', 0.022), ('storing', 0.021), ('reproducing', 0.021), ('bounds', 0.02), ('panel', 0.02), ('batch', 0.019), ('hypothesis', 0.018), ('argmin', 0.018), ('convex', 0.018), ('grows', 0.018), ('observations', 0.018), ('subtraction', 0.018), ('bounded', 0.017), ('differentiable', 0.017), ('schraudolph', 0.017), ('dale', 0.017), ('binary', 0.017), ('foreground', 0.016), ('switched', 0.016), ('adaptation', 0.016), ('hilbert', 0.016), ('structured', 0.015), ('australian', 0.015), ('mistake', 0.015), ('kernel', 0.015), ('et', 0.015), ('princeton', 0.014), ('traf', 0.014), ('coef', 0.014), ('straightforward', 0.014), ('sophisticated', 0.014), ('logistic', 0.014), ('moving', 0.014), ('margin', 0.013), ('size', 0.013), ('gradient', 0.013), ('instantaneous', 0.013), ('mnist', 0.013), ('expansion', 0.013), ('label', 0.013), ('sparse', 0.012), ('theorem', 0.012), ('obtains', 0.012), ('predictor', 0.012), ('excellence', 0.012), ('requirements', 0.012), ('memory', 0.012), ('synthetic', 0.012), ('kernels', 0.012), ('learner', 0.011), ('said', 0.011), ('static', 0.011), ('outperforms', 0.011), ('national', 0.011), ('bound', 0.011), ('maintains', 0.011), ('seen', 0.011), ('vs', 0.011), ('arrive', 0.011), ('decay', 0.011), ('roc', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

2 0.35655382 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classiﬁcation where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by deﬁning a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1

3 0.27930862 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

Author: Gloria Haro, Gregory Randall, Guillermo Sapiro

Abstract: The study of point cloud data sampled from a stratiﬁcation, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artiﬁcial and real examples demonstrating the importance of extending manifold learning to stratiﬁcation learning. 1

4 0.23474219 79 nips-2006-Fast Iterative Kernel PCA

Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan

Abstract: We introduce two methods to improve convergence of the Kernel Hebbian Algorithm (KHA) for iterative kernel PCA. KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems conﬁrm that our methods converge substantially faster than conventional KHA. 1

5 0.22067302 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

Abstract: Adaptively optimizing experiments can signiﬁcantly reduce the number of trials needed to characterize neural responses using parametric statistical models. However, the potential for these methods has been limited to date by severe computational challenges: choosing the stimulus which will provide the most information about the (typically high-dimensional) model parameters requires evaluating a high-dimensional integration and optimization in near-real time. Here we present a fast algorithm for choosing the optimal (most informative) stimulus based on a Fisher approximation of the Shannon information and specialized numerical linear algebra techniques. This algorithm requires only low-rank matrix manipulations and a one-dimensional linesearch to choose the stimulus and is therefore efﬁcient even for high-dimensional stimulus and parameter spaces; for example, we require just 15 milliseconds on a desktop computer to optimize a 100-dimensional stimulus. Our algorithm therefore makes real-time adaptive experimental design feasible. Simulation results show that model parameters can be estimated much more efﬁciently using these adaptive techniques than by using random (nonadaptive) stimuli. Finally, we generalize the algorithm to efﬁciently handle both fast adaptation due to spike-history effects and slow, non-systematic drifts in the model parameters. Maximizing the efﬁciency of data collection is important in any experimental setting. In neurophysiology experiments, minimizing the number of trials needed to characterize a neural system is essential for maintaining the viability of a preparation and ensuring robust results. As a result, various approaches have been developed to optimize neurophysiology experiments online in order to choose the “best” stimuli given prior knowledge of the system and the observed history of the cell’s responses. The “best” stimulus can be deﬁned a number of different ways depending on the experimental objectives. One reasonable choice, if we are interested in ﬁnding a neuron’s “preferred stimulus,” is the stimulus which maximizes the ﬁring rate of the neuron [1, 2, 3, 4]. Alternatively, when investigating the coding properties of sensory cells it makes sense to deﬁne the optimal stimulus in terms of the mutual information between the stimulus and response [5]. Here we take a system identiﬁcation approach: we deﬁne the optimal stimulus as the one which tells us the most about how a neural system responds to its inputs [6, 7]. We consider neural systems in † ‡ http://www.prism.gatech.edu/∼gtg120z http://www.stat.columbia.edu/∼liam which the probability p(rt |{xt , xt−1 , ..., xt−tk }, {rt−1 , . . . , rt−ta }) of the neural response rt given the current and past stimuli {xt , xt−1 , ..., xt−tk }, and the observed recent history of the neuron’s activity, {rt−1 , . . . , rt−ta }, can be described by a model p(rt |{xt }, {rt−1 }, θ), speciﬁed by a ﬁnite vector of parameters θ. Since we estimate these parameters from experimental trials, we want to choose our stimuli so as to minimize the number of trials needed to robustly estimate θ. Two inconvenient facts make it difﬁcult to realize this goal in a computationally efﬁcient manner: 1) model complexity — we typically need a large number of parameters to accurately model a system’s response p(rt |{xt }, {rt−1 }, θ); and 2) stimulus complexity — we are typically interested in neural responses to stimuli xt which are themselves very high-dimensional (e.g., spatiotemporal movies if we are dealing with visual neurons). In particular, it is computationally challenging to 1) update our a posteriori beliefs about the model parameters p(θ|{rt }, {xt }) given new stimulus-response data, and 2) ﬁnd the optimal stimulus quickly enough to be useful in an online experimental context. In this work we present methods for solving these problems using generalized linear models (GLM) for the input-output relationship p(rt |{xt }, {rt−1 }, θ) and certain Gaussian approximations of the posterior distribution of the model parameters. Our emphasis is on ﬁnding solutions which scale well in high dimensions. We solve problem (1) by using efﬁcient rank-one update methods to update the Gaussian approximation to the posterior, and problem (2) by a reduction to a highly tractable onedimensional optimization problem. Simulation results show that the resulting algorithm produces a set of stimulus-response pairs which is much more informative than the set produced by random sampling. Moreover, the algorithm is efﬁcient enough that it could feasibly run in real-time. Neural systems are highly adaptive and more generally nonstatic. A robust approach to optimal experimental design must be able to cope with changes in θ. We emphasize that the model framework analyzed here can account for three key types of changes: stimulus adaptation, spike rate adaptation, and random non-systematic changes. Adaptation which is completely stimulus dependent can be accounted for by including enough stimulus history terms in the model p(rt |{xt , ..., xt−tk }, {rt−1 , ..., rt−ta }). Spike-rate adaptation effects, and more generally spike history-dependent effects, are accounted for explicitly in the model (1) below. Finally, we consider slow, non-systematic changes which could potentially be due to changes in the health, arousal, or attentive state of the preparation. Methods We model a neuron as a point process whose conditional intensity function (instantaneous ﬁring rate) is given as the output of a generalized linear model (GLM) [8, 9]. This model class has been discussed extensively elsewhere; brieﬂy, this class is fairly natural from a physiological point of view [10], with close connections to biophysical models such as the integrate-and-ﬁre cell [9], and has been applied in a wide variety of experimental settings [11, 12, 13, 14]. The model is summarized as: tk λt = E(rt ) = f ta aj rt−j ki,t−l xi,t−l + i l=1 (1) j=1 In the above summation the ﬁlter coefﬁcients ki,t−l capture the dependence of the neuron’s instantaneous ﬁring rate λt on the ith component of the vector stimulus at time t − l, xt−l ; the model therefore allows for spatiotemporal receptive ﬁelds. For convenience, we arrange all the stimulus coefﬁcients in a vector, k, which allows for a uniform treatment of the spatial and temporal components of the receptive ﬁeld. The coefﬁcients aj model the dependence on the observed recent activity r at time t − j (these terms may reﬂect e.g. refractory effects, burstiness, ﬁring-rate adaptation, etc., depending on the value of the vector a [9]). For convenience we denote the unknown parameter vector as θ = {k; a}. The experimental objective is the estimation of the unknown ﬁlter coefﬁcients, θ, given knowledge of the stimuli, xt , and the resulting responses rt . We chose the nonlinear stage of the GLM, the link function f (), to be the exponential function for simplicity. This choice ensures that the log likelihood of the observed data is a concave function of θ [9]. Representing and updating the posterior. As emphasized above, our ﬁrst key task is to efﬁciently update the posterior distribution of θ after t trials, p(θt |xt , rt ), as new stimulus-response pairs are trial 100 trial 500 trial 2500 trial 5000 θ true 1 info. max. trial 0 0 random −1 (a) random info. max. 2000 Time(Seconds) Entropy 1500 1000 500 0 −500 0 1000 2000 3000 Iteration (b) 4000 5000 0.1 total time diagonalization posterior update 1d line Search 0.01 0.001 0 200 400 Dimensionality 600 (c) Figure 1: A) Plots of the estimated receptive ﬁeld for a simulated visual neuron. The neuron’s receptive ﬁeld θ has the Gabor structure shown in the last panel (spike history effects were set to zero for simplicity here, a = 0). The estimate of θ is taken as the mean of the posterior, µt . The images compare the accuracy of the estimates using information maximizing stimuli and random stimuli. B) Plots of the posterior entropies for θ in these two cases; note that the information-maximizing stimuli constrain the posterior of θ much more effectively than do random stimuli. C) A plot of the timing of the three steps performed on each iteration as a function of the dimensionality of θ. The timing for each step was well-ﬁt by a polynomial of degree 2 for the diagonalization, posterior update and total time, and degree 1 for the line search. The times are an average over many iterations. The error-bars for the total time indicate ±1 std. observed. (We use xt and rt to abbreviate the sequences {xt , . . . , x0 } and {rt , . . . , r0 }.) To solve this problem, we approximate this posterior as a Gaussian; this approximation may be justiﬁed by the fact that the posterior is the product of two smooth, log-concave terms, the GLM likelihood function and the prior (which we assume to be Gaussian, for simplicity). Furthermore, the main theorem of [7] indicates that a Gaussian approximation of the posterior will be asymptotically accurate. We use a Laplace approximation to construct the Gaussian approximation of the posterior, p(θt |xt , rt ): we set µt to the peak of the posterior (i.e. the maximum a posteriori (MAP) estimate of θ), and the covariance matrix Ct to the negative inverse of the Hessian of the log posterior at µt . In general, computing these terms directly requires O(td2 + d3 ) time (where d = dim(θ); the time-complexity increases with t because to compute the posterior we must form a product of t likelihood terms, and the d3 term is due to the inverse of the Hessian matrix), which is unfortunately too slow when t or d becomes large. Therefore we further approximate p(θt−1 |xt−1 , rt−1 ) as Gaussian; to see how this simpliﬁes matters, we use Bayes to write out the posterior: 1 −1 log p(θ|rt , xt ) = − (θ − µt−1 )T Ct−1 (θ − µt−1 ) + − exp {xt ; rt−1 }T θ 2 + rt {xt ; rt−1 }T θ + const d log p(θ|rt , xt ) −1 = −(θ − µt−1 )T Ct−1 + (2) − exp({xt ; rt−1 }T θ) + rt {xt ; rt−1 }T dθ d2 log p(θ|rt , xt ) −1 = −Ct−1 − exp({xt ; rt−1 }T θ){xt ; rt−1 }{xt ; rt−1 }T dθi dθj (3) Now, to update µt we only need to ﬁnd the peak of a one-dimensional function (as opposed to a d-dimensional function); this follows by noting that that the likelihood only varies along a single direction, {xt ; rt−1 }, as a function of θ. At the peak of the posterior, µt , the ﬁrst term in the gradient must be parallel to {xt ; rt−1 } because the gradient is zero. Since Ct−1 is non-singular, µt − µt−1 must be parallel to Ct−1 {xt ; rt−1 }. Therefore we just need to solve a one dimensional problem now to determine how much the mean changes in the direction Ct−1 {xt ; rt−1 }; this requires only O(d2 ) time. Moreover, from the second derivative term above it is clear that computing Ct requires just a rank-one matrix update of Ct−1 , which can be evaluated in O(d2 ) time via the Woodbury matrix lemma. Thus this Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) provides a large gain in efﬁciency; our simulations (data not shown) showed that, despite this improved efﬁciency, the loss in accuracy due to this approximation was minimal. Deriving the (approximately) optimal stimulus. To simplify the derivation of our maximization strategy, we start by considering models in which the ﬁring rate does not depend on past spiking, so θ = {k}. To choose the optimal stimulus for trial t + 1, we want to maximize the conditional mutual information I(θ; rt+1 |xt+1 , xt , rt ) = H(θ|xt , rt ) − H(θ|xt+1 , rt+1 ) (4) with respect to the stimulus xt+1 . The ﬁrst term does not depend on xt+1 , so maximizing the information requires minimizing the conditional entropy H(θ|xt+1 , rt+1 ) = p(rt+1 |xt+1 ) −p(θ|rt+1 , xt+1 ) log p(θ|rt+1 , xt+1 )dθ = Ert+1 |xt+1 log det[Ct+1 ] + const. rt+1 (5) We do not average the entropy of p(θ|rt+1 , xt+1 ) over xt+1 because we are only interested in the conditional entropy for the particular xt+1 which will be presented next. The equality above is due to our Gaussian approximation of p(θ|xt+1 , rt+1 ). Therefore, we need to minimize Ert+1 |xt+1 log det[Ct+1 ] with respect to xt+1 . Since we set Ct+1 to be the negative inverse Hessian of the log-posterior, we have: −1 Ct+1 = Ct + Jobs (rt+1 , xt+1 ) −1 , (6) Jobs is the observed Fisher information. Jobs (rt+1 , xt+1 ) = −∂ 2 log p(rt+1 |ε = xt θ)/∂ε2 xt+1 xt t+1 t+1 (7) Here we use the fact that for the GLM, the likelihood depends only on the dot product, ε = xt θ. t+1 We can use the Woodbury lemma to evaluate the inverse: Ct+1 = Ct I + D(rt+1 , ε)(1 − D(rt+1 , ε)xt Ct xt+1 )−1 xt+1 xt Ct t+1 t+1 (8) where D(rt+1 , ε) = ∂ 2 log p(rt+1 |ε)/∂ε2 . Using some basic matrix identities, log det[Ct+1 ] = log det[Ct ] − log(1 − D(rt+1 , ε)xt Ct xt+1 ) t+1 = log det[Ct ] + D(rt+1 , ε)xt Ct xt+1 t+1 + o(D(rt+1 , ε)xt Ct xt+1 ) t+1 (9) (10) Ignoring the higher order terms, we need to minimize Ert+1 |xt+1 D(rt+1 , ε)xt Ct xt+1 . In our case, t+1 with f (θt xt+1 ) = exp(θt xt+1 ), we can use the moment-generating function of the multivariate Trial info. max. i.i.d 2 400 −10−4 0 0.05 −10−1 −2 ai 800 2 0 −2 −7 −10 i i.i.d k info. max. 1 1 50 i 1 50 i 1 10 1 i (a) i 100 0 −0.05 10 1 1 (b) i 10 (c) Figure 2: A comparison of parameter estimates using information-maximizing versus random stimuli for a model neuron whose conditional intensity depends on both the stimulus and the spike history. The images in the top row of A and B show the MAP estimate of θ after each trial as a row in the image. Intensity indicates the value of the coefﬁcients. The true value of θ is shown in the second row of images. A) The estimated stimulus coefﬁcients, k. B) The estimated spike history coefﬁcients, a. C) The ﬁnal estimates of the parameters after 800 trials: dashed black line shows true values, dark gray is estimate using information maximizing stimuli, and light gray is estimate using random stimuli. Using our algorithm improved the estimates of k and a. Gaussian p(θ|xt , rt ) to evaluate this expectation. After some algebra, we ﬁnd that to maximize I(θ; rt+1 |xt+1 , xt , rt ), we need to maximize 1 F (xt+1 ) = exp(xT µt ) exp( xT Ct xt+1 )xT Ct xt+1 . t+1 t+1 2 t+1 (11) Computing the optimal stimulus. For the GLM the most informative stimulus is undeﬁned, since increasing the stimulus power ||xt+1 ||2 increases the informativeness of any putatively “optimal” stimulus. To obtain a well-posed problem, we optimize the stimulus under the usual power constraint ||xt+1 ||2 ≤ e < ∞. We maximize Eqn. 11 under this constraint using Lagrange multipliers and an eigendecomposition to reduce our original d-dimensional optimization problem to a onedimensional problem. Expressing Eqn. 11 in terms of the eigenvectors of Ct yields: 1 2 2 F (xt+1 ) = exp( u i yi + ci yi ) ci yi (12) 2 i i i = g( 2 ci yi ) ui yi )h( i (13) i where ui and yi represent the projection of µt and xt+1 onto the ith eigenvector and ci is the corresponding eigenvalue. To simplify notation we also introduce the functions g() and h() which are monotonically strictly increasing functions implicitly deﬁned by Eqn. 12. We maximize F (xt+1 ) by breaking the problem into an inner and outer problem by ﬁxing the value of i ui yi and maximizing h() subject to that constraint. A single line search over all possible values of i ui yi will then ﬁnd the global maximum of F (.). This approach is summarized by the equation: max F (y) = max g(b) · y:||y||2 =e b max y:||y||2 =e,y t u=b 2 ci yi ) h( i Since h() is increasing, to solve the inner problem we only need to solve: 2 ci yi max y:||y||2 =e,y t u=b (14) i This last expression is a quadratic function with quadratic and linear constraints and we can solve it using the Lagrange method for constrained optimization. The result is an explicit system of 1 true θ random info. max. info. max. no diffusion 1 0.8 0.6 trial 0.4 0.2 400 0 −0.2 −0.4 800 1 100 θi 1 θi 100 1 θi 100 1 θ i 100 −0.6 random info. max. θ true θ i 1 0 −1 Entropy θ i 1 0 −1 random info. max. 250 200 i 1 θ Trial 400 Trial 200 Trial 0 (a) 0 −1 20 40 (b) i 60 80 100 150 0 200 400 600 Iteration 800 (c) Figure 3: Estimating the receptive ﬁeld when θ is not constant. A) The posterior means µt and true θt plotted after each trial. θ was 100 dimensional, with its components following a Gabor function. To simulate nonsystematic changes in the response function, the center of the Gabor function was moved according to a random walk in between trials. We modeled the changes in θ as a random walk with a white covariance matrix, Q, with variance .01. In addition to the results for random and information-maximizing stimuli, we also show the µt given stimuli chosen to maximize the information under the (mistaken) assumption that θ was constant. Each row of the images plots θ using intensity to indicate the value of the different components. B) Details of the posterior means µt on selected trials. C) Plots of the posterior entropies as a function of trial number; once again, we see that information-maximizing stimuli constrain the posterior of θt more effectively. equations for the optimal yi as a function of the Lagrange multiplier λ1 . ui e yi (λ1 ) = ||y||2 2(ci − λ1 ) (15) Thus to ﬁnd the global optimum we simply vary λ1 (this is equivalent to performing a search over b), and compute the corresponding y(λ1 ). For each value of λ1 we compute F (y(λ1 )) and choose the stimulus y(λ1 ) which maximizes F (). It is possible to show (details omitted) that the maximum of F () must occur on the interval λ1 ≥ c0 , where c0 is the largest eigenvalue. This restriction on the optimal λ1 makes the implementation of the linesearch signiﬁcantly faster and more stable. To summarize, updating the posterior and ﬁnding the optimal stimulus requires three steps: 1) a rankone matrix update and one-dimensional search to compute µt and Ct ; 2) an eigendecomposition of Ct ; 3) a one-dimensional search over λ1 ≥ c0 to compute the optimal stimulus. The most expensive step here is the eigendecomposition of Ct ; in principle this step is O(d3 ), while the other steps, as discussed above, are O(d2 ). Here our Gaussian approximation of p(θt−1 |xt−1 , rt−1 ) is once again quite useful: recall that in this setting Ct is just a rank-one modiﬁcation of Ct−1 , and there exist efﬁcient algorithms for rank-one eigendecomposition updates [15]. While the worst-case running time of this rank-one modiﬁcation of the eigendecomposition is still O(d3 ), we found the average running time in our case to be O(d2 ) (Fig. 1(c)), due to deﬂation which reduces the cost of matrix multiplications associated with ﬁnding the eigenvectors of repeated eigenvalues. Therefore the total time complexity of our algorithm is empirically O(d2 ) on average. Spike history terms. The preceding derivation ignored the spike-history components of the GLM model; that is, we ﬁxed a = 0 in equation (1). Incorporating spike history terms only affects the optimization step of our algorithm; updating the posterior of θ = {k; a} proceeds exactly as before. The derivation of the optimization strategy proceeds in a similar fashion and leads to an analogous optimization strategy, albeit with a few slight differences in detail which we omit due to space constraints. The main difference is that instead of maximizing the quadratic expression in Eqn. 14 to ﬁnd the maximum of h(), we need to maximize a quadratic expression which includes a linear term due to the correlation between the stimulus coefﬁcients, k, and the spike history coefﬁcients,a. The results of our simulations with spike history terms are shown in Fig. 2. Dynamic θ. In addition to fast changes due to adaptation and spike-history effects, animal preparations often change slowly and nonsystematically over the course of an experiment [16]. We model these effects by letting θ experience diffusion: θt+1 = θt + wt (16) Here wt is a normally distributed random variable with mean zero and known covariance matrix Q. This means that p(θt+1 |xt , rt ) is Gaussian with mean µt and covariance Ct + Q. To update the posterior and choose the optimal stimulus, we use the same procedure as described above1 . Results Our ﬁrst simulation considered the use of our algorithm for learning the receptive ﬁeld of a visually sensitive neuron. We took the neuron’s receptive ﬁeld to be a Gabor function, as a proxy model of a V1 simple cell. We generated synthetic responses by sampling Eqn. 1 with θ set to a 25x33 Gabor function. We used this synthetic data to compare how well θ could be estimated using information maximizing stimuli compared to using random stimuli. The stimuli were 2-d images which were rasterized in order to express x as a vector. The plots of the posterior means µt in Fig. 1 (recall these are equivalent to the MAP estimate of θ) show that the information maximizing strategy converges an order of magnitude more rapidly to the true θ. These results are supported by the conclusion of [7] that the information maximization strategy is asymptotically never worse than using random stimuli and is in general more efﬁcient. The running time for each step of the algorithm as a function of the dimensionality of θ is plotted in Fig. 1(c). These results were obtained on a machine with a dual core Intel 2.80GHz XEON processor running Matlab. The solid lines indicate ﬁtted polynomials of degree 1 for the 1d line search and degree 2 for the remaining curves; the total running time for each trial scaled as O(d2 ), as predicted. When θ was less than 200 dimensions, the total running time was roughly 50 ms (and for dim(θ) ≈ 100, the runtime was close to 15 ms), well within the range of tolerable latencies for many experiments. In Fig. 2 we apply our algorithm to characterize the receptive ﬁeld of a neuron whose response depends on its past spiking. Here, the stimulus coefﬁcients k were chosen to follow a sine-wave; 1 The one difference is that the covariance matrix of p(θt+1 |xt+1 , rt+1 ) is in general no longer just a rankone modiﬁcation of the covariance matrix of p(θt |xt , rt ); thus, we cannot use the rank-one update to compute the eigendecomposition. However, it is often reasonable to take Q to be white, Q = cI; in this case the eigenvectors of Ct + Q are those of Ct and the eigenvalues are ci + c where ci is the ith eigenvalue of Ct ; thus in this case, our methods may be applied without modiﬁcation. the spike history coefﬁcients a were inhibitory and followed an exponential function. When choosing stimuli we updated the posterior for the full θ = {k; a} simultaneously and maximized the information about both the stimulus coefﬁcients and the spike history coefﬁcients. The information maximizing strategy outperformed random sampling for estimating both the spike history and stimulus coefﬁcients. Our ﬁnal set of results, Fig. 3, considers a neuron whose receptive ﬁeld drifts non-systematically with time. We take the receptive ﬁeld to be a Gabor function whose center moves according to a random walk (we have in mind a slow random drift of eye position during a visual experiment). The results demonstrate the feasibility of the information-maximization strategy in the presence of nonstationary response properties θ, and emphasize the superiority of adaptive methods in this context. Conclusion We have developed an efﬁcient implementation of an algorithm for online optimization of neurophysiology experiments based on information-theoretic criterion. Reasonable approximations based on a GLM framework allow the algorithm to run in near-real time even for high dimensional parameter and stimulus spaces, and in the presence of spike-rate adaptation and time-varying neural response properties. Despite these approximations the algorithm consistently provides signiﬁcant improvements over random sampling; indeed, the differences in efﬁciency are large enough that the information-optimization strategy may permit robust system identiﬁcation in cases where it is simply not otherwise feasible to estimate the neuron’s parameters using random stimuli. Thus, in a sense, the proposed stimulus-optimization technique signiﬁcantly extends the reach and power of classical neurophysiology methods. Acknowledgments JL is supported by the Computational Science Graduate Fellowship Program administered by the DOE under contract DE-FG02-97ER25308 and by the NSF IGERT Program in Hybrid Neural Microsystems at Georgia Tech via grant number DGE-0333411. LP is supported by grant EY018003 from the NEI and by a Gatsby Foundation Pilot Grant. We thank P. Latham for helpful conversations. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] I. Nelken, et al., Hearing Research 72, 237 (1994). P. Foldiak, Neurocomputing 38–40, 1217 (2001). K. Zhang, et al., Proceedings (Computational and Systems Neuroscience Meeting, 2004). R. C. deCharms, et al., Science 280, 1439 (1998). C. Machens, et al., Neuron 47, 447 (2005). A. Watson, et al., Perception and Psychophysics 33, 113 (1983). L. Paninski, Neural Computation 17, 1480 (2005). P. McCullagh, et al., Generalized linear models (Chapman and Hall, London, 1989). L. Paninski, Network: Computation in Neural Systems 15, 243 (2004). E. Simoncelli, et al., The Cognitive Neurosciences, M. Gazzaniga, ed. (MIT Press, 2004), third edn. P. Dayan, et al., Theoretical Neuroscience (MIT Press, 2001). E. Chichilnisky, Network: Computation in Neural Systems 12, 199 (2001). F. Theunissen, et al., Network: Computation in Neural Systems 12, 289 (2001). L. Paninski, et al., Journal of Neuroscience 24, 8551 (2004). M. Gu, et al., SIAM Journal on Matrix Analysis and Applications 15, 1266 (1994). N. A. Lesica, et al., IEEE Trans. On Neural Systems And Rehabilitation Engineering 13, 194 (2005).

6 0.18746601 61 nips-2006-Convex Repeated Games and Fenchel Duality

7 0.18383594 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

8 0.14917222 146 nips-2006-No-regret Algorithms for Online Convex Programs

9 0.13421503 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

10 0.12476881 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

11 0.11309222 154 nips-2006-Optimal Change-Detection and Spiking Neurons

12 0.10984364 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

13 0.10033215 163 nips-2006-Prediction on a Graph with a Perceptron

14 0.095348626 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

15 0.075123101 98 nips-2006-Inferring Network Structure from Co-Occurrences

16 0.066299908 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

17 0.054504409 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

18 0.053450961 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

19 0.049327217 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

20 0.042964209 44 nips-2006-Bayesian Policy Gradient Algorithms

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.166), (1, 0.016), (2, -0.352), (3, 0.156), (4, -0.207), (5, -0.072), (6, -0.026), (7, -0.374), (8, -0.263), (9, -0.159), (10, 0.041), (11, 0.048), (12, -0.006), (13, -0.023), (14, -0.081), (15, 0.02), (16, -0.075), (17, -0.05), (18, 0.096), (19, 0.049), (20, -0.063), (21, 0.014), (22, -0.024), (23, 0.101), (24, 0.028), (25, -0.099), (26, 0.006), (27, -0.006), (28, -0.063), (29, 0.012), (30, 0.079), (31, 0.011), (32, 0.008), (33, 0.043), (34, -0.021), (35, -0.025), (36, 0.054), (37, -0.064), (38, 0.104), (39, 0.033), (40, -0.04), (41, -0.02), (42, -0.031), (43, -0.009), (44, -0.078), (45, 0.003), (46, 0.022), (47, -0.036), (48, -0.052), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98733073 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

2 0.76169896 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

3 0.64429849 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

Author: Gloria Haro, Gregory Randall, Guillermo Sapiro

4 0.59222859 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

Author: Jeremy Lewi, Robert Butera, Liam Paninski

5 0.54126549 79 nips-2006-Fast Iterative Kernel PCA

Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan

6 0.5373702 61 nips-2006-Convex Repeated Games and Fenchel Duality

7 0.52608246 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

8 0.47468448 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

9 0.45358452 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems

10 0.42015624 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

11 0.40487784 146 nips-2006-No-regret Algorithms for Online Convex Programs

12 0.32492062 154 nips-2006-Optimal Change-Detection and Spiking Neurons

13 0.31342354 98 nips-2006-Inferring Network Structure from Co-Occurrences

14 0.30141523 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers

15 0.22206262 163 nips-2006-Prediction on a Graph with a Perceptron

16 0.22068441 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

17 0.19081463 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions

18 0.16698518 44 nips-2006-Bayesian Policy Gradient Algorithms

19 0.16054517 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

20 0.15805778 96 nips-2006-In-Network PCA and Anomaly Detection

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.083), (3, 0.029), (7, 0.068), (9, 0.043), (20, 0.014), (22, 0.125), (38, 0.238), (44, 0.05), (57, 0.044), (65, 0.121), (69, 0.023), (83, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83692402 203 nips-2006-implicit Online Learning with Kernels

Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan

2 0.6957323 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Rachel Puckrin, Sean Cutler

Abstract: We present a hierarchical Bayesian model for sets of related, but different, classes of time series data. Our model performs alignment simultaneously across all classes, while detecting and characterizing class-speciﬁc differences. During inference the model produces, for each class, a distribution over a canonical representation of the class. These class-speciﬁc canonical representations are automatically aligned to one another — preserving common sub-structures, and highlighting differences. We apply our model to compare and contrast solenoid valve current data, and also, liquid-chromatography-ultraviolet-diode array data from a study of the plant Arabidopsis thaliana. 1 Aligning Time Series From Different Classes Many practical problems over a wide range of domains require synthesizing information from several noisy examples of one or more categories in order to build a model which captures common structure and also learns the patterns of variability between categories. In time series analysis, these modeling goals manifest themselves in the tasks of alignment and difference detection. These tasks have diverse applicability, spanning speech & music processing, equipment & industrial plant diagnosis/monitoring, and analysis of biological time series such as microarray & liquid/gas chromatography-based laboratory data (including mass spectrometry and ultraviolet diode arrays). Although alignment and difference detection have been extensively studied as separate problems in the signal processing and statistical pattern recognition communities, to our knowledge, no existing model performs both tasks in a uniﬁed way. Single class alignment algorithms attempt to align a set of time series all together, assuming that variability across different time series is attributable purely to noise. In many real-world situations, however, we have time series from multiple classes (categories) and our prior belief is that there is both substantial shared structure between the class distributions and, simultaneously, systematic (although often rare) differences between them. While in some circumstances (if differences are small and infrequent), single class alignment can be applied to multi-class data, it is much more desirable to have a model which performs true multi-class alignment in a principled way, allowing for more reﬁned and accurate modeling of the data. In this paper, we introduce a novel hierarchical Bayesian model which simultaneously solves the multi-class alignment and difference detection tasks in a uniﬁed manner, as illustrated in Figure 1. The single-class alignment shown in this ﬁgure coerces the feature in region A for class 1 to be inappropriately collapsed in time, and the overall width of the main broad peak in class 2 to be inappropriately narrowed. In contrast, our multi-class model handles these features correctly. Furthermore, because our algorithm does inference for a fully probabilistic model, we are able to obtain quantitative measures of the posterior uncertainty in our results, which, unlike the point estimates produced by most current approaches, allow us to assess our relative conﬁdence in differences learned by the model. Our basic setup for multi-class alignment assumes the class labels are known for each time series, as is the case for most difference detection problems. However, as we discuss at the end of the paper, our model can be extended to the completely unsupervised case. normal abnormal 3 common structure 3 1 1 20 0 3 −20 class−specific differences 20 1 0 −20 3 class−specific models 3 A 1 1 0 50 100 150 200 250 0 50 100 150 200 Figure 1: Nine time series from the NASA valve solenoid current data set [4]. Four belong to a ‘normal’ class, and ﬁve to an ‘abnormal’ class. On all ﬁgures, the horizontal axis is time, or latent time for ﬁgures of latent traces and observed time series aligned to latent traces. The vertical axis is current amplitude. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data (over MCMC samples) within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM. Top right: Parent trace. Middle right: Class-speciﬁc energy impulses with the topmost showing the class impulses for the less smooth class. Bottom right: Child traces superimposed. Note that if one generates more HB-CPM MCMC samples, the parent cycles between the two classes since the model has no preference for which class is seen as a modiﬁcation of the other; the child classes remain stable however. 2 A Hierarchical Bayesian Continuous Proﬁle Model Building on our previous Continuous Proﬁle Model (CPM) [7], we propose a Hierarchical Bayesian Continuous Proﬁle Model (HB-CPM) to address the problems of multi-class alignment and difference detection, together, for sets of sibling time series data — that is, replicate time series from several distinct, but related classes. The HB-CPM is a generative model that allows simultaneous alignment of time series and also provides aligned canonical representations of each class along with measures of uncertainty on these representations. Inference in the model can be used, for example, to detect and quantify similarities and differences in class composition. The HB-CPM extends the basic CPM in two signiﬁcant ways: i) it addresses the multi-class rather than the single-class alignment problem, and ii) it uses a fully Bayesian framework rather than a maximum likelihood approach, allowing us to estimate uncertainty in both the alignments and the canonical representations. Our model, depicted in Figure 2, assumes that each observed time series is generated as a noisy transformation of a single, class-speciﬁc latent trace. Each latent trace is an underlying, noiseless representation of the set of replicated, observable time series belonging to a single class. An observed time series is generated from this latent trace exactly as in the original CPM, by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as with an HMM. Each hidden state corresponds to a ‘latent time’ in the latent trace. Thus different choices of hidden state sequences result in different nonlinear transformations of the underlying trace. The HB-CPM uses a separate latent trace for each class, which we call child traces. Crucially, each of these child traces is generated from a single parent trace (also unobserved), which 250 captures the common structure among all of the classes. The joint prior distribution for the child traces in the HB-CPM model can be realized by ﬁrst sampling a parent trace, and then, for each class, sampling a sparse ‘difference vector’ which dictates how and where each child trace should differ from the common parent. z Figure 2: Core elements of the HB-CPM, illustrated with two-class data (hidden and observed) drawn from the model’s prior. parent r1 r2 z1 impulse z2 child trace child trace x1 x2 x3 class 1 observed time series 2.1 impulse x4 x5 x6 class 2 observed time series The Prior on Latent Traces Let the vector xk = (xk , xk , ..., xk ) represent the k th observed scalar time series, and w k ∈ 1..C 1 2 N be the class label of this time series. Also, let z = (z1 , z2 , ..., zM ) be the parent trace, and c c c z c = (z1 , z2 , ..., zM ) be the child trace for the cth class. During inference, posterior samples of c z form a canonical representation of the observed times series in class c, and z contains their common sub-structure. Ideally, the length of the latent traces, M , would be very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overﬁtting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments, and found this to be sufﬁcient with < 0.2. Because the resolution of the latent traces is higher than that of the observed time series, experimental time can be made to effectively speed up or slow down by advancing along the latent trace in larger or smaller jumps. As mentioned previously, the child traces in the HB-CPM inherit most of their structure from a common parent trace. The differences between child and parent are encoded in a difference vector for each class, dc = (dc , dc , ..., dc ); normally, most elements of dc are close to zero. Child traces 1 2 M are obtained by adding this difference vector to the parent trace: z c = z + dc . We model both the parent trace and class-speciﬁc difference vectors with what we call an energy impulse chain, which is an undirected Markov chain in which neighbouring nodes are encouraged to be similar (i.e., smooth), and where this smoothness is perturbed by a set of marginally independent energy impulse nodes, with one energy impulse node attached to each node in the chain. For the difc c c ference vector of the cth class, the corresponding energy impulses are denoted r c = (r1 , r2 , ..., rM ), and for the parent trace the energy impulses are denoted r = (r1 , r2 , ..., rM ). Conditioned on the energy impulses, the probability of a difference vector is p(dc |r c , αc , ρc ) = 1 1 exp − Zr c 2 M −1 i=1 M c (dc − dc )2 (dc − ri )2 i i+1 i + c c α ρ i=1 . (1) Here, Zrc is the normalizing constant for this probability density, αc controls the smoothness of the chain, and ρc controls the inﬂuence of the energy impulses. Together, αc and ρc also control the overall tightness of the distribution for dc . Presently, we set all αc = α , and similarly ρc = ρ — that is, these do not differ between classes. Similarly, the conditional probability of the parent trace is p(z|r, α, ρ) = 1 1 exp − Zr 2 M −1 i=1 M (zi − zi+1 )2 (zi − ri )2 + α ρ i=1 . (2) These probability densities are each multivariate Gaussian with tridiagonal precision matrixes (corresponding to the Markov nature of the interactions). Each component of each energy impulse for the parent, rj , is drawn independently from a single univariate Gaussian, N (ri |µpar , spar ), whose mean and variance are in turn drawn from a Gaussian and inverse-gamma, respectively. The class-speciﬁc difference vector impulses, however, are drawn from a mixture of two zero-mean Gaussians — one ‘no difference’ (inlier) Gaussian, and one ‘classdifference’ (outlier) Gaussian. The means are zero so as to encourage difference vectors to be near c zero (and thus child traces to be similar to the parent trace). Letting δi denote the binary latent c mixture component indicator variables for each rj , c c c c p(δj ) = Multinomial(δj |mc , mc ) = (mc )δj (mc )1−δj in out in out c c p(rj |δj ) = c N (rj |0, s2 ), in c N (rj |0, s2 ), out if if c δj c δj =1 . =0 (3) (4) Each Gaussian mixture variance has an Inverse-Gamma prior, which for the ‘no difference’ variance, s2 , is set to have very low mean (and not overly dispersed) so that ‘no difference’ regions truly have in little difference from the parent class, while for the ‘class-difference’ variance, s 2 , the prior is out set to have a larger mean, so as to model our belief that substantial class-speciﬁc differences do occasionally exist. The priors for αc , ρc , α, ρ are each log-normal (inverse-gamma priors would not be conjugate in this model, so we use log-normals which are easier to specify). Additionally, the mixing proportions, mc , mc , have a Dirichlet prior, which typically encodes our belief that the out in proportion that are ‘class differences’ is likely to be small. 2.2 The HMM Portion of the Model Each observed xk is modeled as being generated by an HMM conditioned on the appropriate child k trace, z w . The probability of an observed time series conditioned on a path of hidden time states, k N wk τ k , and the child trace, is given by p(xk |z w , τ k ) = i=1 N (xk |zτ k uk , ξ k ), where ξ k is the i i emission variance for time series k, and the scale factor, uk , allows for constant, global, multiplicak tive rescaling. The HMM transition probabilities T k (τi−1 → τik ) are multinomial within a limited k k range, with p (τi = a|τi−1 = b) = κ(a−b) for (a − b) ∈ [1, Jτ ] and pk (τi = a|τi−1 = b) = 0 for (a − b) < 1 or (a − b) > Jτ where Jτ is the maximum allowable number of consecutive time Jτ states that can be advanced in a single transition. (Of course, i=1 κk = 1.) This multinomial disi tribution, in turn, has a Dirichlet prior. The HMM emission variances, ξ k , have an inverse-gamma prior. Additionally, the prior over the ﬁrst hidden time state is a uniform distribution over a constant number of states, 1..Q, where Q deﬁnes how large a shift can exist between any two observed time series. The prior over each global scaling parameter, uk , is a log-normal with ﬁxed variance and mean of zero, which encourages the scaling factors to remain near unity. 3 Posterior Inference of Alignments and Parameters by MCMC Given a set of observed time series (and their associated class labels), the main computational operation to be performed in the HB-CPM is inference of the latent traces, alignment state paths and other model parameters. Exact inference is analytically intractable, but we are able to use Markov Chain Monte Carlo (MCMC) methods to create an iterative algorithm which, if run for sufﬁciently long, produces samples from the correct posterior distribution. This posterior provides simultaneous alignments of all observed time series in all classes, and also, crucially, aligned canonical representations of each class, along with error bars on these representations, allowing for a principled approach to difference detection in time series data from different classes. We may also wish to obtain a posterior estimate of some of our parameters conditioned on the data, and marginalized over the other parameters. In particular, we might be interested in obtaining the posterior over hidden time state vectors for each time series, τ k , which together provide a simultaneous, multi-class alignment of our data. We may, in addition, or, alternatively, be interested in the posterior of the child traces, z c , which together characterize how the classes agree and disagree. The former may be more of interest for visualizing aligned observed time series, or in expanding out aligned scalar time series to a related vector time series, while the latter would be more of interest when looking to characterize differences in multi-class, scalar time series data. We group our parameters into blocks, and sample these blocks conditioned on the values of the other parameters (as in Gibbs sampling) — however, when certain conditional distributions are not amenable to direct sampling, we use slice sampling [8]. The scalar conditional distributions for each c of µpar , spar , mc , mc , δj , κk are known distributions, amenable to direct sampling. The conditional out i in distributions for the scalars αc , ρc , α, ρ and uk are not tractable, and for each of these we use slice sampling (doubling out and shrinking). The conditional distribution for each of r and r c is multivariate Gaussian, and we sample directly from each using a Cholesky decomposition of the covariance matrix. 1 p(r|z, α, ρ) = p(z|r, α, ρ)p(r) = N (r|c, C) (5) Z 1 p(r c |dc , αc , ρc ) = p(dc |r, αc , ρc )p(r) = N (r c |b, B), (6) Z where, using I to denote the identity matrix, −1 µpar z S + Ispar −1 +I (7) , c=C C= ρ2 ρ spar B= S† 2 (ρc ) −1 +v c −1 , b=B dc . ρc (8) −1 The diagonal matrix v c consists of mixture component variances (s2 or s2 ). S −1 [or S † ] is the out in tridiagonal precision matrix of the multivariate normal distribution p(z|r, α, ρ) [or p(d c |r c , αc , ρc )], −1 −1 2 1 1 1 and has entries Sj,j = α + ρ for j = 2..(M − 1), Sj,j = α + ρ for j = 1, M , and −1 −1 −1 1 Sj,j+1 = Sj+1,j = − α [or analogously for S † ]. The computation of C and B can be made more efﬁcient by using the Sherman-Morrison-Woodbury matrix inversion lemma. For example, −1 −1 −1 −1 −1 B = (ρ1)2 (S † − S † (v c + S † )−1 S † ), and we have S −1 [or S † ] almost for free, and c no longer need to invert S [or S † ] to obtain it. The conditional distributions of each of z, z c are also multivariate Gaussians. However, because of the underlying Markov dependencies, their precision matrixes are tridiagonal, and hence we can use belief propagation, in the style of Kalman ﬁltering, followed by a stochastic traceback to sample from them efﬁciently. Thus each can be sampled in time proportional to M rather than M 3 , as required for a general multivariate Gaussian. Lastly, to sample from the conditional distribution of the hidden time vectors for each sample, τ k , we run belief propagation (analogous to the HMM forward-backward algorithm) followed by a stochastic traceback. In our experiments, the parent trace was initialized by averaging one smoothed example from each class. The child traces were initialized to the initial parent trace. The HMM states were initialized by a Viterbi decoding with respect to the initial values of the other parameters. The scaling factors were initialized to unity, and the child energy impulses to zero. MCMC was run for 5000 iterations, with convergence generally realized in less than 1000 iterations. 4 Experiments and Results We demonstrate use of the HB-CPM on two data sets. The ﬁrst data set is the part of the NASA shuttle valve data [4], which measures valve solenoid current against time for some ‘normal’ runs and some ‘abnormal’ runs. Measurements were taken at a rate of 1ms per sample, with 1000 samples per time series. We subsampled the data by a factor of 7 in time since it was extremely dense. The results of performing posterior inference in our model on this two-class data set are shown in Figure 1. They nicely match our intuition of what makes a good solution. In our experiments, we also compared our model to a simple “single-class” version of the HB-CPM in which we simply remove the child trace level of the model, letting all observed data in both classes depend directly on one single parent trace. The single-class alignment, while doing a reasonable job, does so by coercing the two classes to look more similar than they should. This is evident in one particular region labeled on the graph and discussed in the legend. Essentially a single class alignment causes us to lose class-speciﬁc ﬁne detail — the precise information we seek to retain for difference detection. The second data set is from a botany study which uses reverse-phase HPLC (high performance liquid chromatography) as a high-throughput screening method to identify genes involved in xenobiotic uptake and metabolism in the model plant Arabidopsis thaliana. Liquid-chromatography (LC) techniques are currently being developed and reﬁned with the aim of providing a robust platform with which to detect differences in biological organisms — be they plants, animals or humans. Detected differences can reveal new fundamental biological insight, or can be applied in more clinical settings. LC-mass spectrometry technology has recently undergone explosive growth in tackling the problem of biomarker discovery — for example, detecting biological markers that can predict treatment outcome or severity of disease, thereby providing the potential for improved health care and better understanding of the mechanisms of drug and disease. In botany, LC-UV data is used to help understand the uptake and metabolism of compounds in plants by looking for differences across experimental conditions, and it is this type of data that we use here. LC separates mixtures of analytes on the basis of some chemical property — hydrophobicity, for reverse-phase LC, used to generate our data. Components of the analyte in our data set were detected as they came off the LC column with a Diode Array Detector (DAD), yielding UV-visible spectra collected at 540 time points (we used the 280 nm band, which is informative for these experiments). We performed background subtraction [2] and then subsampled this data by a factor of four. This is a three-class data set, where the ﬁrst class is untreated plant extract, followed by two classes consisting of this same plant treated with compounds that were identiﬁed as possessing robust uptake in vivo, and, hence, when metabolized, provide a differential LC-UV signal of interest. Figure 3 gives an overview of the LC-UV results, while Figure 4 zooms in on a particular area of interest to highlight how subtle differences can be detected by the HB-CPM, but not by a singleclass alignment scheme. As with the NASA data set, a single-class alignment coerces features across classes that are in fact different to look the same, thereby preventing us from detecting them. Recall that this data set consists of a ‘no treatment’ plant extract, and two ‘treatments’ of this same plant. Though our model was not informed of these special relationships, it nevertheless elegantly captures this structure by giving almost no energy impulses to the ‘no treatment’ class, meaning that this class is essentially the parent trace, and allowing the ‘treatment’ classes to diverge from it, thereby nicely matching the reality of the situation. All averaging over MCMC runs shown is over 4000 samples, after a 1000 burn in period, which took around 3 hours for the NASA data, and 5 hours for the LC data set, on machines with dual 3 GHz Pentium 4 processors. 5 Related Work While much work has been done on time series alignment, and on comparison/clustering of time series, none of this work, to our knowledge, directly addresses the problem presented in this paper — simultaneously aligning and comparing sets of related time series in order to characterize how they differ from one another. The classical algorithm for aligning time series is Dynamic Time Warping (DTW) [10]. DTW works on pairs of time series, aligning one time series to a speciﬁed reference time, in a non-probabilistic way, without explicit allowance for differences in related time series. More recently, Gaffney et al [5] jointly clustered and aligned time series data from different classes. However, their model does not attempt to put time series from different classes into correspondence with one another — only time series within a class are aligned to one another. Ziv Bar-Joseph et al [1] use a similar approach to cluster and align microarray time series data. Ramsay et al [9] have introduced a curve clustering model, in which a time warping function, h(t), for each time series is learned by way of learning its relative curvature, parameterized with order one B-spline coefﬁcients. This model accounts for 5 5 3 3 9 0 5 −9 9 0 −9 9 3 0 −9 5 5 3 3 0 50 100 150 200 250 300 0 50 100 150 200 250 Figure 3: Seven time series from each of three classes of LC-UV data. On all ﬁgures, the horizontal axis is time, or latent time for ﬁgures of latent traces and observed time series aligned to latent traces. The vertical axis is log of UV absorbance. Top left: The raw, unaligned data. Middle left: Average of the unaligned data within each class in thick line, with the thin lines showing one standard deviation on either side. Bottom left: Average of the aligned data within each class, using the single-class alignment version of the model (no child traces), again with one standard deviation lines shown in the thinner style line. Right: Mean and one standard deviation over MCMC samples using the HB-CPM model. Top right: Parent trace. Middle right: Class-speciﬁc energy impulses, with the top-most showing the class impulses for the ‘no treatment’ class. Bottom right: Child traces superimposed. See Figure 4 for a zoom-in in around the arrow. systematic changes in the range and domain of time series in a way that aligns curves with the same fundamental shape. However, their method does not allow for class-speciﬁc differences between shapes to be taken into account. The anomaly detection (AD) literature deals with related, yet distinct problems. For example, Chan et al [3] build a model of one class of time series data (they use the same NASA valve data as in this paper), and then match test data, possibly belonging to another class (e.g. ‘abnormal’ shuttle valve data) to this model to obtain an anomaly score. Emphasis in the AD community is on detecting abnormal events relative to a normal baseline, in an on-line manner, rather than comparing and contrasting two or more classes from a dataset containing examples of all classes. The problem of ‘elastic curve matching‘ is addressed in [6], where a target time series that best matches a query series is found, by mapping the problem of ﬁnding the best matching subsequence to the problem of ﬁnding the cheapest path in a DAG (directed acyclic graph). 6 Discussion and Conclusion We have introduced a hierarchical, Bayesian model to perform detection of rare differences between sets of related time series, a problem which arises across a wide range of domains. By training our model, we obtain the posterior distribution over a set of class-speciﬁc canonical representations of each class, which are aligned in a way that preserves their common sub-structures, yet retains and highlights important differences. This model can be extended in several interesting and useful ways. One small modiﬁcation could be useful for the LC-UV data set presented in this paper, in which one of the classes was ‘no treatment’, while the other two were each a different ‘treatment’. We might model the ‘no treatment’ as the parent trace, and each of the treatments as a child trace, so that the direct comparison of interest would be made more explicit. Another direction would be to apply the HB-CPM in a completely 300 4 5 0 3 −5 4 5 0 3 −5 4 5 0 3 −5 100 105 110 115 120 125 130 135 140 145 150 155 100 105 110 115 120 125 130 135 140 145 150 155 Figure 4: Left: A zoom in of data displayed in Figure 3, from the region of time 100-150 (labeled in that ﬁgure in latent time, not observed time). Top left: mean and standard deviation of the unaligned data. Middle left: mean and standard deviation of the single-class alignment. Bottom left: mean and standard deviation of the child traces from the HB-CPM. A case in point of a difference that could be detected with the HB-CPM and not in the raw or single-class aligned data, is the difference occurring at time point 127. Right: The mean and standard deviation of the child energy impulses, with dashed lines showing correspondences with the child traces in the bottom left panel. unsupervised setting where we learn not only the canonical class representations, but also obtain the posterior over the class labels by introducing a latent class indicator variable. Lastly, one could use a model with cyclical latent traces to model cyclic data such as electrocardiogram (ECG) and climate data. In such a model, an observed trace being generated by the model would be allowed to cycle back to the start of the latent trace, and the smoothness constraints on the trace would be extended to apply to beginning and end of the traces, coercing these to be similar. Such a model would allow one to do anomaly detection in cyclic data, as well as segmentation. Acknowledgments: Thanks to David Ross and Roland Memisevic for useful discussions, and Ben Marlin for his Matlab slice sampling code. References [1] Z. Bar-Joseph, G. Gerber, D. K. Gifford, T. Jaakkola, and I. Simon. A new approach to analyzing gene expression time series data. In RECOMB, pages 39–48, 2002. [2] H. Boelens, R. Dijkstra, P. Eilers, F. Fitzpatrick, and J. Westerhuis. New background correction method for liquid chromatography with diode array detection, infrared spectroscopic detection and raman spectroscopic detection. Journal of Chromatography A, 1057:21–30, 2004. [3] P. K. Chan and M. V. Mahoney. Modeling multiple time series for anomaly detection. In ICDM, 2005. [4] B. Ferrell and S. Santuro. NASA shuttle valve data. http://www.cs.fit.edu/∼pkc/nasa/data/, 2005. [5] S. J. Gaffney and P. Smyth. Joint probabilistic curve clustering and alignment. In Advances in Neural Information Processing Systems 17, 2005. [6] L. Latecki, V. Megalooikonomou, Q. Wang, R. Lakaemper, C. Ratanamahatana, and E. Keogh. Elastic partial matching of time series, 2005. [7] J. Listgarten, R. M. Neal, S. T. Roweis, and A. Emili. Multiple alignment of continuous time series. In Advances in Neural Information Processing Systems 17, 2005. [8] R. M. Neal. Slice sampling. Annals of Statistics, 31:705–767, 2003. [9] J. Ramsay and X. Li. Curve registration. Journal of the Royal Statistical Society(B), 60, 1998. [10] H. Sakoe and S. Chiba. Dynamic programming algorithm for spoken word recognition. Readings in Speech Recognition, pages 159–165, 1990.

3 0.63222498 61 nips-2006-Convex Repeated Games and Fenchel Duality

Author: Shai Shalev-shwartz, Yoram Singer

Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This uniﬁed view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the ﬁrst player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the ﬁrst player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the ﬁrst player. The goal of the ﬁrst player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us ﬁrst cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner ﬁrst receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classiﬁers, which is used for answering yes/no questions, is deﬁned as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redeﬁne the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which deﬁnes a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the ﬁrst player using the notion of regret. Given a number of rounds T and a ﬁxed vector u ∈ S, we deﬁne the regret of the ﬁrst player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the ﬁrst player which guarantees low regret with respect to any vector u ∈ S. Speciﬁcally, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Speciﬁcally, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the ﬁrst player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the ﬁrst player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the ﬁrst player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Speciﬁcally, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is deﬁned as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few deﬁnitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is deﬁned as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the deﬁnition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the ﬁrst player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best ﬁxed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The ﬁrst update scheme simply ﬁnds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update deﬁnes (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the ﬁnal value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisﬁes the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its ﬁnal sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The deﬁnition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the deﬁnition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classiﬁcation, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the conﬁdence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are deﬁned to be yt = Xt wt . Finally, let us deﬁne two loss functions for ranking, both generalize the hinge-loss used in binary classiﬁcation problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we deﬁne a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufﬁcient conﬁdence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufﬁcient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster ﬁrst deﬁnes a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a ﬁnal, so called strong, hypothesis whose error is small. The ﬁnal hypothesis combines linearly the T hypotheses returned by the weak learner with coefﬁcients α1 , . . . , αT , and is deﬁned to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefﬁcients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the ﬁnal strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note ﬁrst that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, deﬁned as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms deﬁne weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally ﬁts the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us ﬁrst denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us deﬁne αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: ﬁnd a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of ﬁnding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster ﬁrst constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different deﬁnition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The deﬁnition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the deﬁnition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now brieﬂy discuss. First, we generalize the applicability of the framework beyond the speciﬁc setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was ﬁrst presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of deﬁning appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simpliﬁes the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classiﬁers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using conﬁdence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In ICML, 2003.

4 0.62530357 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus

Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.

5 0.61988974 146 nips-2006-No-regret Algorithms for Online Convex Programs

Author: Geoffrey J. Gordon

Abstract: Online convex programming has recently emerged as a powerful primitive for designing machine learning algorithms. For example, OCP can be used for learning a linear classiﬁer, dynamically rebalancing a binary search tree, ﬁnding the shortest path in a graph with unknown edge lengths, solving a structured classiﬁcation problem, or ﬁnding a good strategy in an extensive-form game. Several researchers have designed no-regret algorithms for OCP. But, compared to algorithms for special cases of OCP such as learning from expert advice, these algorithms are not very numerous or ﬂexible. In learning from expert advice, one tool which has proved particularly valuable is the correspondence between no-regret algorithms and convex potential functions: by reasoning about these potential functions, researchers have designed algorithms with a wide variety of useful guarantees such as good performance when the target hypothesis is sparse. Until now, there has been no such recipe for the more general OCP problem, and therefore no ability to tune OCP algorithms to take advantage of properties of the problem or data. In this paper we derive a new class of no-regret learning algorithms for OCP. These Lagrangian Hedging algorithms are based on a general class of potential functions, and are a direct generalization of known learning rules like weighted majority and external-regret matching. In addition to proving regret bounds, we demonstrate our algorithms learning to play one-card poker. 1

6 0.61363238 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

7 0.60550702 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

8 0.60319048 79 nips-2006-Fast Iterative Kernel PCA

9 0.60126328 62 nips-2006-Correcting Sample Selection Bias by Unlabeled Data

10 0.59874338 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

11 0.59732544 156 nips-2006-Ordinal Regression by Extended Binary Classification

12 0.59454274 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

13 0.59161216 20 nips-2006-Active learning for misspecified generalized linear models

14 0.59066576 65 nips-2006-Denoising and Dimension Reduction in Feature Space

15 0.58917695 163 nips-2006-Prediction on a Graph with a Perceptron

16 0.58619314 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

17 0.58505052 150 nips-2006-On Transductive Regression

18 0.58396918 170 nips-2006-Robotic Grasping of Novel Objects

19 0.58193403 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

20 0.58185995 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis