nips nips2003 nips2003-57 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Liva Ralaivola, Florence D'alché-buc
Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: first, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Dynamical Modeling with Kernels for Nonlinear Time Series Prediction Liva Ralaivola Laboratoire d’Informatique de Paris 6 Universit´ Pierre et Marie Curie e 8, rue du capitaine Scott F-75015 Paris, FRANCE liva. [sent-1, score-0.169]
2 fr Florence d’Alch´ –Buc e Laboratoire d’Informatique de Paris 6 Universit´ Pierre et Marie Curie e 8, rue du capitaine Scott F-75015 Paris, FRANCE florence. [sent-3, score-0.169]
3 fr Abstract We consider the question of predicting nonlinear time series. [sent-5, score-0.155]
4 Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. [sent-6, score-0.163]
5 The kernel trick is used twice: first, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. [sent-7, score-0.56]
6 Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. [sent-8, score-0.198]
7 Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. [sent-9, score-0.236]
8 1 Introduction Prediction, smoothing and filtering are traditional tasks applied to time series. [sent-10, score-0.124]
9 The machine learning community has recently paid a lot of attention to these problems and especially to nonlinear time series prediction in various areas such as biological signals, speech or financial markets. [sent-11, score-0.519]
10 To cope with non linearities, extensions of the Kalman filter [5, 4] have been proposed for filtering and smoothing while recurrent artificial neural networks [2] and support vector regressors [7, 8] have been developed for prediction purposes. [sent-12, score-0.339]
11 In this paper, we focus on prediction tasks and introduce a powerful method based on the kernel trick [1], which has been successfully used in tasks ranging from classification and regression to data analysis (see [13, 15] for details). [sent-13, score-0.392]
12 Time series modeling is addressed by extending the framework of observable linear dynamical systems [12] to the feature space defined by a kernel. [sent-14, score-0.476]
13 The predictions are realized in the feature space and are then transformed to obtain the corresponding preimages in the input space. [sent-15, score-0.198]
14 While the proposed model could be used for smoothing as well as filtering, we here focus on the prediction task. [sent-16, score-0.253]
15 A link to the Kalman filter can be drawn by noticing that given the efficiency of our model for the prediction task it can be used as a hidden transition process in the Kalman filter setting. [sent-17, score-0.261]
16 In the next section, we describe how the modeling of a time series can take place in the feature space and explain how to solve the preimage problem by a learning strategy. [sent-19, score-0.494]
17 In the third section, we present prediction results achieved by our model In the fourth section, the estimation algorithm is discussed and its link to the Kalman filter is highlighted. [sent-20, score-0.239]
18 1 Basic Formulation The problem we address is that of modeling d-dimensional nonlinear real-valued time series defined as xt+1 = h(xt ) + u (1) from an observed sequence x1:T = {x1 , . [sent-23, score-0.419]
19 , xT } produced by this model, where h is a (possibly unknown) nonlinear function and u a noise vector. [sent-26, score-0.108]
20 Modeling such a series can be done with the help of recurrent neural networks [2] or support vector machines [7]. [sent-27, score-0.305]
21 In this work, we instead propose to deal with this problem by extending linear dynamical modeling thanks to the kernel trick. [sent-28, score-0.389]
22 , φ(xT )}, where φ is a mapping from Rd to H and k its associated 1:T kernel function [15] such that k(v1 , v2 ) = φ(v1 ), φ(v2 ) ∀v1 , v2 ∈ Rd , ·, · being the inner product of H. [sent-35, score-0.146]
23 The Kernel Dynamical Model (KDM) obtained can be written as: xφ = A φ xφ + µ φ + ν φ t t+1 (2) where Aφ is the process transition matrix, µφ an offset vector, ν φ ∈ H a gaussian isotropic noise of magnitude σ 2 and xφ stands for φ(xt ). [sent-36, score-0.067]
24 t We are going to show that it is possible to apply the maximum likelihood principle to identify σ 2 , Aφ and µφ and come back to the input space thanks to preimages determination. [sent-37, score-0.159]
25 Indeed, performing this task leads to the equations: A φ = T X xφ xφ t t−1 t=2 T X t=2 µφ = σ2 = T T 1 X φX φ − xt−1 xt T − 1 t=2 t=2 xφ xφ t−1 t−1 ! [sent-40, score-0.23]
26 −1 T ” 1 X“ φ xt − A φ xφ t−1 T − 1 t=2 T X φ 1 x − A φ xφ − µ φ t−1 p(T − 1) t=2 t (4) 2 θ := {Aφ , µφ , σ 2 , µφ , Σφ }, and µφ and Σφ are the parameters of the gaussian vector xφ . [sent-42, score-0.303]
27 , if a gaussian kernel is used) and/or singular (equation (3)) and making a division by the dimension of the feature space (p in equation (5)). [sent-45, score-0.257]
28 Once such a set of vectors is available, trying to find good parameters for the model (2) is equivalent to finding an m-dimensional linear dynamical model for the sequence z1:T = {z1 , . [sent-51, score-0.223]
29 , zT } where zt is the vector of coordinates of xφ t with respect to U , i. [sent-54, score-0.205]
30 : zt = xφ , u φ t 1 xφ , uφ · · · x φ , uφ t t m 2 ∀t = 1, . [sent-56, score-0.143]
31 (6) Given z1:T , the following linear dynamical model has to be considered: zt+1 = Az zt + µz + ν z (7) 2 where ν z is again a gaussian noise vector of variance σ . [sent-60, score-0.402]
32 Determining a basis of Hx allows to learn the linear dynamical model (7). [sent-61, score-0.212]
33 As it is based on the coordinates of the observed vectors xφ , . [sent-62, score-0.071]
34 The parameters 1 T are estimated thanks to equations (3), (4) and (5) where xφ is replaced with zt and p with t m. [sent-66, score-0.187]
35 3 Back to the Input Space: the Preimage Problem The problem Predicting the future observations with model (7) gives vectors in the feature space H while vectors from the input space Rd are needed. [sent-71, score-0.143]
36 Given a vector zφ in H, finding a good vector x in Rd such that φ(x) is as close as possible to zφ is known as the preimage problem. [sent-72, score-0.197]
37 Nevertheless, it may require several optimization phases with different starting points to be ran when other kernels are used (e. [sent-76, score-0.096]
38 Here, we propose to use Support Vector Regression (SVR) to solve the preimage problem. [sent-79, score-0.139]
39 In addition, using this strategy, there is no need to solve an optimization problem each time a preimage has to be computed. [sent-81, score-0.181]
40 , (z , y )} with pairs in Z × R, the SVR algorithm assumes a structure on Z given by a kernel k z and its associated mapping φ and feature space H (see [15]). [sent-85, score-0.189]
41 1 1 −ε (α∗ + α) + y (α∗ − α) − ((α∗ − α)KZ (α∗ − α) + (α∗ α∗ + α α)) 2 C 1 (α∗ − α) = 0 α∗ ≥ 0, α ≥ 0 The vectors involved in this program are of dimension , with 1 = [1 · · · 1] , 0 = [0 · · · 0] , ε = [ε · · · ε] , y = [y1 · · · y ] and KZ is the Gram matrix KZij = kz (zi , zj ). [sent-90, score-0.175]
42 Here, ε is the parameter of the Vapnik’s ε-insensitive quadratic loss function and C is a user-defined constant penalizing data points which fail to meet the ε-deviation constraint. [sent-91, score-0.062]
43 In order to learn this mapping, we construct d (the dimension of input space) SVR machines f1 , . [sent-93, score-0.092]
44 Each fi is trained to estimate the ith coordinate of the vector xt given the coordinates vector zt of xt with respect to U . [sent-97, score-0.694]
45 Denoting by zu the function which maps a vector x to its coordinate vector z in U , the d machines provide the mapping ψ: ψ : H x → Rd x → [f1 (zu (x)) · · · fd (zu (x))] (8) which can be used to estimate the preimages. [sent-98, score-0.229]
46 Using ψ, and noting that the program involved by the SVR algorithm is convex, the estimation of the preimages does not have to deal with any problem of local minima. [sent-99, score-0.138]
47 3 Numerical Results In this section we present experiments on highly nonlinear time series prediction with Kernel Dynamical Modeling. [sent-100, score-0.519]
48 As the two series we consider are one dimensional we use the following setup. [sent-101, score-0.194]
49 In order to model it, we introduce an embedding dimension d and a step size κ such that vectors xt = (xt , xt−κ , . [sent-103, score-0.339]
50 We compare the perfomances of KDM to the performances achieved by an SVR for nonlinear time series analysis [7, 8], where the mapping associating xt to xt+κ is learned. [sent-107, score-0.622]
51 The hyperparameters (kernel parameter and SVR penalization constant C) are computed with respect to the one-step prediction error measured on a test set, while the value of ε is set to 1e-4. [sent-108, score-0.262]
52 Prediction quality is assessed on an independent validation sequence on which root mean squared error (RMSE) is computed. [sent-109, score-0.077]
53 The first one is a one-step prediction when after a prediction has been made, the true value is used to estimate the next time series output. [sent-111, score-0.632]
54 The second one is a multi-step or trajectory prediction, where the prediction made by a model serves as a basis for the future predictions. [sent-112, score-0.302]
55 In order to make a prediction for a time t > T , we suppose that we are provided with the vector xt−1 , which may have been observed or computed. [sent-113, score-0.269]
56 We determine the coordinates zt−1 of xφ with respect to U and infer the value of zt by zt = Az zt−1 + µz (see t−1 equation (7)); ψ is then used to recover an estimation of xt+1 (cf. [sent-114, score-0.342]
57 In all our experiments we have made the crude –yet efficient– choice of the linear kernel for k z . [sent-116, score-0.106]
58 4 0 0 20 40 60 80 100 0 50 100 150 200 250 300 350 Figure 1: (left) 100 points of the Mackey-Glass time series M G17 , (right) the first 350 points of the Laser time series. [sent-126, score-0.34]
59 Table 1: Error (RMSE) of one-step and trajectory predictions for gaussian and polynomial kernels for the time series M G17 . [sent-127, score-0.511]
60 1744 Mackey-Glass Time Series Prediction The Mackey-Glass time series comes from the modeling of blood cells production evolution. [sent-156, score-0.312]
61 8, shows some highly nonlinear chaotic behavior (see Figure 1 left). [sent-160, score-0.125]
62 We focus on M G17 , for which τ = 17, and construct embedding vectors of size d = 6 and step size κ = 6. [sent-161, score-0.068]
63 As xt is used to predict xt+κ , the whole dataset can be divided into six “independent” datasets, the first one S1 containing x1+(d−1)κ , the second one S2 , x2+(d−1)κ , . [sent-162, score-0.252]
64 The first 100 points of S1 are used to learning, while the first 100 points of S2 serve to choose the hyperparameters. [sent-167, score-0.062]
65 The prediction error is measured with respect to the points in the range 201 to 300 of S1 . [sent-168, score-0.256]
66 Table 1 reports the RMSE error obtained with gaussian and polynomial kernels, where 1S and 100S respectively stand for one-step prediction and multi-step prediction over the 100 future observations. [sent-169, score-0.583]
67 SVR one-step prediction with gaussian kernel gives the best RMSE. [sent-170, score-0.348]
68 None of the tested regularizers allows KDM to perform better, even if the prediction error obtained with them is never more than 10% away from SVR error. [sent-171, score-0.262]
69 Table 2: Error (RMSE) of one-step and trajectory predictions for gaussian and polynomial kernels for the time series Laser. [sent-172, score-0.511]
70 84 KDM trajectory prediction with gaussian kernel and regularizer γ = 100 leads to the best error. [sent-200, score-0.558]
71 It is around 17% lower than that of SVR multi-step prediction while KDM with no regularizer gives the poorest prediction, emphasizing the importance of the regularizer. [sent-201, score-0.351]
72 Regarding one-step prediction with polynomial kernel, there is no significant difference between the performance achieved by SVR and that of KDM, when regularizer is 0, 0. [sent-202, score-0.42]
73 For a regularizer γ = 100, KDM however leads to the best one-step prediction error, around 16% lower than that obtained by SVR prediction. [sent-204, score-0.351]
74 The dash ’-’ appearing in the first line of the table means that the trajectory prediction made by the SVR with a polynomial kernel has failed to give finite predictions. [sent-205, score-0.457]
75 For a regularizer value of γ = 100, it even gives the best trajectory prediction error. [sent-207, score-0.408]
76 2 Laser Time Series Prediction The Laser time series is the dataset A from the Santa Fe competition. [sent-209, score-0.258]
77 It is a univariate time series from an experiment conducted in a physics laboratory (Figure 1 (right) represents the first 350 points of the series). [sent-210, score-0.267]
78 An embedding dimension d = 3 and a step size κ = 1 are used. [sent-211, score-0.071]
79 The first 100 points are used for training, whereas the points in the range 201 to 300 provide a test set to select hyperparameters. [sent-213, score-0.062]
80 The validation error (RMSE) is evaluated on the points in the range 101 to 200. [sent-214, score-0.086]
81 The most striking information provided by this table is the large error archieved by KDM with no regularizer when a gaussian kernel is used. [sent-216, score-0.357]
82 Besides, when the regularizer γ is appropriately chosen, we see that KDM with a gaussian kernel can achieve very good predictions, for the one-step prediction and the multi-step prediction as well. [sent-218, score-0.699]
83 KDM one-step best prediction error is however not as far from SVR one-step prediction (about 10% lower) than KDM multi-step is from its SVR counterpart (around 16% lower). [sent-219, score-0.423]
84 When a polynomial kernel is used, we observe that KDM with no regularizer provides poor results with regards to the one-step prediction error. [sent-220, score-0.526]
85 Contrary to what occurs with the use of a gaussian kernel, KDM with no regularization does not show bad multi-step prediction ability. [sent-221, score-0.242]
86 Looking at the other entries of this table once again shows that KDM can give very good predictions when a well-suited regularizer is chosen. [sent-222, score-0.22]
87 Hence, we notice that the best multi-step prediction error of KDM is above 19% better than that obtained by SVR multi-step prediction. [sent-223, score-0.225]
88 1 Discussion Another Way of Choosing the Parameters The introduction of a basis U allows to find the parameters of KDM without computing any inversion of infinite dimensional matrices or division by the dimension of H. [sent-225, score-0.087]
89 Aφ T X = xφ xφ − t t−1 t=2 γI + T X T T 1 X φX φ xt xt−1 T − 1 t=2 t=2 xφ xφ t−1 t−1 t=2 T T 1 X φ X φ − xt−1 xt−1 T − 1 t=2 t=2 ! [sent-229, score-0.23]
90 2 Link to Kalman Filtering The usual way to recover a noisy nonlinear signal is to use the Extended Kalman Filter (EKF) or the Unscented Kalman Filter (UKF) [4]. [sent-234, score-0.085]
91 Given a noisy time series from the same driving process h, EKF and UKF then process that series by respectively a first-order linearization of h and an efficient ’sampling’ method to determine the clean signal. [sent-239, score-0.457]
92 5 Conclusion and Future Work Three main results are presented: first, we introduce KDM, a kernel extension of linear dynamical models and show how the kernel trick allows to learn a linear model in a feature space associated to a kernel. [sent-244, score-0.478]
93 Second, an original and efficient solution based on learning has been applied for the preimage problem. [sent-245, score-0.139]
94 Third, Kernel Dynamical Model can be linked to the Kalman filter model with a hidden state process living in the feature space. [sent-246, score-0.065]
95 In the framework of time series prediction, KDM proves to work very well and to compete with the best time series predictors particularly on long time range prediction. [sent-247, score-0.514]
96 , time series denoising) can be tackled by our approach and have to be tested. [sent-253, score-0.236]
97 Besides, the use of kernel opens the door to dealing with structured data, making KDM a very attractive tool in many areas such as bioinformatics, texts and video application. [sent-255, score-0.106]
98 Lastly, from the theoretical point of view, a very interesting issue is that of the actual noise corresponding to a gaussian noise in a feature space. [sent-256, score-0.133]
99 Nonlinear prediction of chaotic time series using support vector machines. [sent-310, score-0.535]
100 In Actes eaire ` du 19eme Symposium GRETSI sur le traitement du signal et des images, 2003. [sent-336, score-0.147]
wordName wordTfidf (topN-words)
[('kdm', 0.577), ('svr', 0.361), ('xt', 0.23), ('kalman', 0.21), ('prediction', 0.198), ('series', 0.194), ('dynamical', 0.163), ('az', 0.16), ('regularizer', 0.153), ('zt', 0.143), ('preimage', 0.139), ('preimages', 0.115), ('rmse', 0.11), ('kernel', 0.106), ('nonlinear', 0.085), ('paris', 0.08), ('modeling', 0.076), ('kz', 0.073), ('ukf', 0.069), ('zu', 0.069), ('polynomial', 0.069), ('rd', 0.066), ('kernels', 0.065), ('du', 0.061), ('ekf', 0.06), ('trajectory', 0.057), ('ltering', 0.056), ('hx', 0.055), ('smoothing', 0.055), ('lter', 0.053), ('laser', 0.048), ('france', 0.048), ('alch', 0.046), ('capitaine', 0.046), ('curie', 0.046), ('pierre', 0.046), ('ralaivola', 0.046), ('gaussian', 0.044), ('thanks', 0.044), ('feature', 0.043), ('filter', 0.043), ('universit', 0.042), ('time', 0.042), ('dimension', 0.041), ('link', 0.041), ('predictions', 0.04), ('chaotic', 0.04), ('informatique', 0.04), ('marie', 0.04), ('mapping', 0.04), ('vectors', 0.038), ('fd', 0.037), ('km', 0.037), ('laboratoire', 0.037), ('penalization', 0.037), ('regularizers', 0.037), ('rue', 0.037), ('regularizing', 0.034), ('trick', 0.034), ('sch', 0.033), ('zi', 0.033), ('coordinates', 0.033), ('support', 0.032), ('points', 0.031), ('penalizing', 0.031), ('scott', 0.031), ('performances', 0.031), ('embedding', 0.03), ('contrary', 0.029), ('gram', 0.029), ('ller', 0.029), ('vector', 0.029), ('validation', 0.028), ('predicting', 0.028), ('mika', 0.028), ('table', 0.027), ('classic', 0.027), ('clean', 0.027), ('error', 0.027), ('tasks', 0.027), ('smola', 0.027), ('em', 0.026), ('lkopf', 0.026), ('pa', 0.026), ('learn', 0.026), ('recurrent', 0.025), ('machines', 0.025), ('et', 0.025), ('future', 0.024), ('posteriori', 0.024), ('equation', 0.023), ('basis', 0.023), ('noise', 0.023), ('reports', 0.023), ('involved', 0.023), ('matrices', 0.023), ('filtering', 0.022), ('sequence', 0.022), ('dataset', 0.022), ('hidden', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
Author: Liva Ralaivola, Florence D'alché-buc
Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: first, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1
2 0.1426018 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression
Author: Roland Vollgraf, Michael Scholz, Ian A. Meinertzhagen, Klaus Obermayer
Abstract: Nonlinear filtering can solve very complex problems, but typically involve very time consuming calculations. Here we show that for filters that are constructed as a RBF network with Gaussian basis functions, a decomposition into linear filters exists, which can be computed efficiently in the frequency domain, yielding dramatic improvement in speed. We present an application of this idea to image processing. In electron micrograph images of photoreceptor terminals of the fruit fly, Drosophila, synaptic vesicles containing neurotransmitter should be detected and labeled automatically. We use hand labels, provided by human experts, to learn a RBF filter using Support Vector Regression with Gaussian kernels. We will show that the resulting nonlinear filter solves the task to a degree of accuracy, which is close to what can be achieved by human experts. This allows the very time consuming task of data evaluation to be done efficiently. 1
3 0.13901684 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
4 0.13077235 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
Author: Radford M. Neal, Matthew J. Beal, Sam T. Roweis
Abstract: We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of “pools” of candidate states at each time. We then define an embedded HMM whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efficiently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple one-dimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers. 1
5 0.12115408 148 nips-2003-Online Passive-Aggressive Algorithms
Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer
Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1
6 0.095942281 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
7 0.094760768 112 nips-2003-Learning to Find Pre-Images
8 0.094549179 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
9 0.092061691 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
10 0.081992038 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
11 0.07700289 102 nips-2003-Large Scale Online Learning
12 0.072955318 145 nips-2003-Online Classification on a Budget
13 0.061194319 176 nips-2003-Sequential Bayesian Kernel Regression
14 0.060981285 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion
15 0.058134113 41 nips-2003-Boosting versus Covering
16 0.056324273 114 nips-2003-Limiting Form of the Sample Covariance Eigenspectrum in PCA and Kernel PCA
17 0.055721004 171 nips-2003-Semi-Definite Programming by Perceptron Learning
18 0.05566489 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
19 0.054509785 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
20 0.04993961 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
topicId topicWeight
[(0, -0.19), (1, 0.02), (2, -0.004), (3, -0.129), (4, 0.048), (5, 0.095), (6, 0.088), (7, 0.141), (8, 0.16), (9, -0.054), (10, 0.031), (11, 0.03), (12, -0.018), (13, -0.022), (14, 0.001), (15, -0.029), (16, 0.085), (17, 0.025), (18, -0.051), (19, 0.032), (20, 0.016), (21, 0.114), (22, 0.054), (23, -0.022), (24, -0.048), (25, -0.036), (26, -0.073), (27, 0.117), (28, -0.031), (29, 0.062), (30, -0.0), (31, 0.123), (32, 0.054), (33, 0.074), (34, -0.08), (35, -0.031), (36, -0.02), (37, -0.007), (38, -0.114), (39, -0.107), (40, -0.133), (41, -0.187), (42, -0.067), (43, -0.026), (44, 0.119), (45, 0.111), (46, 0.004), (47, -0.106), (48, 0.12), (49, 0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.94017619 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
Author: Liva Ralaivola, Florence D'alché-buc
Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: first, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1
2 0.64714593 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression
Author: Roland Vollgraf, Michael Scholz, Ian A. Meinertzhagen, Klaus Obermayer
Abstract: Nonlinear filtering can solve very complex problems, but typically involve very time consuming calculations. Here we show that for filters that are constructed as a RBF network with Gaussian basis functions, a decomposition into linear filters exists, which can be computed efficiently in the frequency domain, yielding dramatic improvement in speed. We present an application of this idea to image processing. In electron micrograph images of photoreceptor terminals of the fruit fly, Drosophila, synaptic vesicles containing neurotransmitter should be detected and labeled automatically. We use hand labels, provided by human experts, to learn a RBF filter using Support Vector Regression with Gaussian kernels. We will show that the resulting nonlinear filter solves the task to a degree of accuracy, which is close to what can be achieved by human experts. This allows the very time consuming task of data evaluation to be done efficiently. 1
3 0.51678377 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
Author: Radford M. Neal, Matthew J. Beal, Sam T. Roweis
Abstract: We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of “pools” of candidate states at each time. We then define an embedded HMM whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efficiently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple one-dimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers. 1
4 0.48292816 112 nips-2003-Learning to Find Pre-Images
Author: Jason Weston, Bernhard Schölkopf, Gökhan H. Bakir
Abstract: We consider the problem of reconstructing patterns from a feature map. Learning algorithms using kernels to operate in a reproducing kernel Hilbert space (RKHS) express their solutions in terms of input points mapped into the RKHS. We introduce a technique based on kernel principal component analysis and regression to reconstruct corresponding patterns in the input space (aka pre-images) and review its performance in several applications requiring the construction of pre-images. The introduced technique avoids difficult and/or unstable numerical optimization, is easy to implement and, unlike previous methods, permits the computation of pre-images in discrete input spaces. 1
5 0.47079438 44 nips-2003-Can We Learn to Beat the Best Stock
Author: Allan Borodin, Ran El-Yaniv, Vincent Gogan
Abstract: A novel algorithm for actively trading stocks is presented. While traditional universal algorithms (and technical trading heuristics) attempt to predict winners or trends, our approach relies on predictable statistical relations between all pairs of stocks in the market. Our empirical results on historical markets provide strong evidence that this type of technical trading can “beat the market” and moreover, can beat the best stock in the market. In doing so we utilize a new idea for smoothing critical parameters in the context of expert learning. 1 Introduction: The Portfolio Selection Problem The portfolio selection (PS) problem is a challenging problem for machine learning, online algorithms and, of course, computational finance. As is well known (e.g. see Lugosi [1]) sequence prediction under the log loss measure can be viewed as a special case of portfolio selection, and perhaps more surprisingly, from a certain worst case minimax criterion, portfolio selection is not essentially any harder (than prediction) as shown in [2] (see also [1], Thm. 20 & 21). But there seems to be a qualitative difference between the practical utility of “universal” sequence prediction and universal portfolio selection. Simply stated, universal sequence prediction algorithms under various probabilistic and worst-case models work very well in practice whereas the known universal portfolio selection algorithms do not seem to provide any substantial benefit over a naive investment strategy (see Sec. 4). A major pragmatic question is whether or not a computer program can consistently outperform the market. A closer inspection of the interesting ideas developed in information theory and online learning suggests that a promising approach is to exploit the natural volatility in the market and in particular to benefit from simple and rather persistent statistical relations between stocks rather than to try to predict stock prices or “winners”. We present a non-universal portfolio selection algorithm1 , which does not try to predict winners. The motivation behind our algorithm is the rationale behind constant rebalancing algorithms and the worst case study of universal trading introduced by Cover [3]. Not only does our proposed algorithm substantially “beat the market” on historical markets, it also beats the best stock. So why are we presenting this algorithm and not just simply making money? There are, of course some caveats and obstacles to utilizing the algorithm. But for large investors the possibility of a goose laying silver (if not golden) eggs is not impossible. 1 Any PS algorithm can be modified to be universal by investing any fixed fraction of the initial wealth in a universal algorithm. Assume a market with m stocks. Let vt = (vt (1), . . . , vt (m)) be the closing prices of the m stocks for the tth day, where vt (j) is the price of the jth stock. It is convenient to work with relative prices xt (j) = vt (j)/vt−1 (j) so that an investment of $d in the jth stock just before the tth period yields dxt (j) dollars. We let xt = (xt (1), . . . , xt (m)) denote the market vector of relative prices corresponding to the tth day. A portfolio b is an allocation of wealth in the stocks, specified by the proportions b = (b(1), . . . , b(m)) of current dollar wealth invested in each of the stocks, where b(j) ≥ 0 and j b(j) = 1. The daily return of a portfolio b w.r.t. a market vector x is b · x = j b(j)x(j) and the (compound) total return, retX (b1 , . . . , bn ), of a sequence of portfolios b1 , . . . , bn w.r.t. a market sequence n X = x1 , . . . , xn is t=1 bt · xt . A portfolio selection algorithm is any deterministic or randomized rule for specifying a sequence of portfolios. The simplest strategy is to “buy-and-hold” stocks using some portfolio b. We denote this strategy by BAHb and let U-BAH denote the uniform buy-and-hold when b = (1/m, . . . , 1/m). We say that a portfolio selection algorithm “beats the market” when it outperforms U-BAH on a given market sequence although in practice “the market” can be represented by some non-uniform BAH (e.g. DJIA). Buy-and-hold strategies rely on the tendency of successful markets to grow. Much of modern portfolio theory focuses on how to choose a good b for the buy-and-hold strategy. The seminal ideas of Markowitz in [4] yield an algorithmic procedure for choosing the weights of the portfolio b so as to minimize the variance for any feasible expected return. This variance minimization is possible by placing appropriate larger weights on subsets of anti-correlated stocks, an idea which we shall also utilize. We denote the optimal in hindsight buy-and-hold strategy (i.e. invest only in the best stock) by BAH∗ . An alternative approach to the static buy-and-hold is to dynamically change the portfolio during the trading period. This approach is often called “active trading”. One example of active trading is constant rebalancing; namely, fix a portfolio b and (re)invest your dollars each day according to b. We denote this constant rebalancing strategy by CBALb and let CBAL∗ denote the optimal (in hindsight) CBAL. A constant rebalancing strategy can often take advantage of market fluctuations to achieve a return significantly greater than that of BAH∗ . CBAL∗ is always at least as good as the best stock BAH∗ and in some real market sequences a constant rebalancing strategy will take advantage of market fluctuations and significantly outperform the best stock (see Table 1). For now, consider Cover and Gluss’ [5] classic (but contrived) example of a market consisting of cash and one stock and 1 1 the market sequence of price relatives 1/2 , 1 , 1/2 , 1 , . . . Now consider the CBALb 2 2 3 1 1 1 11 with b = ( 2 , 2 ). On each odd day the daily return of CBALb is 2 1 + 2 2 = 4 and on each even day, it is 3/2. The total return over n days is therefore (9/8)n/2 , illustrating how a constant rebalancing strategy can yield exponential returns in a “no-growth market”. Under the assumption that the daily market vectors are observations of identically and independently distributed (i.i.d) random variables, it is shown in [6] that CBAL∗ performs at least as good (in the sense of expected total return) as the best online portfolio selection algorithm. However, many studies (see e.g. [7]) argue that stock price sequences do have long term memory and are not i.i.d. A non-traditional objective (in computational finance) is to develop online trading strategies that are in some sense always guaranteed to perform well. Within a line of research pioneered by Cover [5, 3, 2] one attempts to design portfolio selection algorithms that can provably do well (in terms of their total return) with respect to some online or offline benchmark algorithms. Two natural online benchmark algorithms are the uniform buy and hold U-BAH, and the uniform constant rebalancing strategy U-CBAL, which is CBALb with 1 1 b = ( m , . . . , m ). A natural offline benchmark is BAH∗ and a more challenging offline benchmark is CBAL∗ . Cover and Ordentlich’s Universal Portfolios algorithm [3, 2], denoted here by UNIVERSAL, was proven to be universal against CBAL∗ , in the sense that for every market sequence X of m stocks over n days, it guarantees a sub-exponential (indeed polynomial) ratio in n, retX (CBAL∗ )/retX (UNIVERSAL) ≤ O n m−1 2 (1) From a theoretical perspective this is surprising as the ratio is a polynomial in n (for fixed m) whereas CBAL∗ is capable of exponential returns. From a practical perspective, while the m−1 ratio n 2 is not very useful, the motivation that underlies the potential of CBAL algorithms is useful! We follow this motivation and develop a new algorithm which we call ANTICOR. By attempting to systematically follow the constant rebalancing philosophy, ANTICOR is capable of some extraordinary performance in the absence of transaction costs, or even with very small transaction costs. 2 Trying to Learn the Winners The most direct approach to expert learning and portfolio selection is a “(reward based) weighted average prediction” algorithm which adaptively computes a weighted average of experts by gradually increasing (by some multiplicative or additive update rule) the relative weights of the more successful experts. For example, in the context of the PS problem consider the “exponentiated gradient” EG(η) algorithm proposed by Helmbold et al. [8]. The EG(η) algorithm computes the next portfolio to be bt+1 (j) = bt (j) exp {ηxt (j)/(bt · xt )} m j=1 bt (j) exp {ηxt (j)/(bt · xt )} where η is a “learning rate” parameter. EG was designed to greedily choose the best portfolio for yesterday’s market xt while at the same time paying a penalty from moving far from yesterday’s portfolio. For a universal bound on EG, Helmbold et al. set η = 2xmin 2(log m)/n where xmin is a lower bound on any price relative.2 It is easy to see that as n increases, η decreases to 0 so that we can think of η as being very small in order to achieve universality. When η = 0, the algorithm EG(η) degenerates to the uniform CBAL which is not a universal algorithm. It is also the case that if each day the price relatives for all stocks were identical, then EG (as well as other PS algorithms) will converge to the uniform CBAL. Combining a small learning rate with a “reasonably balanced” market we expect the performance of EG to be similar to that of the uniform CBAL and this is confirmed by our experiments (see Table1).3 Cover’s universal algorithms adaptively learn each day’s portfolio by increasing the weights of successful CBALs. The update rule for these universal algorithms is bt+1 = b · rett (CBALb )dµ(b) , rett (CBALb )dµ(b) where µ(·) is some prior distribution over portfolios. Thus, the weight of a possible portfolio is proportional to its total return rett (b) thus far times its prior. The particular universal algorithm we consider in our experiments uses the Dirichlet prior (with parameters 1 1 ( 2 , . . . , 2 )) [2]. Within a constant factor, this algorithm attains the optimal ratio (1) with respect to CBAL∗ .4 The algorithm is equivalent to a particular static distribution over the 2 Helmbold et al. show how to eliminate the need to know xmin and n. While EG can be made universal, its performance ratio is only sub-exponential (and not polynomial) in n. 3 Following Helmbold et al. we fix η = 0.01 in our experiments. 4 Experimentally (on our datasets) there is a negligible difference between the uniform universal algorithm in [3] and the above Dirichlet universal algorithm. class of all CBALs. This equivalence helps to demystify the universality result and also shows that the algorithm can never outperform CBAL∗ . A different type of “winner learning” algorithm can be obtained from any sequence prediction strategy. For each stock, a (soft) sequence prediction algorithm provides a probability p(j) that the next symbol will be j ∈ {1, . . . , m}. We view this as a prediction that stock j will have the best price relative for the next day and set bt+1 (j) = pj . We consider predictions made using the prediction component of the well-known Lempel-Ziv (LZ) lossless compression algorithm [9]. This prediction component is nicely described in Langdon [10] and in Feder [11]. As a prediction algorithm, LZ is provably powerful in various senses. First it can be shown that it is asymptotically optimal with respect to any stationary and ergodic finite order Markov source (Rissanen [12]). Moreover, Feder shows that LZ is also universal in a worst case sense with respect to the (offline) benchmark class of all finite state prediction machines. To summarize, the common approach to devising PS algorithms has been to attempt and learn winners using winner learning schemes. 3 The Anticor Algorithm We propose a different approach, motivated by the CBAL “philosophy”. How can we interpret the success of the uniform CBAL on the Cover and Gluss example of Sec. 1? Clearly, the uniform CBAL here is taking advantage of price fluctuation by constantly transferring wealth from the high performing stock to the anti-correlated low performing stock. Even in a less contrived market, we should be able to take advantage when a stock is currently outperforming other stocks especially if this strong performance is anti-correlated with the performance of these other stocks. Our ANTICORw algorithm considers a short market history (consisting of two consecutive “windows”, each of w trading days) so as to model statistical relations between each pair of stocks. Let LX1 = log(xt−2w+1 ), . . . , log(xt−w )T and LX2 = log(xt−w+1 ), . . . , log(xt )T , where log(xk ) denotes (log(xk (1)), . . . , log(xk (m))). Thus, LX1 and LX2 are the two vector sequences (equivalently, two w × m matrices) constructed by taking the logarithm over the market subsequences corresponding to the time windows [t − 2w + 1, t − w] and [t − w + 1, t], respectively. We denote the jth column of LXk by LXk (j). Let µk = (µk (1), . . . , µk (m)), be the vectors of averages of columns of LXk (that is, µk (j) = E{LXk (j)}). Similarly, let σk , be the vector of standard deviations of columns of LXk . The cross-correlation matrix (and its normalization) between column vectors in LX1 and LX2 are defined as: Mcov (i, j) = (LX1 (i) − µ1 (i))T (LX2 (j) − µ2 (j)); Mcov (i,j) σ1 (i)σ2 (j) σ1 (i), σ2 (j) = 0; 0 otherwise. Mcor (i, j) ∈ [−1, 1] measures the correlation between log-relative prices of stock i over the first window and stock j over the second window. For each pair of stocks i and j we compute claimi→j , the extent to which we want to shift our investment from stock i to stock j. Namely, there is such a claim iff µ2 (i) > µ2 (j) and Mcor (i, j) > 0 in which case claimi→j = Mcor (i, j) + A(i) + A(j) where A(h) = |Mcor (h, h)| if Mcor (h, h) < 0, else 0. Following our interpretation for the success of a CBAL, Mcor (i, j) > 0 is used to predict that stocks i and j will be correlated in consecutive windows (i.e. the current window and the next window based on the evidence for the last two windows) and Mcor (h, h) < 0 predicts that stock h will be anti-correlated with itself over consec˜ utive windows. Finally, bt+1 (i) = bt (i) + j=i [transferj→i − transferi→j ] where ˜ t (i) · claimi→j / ˜ transferi→j = b j claimi→j and bt is the resulting portfolio just after market closing (on day t). Mcor (i, j) SP500: Anticor vs. window size NYSE: Anticor vs. window size w BAH(Anticor ) w Anticor 12 8 w Best Stock Market Return 10 Total Return Total Return (log−scale) 10 Anticorw 5 10 BAH(Anticorw) Anticorw Best Stock Market Best Stock 8 Anticorw 6 4 2 10 2 1 10 Best Stock 1 0 10 2 5 10 15 20 25 5 30 10 15 20 25 30 Window Size (w) Window Size (w) Figure 1: ANTICORw ’s total return (per $1 investment) vs. window size 2 ≤ w ≤ 30 for NYSE (left) and SP500 (right). Our ANTICORw algorithm has one critical parameter, the window size w. In Figure 1 we depict the total return of ANTICORw on two historical datasets as a function of the window size w = 2, . . . , 30. As we might expect, the performance of ANTICORw depends significantly on the window size. However, for all w, ANTICORw beats the uniform market and, moreover, it beats the best stock using most window sizes. Of course, in online trading we cannot choose w in hindsight. Viewing the ANTICORw algorithms as experts, we can try to learn the best expert. But the windows, like individual stocks, induce a rather volatile set of experts and standard expert combination algorithms [13] tend to fail. Alternatively, we can adaptively learn and invest in some weighted average of all ANTICORw algorithms with w less than some maximum W . The simplest case is a uniform investment on all the windows; that is, a uniform buy-and-hold investment on the algorithms ANTICORw , w ∈ [2, W ], denoted by BAHW (ANTICOR). Figure 2 (left) graphs the total return of BAHW (ANTICOR) as a function of W for all values of 2 ≤ W ≤ 50 with respect to the NYSE dataset (see details below). Similar graphs for the other datasets we consider appear qualitatively the same and the choice W = 30 is clearly not optimal. However, for all W ≥ 3, BAHW (ANTICOR) beats the best stock in all our experiments. DJIA: Dec 14, 2002 − Jan 14, 2003 NYSE: Total Return vs. Max Window 10 1.1 BAHW(Anticor) 10 Best Stock MArket 4 10 3 10 Best Stock 2.8 Anticor2 2.2 2.6 1 BAHW(Anticor) 5 Total Return Total Return (log−scale) 10 6 Anticor1 Stocks 7 2 0.9 2.4 1.8 0.8 2.2 1.6 0.7 2 1.4 0.6 2 10 1.8 1.2 0.5 1 10 1.6 1 0.4 0 10 5 10 15 20 25 30 35 Maximal Window size (W) 40 45 50 5 10 15 20 25 Days 5 10 15 20 25 Days 5 10 15 20 25 Days Figure 2: Left: BAHW (ANTICOR)’s total return (per $1 investment) as a function of the maximal window W . Right: Cumulative returns for last month of the DJIA dataset: stocks (left panel); ANTICORw algorithms trading the stocks (denoted ANTICOR1 , middle panel); ANTICORw algorithms trading the ANTICOR algorithms (right panel). Since we now consider the various algorithms as stocks (whose prices are determined by the cumulative returns of the algorithms), we are back to our original portfolio selection problem and if the ANTICOR algorithm performs well on stocks it may also perform well on algorithms. We thus consider active investment in the various ANTICORw algorithms using ANTICOR. We again consider all windows w ≤ W . Of course, we can continue to compound the algorithm any number of times. Here we compound twice and then use a buy-and-hold investment. The resulting algorithm is denoted BAHW (ANTICOR(ANTICOR)). One impact of this compounding, depicted in Figure 2 (right), is to smooth out the anti-correlations exhibited in the stocks. It is evident that after compounding twice the returns become almost completely correlated thus diminishing the possibility that additional compounding will substantially help.5 This idea for eliminating critical parameters may be applicable in other learning applications. The challenge is to understand the conditions and applications in which the process of compounding algorithms will have this smoothing effect! 4 Experimental Results We present an experimental study of the the ANTICOR algorithm and the three online learning algorithms described in Sec. 2. We focus on BAH30 (ANTICOR), abbreviated by ANTI1 and BAH30 (ANTICOR(ANTICOR)), abbreviated by ANTI2 . Four historical datasets are used. The first NYSE dataset, is the one used in [3, 2, 8, 14]. This dataset contains 5651 daily prices for 36 stocks in the New York Stock Exchange (NYSE) for the twenty two year period July 3rd , 1962 to Dec 31st , 1984. The second TSE dataset consists of 88 stocks from the Toronto Stock Exchange (TSE), for the five year period Jan 4th , 1994 to Dec 31st , 1998. The third dataset consists of the 25 stocks from SP500 which (as of Apr. 2003) had the largest market capitalization. This set spans 1276 trading days for the period Jan 2nd , 1998 to Jan 31st , 2003. The fourth dataset consists of the thirty stocks composing the Dow Jones Industrial Average (DJIA) for the two year period (507 days) from Jan 14th , 2001 to Jan 14th , 2003.6 These four datasets are quite different in nature (the market returns for these datasets appear in the first row of Table 1). While every stock in the NYSE increased in value, 32 of the 88 stocks in the TSE lost money, 7 of the 25 stocks in the SP500 lost money and 25 of the 30 stocks in the “negative market” DJIA lost money. All these sets include only highly liquid stocks with huge market capitalizations. In order to maximize the utility of these datasets and yet present rather different markets, we also ran each market in reverse. This is simply done by reversing the order and inverting the relative prices. The reverse datasets are denoted by a ‘-1’ superscript. Some of the reverse markets are particularly challenging. For example, all of the NYSE−1 stocks are going down. Note that the forward and reverse markets (i.e. U-BAH) for the TSE are both increasing but that the TSE−1 is also a challenging market since so many stocks (56 of 88) are declining. Table 1 reports on the total returns of the various algorithms for all eight datasets. We see that prediction algorithms such as LZ can do quite well but the more aggressive ANTI1 and 2 ANTI have excellent and sometimes fantastic returns. Note that these active strategies beat the best stock and even CBAL∗ in all markets with the exception of the TSE−1 in which they still significantly outperform the market. The reader may well be distrustful of what appears to be such unbelievable returns for ANTI1 and ANTI2 especially when applied to the NYSE dataset. However, recall that the NYSE dataset consists of n = 5651 trading days and the y such that y n = the total NYSE return is approximately 1.0029511 for ANTI1 (respectively, 1.0074539 for ANTI2 ); that is, the average daily increase is less than .3% 5 This smoothing effect also allows for the use of simple prediction algorithms such as “expert advice” algorithms [13], which can now better predict a good window size. We have not explored this direction. 6 The four datasets, including their sources and individual stock compositions can be downloaded from http://www.cs.technion.ac.il/∼rani/portfolios. (respectively, .75%). Thus a transaction cost of 1% can present a significant challenge to such active trading strategies (see also Sec. 5). We observe that UNIVERSAL and EG have no substantial advantage over U-CBAL. Some previous expositions of these algorithms highlighted particular combinations of stocks where the returns significantly outperformed UNIVERSAL and the best stock. But the same can be said for U-CBAL. Algorithm M ARKET (U-BAH) B EST S TOCK CBAL∗ U-CBAL ANTI1 ANTI2 LZ EG UNIVERSAL NYSE 14.49 54.14 250.59 27.07 17,059,811.56 238,820,058.10 79.78 27.08 26.99 TSE 1.61 6.27 6.77 1.59 26.77 39.07 1.32 1.59 1.59 SP500 1.34 3.77 4.06 1.64 5.56 5.88 1.67 1.64 1.62 DJIA 0.76 1.18 1.23 0.81 1.59 2.28 0.89 0.81 0.80 NYSE−1 0.11 0.32 2.86 0.22 246.22 1383.78 5.41 0.22 0.22 TSE−1 1.67 37.64 58.61 1.18 7.12 7.27 4.80 1.19 1.19 SP500−1 0.87 1.65 1.91 1.09 6.61 9.69 1.20 1.09 1.07 DJIA−1 1.43 2.77 2.97 1.53 3.67 4.60 1.83 1.53 1.53 Table 1: Monetary returns in dollars (per $1 investment) of various algorithms for four different datasets and their reversed versions. The winner and runner-up for each market appear in boldface. All figures are truncated to two decimals. 5 Concluding Remarks When handling a portfolio of m stocks our algorithm may perform up to m transactions per day. A major concern is therefore the commissions it will incur. Within the proportional commission model (see e.g. [14] and [15], Sec. 14.5.4) there exists a fraction γ ∈ (0, 1) such that an investor pays at a rate of γ/2 for each buy and for each sell. Therefore, the return of a sequence b1 , . . . , bn of portfolios with re˜ spect to a market sequence x1 , . . . , xn is t bt · xt (1 − j γ |bt (j) − bt (j)|) , where 2 1 ˜ (bt (1)xt (1), . . . , bt (m)xt (m)). Our investment algorithm in its simplest form bt = bt ·xt can tolerate very small proportional commission rates and still beat the best stock.7 We note that Blum and Kalai [14] showed that the performance guarantee of UNIVERSAL still holds (and gracefully degrades) in the case of proportional commissions. Many current online brokers only charge a small per share commission rate. A related problem that one must face when actually trading is the difference between bid and ask prices. These bid-ask spreads (and the availability of stocks for both buying and selling) are typically functions of stock liquidity and are typically smaller for large market capitalization stocks. We consider here only very large market cap stocks. As a final caveat, we note that we assume that any one portfolio selection algorithm has no impact on the market! But just like any goose laying golden eggs, widespread use will soon lead to the end of the goose; that is, the market will quickly react. Any report of abnormal returns using historical markets should be suspected of “data snooping”. In particular, when a dataset is excessively mined by testing many strategies there is a substantial chance that one of the strategies will be successful by simple overfitting. Another data snooping hazard is stock selection. For example, the 36 stocks selected for the NYSE dataset were all known to have survived for 22 years. Our ANTICOR algorithms were fully developed using only the NYSE and TSE datasets. The DJIA and SP500 sets were obtained (from public domain sources) after the algorithms were fixed. Finally, our algorithm has one parameter (the maximal window size W ). Our experiments indicate that the algorithm’s performance is robust with respect to W (see Figure 2). 7 For example, with γ = 0.1% we can typically beat the best stock. These results will be presented in the full paper. A number of well-respected works report on statistically robust “abnormal” returns for simple “technical analysis” heuristics, which slightly beat the market. For example, the landmark study of Brock et al. [16] apply 26 simple trading heuristics to the DJIA index from 1897 to 1986 and provide strong support for technical analysis heuristics. While consistently beating the market is considered a great (if not impossible) challenge, our approach to portfolio selection indicates that beating the best stock is an achievable goal. What is missing at this point of time is an analytical model which better explains why our active trading strategies are so successful. In this regard, we are investigating various “statistical adversary” models along the lines suggested by [17, 18]. Namely, we would like to show that an algorithm performs well (relative to some benchmark) for any market sequence that satisfies certain constraints on its empirical statistics. References [1] G. Lugosi. Lectures on prediction URL:http://www.econ.upf.es/∼lugosi/ihp.ps, 2001. of individual sequences. [2] T.M. Cover and E. Ordentlich. Universal portfolios with side information. IEEE Transactions on Information Theory, 42(2):348–363, 1996. [3] T.M. Cover. Universal portfolios. Mathematical Finance, 1:1–29, 1991. [4] H. Markowitz. Portfolio Selection: Efficient Diversification of Investments. John Wiley and Sons, 1959. [5] T.M. Cover and D.H. Gluss. Empirical bayes stock market portfolios. Advances in Applied Mathematics, 7:170–181, 1986. [6] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 1991. [7] A. Lo and C. MacKinlay. A Non-Random Walk Down Wall Street. Princeton University Press, 1999. [8] D.P. Helmbold, R.E. Schapire, Y. Singer, and M.K. Warmuth. Portfolio selection using multiplicative updates. Mathematical Finance, 8(4):325–347, 1998. [9] J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE Transactions on Information Theory, 24:530–536, 1978. [10] G.G. Langdon. A note on the lempel-ziv model for compressing individual sequences. IEEE Transactions on Information Theory, 29:284–287, 1983. [11] M. Feder. Gambling using a finite state machine. IEEE Transactions on Information Theory, 37:1459–1465, 1991. [12] J. Rissanen. A universal data compression system. IEEE Transactions on Information Theory, 29:656–664, 1983. [13] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, May 1997. [14] A. Blum and A. Kalai. Universal portfolios with and without transaction costs. Machine Learning, 30(1):23–30, 1998. [15] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, 1998. [16] L. Brock, J. Lakonishok, and B. LeBaron. Simple technical trading rules and the stochastic properties of stock returns. Journal of Finance, 47:1731–1764, 1992. [17] P. Raghavan. A statistical adversary for on-line algorithms. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 7:79–83, 1992. [18] A. Chou, J.R. Cooperstock, R. El-Yaniv, M. Klugerman, and T. Leighton. The statistical adversary allows optimal money-making trading strategies. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, 1995.
6 0.43547916 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
7 0.42081568 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
8 0.39346671 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis
9 0.38654128 126 nips-2003-Measure Based Regularization
10 0.38638854 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
11 0.38222778 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
12 0.37670439 173 nips-2003-Semi-supervised Protein Classification Using Cluster Kernels
13 0.36415574 76 nips-2003-GPPS: A Gaussian Process Positioning System for Cellular Networks
14 0.33517176 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
15 0.33441538 145 nips-2003-Online Classification on a Budget
16 0.33130696 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering
17 0.3265 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
18 0.3228687 171 nips-2003-Semi-Definite Programming by Perceptron Learning
19 0.30876675 176 nips-2003-Sequential Bayesian Kernel Regression
20 0.3076902 148 nips-2003-Online Passive-Aggressive Algorithms
topicId topicWeight
[(0, 0.02), (1, 0.227), (6, 0.019), (11, 0.03), (24, 0.013), (29, 0.015), (30, 0.036), (35, 0.048), (53, 0.089), (71, 0.053), (76, 0.071), (82, 0.021), (85, 0.123), (91, 0.121)]
simIndex simValue paperId paperTitle
1 0.8610608 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
Author: Clayton Scott, Robert Nowak
Abstract: This paper reports on a family of computationally practical classifiers that converge to the Bayes error at near-minimax optimal rates for a variety of distributions. The classifiers are based on dyadic classification trees (DCTs), which involve adaptively pruned partitions of the feature space. A key aspect of DCTs is their spatial adaptivity, which enables local (rather than global) fitting of the decision boundary. Our risk analysis involves a spatial decomposition of the usual concentration inequalities, leading to a spatially adaptive, data-dependent pruning criterion. For any distribution on (X, Y ) whose Bayes decision boundary behaves locally like a Lipschitz smooth function, we show that the DCT error converges to the Bayes error at a rate within a logarithmic factor of the minimax optimal rate. We also study DCTs equipped with polynomial classification rules at each leaf, and show that as the smoothness of the boundary increases their errors converge to the Bayes error at a rate approaching n−1/2 , the parametric rate. We are not aware of any other practical classifiers that provide similar rate of convergence guarantees. Fast algorithms for tree pruning are discussed. 1
same-paper 2 0.8415072 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
Author: Liva Ralaivola, Florence D'alché-buc
Abstract: We consider the question of predicting nonlinear time series. Kernel Dynamical Modeling (KDM), a new method based on kernels, is proposed as an extension to linear dynamical models. The kernel trick is used twice: first, to learn the parameters of the model, and second, to compute preimages of the time series predicted in the feature space by means of Support Vector Regression. Our model shows strong connection with the classic Kalman Filter model, with the kernel feature space as hidden state space. Kernel Dynamical Modeling is tested against two benchmark time series and achieves high quality predictions. 1
3 0.67180622 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling
Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1
4 0.65866733 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
Author: Jianxin Wu, James M. Rehg, Matthew D. Mullin
Abstract: Face detection is a canonical example of a rare event detection problem, in which target patterns occur with much lower frequency than nontargets. Out of millions of face-sized windows in an input image, for example, only a few will typically contain a face. Viola and Jones recently proposed a cascade architecture for face detection which successfully addresses the rare event nature of the task. A central part of their method is a feature selection algorithm based on AdaBoost. We present a novel cascade learning algorithm based on forward feature selection which is two orders of magnitude faster than the Viola-Jones approach and yields classifiers of equivalent quality. This faster method could be used for more demanding classification tasks, such as on-line learning. 1
5 0.65754658 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
Author: Kevin P. Murphy, Antonio Torralba, William T. Freeman
Abstract: Standard approaches to object detection focus on local patches of the image, and try to classify them as background or not. We propose to use the scene context (image as a whole) as an extra source of (global) information, to help resolve local ambiguities. We present a conditional random field for jointly solving the tasks of object detection and scene classification. 1
6 0.65528119 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
7 0.65442097 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning
8 0.65206093 124 nips-2003-Max-Margin Markov Networks
9 0.65103793 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots
10 0.64850473 3 nips-2003-AUC Optimization vs. Error Rate Minimization
11 0.6459353 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
12 0.64566869 125 nips-2003-Maximum Likelihood Estimation of a Stochastic Integrate-and-Fire Neural Model
13 0.64511734 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
14 0.6432333 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
15 0.64209753 78 nips-2003-Gaussian Processes in Reinforcement Learning
16 0.6413548 126 nips-2003-Measure Based Regularization
17 0.64077473 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression
18 0.64046025 49 nips-2003-Decoding V1 Neuronal Activity using Particle Filtering with Volterra Kernels
19 0.63931626 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
20 0.63783115 68 nips-2003-Eye Movements for Reward Maximization