nips nips2003 nips2003-146 knowledge-graph by maker-knowledge-mining

146 nips-2003-Online Learning of Non-stationary Sequences


Source: pdf

Author: Claire Monteleoni, Tommi S. Jaakkola

Abstract: We consider an online learning scenario in which the learner can make predictions on the basis of a fixed set of experts. We derive upper and lower relative loss bounds for a class of universal learning algorithms involving a switching dynamics over the choice of the experts. On the basis of the performance bounds we provide the optimal a priori discretization for learning the parameter that governs the switching dynamics. We demonstrate the new algorithm in the context of wireless networks.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We consider an online learning scenario in which the learner can make predictions on the basis of a fixed set of experts. [sent-3, score-0.186]

2 We derive upper and lower relative loss bounds for a class of universal learning algorithms involving a switching dynamics over the choice of the experts. [sent-4, score-0.702]

3 On the basis of the performance bounds we provide the optimal a priori discretization for learning the parameter that governs the switching dynamics. [sent-5, score-0.399]

4 We demonstrate the new algorithm in the context of wireless networks. [sent-6, score-0.203]

5 1 Introduction We focus on the online learning framework in which the learner has access to a set of experts but possesses no other a priori information relating to the observation sequence. [sent-7, score-0.466]

6 In such a scenario the learner may choose to quickly identify a single best expert to rely on [12], or switch from one expert to another in response to perceived changes in the observation sequence [8], thus making assumptions about the switching dynamics. [sent-8, score-0.838]

7 The ability to shift emphasis from one “expert” to another, in response to changes in the observations, is valuable in many applications, including energy management in wireless networks. [sent-9, score-0.266]

8 Many algorithms developed for universal prediction on the basis of a set of experts have clear performance guarantees (e. [sent-10, score-0.38]

9 The performance bounds characterize the regret relative to the best expert, or best sequence of experts, chosen in hindsight. [sent-13, score-0.689]

10 Algorithms with such relative loss guarantees have also been developed for adaptive game playing [5], online portfolio management [7], paging [3] and the k-armed bandit problem [1]. [sent-14, score-0.647]

11 Other relative performance measures for universal prediction involve comparing across systematic variations in the sequence [4]. [sent-15, score-0.247]

12 We provide upper and lower performance bounds, and demonstrate the utility of these algorithms in the context of wireless networks. [sent-18, score-0.315]

13 2 Algorithms and performance guarantees The learner has access to n experts, a1 , . [sent-19, score-0.168]

14 , an , and each expert makes a prediction at each time-step over a finite (known) time period t = 1, . [sent-22, score-0.308]

15 We denote the ith expert at time t as ai,t to suppress any details about how the experts arrive at their predictions and what information is available to facilitate the predictions. [sent-26, score-0.461]

16 These details may vary from one expert to another and may change over time. [sent-27, score-0.264]

17 We denote the non-negative prediction loss of expert i at time t as L(i, t), where the loss, a function of t, naturally depends on the observation yt ∈ Y at time t. [sent-28, score-0.742]

18 The prediction loss of such an algorithm is denoted by L(pt , t). [sent-33, score-0.348]

19 For the purpose of deriving learning algorithms such as Static-expert and Fixedshare described in [8], we associate the loss of each expert with a predictive probability so that − log p(yt |yt−1 , . [sent-34, score-0.684]

20 We define the loss of any probabilistic prediction to be the log-loss: n L(pt , t) = − log i=1 n pt (i) p(yt |i, y1 , . [sent-38, score-0.544]

21 , yt−1 ) = − log pt (i)e−L(i,t) (1) i=1 Many other definitions of the loss corresponding to pt (·) can be bounded by a scaled logloss [6, 8]. [sent-41, score-0.635]

22 The algorithms combining expert predictions can be now derived as simple Bayesian estimation methods calculating the distribution pt (i) = P (i|y1 , . [sent-43, score-0.459]

23 , yt−1 ) over the experts on the basis of the observations seen so far. [sent-46, score-0.165]

24 Updating p t (·) involves assumptions about how the optimal choice of expert can change with time. [sent-48, score-0.297]

25 The Bayesian algorithm updating pt (·) is defined analogously to forward propagation in generalized HMMs (allowing observation dependence on past): pt (i; α) = 1 Zt n pt−1 (j; α)e−L(j,t−1) p(i|j; α) (3) j=1 where Zt normalizes the distribution. [sent-51, score-0.437]

26 , conditional independence of expert predictions) in deriving the algorithm, the algorithms can be used in a context where no such statistical assumptions about the observation sequence or the experts are warranted. [sent-54, score-0.628]

27 1 Relative loss bounds The existing upper bound on the relative loss of the Fixed-share algorithm [8] is expressed in terms of the loss of the algorithm relative to the loss of the best k-partition of the observation sequence, where the best expert is assigned to each segment. [sent-57, score-1.933]

28 We start by providing here a similar guarantee but characterizing the regret relative to the best Fixedshare algorithm, parameterized by α∗ , where α∗ is chosen in hindsight after having seen the observation sequence. [sent-58, score-0.623]

29 Our proof technique is different from [8] and gives rise to simple guarantees for a wider class of prediction methods, along with a lower bound on this regret. [sent-59, score-0.262]

30 T Lemma 1 Let LT (α) = t=1 L(pt;α , t), α ∈ [0, 1], be the cumulative loss of the Fixedshare algorithm on an arbitrary sequence of observations. [sent-61, score-0.51]

31 Then for any α, α ∗ : α∗ )−D(α α)] ˆ ˆ LT (α) − LT (α∗ ) = − log Eα∼Q e(T −1)[D(α ˆ (4) Proof: The cumulative log-loss of the Bayesian algorithm can be expressed in terms of negative log-probability of all the observations: LT (α) = − log[ where s = {i1 , . [sent-62, score-0.296]

32 , iT }, φ(s) = φ(s)p(s; α)] (5) s T −L(it ,t) , t=1 e and p(s; α) = p1 (i1 ) Consequently, LT (α) − LT (α∗ ) = − log = − log = − log φ(s)p(s; α) = − log φ(r)p(r; α∗ ) r s Q(s; α∗ ) s s p(s; α) = − log p(s; α∗ ) φ(s)p(s; α∗ ) ∗ r φ(r)p(r; α ) T t=2 p(it |it−1 ; α). [sent-65, score-0.47]

33 2 ˆ We obtain upper and lower bounds on regret by optimizing Q in Q, the set of all distributions over α ∈ [0, 1], of the expression for regret. [sent-68, score-0.567]

34 1 Upper bound ∗ ˆ ˆ The upper bound follows from solving: maxQ∈Q − log Eα∼Q e(T −1)[D(α α )−D(α α)] ˆ subject to the constraint that α∗ has to be the hindsight-optimal switching-rate, i. [sent-71, score-0.315]

35 that: d (C1) dα (LT (α) − LT (α∗ ))|α=α∗ = 0 Theorem 1 Let LT (α∗ ) = minα LT (α) be the loss of the best Fixed-share algorithm chosen in hindsight. [sent-73, score-0.339]

36 While the regret appears proportional to T , this dependence vanishes for any reasonable learning algorithm that is guaranteed to find α such that D(α∗ α) ≤ O(1/T ), as we will show in Section 3. [sent-77, score-0.486]

37 In the general case, the regret bound is: (T − 1) maxi D(P (j|i, α∗ ) P (j|i, α)), where α, α∗ are now transition matrices, and D(· ·) is the relative entropy between discrete distributions. [sent-79, score-0.625]

38 2 Lower bound The relative losses obviously satisfy LT (α) − LT (α∗ ) ≥ 0 providing a trivial lower bound. [sent-85, score-0.241]

39 Any non-trivial lower bound on the regret cannot be expressed only in terms of α and α ∗ , but needs to incorporate some additional information about the losses along the observation sequence. [sent-86, score-0.645]

40 We express the lower bound on the regret as a function of the relative quality β ∗ of the minimum α∗ : α∗ (1 − α∗ ) d2 β∗ = LT (α)|α=α∗ (6) T − 1 dα2 where the normalization guarantees that β ∗ ≤ 1. [sent-87, score-0.673]

41 The upper and lower bounds agree for all α, α ∗ ∈ (0, 1) when β ∗ → 1. [sent-92, score-0.165]

42 Thus there may exist observation sequences on which Fixedshare, using α = α∗ , must incur regret linear in T . [sent-93, score-0.511]

43 Since the cumulative loss Lt (α) of each Fixedshare algorithm running with switching parameter α can be interpreted as a negative log-probability, the posterior distribution over the switching-rate becomes pt (α) = P (α|yt−1 , . [sent-96, score-0.708]

44 For a sufficiently large m and appropriately chosen values {αj }, we expect to be able to always find αj ≈ α∗ and suffer only a minimal additional loss due to not being able to represent the hindsight-optimal value exactly. [sent-106, score-0.271]

45 Let pt,j (i) be the distribution over experts defined by the j th Fixed-share algorithm corresponding to αj , and let ptop (j) be the top-level algorithm producing a weighting over t such Fixed-share experts. [sent-107, score-0.37]

46 The top-level algorithm is given by 1 top ptop (j) = p (j)e−L(pt−1,j ,t−1) (9) t Zt t−1 where ptop (j) = 1/m, and the loss per time-step becomes 1 m Ltop (ptop , t) = − log t j=1 m ptop (j)e−L(pt,j ,t) = − log t as is appropriate for a hierarchical Bayesian method. [sent-108, score-0.936]

47 n ptop (j)pt,j (i)e−L(i,t) (10) t j=1 i=1 3 Relative loss and optimal discretization We derive here the optimal choice of the discrete set {α1 , . [sent-109, score-0.64]

48 , αm } on the basis of the upper bound on relative loss. [sent-112, score-0.213]

49 Corollary to Theorem 1 Let Ltop be the cumulative loss of the hierarchical Learn-α T algorithm using {α1 , . [sent-114, score-0.5]

50 ,m The hierarchical algorithm involves two competing goals that manifest themselves in the regret: 1) the ability to identify the best Fixed-share expert, which degrades for larger m, and 2) the ability to find αj whose loss is close to the optimal α for that sequence, which improves for larger m. [sent-121, score-0.399]

51 The additional regret arising from having to consider a number of non-optimal values of the parameter, in the search, comes from the relative loss bound for the Static-Expert algorithm, i. [sent-122, score-0.841]

52 the relative loss associated with tracking the best single expert [8, 12]. [sent-124, score-0.65]

53 More precisely, the corollary follows directly from successive application of that single expert relative loss bound, and then our Fixed-share relative loss bound (Theorem 1): Ltop − LT (α∗ ) ≤ log(m) + min LT (αj ) (12) T j=1,. [sent-126, score-1.087]

54 1 Optimal discretization We start by finding the smallest discrete set of switching-rate parameters so that any additional regret due to discretization does not exceed (T − 1)δ, for some threshold δ. [sent-133, score-0.73]

55 The resulting discretization √ not uniform but is denser towards the edges; the spacing around the edges is O(δ), and O( δ) around 1/2. [sent-148, score-0.225]

56 α∗ = log( For small values of δ, the logarithm of the number of resulting discretization levels, or log m(δ), closely approximates −1/2 log δ. [sent-149, score-0.352]

57 We can then optimize the regret bound (11): √ −1/2 log δ + (T − 1)δ, yielding δ ∗ = 1/(2T ), and m(δ ∗ ) = 2T . [sent-150, score-0.584]

58 The uniform discretization would not, however, possess the same regret guarantee, resulting in a higher than necessary loss due to discretization. [sent-152, score-0.861]

59 1 Optimized regret bound for Learn-α The optimized regret bound for Learn-α(δ ∗ ) is thus (approximately) 1 log T +c, which is 2 comparable to analysis of universal coding for word-length T [11]. [sent-155, score-1.134]

60 The optimal discretization for learning the parameter is not affected by n, the number of original experts. [sent-156, score-0.197]

61 Unlike regret bounds for Fixed-share, the value of the bound does not depend on the observation sequence. [sent-157, score-0.646]

62 And notably, in comparison to the lower bound on Fixed-share’s performance, Learn-α’s regret is at most logarithmic in T . [sent-158, score-0.536]

63 4 Application to wireless networks We applied the Learn-α algorithm to an open problem in computer networks: managing the tradeoff between energy consumption and performance in wireless nodes of the IEEE 802. [sent-159, score-0.493]

64 Since a node cannot receive packets while asleep, yet maintaining the awake state drains energy, the existing standard uses a fixed polling time at which a node should wake from the sleep state to poll its neighbors for buffered packets. [sent-161, score-0.493]

65 Previous work includes Krashinsky and Balakrishnan’s [10] Bounded Slowdown algorithm which uses an adaptive control loop to change polling time based on network conditions. [sent-166, score-0.251]

66 We instantiate the experts as deterministic algorithms assuming constant polling times. [sent-169, score-0.378]

67 Thus we use n experts, each corresponding to a different but fixed polling time in milliseconds (ms): Ti : i ∈ {1 . [sent-170, score-0.185]

68 n} The experts form a discretization over the range of possible polling times. [sent-173, score-0.514]

69 We then apply the Learn-α algorithm exactly as in our previous exposition, using the discretization defined by δ ∗ , and thus running m(δ ∗ ) sub-algorithms, each running Fixed-share with a different αj . [sent-174, score-0.255]

70 So our subscript t here signifies only wake times, not every time epoch at which bytes might arrive. [sent-176, score-0.231]

71 We define the loss function, L, to reflect the tradeoff inherent in the conflicting goals of minimizing both the energy usage of the node, and the network latency it introduces by sleeping. [sent-177, score-0.467]

72 We propose a loss function that is one of many functions proportional to this tradeoff. [sent-178, score-0.271]

73 We define loss per expert i as: Loss(i, t) = γ It Ti2 1 + 2Tt Ti (17) where It is the observation the node receives, of how many bytes arrived upon awakening at time t, and Tt is the length of time that the node just slept. [sent-179, score-0.993]

74 The first term models the average latency introduced into the network by buffering those bytes, and scales I t to the number of bytes that would have arrived had the node slept for time T i instead of Tt , under the assumption that the bytes arrived at a uniform rate. [sent-180, score-0.656]

75 The second term models the energy consumption of the node, based on the design that the node wakes only after an interval T t to poll for buffered bytes, and the fact that it consumes less energy when asleep than awake. [sent-181, score-0.386]

76 γ > 0 allows for scaling between the units of information and time, and the ability to encode a preference for the ratio between energy and latency that the user favors. [sent-183, score-0.128]

77 1150 12000 arbitrary expert (500ms) 1100 Cumulative loss of Learn−alpha alg (n=10) 10000 Cumulative loss 8000 Fixed−share(alpha) alg 6000 4000 best expert (100ms) IEE 802. [sent-184, score-1.427]

78 11 Protocol alg 2000 Static−expert alg 1050 1000 950 900 850 Learn−alpha(delta*) 0 a) 0 0. [sent-185, score-0.322]

79 9 1 alpha 800 c) 0 2 4 6 8 10 12 14 4 x 10 1/delta 3500 1280 best expert (100ms) IEE 802. [sent-194, score-0.484]

80 11 Protocol alg 1260 3000 Cumulative loss of Learn−alpha alg (n=5) 1240 Cumulative loss 2500 2000 Fixed−share(alpha) alg Static−expert alg 1500 1220 1200 1180 1160 1140 1120 1000 Learn−alpha(delta*) 1100 500 b) 0 0. [sent-195, score-1.186]

81 8 2 −3 alpha x 10 1080 d) 0 2 4 6 8 10 12 14 4 1/delta x 10 Figure 1: a) Cumulative loss of Fixed-share(α) as a function of α, compared to the cumulative loss on the same trace of the 802. [sent-203, score-0.927]

82 c) Cumulative loss of Learnα(δ), as a function of 1/δ, when n = 10, and b) n = 5. [sent-207, score-0.271]

83 Multiple overlapping connections, passing through a collection node over several days, were recorded by start and end times, and number of bytes transferred. [sent-212, score-0.253]

84 Per connection we smoothed the total number of bytes uniformly over 10ms intervals spanning its duration. [sent-213, score-0.278]

85 0 × 10 −7 , calibrated to attain polling times within the range of the existing protocol. [sent-215, score-0.217]

86 Figure 1a) and b) compare cumulative loss of the various algorithms on a 4 hour trace, with observation epochs every 10ms. [sent-216, score-0.604]

87 the time horizen parameter to the loss bounds, is just the number of observation epochs. [sent-220, score-0.353]

88 In this application, the number of training epochs need not match the number of observation epochs, since the application involves sleeping during many observation epochs, and learning is only done upon awakening. [sent-221, score-0.218]

89 Since in these experiments the performance of the three learning algorithms are compared by each algorithm using n experts spanning the range of 1000ms at regularly spaced intervals of 100ms, to obtain a prior estimate of T , we assume a mean sleep interval of 550ms, the mean of the experts. [sent-222, score-0.354]

90 The Static-expert algorithm achieved lower cumulative loss than the best expert, since it can attain the optimal smoothed value over the desired range of polling times, whereas the expert values just form a discretization. [sent-223, score-1.093]

91 Figure 1c) and d) show the cumulative loss of Learn-α as a function of 1/δ. [sent-228, score-0.44]

92 We see that 1 choosing δ = 2T , matches the point in the curve beyond which one cannot significantly reduce cumulative loss by decreasing δ. [sent-229, score-0.44]

93 Our results also verify that the optimal δ is not significantly affected by the number of experts n. [sent-231, score-0.198]

94 5 Conclusion We proved upper and lower bounds on the regret for a class of online learning algorithms, applicable to any sequence of observations. [sent-232, score-0.673]

95 The bounds extend to richer models of nonstationary sequences, allowing the switching dynamics to be governed by an arbitrary transition matrix. [sent-233, score-0.231]

96 We derived the regret-optimal discretization (including the overall resolution) for learning the switching-rate parameter in a simple switching dynamics, yielding an algorithm with stronger guarantees than previous algorithms. [sent-234, score-0.325]

97 We exemplified the approach in the context of energy management in wireless networks. [sent-235, score-0.291]

98 In future work, we hope to extend the online estimation of α and the optimal discretization to learning a full transition matrix. [sent-236, score-0.296]

99 Sequential prediction of individual sequences under general loss functions. [sent-276, score-0.342]

100 Minimizing energy for wireless web access with bounded slowdown. [sent-302, score-0.258]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lt', 0.418), ('regret', 0.402), ('loss', 0.271), ('expert', 0.264), ('alpha', 0.185), ('bytes', 0.185), ('polling', 0.185), ('cumulative', 0.169), ('experts', 0.165), ('discretization', 0.164), ('alg', 0.161), ('wireless', 0.145), ('ptop', 0.139), ('pt', 0.135), ('fixedshare', 0.116), ('log', 0.094), ('ltop', 0.093), ('bound', 0.088), ('observation', 0.082), ('yt', 0.081), ('relative', 0.08), ('energy', 0.077), ('bounds', 0.074), ('switching', 0.071), ('online', 0.069), ('node', 0.068), ('universal', 0.06), ('guarantees', 0.057), ('arrived', 0.055), ('epochs', 0.054), ('latency', 0.051), ('learner', 0.049), ('buffered', 0.046), ('krashinsky', 0.046), ('poll', 0.046), ('wake', 0.046), ('lower', 0.046), ('upper', 0.045), ('management', 0.044), ('prediction', 0.044), ('home', 0.04), ('portfolio', 0.04), ('asleep', 0.04), ('lan', 0.04), ('intervals', 0.04), ('zt', 0.039), ('protocol', 0.037), ('foundations', 0.037), ('tt', 0.037), ('spacing', 0.037), ('sequence', 0.037), ('theorem', 0.036), ('access', 0.036), ('learn', 0.036), ('scenario', 0.036), ('best', 0.035), ('tradeoff', 0.035), ('bandit', 0.034), ('possesses', 0.034), ('sleep', 0.034), ('min', 0.033), ('network', 0.033), ('optimal', 0.033), ('algorithm', 0.033), ('predictions', 0.032), ('consumption', 0.032), ('iee', 0.032), ('entropies', 0.032), ('attain', 0.032), ('trace', 0.031), ('levels', 0.031), ('priori', 0.031), ('economic', 0.031), ('transition', 0.03), ('running', 0.029), ('allowing', 0.029), ('algorithms', 0.028), ('spanning', 0.028), ('updating', 0.028), ('symposium', 0.028), ('proof', 0.027), ('dynamics', 0.027), ('sequences', 0.027), ('hierarchical', 0.027), ('deriving', 0.027), ('losses', 0.027), ('vanishes', 0.027), ('playing', 0.026), ('game', 0.026), ('performance', 0.026), ('smoothed', 0.025), ('uc', 0.025), ('entropy', 0.025), ('context', 0.025), ('games', 0.025), ('guarantee', 0.024), ('uniform', 0.024), ('dependence', 0.024), ('laboratory', 0.024), ('freund', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 146 nips-2003-Online Learning of Non-stationary Sequences

Author: Claire Monteleoni, Tommi S. Jaakkola

Abstract: We consider an online learning scenario in which the learner can make predictions on the basis of a fixed set of experts. We derive upper and lower relative loss bounds for a class of universal learning algorithms involving a switching dynamics over the choice of the experts. On the basis of the performance bounds we provide the optimal a priori discretization for learning the parameter that governs the switching dynamics. We demonstrate the new algorithm in the context of wireless networks.

2 0.27138132 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

Author: Daniela Pucci de Farias, Nimrod Megiddo

Abstract: The so-called “experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any fixed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game. A new experts algorithm is presented and analyzed in the context of repeated games. It is shown that asymptotically, under certain conditions, it performs as well as the best available expert. This algorithm is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player’s actions on the opponent’s actions or the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner’s Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. 1

3 0.14903213 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

4 0.12085062 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

5 0.098805115 148 nips-2003-Online Passive-Aggressive Algorithms

Author: Shai Shalev-shwartz, Koby Crammer, Ofer Dekel, Yoram Singer

Abstract: We present a unified view for online classification, regression, and uniclass problems. This view leads to a single algorithmic framework for the three problems. We prove worst case loss bounds for various algorithms for both the realizable case and the non-realizable case. A conversion of our main online algorithm to the setting of batch learning is also discussed. The end result is new algorithms and accompanying loss bounds for the hinge-loss. 1

6 0.074161045 102 nips-2003-Large Scale Online Learning

7 0.065205641 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

8 0.064186178 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

9 0.060298175 32 nips-2003-Approximate Expectation Maximization

10 0.06000163 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

11 0.05964965 55 nips-2003-Distributed Optimization in Adaptive Networks

12 0.058720984 44 nips-2003-Can We Learn to Beat the Best Stock

13 0.057500731 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

14 0.056809809 94 nips-2003-Information Maximization in Noisy Channels : A Variational Approach

15 0.056412239 87 nips-2003-Identifying Structure across Pre-partitioned Data

16 0.055272818 145 nips-2003-Online Classification on a Budget

17 0.053911135 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

18 0.052937627 1 nips-2003-1-norm Support Vector Machines

19 0.051858753 176 nips-2003-Sequential Bayesian Kernel Regression

20 0.051416129 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.184), (1, 0.056), (2, -0.06), (3, -0.089), (4, 0.143), (5, 0.0), (6, 0.014), (7, 0.092), (8, -0.132), (9, 0.033), (10, -0.011), (11, 0.12), (12, -0.077), (13, 0.172), (14, -0.03), (15, -0.109), (16, -0.071), (17, 0.052), (18, 0.176), (19, -0.028), (20, -0.08), (21, -0.092), (22, 0.063), (23, -0.09), (24, 0.003), (25, -0.164), (26, -0.123), (27, -0.103), (28, -0.092), (29, 0.008), (30, 0.015), (31, 0.02), (32, 0.062), (33, -0.057), (34, -0.063), (35, 0.05), (36, 0.036), (37, 0.042), (38, -0.1), (39, 0.03), (40, 0.025), (41, -0.02), (42, -0.04), (43, 0.01), (44, 0.042), (45, -0.098), (46, -0.048), (47, 0.072), (48, 0.012), (49, 0.243)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95381606 146 nips-2003-Online Learning of Non-stationary Sequences

Author: Claire Monteleoni, Tommi S. Jaakkola

Abstract: We consider an online learning scenario in which the learner can make predictions on the basis of a fixed set of experts. We derive upper and lower relative loss bounds for a class of universal learning algorithms involving a switching dynamics over the choice of the experts. On the basis of the performance bounds we provide the optimal a priori discretization for learning the parameter that governs the switching dynamics. We demonstrate the new algorithm in the context of wireless networks.

2 0.75404203 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

Author: Daniela Pucci de Farias, Nimrod Megiddo

Abstract: The so-called “experts algorithms” constitute a methodology for choosing actions repeatedly, when the rewards depend both on the choice of action and on the unknown current state of the environment. An experts algorithm has access to a set of strategies (“experts”), each of which may recommend which action to choose. The algorithm learns how to combine the recommendations of individual experts so that, in the long run, for any fixed sequence of states of the environment, it does as well as the best expert would have done relative to the same sequence. This methodology may not be suitable for situations where the evolution of states of the environment depends on past chosen actions, as is usually the case, for example, in a repeated non-zero-sum game. A new experts algorithm is presented and analyzed in the context of repeated games. It is shown that asymptotically, under certain conditions, it performs as well as the best available expert. This algorithm is quite different from previously proposed experts algorithms. It represents a shift from the paradigms of regret minimization and myopic optimization to consideration of the long-term effect of a player’s actions on the opponent’s actions or the environment. The importance of this shift is demonstrated by the fact that this algorithm is capable of inducing cooperation in the repeated Prisoner’s Dilemma game, whereas previous experts algorithms converge to the suboptimal non-cooperative play. 1

3 0.50831097 44 nips-2003-Can We Learn to Beat the Best Stock

Author: Allan Borodin, Ran El-Yaniv, Vincent Gogan

Abstract: A novel algorithm for actively trading stocks is presented. While traditional universal algorithms (and technical trading heuristics) attempt to predict winners or trends, our approach relies on predictable statistical relations between all pairs of stocks in the market. Our empirical results on historical markets provide strong evidence that this type of technical trading can “beat the market” and moreover, can beat the best stock in the market. In doing so we utilize a new idea for smoothing critical parameters in the context of expert learning. 1 Introduction: The Portfolio Selection Problem The portfolio selection (PS) problem is a challenging problem for machine learning, online algorithms and, of course, computational finance. As is well known (e.g. see Lugosi [1]) sequence prediction under the log loss measure can be viewed as a special case of portfolio selection, and perhaps more surprisingly, from a certain worst case minimax criterion, portfolio selection is not essentially any harder (than prediction) as shown in [2] (see also [1], Thm. 20 & 21). But there seems to be a qualitative difference between the practical utility of “universal” sequence prediction and universal portfolio selection. Simply stated, universal sequence prediction algorithms under various probabilistic and worst-case models work very well in practice whereas the known universal portfolio selection algorithms do not seem to provide any substantial benefit over a naive investment strategy (see Sec. 4). A major pragmatic question is whether or not a computer program can consistently outperform the market. A closer inspection of the interesting ideas developed in information theory and online learning suggests that a promising approach is to exploit the natural volatility in the market and in particular to benefit from simple and rather persistent statistical relations between stocks rather than to try to predict stock prices or “winners”. We present a non-universal portfolio selection algorithm1 , which does not try to predict winners. The motivation behind our algorithm is the rationale behind constant rebalancing algorithms and the worst case study of universal trading introduced by Cover [3]. Not only does our proposed algorithm substantially “beat the market” on historical markets, it also beats the best stock. So why are we presenting this algorithm and not just simply making money? There are, of course some caveats and obstacles to utilizing the algorithm. But for large investors the possibility of a goose laying silver (if not golden) eggs is not impossible. 1 Any PS algorithm can be modified to be universal by investing any fixed fraction of the initial wealth in a universal algorithm. Assume a market with m stocks. Let vt = (vt (1), . . . , vt (m)) be the closing prices of the m stocks for the tth day, where vt (j) is the price of the jth stock. It is convenient to work with relative prices xt (j) = vt (j)/vt−1 (j) so that an investment of $d in the jth stock just before the tth period yields dxt (j) dollars. We let xt = (xt (1), . . . , xt (m)) denote the market vector of relative prices corresponding to the tth day. A portfolio b is an allocation of wealth in the stocks, specified by the proportions b = (b(1), . . . , b(m)) of current dollar wealth invested in each of the stocks, where b(j) ≥ 0 and j b(j) = 1. The daily return of a portfolio b w.r.t. a market vector x is b · x = j b(j)x(j) and the (compound) total return, retX (b1 , . . . , bn ), of a sequence of portfolios b1 , . . . , bn w.r.t. a market sequence n X = x1 , . . . , xn is t=1 bt · xt . A portfolio selection algorithm is any deterministic or randomized rule for specifying a sequence of portfolios. The simplest strategy is to “buy-and-hold” stocks using some portfolio b. We denote this strategy by BAHb and let U-BAH denote the uniform buy-and-hold when b = (1/m, . . . , 1/m). We say that a portfolio selection algorithm “beats the market” when it outperforms U-BAH on a given market sequence although in practice “the market” can be represented by some non-uniform BAH (e.g. DJIA). Buy-and-hold strategies rely on the tendency of successful markets to grow. Much of modern portfolio theory focuses on how to choose a good b for the buy-and-hold strategy. The seminal ideas of Markowitz in [4] yield an algorithmic procedure for choosing the weights of the portfolio b so as to minimize the variance for any feasible expected return. This variance minimization is possible by placing appropriate larger weights on subsets of anti-correlated stocks, an idea which we shall also utilize. We denote the optimal in hindsight buy-and-hold strategy (i.e. invest only in the best stock) by BAH∗ . An alternative approach to the static buy-and-hold is to dynamically change the portfolio during the trading period. This approach is often called “active trading”. One example of active trading is constant rebalancing; namely, fix a portfolio b and (re)invest your dollars each day according to b. We denote this constant rebalancing strategy by CBALb and let CBAL∗ denote the optimal (in hindsight) CBAL. A constant rebalancing strategy can often take advantage of market fluctuations to achieve a return significantly greater than that of BAH∗ . CBAL∗ is always at least as good as the best stock BAH∗ and in some real market sequences a constant rebalancing strategy will take advantage of market fluctuations and significantly outperform the best stock (see Table 1). For now, consider Cover and Gluss’ [5] classic (but contrived) example of a market consisting of cash and one stock and 1 1 the market sequence of price relatives 1/2 , 1 , 1/2 , 1 , . . . Now consider the CBALb 2 2 3 1 1 1 11 with b = ( 2 , 2 ). On each odd day the daily return of CBALb is 2 1 + 2 2 = 4 and on each even day, it is 3/2. The total return over n days is therefore (9/8)n/2 , illustrating how a constant rebalancing strategy can yield exponential returns in a “no-growth market”. Under the assumption that the daily market vectors are observations of identically and independently distributed (i.i.d) random variables, it is shown in [6] that CBAL∗ performs at least as good (in the sense of expected total return) as the best online portfolio selection algorithm. However, many studies (see e.g. [7]) argue that stock price sequences do have long term memory and are not i.i.d. A non-traditional objective (in computational finance) is to develop online trading strategies that are in some sense always guaranteed to perform well. Within a line of research pioneered by Cover [5, 3, 2] one attempts to design portfolio selection algorithms that can provably do well (in terms of their total return) with respect to some online or offline benchmark algorithms. Two natural online benchmark algorithms are the uniform buy and hold U-BAH, and the uniform constant rebalancing strategy U-CBAL, which is CBALb with 1 1 b = ( m , . . . , m ). A natural offline benchmark is BAH∗ and a more challenging offline benchmark is CBAL∗ . Cover and Ordentlich’s Universal Portfolios algorithm [3, 2], denoted here by UNIVERSAL, was proven to be universal against CBAL∗ , in the sense that for every market sequence X of m stocks over n days, it guarantees a sub-exponential (indeed polynomial) ratio in n, retX (CBAL∗ )/retX (UNIVERSAL) ≤ O n m−1 2 (1) From a theoretical perspective this is surprising as the ratio is a polynomial in n (for fixed m) whereas CBAL∗ is capable of exponential returns. From a practical perspective, while the m−1 ratio n 2 is not very useful, the motivation that underlies the potential of CBAL algorithms is useful! We follow this motivation and develop a new algorithm which we call ANTICOR. By attempting to systematically follow the constant rebalancing philosophy, ANTICOR is capable of some extraordinary performance in the absence of transaction costs, or even with very small transaction costs. 2 Trying to Learn the Winners The most direct approach to expert learning and portfolio selection is a “(reward based) weighted average prediction” algorithm which adaptively computes a weighted average of experts by gradually increasing (by some multiplicative or additive update rule) the relative weights of the more successful experts. For example, in the context of the PS problem consider the “exponentiated gradient” EG(η) algorithm proposed by Helmbold et al. [8]. The EG(η) algorithm computes the next portfolio to be bt+1 (j) = bt (j) exp {ηxt (j)/(bt · xt )} m j=1 bt (j) exp {ηxt (j)/(bt · xt )} where η is a “learning rate” parameter. EG was designed to greedily choose the best portfolio for yesterday’s market xt while at the same time paying a penalty from moving far from yesterday’s portfolio. For a universal bound on EG, Helmbold et al. set η = 2xmin 2(log m)/n where xmin is a lower bound on any price relative.2 It is easy to see that as n increases, η decreases to 0 so that we can think of η as being very small in order to achieve universality. When η = 0, the algorithm EG(η) degenerates to the uniform CBAL which is not a universal algorithm. It is also the case that if each day the price relatives for all stocks were identical, then EG (as well as other PS algorithms) will converge to the uniform CBAL. Combining a small learning rate with a “reasonably balanced” market we expect the performance of EG to be similar to that of the uniform CBAL and this is confirmed by our experiments (see Table1).3 Cover’s universal algorithms adaptively learn each day’s portfolio by increasing the weights of successful CBALs. The update rule for these universal algorithms is bt+1 = b · rett (CBALb )dµ(b) , rett (CBALb )dµ(b) where µ(·) is some prior distribution over portfolios. Thus, the weight of a possible portfolio is proportional to its total return rett (b) thus far times its prior. The particular universal algorithm we consider in our experiments uses the Dirichlet prior (with parameters 1 1 ( 2 , . . . , 2 )) [2]. Within a constant factor, this algorithm attains the optimal ratio (1) with respect to CBAL∗ .4 The algorithm is equivalent to a particular static distribution over the 2 Helmbold et al. show how to eliminate the need to know xmin and n. While EG can be made universal, its performance ratio is only sub-exponential (and not polynomial) in n. 3 Following Helmbold et al. we fix η = 0.01 in our experiments. 4 Experimentally (on our datasets) there is a negligible difference between the uniform universal algorithm in [3] and the above Dirichlet universal algorithm. class of all CBALs. This equivalence helps to demystify the universality result and also shows that the algorithm can never outperform CBAL∗ . A different type of “winner learning” algorithm can be obtained from any sequence prediction strategy. For each stock, a (soft) sequence prediction algorithm provides a probability p(j) that the next symbol will be j ∈ {1, . . . , m}. We view this as a prediction that stock j will have the best price relative for the next day and set bt+1 (j) = pj . We consider predictions made using the prediction component of the well-known Lempel-Ziv (LZ) lossless compression algorithm [9]. This prediction component is nicely described in Langdon [10] and in Feder [11]. As a prediction algorithm, LZ is provably powerful in various senses. First it can be shown that it is asymptotically optimal with respect to any stationary and ergodic finite order Markov source (Rissanen [12]). Moreover, Feder shows that LZ is also universal in a worst case sense with respect to the (offline) benchmark class of all finite state prediction machines. To summarize, the common approach to devising PS algorithms has been to attempt and learn winners using winner learning schemes. 3 The Anticor Algorithm We propose a different approach, motivated by the CBAL “philosophy”. How can we interpret the success of the uniform CBAL on the Cover and Gluss example of Sec. 1? Clearly, the uniform CBAL here is taking advantage of price fluctuation by constantly transferring wealth from the high performing stock to the anti-correlated low performing stock. Even in a less contrived market, we should be able to take advantage when a stock is currently outperforming other stocks especially if this strong performance is anti-correlated with the performance of these other stocks. Our ANTICORw algorithm considers a short market history (consisting of two consecutive “windows”, each of w trading days) so as to model statistical relations between each pair of stocks. Let LX1 = log(xt−2w+1 ), . . . , log(xt−w )T and LX2 = log(xt−w+1 ), . . . , log(xt )T , where log(xk ) denotes (log(xk (1)), . . . , log(xk (m))). Thus, LX1 and LX2 are the two vector sequences (equivalently, two w × m matrices) constructed by taking the logarithm over the market subsequences corresponding to the time windows [t − 2w + 1, t − w] and [t − w + 1, t], respectively. We denote the jth column of LXk by LXk (j). Let µk = (µk (1), . . . , µk (m)), be the vectors of averages of columns of LXk (that is, µk (j) = E{LXk (j)}). Similarly, let σk , be the vector of standard deviations of columns of LXk . The cross-correlation matrix (and its normalization) between column vectors in LX1 and LX2 are defined as: Mcov (i, j) = (LX1 (i) − µ1 (i))T (LX2 (j) − µ2 (j)); Mcov (i,j) σ1 (i)σ2 (j) σ1 (i), σ2 (j) = 0; 0 otherwise. Mcor (i, j) ∈ [−1, 1] measures the correlation between log-relative prices of stock i over the first window and stock j over the second window. For each pair of stocks i and j we compute claimi→j , the extent to which we want to shift our investment from stock i to stock j. Namely, there is such a claim iff µ2 (i) > µ2 (j) and Mcor (i, j) > 0 in which case claimi→j = Mcor (i, j) + A(i) + A(j) where A(h) = |Mcor (h, h)| if Mcor (h, h) < 0, else 0. Following our interpretation for the success of a CBAL, Mcor (i, j) > 0 is used to predict that stocks i and j will be correlated in consecutive windows (i.e. the current window and the next window based on the evidence for the last two windows) and Mcor (h, h) < 0 predicts that stock h will be anti-correlated with itself over consec˜ utive windows. Finally, bt+1 (i) = bt (i) + j=i [transferj→i − transferi→j ] where ˜ t (i) · claimi→j / ˜ transferi→j = b j claimi→j and bt is the resulting portfolio just after market closing (on day t). Mcor (i, j) SP500: Anticor vs. window size NYSE: Anticor vs. window size w BAH(Anticor ) w Anticor 12 8 w Best Stock Market Return 10 Total Return Total Return (log−scale) 10 Anticorw 5 10 BAH(Anticorw) Anticorw Best Stock Market Best Stock 8 Anticorw 6 4 2 10 2 1 10 Best Stock 1 0 10 2 5 10 15 20 25 5 30 10 15 20 25 30 Window Size (w) Window Size (w) Figure 1: ANTICORw ’s total return (per $1 investment) vs. window size 2 ≤ w ≤ 30 for NYSE (left) and SP500 (right). Our ANTICORw algorithm has one critical parameter, the window size w. In Figure 1 we depict the total return of ANTICORw on two historical datasets as a function of the window size w = 2, . . . , 30. As we might expect, the performance of ANTICORw depends significantly on the window size. However, for all w, ANTICORw beats the uniform market and, moreover, it beats the best stock using most window sizes. Of course, in online trading we cannot choose w in hindsight. Viewing the ANTICORw algorithms as experts, we can try to learn the best expert. But the windows, like individual stocks, induce a rather volatile set of experts and standard expert combination algorithms [13] tend to fail. Alternatively, we can adaptively learn and invest in some weighted average of all ANTICORw algorithms with w less than some maximum W . The simplest case is a uniform investment on all the windows; that is, a uniform buy-and-hold investment on the algorithms ANTICORw , w ∈ [2, W ], denoted by BAHW (ANTICOR). Figure 2 (left) graphs the total return of BAHW (ANTICOR) as a function of W for all values of 2 ≤ W ≤ 50 with respect to the NYSE dataset (see details below). Similar graphs for the other datasets we consider appear qualitatively the same and the choice W = 30 is clearly not optimal. However, for all W ≥ 3, BAHW (ANTICOR) beats the best stock in all our experiments. DJIA: Dec 14, 2002 − Jan 14, 2003 NYSE: Total Return vs. Max Window 10 1.1 BAHW(Anticor) 10 Best Stock MArket 4 10 3 10 Best Stock 2.8 Anticor2 2.2 2.6 1 BAHW(Anticor) 5 Total Return Total Return (log−scale) 10 6 Anticor1 Stocks 7 2 0.9 2.4 1.8 0.8 2.2 1.6 0.7 2 1.4 0.6 2 10 1.8 1.2 0.5 1 10 1.6 1 0.4 0 10 5 10 15 20 25 30 35 Maximal Window size (W) 40 45 50 5 10 15 20 25 Days 5 10 15 20 25 Days 5 10 15 20 25 Days Figure 2: Left: BAHW (ANTICOR)’s total return (per $1 investment) as a function of the maximal window W . Right: Cumulative returns for last month of the DJIA dataset: stocks (left panel); ANTICORw algorithms trading the stocks (denoted ANTICOR1 , middle panel); ANTICORw algorithms trading the ANTICOR algorithms (right panel). Since we now consider the various algorithms as stocks (whose prices are determined by the cumulative returns of the algorithms), we are back to our original portfolio selection problem and if the ANTICOR algorithm performs well on stocks it may also perform well on algorithms. We thus consider active investment in the various ANTICORw algorithms using ANTICOR. We again consider all windows w ≤ W . Of course, we can continue to compound the algorithm any number of times. Here we compound twice and then use a buy-and-hold investment. The resulting algorithm is denoted BAHW (ANTICOR(ANTICOR)). One impact of this compounding, depicted in Figure 2 (right), is to smooth out the anti-correlations exhibited in the stocks. It is evident that after compounding twice the returns become almost completely correlated thus diminishing the possibility that additional compounding will substantially help.5 This idea for eliminating critical parameters may be applicable in other learning applications. The challenge is to understand the conditions and applications in which the process of compounding algorithms will have this smoothing effect! 4 Experimental Results We present an experimental study of the the ANTICOR algorithm and the three online learning algorithms described in Sec. 2. We focus on BAH30 (ANTICOR), abbreviated by ANTI1 and BAH30 (ANTICOR(ANTICOR)), abbreviated by ANTI2 . Four historical datasets are used. The first NYSE dataset, is the one used in [3, 2, 8, 14]. This dataset contains 5651 daily prices for 36 stocks in the New York Stock Exchange (NYSE) for the twenty two year period July 3rd , 1962 to Dec 31st , 1984. The second TSE dataset consists of 88 stocks from the Toronto Stock Exchange (TSE), for the five year period Jan 4th , 1994 to Dec 31st , 1998. The third dataset consists of the 25 stocks from SP500 which (as of Apr. 2003) had the largest market capitalization. This set spans 1276 trading days for the period Jan 2nd , 1998 to Jan 31st , 2003. The fourth dataset consists of the thirty stocks composing the Dow Jones Industrial Average (DJIA) for the two year period (507 days) from Jan 14th , 2001 to Jan 14th , 2003.6 These four datasets are quite different in nature (the market returns for these datasets appear in the first row of Table 1). While every stock in the NYSE increased in value, 32 of the 88 stocks in the TSE lost money, 7 of the 25 stocks in the SP500 lost money and 25 of the 30 stocks in the “negative market” DJIA lost money. All these sets include only highly liquid stocks with huge market capitalizations. In order to maximize the utility of these datasets and yet present rather different markets, we also ran each market in reverse. This is simply done by reversing the order and inverting the relative prices. The reverse datasets are denoted by a ‘-1’ superscript. Some of the reverse markets are particularly challenging. For example, all of the NYSE−1 stocks are going down. Note that the forward and reverse markets (i.e. U-BAH) for the TSE are both increasing but that the TSE−1 is also a challenging market since so many stocks (56 of 88) are declining. Table 1 reports on the total returns of the various algorithms for all eight datasets. We see that prediction algorithms such as LZ can do quite well but the more aggressive ANTI1 and 2 ANTI have excellent and sometimes fantastic returns. Note that these active strategies beat the best stock and even CBAL∗ in all markets with the exception of the TSE−1 in which they still significantly outperform the market. The reader may well be distrustful of what appears to be such unbelievable returns for ANTI1 and ANTI2 especially when applied to the NYSE dataset. However, recall that the NYSE dataset consists of n = 5651 trading days and the y such that y n = the total NYSE return is approximately 1.0029511 for ANTI1 (respectively, 1.0074539 for ANTI2 ); that is, the average daily increase is less than .3% 5 This smoothing effect also allows for the use of simple prediction algorithms such as “expert advice” algorithms [13], which can now better predict a good window size. We have not explored this direction. 6 The four datasets, including their sources and individual stock compositions can be downloaded from http://www.cs.technion.ac.il/∼rani/portfolios. (respectively, .75%). Thus a transaction cost of 1% can present a significant challenge to such active trading strategies (see also Sec. 5). We observe that UNIVERSAL and EG have no substantial advantage over U-CBAL. Some previous expositions of these algorithms highlighted particular combinations of stocks where the returns significantly outperformed UNIVERSAL and the best stock. But the same can be said for U-CBAL. Algorithm M ARKET (U-BAH) B EST S TOCK CBAL∗ U-CBAL ANTI1 ANTI2 LZ EG UNIVERSAL NYSE 14.49 54.14 250.59 27.07 17,059,811.56 238,820,058.10 79.78 27.08 26.99 TSE 1.61 6.27 6.77 1.59 26.77 39.07 1.32 1.59 1.59 SP500 1.34 3.77 4.06 1.64 5.56 5.88 1.67 1.64 1.62 DJIA 0.76 1.18 1.23 0.81 1.59 2.28 0.89 0.81 0.80 NYSE−1 0.11 0.32 2.86 0.22 246.22 1383.78 5.41 0.22 0.22 TSE−1 1.67 37.64 58.61 1.18 7.12 7.27 4.80 1.19 1.19 SP500−1 0.87 1.65 1.91 1.09 6.61 9.69 1.20 1.09 1.07 DJIA−1 1.43 2.77 2.97 1.53 3.67 4.60 1.83 1.53 1.53 Table 1: Monetary returns in dollars (per $1 investment) of various algorithms for four different datasets and their reversed versions. The winner and runner-up for each market appear in boldface. All figures are truncated to two decimals. 5 Concluding Remarks When handling a portfolio of m stocks our algorithm may perform up to m transactions per day. A major concern is therefore the commissions it will incur. Within the proportional commission model (see e.g. [14] and [15], Sec. 14.5.4) there exists a fraction γ ∈ (0, 1) such that an investor pays at a rate of γ/2 for each buy and for each sell. Therefore, the return of a sequence b1 , . . . , bn of portfolios with re˜ spect to a market sequence x1 , . . . , xn is t bt · xt (1 − j γ |bt (j) − bt (j)|) , where 2 1 ˜ (bt (1)xt (1), . . . , bt (m)xt (m)). Our investment algorithm in its simplest form bt = bt ·xt can tolerate very small proportional commission rates and still beat the best stock.7 We note that Blum and Kalai [14] showed that the performance guarantee of UNIVERSAL still holds (and gracefully degrades) in the case of proportional commissions. Many current online brokers only charge a small per share commission rate. A related problem that one must face when actually trading is the difference between bid and ask prices. These bid-ask spreads (and the availability of stocks for both buying and selling) are typically functions of stock liquidity and are typically smaller for large market capitalization stocks. We consider here only very large market cap stocks. As a final caveat, we note that we assume that any one portfolio selection algorithm has no impact on the market! But just like any goose laying golden eggs, widespread use will soon lead to the end of the goose; that is, the market will quickly react. Any report of abnormal returns using historical markets should be suspected of “data snooping”. In particular, when a dataset is excessively mined by testing many strategies there is a substantial chance that one of the strategies will be successful by simple overfitting. Another data snooping hazard is stock selection. For example, the 36 stocks selected for the NYSE dataset were all known to have survived for 22 years. Our ANTICOR algorithms were fully developed using only the NYSE and TSE datasets. The DJIA and SP500 sets were obtained (from public domain sources) after the algorithms were fixed. Finally, our algorithm has one parameter (the maximal window size W ). Our experiments indicate that the algorithm’s performance is robust with respect to W (see Figure 2). 7 For example, with γ = 0.1% we can typically beat the best stock. These results will be presented in the full paper. A number of well-respected works report on statistically robust “abnormal” returns for simple “technical analysis” heuristics, which slightly beat the market. For example, the landmark study of Brock et al. [16] apply 26 simple trading heuristics to the DJIA index from 1897 to 1986 and provide strong support for technical analysis heuristics. While consistently beating the market is considered a great (if not impossible) challenge, our approach to portfolio selection indicates that beating the best stock is an achievable goal. What is missing at this point of time is an analytical model which better explains why our active trading strategies are so successful. In this regard, we are investigating various “statistical adversary” models along the lines suggested by [17, 18]. Namely, we would like to show that an algorithm performs well (relative to some benchmark) for any market sequence that satisfies certain constraints on its empirical statistics. References [1] G. Lugosi. Lectures on prediction URL:http://www.econ.upf.es/∼lugosi/ihp.ps, 2001. of individual sequences. [2] T.M. Cover and E. Ordentlich. Universal portfolios with side information. IEEE Transactions on Information Theory, 42(2):348–363, 1996. [3] T.M. Cover. Universal portfolios. Mathematical Finance, 1:1–29, 1991. [4] H. Markowitz. Portfolio Selection: Efficient Diversification of Investments. John Wiley and Sons, 1959. [5] T.M. Cover and D.H. Gluss. Empirical bayes stock market portfolios. Advances in Applied Mathematics, 7:170–181, 1986. [6] T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., 1991. [7] A. Lo and C. MacKinlay. A Non-Random Walk Down Wall Street. Princeton University Press, 1999. [8] D.P. Helmbold, R.E. Schapire, Y. Singer, and M.K. Warmuth. Portfolio selection using multiplicative updates. Mathematical Finance, 8(4):325–347, 1998. [9] J. Ziv and A. Lempel. Compression of individual sequences via variable rate coding. IEEE Transactions on Information Theory, 24:530–536, 1978. [10] G.G. Langdon. A note on the lempel-ziv model for compressing individual sequences. IEEE Transactions on Information Theory, 29:284–287, 1983. [11] M. Feder. Gambling using a finite state machine. IEEE Transactions on Information Theory, 37:1459–1465, 1991. [12] J. Rissanen. A universal data compression system. IEEE Transactions on Information Theory, 29:656–664, 1983. [13] N. Cesa-Bianchi, Y. Freund, D. Haussler, D.P. Helmbold, R.E. Schapire, and M.K. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, May 1997. [14] A. Blum and A. Kalai. Universal portfolios with and without transaction costs. Machine Learning, 30(1):23–30, 1998. [15] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, 1998. [16] L. Brock, J. Lakonishok, and B. LeBaron. Simple technical trading rules and the stochastic properties of stock returns. Journal of Finance, 47:1731–1764, 1992. [17] P. Raghavan. A statistical adversary for on-line algorithms. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 7:79–83, 1992. [18] A. Chou, J.R. Cooperstock, R. El-Yaniv, M. Klugerman, and T. Leighton. The statistical adversary allows optimal money-making trading strategies. In Proceedings of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, 1995.

4 0.50155091 19 nips-2003-Algorithms for Interdependent Security Games

Author: Michael Kearns, Luis E. Ortiz

Abstract: unkown-abstract

5 0.44167092 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

6 0.4391678 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

7 0.36625782 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

8 0.3309139 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

9 0.3112832 148 nips-2003-Online Passive-Aggressive Algorithms

10 0.31123146 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering

11 0.2957691 55 nips-2003-Distributed Optimization in Adaptive Networks

12 0.27480346 102 nips-2003-Large Scale Online Learning

13 0.27301824 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

14 0.26916412 38 nips-2003-Autonomous Helicopter Flight via Reinforcement Learning

15 0.26599121 32 nips-2003-Approximate Expectation Maximization

16 0.25772172 99 nips-2003-Kernels for Structured Natural Language Data

17 0.25443056 196 nips-2003-Wormholes Improve Contrastive Divergence

18 0.25346678 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

19 0.25214392 153 nips-2003-Parameterized Novelty Detectors for Environmental Sensor Monitoring

20 0.25181383 176 nips-2003-Sequential Bayesian Kernel Regression


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.033), (11, 0.029), (20, 0.296), (29, 0.017), (30, 0.019), (35, 0.107), (53, 0.079), (71, 0.057), (76, 0.032), (85, 0.07), (91, 0.115), (99, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78661561 146 nips-2003-Online Learning of Non-stationary Sequences

Author: Claire Monteleoni, Tommi S. Jaakkola

Abstract: We consider an online learning scenario in which the learner can make predictions on the basis of a fixed set of experts. We derive upper and lower relative loss bounds for a class of universal learning algorithms involving a switching dynamics over the choice of the experts. On the basis of the performance bounds we provide the optimal a priori discretization for learning the parameter that governs the switching dynamics. We demonstrate the new algorithm in the context of wireless networks.

2 0.78453338 168 nips-2003-Salient Boundary Detection using Ratio Contour

Author: Song Wang, Toshiro Kubota, Jeffrey M. Siskind

Abstract: This paper presents a novel graph-theoretic approach, named ratio contour, to extract perceptually salient boundaries from a set of noisy boundary fragments detected in real images. The boundary saliency is defined using the Gestalt laws of closure, proximity, and continuity. This paper first constructs an undirected graph with two different sets of edges: solid edges and dashed edges. The weights of solid and dashed edges measure the local saliency in and between boundary fragments, respectively. Then the most salient boundary is detected by searching for an optimal cycle in this graph with minimum average weight. The proposed approach guarantees the global optimality without introducing any biases related to region area or boundary length. We collect a variety of images for testing the proposed approach with encouraging results. 1

3 0.55266106 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

Author: Radford M. Neal, Matthew J. Beal, Sam T. Roweis

Abstract: We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of “pools” of candidate states at each time. We then define an embedded HMM whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efficiently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple one-dimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers. 1

4 0.54591644 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1

5 0.54558617 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe

Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

6 0.54456663 158 nips-2003-Policy Search by Dynamic Programming

7 0.53912067 78 nips-2003-Gaussian Processes in Reinforcement Learning

8 0.53557819 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

9 0.53365189 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

10 0.52848214 68 nips-2003-Eye Movements for Reward Maximization

11 0.527753 189 nips-2003-Tree-structured Approximations by Expectation Propagation

12 0.52698654 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

13 0.52552587 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

14 0.52547383 107 nips-2003-Learning Spectral Clustering

15 0.52523679 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

16 0.52491593 73 nips-2003-Feature Selection in Clustering Problems

17 0.52333361 122 nips-2003-Margin Maximizing Loss Functions

18 0.52286947 30 nips-2003-Approximability of Probability Distributions

19 0.52209091 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

20 0.52127743 100 nips-2003-Laplace Propagation