nips nips2009 nips2009-220 knowledge-graph by maker-knowledge-mining

220 nips-2009-Slow Learners are Fast


Source: pdf

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. [sent-6, score-0.504]

2 This view, however, is slightly deceptive for several reasons: current online algorithms process one instance at a time. [sent-11, score-0.132]

3 It is therefore very wasteful if only one of these cores is actually used for estimation. [sent-17, score-0.204]

4 This means that current algorithms reach their limit at problems of size 1TB whenever the algorithm is I/O bound (this amounts to a training time of 3 hours), or even smaller problems whenever the model parametrization makes the algorithm CPU bound. [sent-20, score-0.165]

5 Finally, distributed and cloud computing are unsuitable for today’s online learning algorithms. [sent-21, score-0.162]

6 In a nutshell, we propose the following two variants: several processing cores perform stochastic gradient descent independently of each other while sharing a common parameter vector which is updated asynchronously. [sent-28, score-0.403]

7 This allows us to accelerate computationally intensive problems whenever gradient computations are relatively expensive. [sent-29, score-0.132]

8 A common feature of both algorithms is that the update occurs with some delay: in the first case other cores may have updated the parameter vector in the meantime, in the second case, other cores may have already computed parts of the function for the subsequent examples before an update. [sent-32, score-0.498]

9 1 Algorithm Platforms We begin with an overview of three platforms which are available for parallelization of algorithms. [sent-34, score-0.165]

10 They differ in their structural parameters, such as synchronization ability, latency, and bandwidth and consequently they are better suited to different styles of algorithms. [sent-35, score-0.177]

11 They are general purpose processors which operate on a joint memory space where each of the processors can execute arbitrary pieces of code independently of other processors. [sent-39, score-0.255]

12 There the number of processing elements is vastly higher (1024 on high-end consumer graphics cards), although they tend to be bundled into groups of 8 cores (also referred to as multiprocessing elements), each of which can execute a given piece of code in a data-parallel fashion. [sent-42, score-0.293]

13 An issue is that explicit synchronization between multiprocessing elements is difficult — it requires computing kernels on the processing elements to complete. [sent-43, score-0.133]

14 A clear limit here is bandwidth constraints and latency for inter-computer communication. [sent-45, score-0.156]

15 On Gigabit Ethernet the TCP/IP latency can be in the order of 100µs, the equivalent of 105 clock cycles on a processor and network bandwidth tends to be a factor 100 slower than memory bandwdith. [sent-46, score-0.232]

16 Grid Computing: Computational paradigms such as MapReduce [4] and Hadoop are well suited for the parallelization of batch-style algorithms [17]. [sent-47, score-0.161]

17 In comparison to cluster configurations communication and latency are further constrained. [sent-48, score-0.144]

18 We consider only the first two platforms since latency plays a critical role in the analysis of the class of algorithms we propose. [sent-51, score-0.172]

19 While we do not exclude the possibility of devising parallel online algorithms suited to grid computing, we believe that the family of algorithm proposed in this paper is unsuitable and a significantly different synchronization paradigm would be needed. [sent-52, score-0.299]

20 It is our goal to find some parameter vector x (which is drawn from some Banach space X with associated norm · ) such that the sum over convex functions fi : X → R takes on the smallest value possible. [sent-55, score-0.225]

21 At the outset we make no special assumptions on the order or form of the functions fi . [sent-58, score-0.179]

22 In other cases, the functions fi may be drawn from some distribution (e. [sent-60, score-0.179]

23 It is our goal to find a sequence of xi such that the cumulative loss i fi (xi ) is minimized. [sent-63, score-0.212]

24 Denote by f ∗ (x) := 1 |F | fi (x) or f ∗ (x) := Ef ∼p(f ) [f (x)] (1) and correspondingly x∗ := argmin f ∗ (x) (2) i x∈X the average risk. [sent-66, score-0.15]

25 We propose the following algorithm: 2 Algorithm 1 Delayed Stochastic Gradient Descent Input: Feasible space X ⊆ Rn , annealing schedule ηt and delay τ ∈ N Initialization: set x1 . [sent-68, score-0.605]

26 , xτ = 0 and compute corresponding gt = ft (xt ). [sent-71, score-0.59]

27 for t = τ + 1 to T + τ do Obtain ft and incur loss ft (xt ) Compute gt := ft (xt ) Update xt+1 = argminx∈X x − (xt − ηt gt−τ ) (Gradient Step and Projection) end for Figure 1: Data parallel stochastic gradient descent with shared parameter vector. [sent-72, score-1.658]

28 Each of them computes its own gradient gt = ∂x ft (xt ). [sent-74, score-0.674]

29 Since each computer is updating x in a round-robin fashion, it takes a delay of τ = n − 1 between gradient computation and when the gradients are applied to x. [sent-75, score-0.739]

30 x data source loss gradient σ σ In this paper the annealing schedule will be either ηt = (t−τ ) or ηt = √t−τ . [sent-76, score-0.206]

31 If we set τ = 0, algorithm 1 becomes an entirely standard stochastic gradient descent algorithm. [sent-78, score-0.228]

32 The only difference with delayed stochastic gradient descent is that we do not update the parameter vector xt with the current gradient gt but rather with a delayed gradient gt−τ that we computed τ steps previously. [sent-79, score-1.161]

33 Moreover, assume that computing the gradient of ft (x) is at least n times as expensive as it is to update x (read, add, write). [sent-87, score-0.495]

34 The rationale for delayed updates can be seen in the following setting: assume that we have n cores performing stochastic gradient descent on different instances ft while sharing one common parameter vector x. [sent-89, score-1.095]

35 If we allow each core in a round-robin fashion to update x one at a time then there will be a delay of τ = n − 1 between when we see ft and when we get to update xt+τ . [sent-90, score-1.048]

36 The delay arises since updates by different cores cannot happen simultaneously. [sent-91, score-0.821]

37 This setting is preferable whenever computation of ft itself is time consuming. [sent-92, score-0.399]

38 On a multi-computer cluster we can use a similar mechanism simply by having one server act as a state-keeper which retains an up-to-date copy of x while the loss-gradient computation clients can retrieve at any time a copy of x and send gradient update messages to the state keeper. [sent-95, score-0.265]

39 This can be addressed by parallelizing computing the function value fi (x) explicitly rather than attempting to compute several instances of fi (x) simultaneously. [sent-97, score-0.35]

40 when fi (x) = g( φ(zi ), x ) for high-dimensional φ(zi ). [sent-100, score-0.15]

41 The only communication required is to combine partial values and to compute gradients with respect to φ(zi ), x . [sent-102, score-0.152]

42 This causes delay since the second stage is processing results of the first stage while the latter has already moved on to processing ft+1 or further. [sent-103, score-0.545]

43 While the architecture is quite different, the effects are identical: the parameter vector x is updated with some delay τ . [sent-104, score-0.545]

44 Note that here τ can be much smaller than the number of processors and mainly depends on the latency of the communication channel. [sent-105, score-0.217]

45 3 Randomization: Order of observations matters for delayed updates: imagine that an adversary, aware of the delay τ bundles each of the τ most similar instances ft together. [sent-107, score-1.135]

46 The reason being that only after seeing τ instances of ft will we be able to respond to the data. [sent-109, score-0.442]

47 We begin with a simple game theoretic analysis that only requires ft to be convex and where the subdifferentials are bounded ft (x) ≤ L by some L > 0. [sent-114, score-0.776]

48 It is our goal to bound the regret R associated with a sequence X = {x1 , . [sent-116, score-0.171]

49 If all terms are convex we obtain T T T ft (xt ) − ft (x∗ ) ≤ R[X] := t=1 ft (xt ), xt − x∗ = t=1 gt , xt − x∗ . [sent-120, score-1.572]

50 The key difference is that we now have an additional term characterizing the correlation between successive gradients which needs to be bounded. [sent-125, score-0.141]

51 In the worst case all we can do is bound gt−τ −j , gt−τ ≤ L2 , whenever the gradients are highly correlated, which yields the following: Theorem 2 Suppose all the cost functions are Lipschitz continuous with a constant L and σ maxx,x ∈X D(x x ) ≤ F 2 . [sent-126, score-0.221]

52 This is similar to what we would expect in the worst case: an adversary may reorder instances such as to maximally slow down progress. [sent-128, score-0.137]

53 This result may appear overly pessimistic but the following example shows that such worst-case scaling behavior is to be expected: Lemma 3 Assume that an optimal online algorithm with regard to a convex game achieves regret R[m] after seeing m instances. [sent-130, score-0.345]

54 Then any algorithm which may only use information that is at least τ instances old has a worst case regret bound of τ R[m/τ ]. [sent-131, score-0.221]

55 Our construction works by designing a sequence of functions fi where for a fixed n ∈ N all fnτ +j are identical (for j ∈ {1, . [sent-132, score-0.179]

56 Hence, even an algorithm knowing that we will see τ identical instances in a row but being disallowed to respond to them for τ instances will do no better than one which sees every instance once but is allowed to respond instantly. [sent-137, score-0.217]

57 4 The useful consequence of Theorem 2 is that we are guaranteed to converge at all even if we encounter delay (the latter is not trivial — after all, we could end up with an oscillating parameter vector for overly aggressive learning rates). [sent-138, score-0.545]

58 While such extreme cases hardly occur in practice, we need to make stronger assumptions in terms of correlation of ft and the degree of smoothness in ft to obtain tighter bounds. [sent-139, score-0.762]

59 We conclude this section by studying a particularly convenient case: the setting when the functions fi are strongly convex satisfying hi 2 f (x∗ ) ≥ fi (x) + x∗ − x, ∂x f (x) + x − x∗ (6) 2 Here we can get rid of the D(x∗ x1 ) dependency in the loss bound. [sent-140, score-0.467]

60 Theorem 4 Suppose that the functions fi are strongly convex with parameter λ > 0. [sent-141, score-0.255]

61 As before, we pay a linear price in the delay τ . [sent-144, score-0.545]

62 4 Decorrelating Gradients To improve our bounds beyond the most pessimistic case we need to assume that the adversary is not acting in the most hostile fashion possible. [sent-145, score-0.191]

63 In the following we study the opposite case — namely that the adversary is drawing the functions fi iid from an arbitrary (but fixed) distribution. [sent-146, score-0.266]

64 The key reason for this requirement is that we need to control the value of gt , gt for adjacent gradients. [sent-147, score-0.478]

65 The flavor of the bounds we use will be in terms of the expected regret rather than an actual regret. [sent-148, score-0.204]

66 In some cases we will only be able to obtain bounds on the expected regret rather than the actual regret. [sent-156, score-0.204]

67 Our first strategy is to assume that ft arises from a scalar function of a linear function class. [sent-158, score-0.351]

68 The second strategy makes stringent smoothness assumptions on ft , namely it assumes that the gradients themselves are Lipschitz continuous. [sent-160, score-0.461]

69 This will lead to guarantees for which the delay becomes increasingly irrelevant as the algorithm progresses. [sent-161, score-0.545]

70 1 Covariance bounds for linear function classes Many functions ft (x) depend on x only via an inner product. [sent-163, score-0.447]

71 They can be expressed as ft (x) = l(yt , zt , x ) and hence gt (x) = ft (x) = zt ∂ zt ,x l(yt , zt , x ) (8) Now assume that ∂ zt ,x l(yt , zt , x ) ≤ Λ for all x and all t. [sent-164, score-1.445]

72 For such problems it is possible to bound the correlation between subsequent gradients via the following lemma: Lemma 5 Denote by (y, z), (y , z ) ∼ Pr(y, z) random variables which are drawn independently of x, x ∈ X. [sent-170, score-0.205]

73 Corollary 6 Given ηt = algorithm is bounded by √σ t−τ and the conditions of Lemma 5 the regret of the delayed update 2 R[X] ≤ σL Hence for σ 2 = 4. [sent-173, score-0.386]

74 Bounds for smooth gradients The key to improving the rate rather than the constant with regard to which the bounds depend on τ is to impose further smoothness constraints on ft . [sent-175, score-0.528]

75 This is precisely what we need in order to show that a small delay (which amounts to small changes in x) will not impact the update that is carried out to a significant amount. [sent-177, score-0.64]

76 Nonetheless, since this discontinuity only occurs on a set of measure 0 delayed stochastic gradient descent still works very well on them in practice. [sent-181, score-0.388]

77 Theorem 7 In addition to the conditions of Theorem 2 assume that the functions fi are i. [sent-182, score-0.179]

78 Initially, a delay of τ can be quite harmful since subsequent gradients are highly correlated. [sent-190, score-0.685]

79 At a later stage when optimization becomes increasingly an averaging process a delay of τ in the updates proves to be essentially harmless. [sent-191, score-0.617]

80 The key difference to bounds of Theorem 2 is that now the rate of convergence has improved dramatically and is essentially as good as in sequential online learning. [sent-192, score-0.253]

81 In this case averaging becomes the dominant force, hence parallelization does not degrade convergence further. [sent-196, score-0.184]

82 3 Bounds for smooth gradients with strong convexity We conclude this section with the tightest of all bounds — the setting where the losses are all strongly convex and smooth. [sent-199, score-0.291]

83 In both cases a delay of 10 has no effect on the convergence whatsoever and even a delay of 100 is still quite acceptable. [sent-204, score-1.132]

84 1 2 3 4 Threads 5 6 7 Theorem 8 Under the assumptions of Theorem 4, in particular, assuming that all functions fi are 1 i. [sent-207, score-0.179]

85 In particular, we used two different training sets that were based on e-mails: the TREC dataset [5], consisting of 75,419 e-mail messages, and a proprietary (significantly harder) dataset of which we took 100,000 e-mails. [sent-216, score-0.132]

86 In particular, for the TREC dataset we used 218 feature bins and for the proprietary dataset we used 224 bins. [sent-221, score-0.132]

87 Note that hashing comes with performance guarantees which state that the canonical distortion due to hashing is sufficiently small for the dimensionality we picked. [sent-222, score-0.152]

88 We checked convergence for a system where the delay is given by τ ∈ {0, 10, 100, 1000}. [sent-226, score-0.615]

89 Unlike the previous check includes issues such as memory contention, thread synchronization, and general feasibility of a delayed updating architecture. [sent-229, score-0.291]

90 Implementation The code was written in Java, although several of the fundamentals were based upon VW [10], that is, hashing and the choice of loss function. [sent-230, score-0.138]

91 we rescale the updates and occasionally rescale the parameter). [sent-233, score-0.132]

92 In general, at least 6 of the cores were free at any given time. [sent-237, score-0.204]

93 Each slave is given both the weights for its pieces, as well as the corresponding pieces of the examples. [sent-242, score-0.133]

94 As a safeguard we limited the maximum delay to 100 examples. [sent-249, score-0.545]

95 The first experiment that we ran was a simulation where we artificially added a delay between the update and the product (Figure 2a). [sent-251, score-0.605]

96 We ran this experiment using linear features, and observed that the performance did not noticeably degrade with a delay of 10 examples, did not significantly degrade with a delay of 100, but with a delay of 1000, the performance became much worse. [sent-252, score-1.729]

97 In fact, even a delay of 1000 does not result in particularly bad performance. [sent-255, score-0.545]

98 Since even the sequential version already handled 150,000 examples per second we tested parallelization only for quadratic features where throughput would be in the order of 1000 examples per second. [sent-256, score-0.231]

99 However, intuitively, having a delay of τ is like having a learning rate that is τ times larger. [sent-260, score-0.545]

100 Accelerated training conditional random fields with stochastic gradient methods. [sent-405, score-0.139]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('delay', 0.545), ('ft', 0.351), ('gt', 0.239), ('cores', 0.204), ('delayed', 0.189), ('fi', 0.15), ('trec', 0.149), ('regret', 0.137), ('xt', 0.117), ('gradients', 0.11), ('latency', 0.102), ('online', 0.097), ('parallelization', 0.095), ('synchronization', 0.087), ('adversary', 0.087), ('gradient', 0.084), ('zt', 0.084), ('hashing', 0.076), ('processors', 0.073), ('updates', 0.072), ('platforms', 0.07), ('proprietary', 0.07), ('slave', 0.07), ('bounds', 0.067), ('disk', 0.063), ('pieces', 0.063), ('loss', 0.062), ('update', 0.06), ('descent', 0.06), ('incur', 0.056), ('thread', 0.056), ('hurt', 0.056), ('stochastic', 0.055), ('bandwidth', 0.054), ('langford', 0.051), ('instances', 0.05), ('parallel', 0.049), ('lemma', 0.048), ('whenever', 0.048), ('yt', 0.048), ('quadratic', 0.048), ('send', 0.047), ('degrade', 0.047), ('sequential', 0.047), ('multiprocessing', 0.046), ('ratliff', 0.046), ('slaves', 0.046), ('convex', 0.046), ('memory', 0.046), ('master', 0.045), ('graphics', 0.043), ('communication', 0.042), ('convergence', 0.042), ('respond', 0.041), ('bottou', 0.041), ('bagnell', 0.041), ('desktop', 0.041), ('throughput', 0.041), ('pipelined', 0.041), ('dot', 0.039), ('smola', 0.039), ('editors', 0.038), ('losses', 0.038), ('copy', 0.037), ('pessimistic', 0.037), ('amounted', 0.037), ('suited', 0.036), ('harder', 0.036), ('instance', 0.035), ('novelty', 0.035), ('threads', 0.035), ('cloud', 0.035), ('amounts', 0.035), ('bound', 0.034), ('risk', 0.034), ('platt', 0.034), ('lipschitz', 0.034), ('singer', 0.033), ('core', 0.032), ('cards', 0.032), ('schedule', 0.032), ('dataset', 0.031), ('correlation', 0.031), ('theorem', 0.031), ('zi', 0.031), ('rationale', 0.03), ('unsuitable', 0.03), ('paradigms', 0.03), ('rescale', 0.03), ('processor', 0.03), ('strongly', 0.03), ('subsequent', 0.03), ('occur', 0.029), ('hazan', 0.029), ('randomization', 0.029), ('entirely', 0.029), ('functions', 0.029), ('game', 0.028), ('annealing', 0.028), ('checked', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 220 nips-2009-Slow Learners are Fast

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

2 0.27284831 45 nips-2009-Beyond Convexity: Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and bandit settings. 1

3 0.22038807 178 nips-2009-On Stochastic and Worst-case Models for Investing

Author: Elad Hazan, Satyen Kale

Abstract: In practice, most investing is done assuming a probabilistic model of stock price returns known as the Geometric Brownian Motion (GBM). While often an acceptable approximation, the GBM model is not always valid empirically. This motivates a worst-case approach to investing, called universal portfolio management, where the objective is to maximize wealth relative to the wealth earned by the best fixed portfolio in hindsight. In this paper we tie the two approaches, and design an investment strategy which is universal in the worst-case, and yet capable of exploiting the mostly valid GBM model. Our method is based on new and improved regret bounds for online convex optimization with exp-concave loss functions. 1

4 0.21492775 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Infinitesimal Perturbation Analysis (IPA) on the filtering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the filtering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.

5 0.15591118 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

Author: Chonghai Hu, Weike Pan, James T. Kwok

Abstract: Regularized risk minimization often involves non-smooth optimization, either because of the loss function (e.g., hinge loss) or the regularizer (e.g., ℓ1 -regularizer). Gradient methods, though highly scalable and easy to implement, are known to converge slowly. In this paper, we develop a novel accelerated gradient method for stochastic optimization while still preserving their computational simplicity and scalability. The proposed algorithm, called SAGE (Stochastic Accelerated GradiEnt), exhibits fast convergence rates on stochastic composite optimization with convex or strongly convex objectives. Experimental results show that SAGE is faster than recent (sub)gradient methods including FOLOS, SMIDAS and SCD. Moreover, SAGE can also be extended for online learning, resulting in a simple algorithm but with the best regret bounds currently known for these problems. 1

6 0.12931842 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

7 0.1266979 181 nips-2009-Online Learning of Assignments

8 0.12514625 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

9 0.10261431 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

10 0.098401427 14 nips-2009-A Parameter-free Hedging Algorithm

11 0.089133523 27 nips-2009-Adaptive Regularization of Weight Vectors

12 0.089042217 76 nips-2009-Efficient Learning using Forward-Backward Splitting

13 0.086896658 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

14 0.079941921 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

15 0.079684481 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

16 0.076797545 123 nips-2009-Large Scale Nonparametric Bayesian Inference: Data Parallelisation in the Indian Buffet Process

17 0.075596735 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

18 0.073562749 101 nips-2009-Generalization Errors and Learning Curves for Regression with Multi-task Gaussian Processes

19 0.07109873 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

20 0.06986846 24 nips-2009-Adapting to the Shifting Intent of Search Queries


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.224), (1, 0.217), (2, 0.135), (3, -0.129), (4, 0.241), (5, 0.197), (6, -0.05), (7, 0.1), (8, -0.035), (9, 0.029), (10, -0.023), (11, -0.006), (12, -0.06), (13, 0.005), (14, -0.062), (15, 0.068), (16, -0.037), (17, 0.018), (18, 0.041), (19, -0.03), (20, -0.011), (21, 0.031), (22, 0.058), (23, 0.024), (24, -0.032), (25, -0.009), (26, 0.005), (27, 0.075), (28, 0.061), (29, 0.028), (30, 0.001), (31, -0.024), (32, 0.007), (33, 0.02), (34, 0.097), (35, -0.016), (36, -0.046), (37, -0.044), (38, -0.03), (39, 0.085), (40, 0.033), (41, -0.031), (42, 0.052), (43, -0.08), (44, -0.046), (45, 0.083), (46, 0.039), (47, -0.058), (48, -0.077), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93832564 220 nips-2009-Slow Learners are Fast

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

2 0.79574537 178 nips-2009-On Stochastic and Worst-case Models for Investing

Author: Elad Hazan, Satyen Kale

Abstract: In practice, most investing is done assuming a probabilistic model of stock price returns known as the Geometric Brownian Motion (GBM). While often an acceptable approximation, the GBM model is not always valid empirically. This motivates a worst-case approach to investing, called universal portfolio management, where the objective is to maximize wealth relative to the wealth earned by the best fixed portfolio in hindsight. In this paper we tie the two approaches, and design an investment strategy which is universal in the worst-case, and yet capable of exploiting the mostly valid GBM model. Our method is based on new and improved regret bounds for online convex optimization with exp-concave loss functions. 1

3 0.75780052 45 nips-2009-Beyond Convexity: Online Submodular Minimization

Author: Elad Hazan, Satyen Kale

Abstract: We consider an online decision problem over a discrete space in which the loss function is submodular. We give algorithms which are computationally efficient and are Hannan-consistent in both the full information and bandit settings. 1

4 0.68193501 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Infinitesimal Perturbation Analysis (IPA) on the filtering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the filtering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.

5 0.62963724 27 nips-2009-Adaptive Regularization of Weight Vectors

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

6 0.62356192 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

7 0.61205184 73 nips-2009-Dual Averaging Method for Regularized Stochastic Learning and Online Optimization

8 0.60441834 76 nips-2009-Efficient Learning using Forward-Backward Splitting

9 0.59160966 63 nips-2009-DUOL: A Double Updating Approach for Online Learning

10 0.54576743 181 nips-2009-Online Learning of Assignments

11 0.54113919 177 nips-2009-On Learning Rotations

12 0.51647538 14 nips-2009-A Parameter-free Hedging Algorithm

13 0.50920951 24 nips-2009-Adapting to the Shifting Intent of Search Queries

14 0.49675462 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

15 0.46136689 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

16 0.42545372 11 nips-2009-A General Projection Property for Distribution Families

17 0.417308 154 nips-2009-Modeling the spacing effect in sequential category learning

18 0.38472965 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control

19 0.38145438 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

20 0.37371707 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.014), (10, 0.247), (24, 0.083), (25, 0.056), (35, 0.064), (36, 0.114), (39, 0.038), (58, 0.077), (61, 0.024), (71, 0.061), (81, 0.028), (86, 0.073), (91, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93326461 186 nips-2009-Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units

Author: Feng Yan, Ningyi Xu, Yuan Qi

Abstract: The recent emergence of Graphics Processing Units (GPUs) as general-purpose parallel computing devices provides us with new opportunities to develop scalable learning methods for massive data. In this work, we consider the problem of parallelizing two inference methods on GPUs for latent Dirichlet Allocation (LDA) models, collapsed Gibbs sampling (CGS) and collapsed variational Bayesian (CVB). To address limited memory constraints on GPUs, we propose a novel data partitioning scheme that effectively reduces the memory cost. This partitioning scheme also balances the computational cost on each multiprocessor and enables us to easily avoid memory access conflicts. We use data streaming to handle extremely large datasets. Extensive experiments showed that our parallel inference methods consistently produced LDA models with the same predictive power as sequential training methods did but with 26x speedup for CGS and 196x speedup for CVB on a GPU with 30 multiprocessors. The proposed partitioning scheme and data streaming make our approach scalable with more multiprocessors. Furthermore, they can be used as general techniques to parallelize other machine learning models. 1

same-paper 2 0.8051343 220 nips-2009-Slow Learners are Fast

Author: Martin Zinkevich, John Langford, Alex J. Smola

Abstract: Online learning algorithms have impressive convergence properties when it comes to risk minimization and convex games on very large problems. However, they are inherently sequential in their design which prevents them from taking advantage of modern multi-core architectures. In this paper we prove that online learning with delayed updates converges well, thereby facilitating parallel online learning. 1

3 0.79870814 177 nips-2009-On Learning Rotations

Author: Raman Arora

Abstract: An algorithm is presented for online learning of rotations. The proposed algorithm involves matrix exponentiated gradient updates and is motivated by the von Neumann divergence. The multiplicative updates are exponentiated skew-symmetric matrices which comprise the Lie algebra of the rotation group. The orthonormality and unit determinant of the matrix parameter are preserved using matrix logarithms and exponentials and the algorithm lends itself to intuitive interpretation in terms of the differential geometry of the manifold associated with the rotation group. A complexity reduction result is presented that exploits the eigenstructure of the matrix updates to simplify matrix exponentiation to a quadratic form. 1

4 0.63257903 27 nips-2009-Adaptive Regularization of Weight Vectors

Author: Koby Crammer, Alex Kulesza, Mark Dredze

Abstract: We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle non-separable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidence-weighted online learning techniques and show empirically that AROW achieves state-of-the-art performance and notable robustness in the case of non-separable data. 1

5 0.63118553 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black

Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1

6 0.62895405 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

7 0.62787354 129 nips-2009-Learning a Small Mixture of Trees

8 0.6277324 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

9 0.62747473 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

10 0.62721896 72 nips-2009-Distribution Matching for Transduction

11 0.62677109 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

12 0.62506151 239 nips-2009-Submodularity Cuts and Applications

13 0.6245153 260 nips-2009-Zero-shot Learning with Semantic Output Codes

14 0.62255973 71 nips-2009-Distribution-Calibrated Hierarchical Classification

15 0.62222546 3 nips-2009-AUC optimization and the two-sample problem

16 0.62060165 97 nips-2009-Free energy score space

17 0.62034827 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

18 0.61991292 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

19 0.61773479 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

20 0.61665457 16 nips-2009-A Smoothed Approximate Linear Program