jmlr jmlr2011 jmlr2011-19 knowledge-graph by maker-knowledge-mining

19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

Source: pdf

Author: Benoît Patra

Abstract: Motivated by the problem of effectively executing clustering algorithms on very large data sets, we address a model for large scale distributed clustering methods. To this end, we brieﬂy recall some standards on the quantization problem and some results on the almost sure convergence of the competitive learning vector quantization (CLVQ) procedure. A general model for linear distributed asynchronous algorithms well adapted to several parallel computing architectures is also discussed. Our approach brings together this scalable model and the CLVQ algorithm, and we call the resulting technique the distributed asynchronous learning vector quantization algorithm (DALVQ). An indepth analysis of the almost sure convergence of the DALVQ algorithm is performed. A striking result is that we prove that the multiple versions of the quantizers distributed among the processors in the parallel architecture asymptotically reach a consensus almost surely. Furthermore, we also show that these versions converge almost surely towards the same nearly optimal value for the quantization criterion. Keywords: k-means, vector quantization, distributed, asynchronous, stochastic optimization, scalability, distributed consensus

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 To this end, we brieﬂy recall some standards on the quantization problem and some results on the almost sure convergence of the competitive learning vector quantization (CLVQ) procedure. [sent-4, score-0.591]

2 A general model for linear distributed asynchronous algorithms well adapted to several parallel computing architectures is also discussed. [sent-5, score-0.385]

3 Our approach brings together this scalable model and the CLVQ algorithm, and we call the resulting technique the distributed asynchronous learning vector quantization algorithm (DALVQ). [sent-6, score-0.579]

4 A striking result is that we prove that the multiple versions of the quantizers distributed among the processors in the parallel architecture asymptotically reach a consensus almost surely. [sent-8, score-0.445]

5 Furthermore, we also show that these versions converge almost surely towards the same nearly optimal value for the quantization criterion. [sent-9, score-0.363]

6 Keywords: k-means, vector quantization, distributed, asynchronous, stochastic optimization, scalability, distributed consensus 1. [sent-10, score-0.194]

7 Building such large scale algorithms attacks several problems in a distributed framework, such as communication delays in the network or numerous problems caused by the lack of shared memory. [sent-14, score-0.231]

8 ı PATRA competitive learning vector quantization (CLVQ) algorithm (see Gersho and Gray, 1992) provides a technique for building reliable clusters characterized by their prototypes. [sent-24, score-0.219]

9 The CLVQ also belongs to the class of stochastic gradient descent algorithms (for more information on stochastic gradient descent procedures we refer the reader to Benveniste et al. [sent-26, score-0.368]

10 In the present paper, we go further by introducing a model that brings together the original CLVQ algorithm and the comprehensive theory of asynchronous parallel linear algorithms developed by Tsitsiklis (1984), Tsitsiklis et al. [sent-31, score-0.322]

11 The resulting model will be called distributed asynchronous learning vector quantization (DALVQ for short). [sent-33, score-0.579]

12 At a high level, the DALVQ algorithm parallelizes several executions of the CLVQ method concurrently on different processors while the results of these algorithms are broadcast through the distributed framework asynchronously and efﬁciently. [sent-34, score-0.167]

13 Here, the term processor refers to any computing instance in a distributed architecture (see Bullo et al. [sent-35, score-0.228]

14 (2009), a coverage control problem is formulated as an optimization problem where the functional cost to be minimized is the same of the quantization problem stated in this manuscript. [sent-41, score-0.219]

15 The ﬁrst technique computes quantization scheme for d dimensional samples z1 , z2 , . [sent-43, score-0.219]

16 Therefore, the DALVQ algorithms are deﬁned by the M iterations {wi (t)}t=0 , called versions, satisfying (with slight simpliﬁcations) wi (t + 1) = M i ∑ ai, j (t)w j (τi, j (t)) − εt+1H i zt+1 , wi (t) , j=1 i ∈ {1, . [sent-56, score-0.288]

17 j=1 As a striking result, we prove that multiple versions of the quantizers, distributed among the processors in a parallel architecture, asymptotically reach a consensus almost surely. [sent-62, score-0.349]

18 Using the materials introduced above, it writes wi (t) − w j (t) − → 0, − t→∞ (i, j) ∈ {1, . [sent-63, score-0.162]

19 Furthermore, we also show that these versions converge almost surely towards (the same) nearly optimal value for the quantization criterion. [sent-69, score-0.363]

20 These convergence results are similar in spirit to the 3432 C ONVERGENCE OF D ISTRIBUTED A SYNCHRONOUS L EARNING V ECTOR Q UANTIZATION A LGORITHMS most satisfactory almost sure convergence theorem for the CLVQ algorithm obtained by Pag` s e (1997). [sent-70, score-0.233]

21 For a given time span, our parallel DALVQ algorithm is able to process much more data than a single processor execution of the CLVQ procedure. [sent-71, score-0.164]

22 This allows some processors to compute faster and execute more iterations than others, and it also allows communication delays to be substantial and unpredictable. [sent-74, score-0.308]

23 First, a reduction of the synchronization penalty, which could bring a speed advantage over a synchronous execution. [sent-77, score-0.18]

24 In Section 3 we give a brief exposition of the mathematical framework for parallel asynchronous gradient methods introduced by Tsitsiklis (1984), Tsitsiklis et al. [sent-86, score-0.37]

25 (2005) on the asymptotic consensus in asynchronous parallel averaging problems are also recalled. [sent-89, score-0.407]

26 Quantization and CLVQ Algorithm In this section, we describe the mathematical quantization problem and the CLVQ algorithm. [sent-92, score-0.219]

27 The quantization problem consists in ﬁnding a “good approximation” of µ by a set of κ vectors of Rd called quantizer. [sent-96, score-0.219]

28 Throughout the document the κ quantization points (or prototypes) will be seen as the components κ of a Rd -dimensional vector w = (w1 , . [sent-97, score-0.219]

29 To measure the correctness of a quantization scheme given by w, one introduces a cost function called distortion, deﬁned by Cµ (w) = 1 2 Rd min z − wℓ 1≤ℓ≤κ 2 dµ(z). [sent-101, score-0.219]

30 ∑1 n i=1 {zi ∈A} Much attention has been devoted to the convergence study of the quantization scheme provided by the empirical minimizers w◦ ∈ argmin Cµn (w). [sent-111, score-0.347]

31 n κ w∈(Rd ) The almost sure convergence of Cµ (w◦ ) towards minw∈(Rd )κ Cµ (w) was proved by Pollard (1981, n 1982a) and Abaya and Wise (1984). [sent-112, score-0.183]

32 Another celebrated quantization algorithm is the competitive learning vector quantization (CLVQ), also called on-line k-means. [sent-126, score-0.438]

33 The CLVQ procedure can be seen as a stochastic gradient descent algorithm. [sent-130, score-0.184]

34 In the more general context of gradient descent methods, one cannot hope for the convergence of the procedure towards global minimizers with a non convex objective function (see for instance Benveniste et al. [sent-131, score-0.267]

35 In our quantization context, the distortion mapping Cµ is not convex (see for instance Graf and Luschgy 2000). [sent-133, score-0.368]

36 Assuming that the distribution µ has a compact support and a bounded density with respect to the Lebesgue measure, Pag` s (1997) states a result regarding the almost sure consistency of the CLVQ e algorithm towards critical points of the distortion Cµ . [sent-135, score-0.287]

37 The main difﬁculties in the proof arise from the fact that the gradient of the distortion is singular on κ-tuples having equal components and the distortion function Cµ is not convex. [sent-137, score-0.346]

38 ı The next proposition states the differentiability of the distortion C, and provides an explicit formula for the gradient ∇C whenever the distortion is differentiable. [sent-160, score-0.385]

39 This leads to the following stochastic gradient descent procedure w(t + 1) = w(t) − εt+1 H (zt+1 , w(t)) , t ≥ 0, (3) ◦ κ where w(0) ∈ G κ ∩ D∗ and z1 , z2 . [sent-187, score-0.184]

40 As outlined by Pag` s in Pag` s (1997), this algorithm belongs e e to the class of stochastic gradient descent methods. [sent-193, score-0.184]

41 The following assumption set is standard in a gradient descent context. [sent-196, score-0.185]

42 , (5) The next theorem is, as far as we know, the ﬁrst almost sure convergence theorem for the stochastic algorithm CLVQ. [sent-241, score-0.269]

43 Recall that without more assumption than w(0) ∈ G κ ∩ D∗ , we have already discussed the fact that the components of w(t) are almost surely parted for all t ≥ 0. [sent-248, score-0.232]

44 Thus, it is easily seen that the two following events only differ on a set of zero probability κ lim inf dist w(t), ∁D∗ > 0 t→∞ and κ inf dist w(t), ∁D∗ > 0 . [sent-249, score-0.166]

45 General Distributed Asynchronous Algorithm We present in this section some materials and results of the asynchronous parallel linear algorithms theory. [sent-252, score-0.322]

46 The aim of this section is to discuss a precise mathematical description of a distributed asynchronous model for the iterations (6). [sent-258, score-0.396]

47 Assume that we dispose of a distributed architecture with M computing entities called processors (or agents, see for instance Bullo et al. [sent-262, score-0.217]

48 Throughout the paper, we will add the superscript i on the variables possessed by the processor i. [sent-268, score-0.169]

49 Thus, for agent i such κ ∞ iterations are represented by the Rd -valued sequence wi (t) t=0 . [sent-270, score-0.239]

50 We deﬁne the general distributed asynchronous algorithm by the following iterations wi (t + 1) = M ∑ ai, j (t)w j (τi, j (t)) + si(t), j=1 i ∈ {1, . [sent-287, score-0.522]

51 (7) The model can be interpreted as follows: at time t ≥ 0, processor i receives messages from other processors containing w j (τi, j (t)). [sent-291, score-0.243]

52 In Section 4, when we deﬁne the distributed asynchronous learning vector quantization (DALVQ), the deﬁnition of the descent terms will be made more explicit. [sent-302, score-0.669]

53 (2005), for a natural simpliﬁcation of the general distributed asynchronous algorithm (7). [sent-305, score-0.36]

54 (2005) deﬁne two sets of assumptions that enforce some weak properties on the communication delays and the network topology. [sent-314, score-0.168]

55 − t→∞ The agreement algorithm (9) is essentially driven by the communication times τi, j (t) assumed to be deterministic but do not need to be known a priori by the processors. [sent-321, score-0.174]

56 The following Assumption 3441 PATRA Global time reference Figure 3: Illustration of the time delays introduced in the general distributed asynchronous algorithm. [sent-322, score-0.44]

57 3 essentially ensures, in its third statement, that the communication delays t − τi, j (t) are bounded. [sent-326, score-0.168]

58 This assumption prevents some processor from taking into account some arbitrarily old values computed by others processors. [sent-327, score-0.186]

59 , M} | ai, j (t) = 0 corresponds to the set of agents whose version is taken into account by processor i at time t. [sent-372, score-0.168]

60 The communication patterns, sometimes refereed to as the network communication topology, can be expressed in terms of directed graph. [sent-380, score-0.2]

61 Deﬁnition 5 (Communication graph) Let us ﬁx t ≥ 0, the communication graph at time t, (V , E(t)), is deﬁned by • the set of vertices V is formed by the set of processors V = {1, . [sent-382, score-0.192]

62 The Theorem 6 expresses the fact that, for the agreement algorithm, a consensus is asymptotically reached by the agents. [sent-410, score-0.196]

63 3 Asymptotic Consensus This subsection is devoted to the analysis of the general distributed asynchronous algorithm (7). [sent-421, score-0.389]

64 The following lemma states that the version possessed by agent i ∈ {1, . [sent-423, score-0.16]

65 , M} at time t ≥ 0, namely wi (t), depends linearly on the others initialization vectors w j (0) and the descent t−1 subsequences s j (τ) τ=−1 , where j ∈ {1, . [sent-426, score-0.216]

66 , M}2 and t ≥ 0, the real-valued sequences φi, j (t, τ) τ=−1 do not depend on the value taken by the descent terms si (t). [sent-437, score-0.162]

67 Distributed Asynchronous Learning Vector Quantization This section is devoted to the distributed asynchronous learning vector quantization techniques. [sent-477, score-0.608]

68 3445 PATRA Global time reference Figure 4: The agreement vector at time t ′ , w⋆ (t ′ ) corresponds to the common value asymptotically achieved by all processors if computations integrating descent terms have stopped after t ′ , that is, s j (t) = 0 for all t ≥ t ′ . [sent-479, score-0.305]

69 (1986) and Bertsekas and Tsitsiklis (1989) studied distributed asynchronous stochastic gradient optimization algorithms. [sent-485, score-0.454]

70 In this series of publications, for the κ distributed minimization of a cost function F : Rd −→ R, the authors considered the general distributed asynchronous algorithm deﬁned by Equation (7) with speciﬁc choices for stochastic descent terms si . [sent-486, score-0.596]

71 Using the notation of Section 3, the algorithm writes wi (t + 1) = M ∑ ai, j (t)w j (τi, j (t)) + si (t), j=1 i ∈ {1, . [sent-487, score-0.199]

72 , M} and t ≥ 0, with stochastic descent terms si (t) satisfying i E si (t) s j (τ), j ∈ {1, . [sent-490, score-0.21]

73 As discussed in Section 2, the CLVQ algorithm is also a stochastic gradient descent procedure. [sent-505, score-0.184]

74 to the context of vector quantization and on-line clustering. [sent-509, score-0.219]

75 We ﬁrst introduce the distributed asynchronous learning vector quantization (DALVQ) algorithm. [sent-510, score-0.579]

76 To prove its almost sure consistency, we will need an asynchronous G-lemma, which is inspired from the G-lemma, Theorem 3, presented in Section 2. [sent-511, score-0.405]

77 This theorem may be seen as an easyto-apply tool for the almost sure consistency of a distributed asynchronous system where the average function is not necessary regular. [sent-512, score-0.503]

78 Our approach sheds also some new light on the convergence of distributed asynchronous stochastic gradient descent algorithms. [sent-513, score-0.589]

79 Similarly to Pag` s (1997), e who assumes that the trajectory of the CLVQ algorithm has almost surely asymptotically parted components (see Theorem 4 in Section 2), we will suppose that the agreement vector sequence has, almost surely, asymptotically parted component trajectories. [sent-518, score-0.439]

80 The data sets (or streams of data) are distributed among several queues sending data to the different processors of our distributed framework. [sent-520, score-0.23]

81 Thus, the DALVQ procedure is 1 2 deﬁned by Equation (7) with the following choice for the descent term si : si (t) = i i −εt+1 H zt+1 , wi (t) 0 if t ∈ T i ; otherwise; ∞ (12) where εti t=0 are (0, 1)-valued sequences. [sent-543, score-0.29]

82 The sets T i contain the time instants where the version wi , kept by processor i, is updated with the descent terms. [sent-544, score-0.39]

83 This ﬁne grain description of the algorithm allows some processors to be idle for computing descent terms (when t ∈ T i ). [sent-545, score-0.194]

84 This reﬂects / the fact that the computing operations might not take the same time for all processors, which is precisely the core of asynchronous algorithms analysis. [sent-546, score-0.297]

85 This assumption is satisﬁed, for example, by taking the current value of εti proportional to 1/nti , where nti is the number of times that processor i as performed an update, that is, the cardinal of the set T i ∩ {0, . [sent-566, score-0.21]

86 2 The Asynchronous G-lemma The aim of this subsection is to state a useful theorem similar to Theorem 3, but adapted to our asynchronous distributed context. [sent-579, score-0.395]

87 The reader should keep in mind that the vector w⋆ (t) is also the asymptotical consensus if descent terms are zero after time t. [sent-581, score-0.175]

88 Nevertheless, the agreement vector w⋆ (t) can be interpreted as a “probabilistic state” of the whole distributed quantization scheme at time t. [sent-583, score-0.368]

89 (16) We are now in a position to state our most useful tool, which is similar in spirit to the G-lemma, but adapted to the context of distributed asynchronous stochastic gradient descent algorithm. [sent-599, score-0.544]

90 Thus, Equation (7) takes the form /  i i  wi (t + 1) = wi (t) − εt+1 wi (t) − zt+1  if t ∈ T i ; i i i wi (t + 1) = = 1 − εt+1 wi (t) + εt+1 zt+1   i otherwise. [sent-631, score-0.63]

91 The next Lemma 11 provides a deterministic upper bound on the differences between the disκ tributed versions wi and the agreement vector. [sent-636, score-0.212]

92 wi (t) − w j (t) − → 0, − t→∞ (18) This shows that the trajectories of the distributed versions of the quantizers reach asymptotically ∞ a consensus with probability 1. [sent-656, score-0.369]

93 In other words, if one of the sequences wi (t) t=0 converges then they all converge towards the same value. [sent-657, score-0.191]

94 Its proof is based on the asynchronous G-lemma, Theorem 10. [sent-667, score-0.297]

95 We are now in a position to state the main theorem of this paper, which expresses the convergence of the distributed version towards some zero of the gradient of the distortion. [sent-686, score-0.221]

96 5 Annex Sketch of the proof of asynchronous G-lemma 10. [sent-711, score-0.297]

97 τ=−1 τ ∨ 1 ∑ Consequently, w⋆ (t) − wi (t) κ ∑ = ℓ=1 ≤ √ 2 wi (t) − w⋆ (t) ℓ ℓ κM diam(G )AK2 t−1 1 t−τ ρ . [sent-750, score-0.252]

98 For any t ≥ tδ , αwi (t) + (1 − α)w⋆ (t) − αwi (t) − (1 − α)w⋆ (t) k ℓ k ℓ = w⋆ (t) − w⋆ (t) + α(wi (t) − w⋆ (t)) + α(w⋆ (t) − wi (t)) k ℓ k ℓ ℓ k ≥ w⋆ (t) − w⋆ (t) − α wi (t) − w⋆ (t) − α w⋆ (t) − wi (t) k ℓ ℓ k k ℓ δ ≥ δ − 2α 4 ≥ δ/2. [sent-784, score-0.378]

99 For √ ∞ all z ∈ G and all t ≥ 0, we have H(z, w⋆ (t)) ≤ κ diam (G ), almost surely, whereas {h(w⋆ (t))}t=0 satisﬁes h(w⋆ (t)) = E {H (z, w⋆ (t)) | Ft } , t ≥ 0, a. [sent-845, score-0.22]

100 The distortion of vector quantizers trained on n vectors decreases to the optimum at o p (1/n). [sent-948, score-0.219]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('clvq', 0.318), ('asynchronous', 0.297), ('pag', 0.26), ('asy', 0.24), ('patra', 0.235), ('quantization', 0.219), ('dalvq', 0.212), ('uantization', 0.2), ('synchronous', 0.18), ('diam', 0.173), ('istributed', 0.17), ('tsitsiklis', 0.154), ('distortion', 0.149), ('ector', 0.14), ('processor', 0.139), ('rd', 0.135), ('wi', 0.126), ('mt', 0.114), ('fort', 0.106), ('onvergence', 0.106), ('processors', 0.104), ('zt', 0.098), ('descent', 0.09), ('ft', 0.089), ('communication', 0.088), ('agreement', 0.086), ('consensus', 0.085), ('lgorithms', 0.084), ('dist', 0.083), ('delays', 0.08), ('blondel', 0.08), ('agent', 0.077), ('parted', 0.071), ('lloyd', 0.07), ('quantizers', 0.07), ('surely', 0.067), ('distributed', 0.063), ('ai', 0.063), ('sure', 0.061), ('kohonen', 0.059), ('clock', 0.054), ('minimizers', 0.054), ('lemma', 0.053), ('vorono', 0.05), ('earning', 0.05), ('gradient', 0.048), ('assumption', 0.047), ('almost', 0.047), ('benveniste', 0.047), ('stochastic', 0.046), ('ltration', 0.045), ('convergence', 0.045), ('bertsekas', 0.042), ('linder', 0.041), ('proposition', 0.039), ('si', 0.037), ('iterations', 0.036), ('statement', 0.036), ('writes', 0.036), ('antos', 0.035), ('bullo', 0.035), ('frasca', 0.035), ('inft', 0.035), ('instants', 0.035), ('sabin', 0.035), ('sequences', 0.035), ('theorem', 0.035), ('ru', 0.033), ('quantizer', 0.033), ('wk', 0.033), ('stands', 0.031), ('graf', 0.03), ('possessed', 0.03), ('towards', 0.03), ('agents', 0.029), ('devoted', 0.029), ('bottou', 0.028), ('architecture', 0.026), ('wt', 0.026), ('asymptotically', 0.025), ('parallel', 0.025), ('pollard', 0.025), ('zi', 0.025), ('clustering', 0.024), ('abaya', 0.024), ('asynchronism', 0.024), ('beno', 0.024), ('cardinal', 0.024), ('carli', 0.024), ('chou', 0.024), ('communicates', 0.024), ('dispose', 0.024), ('gossip', 0.024), ('homothety', 0.024), ('inaba', 0.024), ('luschgy', 0.024), ('organizing', 0.024), ('refereed', 0.024), ('tessellation', 0.024), ('martingale', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

Author: Benoît Patra

2 0.11001399 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a uniﬁed way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function

3 0.096956827 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

Abstract: We present a parallel algorithm for the score-based optimal structure search of Bayesian networks. This algorithm is based on a dynamic programming (DP) algorithm having O(n · 2n ) time and space complexity, which is known to be the fastest algorithm for the optimal structure search of networks with n nodes. The bottleneck of the problem is the memory requirement, and therefore, the algorithm is currently applicable for up to a few tens of nodes. While the recently proposed algorithm overcomes this limitation by a space-time trade-off, our proposed algorithm realizes direct parallelization of the original DP algorithm with O(nσ ) time and space overhead calculations, where σ > 0 controls the communication-space trade-off. The overall time and space complexity is O(nσ+1 2n ). This algorithm splits the search space so that the required communication between independent calculations is minimal. Because of this advantage, our algorithm can run on distributed memory supercomputers. Through computational experiments, we conﬁrmed that our algorithm can run in parallel using up to 256 processors with a parallelization efﬁciency of 0.74, compared to the original DP algorithm with a single processor. We also demonstrate optimal structure search for a 32-node network without any constraints, which is the largest network search presented in literature. Keywords: optimal Bayesian network structure, parallel algorithm

4 0.070286557 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

Abstract: In the efﬁcient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is ﬁxed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modiﬁed algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never ﬁnd the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a ﬁxed prior. Keywords: convergence rates, efﬁcient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization

5 0.059048586 14 jmlr-2011-Better Algorithms for Benign Bandits

Author: Elad Hazan, Satyen Kale

Abstract: The online multi-armed bandit problem and its generalizations are repeated decision making problems, where the goal is to select one of several possible decisions in every round, and incur a cost associated with the decision, in such a way that the total cost incurred over all iterations is close to the cost of the best ﬁxed decision in hindsight. The difference in these costs is known as the regret of the algorithm. The term bandit refers to the setting where one only obtains the cost of the decision used in a given iteration and no other information. A very general form of this problem is the non-stochastic bandit linear optimization problem, where the set of decisions is a convex set in some√ Euclidean space, and the cost functions are linear. ˜ Only recently an efﬁcient algorithm attaining O( T ) regret was discovered in this setting. In this paper we propose a new algorithm for the bandit linear optimization problem which √ ˜ obtains a tighter regret bound of O( Q), where Q is the total variation in the cost functions. This regret bound, previously conjectured to hold in the full information case, shows that it is possible to incur much less regret in a slowly changing environment even in the bandit setting. Our algorithm is efﬁcient and applies several new ideas to bandit optimization such as reservoir sampling. Keywords: multi-armed bandit, regret minimization, online learning

6 0.05143236 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

7 0.049079008 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization

8 0.047464404 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

9 0.044235874 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

10 0.040891789 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

11 0.036952738 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.034483958 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

13 0.033271533 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

14 0.032950565 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

15 0.0328997 104 jmlr-2011-X-Armed Bandits

16 0.032282244 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

17 0.031762902 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

18 0.031714749 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

19 0.031586323 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

20 0.030788705 16 jmlr-2011-Clustering Algorithms for Chains

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.172), (1, 0.133), (2, -0.059), (3, -0.022), (4, -0.009), (5, 0.009), (6, 0.002), (7, 0.016), (8, -0.056), (9, 0.073), (10, -0.013), (11, 0.012), (12, -0.01), (13, 0.08), (14, -0.021), (15, -0.078), (16, 0.107), (17, -0.081), (18, -0.028), (19, 0.011), (20, 0.093), (21, -0.166), (22, -0.061), (23, -0.027), (24, 0.026), (25, -0.39), (26, -0.311), (27, 0.066), (28, 0.094), (29, 0.215), (30, -0.313), (31, 0.047), (32, 0.173), (33, 0.003), (34, -0.046), (35, -0.049), (36, 0.028), (37, -0.01), (38, 0.01), (39, -0.045), (40, 0.04), (41, 0.037), (42, -0.008), (43, -0.006), (44, -0.011), (45, -0.045), (46, 0.018), (47, 0.072), (48, -0.032), (49, -0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94444746 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

Author: Benoît Patra

2 0.61266834 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure

Author: Yoshinori Tamada, Seiya Imoto, Satoru Miyano

3 0.45525765 36 jmlr-2011-Generalized TD Learning

Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii

4 0.28484491 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is deﬁned as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components

5 0.27779055 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

Author: Adam D. Bull

6 0.2622529 10 jmlr-2011-Anechoic Blind Source Separation Using Wigner Marginals

7 0.21715991 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

8 0.21382917 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

9 0.19796965 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

10 0.19494405 98 jmlr-2011-Universality, Characteristic Kernels and RKHS Embedding of Measures

11 0.18230243 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

12 0.1710663 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

13 0.15890202 104 jmlr-2011-X-Armed Bandits

14 0.15750821 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

15 0.15592502 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

16 0.15384436 14 jmlr-2011-Better Algorithms for Benign Bandits

17 0.14922409 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

18 0.1480529 16 jmlr-2011-Clustering Algorithms for Chains

19 0.14660107 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

20 0.14024191 91 jmlr-2011-The Sample Complexity of Dictionary Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.041), (9, 0.026), (10, 0.06), (24, 0.035), (31, 0.079), (32, 0.019), (41, 0.039), (57, 0.41), (60, 0.014), (65, 0.014), (67, 0.012), (71, 0.011), (73, 0.023), (78, 0.094), (90, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.6726864 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

Author: Benoît Patra

2 0.33365116 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir

Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efﬁcient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufﬁcient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory

3 0.33295417 16 jmlr-2011-Clustering Algorithms for Chains

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a ﬁnite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-speciﬁc scores are either difﬁcult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The ﬁrst one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the signiﬁcance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artiﬁcial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

4 0.32836318 104 jmlr-2011-X-Armed Bandits

Author: Sébastien Bubeck, Rémi Munos, Gilles Stoltz, Csaba Szepesvári

Abstract: We consider a generalization of stochastic bandits where the set of arms, X , is allowed to be a generic measurable space and the mean-payoff function is “locally Lipschitz” with respect to a dissimilarity function that is known to the decision maker. Under this condition we construct an arm selection policy, called HOO (hierarchical optimistic optimization), with improved regret bounds compared to previous results for a large class of problems. In particular, our results imply that if X is the unit hypercube in a Euclidean space and the mean-payoff function has a ﬁnite number of global maxima around which the behavior of the function is locally continuous with a known √ smoothness degree, then the expected regret of HOO is bounded up to a logarithmic factor by n, that is, the rate of growth of the regret is independent of the dimension of the space. We also prove the minimax optimality of our algorithm when the dissimilarity is a metric. Our basic strategy has quadratic computational complexity as a function of the number of time steps and does not rely on the doubling trick. We also introduce a modiﬁed strategy, which relies on the doubling trick but runs in linearithmic time. Both results are improvements with respect to previous approaches. Keywords: bandits with inﬁnitely many arms, optimistic online optimization, regret bounds, minimax rates

5 0.32742018 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

6 0.32690525 12 jmlr-2011-Bayesian Co-Training

7 0.32556587 91 jmlr-2011-The Sample Complexity of Dictionary Learning

8 0.32472655 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

9 0.32368997 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

10 0.32276309 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

11 0.32255465 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

12 0.32087722 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

13 0.32063392 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

14 0.32025453 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

15 0.31992397 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

16 0.31971973 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling

17 0.31966752 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

18 0.31965211 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

19 0.31905904 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

20 0.31889766 59 jmlr-2011-Learning with Structured Sparsity