nips nips2013 nips2013-325 knowledge-graph by maker-knowledge-mining

325 nips-2013-The Pareto Regret Frontier

Source: pdf

Author: Wouter M. Koolen

Abstract: Performance guarantees for online learning algorithms typically take the form of regret bounds, which express that the cumulative loss overhead compared to the best expert in hindsight is small. In the common case of large but structured expert sets we typically wish to keep the regret especially small compared to simple experts, at the cost of modest additional overhead compared to more complex others. We study which such regret trade-offs can be achieved, and how. We analyse regret w.r.t. each individual expert as a multi-objective criterion in the simple but fundamental case of absolute loss. We characterise the achievable and Pareto optimal trade-offs, and the corresponding optimal strategies for each sample size both exactly for each ﬁnite horizon and asymptotically. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract Performance guarantees for online learning algorithms typically take the form of regret bounds, which express that the cumulative loss overhead compared to the best expert in hindsight is small. [sent-5, score-0.745]

2 In the common case of large but structured expert sets we typically wish to keep the regret especially small compared to simple experts, at the cost of modest additional overhead compared to more complex others. [sent-6, score-0.57]

3 We study which such regret trade-offs can be achieved, and how. [sent-7, score-0.293]

4 each individual expert as a multi-objective criterion in the simple but fundamental case of absolute loss. [sent-11, score-0.337]

5 We characterise the achievable and Pareto optimal trade-offs, and the corresponding optimal strategies for each sample size both exactly for each ﬁnite horizon and asymptotically. [sent-12, score-0.411]

6 1 Introduction One of the central problems studied in online learning is prediction with expert advice. [sent-13, score-0.265]

7 He needs to make a sequence of T decisions with the objective of performing as well as the best expert in hindsight. [sent-15, score-0.229]

8 Hedge [1] with learning rate η = 8/T ln K, guarantee LT − Lk ≤ T T /2 ln K for each expert k. [sent-19, score-0.489]

9 (1) where LT and Lk are the cumulative losses of the learner and expert k after all T rounds. [sent-20, score-0.379]

10 For it is not always desirable to have a uniform regret bound w. [sent-22, score-0.293]

11 Instead, we may want to single out a few special experts and demand to be really close to them, at the cost of increased overhead compared to the rest. [sent-26, score-0.218]

12 When the number of experts K is large or inﬁnite, such favouritism even seems unavoidable for non-trivial regret bounds. [sent-27, score-0.463]

13 The typical proof of the regret bound (1) suggests that the following can be guaranteed as well. [sent-28, score-0.293]

14 For each choice of probability distribution q on experts, there is an algorithm that guarantees LT − Lk ≤ T T /2(− ln q(k)) for each expert k. [sent-29, score-0.391]

15 In this paper we study the Pareto (achievable and non-dominated) regret trade-offs. [sent-34, score-0.293]

16 , rK ∈ RK is T -realisable if there is an algorithm that guarantees LT − Lk ≤ rk T for each expert k. [sent-38, score-0.331]

17 And what is the strategy that witnesses these realisable strategies? [sent-41, score-0.46]

18 We ﬁrst obtain an exact characterisation of the set of realisable trade-offs. [sent-44, score-0.38]

19 We then construct for each realisable proﬁle a witnessing strategy. [sent-45, score-0.361]

20 We also give a randomised procedure for optimal play that extends the randomised procedures for balanced regret proﬁles from [3] and later [4, 5]. [sent-46, score-0.461]

21 We then focus on the relation between priors and regret bounds, to see if the particular form (2) is achievable, and if so, whether it is optimal. [sent-47, score-0.293]

22 To this end, we characterise the asymptotic Pareto frontier as T → ∞. [sent-48, score-0.296]

23 This is of philosophical interest as it hints that approaching absolute loss by essentially reducing it to information theory (including Bayesian and Minimum Description Length methods, relative entropy based optimisation (instance of Mirror Descent), Defensive Forecasting etc. [sent-50, score-0.26]

24 Finally, we show that our solution for absolute loss equals that of K = 2 experts with bounded linear loss. [sent-52, score-0.421]

25 On the one hand there are the game-theoretic (minimax) approaches to prediction with expert advice. [sent-58, score-0.265]

26 In [6] CesaBianchi, Freund, Haussler, Helmbold, Schapire and Warmuth analysed the minimax strategy for absolute loss with a known time horizon T . [sent-59, score-0.443]

27 In [5] Cesa-Bianchi and Shamir used random walks to implement it efﬁciently for K = 2 experts or K ≥ 2 static experts. [sent-60, score-0.236]

28 In [7] Abernethy, Langford and Warmuth obtained the optimal strategy for absolute loss with experts that issue binary predictions, now controlling the game complexity by imposing a bound on the loss of the best expert. [sent-62, score-0.676]

29 Even-Dar, Kearns, Mansour and Wortman characterise the achievable tradeoffs when we desire especially small regret compared to a ﬁxed average of the experts’ losses in [10]. [sent-69, score-0.589]

30 An at least tangentially related problem is to ensure smaller regret when there are several good experts. [sent-71, score-0.293]

31 2 Setup The absolute loss game is one of the core decision problems studied in online learning [14]. [sent-73, score-0.255]

32 , T } the learner assigns a probability pt ∈ [0, 1] to the next outcome being a 1, after which the actual outcome xt ∈ {0, 1} is revealed, and the learner suffers absolute loss |pt − xt |. [sent-78, score-0.724]

33 Note that absolute loss equals expected 0/1 loss, that is, the probability of a mistake if a “hard” prediction in {0, 1} is sampled with bias p on 1. [sent-79, score-0.287]

34 Realising that the learner cannot avoid high cumulative loss without assumptions on the origin of the outcomes, the learner’s objective is deﬁned to ensure low cumulative loss compared to a ﬁxed set of baseline strategies. [sent-80, score-0.364]

35 for which some reference strategy has low loss), the lower the cumulative loss incurred by the learner. [sent-83, score-0.246]

36 2 4 T=1 T=2 T=3 T=4 T=5 T=6 T=7 T=8 T=9 T=10 regret w. [sent-84, score-0.293]

37 1 3 2 3 0 2 0 r1 1 1 0 1 1 2 0 0 0 1 2 regret w. [sent-87, score-0.293]

38 Figure 1: Exact regret trade-off proﬁle The regret w. [sent-97, score-0.586]

39 The classical k approach is to “scalarise” it into the single objective RT := maxk RT , that is, to ensure small regret compared to the best expert in hindsight. [sent-102, score-0.522]

40 A candidate trade-off r0 , r1 ∈ R2 is called T -realisable for the T -round absolute loss game if there is a strategy that keeps the regret w. [sent-105, score-0.684]

41 if 0 1 ∃p1 ∀x1 · · · ∃pT ∀xT : RT ≤ r0 and RT ≤ r1 where pt ∈ [0, 1] and xt ∈ {0, 1} in each round t. [sent-110, score-0.246]

42 3 The exact regret trade-off proﬁle In this section we characterise the set GT ⊂ R2 of T -realisable trade-offs. [sent-116, score-0.408]

43 We show that it is a convex polygon, that we subsequently characterise by its vertices and edges. [sent-117, score-0.209]

44 We also exhibit the optimal strategy witnessing each Pareto optimal trade-off and discuss the connection with random walks. [sent-118, score-0.238]

45 As the absolute loss is linear in the prediction, t t A B this strategy guarantees LT = αLA +(1−α)LB ≤ Lk +αrk +(1−α)rk for each k ∈ {0, 1}. [sent-128, score-0.352]

46 A strategy that guarantees RT ≤ rk must maintain Rt ≤ rk for all 0 ≤ t ≤ T . [sent-131, score-0.275]

47 1 k One could deﬁne the regret RT for all static reference probabilities k ∈ [0, 1], but as the loss is minimised by either k = 0 or k = 1, we immediately restrict to only comparing against these two. [sent-132, score-0.431]

48 The static strategy pt = 0 witnesses 0, T ∈ GT for every horizon T . [sent-145, score-0.443]

49 To ensure RT < T , 0 any strategy will have to play pt > 0 at some time t ≤ T . [sent-146, score-0.306]

50 It is also intuitive that maintaining low regret becomes progressively harder with T . [sent-148, score-0.323]

51 The Pareto frontier of GT is the piece-wise linear curve through the T + 1 vertices i fT (i), fT (T − i) for i ∈ {0, . [sent-160, score-0.286]

52 T −i−1 Moreover, for T > 0 the optimal strategy at vertex i assigns to the outcome x = 1 the probability fT −1 (i) − fT −1 (i − 1) pT (0) := 0, pT (T ) := 1, and pT (i) := for 0 < i < T, 2 and the optimal probability interpolates linearly in between consecutive vertices. [sent-164, score-0.317]

53 By the induction hypothesis we know that the south-west frontier curve for GT −1 is piecewise linear. [sent-171, score-0.259]

54 We will characterise GT via its frontier as well. [sent-172, score-0.296]

55 For r0 = 0 we ﬁnd r1 (0) = T , with witness p(0) = 0 and rear/front contact points 0, T + 1 and 0, T − 1 , and for r0 = T we ﬁnd r1 (T ) = 0 with witness p(T ) = 1 and rear/front contact points T − 1, 0 and T + 1, 0 . [sent-175, score-0.492]

56 Initially at r0 = 0 the rear contact point lies on the edge of GT −1 entering vertex i = 0 of GT −1 , while the front contact point lies on the edge emanating from that same vertex. [sent-177, score-0.488]

57 Once we increase r0 enough, both the rear and front contact point will hit the vertex at the end of their edges simultaneously (a fortunate fact that greatly simpliﬁes our analysis), as shown in Lemma 12 (supplementary material). [sent-180, score-0.34]

58 Given that at each such transition r0 , r1 (r0 ) is the midpoint between both contact points, this implies that all midpoints between successive vertices of GT −1 are vertices of GT . [sent-183, score-0.264]

59 4 T=10000 asymptotically realisable sqrt min-log-prior normalised regret w. [sent-185, score-0.802]

60 0 10 1 1e-50 2 (a) Normal scale T=10000 asymptotically realisable sqrt min-log-prior 1e-40 1e-30 1e-20 1e-10 normalised regret w. [sent-198, score-0.802]

61 0 1 (b) Log-log scale to highlight the tail behaviour Figure 2: Pareto frontier of G, the asymptotically realisable trade-off rates. [sent-201, score-0.645]

62 There is no noticeable √ difference with the normalised regret trade-off proﬁle GT / T for T = 10000. [sent-202, score-0.403]

63 1 The optimal strategy and random walks In this section we describe how to follow the optimal strategy. [sent-205, score-0.218]

64 First suppose we desire to witness a T -realisable trade-off that happens to be a vertex of GT , say vertex i at fT (i), fT (T − i) . [sent-206, score-0.33]

65 If x = 0, we need to witness in the remaining T − 1 rounds the trade-off fT (i), fT (T − i) − pT (i), pT (i) + 1 = fT −1 (i − 1), fT −1 (T − 1) , which is vertex i − 1 of GT −1 . [sent-209, score-0.216]

66 Second, if we desire to witness a T -realisable trade-off that is a convex combination of successive vertices, we simply follow the mixture strategy as constructed in Lemma 2. [sent-213, score-0.255]

67 Third, if we desire to witness a sub-optimal element of GT , we may follow any strategy that witnesses a Pareto optimal dominating trade-off. [sent-214, score-0.347]

68 The random walks considered in [3] for K ≥ 2 experts still take T steps, whereas direct evaluation of the optimal strategy scales rather badly with K. [sent-224, score-0.349]

69 4 The asymptotic regret rate trade-off proﬁle In the previous section we obtained for each time horizon T a combinatorial characterisation of the set GT of T -realisable trade-offs. [sent-225, score-0.454]

70 Let us deﬁne the set G of asymptotically realisable regret rate trade-offs by G := GT lim √ . [sent-229, score-0.701]

71 Each achievable regret rate trade-off ρ0 , ρ1 ∈ G 5 may be witnessed by a different strategy for each T . [sent-231, score-0.518]

72 The literature [2] suggests that, for some constant c, −c ln(q), −c ln(1 − q) should be asymptotically realisable for each q ∈ [0, 1]. [sent-234, score-0.361]

73 We now come to our second main result, the characterisation of the asymptotically realisable tradeoff rates. [sent-237, score-0.437]

74 The Pareto frontier is graphed in Figure 2 both on normal axes for comparison to Figure 1a, √ and on a log-log scale to show its tails. [sent-238, score-0.224]

75 The Pareto frontier of the set G of asymptotically realisable trade-offs is the curve f (u), f (−u) and erf(u) = 2 √ π for u ∈ R, f (u) := u erf where √ 2 e−2u 2u + √ + u, 2π u −v 2 e 0 dv is the error function. [sent-241, score-0.732]

76 Moreover, the optimal strategy converges to √ 1 − erf 2u p(u) = . [sent-242, score-0.237]

77 To obtain the optimal strategy, we observe the following relation between the slope of the Pareto curve and the optimal strategy for each horizon T . [sent-248, score-0.361]

78 Since slope is invariant under normalisation, this relation between slope and optimal strategy becomes exact as T tends to inﬁnity, and we ﬁnd √ 1 − erf 2u 1 1 = = . [sent-251, score-0.333]

79 So for large r0 0, the least realisable r1 is approximately r1 ≈ e− 2 r0 2 −2 ln r0 √ 2π . [sent-268, score-0.434]

80 (3) With the candidate relation r0 = −c ln(q) and r1 = −c ln(1 − q), still for large r0 0 so that q is small and − ln(1 − q) ≈ q, we would instead ﬁnd least realisable r1 approximately equal to r1 ≈ √ 2 r0 ce− 2c . [sent-269, score-0.337]

81 Even though the sqrt-min-log-prior trade-off is realisable, we see that its tail (4) exceeds the actual √ 2 tail (3) by the factor r0 2π, which gets progressively worse with the extremity of the tail r0 . [sent-273, score-0.231]

82 For certain log loss games, each Pareto regret trade-off is witnessed uniquely by the Bayesian mixture of expert predictions w. [sent-278, score-0.664]

83 Be that as it may, we conclude that the world of absolute loss is not information theory: simply putting a prior is not the deﬁnitive answer to non-uniform guarantees. [sent-283, score-0.217]

84 Conversely, by taking the limit of the game protocol, which involves the absolute loss function, we might obtain an interesting protocol and “asymptotic” loss function2 , for which u is the natural state, p(u) is the optimal strategy, and u is updated in a certain way. [sent-292, score-0.435]

85 When the Hedge algorithm with learning rate η plays weights w and faces loss vector , its dot loss is given by wT . [sent-296, score-0.256]

86 In the limit of n → ∞, the 1 resulting loss becomes the mix loss − η ln k w(k)e−η k . [sent-298, score-0.38]

87 1 Extension Beyond absolute loss In this section we consider the general setting with K = 2 expert, that we still refer to as 0 and 1. [sent-300, score-0.217]

88 Here the learner plays p ∈ [0, 1] which is now interpreted as the weight allocated to expert 1, 2 the adversary chooses a loss vector = 0 , 1 ∈ [0, 1] , and the learner incurs dot loss given by (1 − p) 0 + p 1 . [sent-301, score-0.641]

89 The regrets are now redeﬁned as follows T T k RT := pt 1 t + (1 − pt ) t=1 0 t k t − for each expert k ∈ {0, 1}. [sent-302, score-0.575]

90 The T -realisable trade-offs for absolute loss and K = 2 expert dot loss coincide. [sent-304, score-0.593]

91 We recover the absolute loss case by restricting δ to {−1, 1}. [sent-309, score-0.217]

92 2 More than 2 experts In the general experts problem we compete with K instead of 2 experts. [sent-312, score-0.34]

93 The intuitive approach, combining the K experts in a balanced binary tree of two-expert predictors, does not achieve this goal: each internal node contributes the optimal symmetric regret of T /(2π). [sent-317, score-0.531]

94 At each level we combine K experts one-vs-all, permitting large regret w. [sent-320, score-0.463]

95 , combining the expert with the smallest prior with the recursive combination of the rest guarantees regret −cT ln q(k) w. [sent-332, score-0.716]

96 6 Conclusion We studied asymmetric regret guarantees for the fundamental online learning setting of the absolute loss game. [sent-336, score-0.542]

97 We obtained exactly the achievable skewed regret guarantees, and the corresponding optimal algorithm. [sent-337, score-0.421]

98 We then showed how our results transfer from absolute loss to general linear losses, and to more than two experts. [sent-340, score-0.217]

99 and to ﬁnd the Pareto frontier for horizon-free strategies maintaining RT ≤ ρk T at any T . [sent-342, score-0.225]

100 Improved second-order bounds for o prediction with expert advice. [sent-446, score-0.265]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pareto', 0.346), ('gt', 0.327), ('realisable', 0.304), ('regret', 0.293), ('expert', 0.229), ('frontier', 0.181), ('pt', 0.173), ('experts', 0.17), ('contact', 0.148), ('rt', 0.133), ('ln', 0.13), ('lk', 0.129), ('characterise', 0.115), ('ft', 0.111), ('normalised', 0.11), ('loss', 0.109), ('absolute', 0.108), ('pro', 0.108), ('strategy', 0.103), ('lt', 0.099), ('witness', 0.098), ('erf', 0.095), ('hedge', 0.09), ('achievable', 0.089), ('vertex', 0.089), ('horizon', 0.085), ('abernethy', 0.081), ('learner', 0.078), ('characterisation', 0.076), ('rk', 0.07), ('warmuth', 0.07), ('tail', 0.067), ('rear', 0.065), ('freund', 0.063), ('le', 0.058), ('vertices', 0.058), ('nicol', 0.058), ('wouter', 0.057), ('witnessing', 0.057), ('asymptotically', 0.057), ('desire', 0.054), ('witnesses', 0.053), ('editors', 0.051), ('manfred', 0.05), ('dv', 0.048), ('slope', 0.048), ('overhead', 0.048), ('lim', 0.047), ('curve', 0.047), ('outcome', 0.047), ('strategies', 0.044), ('optimisation', 0.043), ('chernov', 0.043), ('graphed', 0.043), ('xt', 0.042), ('gr', 0.042), ('frontiers', 0.041), ('yoav', 0.041), ('les', 0.041), ('schapire', 0.039), ('optimal', 0.039), ('game', 0.038), ('dot', 0.038), ('losses', 0.038), ('peter', 0.038), ('front', 0.038), ('kapralov', 0.038), ('nwald', 0.038), ('sqrt', 0.038), ('analysed', 0.038), ('lemma', 0.038), ('walks', 0.037), ('behaviour', 0.036), ('prediction', 0.036), ('subsequently', 0.036), ('helmbold', 0.035), ('koolen', 0.035), ('randomised', 0.035), ('pereira', 0.035), ('jacob', 0.034), ('cumulative', 0.034), ('equals', 0.034), ('mansour', 0.034), ('candidate', 0.033), ('witnessed', 0.033), ('hutter', 0.033), ('unbalanced', 0.033), ('guarantees', 0.032), ('shamir', 0.032), ('limit', 0.032), ('recursive', 0.032), ('round', 0.031), ('zemel', 0.031), ('induction', 0.031), ('weinberger', 0.03), ('play', 0.03), ('progressively', 0.03), ('static', 0.029), ('balanced', 0.029), ('rounds', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 325 nips-2013-The Pareto Regret Frontier

Author: Wouter M. Koolen

2 0.35255304 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efﬁciently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1

3 0.29054117 230 nips-2013-Online Learning with Costly Features and Labels

Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari

Abstract: This paper introduces the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying to see the loss function that he is evaluated against. Either way, the learner pays for both the errors of his predictions and also whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label signiﬁcantly increases the regret of the problem. 1

4 0.27819011 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

Author: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour

Abstract: We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efﬁcient manner. 1

5 0.25422397 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

Author: Sasha Rakhlin, Karthik Sridharan

Abstract: We provide several applications of Optimistic Mirror Descent, an online learning algorithm based on the idea of predictable sequences. First, we recover the Mirror Prox algorithm for ofﬂine optimization, prove an extension to H¨ lder-smooth o functions, and apply the results to saddle-point type problems. Next, we prove that a version of Optimistic Mirror Descent (which has a close relation to the Exponential Weights algorithm) can be used by two strongly-uncoupled players in a ﬁnite zero-sum matrix game to converge to the minimax equilibrium at the rate of O((log T )T ). This addresses a question of Daskalakis et al [6]. Further, we consider a partial information version of the problem. We then apply the results to convex programming and exhibit a simple algorithm for the approximate Max Flow problem. 1

6 0.2387539 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

7 0.19359773 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

8 0.18995295 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

9 0.17387985 26 nips-2013-Adaptive Market Making via Online Learning

10 0.16860309 89 nips-2013-Dimension-Free Exponentiated Gradient

11 0.1446752 137 nips-2013-High-Dimensional Gaussian Process Bandits

12 0.12998481 233 nips-2013-Online Robust PCA via Stochastic Optimization

13 0.12894784 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

14 0.11646269 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

15 0.11310329 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

16 0.1075009 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

17 0.10618111 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

18 0.10164367 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality

19 0.098217651 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

20 0.095740989 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.211), (1, -0.177), (2, 0.279), (3, -0.265), (4, 0.004), (5, -0.11), (6, 0.034), (7, -0.012), (8, 0.04), (9, -0.045), (10, -0.028), (11, -0.106), (12, 0.1), (13, -0.007), (14, 0.097), (15, -0.111), (16, 0.022), (17, -0.133), (18, -0.051), (19, -0.016), (20, -0.129), (21, -0.034), (22, -0.112), (23, 0.063), (24, -0.009), (25, 0.031), (26, -0.028), (27, 0.063), (28, 0.07), (29, 0.021), (30, -0.05), (31, 0.038), (32, 0.029), (33, -0.103), (34, 0.004), (35, 0.054), (36, 0.043), (37, -0.042), (38, 0.016), (39, -0.025), (40, 0.001), (41, -0.015), (42, 0.025), (43, -0.062), (44, -0.049), (45, 0.006), (46, 0.024), (47, 0.009), (48, 0.017), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97079682 325 nips-2013-The Pareto Regret Frontier

Author: Wouter M. Koolen

2 0.85229349 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

3 0.83199304 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

Author: Nicolò Cesa-Bianchi, Ofer Dekel, Ohad Shamir

Abstract: We study the power of different types of adaptive (nonoblivious) adversaries in the setting of prediction with expert advice, under both full-information and bandit feedback. We measure the player’s performance using a new notion of regret, also known as policy regret, which better captures the adversary’s adaptiveness to the player’s behavior. In a setting where losses are allowed to drift, we characterize —in a nearly complete manner— the power of adaptive adversaries with bounded memories and switching costs. In particular, we show that with switch� ing costs, the attainable rate with bandit feedback is Θ(T 2/3 ). Interestingly, this √ rate is signiﬁcantly worse than the Θ( T ) rate attainable with switching costs in the full-information case. Via a novel reduction from experts to bandits, we also � show that a bounded memory adversary can force Θ(T 2/3 ) regret even in the full information case, proving that switching costs are easier to control than bounded memory adversaries. Our lower bounds rely on a new stochastic adversary strategy that generates loss processes with strong dependencies. 1

4 0.81848425 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

Author: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour

5 0.73561782 230 nips-2013-Online Learning with Costly Features and Labels

Author: Navid Zolghadr, Gabor Bartok, Russell Greiner, András György, Csaba Szepesvari

6 0.72567874 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

7 0.69628298 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

8 0.68973267 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

9 0.65450573 26 nips-2013-Adaptive Market Making via Online Learning

10 0.54871154 89 nips-2013-Dimension-Free Exponentiated Gradient

11 0.5394513 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

12 0.51476717 139 nips-2013-How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal

13 0.51223141 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

14 0.47746596 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

15 0.43088749 137 nips-2013-High-Dimensional Gaussian Process Bandits

16 0.42824835 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

17 0.41608587 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

18 0.39323047 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

19 0.38488826 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

20 0.37554336 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.108), (16, 0.03), (33, 0.167), (34, 0.074), (41, 0.018), (49, 0.026), (56, 0.127), (70, 0.052), (85, 0.045), (89, 0.022), (93, 0.028), (94, 0.187), (95, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86422861 325 nips-2013-The Pareto Regret Frontier

Author: Wouter M. Koolen

2 0.79428428 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

Author: Brendan McMahan, Jacob Abernethy

3 0.78794825 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

Author: Alfredo Kalaitzis, Ricardo Silva

Abstract: Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parameterization of joint distributions. Among other properties, copulas provide a recipe for combining ﬂexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increases quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efﬁcient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size. 1 Contribution There are many ways of constructing multivariate discrete distributions: from full contingency tables in the small dimensional case [1], to structured models given by sparsity constraints [11] and (hierarchies of) latent variable models [6]. More recently, the idea of copula modeling [16] has been combined with such standard building blocks. Our contribution is a novel algorithm for efﬁcient Markov chain Monte Carlo (MCMC) for the copula framework introduced by [7], extending algorithmic ideas introduced by [17]. A copula is a continuous cumulative distribution function (CDF) with uniformly distributed univariate marginals in the unit interval [0, 1]. It complements graphical models and other formalisms that provide a modular parameterization of joint distributions. The core idea is simple and given by the following observation: suppose we are given a (say) bivariate CDF F (y1 , y2 ) with marginals −1 −1 F1 (y1 ) and F2 (y2 ). This CDF can then be rewritten as F (F1 (F1 (y1 )), F2 (F2 (y2 ))). The func−1 −1 tion C(·, ·) given by F (F1 (·), F2 (·)) is a copula. For discrete distributions, this decomposition is not unique but still well-deﬁned [16]. Copulas have found numerous applications in statistics and machine learning since they provide a way of constructing ﬂexible multivariate distributions by mix-and-matching different copulas with different univariate marginals. For instance, one can combine ﬂexible univariate marginals Fi (·) with useful but more constrained high-dimensional copulas. We will not further motivate the use of copula models, which has been discussed at length in recent 1 machine learning publications and conference workshops, and for which comprehensive textbooks exist [e.g., 9]. For a recent discussion on the applications of copulas from a machine learning perspective, [4] provides an overview. [10] is an early reference in machine learning. The core idea dates back at least to the 1950s [16]. In the discrete case, copulas can be difﬁcult to apply: transforming a copula CDF into a probability mass function (PMF) is computationally intractable in general. For the continuous case, a common ˆ trick goes as follows: transform variables by deﬁning ai ≡ Fi (yi ) for an estimate of Fi (·) and then ﬁt a copula density c(·, . . . , ·) to the resulting ai [e.g. 9]. It is not hard to check this breaks down in the discrete case [7]. An alternative is to represent the CDF to PMF transformation for each data point by a continuous integral on a bounded space. Sampling methods can then be used. This trick has allowed many applications of the Gaussian copula to discrete domains. Readers familiar with probit models will recognize the similarities to models where an underlying latent Gaussian ﬁeld is discretized into observable integers as in Gaussian process classiﬁers and ordinal regression [18]. Such models can be indirectly interpreted as special cases of the Gaussian copula. In what follows, we describe in Section 2 the Gaussian copula and the general framework for constructing Bayesian estimators of Gaussian copulas by [7], the extended rank likelihood framework. This framework entails computational issues which are discussed. A recent general approach for MCMC in constrained Gaussian ﬁelds by [17] can in principle be directly applied to this problem as a blackbox, but at a cost that scales quadratically in sample size and as such it is not practical in general. Our key contribution is given in Section 4. An application experiment on the Bayesian Gaussian copula factor model is performed in Section 5. Conclusions are discussed in the ﬁnal section. 2 Gaussian copulas and the extended rank likelihood It is not hard to see that any multivariate Gaussian copula is fully deﬁned by a correlation matrix C, since marginal distributions have no free parameters. In practice, the following equivalent generative model is used to deﬁne a sample U according to a Gaussian copula GC(C): 1. Sample Z from a zero mean Gaussian with covariance matrix C 2. For each Zj , set Uj = Φ(zj ), where Φ(·) is the CDF of the standard Gaussian It is clear that each Uj follows a uniform distribution in [0, 1]. To obtain a model for variables {y1 , y2 , . . . , yp } with marginal distributions Fj (·) and copula GC(C), one can add the deterministic (n) (1) (1) (2) step yj = Fj−1 (uj ). Now, given n samples of observed data Y ≡ {y1 , . . . , yp , y1 , . . . , yp }, one is interested on inferring C via a Bayesian approach and the posterior distribution p(C, θF | Y) ∝ pGC (Y | C, θF )π(C, θF ) where π(·) is a prior distribution, θF are marginal parameters for each Fj (·), which in general might need to be marginalized since they will be unknown, and pGC (·) is the PMF of a (here discrete) distribution with a Gaussian copula and marginals given by θF . Let Z be the underlying latent Gaussians of the corresponding copula for dataset Y. Although Y is a deterministic function of Z, this mapping is not invertible due to the discreteness of the distribution: each marginal Fj (·) has jumps. Instead, the reverse mapping only enforces the constraints where (i ) (i ) (i ) (i ) yj 1 < yj 2 implies zj 1 < zj 2 . Based on this observation, [7] considers the event Z ∈ D(y), where D(y) is the set of values of Z in Rn×p obeying those constraints, that is (k) (k) D(y) ≡ Z ∈ Rn×p : max zj s.t. yj (i) < yj (i) (k) (i) (k) < zj < min zj s.t. yj < yj . Since {Y = y} ⇒ Z(y) ∈ D(y), we have pGC (Y | C, θF ) = pGC (Z ∈ D(y), Y | C, θF ) = pN (Z ∈ D(y) | C) × pGC (Y| Z ∈ D(y), C, θF ), (1) the ﬁrst factor of the last line being that of a zero-mean a Gaussian density function marginalized over D(y). 2 The extended rank likelihood is deﬁned by the ﬁrst factor of (1). With this likelihood, inference for C is given simply by marginalizing p(C, Z | Y) ∝ I(Z ∈ D(y)) pN (Z| C) π(C), (2) the ﬁrst factor of the right-hand side being the usual binary indicator function. Strictly speaking, this is not a fully Bayesian method since partial information on the marginals is ignored. Nevertheless, it is possible to show that under some mild conditions there is information in the extended rank likelihood to consistently estimate C [13]. It has two important properties: ﬁrst, in many applications where marginal distributions are nuisance parameters, this sidesteps any major assumptions about the shape of {Fi (·)} – applications include learning the degree of dependence among variables (e.g., to understand relationships between social indicators as in [7] and [13]) and copula-based dimensionality reduction (a generalization of correlation-based principal component analysis, e.g., [5]); second, MCMC inference in the extended rank likelihood is conceptually simpler than with the joint likelihood, since dropping marginal models will remove complicated entanglements between C and θF . Therefore, even if θF is necessary (when, for instance, predicting missing values of Y), an estimate of C can be computed separately and will not depend on the choice of estimator for {Fi (·)}. The standard model with a full correlation matrix C can be further reﬁned to take into account structure implied by sparse inverse correlation matrices [2] or low rank decompositions via higher-order latent variable models [13], among others. We explore the latter case in section 5. An off-the-shelf algorithm for sampling from (2) is full Gibbs sampling: ﬁrst, given Z, the (full or structured) correlation matrix C can be sampled by standard methods. More to the point, sampling (i) Z is straightforward if for each variable j and data point i we sample Zj conditioned on all other variables. The corresponding distribution is an univariate truncated Gaussian. This is the approach used originally by Hoff. However, mixing can be severely compromised by the sampling of Z, and that is where novel sampling methods can facilitate inference. 3 Exact HMC for truncated Gaussian distributions (i) Hoff’s algorithm modiﬁes the positions of all Zj associated with a particular discrete value of Yj , conditioned on the remaining points. As the number of data points increases, the spread of the hard (i) boundaries on Zj , given by data points of Zj associated with other levels of Yj , increases. This (i) reduces the space in which variables Zj can move at a time. To improve the mixing, we aim to sample from the joint Gaussian distribution of all latent variables (i) Zj , i = 1 . . . n , conditioned on other columns of the data, such that the constraints between them are satisﬁed and thus the ordering in the observation level is conserved. Standard Gibbs approaches for sampling from truncated Gaussians reduce the problem to sampling from univariate truncated Gaussians. Even though each step is computationally simple, mixing can be slow when strong correlations are induced by very tight truncation bounds. In the following, we brieﬂy describe the methodology recently introduced by [17] that deals with the problem of sampling from log p(x) ∝ − 1 x Mx + r x , where x, r ∈ Rn and M is positive 2 deﬁnite, with linear constraints of the form fj x ≤ gj , where fj ∈ Rn , j = 1 . . . m, is the normal vector to some linear boundary in the sample space. Later in this section we shall describe how this framework can be applied to the Gaussian copula extended rank likelihood model. More importantly, the observed rank statistics impose only linear constraints of the form xi1 ≤ xi2 . We shall describe how this special structure can be exploited to reduce the runtime complexity of the constrained sampler from O(n2 ) (in the number of observations) to O(n) in practice. 3.1 Hamiltonian Monte Carlo for the Gaussian distribution Hamiltonian Monte Carlo (HMC) [15] is a MCMC method that extends the sampling space with auxiliary variables so that (ideally) deterministic moves in the joint space brings the sampler to 3 potentially far places in the original variable space. Deterministic moves cannot in general be done, but this is possible in the Gaussian case. The form of the Hamiltonian for the general d-dimensional Gaussian case with mean µ and precision matrix M is: 1 1 H = x Mx − r x + s M−1 s , (3) 2 2 where M is also known in the present context as the mass matrix, r = Mµ and s is the velocity. Both x and s are Gaussian distributed so this Hamiltonian can be seen (up to a constant) as the negative log of the product of two independent Gaussian random variables. The physical interpretation is that of a sum of potential and kinetic energy terms, where the total energy of the system is conserved. In a system where this Hamiltonian function is constant, we can exactly compute its evolution through the pair of differential equations: ˙ x= sH = M−1 s , ˙ s=− xH = −Mx + r . (4) These are solved exactly by x(t) = µ + a sin(t) + b cos(t) , where a and b can be identiﬁed at initial conditions (t = 0) : ˙ a = x(0) = M−1 s , b = x(0) − µ . (5) Therefore, the exact HMC algorithm can be summarised as follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b and store the new position at the end of the trajectory xk+1 = x(T ) as an HMC sample. It can be easily shown that the Markov chain of sampled positions has the desired equilibrium distribution N µ, M−1 [17]. 3.2 Sampling with linear constraints Sampling from multivariate Gaussians does not require any method as sophisticated as HMC, but the plot thickens when the target distribution is truncated by linear constraints of the form Fx ≤ g . Here, F ∈ Rm×n is a constraint matrix whose every row is the normal vector to a linear boundary in the sample space. This is equivalent to sampling from a Gaussian that is conﬁned in the (not necessarily bounded) convex polyhedron {x : Fx ≤ g}. In general, to remain within the boundaries of each wall, once a new velocity has been sampled one must compute all possible collision times with the walls. The smallest of all collision times signiﬁes the wall that the particle should bounce from at that collision time. Figure 1 illustrates the concept with two simple examples on 2 and 3 dimensions. The collision times can be computed analytically and their equations can be found in the supplementary material. We also point the reader to [17] for a more detailed discussion of this implementation. Once the wall to be hit has been found, then position and velocity at impact time are computed and the velocity is reﬂected about the boundary normal1 . The constrained HMC sampler is summarized follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b . 1 Also equivalent to transforming the velocity with a Householder reﬂection matrix about the bounding hyperplane. 4 1 2 3 4 1 2 3 4 Figure 1: Left: Trajectories of the ﬁrst 40 iterations of the exact HMC sampler on a 2D truncated Gaussian. A reﬂection of the velocity can clearly be seen when the particle meets wall #2 . Here, the constraint matrix F is a 4 × 2 matrix. Center: The same example after 40000 samples. The coloring of each sample indicates its density value. Right: The anatomy of a 3D Gaussian. The walls are now planes and in this case F is a 2 × 3 matrix. Figure best seen in color. 3. Reset remaining travel time Tleft ← T . Until no travel time is left or no walls can be reached (no solutions exist), do: (a) Compute impact times with all walls and pick the smallest one, th (if a solution exists). (b) Compute v(th ) and reﬂect it about the hyperplane fh . This is the updated velocity after impact. The updated position is x(th ) . (c) Tleft ← Tleft − th 4. Store the new position at the end of the trajectory xk+1 as an HMC sample. In general, all walls are candidates for impact, so the runtime of the sampler is linear in m , the number of constraints. This means that the computational load is concentrated in step 3(a). Another consideration is that of the allocated travel time T . Depending on the shape of the bounding polyhedron and the number of walls, a very large travel time can induce many more bounces thus requiring more computations per sample. On the other hand, a very small travel time explores the distribution more locally so the mixing of the chain can suffer. What constitutes a given travel time “large” or “small” is relative to the dimensionality, the number of constraints and the structure of the constraints. Due to the nature of our problem, the number of constraints, when explicitly expressed as linear functions, is O(n2 ) . Clearly, this restricts any direct application of the HMC framework for Gaussian copula estimation to small-sample (n) datasets. More importantly, we show how to exploit the structure of the constraints to reduce the number of candidate walls (prior to each bounce) to O(n) . 4 HMC for the Gaussian Copula extended rank likelihood model Given some discrete data Y ∈ Rn×p , the task is to infer the correlation matrix of the underlying Gaussian copula. Hoff’s sampling algorithm proceeds by alternating between sampling the continu(i) (i) ous latent representation Zj of each Yj , for i = 1 . . . n, j = 1 . . . p , and sampling a covariance matrix from an inverse-Wishart distribution conditioned on the sampled matrix Z ∈ Rn×p , which is then renormalized as a correlation matrix. From here on, we use matrix notation for the samples, as opposed to the random variables – with (i) Zi,j replacing Zj , Z:,j being a column of Z, and Z:,\j being the submatrix of Z without the j-th column. In a similar vein to Hoff’s sampling algorithm, we replace the successive sampling of each Zi,j conditioned on Zi,\j (a conditional univariate truncated Gaussian) with the simultaneous sampling of Z:,j conditioned on Z:,\j . This is done through an HMC step from a conditional multivariate truncated Gaussian. The added beneﬁt of this HMC step over the standard Gibbs approach, is that of a handle for regulating the trade-off between exploration and runtime via the allocated travel time T . Larger travel times potentially allow for larger moves in the sample space, but it comes at a cost as explained in the sequel. 5 4.1 The Hough envelope algorithm The special structure of constraints. Recall that the number of constraints is quadratic in the dimension of the distribution. This is because every Z sample must satisfy the conditions of the event Z ∈ D(y) of the extended rank likelihood (see Section 2). In other words, for any column Z:,j , all entries are organised into a partition L(j) of |L(j) | levels, the number of unique values observed for the discrete or ordinal variable Y (j) . Thereby, for any two adjacent levels lk , lk+1 ∈ L(j) and any pair i1 ∈ lk , i2 ∈ lk+1 , it must be true that Zli ,j < Zli+1 ,j . Equivalently, a constraint f exists where fi1 = 1, fi2 = −1 and g = 0 . It is easy to see that O(n2 ) of such constraints are induced by the order statistics of the j-th variable. To deal with this boundary explosion, we developed the Hough Envelope algorithm to search efﬁciently, within all pairs in {Z:,j }, in practically linear time. Recall in HMC (section 3.2) that the trajectory of the particle, x(t), is decomposed as xi (t) = ai sin(t) + bi cos(t) + µi , (6) and there are n such functions, grouped into a partition of levels as described above. The Hough envelope2 is found for every pair of adjacent levels. We illustrate this with an example of 10 dimensions and two levels in Figure 2, without loss of generalization to any number of levels or dimensions. Assume we represent trajectories for points in level lk with blue curves, and points in lk+1 with red curves. Assuming we start with a valid state, at time t = 0 all red curves are above all blue curves. The goal is to ﬁnd the smallest t where a blue curve meets a red curve. This will be our collision time where a bounce will be necessary. 5 3 1 2 Figure 2: The trajectories xj (t) of each component are sinusoid functions. The right-most green dot signiﬁes the wall and the time th of the earliest bounce, where the ﬁrst inter-level pair (that is, any two components respectively from the blue and red level) becomes equal, in this case the constraint activated being xblue2 = xred2 . 4 4 5 1 2 3 0.2 0.4 0.6 t 0.8 1 1.2 1.4 1. First we ﬁnd the largest component bluemax of the blue level at t = 0. This takes O(n) time. Clearly, this will be the largest component until its sinusoid intersects that of any other component. 2. To ﬁnd the next largest component, compute the roots of xbluemax (t) − xi (t) = 0 for all components and pick the smallest (earliest) one (represented by a green dot). This also takes O(n) time. 3. Repeat this procedure until a red sinusoid crosses the highest running blue sinusoid. When this happens, the time of earliest bounce and its constraint are found. In the worst-case scenario, n such repetitions have to be made, but in practice we can safely assume an ﬁxed upper bound h on the number of blue crossings before a inter-level crossing occurs. In experiments, we found h << n, no more than 10 in simulations with hundreds of thousands of curves. Thus, this search strategy takes O(n) time in practice to complete, mirroring the analysis of other output-sensitive algorithms such as the gift wrapping algorithm for computing convex hulls [8]. Our HMC sampling approach is summarized in Algorithm 1. 2 The name is inspired from the fact that each xi (t) is the sinusoid representation, in angle-distance space, of all lines that pass from the (ai , bi ) point in a − b space. A representation known in image processing as the Hough transform [3]. 6 Algorithm 1 HMC for GCERL # Notation: T MN (µ, C, F) is a truncated multivariate normal with location vector µ, scale matrix C and constraints encoded by F and g = 0 . # IW(df, V0 ) is an inverse-Wishart prior with degrees of freedom df and scale matrix V0 . Input: Y ∈ Rn×p , allocated travel time T , a starting Z and variable covariance V ∈ Rp×p , df = p + 2, V0 = df Ip and chain size N . Generate constraints F(j) from Y:,j , for j = 1 . . . p . for samples k = 1 . . . N do # Resample Z as follows: for variables j = 1 . . . p do −1 −1 2 Compute parameters: σj = Vjj − Vj,\j V\j,\j V\j,j , µj = Z:,\j V\j,\j V\j,j . 2 Get one sample Z:,j ∼ T MN µj , σj I, F(j) efﬁciently by using the Hough Envelope algorithm, see section 4.1. end for Resample V ∼ IW(df + n, V0 + Z Z) . Compute correlation matrix C, s.t. Ci,j = Vi,j / Vi,i Vj,j and store sample, C(k) ← C . end for 5 An application on the Bayesian Gausian copula factor model In this section we describe an experiment that highlights the beneﬁts of our HMC treatment, compared to a state-of-the-art parameter expansion (PX) sampling scheme. During this experiment we ask the important question: “How do the two schemes compare when we exploit the full-advantage of the HMC machinery to jointly sample parameters and the augmented data Z, in a model of latent variables and structured correlations?” We argue that under such circumstances the superior convergence speed and mixing of HMC undeniably compensate for its computational overhead. Experimental setup In this section we provide results from an application on the Gaussian copula latent factor model of [13] (Hoff’s model [7] for low-rank structured correlation matrices). We modify the parameter expansion (PX) algorithm used by [13] by replacing two of its Gibbs steps with a single HMC step. We show a much faster convergence to the true mode with considerable support on its vicinity. We show that unlike the HMC, the PX algorithm falls short of properly exploring the posterior in any reasonable ﬁnite amount of time, even for small models, even for small samples. Worse, PX fails in ways one cannot easily detect. Namely, we sample each row of the factor loadings matrix Λ jointly with the corresponding column of the augmented data matrix Z, conditioning on the higher-order latent factors. This step is analogous to Pakman and Paninski’s [17, sec.3.1] use of HMC in the context of a binary probit model (the extension to many levels in the discrete marginal is straightforward with direct application of the constraint matrix F and the Hough envelope algorithm). The sampling of the higher level latent factors remains identical to [13]. Our scheme involves no parameter expansion. We do however interweave the Gibbs step for the Z matrix similarly to Hoff. This has the added beneﬁt of exploring the Z sample space within their current boundaries, complementing the joint (λ, z) sampling which moves the boundaries jointly. The value of such ”interweaving” schemes has been addressed in [19]. Results We perform simulations of 10000 iterations, n = 1000 observations (rows of Y), travel time π/2 for HMC with the setups listed in the following table, along with the elapsed times of each sampling scheme. These experiments were run on Intel COREi7 desktops with 4 cores and 8GB of RAM. Both methods were parallelized across the observed variables (p). Figure p (vars) k (latent factors) M (ordinal levels) elapsed (mins): HMC PX 3(a) : 20 5 2 115 8 3(b) : 10 3 2 80 6 10 3 5 203 16 3(c) : Many functionals of the loadings matrix Λ can be assessed. We focus on reconstructing the true (low-rank) correlation matrix of the Gaussian copula. In particular, we summarize the algorithm’s 7 outcome with the root mean squared error (RMSE) of the differences between entries of the ground-truth correlation matrix and the implied correlation matrix at each iteration of a MCMC scheme (so the following plots looks like a time-series of 10000 timepoints), see Figures 3(a), 3(b) and 3(c) . (a) (b) (c) Figure 3: Reconstruction (RMSE per iteration) of the low-rank structured correlation matrix of the Gaussian copula and its histogram (along the left side). (a) Simulation setup: 20 variables, 5 factors, 5 levels. HMC (blue) reaches a better mode faster (in iterations/CPU-time) than PX (red). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. (b) Simulation setup: 10 vars, 3 factors, 2 levels. We observe behaviors similar to Figure 3(a). Note that the histogram counts RMSEs after the burn-in period of PX (iteration #500). (c) Simulation setup: 10 vars, 3 factors, 5 levels. We observe behaviors similar to Figures 3(a) and 3(b) but with a thinner tail for HMC. Note that the histogram counts RMSEs after the burn-in period of PX (iteration #2000). Main message HMC reaches a better mode faster (iterations/CPUtime). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. As an analogous situation we refer to the top and bottom panels of Figure 14 of Radford Neal’s slice sampler paper [14]. If there was no comparison against HMC, there would be no evidence from the PX plot alone that the algorithm is performing poorly. This mirrors Radford Neal’s statement opening Section 8 of his paper: “a wrong answer is obtained without any obvious indication that something is amiss”. The concentration on the posterior mode of PX in these simulations is misleading of the truth. PX might seen a bit simpler to implement, but it seems one cannot avoid using complex algorithms for complex models. We urge practitioners to revisit their past work with this model to ﬁnd out by how much credible intervals of functionals of interest have been overconﬁdent. Whether trivially or severely, our algorithm offers the ﬁrst principled approach for checking this out. 6 Conclusion Sampling large random vectors simultaneously in order to improve mixing is in general a very hard problem, and this is why clever methods such as HMC or elliptical slice sampling [12] are necessary. We expect that the method here developed is useful not only for those with data analysis problems within the large family of Gaussian copula extended rank likelihood models, but the method itself and its behaviour might provide some new insights on MCMC sampling in constrained spaces in general. Another direction of future work consists of exploring methods for elliptical copulas, and related possible extensions of general HMC for non-Gaussian copula models. Acknowledgements The quality of this work has beneﬁted largely from comments by our anonymous reviewers and useful discussions with Simon Byrne and Vassilios Stathopoulos. Research was supported by EPSRC grant EP/J013293/1. 8 References [1] Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975. [2] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to modeling functional disability data. Annals of Applied Statistics, 5:969–993, 2011. [3] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15, 1972. [4] G. Elidan. Copulas and machine learning. Proceedings of the Copulae in Mathematical and Quantitative Finance workshop, to appear, 2013. [5] F. Han and H. Liu. Semiparametric principal component analysis. Advances in Neural Information Processing Systems, 25:171–179, 2012. [6] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [7] P. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied Statistics, 1:265–283, 2007. [8] R. Jarvis. On the identiﬁcation of the convex hull of a ﬁnite set of points in the plane. Information Processing Letters, 2(1):18–21, 1973. [9] H. Joe. Multivariate Models and Dependence Concepts. Chapman-Hall, 1997. [10] S. Kirshner. Learning with tree-averaged densities and distributions. Neural Information Processing Systems, 2007. [11] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [12] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. JMLR Workshop and Conference Proceedings: AISTATS 2010, 9:541–548, 2010. [13] J. Murray, D. Dunson, L. Carin, and J. Lucas. Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association, to appear, 2013. [14] R. Neal. Slice sampling. The Annals of Statistics, 31:705–767, 2003. [15] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2010. [16] R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007. [17] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. arXiv:1208.4118, 2012. [18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [19] Y. Yu and X. L. Meng. To center or not to center: That is not the question — An ancillaritysufﬁciency interweaving strategy (ASIS) for boosting MCMC efﬁciency. Journal of Computational and Graphical Statistics, 20(3):531–570, 2011. 9

4 0.78456068 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization

Author: Ryota Tomioka, Taiji Suzuki

Abstract: We study a new class of structured Schatten norms for tensors that includes two recently proposed norms (“overlapped” and “latent”) for convex-optimizationbased tensor decomposition. We analyze the performance of “latent” approach for tensor decomposition, which was empirically found to perform better than the “overlapped” approach in some settings. We show theoretically that this is indeed the case. In particular, when the unknown true tensor is low-rank in a speciﬁc unknown mode, this approach performs as well as knowing the mode with the smallest rank. Along the way, we show a novel duality result for structured Schatten norms, which is also interesting in the general context of structured sparsity. We conﬁrm through numerical simulations that our theory can precisely predict the scaling behaviour of the mean squared error. 1

5 0.78430384 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

Author: Gunnar Kedenburg, Raphael Fonteneau, Remi Munos

Abstract: This paper addresses the problem of online planning in Markov decision processes using a randomized simulator, under a budget constraint. We propose a new algorithm which is based on the construction of a forest of planning trees, where each tree corresponds to a random realization of the stochastic environment. The trees are constructed using a “safe” optimistic planning strategy combining the optimistic principle (in order to explore the most promising part of the search space ﬁrst) with a safety principle (which guarantees a certain amount of uniform exploration). In the decision-making step of the algorithm, the individual trees are aggregated and an immediate action is recommended. We provide a ﬁnite-sample analysis and discuss the trade-off between the principles of optimism and safety. We also report numerical results on a benchmark problem. Our algorithm performs as well as state-of-the-art optimistic planning algorithms, and better than a related algorithm which additionally assumes the knowledge of all transition distributions. 1

6 0.78249401 171 nips-2013-Learning with Noisy Labels

7 0.78142619 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge

8 0.77696967 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition

9 0.77664202 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

10 0.76984674 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

11 0.76201844 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

12 0.75258875 66 nips-2013-Computing the Stationary Distribution Locally

13 0.74869835 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

14 0.74846625 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles

15 0.74748391 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

16 0.7395851 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

17 0.73629349 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

18 0.73519707 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

19 0.73203057 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation

20 0.73164344 355 nips-2013-Which Space Partitioning Tree to Use for Search?