nips nips2013 nips2013-270 knowledge-graph by maker-knowledge-mining

270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

Source: pdf

Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet

Abstract: In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of maximin policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed minimax regret as a suitable alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). [sent-8, score-0.181]

2 Most robust optimization approaches for these problems have focussed on the computation of maximin policies which maximize the value corresponding to the worst realization of the uncertainty. [sent-9, score-0.291]

3 Recent work has proposed minimax regret as a suitable alternative to the maximin objective for robust optimization. [sent-10, score-0.472]

4 However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only. [sent-11, score-0.326]

5 We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. [sent-12, score-0.689]

6 Introduction Motivated by the difﬁculty in exact speciﬁcation of reward and transition models, researchers have proposed the uncertain Markov Decision Process (MDP) model and robustness objectives in solving these models. [sent-15, score-0.231]

7 Given the uncertainty over the reward and transition models, a robust solution can typically provide some guarantees on the worst case performance. [sent-16, score-0.274]

8 Most of the research in computing robust solutions has assumed a maximin objective, where one computes a policy that maximizes the value corresponding to the worst case realization [8, 4, 3, 1, 7]. [sent-17, score-0.352]

9 This line of work has developed scalable algorithms by exploiting independence of uncertainties across states and convexity of uncertainty sets. [sent-18, score-0.293]

10 Recently, techniques have been proposed to deal with dependence of uncertainties [15, 6]. [sent-19, score-0.164]

11 [16] have proposed minimax regret criterion [13] as a suitable alternative to maximin objective for uncertain MDPs. [sent-22, score-0.531]

12 We also focus on this minimax notion of robustness and also provide a new myopic variant of regret called Cumulative Expected Regret (CER) that allows for development of scalable algorithms. [sent-23, score-0.331]

13 Due to the complexity of computing optimal minimax regret policies [16] , existing algorithms [12] are restricted to handling uncertainty only in reward models and the uncertainties are independent across states. [sent-24, score-0.675]

14 Recent research has shown that sampling-based techniques [5, 9] are not only efﬁcient but also provide a priori (Chernoff-Hoefﬁding bounds) and a posteriori [14] quality bounds for planning under uncertainty. [sent-25, score-0.08]

15 In this paper, we also employ sampling-based approaches to address restrictions of existing approaches for obtaining regret-based solutions for uncertain MDPs . [sent-26, score-0.107]

16 More speciﬁcally, we make the 1 following contributions: (a) An approximate Mixed Integer Linear Programming (MILP) formulation with error bounds for computing minimum regret solutions for uncertain MDPs, where the uncertainties across states are dependent. [sent-27, score-0.525]

17 We further provide enhancements and error bounds to improve applicability. [sent-28, score-0.046]

18 Preliminaries We now formally deﬁne the two regret criterion that will be employed in this paper. [sent-31, score-0.234]

19 In the deﬁnitions below, we assume an underlying MDP, M = S, A, T, R, H where a policy is represented as: π t = {π t , π t+1 , . [sent-32, score-0.121]

20 , π H−1 }, the optimal policy as π ∗ and the optimal expected value as v 0 (π ∗ ). [sent-35, score-0.169]

21 The maximum reward in any state s is denoted as R∗ (s) = maxa R(s, a). [sent-36, score-0.123]

22 H Proposition 1 For a policy π 0 : 0 ≤ reg(π 0 ) − creg(π 0 ) ≤ maxs R∗ (s) − mins R∗ (s) · (1−γ 1−γ ) Proof Sketch1 By rewriting Equation (1) as creg(π 0 ) = v 0,# (π 0 ) − v 0 (π 0 ), we provide the proof. [sent-40, score-0.181]

23 Uncertain MDP A ﬁnite horizon uncertain MDP is deﬁned as the tuple of S, A, T, R, H . [sent-44, score-0.123]

24 T = ∆τ (T ) denotes a distribution over the set of transition functions T , where Tkt (s, a, s ) denotes the probability of transitioning from state s ∈ S to state s ∈ S on taking action a ∈ A at time step t according to the kth element in T . [sent-46, score-0.178]

25 Similarly, R = ∆ρ (R) denotes the distribution over the set of reward functions R, where Rt (s, a, s ) is the reinforcement k obtained on taking action a in state s and transitioning to state s at time t according to kth element in R. [sent-47, score-0.209]

26 In the above representation, every element of T and R represent uncertainty over the entire horizon and hence this representation captures dependence in uncertainty distributions across states. [sent-50, score-0.187]

27 We now provide a formal deﬁnition for the independence of uncertainty distributions that is equivalent to the rectangularity property introduced in Iyengar et al. [sent-51, score-0.107]

28 1 Detailed proof provided in supplement under Proposition 1. [sent-53, score-0.054]

29 2 Deﬁnition 3 An uncertainty distribution ∆τ over the set of transition functions, T is independent over state-action pairs at various decision epochs if t ∆τ (T ) = ×s∈S,a∈A,t≤H ∆τ,t (Ts,a ), i. [sent-54, score-0.202]

30 ∀k, P r∆τ (T k ) = s,a t P r∆τ,t (Ts,a ) s,a s,a,t t ×s,a,t Ts,a , t Ts,a k where T = is the set of transition functions for s, a, t; ∆τ,t is the distribution over s,a t the set Ts,a and P r∆τ (T ) is the probability of the transition function T k given the distribution ∆τ . [sent-56, score-0.11]

31 We can provide a similar deﬁnition for the independence of uncertainty distributions over the reward functions. [sent-57, score-0.193]

32 In the following deﬁnitions, we include transition, T and reward, R models as subscripts to indicate value (v), regret (reg) and CER (creg) functions corresponding to a speciﬁc MDP. [sent-58, score-0.22]

33 Existing works on computation of maximin policies have the following objective: π maximin = arg max 0 π 0 α(s) · vT,R (s, π 0 ) min T ∈T ,R∈R s Our goal is to compute policies that minimize the maximum regret or cumulative regret over possible models of transitional and reward uncertainty. [sent-59, score-1.113]

34 π reg = arg min max regT,R (π 0 ); π creg = arg min max cregT,R (π 0 ) 0 0 T ∈T ,R∈R π π T ∈T ,R∈R Regret Minimizing Solution We will ﬁrst consider the more general case of dependent uncertainty distributions. [sent-60, score-1.016]

35 Our approach to obtaining regret minimizing solution relies on sampling the uncertainty distributions over the transition and reward models. [sent-61, score-0.475]

36 We formulate the regret minimization problem over the sample set as an optimization problem and then approximate it as a Mixed Integer Linear Program (MILP). [sent-62, score-0.242]

37 We now describe the representation of a sample and the deﬁnition of optimal expected value for a sample, a key component in the computation of regret. [sent-63, score-0.053]

38 Thus, a sample is: s,a s,a H−1 ξq = { Tq0 , R0 , Tq1 , R1 , · · · TqH−1 , Rq } q q where Tqt and Rt refer to the transition and reward model respectively at time step t in sample q . [sent-65, score-0.185]

39 q Let π t represent the policy for each time step from t to H − 1 and the set of samples be denoted by ξ. [sent-66, score-0.164]

40 Intuitively, that corresponds to |ξ| number of discrete MDPs and our goal is to compute one policy that minimizes the regret over all the |ξ| MDPs, i. [sent-67, score-0.341]

41 ∗ 0 α(s) · [vξq (s) − vξq (s, π 0 )] π reg = arg min max 0 π ξq ∈ξ s ∗ 0 where vξq and vξq (s, π 0 ) denote the optimal expected value and expected value for policy π 0 respectively of the sample ξq . [sent-69, score-0.34]

42 3 Mixed Integer Linear Program The optimal policy for minimizing maximum regret in the general case is randomized. [sent-74, score-0.383]

43 We provide a mixed integer linear approximation to address the product terms above. [sent-79, score-0.075]

44 Because of the convexity of x2 function, the maximum approximation error in any interval [brw−1 , brw ] occurs at its mid-point3 . [sent-96, score-0.196]

45 Hence, approximation error δ is given by: δ 2 3 ≤ 2 (brw )2 + (brw−1 )2 brw + brw−1 − 2 2 = + 2 · brw−1 · (brw−1 − brw ) < 4 4 Using CPLEX SOS-2 considerably improves runtime compared to a binary variables formulation. [sent-97, score-0.392]

46 Proposition and proof provided in supplement as footnote 3 4 t Proposition 3 Let vξq (s, π t ) denote the approximation of vξq (s, π t ). [sent-98, score-0.054]

47 Corollary 2 The positive and negative errors in regret are bounded by |A|· ·(1−γ H−1 ) 4·(1−γ) Proof. [sent-100, score-0.22]

48 Since the break points are ﬁxed before hand, we can ﬁnd tighter bounds (refer to Proof of Proposition 2). [sent-102, score-0.057]

49 Also, we can further improve on the performance (both run-time and solution quality) of the MILP by pruning out dominated actions and adopting clever sampling strategies as discussed in the next subsections. [sent-103, score-0.116]

50 Pruning dominated actions We now introduce a pruning approach5 to remove actions that will never be assigned a positive probability in a regret minimization strategy. [sent-104, score-0.338]

51 It should be noted that an action that is not optimal for any of the samples cannot be pruned. [sent-106, score-0.113]

52 Greedy sampling The scalability of the MILP formulation above is constrained by the number of samples Q. [sent-107, score-0.084]

53 So, instead of generating only the ﬁxed set of Q samples from the uncertainty distribution over models, we generate more than Q samples and then pick a set of size Q so that samples are “as far apart” as possible. [sent-108, score-0.195]

54 The key intuition in selecting the samples is to consider distance among samples as being equivalent to entropy in the optimal policies for the MDPs in the samples. [sent-109, score-0.163]

55 For each decision s,a,t ∗t epoch, t, each state s and action a, we deﬁne P rξ (πξ (s, a) = 1) to be the probability that a is s,a,t ∗t the optimal action in state s at time t. [sent-110, score-0.227]

56 we iteratively add samples that maximize entropy of the sample set in that iteration. [sent-113, score-0.065]

57 It is possible to provide bounds on the number of samples required for a given error using the methods suggested by Shapiro et al. [sent-114, score-0.089]

58 4 5 Detailed proof in supplement under Proposition 3 Pseudo code provided in the supplement under ”Pruning dominated actions” section. [sent-117, score-0.119]

59 5 CER Minimizing Solution The MILP based approach mentioned in the previous section can easily be adapted to minimize the maximum cumulative regret over all samples when uncertainties across states are dependent: min creg(π 0 ) π0 s. [sent-118, score-0.502]

60 While we were unable to exploit the independence of uncertainty distributions across states with minimax regret, we are able to exploit the independence with minimax CER. [sent-121, score-0.246]

61 In fact, a key advantage of the CER robustness concept in the context of independent uncertainties is that it has the optimal substructure over time steps and hence a Dynamic Programming(DP) algorithm can be used to solve it. [sent-122, score-0.221]

62 In the case of independent uncertainties, samples at each time step can be drawn independently and we now introduce a formal notation to account for samples drawn at each time step. [sent-123, score-0.086]

63 Further, we use ξ t to indicate cross product of samples from t to H − 1, i. [sent-125, score-0.057]

64 To indicate the entire horizon t t samples corresponding to a sample p from time step t, we have ξp = ξp × ξ t+1 . [sent-129, score-0.098]

65 At each stage, t we calculate the creg for each state-action pair corresponding to each of the 6 Detailed proof in supplement under Proposition 4. [sent-136, score-0.791]

66 Once these are computed, we obtain the maximum creg and the policy corresponding to it (line 10) using the G ET CER() function. [sent-140, score-0.858]

67 In the next iteration, creg computed at t is then used in the computation of creg at t − 1 using the same update step (lines 6-9). [sent-141, score-1.474]

68 Let creg ∗,H−1 (s, a) denote the optimal cumulative regret at time step H − 1 for taking action a in ∗,H−1 state s and cregξ (s, a) denote the optimal cumulative regret over the sample set ξ. [sent-144, score-1.381]

69 Let indicator 1 0 random variable, X be deﬁned as follows: X = ∗,H−1 if creg ∗,H−1 (s, a) − cregξ (s, a) ≤ λ otherwise By using Chernoff and Hoeffding bounds on X, it is possible to provide bounds on deviation from mean and on the number of samples at H − 1. [sent-145, score-0.855]

70 However, these bounds can be very loose and they do not exploit the properties of creg functions. [sent-147, score-0.781]

71 Bounds developed on spacings of order statistics can help exploit the properties of creg functions. [sent-148, score-0.752]

72 MILP-Regret refers to the randomized policy variant of the MILP approximation algorithm for solving uncertain MDPs with dependent uncertainties. [sent-151, score-0.266]

73 We refer to the dynamic programming algorithm for minimizing CER in the independent uncertainty case as DP-CER and ﬁnally, we refer to the maximin value algorithm as “Maximin”. [sent-153, score-0.287]

74 We provide the following results in this section: (1) Performance comparison of Greedy sampling and Random sampling strategies in the context of MILP-Regret as we increase the number of samples. [sent-156, score-0.063]

75 (3) Comparison of MILP-Regret and MILP-CER policies with respect to simulated regret. [sent-158, score-0.077]

76 The ﬁrst three comparisons correspond to the dependent uncertainties case and the results are based on a path planning problem that is motivated by disaster rescue and is a modiﬁcation of the one employed in Bagnell et al. [sent-160, score-0.22]

77 On top of normal transitional uncertainty, we have uncertainty over transition and reward models due to random obstacles and random reward cells. [sent-162, score-0.345]

78 Furthermore, these uncertainties are dependent on each other due to patterns in terrains. [sent-163, score-0.188]

79 Each sample of the various uncertainties represents an individual map and can be modelled as an MDP. [sent-164, score-0.186]

80 5} · X and σ ≤ min{µ,X−µ} 3 a grid world of size 4x4 while varying numbers of obstacles, reward cells, horizon and the number of break points employed (3-6). [sent-196, score-0.161]

81 In Figure 1a, we show the effect of using greedy sampling strategy on the MILP-Regret policy. [sent-197, score-0.049]

82 On X-axis, we represent the number of samples used for computation of policy (learning set). [sent-198, score-0.164]

83 We then obtained the policies using MILP-Regret corresponding to the sample sets (referred to as learning set) generated by using the two sampling strategies. [sent-200, score-0.105]

84 On Y-axis, we show the percentage difference between simulated regret values on test and learning sample sets. [sent-201, score-0.259]

85 We observe that for a ﬁxed difference, the number of samples required by greedy is signiﬁcantly lower in comparison to random. [sent-202, score-0.069]

86 A key result from this graph is that even with just 15 samples, the difference with actual regret is less than 10%. [sent-204, score-0.22]

87 We have shown the gap and variance on the gap over three different settings of uncertainty labeled 1,2 and 3. [sent-207, score-0.108]

88 Setting 3 has the highest uncertainty over the models and Setting 1 has the least uncertainty. [sent-208, score-0.066]

89 The variance over the gap was higher for higher uncertainty settings. [sent-209, score-0.087]

90 While MILP-CER obtained a simulated regret value (over 250 samples) within the bound provided in Proposition 1, we were unable to ﬁnd any correlation in the simulated regret values of MILPRegret and MILP-CER policies as the samples were increased. [sent-210, score-0.577]

91 In the last result shown in Figure 1c, we employ the well known single product ﬁnite horizon stochastic inventory control problem [10]. [sent-212, score-0.084]

92 The demand values at each decision epoch were taken from a normal distribution. [sent-214, score-0.083]

93 As expected, the DP-CER approach provides much higher values than maximin and the difference between the two reduced as the cost to revenue ratio increased. [sent-216, score-0.181]

94 Conclusions We have introduced scalable sampling-based mechanisms for optimizing regret and a new variant of regret called CER in uncertain MDPs with dependent and independent uncertainties across states. [sent-218, score-0.771]

95 We have provided a variety of theoretical results that indicate the connection between regret and CER, quality bounds on regret in case of MILP-Regret, optimal substructure in optimizing CER for independent uncertainty case and run-time performance for MinimizeCER. [sent-219, score-0.592]

96 In the future, we hope to better understand the correlation between regret and CER, while also understanding the properties of CER policies. [sent-220, score-0.22]

97 A sparse sampling algorithm for nearoptimal planning in large markov decision processes. [sent-242, score-0.137]

98 Loss bounds for uncertain transition probabilities in markov decision processes. [sent-248, score-0.27]

99 : Robust control of markov decision processes with uncertain transition matrices. [sent-252, score-0.241]

100 Robust policy computation in reward-uncertain MDPs using nondominated policies. [sent-265, score-0.121]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('creg', 0.737), ('cer', 0.376), ('regret', 0.22), ('brw', 0.196), ('maximin', 0.181), ('uncertainties', 0.164), ('policy', 0.121), ('reg', 0.115), ('milp', 0.09), ('saa', 0.09), ('uncertain', 0.09), ('reward', 0.086), ('tqt', 0.08), ('mdps', 0.075), ('decision', 0.066), ('uncertainty', 0.066), ('policies', 0.06), ('proposition', 0.06), ('rt', 0.057), ('transition', 0.055), ('action', 0.053), ('brr', 0.045), ('inimize', 0.045), ('regan', 0.045), ('samples', 0.043), ('pruning', 0.042), ('substructure', 0.04), ('minimax', 0.04), ('supplement', 0.039), ('cumulative', 0.038), ('singapore', 0.034), ('horizon', 0.033), ('robust', 0.031), ('markov', 0.03), ('tpt', 0.03), ('transitional', 0.03), ('yossiri', 0.03), ('bounds', 0.029), ('mdp', 0.029), ('break', 0.028), ('jaillet', 0.027), ('dominated', 0.026), ('greedy', 0.026), ('actions', 0.025), ('minimizing', 0.025), ('dependent', 0.024), ('independence', 0.024), ('integer', 0.023), ('maxs', 0.023), ('nished', 0.023), ('myopic', 0.023), ('craig', 0.023), ('sampling', 0.023), ('max', 0.022), ('obstacles', 0.022), ('bagnell', 0.022), ('shie', 0.022), ('sample', 0.022), ('across', 0.022), ('andrew', 0.021), ('gap', 0.021), ('mixed', 0.021), ('cplex', 0.021), ('cdc', 0.02), ('inventory', 0.02), ('mins', 0.02), ('huan', 0.02), ('equation', 0.02), ('management', 0.02), ('payoff', 0.019), ('worst', 0.019), ('state', 0.019), ('maxa', 0.018), ('planning', 0.018), ('scalability', 0.018), ('transitioning', 0.018), ('kevin', 0.018), ('pseudo', 0.018), ('randomized', 0.017), ('scalable', 0.017), ('rq', 0.017), ('simulated', 0.017), ('epoch', 0.017), ('provide', 0.017), ('optimal', 0.017), ('employ', 0.017), ('posteriori', 0.016), ('massachusetts', 0.016), ('patrick', 0.016), ('exploit', 0.015), ('proof', 0.015), ('dynamic', 0.015), ('epochs', 0.015), ('min', 0.015), ('operations', 0.015), ('product', 0.014), ('employed', 0.014), ('variant', 0.014), ('kth', 0.014), ('expected', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet

2 0.17189975 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1

3 0.14318742 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari

Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efﬁciently and the comparison set ⇧ has polynomial size, this algorithm is efﬁcient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identiﬁed start and ﬁnish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to ﬁnish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a ﬁxed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efﬁciently solved. We show that it also can be efﬁciently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efﬁcient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difﬁcult problem that has been used to design efﬁcient cryptographic schemes. Finally, we present an efﬁcient algorithm whose regret scales linearly with the number of distinct graphs. 1

4 0.1163988 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have signiﬁcantly grown in popularity in the reinforcement–learning ﬁeld. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identiﬁcation of effective gradient directions and the proposal of efﬁcient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly inﬂuenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simpliﬁed version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

5 0.11240064 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

Author: Sebastien Bubeck, Che-Yu Liu

Abstract: We consider the stochastic multi-armed bandit problem with a prior distribution on the reward distributions. We are interested in studying prior-free and priordependent regret bounds, very much in the same spirit than the usual distributionfree and distribution-dependent bounds for the non-Bayesian stochastic bandit. We ﬁrst show that Thompson Sampling attains an optimal prior-free bound in the sense that for any prior distribution its Bayesian regret is bounded from above by √ 14 nK. This result is unimprovable in the sense that there exists a prior distribution such that any algorithm has a Bayesian regret bounded from below by √ 1 20 nK. We also study the case of priors for the setting of Bubeck et al. [2013] (where the optimal mean is known as well as a lower bound on the smallest gap) and we show that in this case the regret of Thompson Sampling is in fact uniformly bounded over time, thus showing that Thompson Sampling can greatly take advantage of the nice properties of these priors. 1

6 0.11194322 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

7 0.11148386 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

8 0.11049367 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

9 0.10580028 137 nips-2013-High-Dimensional Gaussian Process Bandits

10 0.098527826 347 nips-2013-Variational Planning for Graph-based MDPs

11 0.095362715 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

12 0.09243755 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

13 0.091899529 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

14 0.089683883 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

15 0.088948026 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

16 0.088636152 230 nips-2013-Online Learning with Costly Features and Labels

17 0.087659553 325 nips-2013-The Pareto Regret Frontier

18 0.086152583 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

19 0.086043984 348 nips-2013-Variational Policy Search via Trajectory Optimization

20 0.085049219 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.147), (1, -0.224), (2, 0.017), (3, -0.007), (4, 0.001), (5, -0.04), (6, 0.025), (7, -0.026), (8, -0.032), (9, 0.0), (10, 0.004), (11, 0.034), (12, 0.011), (13, 0.015), (14, 0.06), (15, -0.007), (16, 0.036), (17, -0.016), (18, -0.004), (19, -0.015), (20, -0.028), (21, -0.004), (22, -0.0), (23, 0.026), (24, 0.043), (25, 0.028), (26, -0.054), (27, -0.026), (28, 0.008), (29, 0.017), (30, -0.022), (31, 0.005), (32, 0.011), (33, 0.038), (34, -0.038), (35, 0.038), (36, -0.008), (37, 0.011), (38, 0.049), (39, 0.05), (40, 0.018), (41, 0.001), (42, -0.041), (43, 0.023), (44, -0.037), (45, 0.004), (46, 0.037), (47, 0.02), (48, -0.037), (49, -0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93451405 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet

2 0.88233197 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

Author: Ian Osband, Dan Russo, Benjamin Van Roy

Abstract: Most provably-eﬃcient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for eﬃcient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally eﬃcient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the ﬁrst for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL signiﬁcantly outperforms existing algorithms with similar regret bounds. 1

3 0.88035625 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

Author: Shiau Hong Lim, Huan Xu, Shie Mannor

4 0.77815992 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

Author: Zheng Wen, Benjamin Van Roy

Abstract: We consider the problem of reinforcement learning over episodes of a ﬁnitehorizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efﬁcient exploration and value function generalization. We establish that when the true value function Q⇤ lies within the hypothesis class Q, OCP selects optimal actions over all but at most dimE [Q] episodes, where dimE denotes the eluder dimension. We establish further efﬁciency and asymptotic performance guarantees that apply even if Q⇤ does not lie in Q, for the special case where Q is the span of pre-speciﬁed indicator functions over disjoint sets. 1

5 0.72713107 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

Author: Alexander Zimin, Gergely Neu

Abstract: We study the problem of online learning in ﬁnite episodic Markov decision processes (MDPs) where the loss function is allowed to change between episodes. The natural performance measure in this learning problem is the regret deﬁned as the difference between the total loss of the best stationary policy and the total loss suffered by the learner. We assume that the learner is given access to a ﬁnite action space A and the state space X has a layered structure with L layers, so that state transitions are only possible between consecutive layers. We describe a variant of the recently proposed Relative Entropy Policy Search algorithm and show that its regret after T episodes is 2 L|X ||A|T log(|X ||A|/L) in the bandit setting and 2L T log(|X ||A|/L) in the full information setting, given that the learner has perfect knowledge of the transition probabilities of the underlying MDP. These guarantees largely improve previously known results under much milder assumptions and cannot be signiﬁcantly improved under general assumptions. 1

6 0.72298807 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

7 0.67808717 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

8 0.6673938 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

9 0.66581744 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

10 0.66280895 347 nips-2013-Variational Planning for Graph-based MDPs

11 0.63935512 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

12 0.63854462 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

13 0.6266098 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

14 0.57824051 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

15 0.56967318 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

16 0.56719863 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

17 0.54821175 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

18 0.54708737 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

19 0.54591417 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

20 0.54001468 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.02), (16, 0.037), (21, 0.286), (33, 0.112), (34, 0.092), (41, 0.029), (49, 0.019), (56, 0.132), (66, 0.017), (70, 0.019), (85, 0.046), (89, 0.043), (93, 0.027), (95, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74489713 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

Author: Asrar Ahmed, Pradeep Varakantham, Yossiri Adulyasak, Patrick Jaillet

2 0.74226081 57 nips-2013-Beyond Pairwise: Provably Fast Algorithms for Approximate $k$-Way Similarity Search

Author: Anshumali Shrivastava, Ping Li

Abstract: We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1 ∩S2 ∩S3 | , S1 , S2 , S3 ∈ C, where C is a |S1 ∪S2 ∪S3 | size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality. 1 Introduction and Motivation Similarity search (near neighbor search) is one of the fundamental problems in Computer Science. The task is to identify a small set of data points which are “most similar” to a given input query. Similarity search algorithms have been one of the basic building blocks in numerous applications including search, databases, learning, recommendation systems, computer vision, etc. One widely used notion of similarity on sets is the Jaccard similarity or resemblance [5, 10, 18, 20]. Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, the resemblance R2way between S1 and S2 is deﬁned as: R2way = |S1 ∩S2 | . Existing notions of similarity in search problems mainly work with |S1 ∪S2 | pairwise similarity functions. In this paper, we go beyond this notion and look at the problem of k-way similarity search, where the similarity function of interest involves k sets (k ≥ 2). Our work exploits the fact that resemblance can be naturally extended to k-way resemblance similarity [18, 21], deﬁned over k sets {S1 , S2 , ..., Sk } as Rk−way = |S1 ∩S2 ∩...∩Sk | . |S1 ∪S2 ∪...∪Sk | Binary high-dimensional data The current web datasets are typically binary, sparse, and extremely high-dimensional, largely due to the wide adoption of the “Bag of Words” (BoW) representations for documents and images. It is often the case, in BoW representations, that just the presence or absence (0/1) of speciﬁc feature words captures sufﬁcient information [7, 16, 20], especially with (e.g.,) 3-grams or higher-order models. And so, the web can be imagined as a giant storehouse of ultra high-dimensional sparse binary vectors. Of course, binary vectors can also be equivalently viewed as sets (containing locations of the nonzero features). We list four practical scenarios where k-way resemblance search would be a natural choice. (i) Google Sets: (http://googlesystem.blogspot.com/2012/11/google-sets-still-available.html) Google Sets is among the earliest google projects, which allows users to generate list of similar words by typing only few related keywords. For example, if the user types “mazda” and “honda” the application will automatically generate related words like “bmw”, “ford”, “toyota”, etc. This application is currently available in google spreadsheet. If we assume the term document binary representation of each word w in the database, then given query w1 and w2 , we show that |w1 ∩w2 ∩w| |w1 ∪w2 ∪w| turns out to be a very good similarity measure for this application (see Section 7.1). 1 (ii) Joint recommendations: Users A and B would like to watch a movie together. The proﬁle of each person can be represented as a sparse vector over a giant universe of attributes. For example, a user proﬁle may be the set of actors, actresses, genres, directors, etc, which she/he likes. On the other hand, we can represent a movie M in the database over the same universe based on attributes associated with the movie. If we have to recommend movie M, jointly to users A and B, then a natural measure to maximize is |A∩B∩M | . The problem of group recommendation [3] is applicable |A∪B∪M | in many more settings such as recommending people to join circles, etc. (iii) Improving retrieval quality: We are interested in ﬁnding images of a particular type of object, and we have two or three (possibly noisy) representative images. In such a scenario, a natural expectation is that retrieving images simultaneously similar to all the representative images should be more reﬁned than just retrieving images similar to any one of them. In Section 7.2, we demonstrate that in cases where we have more than one element to search for, we can reﬁne our search quality using k-way resemblance search. In a dynamic feedback environment [4], we can improve subsequent search quality by using k-way similarity search on the pages already clicked by the user. (iv) Beyond pairwise clustering: While machine learning algorithms often utilize the data through pairwise similarities (e.g., inner product or resemblance), there are natural scenarios where the afﬁnity relations are not pairwise, but rather triadic, tetradic or higher [2, 30]. The computational cost, of course, will increase exponentially if we go beyond pairwise similarity. Efﬁciency is crucial With the data explosion in modern applications, the brute force way of scanning all the data for searching is prohibitively expensive, specially in user-facing applications like search. The need for k-way similarity search can only be fulﬁlled if it admits efﬁcient algorithms. This paper fulﬁlls this requirement for k-way resemblance and its derived similarities. In particular, we show fast algorithms with provable query time guarantees for approximate k-way resemblance search. Our algorithms and analysis naturally provide a framework to extend classical LSH framework [14, 13] to handle higher-order similarities, which could be of independent theoretical interest. Organization In Section 2, we review approximate near neighbor search and classical Locality Sensitive Hashing (LSH). In Section 3, we formulate the 3-way similarity search problems. Sections 4, 5, and 6 describe provable fast algorithms for several search problems. Section 7 demonstrates the applicability of 3-way resemblance search in real applications. 2 Classical c-NN and Locality Sensitive Hashing (LSH) Initial attempts of ﬁnding efﬁcient (sub-linear time) algorithms for exact near neighbor search, based on space partitioning, turned out to be a disappointment with the massive dimensionality of current datasets [11, 28]. Approximate versions of the problem were proposed [14, 13] to break the linear query time bottleneck. One widely adopted formalism is the c-approximate near neighbor (c-NN). Deﬁnition 1 (c-Approximate Near Neighbor or c-NN). Consider a set of points, denoted by P, in a D-dimensional space RD , and parameters R0 > 0, δ > 0. The task is to construct a data structure which, given any query point q, if there exist an R0 -near neighbor of q in P, it reports some cR0 -near neighbor of q in P with probability 1 − δ. The usual notion of c-NN is for distance. Since we deal with similarities, we deﬁne R0 -near neighbor of point q as a point p with Sim(q, p) ≥ R0 , where Sim is the similarity function of interest. Locality sensitive hashing (LSH) [14, 13] is a popular framework for c-NN problems. LSH is a family of functions, with the property that similar input objects in the domain of these functions have a higher probability of colliding in the range space than non-similar ones. In formal terms, consider H a family of hash functions mapping RD to some set S Deﬁnition 2 (Locality Sensitive Hashing (LSH)). A family H is called (R0 , cR0 , p1 , p2 )-sensitive if for any two points x, y ∈ RD and h chosen uniformly from H satisﬁes the following: • if Sim(x, y) ≥ R0 then P rH (h(x) = h(y)) ≥ p1 • if Sim(x, y) ≤ cR0 then P rH (h(x) = h(y)) ≤ p2 For approximate nearest neighbor search typically, p1 > p2 and c < 1 is needed. Note, c < 1 as we are deﬁning neighbors in terms of similarity. Basically, LSH trades off query time with extra preprocessing time and space which can be accomplished off-line. 2 Fact 1 Given a family of (R0 , cR0 , p1 , p2 ) -sensitive hash functions, one can construct a data structure for c-NN with O(nρ log1/p2 n) query time and space O(n1+ρ ), where ρ = log 1/p1 . log 1/p2 Minwise Hashing for Pairwise Resemblance One popular choice of LSH family of functions associated with resemblance similarity is, Minwise Hashing family [5, 6, 13]. Minwise Hashing family applies an independent random permutation π : Ω → Ω, on the given set S ⊆ Ω, and looks at the minimum element under π, i.e. min(π(S)). Given two sets S1 , S2 ⊆ Ω = {0, 1, 2, ..., D − 1}, it can be shown by elementary probability argument that P r (min(π(S1 )) = min(π(S2 ))) = |S1 ∩ S2 | = R2way . |S1 ∪ S2 | (1) The recent work on b-bit minwise hashing [20, 23] provides an improvement by storing only the lowest b bits of the hashed values: min(π(S1 )), min(π(S2 )). [26] implemented the idea of building hash tables for near neighbor search, by directly using the bits from b-bit minwise hashing. 3 3-way Similarity Search Formulation Our focus will remain on binary vectors which can also be viewed as sets. We illustrate our method |S1 ∩S2 ∩S3 | using 3-way resemblance similarity function Sim(S1 , S2 , S3 ) = |S1 ∪S2 ∪S3 | . The algorithm and guarantees naturally extend to k-way resemblance. Given a size n collection C ⊆ 2Ω of sets (or binary vectors), we are particularly interested in the following three problems: 1. Given two query sets S1 and S2 , ﬁnd S3 ∈ C that maximizes Sim(S1 , S2 , S3 ). 2. Given a query set S1 , ﬁnd two sets S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). 3. Find three sets S1 , S2 , S3 ∈ C maximizing Sim(S1 , S2 , S3 ). The brute force way of enumerating all possibilities leads to the worst case query time of O(n), O(n2 ) and O(n3 ) for problem 1, 2 and 3, respectively. In a hope to break this barrier, just like the case of pairwise near neighbor search, we deﬁne the c-approximate (c < 1) versions of the above three problems. As in the case of c-NN, we are given two parameters R0 > 0 and δ > 0. For each of the following three problems, the guarantee is with probability at least 1 − δ: 1. (3-way c-Near Neighbor or 3-way c-NN) Given two query sets S1 and S2 , if there ′ exists S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report some S3 ∈ C so that ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 2. (3-way c-Close Pair or 3-way c-CP) Given a query set S1 , if there exists a pair of ′ ′ set S2 , S3 ∈ C with Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S2 , S3 ∈ C so that ′ ′ Sim(S1 , S2 , S3 ) ≥ cR0 . 3. (3-way c-Best Cluster or 3-way c-BC) If there exist sets S1 , S2 , S3 ∈ C with ′ ′ ′ ′ ′ ′ Sim(S1 , S2 , S3 ) ≥ R0 , then we report sets S1 , S2 , S3 ∈ C so that Sim(S1 , S2 , S3 ) ≥ cR0 . 4 Sub-linear Algorithm for 3-way c-NN The basic philosophy behind sub-linear search is bucketing, which allows us to preprocess dataset in a fashion so that we can ﬁlter many bad candidates without scanning all of them. LSH-based techniques rely on randomized hash functions to create buckets that probabilistically ﬁlter bad candidates. This philosophy is not restricted for binary similarity functions and is much more general. Here, we ﬁrst focus on 3-way c-NN problem for binary data. Theorem 1 For R3way c-NN one can construct a data structure with O(nρ log1/cR0 n) query time and O(n1+ρ ) space, where ρ = 1 − log 1/c log 1/c+log 1/R0 . The argument for 2-way resemblance can be naturally extended to k-way resemblance. Speciﬁcally, given three sets S1 , S2 , S3 ⊆ Ω and an independent random permutation π : Ω → Ω, we have: P r (min(π(S1 )) = min(π(S2 )) = min(π(S3 ))) = R3way . (2) Eq.( 2) shows that minwise hashing, although it operates on sets individually, preserves all 3-way (in fact k-way) similarity structure of the data. The existence of such a hash function is the key requirement behind the existence of efﬁcient approximate search. For the pairwise case, the probability event was a simple hash collision, and the min-hash itself serves as the bucket index. In case 3 of 3-way (and higher) c-NN problem, we have to take care of a more complicated event to create an indexing scheme. In particular, during preprocessing we need to create buckets for each individual S3 , and while querying we need to associate the query sets S1 and S2 to the appropriate bucket. We need extra mechanisms to manipulate these minwise hashes to obtain a bucketing scheme. Proof of Theorem 1: We use two additional functions: f1 : Ω → N for manipulating min(π(S3 )) and f2 : Ω × Ω → N for manipulating both min(π(S1 )) and min(π(S2 )). Let a ∈ N+ such that |Ω| = D < 10a . We deﬁne f1 (x) = (10a + 1) × x and f2 (x, y) = 10a x + y. This choice ensures that given query S1 and S2 , for any S3 ∈ C, f1 (min(π(S3 ))) = f2 (min(π(S1 )), min(π(S2 ))) holds if and only if min(π(S1 )) = min(π(S2 )) = min(π(S2 )), and thus we get a bucketing scheme. To complete the proof, we introduce two integer parameters K and L. Deﬁne a new hash function by concatenating K events. To be more precise, while preprocessing, for every element S3 ∈ C create buckets g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hK (S3 ))] where hi is chosen uniformly from minwise hashing family. For given query points S1 and S2 , retrieve only points in the bucket g2 (S1 , S2 ) = [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hK (S1 ), hK (S2 ))]. Repeat this process L times independently. For any K S3 ∈ C, with Sim(S1 , S2 , S3 ) ≥ R0 , is retrieved with probability at least 1 − (1 − R0 )L . Using log 1/c log K = ⌈ log n ⌉ and L = ⌈nρ log( 1 )⌉, where ρ = 1 − log 1/c+log 1/R0 , the proof can be obtained 1 δ cR0 using standard concentration arguments used to prove Fact 1, see [14, 13]. It is worth noting that the probability guarantee parameter δ gets absorbed in the constants as log( 1 ). Note, the process is δ stopped as soon as we ﬁnd some element with R3way ≥ cR0 . Theorem 1 can be easily extended to k-way resemblance with same query time and space guarantees. Note that k-way c-NN is at least as hard as k ∗ -way c-NN for any k ∗ ≤ k, because we can always choose (k −k ∗ +1) identical query sets in k-way c-NN, and it reduces to k ∗ -way c-NN problem. So, any improvements in R3way c-NN implies improvement in the classical min-hash LSH for Jaccard similarity. The proposed analysis is thus tight in this sense. The above observation makes it possible to also perform the traditional pairwise c-NN search using the same hash tables deployed for 3-way c-NN. In the query phase we have an option, if we have two different queries S1 , S2 , then we retrieve from bucket g2 (S1 , S2 ) and that is usual 3-way c-NN search. If we are just interested in pairwise near neighbor search given one query S1 , then we will look into bucket g2 (S1 , S1 ), and we know that the 3-way resemblance between S1 , S1 , S3 boils down to the pairwise resemblance between S1 and S3 . So, the same hash tables can be used for both the purposes. This property generalizes, and hash tables created for k-way c-NN can be used for any k ∗ -way similarity search so long as k ∗ ≤ k. The approximation guarantees still holds. This ﬂexibility makes k-way c-NN bucketing scheme more advantageous over the pairwise scheme. ρ 1 One of the peculiarity of LSH based techniques is that the query complexity exponent ρ < 1 is dependent on the choice R0=0.01 0.8 of the threshold R0 we are interested in and the value of c 0.05 0.1 0.3 0.6 which is the approximation ratio that we will tolerate. Figure 1 0.2 0.4 0.8 log 1/c 0.5 plots ρ = 1− log 1/c+log 1/R0 with respect to c, for selected R0 0.4 0.6 0.9 0.7 values from 0.01 to 0.99. For instance, if we are interested in 0.2 0.95 highly similar pairs, i.e. R0 ≈ 1, then we are looking at near R =0.99 0 O(log n) query complexity for c-NN problem as ρ ≈ 0. On 0 0 0.2 0.4 0.6 0.8 1 the other hand, for very lower threshold R0 , there is no much c log 1/c of hope of time-saving because ρ is close to 1. Figure 1: ρ = 1 − log 1/c+log 1/R0 . 5 Other Efﬁcient k-way Similarities We refer to the k-way similarities for which there exist sub-linear algorithms for c-NN search with query and space complexity exactly as given in Theorem 1 as efﬁcient . We have demonstrated existence of one such example of efﬁcient similarities, which is the k-way resemblance. This leads to a natural question: “Are there more of them?”. [9] analyzed all the transformations on similarities that preserve existence of efﬁcient LSH search. In particular, they showed that if S is a similarity for which there exists an LSH family, then there also exists an LSH family for any similarity which is a probability generating function (PGF) transfor∑∞ mation on S. PGF transformation on S is deﬁned as P GF (S) = i=1 pi S i , where S ∈ [0, 1] and ∑∞ pi ≥ 0 satisﬁes i=1 pi = 1. Similar theorem can also be shown in the case of 3-way resemblance. 4 Theorem 2 Any PGF transformation on 3-way resemblance R3way is efﬁcient. Recall in the proof of Theorem 1, we created hash assignments f1 (min(π(S3 ))) and f2 (min(π(S1 )), min(π(S2 ))), which lead to a bucketing scheme for the 3-way resemblance search, where the collision event E = {f1 (min(π(S3 )) = f2 (min(π(S1 )), min(π(S2 )))} happens with probability P r(E) = R3way . To prove the above Theorem 2, we will need to create hash events ∑∞ i having probability P GF (R3way ) = i=1 pi (R3way ) . Note that 0 ≤ P GF (R3way ) ≤ 1. We will make use of the following simple lemma. Lemma 1 (R3way )n is efﬁcient for all n ∈ N. n n Proof: Deﬁne new hash assignments g1 (S3 ) = [f1 (h1 (S3 )); ...; f1 (hn (S3 ))] and g2 (S1 , S2 ) = n n [f2 (h1 (S1 ), h1 (S2 )); ...; f2 (hn (S1 ), hn (S2 ))]. The collision event g1 (S3 ) = g2 (S1 , S2 ) has n n probability (R3way )n . We now use the pair < g1 , g2 > instead of < f1 , f2 > and obtain same 3way n guarantees, as in Theorem 1, for (R ) as well. i i Proof of Theorem 2: From Lemma 1, let < g1 , g2 > be the hash pair corresponding to (R3way )i i i as used in above lemma. We sample one hash pair from the set {< g1 , g2 >: i ∈ N}, where i i the probability of sampling < g1 , g2 > is proportional to pi . Note that pi ≥ 0, and satisﬁes ∑∞ is i=1 pi = 1, and so the above sampling ∑ valid. It is not difﬁcult to see that the collision of the ∞ sampled hash pair has probability exactly i=1 pi (R3way )i . Theorem 2 can be naturally extended to k-way similarity for any k ≥ 2. Thus, we now have inﬁnitely many k-way similarity functions admitting efﬁcient sub-linear search. One, that might be interesting, because of its radial basis kernel like nature, is shown in the following corollary. Corollary 1 eR k−way −1 is efﬁcient. Proof: Use the expansion of eR k−way normalized by e to see that eR k−way −1 is a PGF on Rk−way . 6 Fast Algorithms for 3-way c-CP and 3-way c-BC Problems For 3-way c-CP and 3-way c-BC problems, using bucketing scheme with minwise hashing family will save even more computations. Theorem 3 For R3way c-Close Pair Problem (or c-CP) one can construct a data structure with log 1/c O(n2ρ log1/cR0 n) query time and O(n1+2ρ ) space, where ρ = 1 − log 1/c+log 1/R0 . Note that we can switch the role of f1 and f2 in the proof of Theorem 1. We are thus left with a c-NN problem with search space O(n2 ) (all pairs) instead of n. A bit of analysis, similar to Theorem 1, will show that this procedure achieves the required query time O(n2ρ log1/cR0 n), but uses a lot more space, O(n2(1+ρ )), than shown in the above theorem. It turns out that there is a better way of doing c-CP that saves us space. Proof of Theorem 3: We again start with constructing hash tables. For every element Sc ∈ C, we create a hash-table and store Sc in bucket B(Sc ) = [h1 (Sc ); h2 (Sc ); ...; hK (Sc )], where hi is chosen uniformly from minwise independent family of hash functions H. We create L such hash-tables. For a query element Sq we look for all pairs in bucket B(Sq ) = [h1 (Sq ); h2 (Sq ); ...; hK (Sq )] and repeat this for each of the L tables. Note, we do not form pairs of elements retrieved from different tables as they do not satisfy Eq. (2). If there exists a pair S1 , S2 ∈ C with Sim(Sq , S1 , S2 ) ≥ R0 , using K Eq. (2), we can see that we will ﬁnd that pair in bucket B(Sq ) with probability 1 − (1 − R0 )L . Here, we cannot use traditional choice of K and L, similar to what we did in Theorem 1, as there 2 log are O(n2 ) instead of O(n) possible pairs. We instead use K = ⌈ log 1n ⌉ and L = ⌈n2ρ log( 1 )⌉, δ cR0 log 1/c with ρ = 1 − log 1/c+log 1/R0 . With this choice of K and L, the result follows. Note, the process is stopped as soon as we ﬁnd pairs S1 and S2 with Sim(Sq , S1 , S2 ) ≥ cR0 . The key argument that saves space from O(n2(1+ρ) ) to O(n1+2ρ ) is that we hash n points individually. Eq. (2) makes it clear that hashing all possible pairs is not needed when every point can be processed individually, and pairs formed within each bucket itself ﬁlter out most of the unnecessary combinations. 5 Theorem 4 For R3way c-Best Cluster Problem (or c-BC) there exist an algorithm with running time log 1/c O(n1+2ρ log1/cR0 n), where ρ = 1 − log 1/c+log 1/R0 . The argument similar to one used in proof of Theorem 3 leads to the running time of O(n1+3ρ log1/cR0 n) as we need L = O(n3ρ ), and we have to processes all points at least once. Proof of Theorem 4: Repeat c-CP problem n times for every element in collection C acting as query once. We use the same set of hash tables and hash functions every time. The preprocessing time is O(n1+2ρ log1/cR0 n) evaluations of hash functions and the total querying time is O(n × n2ρ log1/cR0 n), which makes the total running time O(n1+2ρ log1/cR0 n). For k-way c-BC Problem, we can achieve O(n1+(k−1)ρ log1/cR0 n) running time. If we are interested in very high similarity cluster, with R0 ≈ 1, then ρ ≈ 0, and the running time is around O(n log n). This is a huge saving over the brute force O(nk ). In most practical cases, specially in big data regime where we have enormous amount of data, we can expect the k-way similarity of good clusters to be high and ﬁnding them should be efﬁcient. We can see that with increasing k, hashing techniques save more computations. 7 Experiments In this section, we demonstrate the usability of 3-way and higher-order similarity search using (i) Google Sets, and (ii) Improving retrieval quality. 7.1 Google Sets: Generating Semantically Similar Words Here, the task is to retrieve words which are “semantically” similar to the given set of query words. We collected 1.2 million random documents from Wikipedia and created a standard term-doc binary vector representation of each term present in the collected documents after removing standard stop words and punctuation marks. More speciﬁcally, every word is represented as a 1.2 million dimension binary vector indicating its presence or absence in the corresponding document. The total number of terms (or words) was around 60,000 in this experiment. Since there is no standard benchmark available for this task, we show qualitative evaluations. For querying, we used the following four pairs of semantically related words: (i) “jaguar” and “tiger”; (ii) “artiﬁcial” and “intelligence”; (iii) “milky” and “way” ; (iv) “ﬁnger” and “lakes”. Given the query words w1 and w2 , we compare the results obtained by the following four methods. • Google Sets: We use Google’s algorithm and report 5 words from Google spreadsheets [1]. This is Google’s algorithm which uses its own data. • 3-way Resemblance (3-way): We use 3-way resemblance |w1 ∩w2 ∩w| to rank every word |w1 ∪w2 ∪w| w and report top 5 words based on this ranking. • Sum Resemblance (SR): Another intuitive method is to use the sum of pairwise resem|w2 ∩w| blance |w1 ∩w| + |w2 ∪w| and report top 5 words based on this ranking. |w1 ∪w| • Pairwise Intersection (PI): We ﬁrst retrieve top 100 words based on pairwise resemblance for each w1 and w2 independently. We then report the words common in both. If there is no word in common we do not report anything. The results in Table 1 demonstrate that using 3-way resemblance retrieves reasonable candidates for these four queries. An interesting query is “ﬁnger” and “lakes”. Finger Lakes is a region in upstate New York. Google could only relate it to New York, while 3-way resemblance could even retrieve the names of cities and lakes in the region. Also, for query “milky” and “way”, we can see some (perhaps) unrelated words like “dance” returned by Google. We do not see such random behavior with 3-way resemblance. Although we are not aware of the algorithm and the dataset used by Google, we can see that 3-way resemblance appears to be a right measure for this application. The above results also illustrate the problem with using the sum of pairwise similarity method. The similarity value with one of the words dominates the sum and hence we see for queries “artiﬁcial” and “intelligence” that all the retrieved words are mostly related to the word “intelligence”. Same is the case with query “ﬁnger” and “lakes” as well as “jaguar” and “tiger”. Note that “jaguar” is also a car brand. In addition, for all 4 queries, there was no common word in the top 100 words similar to the each query word individually and so PI method never returns anything. 6 Table 1: Top ﬁve words retrieved using various methods for different queries. “JAGUAR” AND “ TIGER” G OOGLE 3- WAY SR LION LEOPARD CHEETAH CAT DOG LEOPARD CHEETAH LION PANTHER CAT CAT LEOPARD LITRE BMW CHASIS “MILKY” AND “ WAY” G OOGLE 3- WAY SR DANCE STARS SPACE THE UNIVERSE GALAXY STARS EARTH LIGHT SPACE EVEN ANOTHER STILL BACK TIME PI — — — — — “ARTIFICIAL” AND “INTELLIGENCE” G OOGLE 3- WAY SR PI COMPUTER COMPUTER SECURITY — PROGRAMMING SCIENCE WEAPONS — INTELLIGENT SECRET — SCIENCE ROBOT HUMAN ATTACKS — ROBOTICS TECHNOLOGY HUMAN — PI — — — — — G OOGLE NEW YORK NY PARK CITY “FINGER” AND “LAKES” 3- WAY SR SENECA CAYUGA ERIE ROCHESTER IROQUOIS RIVERS FRESHWATER FISH STREAMS FORESTED PI — — — — — We should note the importance of the denominator term in 3-way resemblance, without which frequent words will be blindly favored. The exciting contribution of this paper is that 3-way resemblance similarity search admits provable sub-linear guarantees, making it an ideal choice. On the other hand, no such provable guarantees are known for SR and other heuristic based search methods. 7.2 Improving Retrieval Quality in Similarity Search We also demonstrate how the retrieval quality of traditional similarity search can be boosted by utilizing more query candidates instead of just one. For the evaluations we choose two public datasets: MNIST and WEBSPAM, which were used in a recent related paper [26] for near neighbor search with binary data using b-bit minwise hashing [20, 23]. The two datasets reﬂect diversity both in terms of task and scale that is encountered in practice. The MNIST dataset consists of handwritten digit samples. Each sample is an image of 28 × 28 pixel yielding a 784 dimension vector with the associated class label (digit 0 − 9). We binarize the data by settings all non zeros to be 1. We used the standard partition of MNIST, which consists of 10,000 samples in one set and 60,000 in the other. The WEBSPAM dataset, with 16,609,143 features, consists of sparse vector representation of emails labeled as spam or not. We randomly sample 70,000 data points and partitioned them into two independent sets of size 35,000 each. Table 2: Percentage of top candidates with the same labels as that of query retrieved using various similarity criteria. More indicates better retrieval quality (Best marked in bold). T OP Pairwise 3-way NNbor 4-way NNbor 1 94.20 96.90 97.70 MNIST 10 20 92.33 91.10 96.13 95.36 96.89 96.28 50 89.06 93.78 95.10 1 98.45 99.75 99.90 WEBSPAM 10 20 96.94 96.46 98.68 97.80 98.87 98.15 50 95.12 96.11 96.45 For evaluation, we need to generate potential similar search query candidates for k-way search. It makes no sense in trying to search for object simultaneously similar to two very different objects. To generate such query candidates, we took one independent set of the data and partition it according to the class labels. We then run a cheap k-mean clustering on each class, and randomly sample triplets < x1 , x2 , x3 > from each cluster for evaluating 2-way, 3-way and 4-way similarity search. For MNIST dataset, the standard 10,000 test set was partitioned according to the labels into 10 sets, each partition was then clustered into 10 clusters, and we choose 10 triplets randomly from each cluster. In all we had 100 such triplets for each class, and thus 1000 overall query triplets. For WEBSPAM, which consists only of 2 classes, we choose one of the independent set and performed the same procedure. We selected 100 triplets from each cluster. We thus have 1000 triplets from each class making the total number of 2000 query candidates. The above procedures ensure that the elements in each triplets < x1 , x2 , x3 > are not very far from each other and are of the same class label. For each triplet < x1 , x2 , x3 >, we sort all the points x in the other independent set based on the following: • Pairwise: We only use the information in x1 and rank x based on resemblance 7 |x1 ∩x| |x1 ∪x| . • 3-way NN: We rank x based on 3-way resemblance • 4-way NN: We rank x based on 4-way resemblance |x1 ∩x2 ∩x| |x1 ∪x2 ∪x| . |x1 ∩x2 ∩x3 ∩x| |x1 ∪x2 ∪x3 ∪x| . We look at the top 1, 10, 20 and 50 points based on orderings described above. Since, all the query triplets are of the same label, The percentage of top retrieved candidates having same label as that of the query items is a natural metric to evaluate the retrieval quality. This percentage values accumulated over all the triplets are summarized in Table 2. We can see that top candidates retrieved by 3-way resemblance similarity, using 2 query points, are of better quality than vanilla pairwise similarity search. Also 4-way resemblance, with 3 query points, further improves the results compared to 3-way resemblance similarity search. This clearly demonstrates that multi-way resemblance similarity search is more desirable whenever we have more than one representative query in mind. Note that, for MNIST, which contains 10 classes, the boost compared to pairwise retrieval is substantial. The results follow a consistent trend. 8 Future Work While the work presented in this paper is promising for efﬁcient 3-way and k-way similarity search in binary high-dimensional data, there are numerous interesting and practical research problems we can study as future work. In this section, we mention a few such examples. One-permutation hashing. Traditionally, building hash tables for near neighbor search required many (e.g., 1000) independent hashes. This is both time- and energy-consuming, not only for building tables but also for processing un-seen queries which have not been processed. One permutation hashing [22] provides the hope of reducing many permutations to merely one. The version in [22], however, was not applicable to near neighbor search due to the existence of many empty bins (which offer no indexing capability). The most recent work [27] is able to ﬁll the empty bins and works well for pairwise near neighbor search. It will be interesting to extend [27] to k-way search. Non-binary sparse data. This paper focuses on minwise hashing for binary data. Various extensions to real-valued data are possible. For example, our results naturally apply to consistent weighted sampling [25, 15], which is one way to handle non-binary sparse data. The problem, however, is not solved if we are interested in similarities such as (normalized) k-way inner products, although the line of work on Conditional Random Sampling (CRS) [19, 18] may be promising. CRS works on non-binary sparse data by storing a bottom subset of nonzero entries after applying one permutation to (real-valued) sparse data matrix. CRS performs very well for certain applications but it does not work in our context because the bottom (nonzero) subsets are not properly aligned. Building hash tables by directly using bits from minwise hashing. This will be a different approach from the way how the hash tables are constructed in this paper. For example, [26] directly used the bits from b-bit minwise hashing [20, 23] to build hash tables and demonstrated the signiﬁcant advantages compared to sim-hash [8, 12] and spectral hashing [29]. It would be interesting to see the performance of this approach in k-way similarity search. k-Way sign random projections. It would be very useful to develop theory for k-way sign random projections. For usual (real-valued) random projections, it is known that the volume (which is related to the determinant) is approximately preserved [24, 17]. We speculate that the collision probability of k-way sign random projections might be also a (monotonic) function of the determinant. 9 Conclusions We formulate a new framework for k-way similarity search and obtain fast algorithms in the case of k-way resemblance with provable worst-case approximation guarantees. We show some applications of k-way resemblance search in practice and demonstrate the advantages over traditional search. Our analysis involves the idea of probabilistic hashing and extends the well-known LSH family beyond the pairwise case. We believe the idea of probabilistic hashing still has a long way to go. Acknowledgement The work is supported by NSF-III-1360971, NSF-Bigdata-1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-13-1-0137. Ping Li thanks Kenneth Church for introducing Google Sets to him in the summer of 2004 at Microsoft Research. 8 References [1] http://www.howtogeek.com/howto/15799/how-to-use-autoﬁll-on-a-google-docs-spreadsheet-quick-tips/. [2] S. Agarwal, Jongwoo Lim, L. Zelnik-Manor, P. Perona, D. Kriegman, and S. Belongie. Beyond pairwise clustering. In CVPR, 2005. [3] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawlat, Gautam Das, and Cong Yu. Group recommendation: semantics and efﬁciency. Proc. VLDB Endow., 2(1):754–765, 2009. [4] Christina Brandt, Thorsten Joachims, Yisong Yue, and Jacob Bank. Dynamic ranked retrieval. In WSDM, pages 247–256, 2011. [5] Andrei Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21–29, Positano, Italy, 1997. [6] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998. [7] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classiﬁcation. IEEE Trans. Neural Networks, 10(5):1055–1064, 1999. [8] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002. [9] Flavio Chierichetti and Ravi Kumar. LSH-preserving functions and their applications. In SODA, 2012. [10] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003. [11] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for ﬁnding nearest neighbors. IEEE Transactions on Computers, 24:1000–1006, 1975. [12] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁnite programming. Journal of ACM, 42(6):1115–1145, 1995. [13] Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(14):321–350, 2012. [14] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604–613, Dallas, TX, 1998. [15] Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, 2010. [16] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494–501, Amsterdam, Netherlands, 2007. [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Technical report, arXiv:1207.6083, 2013. [18] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics (Preliminary results appeared in HLT/EMNLP 2005), 33(3):305–354, 2007. [19] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, pages 873–880, Vancouver, Canada, 2006. [20] Ping Li and Arnd Christian K¨ nig. b-bit minwise hashing. In Proceedings of the 19th International o Conference on World Wide Web, pages 671–680, Raleigh, NC, 2010. [21] Ping Li, Arnd Christian K¨ nig, and Wenhao Gui. b-bit minwise hashing for estimating three-way simio larities. In NIPS, Vancouver, Canada, 2010. [22] Ping Li, Art B Owen, and Cun-Hui Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012. [23] Ping Li, Anshumali Shrivastava, and Arnd Christian K¨ nig. b-bit minwise hashing in practice. In Intero netware, Changsha, China, 2013. [24] Avner Magen and Anastasios Zouzias. Near optimal dimensionality reductions that preserve volumes. In APPROX / RANDOM, pages 523–534, 2008. [25] Mark Manasse, Frank McSherry, and Kunal Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010. [26] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML, Bristol, UK, 2012. [27] Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014. [28] Roger Weber, Hans-J¨ rg Schek, and Stephen Blott. A quantitative analysis and performance study for o similarity-search methods in high-dimensional spaces. In VLDB, pages 194–205, 1998. [29] Yair Weiss, Antonio Torralba, and Robert Fergus. Spectral hashing. In NIPS, Vancouver, Canada, 2008. [30] D. Zhou, J. Huang, and B. Sch¨ lkopf. Beyond pairwise classiﬁcation and clustering using hypergraphs. o In NIPS, Vancouver, Canada, 2006. 9

3 0.66045028 344 nips-2013-Using multiple samples to learn mixture models

Author: Jason Lee, Ran Gilad-Bachrach, Rich Caruana

Abstract: In the mixture models problem it is assumed that there are K distributions θ1 , . . . , θK and one gets to observe a sample from a mixture of these distributions with unknown coeﬃcients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same K underlying distributions, but with diﬀerent mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the diﬀerences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data. 1

4 0.59748846 326 nips-2013-The Power of Asymmetry in Binary Hashing

Author: Behnam Neyshabur, Nati Srebro, Ruslan Salakhutdinov, Yury Makarychev, Payman Yadollahpour

Abstract: When approximating binary similarity using the hamming distance between short binary hashes, we show that even if the similarity is symmetric, we can have shorter and more accurate hashes by using two distinct code maps. I.e. by approximating the similarity between x and x as the hamming distance between f (x) and g(x ), for two distinct binary codes f, g, rather than as the hamming distance between f (x) and f (x ). 1

5 0.58944112 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

Author: Vincent Q. Vu, Juhee Cho, Jing Lei, Karl Rohe

Abstract: We propose a novel convex relaxation of sparse principal subspace estimation based on the convex hull of rank-d projection matrices (the Fantope). The convex problem can be solved efﬁciently using alternating direction method of multipliers (ADMM). We establish a near-optimal convergence rate, in terms of the sparsity, ambient dimension, and sample size, for estimation of the principal subspace of a general covariance matrix without assuming the spiked covariance model. In the special case of d = 1, our result implies the near-optimality of DSPCA (d’Aspremont et al. [1]) even when the solution is not rank 1. We also provide a general theoretical framework for analyzing the statistical properties of the method for arbitrary input matrices that extends the applicability and provable guarantees to a wide array of settings. We demonstrate this with an application to Kendall’s tau correlation matrices and transelliptical component analysis. 1

6 0.58814001 249 nips-2013-Polar Operators for Structured Sparse Estimation

7 0.5879389 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

8 0.58527261 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

9 0.58482254 355 nips-2013-Which Space Partitioning Tree to Use for Search?

10 0.58406889 224 nips-2013-On the Sample Complexity of Subspace Learning

11 0.58352405 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization

12 0.58233279 252 nips-2013-Predictive PAC Learning and Process Decompositions

13 0.5822947 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

14 0.58215392 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

15 0.58143711 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity

16 0.58117747 233 nips-2013-Online Robust PCA via Stochastic Optimization

17 0.58051574 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

18 0.58039027 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

19 0.58028752 184 nips-2013-Marginals-to-Models Reducibility

20 0.58017099 318 nips-2013-Structured Learning via Logistic Regression