nips nips2013 nips2013-322 knowledge-graph by maker-knowledge-mining

322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs


Source: pdf

Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. [sent-5, score-0.429]

2 Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). [sent-6, score-0.937]

3 Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. [sent-7, score-0.612]

4 Empirical results show significantly improved scalability over state-of-the-art symbolic planners. [sent-10, score-0.216]

5 1 Introduction We study symbolic dynamic programming (SDP) for Markov Decision Processes (MDPs) with exponentially large factored state and action spaces. [sent-11, score-0.588]

6 Most prior SDP work has focused on exact [1] and approximate [2, 3] solutions to MDPs with factored states, assuming just a handful of atomic actions. [sent-12, score-0.164]

7 In contrast to this, many applications are most naturally modeled as having factored actions described in terms of multiple action variables, which yields an exponential number of joint actions. [sent-13, score-0.474]

8 In recent work [4] we have extended SDP to factored actions by giving a symbolic VI algorithm that explicitly reasons about action variables. [sent-17, score-0.66]

9 The key bottleneck of that approach is the space and time complexity of computing symbolic Bellman backups, which requires reasoning about all actions at all states simultaneously. [sent-18, score-0.361]

10 We start by considering Modified Policy Iteration (MPI) [5], which adds a few policy evaluation steps between consecutive Bellman backups. [sent-20, score-0.457]

11 MPI is attractive for factored-action spaces because policy evaluation does not require reasoning about all actions at all states, but rather only about the current policy’s action at each state. [sent-21, score-0.741]

12 Existing work on symbolic MPI [6] assumes a small atomic action space and does not scale to factored actions. [sent-22, score-0.533]

13 Our first contribution (Section 3) is a new algorithm, Factored Action MPI (FA-MPI), that conducts exact policy evaluation steps by treating the policy as a constraint on normal Bellman backups. [sent-23, score-0.789]

14 While FA-MPI is shown to improve scalability compared to VI in some cases, we observed that in practice the strict enforcement of the policy constraint can cause the representation of value functions to become too large and dominate run time. [sent-24, score-0.457]

15 Our second and main contribution (Section 4) is to overcome this issue using a new backup operator that lies between policy evaluation and a Bellman 1 Figure 1: Example of a DBN MDP with factored actions. [sent-25, score-0.759]

16 This new algorithm, Opportunistic Policy Iteration (OPI), constrains a select subset of the actions in a way that guarantees that there is no growth in the representation of the value function. [sent-27, score-0.186]

17 2 MDPs with Factored State and Action Spaces In a factored MDP M , the state space S and action space A are specified by finite sets of binary variables X = (X1 , . [sent-30, score-0.405]

18 The DBN model consists of a two–time-step graphical model that shows, for each next state variable X and the immediate reward, the set of current state and action variables, denoted by parents(X ). [sent-39, score-0.255]

19 The DBN encodes that the computers c1, c2 and c3 are arranged in a directed ring so that the running status of each is influenced by its reboot action and the status of its predecessor. [sent-46, score-0.346]

20 SDP uses the compact MDP model to derive compact value functions by iterating symbolic Bellman backups that avoid enumerating all states. [sent-68, score-0.4]

21 In recent work, we extended SDP to factored action spaces by computing Bellman backups using an algorithm called Factored Action Regression (FAR) [4]. [sent-71, score-0.509]

22 Let T Q (V ) denote the backup 2 operator that computes the next iterate of the Q-value function starting with value function V , T Q (V ) = R + γ P Xl × primed(V ) P X1 . [sent-73, score-0.214]

23 Here primed(V ) swaps the state variables X in the diagram V with next state variables X (c. [sent-81, score-0.183]

24 Written in this way, where the domain dynamics are explicitly expressed in terms of actions variables and where maxA = maxA1 ,. [sent-87, score-0.184]

25 ,Am is a symbolic marginalization operation over action variables, we get the Factored Action Regression (FAR) algorithm [4]. [sent-90, score-0.389]

26 In the following, we use T () to denote a Bellman-like backup where superscript T Q () denotes that that actions are not maximized out so the output is a function of state and actions, and subscript as in Q Tπ () defined below denotes that the update is restricted to the actions in π. [sent-91, score-0.483]

27 Similarly Tπ () restricts to a (possibly partial) policy π and does not maximize over the unspecified action choice. [sent-92, score-0.548]

28 In this work we will build on Modified Policy Iteration (MPI), which generalizes value iteration and policy iteration, by interleaving k policy evaluation steps between successive Bellman backups [5]. [sent-93, score-0.958]

29 Here a policy evaluation step corresponds to iterating exact policy backups, denoted by Tπ where the action is prescribed by the policy π in each state. [sent-94, score-1.315]

30 MPI has the potential to speed up convergence over VI because, at least for flat action spaces, policy evaluation is considerably cheaper than full Bellman backups. [sent-95, score-0.585]

31 In addition, when k > 0, one might hope for larger jumps in policy improvement because the greedy action in T is based on a more accurate estimate of the value of the policy. [sent-96, score-0.569]

32 Interestingly, the first approach to symbolic planning in MDPs was a version of MPI for factored states called Structured Policy Iteration (SPI), which was [6] later adapted to relational problems [8]. [sent-97, score-0.432]

33 SPI represents the policy as a decision tree with state-variables labeling interior nodes and a concrete action as a leaf node. [sent-98, score-0.606]

34 The policy backup uses the graphical form of the policy. [sent-99, score-0.558]

35 In each such backup, for each leaf node (policy action) a in the policy tree, its Q-function Qa is computed and attached to the leaf. [sent-100, score-0.398]

36 Although SPI leverages the factored state representation, it represents the policy in terms of concrete joint actions, which fails to capture the structure among the action variables in FA-MDPs. [sent-101, score-0.77]

37 In addition, in factored actions spaces this requires an explicit calculation of Q functions for all joint actions. [sent-102, score-0.32]

38 Finally, the space required for policy backup can be prohibitive because each Q-function Qa is joined to each leaf of the policy. [sent-103, score-0.591]

39 SPI goes to great lengths in order to enforce a policy backup which, intuitively, ought to be much easier to compute than a Bellman backup. [sent-104, score-0.558]

40 In fact, we are not aware of any implementations of this algorithm that scales well for FA-MDPs or even for factored state spaces. [sent-105, score-0.2]

41 3 Factored Action MPI (FA-MPI) In this section, we introduce Factored Action MPI (FA-MPI), which uses a novel form of policy backup. [sent-107, score-0.365]

42 Each iteration of the outer while loop starts with one full Bellman backup using Equation 1, i. [sent-109, score-0.23]

43 The inner loop performs k steps of policy backups using a new algorithm described below that avoids enumerating all actions. [sent-112, score-0.498]

44 We represent the policy using a Binary Decision Diagram (BDD) with state and action variables where a leaf value of 1 denotes any combination of action variables that is the policy action, and a leaf value of −∞ indicates otherwise. [sent-113, score-1.284]

45 Using this representation, we perform policy backups using Q Tπ (V ) given in Equation 2 below followed by a max over the actions in the resulting diagram. [sent-114, score-0.625]

46 The proof uses the fact that (s, a) pairs that do not agree with the policy get a value −∞ via the constraints and therefore do not affect the maximum. [sent-130, score-0.386]

47 The next section shows how to approximate the backup in Equation 2 while ensuring no growth in the size of the ADD. [sent-135, score-0.212]

48 As seen in Figure 2, OPI is identical to FA-MPI except that it uses an alternative, more conservative policy backup. [sent-137, score-0.365]

49 The key idea in OPI is to enforce the policy constraint opportunistically, i. [sent-140, score-0.387]

50 In an exponential action space, we can sometimes expect a Bellman backup to be a coarser partitioning of state variables than the value function of a given policy (e. [sent-143, score-0.84]

51 two states that have the same value under the optimal action have different values under the policy action). [sent-145, score-0.592]

52 In this case enforcQ ing the policy constraint via Tπ (V ) is actually harmful in terms of the size of the representation. [sent-146, score-0.387]

53 OPI is motivated by retaining the coarseness of Bellman backups in some states, and otherwise enforcing the policy constraint. [sent-147, score-0.498]

54 The OPI backup is sensitive to the size of the value ADD so that it is guaranteed to be smaller than the results of both Bellman backup and policy backup. [sent-148, score-0.772]

55 First we describe the symbolic implementation of OPI . [sent-149, score-0.186]

56 The trade-off between policy evaluation and policy improvement is made via a pruning procedure (pseudo-code in Figure 3). [sent-150, score-0.826]

57 This procedure assigns a value of −∞ to only those paths in a value function ADD that violate the policy constraint π. [sent-151, score-0.464]

58 The interesting case is when the root variable of π is ordered below the root of D (and thus does not appear in D) so that the only way to violate the constraint is to violate both true and false branches. [sent-152, score-0.161]

59 The novel backup introduced in OPI interleaves the application of pruning with the summation steps so as to prune the diagram as early as possible. [sent-176, score-0.319]

60 The backup used by OPI, which is shown in Figure 2 is   ˆQ Tπ (V ) = Pπ Pπ (R) + γPπ ( P X1 . [sent-178, score-0.193]

61 Pπ ( X1 P Xl × primed(V ))))) (3) Xl ˆQ Using the properties of P we can show that Tπ (V ) overestimates the true backup of a policy, but is still bounded by the true value function. [sent-181, score-0.214]

62 The policy backup used by OPI is bounded between the full Bellman backup and the ˆQ true policy backup, i. [sent-183, score-1.116]

63 Since none of the value functions generated by OPI overestimate the optimal value function, it follows that both OPI and FA-MPI converge to the optimal policy under the same conditions as MPI [5]. [sent-186, score-0.407]

64 In terms of a flat MDP, OPI can be interpreted as sometimes picking a greedy off-policy action while evaluating a fixed policy, when the value function of the greedy policy is at least as good and more compact than that of the given policy. [sent-190, score-0.599]

65 Thus, OPI may be viewed as asynchronous policy iteration ([9]). [sent-191, score-0.423]

66 However, unlike traditional asynchronous PI, the policy improvement in OPI is motivated by the size of the representation, rather than any measure of the magnitude of improvement. [sent-192, score-0.386]

67 Suppose that π is a policy constraint that says that the action variable A1 must be true when the state variable X2 is false. [sent-195, score-0.606]

68 The backup T Q (R) does not involve X2 and therefore pruning does not change the diagram and Pπ (T Q (R)) = T Q (R). [sent-196, score-0.319]

69 Note that the improved policy (always set A1 ) is more compact than π, and so is its value. [sent-198, score-0.395]

70 5 Memory-Bounded OPI Memory is usually a limiting factor for symbolic planning. [sent-200, score-0.186]

71 In [4] we proposed a symbolic memory bounded (MB) VI algorithm for FA-MDPs, which we refer to below as Memory Bounded Factored 5 (a) A simple policy for an MDP (b) Optimal policy backup in (c) OPI backup. [sent-201, score-1.17]

72 smaller size of the value funcX2 , and one action variable A1 . [sent-203, score-0.204]

73 Figure 5: An illustration where OPI computes an incorrect but more compact value function that is is a partial policy improvement. [sent-205, score-0.416]

74 The key idea is that a backup can be computed over a partially instantiated action, by fixing the value of an action variable. [sent-209, score-0.397]

75 MB-MPI is parametrized by k, the number of policy backups, and M , the maximum size (in nodes) of a Z-value function. [sent-214, score-0.365]

76 We heuristically chose to do the expectation over state variables in the top-down way, and maximization of action variables in the bottom-up way with respect to the variable ordering. [sent-226, score-0.263]

77 An instance of IC with n shops and m trucks has m a joint state and action space of size 22n and i=0 n respectively. [sent-233, score-0.297]

78 Each computer is either running (reward of +1) or failed (reward of 0) so that |S| = 2n , and each computer has an associated deterministic action of rebooting (with a cost of -0. [sent-236, score-0.234]

79 A running computer that is not being 6 2 3 4 5 6 7 1 2 #Parallel actions 3 4 5 6 0 7 1 2 # parallel actions 3 4 5 6 400 VI OPI(2) OPI(5) 0. [sent-240, score-0.316]

80 46 4 5 6 7 200 400 VI OPI(2) OPI(5) 0 800 Unidirectional ring − 10 computers Solution time(mins) Bidirectional ring − 10 computers Solution time(mins) 400 0. [sent-244, score-0.18]

81 3 VI OPI(2) OPI(5) 0 Solution time(mins) 1000 SysAdmin − Star network − 11 computers VI OPI(2) OPI(5) 200 Solution time(mins) Inventory Control − 8 shops 7 1 2 3 # parallel actions # parallel actions Figure 6: Impact of policy evaluation: Parallel actions vs. [sent-248, score-0.967]

82 A state is described as follows: for each floor, whether a person is waiting to go up or down; for each elevator, whether a person inside the elevator is going up or down, whether the elevator is at each floor, and its current direction (up or down). [sent-259, score-0.374]

83 Each person gets into an elevator if it is at the same floor and has the same direction (up or down), and exits at the top or bottom floor based on his direction. [sent-263, score-0.169]

84 2 Experimental validation In order to evaluate scaling with respect to the action space we fix the size of the state-space and measure time to convergence (Bellman error less than 0. [sent-269, score-0.183]

85 In addition, we compare to symbolic value iteration: MB−MPI(5) MB−OPI(5) 0 50 100 150 200 CPU time(mins) Figure 8: Impact of policy evaluation in Elevators. [sent-275, score-0.609]

86 3e−3 1e−3 # parallel actions Compression in π 4 5 0. [sent-309, score-0.17]

87 the well-established baseline for factored states, SPUDD [1], and factored states and actions FAMPI(0). [sent-326, score-0.478]

88 Impact of policy evaluation : We compare symbolic VI and OPI in Figure 6. [sent-328, score-0.588]

89 For Inventory Control, as the number of parallel actions increases, SPUDD takes increasingly more time but FA-MPI(0) takes increasingly less time, giving VI a bell-shaped profile. [sent-329, score-0.17]

90 For all the topologies, as the size of the action space increases, VI takes an increasing amount of time. [sent-332, score-0.183]

91 OPI scales significantly better and does better with more steps of policy evaluation, suggesting that more lookahead is useful in this domain. [sent-333, score-0.365]

92 Figure 7 shows that with increasing state and action spaces FA-MPI exceeds the memory limit (EML) whereas OPI does not and that when both converge OPI converges much faster. [sent-338, score-0.355]

93 Impact of memory-bounding : Even though memory bounding can mitigate the memory problem in FA-MPI, it can cause a large overhead in time, and can still exceed the limit due to intermediate steps in the exact policy backups. [sent-343, score-0.506]

94 Representation compactness : The main bottleneck toward scalability beyond our current results is the growth of the value and policy diagrams with problem complexity, which is a function of the suitability of our ADD representation to the problem at hand. [sent-349, score-0.514]

95 Possible future directions include better alternative symbolic representations as well as approximations. [sent-354, score-0.186]

96 7 Discussion This paper presented symbolic variants of MPI that scale to large action spaces and generalize and improve over state-of-the-art algorithms. [sent-355, score-0.398]

97 The insight that the policy can be treated as a loose constraint within value iteration steps gives a new interpretation of MPI. [sent-356, score-0.445]

98 Our algorithm OPI computes some policy improvements during policy evaluation and is related to Asynchronous Policy Iteration [9]. [sent-357, score-0.767]

99 Further scalability can be achieved by incorporating approximate value backups (e. [sent-358, score-0.184]

100 Previous work [13] has studied theoretical properties of such approximations of MPI, but no efficient symbolic version exists. [sent-364, score-0.186]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('opi', 0.656), ('policy', 0.365), ('backup', 0.193), ('mpi', 0.189), ('symbolic', 0.186), ('action', 0.183), ('factored', 0.164), ('eml', 0.145), ('elevator', 0.139), ('backups', 0.133), ('actions', 0.127), ('bellman', 0.127), ('primed', 0.088), ('sysadmin', 0.088), ('vi', 0.088), ('shops', 0.078), ('spi', 0.076), ('mins', 0.067), ('diagram', 0.067), ('spudd', 0.067), ('oor', 0.064), ('mbfar', 0.063), ('opportunistic', 0.063), ('rebooted', 0.063), ('memory', 0.061), ('pruning', 0.059), ('computers', 0.057), ('dbn', 0.055), ('adds', 0.055), ('inventory', 0.051), ('xl', 0.049), ('sdp', 0.046), ('add', 0.046), ('elevators', 0.044), ('mdps', 0.044), ('parallel', 0.043), ('mb', 0.042), ('roni', 0.041), ('reward', 0.038), ('fampi', 0.038), ('evaluation', 0.037), ('iteration', 0.037), ('mdp', 0.036), ('state', 0.036), ('domain', 0.035), ('bidirectional', 0.035), ('diagrams', 0.035), ('violate', 0.035), ('impact', 0.035), ('compression', 0.034), ('unidirectional', 0.033), ('ring', 0.033), ('leaf', 0.033), ('relational', 0.033), ('fa', 0.033), ('failed', 0.032), ('ic', 0.032), ('compact', 0.03), ('scalability', 0.03), ('person', 0.03), ('spaces', 0.029), ('craig', 0.029), ('status', 0.027), ('exceeds', 0.027), ('planning', 0.026), ('apricodd', 0.025), ('arents', 0.025), ('floors', 0.025), ('rddl', 0.025), ('shop', 0.025), ('uniring', 0.025), ('star', 0.025), ('control', 0.025), ('bottleneck', 0.025), ('decision', 0.025), ('root', 0.025), ('maxa', 0.023), ('states', 0.023), ('policies', 0.023), ('constraint', 0.022), ('aswin', 0.022), ('hoey', 0.022), ('variables', 0.022), ('lled', 0.022), ('operations', 0.022), ('asynchronous', 0.021), ('value', 0.021), ('parents', 0.021), ('coarser', 0.02), ('else', 0.02), ('marginalization', 0.02), ('false', 0.019), ('dynamic', 0.019), ('jesse', 0.019), ('assignments', 0.019), ('representation', 0.019), ('running', 0.019), ('limit', 0.019), ('growth', 0.019), ('alan', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1

2 0.30904114 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.27749822 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

4 0.22654878 348 nips-2013-Variational Policy Search via Trajectory Optimization

Author: Sergey Levine, Vladlen Koltun

Abstract: In order to learn effective control policies for dynamical systems, policy search methods must be able to discover successful executions of the desired task. While random exploration can work well in simple domains, complex and highdimensional tasks present a serious challenge, particularly when combined with high-dimensional policies that make parameter-space exploration infeasible. We present a method that uses trajectory optimization as a powerful exploration strategy that guides the policy search. A variational decomposition of a maximum likelihood policy objective allows us to use standard trajectory optimization algorithms such as differential dynamic programming, interleaved with standard supervised learning for the policy itself. We demonstrate that the resulting algorithm can outperform prior methods on two challenging locomotion tasks. 1

5 0.20139745 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difficulties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1

6 0.17237061 241 nips-2013-Optimizing Instructional Policies

7 0.17008732 347 nips-2013-Variational Planning for Graph-based MDPs

8 0.16678841 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

9 0.16491394 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

10 0.16108908 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

11 0.1610111 257 nips-2013-Projected Natural Actor-Critic

12 0.15649933 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

13 0.15577057 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

14 0.150949 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

15 0.14707209 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

16 0.13731724 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

17 0.12813495 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

18 0.12761277 165 nips-2013-Learning from Limited Demonstrations

19 0.12296271 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs

20 0.12017155 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.173), (1, -0.333), (2, -0.184), (3, 0.164), (4, -0.012), (5, 0.007), (6, -0.081), (7, 0.05), (8, -0.003), (9, 0.056), (10, 0.021), (11, -0.018), (12, 0.035), (13, 0.006), (14, 0.048), (15, -0.007), (16, 0.039), (17, -0.047), (18, -0.014), (19, -0.049), (20, -0.002), (21, -0.002), (22, 0.013), (23, 0.036), (24, 0.01), (25, -0.03), (26, 0.037), (27, -0.022), (28, -0.029), (29, -0.01), (30, -0.017), (31, -0.062), (32, 0.037), (33, 0.009), (34, 0.053), (35, -0.017), (36, -0.018), (37, 0.034), (38, 0.054), (39, 0.015), (40, -0.009), (41, -0.057), (42, 0.012), (43, -0.004), (44, -0.013), (45, 0.001), (46, 0.041), (47, -0.028), (48, -0.072), (49, -0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97938186 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1

2 0.86185145 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Author: Paul Wagner

Abstract: Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality. 1

3 0.8571865 28 nips-2013-Adaptive Step-Size for Policy Gradient Methods

Author: Matteo Pirotta, Marcello Restelli, Luca Bascetta

Abstract: In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement–learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step–size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second–order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear–quadratic regulator problem. 1

4 0.82657206 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

Author: Victor Gabillon, Mohammad Ghavamzadeh, Bruno Scherrer

Abstract: Tetris is a video game that has been widely used as a benchmark for various optimization techniques including approximate dynamic programming (ADP) algorithms. A look at the literature of this game shows that while ADP algorithms that have been (almost) entirely based on approximating the value function (value function based) have performed poorly in Tetris, the methods that search directly in the space of policies by learning the policy parameters using an optimization black box, such as the cross entropy (CE) method, have achieved the best reported results. This makes us conjecture that Tetris is a game in which good policies are easier to represent, and thus, learn than their corresponding value functions. So, in order to obtain a good performance with ADP, we should use ADP algorithms that search in a policy space, instead of the more traditional ones that search in a value function space. In this paper, we put our conjecture to test by applying such an ADP algorithm, called classification-based modified policy iteration (CBMPI), to the game of Tetris. Our experimental results show that for the first time an ADP algorithm, namely CBMPI, obtains the best results reported in the literature for Tetris in both small 10 × 10 and large 10 × 20 boards. Although the CBMPI’s results are similar to those of the CE method in the large board, CBMPI uses considerably fewer (almost 1/6) samples (calls to the generative model) than CE. 1

5 0.80599439 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difficulties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1

6 0.78082734 241 nips-2013-Optimizing Instructional Policies

7 0.72555786 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

8 0.72502571 347 nips-2013-Variational Planning for Graph-based MDPs

9 0.70995277 257 nips-2013-Projected Natural Actor-Critic

10 0.69689262 348 nips-2013-Variational Policy Search via Trajectory Optimization

11 0.68145812 165 nips-2013-Learning from Limited Demonstrations

12 0.65616745 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

13 0.63615578 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes

14 0.62204993 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs

15 0.62033278 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

16 0.60121965 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

17 0.59706813 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

18 0.59301722 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning

19 0.57140785 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

20 0.55659986 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.011), (16, 0.047), (33, 0.075), (34, 0.147), (36, 0.013), (41, 0.021), (49, 0.034), (56, 0.109), (67, 0.281), (70, 0.033), (75, 0.03), (85, 0.035), (89, 0.016), (93, 0.038), (95, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7765978 322 nips-2013-Symbolic Opportunistic Policy Iteration for Factored-Action MDPs

Author: Aswin Raghavan, Roni Khardon, Alan Fern, Prasad Tadepalli

Abstract: This paper addresses the scalability of symbolic planning under uncertainty with factored states and actions. Our first contribution is a symbolic implementation of Modified Policy Iteration (MPI) for factored actions that views policy evaluation as policy-constrained value iteration (VI). Unfortunately, a na¨ve approach ı to enforce policy constraints can lead to large memory requirements, sometimes making symbolic MPI worse than VI. We address this through our second and main contribution, symbolic Opportunistic Policy Iteration (OPI), which is a novel convergent algorithm lying between VI and MPI, that applies policy constraints if it does not increase the size of the value function representation, and otherwise performs VI backups. We also give a memory bounded version of this algorithm allowing a space-time tradeoff. Empirical results show significantly improved scalability over state-of-the-art symbolic planners. 1

2 0.75333083 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation

Author: Bogdan Savchynskyy, Jörg Hendrik Kappes, Paul Swoboda, Christoph Schnörr

Abstract: We consider energy minimization for undirected graphical models, also known as the MAP-inference problem for Markov random fields. Although combinatorial methods, which return a provably optimal integral solution of the problem, made a significant progress in the past decade, they are still typically unable to cope with large-scale datasets. On the other hand, large scale datasets are often defined on sparse graphs and convex relaxation methods, such as linear programming relaxations then provide good approximations to integral solutions. We propose a novel method of combining combinatorial and convex programming techniques to obtain a global solution of the initial combinatorial problem. Based on the information obtained from the solution of the convex relaxation, our method confines application of the combinatorial solver to a small fraction of the initial graphical model, which allows to optimally solve much larger problems. We demonstrate the efficacy of our approach on a computer vision energy minimization benchmark. 1

3 0.69976574 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis

Author: James R. Voss, Luis Rademacher, Mikhail Belkin

Abstract: The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement an efficient, Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case. 1 Introduction and Related Works In the Blind Signal Separation setting, it is assumed that observed data is drawn from an unknown distribution. The goal is to recover the latent signals under some appropriate structural assumption. A prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking simultaneously and d microphones, with each microphone capturing a superposition of the voices. The objective is to recover the speech of each individual speaker. The simplest modeling assumption is to consider each speaker as producing a signal that is a random variable independent of the others, and to take the superposition to be a linear transformation independent of time. This leads to the following formalization: We observe samples from a random vector x distributed according to the equation x = As + b + η where A is a linear mixing matrix, b ∈ Rd is a constant vector, s is a latent random vector with independent coordinates, and η is an unknown random noise independent 1 of s. For simplicity, we assume A ∈ Rd×d is square and of full rank. The latent components of s are viewed as containing the information describing the makeup of the observed signal (voices of individual speakers in the cocktail party setting). The goal of Independent Component Analysis is to approximate the matrix A in order to recover the latent signal s. In practice, most methods ignore the noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is observed. Arguably the two most widely used ICA algorithms are FastICA [13] and JADE [6]. Both of these algorithms are based on a two step process: (1) The data is centered and whitened, that is, made to have identity covariance matrix. This is typically done using principal component analysis (PCA) and rescaling the appropriate components. In the noiseless case this procedure orthogonalizes and rescales the independent components and thus recovers A up to an unknown orthogonal matrix R. (2) Recover the orthogonal matrix R. Most practical ICA algorithms differ only in the second step. In FastICA, various objective functions are used to perform a projection pursuit style algorithm which recovers the columns of R one at a time. JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R. Step 1 of ICA is affected by the addition of a Gaussian noise. Even if the noise is white (has a scalar times identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the whitening of the underlying independent components. Hence, the second step of the process is no longer justified. This failure may be even more significant if the noise is not white, which is likely to be the case in many practical situations. Recent theoretical developments (see, [2] and [3]) consider the case where the noise η is an arbitrary (not necessarily white) additive Gaussian variable drawn independently from s. In [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the second step if the underlying signals can be orthogonalized.1 Orthogonalization of the latent signals (quasi-orthogonalization) is a significantly less restrictive condition as it does not force the underlying signal to have identity covariance (as in whitening in the noiseless case). In the noisy setting, the usual PCA cannot achieve quasi-orthogonalization as it will whiten the mixed signal, but not the underlying components. In [3], we show how quasi-orthogonalization can be achieved in a noise-invariant way through a method based on the fourth-order cumulant tensor. However, a direct implementation of that method requires estimating the full fourth-order cumulant tensor, which is computationally challenging even in relatively low dimensions. In this paper we derive a practical version of that algorithm based on directional Hessians of the fourth univariate cumulant, thus reducing the complexity dependence on the data dimensionality from d4 to d3 , and also allowing for a fully vectorized implementation. We also develop a fast and very simple gradient iteration (not to be confused with gradient descent) algorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to have convergence of order r − 1, when implemented using a univariate cumulant of order r. For the cumulant of order four, commonly used in practical applications, we obtain cubic convergence. We show how these convergence rates follow directly from the properties of the cumulants, which sheds some light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods [13, 18, 22]. The update step has complexity O(N d) where N is the number of samples, giving a total algorithmic complexity of O(N d3 ) for step 1 and O(N d2 t) for step 2, where t is the number of iterations for convergence in the gradient iteration. Interestingly, while the techniques are quite different, our gradient iteration algorithm turns out to be closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the cumulants of order three or four are used. Thus, GI-ICA can be viewed as a generalization (and a conceptual simplification) of Fast ICA for more general quasi-orthogonalized data. We present experimental results showing superior performance in the case of data contaminated by Gaussian noise and very competitive performance for clean data. We also note that the GIICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent 1 This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. However, this conflicts with the definition of quasi-whitening given in [12] which requires the latent signals to be whitened. To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonalizing the latent signals. 2 components) 100 000 points in dimension 5 in well under a second on a standard desktop computer. Our Matlab implementation of GI-ICA is available for download at http://sourceforge. net/projects/giica/. Finally, we observe that our method is partially compatible with the robust cumulants introduced in [20]. We briefly discuss how GI-ICA can be extended using these noise-robust techniques for ICA to reduce the impact of sparse noise. The paper is organized as follows. In section 2, we discuss the relevant properties of cumulants, and discuss results from prior work which allows for the quasi-orthogonalization of signals with non-zero fourth cumulant. In section 3, we discuss the connection between the fourth-order cumulant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques seen in [2] and [11]. We use this connection to create a more computationally efficient and practically implementable version of the quasi-orthogonalization algorithm discussed in section 2. In section 4, we discuss new, fast, projection-pursuit style algorithms for the second step of ICA which are compatible with quasi-orthogonalization. In order to simplify the presentation, all algorithms are stated in an abstract form as if we have exact knowledge of required distribution parameters. Section 5 discusses the estimators of required distribution parameters to be used in practice. Section 6 discusses numerical experiments demonstrating the applicability of our techniques. Related Work. The name Independent Component Analysis refers to a broad range of algorithms addressing the blind signal separation problem as well as its variants and extensions. There is an extensive literature on ICA in the signal processing and machine learning communities due to its applicability to a variety of important practical situations. For a comprehensive introduction see the books [8, 14]. In this paper we develop techniques for dealing with noisy data by introducing new and more efficient techniques for quasi-orthogonalization and subsequent component recovery. The quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm for the case when the fourth cumulants of all independent components are of the same sign. A general algorithm with complete theoretical analysis was provided in [3]. That algorithm required estimating the full fourth-order cumulant tensor. We note that Hessian based techniques for ICA were used in [21, 2, 11], with [11] and [2] using the Hessian of the fourth-order cumulant. The papers [21] and [11] proposed interesting randomized one step noise-robust ICA algorithms based on the cumulant generating function and the fourth cumulant respectively in primarily theoretical settings. The gradient iteration algorithm proposed is closely related to the work [18], which provides a gradient-based algorithm derived from the fourth moment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. For the special case of the fourth cumulant, the idea of gradient iteration has appeared in the context of FastICA with a different justification, see e.g. [16, Equation 11 and Theorem 2]. We also note the work [12], which develops methods for Gaussian noise-invariant ICA under the assumption that the noise parameters are known. Finally, there are several papers that considered the problem of performing PCA in a noisy framework. [5] gives a provably robust algorithm for PCA under a sparse noise model. [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust to white Gaussian noise and sparse noise. 2 Using Cumulants to Orthogonalize the Independent Components Properties of Cumulants: Cumulants are similar to moments and can be expressed in terms of certain polynomials of the moments. However, cumulants have additional properties which allow independent random variables to be algebraically separated. We will be interested in the fourth order multi-variate cumulants, and univariate cumulants of arbitrary order. Denote by Qx the fourth order cumulant tensor for the random vector x. So, (Qx )ijkl is the cross-cumulant between the random variables xi , xj , xk , and xl , which we alternatively denote as Cum(xi , xj , xk , xl ). Cumulant tensors are symmetric, i.e. (Qx )ijkl is invariant under permutations of indices. Multivariate cumulants have the following properties (written in the case of fourth order cumulants): 1. (Multilinearity) Cum(αxi , xj , xk , xl ) = α Cum(xi , xj , xk , xl ) for random vector x and scalar α. If y is a random variable, then Cum(xi +y, xj , xk , xl ) = Cum(xi , xj , xk , xl )+Cum(y, xj , xk , xl ). 2. (Independence) If xi and xj are independent random variables, then Cum(xi , xj , xk , xl ) = 0. When x and y are independent, Qx+y = Qx + Qy . 3. (Vanishing Gaussian) Cumulants of order 3 and above are zero for Gaussian random variables. 3 The first order cumulant is the mean, and the second order multivariate cumulant is the covariance matrix. We will denote by κr (x) the order-r univariate cumulant, which is equivalent to the crosscumulant of x with itself r times: κr (x) := Cum(x, x, . . . , x) (where x appears r times). Univariate r-cumulants are additive for independent random variables, i.e. κr (x + y) = κr (x) + κr (y), and homogeneous of degree r, i.e. κr (αx) = αr κr (x). Quasi-Orthogonalization Using Cumulant Tensors. Recalling our original notation, x = As + b + η gives the generative ICA model. We define an operation of fourth-order tensors on matrices: For Q ∈ Rd×d×d×d and M ∈ Rd×d , Q(M ) is the matrix such that d d Q(M )ij := Qijkl mlk . (1) k=1 l=1 We can use this operation to orthogonalize the latent random signals. Definition 2.1. A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal matrix R and a nonsingular diagonal matrix D such that W A = RD. We will need the following results from [3]. Here we use Aq to denote the q th column of A. Lemma 2.2. Let M ∈ Rd×d be an arbitrary matrix. Then, Qx (M ) = ADAT where D is a diagonal matrix with entries dqq = κ4 (sq )AT M Aq . q Theorem 2.3. Suppose that each component of s has non-zero fourth cumulant. Let M = Qx (I), and let C = Qx (M −1 ). Then C = ADAT where D is a diagonal matrix with entries dqq = 1/ Aq 2 . In particular, C is positive definite, and for any factorization BB T of C, B −1 is a quasi2 orthogonalization matrix. 3 Quasi-Orthogonalization using Cumulant Hessians We have seen in Theorem 2.3 a tensor-based method which can be used to quasi-orthogonalize observed data. However, this method na¨vely requires the estimation of O(d4 ) terms from data. ı There is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and the tensor-based technique for quasi-orthogonalization described in Theorem 2.3 that allows the tensor-method to be rewritten using a series of Hessian operations. We make this connection precise below. The Hessian version requires only O(d3 ) terms to be estimated from data and simplifies the computation to consist of matrix and vector operations. Let Hu denote the Hessian operator with respect to a vector u ∈ Rd . The following lemma connects Hessian methods with our tensor-matrix operation (a special case is discussed in [2, Section 2.1]). Lemma 3.1. Hu (κ4 (uT x)) = ADAT where dqq = 12(uT Aq )2 κ4 (sq ). In Lemma 3.1, the diagonal entries can be rewritten as dqq = 12κ4 (sq )(AT (uuT )Aq ). By comq paring with Lemma 2.2, we see that applying Qx against a symmetric, rank one matrix uuT can be 1 rewritten in terms of the Hessian operations: Qx (uuT ) = 12 Hu (κ4 (uT x)). This formula extends to arbitrary symmetric matrices by the following Lemma. Lemma 3.2. Let M be a symmetric matrix with eigen decomposition U ΛU T such that U = d 1 (u1 , u2 , . . . , ud ) and Λ = diag(λ1 , λ2 , . . . , λd ). Then, Qx (M ) = 12 i=1 λi Hui κ4 (uT x). i The matrices I and M −1 in Theorem 2.3 are symmetric. As such, the tensor-based method for quasi-orthogonalization can be rewritten using Hessian operations. This is done in Algorithm 1. 4 Gradient Iteration ICA In the preceding sections, we discussed techniques to quasi-orthogonalize data. For this section, we will assume that quasi-orthogonalization is accomplished, and discuss deflationary approaches that can quickly recover the directions of the independent components. Let W be a quasiorthogonalization matrix. Then, define y := W x = W As + W η. Note that since η is Gaussian noise, so is W η. There exists a rotation matrix R and a diagonal matrix D such that W A = RD. Let ˜ := Ds. The coordinates of ˜ are still independent random variables. Gaussian noise makes s s recovering the scaling matrix D impossible. We aim to recover the rotation matrix R. 4 Algorithm 1 Hessian-based algorithm to generate a quasi-orthogonalization matrix. 1: function F IND Q UASI O RTHOGONALIZATION M ATRIX(x) d 1 2: Let M = 12 i=1 Hu κ4 (uT x)|u=ei . See Equation (4) for the estimator. T 3: Let U ΛU give the eigendecomposition of M −1 d 4: Let C = i=1 λi Hu κ4 (uT x)|u=Ui . See Equation (4) for the estimator. 5: Factorize C as BB T . 6: return B −1 7: end function To see why recovery of D is impossible, we note that a white Gaussian random variable η 1 has independent components. It is impossible to distinguish between the case where η 1 is part of the signal, i.e. W A(s + η 1 ) + W η, and the case where Aη 1 is part of the additive Gaussian noise, i.e. W As + W (Aη 1 + η), when s, η 1 , and η are drawn independently. In the noise-free ICA setting, the latent signal is typically assumed to have identity covariance, placing the scaling information in the columns of A. The presence of additive Gaussian noise makes recovery of the scaling information impossible since the latent signals become ill-defined. Following the idea popularized in FastICA, we will discuss a deflationary technique to recover the columns of R one at a time. Fast Recovery of a Single Independent Component. In the deflationary approach, a function f is fixed that acts upon a directional vector u ∈ Rd . Based on some criterion (typically maximization or minimization of f ), an iterative optimization step is performed until convergence. This technique was popularized in FastICA, which is considered fast for the following reasons: 1. As an approximate Newton method, FastICA requires computation of u f and a quick-tocompute estimate of (Hu (f ))−1 at each iterative step. Due to the estimate, the computation runs in O(N d) time, where N is the number of samples. 2. The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and global cubic-order convergence when using the fourth cumulant [13]. We note that cubic convergence rates are not unique to FastICA and have been seen using gradient descent (with the correct step-size) when choosing f as the fourth moment [18]. Our proposed deflationary algorithm will be comparable with FastICA in terms of computational complexity, and the iterative step will take on a conceptually simpler form as it only relies on u κr . We provide a derivation of fast convergence rates that relies entirely on the properties of cumulants. As cumulants are invariant with respect to the additive Gaussian noise, the proposed methods will be admissible for both standard and noisy ICA. While cumulants are essentially unique with the additivity and homogeneity properties [17] when no restrictions are made on the probability space, the preprocessing step of ICA gives additional structure (like orthogonality and centering), providing additional admissible functions. In particular, [20] designs “robust cumulants” which are only minimally effected by sparse noise. Welling’s robust cumulants have versions of the additivity and homogeneity properties, and are consistent with our update step. For this reason, we will state our results in greater generality. Let G be a function of univariate random variables that satisfies the additivity, degree-r (r ≥ 3) homogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. Then for a generic choice of input vector v, Algorithm 2 will demonstrate order r−1 convergence. In particular, if G is κ3 , then we obtain quadratic convergence; and if G is κ4 , we obtain cubic convergence. Lemma 4.1 helps explain why this is true. Lemma 4.1. v G(v · y) = r d i=1 (v · Ri )r−1 G(˜i )Ri . s If we consider what is happening in the basis of the columns of R, then up to some multiplicative constant, each coordinate is raised to the r − 1 power and then renormalized during each step of Algorithm 2. This ultimately leads to the order r − 1 convergence. Theorem 4.2. If for a unit vector input v to Algorithm 2 h = arg maxi |(v · Ri )r−2 G(˜i )| has a s unique answer, then v has order r − 1 convergence to Rh up to sign. In particular, if the following conditions are met: (1) There exists a coordinate random variable si of s such that G(si ) = 0. (2) v inputted into Algorithm 2 is chosen uniformly at random from the unit sphere S d−1 . Then Algorithm 2 converges to a column of R (up to sign) almost surely, and convergence is of order r − 1. 5 Algorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from the unit sphere. Equations (2) and (3) provide k-statistic based estimates of v κ3 and v κ4 , which can be used as practical choices of v G on real data. 1: function GI-ICA(v, y) 2: repeat 3: v ← v G(vT y) 4: v ← v/ v 2 5: until Convergence return v 6: end function ˜ Algorithm 3 Algorithm for ICA in the presence of Gaussian noise. A recovers A up to column order and scaling. RT W is the demixing matrix for the observed random vector x. function G AUSSIAN ROBUST ICA(G, x) W = F IND Q UASI O RTHOGONALIZATION M ATRIX(x) y = Wx R columns = ∅ for i = 1 to d do Draw v from S d−1 ∩ span(R columns)⊥ uniformly at random. R columns = R columns ∪ {GI-ICA(v, y)} end for Construct a matrix R using the elements of R columns as columns. ˜ = RT y s ˜ A = (RT W )−1 ˜ s return A, ˜ end function By convergence up to sign, we include the possibility that v oscillates between Rh and −Rh on alternating steps. This can occur if G(˜i ) < 0 and r is odd. Due to space limitations, the proof is s omitted. Recovering all Independent Components. As a Corollary to Theorem 4.2 we get: Corollary 4.3. Suppose R1 , R2 , . . . , Rk are known for some k < d. Suppose there exists i > k such that G(si ) = 0. If v is drawn uniformly at random from S d−1 ∩ span(R1 , . . . , Rk )⊥ where S d−1 denotes the unit sphere in Rd , then Algorithm 2 with input v converges to a new column of R almost surely. Since the indexing of R is arbitrary, Corollary 4.3 gives a solution to noisy ICA, in Algorithm 3. In practice (not required by the theory), it may be better to enforce orthogonality between the columns of R, by orthogonalizing v against previously found columns of R at the end of each step in Algorithm 2. We expect the fourth or third cumulant function will typically be chosen for G. 5 Time Complexity Analysis and Estimation of Cumulants To implement Algorithms 1 and 2 requires the estimation of functions from data. We will limit our discussion to estimation of the third and fourth cumulants, as lower order cumulants are more statistically stable to estimate than higher order cumulants. κ3 is useful in Algorithm 2 for nonsymmetric distributions. However, since κ3 (si ) = 0 whenever si is a symmetric distribution, it is plausible that κ3 would not recover all columns of R. When s is suspected of being symmetric, it is prudent to use κ4 for G. Alternatively, one can fall back to κ4 from κ3 when κ3 is detected to be near 0. Denote by z (1) , z (2) , . . . , z (N ) the observed samples of a random variable z. Given a sample, each cumulant can be estimated in an unbiased fashion by its k-statistic. Denote by kr (z (i) ) the kN 1 statistic sample estimate of κr (z). Letting mr (z (i) ) := N i=1 (z (i) − z )r give the rth sample ¯ central moment, then N 2 m3 (z (i) ) (N + 1)m4 (z (i) ) − 3(N − 1)m2 (z (i) )2 k3 (z (i) ) := , k4 (z (i) ) := N 2 (N − 1)(N − 2) (N − 1)(N − 2)(N − 3) 6 gives the third and fourth k-statistics [15]. However, we are interested in estimating the gradients (for Algorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves. The following Lemma shows how to obtain unbiased estimates: Lemma 5.1. Let z be a d-dimensional random vector with finite moments up to order r. Let z(i) be α an iid sample of z. Let α ∈ Nd be a multi-index. Then ∂u kr (u · z(i) ) is an unbiased estimate for α ∂u κr (u · z). If we mean-subtract (via the sample mean) all observed random variables, then the resulting estimates are: N u k3 (u · y) = (N − 1)−1 (N − 2)−1 3N (u · y(i) )2 y(i) (2) i=1 u k4 (u · y) = N2 (N − 1)(N − 2)(N − 3) −12 Hu k4 (u · x) = N −1 − N2 N −1 N2 N +1 N N N ((u · y(i) ))3 y(i) i=1 N (u · y(i) )2 i=1 12N 2 (N − 1)(N − 2)(N − 3) N 4 (u · y(i) )y(i) (3) i=1 N +1 N N 2N − 2 (u · x(i) )2 (xxT )(i) − N2 i=1 i=1 N ((u · x(i) ))2 (xxT )(i) (4) i=1 N (u · x(i) )x(i) i=1 T N (u · x(i) )x(i) i=1    Using (4) to estimate Hu κ4 (uT x) from data when implementing Algorithm 1, the resulting quasiorthogonalization algorithm runs in O(N d3 ) time. Using (2) or (3) to estimate u G(vT y) (with G chosen to be κ3 or κ4 respectively) when implementing Algorithm 2 gives an update step that runs in O(N d) time. If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2 t) steps are required to recover all columns of R once quasi-orthogonalization has been achieved. 6 Simulation Results In Figure 1, we compare our algorithms to the baselines JADE [7] and versions of FastICA [10], using the code made available by the authors. Except for the choice of the contrast function for FastICA the baselines were run using default settings. All tests were done using artificially generated data. In implementing our algorithms (available at [19]), we opted to enforce orthogonality during the update step of Algorithm 2 with previously found columns of R. In Figure 1, comparison on five distributions indicates that each of the independent coordinates was generated from a distinct distribution among the Laplace distribution, the Bernoulli distribution with parameter 0.5, the tdistribution with 5 degrees of freedom, the exponential distribution, and the continuous uniform distribution. Most of these distributions are symmetric, making GI-κ3 inadmissible. When generating data for the ICA algorithm, we generate a random mixing matrix A with condition number 10 (minimum singular value 1 and maximum singular value 10), and intermediate singular values chosen uniformly at random. The noise magnitude indicates the strength of an additive white Gaussian noise. We define 100% noise magnitude to mean variance 10, with 25% noise and 50% noise indicating variances 2.5 and 5 respectively. Performance was measured using the Amari Index ˆ introduced in [1]. Let B denote the approximate demixing matrix returned by an ICA algorithm, |m | n n ˆ and let M = BA. Then, the Amari index is given by: E := i=1 j=1 maxk ij ik | − 1 + |m n j=1 n i=1 |mij | maxk |mkj | − 1 . The Amari index takes on values between 0 and the dimensionality d. It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D (where P is a permutation matrix and D is a diagonal matrix). From the noiseles data, we see that quasi-orthogonalization requires more data than whitening in order to provide accurate results. Once sufficient data is provided, all fourth order methods (GI-κ4 , JADE, and κ4 -FastICA) perform comparably. The difference between GI-κ4 and κ4 -FastICA is not 7 ICA Comparison on 5 distributions (d=5, noisless data) ICA Comparison on 5 distributions (d=5, 25% noise magnitude) 1.00 ICA Comparison on 5 distributions (d=5, 50% noise magnitude) 1.00 1.00 GI−κ4 (quasi−orthogonal) κ4−FastICA κ4−FastICA κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, noisless data) 10.00 Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 25% noise magnitude) 10.00 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 50% noise magnitude) 10.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 1000 10000 Number of Samples 100000 κ4−FastICA log cosh−FastICA JADE 1.00 Amari Index 1.00 Amari Index Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA 1.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) 0.01 100 0.10 1000 10000 Number of Samples 100000 0.01 100 1000 10000 Number of Samples 100000 Figure 1: Comparison of ICA algorithms under various levels of noise. White and quasi-orthogonal refer to the choice of the first step of ICA. All baseline algorithms use whitening. Reported Amari indices denote the mean Amari index over 50 runs on different draws of both A and the data. d gives the data dimensionality, with two copies of each distribution used when d = 10. statistically significant over 50 runs with 100 000 samples. We note that GI-κ4 under whitening and κ4 -FastICA have the same update step (up to a slightly different choice of estimators), with GI-κ4 differing to allow for quasi-orthogonalization. Where provided, the error bars give a 2σ confidence interval on the mean Amari index. In all cases, error bars for our algorithms are provided, and error bars for the baseline algorithms are provided when they do not hinder readability. It is clear that all algorithms degrade with the addition of Gaussian noise. However, GI-κ4 under quasi-orthogonalization degrades far less when given sufficient samples. For this reason, the quasi-orthogonalized GI-κ4 outperforms all other algorithms (given sufficient samples) including the log cosh-FastICA, which performs best in the noiseless case. Contrasting the performance of GIκ4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization is necessary to be robust to Gaussian noise. Run times were indeed reasonably fast. For 100 000 samples on the varied distributions (d = 5) with 50% Gaussian noise magnitude, GI-κ4 (including the orthogonalization step) had an average running time2 of 0.19 seconds using PCA whitening, and 0.23 seconds under quasi-orthogonalization. The corresponding average number of iterations to convergence per independent component (at 0.0001 error) were 4.16 and 4.08. In the following table, we report the mean number of steps to convergence (per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that once sufficiently many samples were taken, the number of steps to convergence becomes remarkably small. Number of data pts whitening+GI-κ4 : mean num steps quasi-orth.+GI-κ4 : mean num steps 7 500 11.76 213.92 1000 5.92 65.95 5000 4.99 4.48 10000 4.59 4.36 Acknowledgments This work was supported by NSF grant IIS 1117707. 2 Using a standard desktop with an i7-2600 3.4 GHz CPU and 16 GB RAM. 8 50000 4.35 4.06 100000 4.16 4.08 References [1] S. Amari, A. Cichocki, H. H. Yang, et al. A new learning algorithm for blind signal separation. Advances in neural information processing systems, pages 757–763, 1996. [2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, with implications for Gaussian mixtures and autoencoders. In NIPS, pages 2384–2392, 2012. [3] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian noise. In JMLR W&CP;, volume 30: COLT, pages 270–287, 2013. [4] C. M. Bishop. Variational principal components. Proc. Ninth Int. Conf. on Articial Neural Networks. ICANN, 1:509–514, 1999. [5] E. J. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, e abs/0912.3599, 2009. [6] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993. [7] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http:// perso.telecom-paristech.fr/˜cardoso/Algo/Jade/jadeR.m, 2005. [Online; accessed 8-May-2013]. [8] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010. [9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image Processing, IEEE Transactions on, 20(12):3419–3430, 2011. [10] H. G¨ vert, J. Hurri, J. S¨ rel¨ , and A. Hyv¨ rinen. Matlab FastICA v 2.5. http:// a a a a research.ics.aalto.fi/ica/fastica/code/dlcode.shtml, 2005. [Online; accessed 1-May-2013]. [11] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In ITCS, pages 11–20, 2013. [12] A. Hyv¨ rinen. Independent component analysis in the presence of Gaussian noise by maxia mizing joint likelihood. Neurocomputing, 22(1-3):49–67, 1998. [13] A. Hyv¨ rinen. Fast and robust fixed-point algorithms for independent component analysis. a IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [14] A. Hyv¨ rinen and E. Oja. Independent component analysis: Algorithms and applications. a Neural Networks, 13(4-5):411–430, 2000. [15] J. F. Kenney and E. S. Keeping. Mathematics of Statistics, part 2. van Nostrand, 1962. [16] H. Li and T. Adali. A class of complex ICA algorithms based on the kurtosis cost function. IEEE Transactions on Neural Networks, 19(3):408–420, 2008. [17] L. Mafttner. What are cumulants. Documenta Mathematica, 4:601–622, 1999. [18] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. J. Cryptology, 22(2):139–160, 2009. [19] J. Voss, L. Rademacher, and M. Belkin. Matlab GI-ICA implementation. sourceforge.net/projects/giica/, 2013. [Online]. http:// [20] M. Welling. Robust higher order statistics. In Tenth International Workshop on Artificial Intelligence and Statistics, pages 405–412, 2005. [21] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000. [22] V. Zarzoso and P. Comon. How fast is FastICA. EUSIPCO, 2006. 9

4 0.61565679 177 nips-2013-Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation

Author: John Duchi, Martin J. Wainwright, Michael Jordan

Abstract: We provide a detailed study of the estimation of probability distributions— discrete and continuous—in a stringent setting in which data is kept private even from the statistician. We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental trade-offs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. One of the consequences of our results is that Warner’s classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents. 1

5 0.5951007 347 nips-2013-Variational Planning for Graph-based MDPs

Author: Qiang Cheng, Qiang Liu, Feng Chen, Alex Ihler

Abstract: Markov Decision Processes (MDPs) are extremely useful for modeling and solving sequential decision making problems. Graph-based MDPs provide a compact representation for MDPs with large numbers of random variables. However, the complexity of exactly solving a graph-based MDP usually grows exponentially in the number of variables, which limits their application. We present a new variational framework to describe and solve the planning problem of MDPs, and derive both exact and approximate planning algorithms. In particular, by exploiting the graph structure of graph-based MDPs, we propose a factored variational value iteration algorithm in which the value function is first approximated by the multiplication of local-scope value functions, then solved by minimizing a Kullback-Leibler (KL) divergence. The KL divergence is optimized using the belief propagation algorithm, with complexity exponential in only the cluster size of the graph. Experimental comparison on different models shows that our algorithm outperforms existing approximation algorithms at finding good policies. 1

6 0.58018297 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents

7 0.57700288 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

8 0.57649708 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

9 0.57392067 122 nips-2013-First-order Decomposition Trees

10 0.57184714 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

11 0.57159519 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

12 0.56741726 86 nips-2013-Demixing odors - fast inference in olfaction

13 0.56641167 210 nips-2013-Noise-Enhanced Associative Memories

14 0.56453079 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

15 0.56426322 348 nips-2013-Variational Policy Search via Trajectory Optimization

16 0.56331688 143 nips-2013-Integrated Non-Factorized Variational Inference

17 0.56293714 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

18 0.56269723 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

19 0.56268352 38 nips-2013-Approximate Dynamic Programming Finally Performs Well in the Game of Tetris

20 0.56264412 357 nips-2013-k-Prototype Learning for 3D Rigid Structures