nips nips2008 nips2008-212 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ozgur Simsek, Andre S. Barreto
Abstract: We present a characterization of a useful class of skills based on a graphical representation of an agent’s interaction with its environment. Our characterization uses betweenness, a measure of centrality on graphs. It captures and generalizes (at least intuitively) the bottleneck concept, which has inspired many of the existing skill-discovery algorithms. Our characterization may be used directly to form a set of skills suitable for a given task. More importantly, it serves as a useful guide for developing incremental skill-discovery algorithms that do not rely on knowing or representing the interaction graph in its entirety. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Skill characterization based on betweenness ¨ ¨ ¸ ¸ Ozgur Simsek∗ Andrew G. [sent-1, score-0.504]
2 edu Abstract We present a characterization of a useful class of skills based on a graphical representation of an agent’s interaction with its environment. [sent-4, score-0.619]
3 Our characterization may be used directly to form a set of skills suitable for a given task. [sent-7, score-0.462]
4 More importantly, it serves as a useful guide for developing incremental skill-discovery algorithms that do not rely on knowing or representing the interaction graph in its entirety. [sent-8, score-0.291]
5 For example, for a robot performing tasks that require manipulating objects, grasping is a useful skill that employs lower-level sensory and motor primitives. [sent-10, score-0.251]
6 And, how can an agent identify such skills autonomously? [sent-12, score-0.664]
7 Our main contribution is a characterization of a useful class of skills based on a graphical representation of the agent’s interaction with its environment. [sent-14, score-0.619]
8 Specifically, we use betweenness, a measure of centrality on graphs [1, 2], to define a set of skills that allows efficient navigation on the interaction graph. [sent-15, score-0.641]
9 In the game of Tic-Tac-Toe, these skills translate into setting up a fork, creating an opportunity to win the game. [sent-16, score-0.506]
10 Our characterization may be used directly to form a set of skills suitable for a given task if the interaction graph is readily available. [sent-18, score-0.685]
11 More importantly, this characterization is a useful guide for developing low-cost, incremental algorithms for skill discovery that do not rely on complete representation of the interaction graph. [sent-19, score-0.5]
12 Bottlenecks have been described as regions that the agent tends to visit frequently on successful trajectories but not on unsuccessful ones [3], border states of strongly connected areas [6], and states that allow transitions to a different part of the environment [7]. [sent-22, score-0.472]
13 We hope that our explicit and concrete description of what makes a useful skill will lead to further development of these existing algorithms and inspire alternative methods. [sent-24, score-0.251]
14 1 Figure 1: A visual representation of betweenness on two sample graphs. [sent-26, score-0.453]
15 The interaction graph is a directed graph in which the vertices represent the states of the MDP and the edges represent possible state transitions brought about by available actions. [sent-28, score-0.469]
16 It gives the fraction of shortest paths on the graph (between all possible sources and destinations) that pass through the vertex of interest. [sent-33, score-0.243]
17 Depending on the reward function—or a probability distribution over possible reward functions—some parts of the interaction graph may be given more weight than others, depending on how well they serve the agent’s needs. [sent-38, score-0.311]
18 We define as subgoals those states that correspond to local maxima of betweenness on the interaction graph, in other words, states that have a higher betweenness than other states in their neighborhood. [sent-39, score-1.578]
19 Skills for efficiently reaching the local maxima of betweenness represent a set of behaviors that may be combined in different ways to efficiently reach different regions, serving as useful building blocks for navigating the graph. [sent-41, score-0.776]
20 Figure 1 is a visual representation of betweenness on two sample graphs, computed using uniform edge and path weights. [sent-42, score-0.478]
21 The gray-scale shading on the vertices corresponds to the relative values of betweenness, with black representing the highest betweenness on the graph and white representing the lowest. [sent-43, score-0.605]
22 The graph on the left corresponds to a gridworld in which a doorway connects two rooms. [sent-44, score-0.283]
23 The graph on the right has a doorway of a different type: an edge connecting two otherwise distant nodes. [sent-45, score-0.207]
24 In both graphs, states that are local maxima of betweenness correspond to our intuitive choice of subgoals. [sent-46, score-0.712]
25 3 Examples We appled the skill definition of Section 2 to various domains in the literature: Taxi [11], Playroom [12, 13], and the game of Tic-Tac-Toe. [sent-49, score-0.283]
26 Interaction graphs of these domains, displaying betweenness values as gray-scale shading on the vertices, are shown in Figure 2. [sent-50, score-0.526]
27 We considered a node to be a local maximum if its betweenness was higher than or equal to those of its immediate neighbors, taking into account both incoming and outgoing edges. [sent-52, score-0.489]
28 Unless stated otherwise, actions had uniform cost and betweenness was computed using uniform path weights. [sent-53, score-0.53]
29 Taxi This domain includes a taxi and a passenger on the 5 × 5 grid shown in Figure 4. [sent-54, score-0.606]
30 At each grid location, the taxi has six primitive actions: north, east, south, west, pick-up, and put-down. [sent-55, score-0.457]
31 The navigation actions succeed in moving the taxi in the intended direction with probability 0. [sent-56, score-0.5]
32 2, the action takes the taxi to the right or left of the intended direction. [sent-58, score-0.43]
33 If the direction of movement is blocked, the taxi remains in the same location. [sent-59, score-0.381]
34 Pick-up places the passenger in the taxi if the taxi is at the passenger location; otherwise it has no effect. [sent-60, score-1.162]
35 Similarly, put-down delivers the passenger if the passenger is inside the taxi and the taxi is at the destination; otherwise it has no effect. [sent-61, score-1.162]
36 We used a continuing version of this problem in which a new passenger appears after each successful delivery. [sent-63, score-0.231]
37 The highest local maxima of betweenness are at the four regions of the graph that correspond to passenger delivery. [sent-64, score-0.951]
38 The corresponding skills are (approximately) those that take the taxi to the passenger location, to the destination (having picked up the passenger), or to a navigational bottleneck. [sent-66, score-1.161]
39 These skills closely resemble those that are hand-coded for this domain in the literature. [sent-67, score-0.411]
40 Playroom We created a Markov version of this domain in which an agent interacts with a number of objects in its surroundings: a light switch, a ball, a bell, a button for turning music on and off, 1 Except when passenger is waiting at Y, in which case, the taxi is at x = 1, y = 3. [sent-68, score-0.958]
41 For wait location Y, the corresponding subgoal has the taxi at x = 1, y = 3, having picked up the passenger. [sent-69, score-0.557]
42 The agent has an eye, a hand, and a marker it can place on objects. [sent-74, score-0.281]
43 In order to operate on an object, the agent must be looking at the object and holding the object in its hand. [sent-78, score-0.4]
44 The toy monkey starts to make frightened sounds if the bell is rung while the music is playing; it stops only when the music is turned off. [sent-80, score-0.226]
45 The MDP state consists of the object that the agent is looking at, the object that the agent is holding, the object that the marker is placed on, music (on/off), light (on/off), monkey (frightened/not), and bell (ringing/not). [sent-82, score-0.922]
46 The six different clusters of the interaction graph in Figure 2 emerge naturally from the force-directed layout algorithm and correspond to the different settings of the music, light, and monkey variables. [sent-83, score-0.281]
47 Betweenness peaks at regions that immediately connect neighboring clusters, corresponding to skills that change the setting of the music, light, or monkey variables. [sent-85, score-0.447]
48 Tic-Tac-Toe In the interaction graph, the node at the center of the interaction graph is the empty board, with other board configurations forming rings around it with respect to their distance from this initial configuration. [sent-86, score-0.419]
49 The opponent followed a policy that (1) placed the third mark in a row, whenever possible, winning the game, (2) blocked the agent from completing a row, and (3) placed its mark on a random empty square, with decreasing priority. [sent-89, score-0.296]
50 We assigned a weight of +1 to paths that terminate at a win for the agent and 0 to all other paths. [sent-91, score-0.327]
51 The state with the highest betweenness is the one shown in Figure 4. [sent-92, score-0.487]
52 This state gives the agent two possibilities for setting up a fork (board locations marked with *), creating an opportunity to win on the next turn. [sent-94, score-0.357]
53 There were nine other local maxima that similarly allowed the agent to immediately create a fork. [sent-95, score-0.426]
54 In addition, there were a number of “trivial” local maxima that allowed the agent to immediately win the game. [sent-96, score-0.469]
55 4 Empirical Performance We evaluated the impact of our skills on the agent’s learning performance in Taxi, Playroom, TicTac-Toe, and two additional domains, called Rooms and Shortcut, whose interaction graphs are those presented in Figure 1. [sent-97, score-0.58]
56 They move the agent in the intended direction with probability 0. [sent-100, score-0.252]
57 maxima of betweenness are the two states that surround the doorway, which have a slightly higher betweenness than the doorway itself. [sent-104, score-1.236]
58 The transition dynamics of Shortcut is identical, except there is one additional long-range action, connecting two particular states, which are the local maxima of betweenness in this domain. [sent-105, score-0.651]
59 We represented skills using the options framework [14, 15]. [sent-106, score-0.435]
60 The initiation set was restricted to include a certain number of states and included those states with the least distance to the subgoal on the interaction graph. [sent-107, score-0.405]
61 The skills terminated with probability one outside the initiation set and at the subgoal, with probability zero at all other states. [sent-108, score-0.464]
62 The skill policy was the optimal policy for reaching the subgoal. [sent-109, score-0.293]
63 We compared three agents: one that used only the primitive actions of the domain, one that used primitives and our skills, and one that used primitives and a control group of skills whose subgoals were selected randomly. [sent-110, score-0.758]
64 The number of subgoals used and the size of the initiation sets were identical in the two skill conditions. [sent-111, score-0.438]
65 The agent used Q-learning with -greedy exploration with = 0 05. [sent-112, score-0.228]
66 Figure 3 shows performance results in Rooms, Shortcut, and PlayRoom, where we had the agent perform 100 different episodic tasks, choosing a single goal state uniformly randomly in each task. [sent-119, score-0.262]
67 If no number is present, the skills were made available everywhere in the domain. [sent-123, score-0.411]
68 The availability of our skills—those that were identified using local maxima of betweenness—revealed a big improvement compared to using primitive actions only. [sent-124, score-0.279]
69 In some cases, random skills improved performance as well, but this improvement was much smaller than that obtained by our skills. [sent-125, score-0.411]
70 In Taxi, we examined performance on the single continuing task that rewarded the agent for delivering passengers. [sent-128, score-0.259]
71 Reward was 1 for each action, an additional +50 for passenger delivery, and an additional 10 for an unsuccessful pick-up or put-down. [sent-129, score-0.24]
72 In Tic-Tac-Toe, the agent received a reward of 0 001 for each action, an additional +1 for winning the game, and an additional 1 for losing. [sent-130, score-0.294]
73 Creating an individual skill for reaching each of the identified subgoals (which is what we have done in other domains) generates skills that are not of much use in Tic-Tac-Toe because reaching any particular board configuration is usually not possible. [sent-131, score-0.902]
74 Instead, we defined a single skill with multiple subgoals—the ten local maxima of betweenness that allow the agent to setup a fork. [sent-132, score-1.096]
75 We set the initial Q-value of this skill to 1 at the start state to ensure that the skill got executed frequently enough. [sent-133, score-0.468]
76 It is not clear what this single skill can be meaningfully compared to, so we do not provide a control condition with randomly-selected subgoals. [sent-134, score-0.217]
77 5 Our analysis shows that, in a diverse set of domains, the skill definition of Section 2 gives rise to skills that are consistent with common sense, are similar to skills people handcraft for these domains, and improve learning performance. [sent-135, score-1.039]
78 The improvements in performance are greater than those observed when using a control group of randomly-generated skills, suggesting that they should not be attributed to the presence of skills alone but to the presence of the specific skills that are formed based on betweenness. [sent-136, score-0.822]
79 Amarel advocated representing action consequences in the environment as a graph and forming skills that correspond to navigating this graph by exploiting its structural regularities. [sent-138, score-0.719]
80 Our skill definition captures the “bottleneck” concept, which has inspired many of the existing skill discovery algorithms [3, 6, 4, 5, 7, 8, 9]. [sent-140, score-0.475]
81 There is clearly an overlap between our skills and the skills that are generated by these algorithms. [sent-141, score-0.822]
82 McGovern & Barto [3] examine past trajectories to identify states that are common in successful trajectories but not in unsuccessful ones. [sent-143, score-0.194]
83 It can be applied only after the agent has successfully performed the task at least once. [sent-145, score-0.228]
84 It examines different paths that reach the same destination, while we look for the most efficient ways of navigating between different source and destination pairs. [sent-148, score-0.224]
85 Unfortunately, however, sample efficiency is even a larger concern with these algorithms, because they require the agent to have already identified the optimal policy—not for only a single task, but for many different tasks in the domain. [sent-152, score-0.253]
86 They apply a clustering algorithm to partition the graph into blocks and create skills that efficiently take the agent to states that connect the different blocks. [sent-156, score-0.8]
87 We claim that the more fundamental property that makes a doorway a useful subgoal is that it is between many source-destination pairs and that graph partitioning can not directly tap into this property, although it can sometimes do it indirectly. [sent-164, score-0.374]
88 6 An Incremental Discovery Algorithm Our skill definition may be used directly to form a set of skills suitable for a given environment. [sent-165, score-0.628]
89 Because of its reliance on complete knowledge of the interaction graph and the computational cost of betweenness, the use of our approach as a skill-discovery method is limited, although there are conditions under which it would be useful. [sent-166, score-0.223]
90 6 Although betweenness of a given vertex is a global graph property that can not be estimated reliably without knowledge of the entire graph, it should be possible to reliably determine the local maxima of betweenness using limited information. [sent-168, score-1.262]
91 In particular, we apply the statistical approach from Simsek & Barto [7] and Simsek, Wolfe & Barto [9] using the ¸ ¸ ¸ ¸ skill description in the present paper. [sent-170, score-0.217]
92 The resulting algorithm is founded on the premise that local maxima of betweenness of the interaction graph are likely to be local maxima on its subgraphs. [sent-171, score-1.072]
93 The agent uses short trajectories to construct subgraphs of the interaction graph and identifies the local maxima of betweenness on these subgraphs. [sent-174, score-1.163]
94 The reward structure creates two local maxima of betweenness on the graph. [sent-183, score-0.695]
95 These are the regions that look like doorways in the figure—they are useful subgoals for the same reasons that doorways are. [sent-184, score-0.262]
96 In both domains, we had the agent perform a random walk of 40,000 steps. [sent-189, score-0.228]
97 Every 1000 transitions, the agent created a new interaction graph using the last 1000 transitions. [sent-190, score-0.451]
98 An interesting direction is to develop algorithms that actively explore to discover local maxima of betweenness rather than only passively mining available trajectories. [sent-194, score-0.651]
99 Automatic discovery of subgoals in reinforcement learning using diverse density. [sent-212, score-0.277]
100 Identifying useful subgoals in reinforcement learning by local graph partitioning. [sent-248, score-0.406]
wordName wordTfidf (topN-words)
[('betweenness', 0.453), ('skills', 0.411), ('taxi', 0.381), ('agent', 0.228), ('skill', 0.217), ('passenger', 0.2), ('subgoals', 0.168), ('maxima', 0.162), ('simsek', 0.152), ('interaction', 0.123), ('playroom', 0.122), ('destination', 0.107), ('doorway', 0.107), ('subgoal', 0.107), ('barto', 0.103), ('graph', 0.1), ('shortcut', 0.091), ('gridworld', 0.076), ('music', 0.073), ('rooms', 0.073), ('reinforcement', 0.068), ('states', 0.061), ('centrality', 0.061), ('navigating', 0.061), ('vertex', 0.058), ('paths', 0.056), ('looking', 0.055), ('marker', 0.053), ('initiation', 0.053), ('actions', 0.052), ('characterization', 0.051), ('wolfe', 0.049), ('primitives', 0.049), ('graphs', 0.046), ('mcgovern', 0.046), ('menache', 0.046), ('stolle', 0.046), ('board', 0.046), ('object', 0.045), ('bell', 0.044), ('reward', 0.044), ('win', 0.043), ('succeed', 0.043), ('discovery', 0.041), ('navigational', 0.04), ('button', 0.04), ('unsuccessful', 0.04), ('bottleneck', 0.039), ('domains', 0.036), ('local', 0.036), ('light', 0.036), ('monkey', 0.036), ('mannor', 0.034), ('trajectories', 0.034), ('incremental', 0.034), ('useful', 0.034), ('state', 0.034), ('mdp', 0.034), ('continuing', 0.031), ('amarel', 0.03), ('clearing', 0.03), ('doorways', 0.03), ('fork', 0.03), ('ozgur', 0.03), ('wst', 0.03), ('reaching', 0.03), ('game', 0.03), ('abstraction', 0.029), ('shortest', 0.029), ('primitive', 0.029), ('episodes', 0.029), ('bottlenecks', 0.027), ('pivotal', 0.027), ('rings', 0.027), ('subgraphs', 0.027), ('shading', 0.027), ('holding', 0.027), ('identi', 0.026), ('transitions', 0.026), ('partitioning', 0.026), ('action', 0.025), ('concern', 0.025), ('vertices', 0.025), ('path', 0.025), ('grid', 0.025), ('identify', 0.025), ('location', 0.024), ('intended', 0.024), ('options', 0.024), ('wait', 0.023), ('amherst', 0.023), ('blocked', 0.023), ('completed', 0.023), ('policy', 0.023), ('creating', 0.022), ('six', 0.022), ('environment', 0.022), ('precup', 0.022), ('picked', 0.022), ('winning', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 212 nips-2008-Skill Characterization Based on Betweenness
Author: Ozgur Simsek, Andre S. Barreto
Abstract: We present a characterization of a useful class of skills based on a graphical representation of an agent’s interaction with its environment. Our characterization uses betweenness, a measure of centrality on graphs. It captures and generalizes (at least intuitively) the bottleneck concept, which has inspired many of the existing skill-discovery algorithms. Our characterization may be used directly to form a set of skills suitable for a given task. More importantly, it serves as a useful guide for developing incremental skill-discovery algorithms that do not rely on knowing or representing the interaction graph in its entirety. 1
2 0.12240104 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
Author: Luke Zettlemoyer, Brian Milch, Leslie P. Kaelbling
Abstract: In partially observable worlds with many agents, nested beliefs are formed when agents simultaneously reason about the unknown state of the world and the beliefs of the other agents. The multi-agent filtering problem is to efficiently represent and update these beliefs through time as the agents act in the world. In this paper, we formally define an infinite sequence of nested beliefs about the state of the world at the current time t, and present a filtering algorithm that maintains a finite representation which can be used to generate these beliefs. In some cases, this representation can be updated exactly in constant time; we also present a simple approximation scheme to compact beliefs if they become too complex. In experiments, we demonstrate efficient filtering in a range of multi-agent domains. 1
3 0.083081439 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
Author: Dotan D. Castro, Dmitry Volkinshtein, Ron Meir
Abstract: Actor-critic algorithms for reinforcement learning are achieving renewed popularity due to their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through cortical and basal ganglia loops. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming widely separated time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms. 1
4 0.07986252 181 nips-2008-Policy Search for Motor Primitives in Robotics
Author: Jens Kober, Jan R. Peters
Abstract: Many motor skills in humanoid robotics can be learned using parametrized motor primitives as done in imitation learning. However, most interesting motor learning problems are high-dimensional reinforcement learning problems often beyond the reach of current methods. In this paper, we extend previous work on policy learning from the immediate reward case to episodic reinforcement learning. We show that this results in a general, common framework also connected to policy gradient methods and yielding a novel algorithm for policy learning that is particularly well-suited for dynamic motor primitives. The resulting algorithm is an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it in the context of motor learning and show that it can learn a complex Ball-in-a-Cup task using a real Barrett WAMTM robot arm. 1
5 0.078507639 84 nips-2008-Fast Prediction on a Tree
Author: Mark Herbster, Massimiliano Pontil, Sergio R. Galeano
Abstract: Given an n-vertex weighted tree with structural diameter S and a subset of m vertices, we present a technique to compute a corresponding m × m Gram matrix of the pseudoinverse of the graph Laplacian in O(n + m2 + mS) time. We discuss the application of this technique to fast label prediction on a generic graph. We approximate the graph with a spanning tree and then we predict with the kernel perceptron. We address the approximation of the graph with either a minimum spanning tree or a shortest path tree. The fast computation of the pseudoinverse enables us to address prediction problems on large graphs. We present experiments on two web-spam classification tasks, one of which includes a graph with 400,000 vertices and more than 10,000,000 edges. The results indicate that the accuracy of our technique is competitive with previous methods using the full graph information. 1
6 0.077688791 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
7 0.073626041 131 nips-2008-MDPs with Non-Deterministic Policies
8 0.06500753 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
9 0.0646322 223 nips-2008-Structure Learning in Human Sequential Decision-Making
10 0.061574455 171 nips-2008-Online Prediction on Large Diameter Graphs
11 0.056024969 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
12 0.055740014 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries
13 0.054631002 150 nips-2008-Near-optimal Regret Bounds for Reinforcement Learning
14 0.05230296 10 nips-2008-A rational model of preference learning and choice prediction by children
15 0.051187079 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
16 0.049182851 107 nips-2008-Influence of graph construction on graph-based clustering measures
17 0.048019424 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE
18 0.047343057 166 nips-2008-On the asymptotic equivalence between differential Hebbian and temporal difference learning using a local third factor
19 0.046518568 39 nips-2008-Bounding Performance Loss in Approximate MDP Homomorphisms
20 0.045877382 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
topicId topicWeight
[(0, -0.113), (1, 0.13), (2, 0.027), (3, -0.062), (4, 0.088), (5, -0.05), (6, 0.054), (7, -0.069), (8, -0.012), (9, -0.089), (10, -0.015), (11, 0.03), (12, -0.059), (13, -0.05), (14, 0.006), (15, -0.011), (16, -0.024), (17, 0.025), (18, -0.052), (19, 0.041), (20, -0.001), (21, -0.049), (22, 0.019), (23, 0.007), (24, 0.011), (25, -0.063), (26, -0.031), (27, -0.019), (28, 0.059), (29, -0.026), (30, 0.076), (31, 0.055), (32, -0.002), (33, -0.026), (34, 0.019), (35, -0.038), (36, 0.137), (37, 0.05), (38, -0.085), (39, 0.084), (40, 0.06), (41, -0.028), (42, 0.095), (43, -0.044), (44, 0.019), (45, 0.02), (46, -0.044), (47, 0.098), (48, 0.022), (49, -0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.94815159 212 nips-2008-Skill Characterization Based on Betweenness
Author: Ozgur Simsek, Andre S. Barreto
Abstract: We present a characterization of a useful class of skills based on a graphical representation of an agent’s interaction with its environment. Our characterization uses betweenness, a measure of centrality on graphs. It captures and generalizes (at least intuitively) the bottleneck concept, which has inspired many of the existing skill-discovery algorithms. Our characterization may be used directly to form a set of skills suitable for a given task. More importantly, it serves as a useful guide for developing incremental skill-discovery algorithms that do not rely on knowing or representing the interaction graph in its entirety. 1
2 0.75514376 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
Author: Luke Zettlemoyer, Brian Milch, Leslie P. Kaelbling
Abstract: In partially observable worlds with many agents, nested beliefs are formed when agents simultaneously reason about the unknown state of the world and the beliefs of the other agents. The multi-agent filtering problem is to efficiently represent and update these beliefs through time as the agents act in the world. In this paper, we formally define an infinite sequence of nested beliefs about the state of the world at the current time t, and present a filtering algorithm that maintains a finite representation which can be used to generate these beliefs. In some cases, this representation can be updated exactly in constant time; we also present a simple approximation scheme to compact beliefs if they become too complex. In experiments, we demonstrate efficient filtering in a range of multi-agent domains. 1
3 0.52827382 211 nips-2008-Simple Local Models for Complex Dynamical Systems
Author: Erik Talvitie, Satinder P. Singh
Abstract: We present a novel mathematical formalism for the idea of a “local model” of an uncontrolled dynamical system, a model that makes only certain predictions in only certain situations. As a result of its restricted responsibilities, a local model may be far simpler than a complete model of the system. We then show how one might combine several local models to produce a more detailed model. We demonstrate our ability to learn a collection of local models on a large-scale example and do a preliminary empirical comparison of learning a collection of local models and some other model learning methods. 1
4 0.52079815 230 nips-2008-Temporal Difference Based Actor Critic Learning - Convergence and Neural Implementation
Author: Dotan D. Castro, Dmitry Volkinshtein, Ron Meir
Abstract: Actor-critic algorithms for reinforcement learning are achieving renewed popularity due to their good convergence properties in situations where other approaches often fail (e.g., when function approximation is involved). Interestingly, there is growing evidence that actor-critic approaches based on phasic dopamine signals play a key role in biological learning through cortical and basal ganglia loops. We derive a temporal difference based actor critic learning algorithm, for which convergence can be proved without assuming widely separated time scales for the actor and the critic. The approach is demonstrated by applying it to networks of spiking neurons. The established relation between phasic dopamine and the temporal difference signal lends support to the biological relevance of such algorithms. 1
5 0.48696738 187 nips-2008-Psychiatry: Insights into depression through normative decision-making models
Author: Quentin J. Huys, Joshua Vogelstein, Peter Dayan
Abstract: Decision making lies at the very heart of many psychiatric diseases. It is also a central theoretical concern in a wide variety of fields and has undergone detailed, in-depth, analyses. We take as an example Major Depressive Disorder (MDD), applying insights from a Bayesian reinforcement learning framework. We focus on anhedonia and helplessness. Helplessness—a core element in the conceptualizations of MDD that has lead to major advances in its treatment, pharmacological and neurobiological understanding—is formalized as a simple prior over the outcome entropy of actions in uncertain environments. Anhedonia, which is an equally fundamental aspect of the disease, is related to the effective reward size. These formulations allow for the design of specific tasks to measure anhedonia and helplessness behaviorally. We show that these behavioral measures capture explicit, questionnaire-based cognitions. We also provide evidence that these tasks may allow classification of subjects into healthy and MDD groups based purely on a behavioural measure and avoiding any verbal reports. There are strong ties between decision making and psychiatry, with maladaptive decisions and behaviors being very prominent in people with psychiatric disorders. Depression is classically seen as following life events such as divorces and job losses. Longitudinal studies, however, have revealed that a significant fraction of the stressors associated with depression do in fact follow MDD onset, and that they are likely due to maladaptive behaviors prominent in MDD (Kendler et al., 1999). Clinically effective ’talking’ therapies for MDD such as cognitive and dialectical behavior therapies (DeRubeis et al., 1999; Bortolotti et al., 2008; Gotlib and Hammen, 2002; Power, 2005) explicitly concentrate on altering patients’ maladaptive behaviors and decision making processes. Decision making is a promising avenue into psychiatry for at least two more reasons. First, it offers powerful analytical tools. Control problems related to decision making are prevalent in a huge diversity of fields, ranging from ecology to economics, computer science and engineering. These fields have produced well-founded and thoroughly characterized frameworks within which many issues in decision making can be framed. Here, we will focus on framing issues identified in psychiatric settings within a normative decision making framework. Its second major strength comes from its relationship to neurobiology, and particularly those neuromodulatory systems which are powerfully affected by all major clinically effective pharmacotherapies in psychiatry. The understanding of these systems has benefited significantly from theoretical accounts of optimal control such as reinforcement learning (Montague et al., 1996; Kapur and Remington, 1996; Smith et al., 1999; Yu and Dayan, 2005; Dayan and Yu, 2006). Such accounts may be useful to identify in more specific terms the roles of the neuromodulators in psychiatry (Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008; Dayan and Huys, 2008). ∗ qhuys@cantab.net, joshuav@jhu.edu, dayan@gatsby.ucl.ac.uk; www.gatsby.ucl.ac.uk/∼qhuys/pub.html 1 Master Yoked Control Figure 1: The learned helplessness (LH) paradigm. Three sets of rats are used in a sequence of two tasks. In the first task, rats are exposed to escapable or inescapable shocks. Shocks come on at random times. The master rat is given escapable shocks: it can switch off the shock by performing an action, usually turning a wheel mounted in front of it. The yoked rat is exposed to precisely the same shocks as the master rat, i.e its shocks are terminated when the master rat terminates the shock. Thus its shocks are inescapable, there is nothing it can do itself to terminate them. A third set of rats is not exposed to shocks. Then, all three sets of rats are exposed to a shuttlebox escape task. Shocks again come on at random times, and rats have to shuttle to the other side of the box to terminate the shock. Only yoked rats fail to acquire the escape response. Yoked rats generally fail to acquire a wide variety of instrumental behaviours, either determined by reward or, as here, by punishment contingencies. This paper represents an initial attempt at validating this approach experimentally. We will frame core notions of MDD in a reinforcement learning framework and use it to design behavioral decision making experiments. More specifically, we will concentrate on two concepts central to current thinking about MDD: anhedonia and learned helplessness (LH, Maier and Seligman 1976; Maier and Watkins 2005). We formulate helplessness parametrically as prior beliefs on aspects of decision trees, and anhedonia as the effective reward size. This allows us to use choice behavior to infer the degree to which subjects’ behavioral choices are characterized by either of these. For validation, we correlate the parameters inferred from subjects’ behavior with standard, questionnaire-based measures of hopelessness and anhedonia, and finally use the inferred parameters alone to attempt to recover the diagnostic classification. 1 Core concepts: helplessness and anhedonia The basic LH paradigm is explained in figure 1. Its importance is manifold: the effect of inescapable shock on subsequent learning is sensitive to most classes of clinically effective antidepressants; it has arguably been a motivation framework for the development of the main talking therapies for depression (cognitive behavioural therapy, Williams (1992), it has motivated the development of further, yet more specific animal models (Willner, 1997), and it has been the basis of very specific research into the cognitive basis of depression (Peterson et al., 1993). Behavioral control is the central concept in LH: yoked and master rat do not differ in terms of the amount of shock (stress) they have experienced, only in terms of the behavioural control over it. It is not a standard notion in reinforcement learning, and there are several ways one could translate the concept into RL terms. At a simple level, there is intuitively more behavioural control if, when repeating one action, the same outcome occurs again and again, than if this were not true. Thus, at a very first level, control might be related to the outcome entropy of actions (see Maier and Seligman 1976 for an early formulation). Of course, this is too simple. If all available actions deterministically led to the same outcome, the agent has very little control. Finally, if one were able to achieve all outcomes except for the one one cares about (in the rats’ case switching off or avoiding the shock), we would again not say that there is much control (see Huys (2007); Huys and Dayan (2007) for a more detailed discussion). Despite its obvious limitations, we will here concentrate on the simplest notion for reasons of mathematical expediency. 2 0.6 0.5 Exploration vs Exploitation Predictive Distributions Q(aknown)−Q(aunknown) P(reward a known ) 0.7 2 0 1 2 3 4 5 0.4 0.3 0.2 Choose blue slot machine 0.5 0 −0.5 0.1 0 1 1 2 3 4 5 Reward −1 Choose orange slot machine 1 High control Low control 2 3 4 5 Tree depth Figure 2: Effect of γ on predictions, Q-values and exploration behaviour. Assume a slot machine (blue) has been chosen five times, with possible rewards 1-5, and that reward 2 has been obtained twice, and reward 4 three times (inset in left panel). Left: Predictive distribution for a prior with negative γ (low control) in light gray, and large γ (extensive control) in dark gray. We see that, if the agent believes he has much control (and outcome distributions have low entropy), the predictive distribution puts all mass on the observations. Right: Assume now the agent gets up to 5 more pulls (tree depth 1-5) between the blue slot machine and a new, orange slot machine. The orange slot machine’s predictive distribution is flat as it has never been tried, and its expected value is therefore 3. The plot shows the difference between the values for the two slot machines. First consider the agent only has one more pull to take. In this case, independently of the priors about control, the agent will choose the blue machine, because it is just slightly better than average. Note though that the difference is more pronounced if the agent has a high control prior. But things change if the agent has two or more choices. Now, it is worth trying out the new machine if the agent has a high-control prior. For in that case, if the new machine turns out to yield a large reward on the first try, it is likely to do so again for the second and subsequent times. Thus, the prior about control determines the exploration bonus. The second central concept in current conceptions of MDD is that of reward sensitivity. Anhedonia, an inability to enjoy previously enjoyable things, is one of two symptoms necessary for the diagnosis of depression (American Psychiatric Association, 1994). A number of tasks in the literature have attempted to measure reward sensitivity behaviourally. While these generally concur in finding decreased reward sensitivity in subjects with MDD, these results need further clarification. Some studies show interactions between reward and punishment sensitivities with respect to MDD, but important aspects of the tasks are not clearly understood. For instance, Henriques et al. (1994); Henriques and Davidson (2000) show decreased resonsiveness of MDD subjects to rewards, but equally show decreased resonsiveness of healthy subjects to punishments. Pizzagalli et al. (2005) introduced an asymmetrically rewarded perceptual discrimination task and show that the rate of change of the response bias is anticorrelated with subjects’ anhedonic symptoms. Exactly how decreased reward responsivity can account for this is at pressent not clear. Great care has to be taken to disentangle these two concepts. Anhedonia and helplessness both provide good reasons for not taking an action: either because the reinforcements associated with the action are insufficient (anhedonia), or because the outcome is not judged a likely result of taking some particular action (if actions are thought to have large outcome entropy). 2 A Bayesian formulation of control We consider a scenario where subjects have no knowledge of the outcome distributions of actions, but rather learn about them. This means that their prior beliefs about the outcome distributions are not overwhelmed by the likelihood of observations, and may thus have measurable effects on their action choices. In terms of RL, this means that agents do not know the decision tree of the problem they face. Control is formulated as a prior distribution on the outcome distributions, and thereby as a prior distribution on the decision trees. The concentration parameter α of a Dirichlet process can very simply parametrise entropy, and, if used as a prior, allow for very efficient updates of the predictive distributions of actions. Let us assume we have actions A which have as outcomes rewards R, and keep count Nt (r, a) = 3 k:k < 0. Here, we included a regressor for the AGE as that was a confounding variable in our subject sample. Furthermore, if it is true that anhedonia, as expressed by the questionnaire, relates to reward sensitivity specifically, we should be able to write a similar regression for the learning rate ǫ (from equation 5) ǫ(BDIa, AGE) = θǫ BDIa + cǫ AGE + ζǫ but find that θǫ is not different from zero. Figure 4 shows the ML values for the parameters of interest (emphasized in blue in the equations) and confirms that people who express higher levels of anhedonia do indeed show less reward sensitivity, but do not differ in terms of learning rate. If it were the case that subjects with higher BDIa score were just less attentive to the task, one might also expect an effect of BDIa on learning rate. 3.2 Control Validation: The control task is new, and we first need to ascertain that subjects were indeed sensitive to main features of the task. We thus fit both a RW-learning rule (as in the previous section, but adjusted for the varying number of available actions), and the full control model. Importantly, both these models have two parameters, but only the full control model has a notion of outcome entropy, and evaluations a tree. The chance probability of subjects’ actions was 0.37, meaning that, on average, there were just under three machines on the screen. The probability of the actions under the RW-learning rule was better at 0.48, and that of the full control model 0.54. These differences are highly significant as the total number of choices is 29600. Thus, we conclude that subjects were indeed sensitive to the manipulation of outcome entropy, and that they did look ahead in a tree. Prior belief about control: Applying the procedure from the previous task to the main task, we write the main parameters of equations 2 and 4 as functions of the questionnaire measures and infer linear parameters: γ1 (BDIa, BHS, age) = χγ1 BHS + θγ1 BDIa + cγ1 AGE + ζγ1 γ2 (BDIa, BHS, age) = χγ2 BHS + θγ2 BDIa + cγ2 AGE + ζγ2 β(BDIa, BHS, age) = χβ BHS + θβ BDIa + cβ AGE + ζβ Importantly, because the BDIa scores and the BHS scores are correlated in our sample (they tend to be large for the subjects with MDD), we include the cross-terms (θγ1 , θγ2 , χγ ), as we are interested in the specific effects of BDIa on β, as before, and of BHS on γ. 6 3 control γ 2 Figure 6: Classification. Controls are shown as black dots, and depressed subjects as red crosses. The blue line is a linear classifier. Thus, the patients and controls can be approximately classified purely on the basis of behaviour. 1 0 83% correct 69% sensitivity 94% specificity −1 −2 2 4 6 8 10 12 14 16 reward sensitivity β We here infer and display two separate values γ1 and γ2 . These correspond to the level of control in the first and the second half of the experiment. In fact, to parallel the LH experiments better, the slot machines in the first 50 rooms were actually very noisy (low true γ), which means that subjects were here exposed to low levels of control just like the yoked rats in the original experiment. In the second half of the experiment on the other hand, slot machines tended to be quite reliable (high true γ). Figure 5 shows again the ML values for the parameters of interest (emphasized in blue in the equations). Again, we find that our parameter estimate are very significantly different from zero (> three standard deviations). The effect of the BHS score on the prior beliefs about control γ is much stronger in the second half than of the experiment in the first half, i.e. the effect of BHS on the prior belief about control is particularly prominent when subjects are in a high-control environment and have previously been exposed to a low-control environment. This is an interesting parallel to the learned helplessness experiments in animals. 3.3 Classification Finally we combine the two tasks. We integrate out the learning rate ǫ, which we had found not be related to the questionnaire measures (c.f. figure 4), and use the distribution over β from the first task as a prior distribution on β for the second task. We also put weak priors on γ and infer both β and γ for the second task on a subject-by-subject basis. Figure 6 shows the posterior values for γ and β for MDD and healthy subjects and the ability of a linear classifier to classify them. 4 Discussion In this paper, we have attempted to provide a specific formulation of core psychiatric concepts in reinforcement learning terms, i.e. hopelessness as a prior belief about controllability, and anhedonia as reward sensitivity. We have briefly explained how we expect these formulations to have effect in a behavioural situation, have presented a behavioral task explicitly designed to be sensitive to our formulations, and shown that people’s verbal expression of hopelessness and anhedonia do have specific behavioral impacts. Subjects who express anhedonia display insensitivity to rewards and those expressing hopelessness behave as if they had prior beliefs that outcome distributions of actions (slot machines) are very broad. Finally, we have shown that these purely behavioural measures are also predictive of their psychiatric status, in that we were able to classify patients and healthy controls purely on the basis of performance. Several aspects of this work are novel. There have been previous attempts to map aspects of psychiatric dysfunction onto specific parametrizations (Cohen et al., 1996; Smith et al., 2004; Williams and Dayan, 2005; Moutoussis et al., 2008), but we believe that our work represents the first attempt to a) apply it to MDD; b) make formal predictions about subject behavior c) present strong evidence linking anhedonia specifically to reward insensitivity across two tasks d) combine tasks to tease helplessness and anhedonia apart and e) to use the behavioral inferences for classification. The latter point is particularly important, as it will determine any potential clinical significance (Veiel, 1997). In the future, rather than cross-validating with respect to say DSM-IV criteria, it may also be important to validate measures such as ours in their own right in longitudinal studies. 7 Several important caveats do remain. First, the populations are not fully matched for age. We included age as an additional regressor and found all results to be robust. Secondly, only the healthy subjects were remunerated. However, repeating the analyses presented using only the MDD subjects yields the same results (data not shown). Thirdly, we have not yet fully mirrored the LH experiments. We have so far only tested the transfer from a low-control environment to a high-control environment. To make statements like those in animal learned helplessness experiments, the transfer from high-control to low-control environments will need to be examined, too. Fourth, the notion of control we have used is very simple, and more complex notions should certainly be tested (see Dayan and Huys 2008). Fifth, and maybe most importantly, we have so far only attempted to classify MDD and healthy subjects, and can thus not yet make any statements about the specificity of these effects with respect to MDD. Finally, it will be important to replicate these results independently, and possibly in a different modality. Nevertheless, we believe these results to be very encouraging. Acknowledgments: This work would not have been possible without the help of Sarah Hollingsworth Lisanby, Kenneth Miller and Ramin V. Parsey. We would also like to thank Nathaniel Daw and Hanneke EM Den Ouden and Ren´ Hen for invaluable discussions. Support for this work was provided by the Gatsby Charitable e Foundation (PD), a UCL Bogue Fellowship and the Swartz Foundation (QH) and a Columbia University startup grant to Kenneth Miller. References American Psychiatric Association (1994). Diagnostic and Statistical Manual of Mental Disorders. American Psychiatric Association Press. Bortolotti, B., Menchetti, M., Bellini, F., Montaguti, M. B., and Berardi, D. (2008). Psychological interventions for major depression in primary care: a meta-analytic review of randomized controlled trials. Gen Hosp Psychiatry, 30(4):293–302. Cohen, J. D., Braver, T. S., and O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346):1515–1527. Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., and Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879. Dayan, P. and Huys, Q. J. M. (2008). Serotonin, inhibition, and negative mood. PLoS Comput Biol, 4(2):e4. Dayan, P. and Yu, A. J. (2006). Phasic norepinephrine: a neural interrupt signal for unexpected events. Network, 17(4):335– 350. DeRubeis, R. J., Gelfand, L. A., Tang, T. Z., and Simons, A. D. (1999). Medications versus cognitive behavior therapy for severely depressed outpatients: mega-analysis of four randomized comparisons. Am J Psychiatry, 156(7):1007–1013. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002a). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Non-Patient Edition. (SCID-I/NP). Biometrics Research, New York State Psychiatric Institute. First, M. B., Spitzer, R. L., Gibbon, M., and Williams, J. B. (2002b). Structured Clinical Interview for DSM-IV-TR Axis I Disorders, Research Version, Patient Edition. (SCID-I/P). Biometrics Research, New York State Psychiatric Institute. Gotlib, I. H. and Hammen, C. L., editors (2002). Handbook of Depression. The Guilford Press. Henriques, J. B. and Davidson, R. J. (2000). Decreased responsiveness to reward in depression. Cognition and Emotion, 14(5):711–24. Henriques, J. B., Glowacki, J. M., and Davidson, R. J. (1994). Reward fails to alter response bias in depression. J Abnorm Psychol, 103(3):460–6. Huys, Q. J. M. (2007). Reinforcers and control. Towards a computational ætiology of depression. PhD thesis, Gatsby Computational Neuroscience Unit, UCL, University of London. Huys, Q. J. M. and Dayan, P. (2007). A bayesian formulation of behavioral control. Under Review, 0:00. Kapur, S. and Remington, G. (1996). Serotonin-dopamine interaction and its relevance to schizophrenia. Am J Psychiatry, 153(4):466–76. Kendler, K. S., Karkowski, L. M., and Prescott, C. A. (1999). Causal relationship between stressful life events and the onset of major depression. Am. J. Psychiatry, 156:837–41. Maier, S. and Seligman, M. (1976). Learned Helplessness: Theory and Evidence. Journal of Experimental Psychology: General, 105(1):3–46. Maier, S. F. and Watkins, L. R. (2005). Stressor controllability and learned helplessness: the roles of the dorsal raphe nucleus, serotonin, and corticotropin-releasing factor. Neurosci. Biobehav. Rev., 29(4-5):829–41. Montague, P. R., Dayan, P., and Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive hebbian learning. J. Neurosci., 16(5):1936–47. Moutoussis, M., Bentall, R. P., Williams, J., and Dayan, P. (2008). A temporal difference account of avoidance learning. Network, 19(2):137–160. Peterson, C., Maier, S. F., and Seligman, M. E. P. (1993). Learned Helplessness: A theory for the age of personal control. OUP, Oxford, UK. Pizzagalli, D. A., Jahn, A. L., and O’Shea, J. P. (2005). Toward an objective characterization of an anhedonic phenotype: a signal-detection approach. Biol Psychiatry, 57(4):319–327. Power, M., editor (2005). Mood Disorders: A Handbook of Science and Practice. John Wiley and Sons, paperback edition. Smith, A., Li, M., Becker, S., and Kapur, S. (2004). A model of antipsychotic action in conditioned avoidance: a computational approach. Neuropsychopharm., 29(6):1040–9. Smith, K. A., Morris, J. S., Friston, K. J., Cowen, P. J., and Dolan, R. J. (1999). Brain mechanisms associated with depressive relapse and associated cognitive impairment following acute tryptophan depletion. Br. J. Psychiatry, 174:525–9. Veiel, H. O. F. (1997). A preliminary profile of neuropsychological deficits associated with major depression. J. Clin. Exp. Neuropsychol., 19:587–603. Williams, J. and Dayan, P. (2005). Dopamine, learning, and impulsivity: a biological account of attentiondeficit/hyperactivity disorder. J Child Adolesc Psychopharmacol, 15(2):160–79; discussion 157–9. Williams, J. M. G. (1992). The psychological treatment of depression. Routledge. Willner, P. (1997). Validity, reliability and utility of the chronic mild stress model of depression: a 10-year review and evaluation. Psychopharm, 134:319–29. Yu, A. J. and Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692. 8
6 0.47828123 107 nips-2008-Influence of graph construction on graph-based clustering measures
7 0.47642958 10 nips-2008-A rational model of preference learning and choice prediction by children
8 0.47373754 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression
9 0.4696545 69 nips-2008-Efficient Exact Inference in Planar Ising Models
10 0.46578699 84 nips-2008-Fast Prediction on a Tree
11 0.45083198 33 nips-2008-Bayesian Model of Behaviour in Economic Games
12 0.44249517 144 nips-2008-Multi-resolution Exploration in Continuous Spaces
13 0.44158766 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
14 0.41602927 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
15 0.41118491 94 nips-2008-Goal-directed decision making in prefrontal cortex: a computational framework
16 0.40969059 171 nips-2008-Online Prediction on Large Diameter Graphs
17 0.39882666 131 nips-2008-MDPs with Non-Deterministic Policies
18 0.39820844 225 nips-2008-Supervised Bipartite Graph Inference
19 0.38426292 115 nips-2008-Learning Bounded Treewidth Bayesian Networks
20 0.38218799 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
topicId topicWeight
[(4, 0.034), (6, 0.033), (7, 0.039), (12, 0.029), (15, 0.011), (28, 0.118), (57, 0.039), (59, 0.029), (63, 0.482), (71, 0.012), (77, 0.048), (83, 0.03)]
simIndex simValue paperId paperTitle
1 0.82721847 3 nips-2008-A Massively Parallel Digital Learning Processor
Author: Hans P. Graf, Srihari Cadambi, Venkata Jakkula, Murugan Sankaradass, Eric Cosatto, Srimat Chakradhar, Igor Dourdanovic
Abstract: We present a new, massively parallel architecture for accelerating machine learning algorithms, based on arrays of vector processing elements (VPEs) with variable-resolution arithmetic. Groups of VPEs operate in SIMD (single instruction multiple data) mode, and each group is connected to an independent memory bank. The memory bandwidth thus scales with the number of VPEs, while the main data flows are local, keeping power dissipation low. With 256 VPEs, implemented on two FPGAs (field programmable gate array) chips, we obtain a sustained speed of 19 GMACS (billion multiplyaccumulate per sec.) for SVM training, and 86 GMACS for SVM classification. This performance is more than an order of magnitude higher than that of any FPGA implementation reported so far. The speed on one FPGA is similar to the fastest speeds published on a Graphics Processor for the MNIST problem, despite a clock rate that is an order of magnitude lower. Tests with Convolutional Neural Networks show similar compute performances. This massively parallel architecture is particularly attractive for embedded applications, where low power dissipation is critical. 1 I n trod u cti on Machine learning demands higher and higher compute-performance, but serial processors are not improving that much anymore - at least not as quickly as they used to. Mainstream processor development is moving to multi-core systems, using shared memory technology to hide the parallel nature of the processors. But shared memory technology does not scale to hundreds or thousands of cores. In order to reach such levels of parallelization alternative approaches have to be developed. Massively parallel general-purpose computers had limited success so far, because of difficulties programming these machines, and they remain a niche market, mostly in highperformance computing. Yet processors specialized for certain application domains, such as graphics processors or routing processors 1, have been parallelized to several hundred cores and are successful mass products. They improve performance over general-purpose processors by focusing on a few key algorithmic elements, yet still maintain enough flexibility that they can be programmed for a variety of applications. We explore in this paper if a similar approach can lead to efficient machine learning processors. 1 e.g. Nvidia, Quadro FX 5600 graphics processor; Cisco, CRS-1 routing processor Several processors optimized for machine learning, in particular for neural networks, were developed during the 1980’s and 90’s. Examples are the Synapse-1 architecture [1], or the Connectionist Network Supercomputer, CNS1 [2]. Recently there has been less activity in this field, but some accelerators are sold today for specific applications, such as the Axeon [3] processor for power train control of cars. Beside digital processors a large number of analog circuits were built, emulating neural network structures. Extremely high performance with low power dissipation is achievable, see e.g. [4][5], but these networks have little flexibility. SVM implementations on FPGA have been demonstrated in recent years [6-8], yet reached only low compute-performances. All machine learning processors had only limited success so far, indicating how difficult it is to find a good combination of performance, flexibility, price and ease of use. An important consideration is that many applications of machine learning, such as video analysis, data mining, or personalization of services, show the most promise in embedded systems. Embedded learning requires high compute performance while dissipating little power, a combination that is difficult to achieve, and so far required application specific IC (ASIC). Our aim is to develop architectures that meet the requirements for embedded learning, but are programmable and therefore can be used in a wide range of applications. With the goal of analyzing different architectures we designed a development and testing environment where the parallel computation is mapped onto FPGA’s. Initially this system was intended only for experimentation, but its performance is so high that this platform is useful in its own right as accelerator for high-performance systems. While the experiments shown here emphasize high performance, the architecture has been designed from the start for low power dissipation. The main features for achieving this goal are: low-resolution arithmetic, keeping the main data flow local, low operating frequencies, and a modular design, so that unused parts can be powered down dynamically. All results shown here are from the test platform; migration to lowpower FPGA or chip designs are done in a later stage. 2 Al gori th ms - A ri th meti c - A rch i te ctu re For a substantial improvement over a general purpose processor, the algorithms, the arithmetic units, as well as the architecture have to be optimized simultaneously. This is not just an exercise in hardware design, but algorithms and their software implementations have to be developed concurrently. Most machine learning algorithms have not been developed with parallelization in mind. Therefore, we first need to find good parallel versions, identify their performance bottlenecks, and then extract common computational patterns that can be mapped into accelerator hardware. 2.1 Algorithms Characteristic for machine learning is that large amounts of data need to be processed, often with predictable data access patterns and no dependency between operations over large segments of the computation. This is why data-parallelization can often provide good accelerations on multi-core chips, clusters of machines, or even on loosely coupled networks of machines. Using MapReduce, speedups linear with the number of processors have been reported in [9] for several machine learning algorithms. Up to 16 cores were tested, and simulations indicate good scaling to more processors in some cases. Many algorithms, such as KNN, K-means clustering, LVQ, and Neural Networks can be reduced to forms where the computation is dominated by vector-matrix multiplications, which are easily parallelizable. For Convolutional Neural Networks (CNN) the data flow can be complex, yet the core of the computation is a convolution, an operation which has been studied extensively for parallel implementations. For Support Vector Machines (SVM), several parallel algorithms were described, but most saturate quickly for more than 16 processors. Scaling to larger numbers of processors has been demonstrated, applying MapReduce on a graphics processor with 128 cores [10]. Another implementation on a cluster of 48 dual-core machines (with 384 MMX units) [11] scales even super-linearly, and, according to simulations, scales to thousands of cores. Based on this analysis it is clear that vector-matrix and matrix-matrix multiplications for large vector dimensionalities and large numbers of vectors must be handled efficiently. Yet this alone is not sufficient since data access patterns vary greatly between algorithms. We analyze this here in more detail for SVM and CNN. These algorithms were chosen, because they are widely used for industrial applications and cover a broad range of computation, I/O, and memory requirements. The characteristics of the SVM training are summarized in Table 1. We use an approach similar to the one described in [11] to split different parts of the computation between a host CPU and the FPGA accelerator. For large dimensions d of the vectors the calculation of the columns of the kernel matrix dominates by far. This is needed to update the gradients, and in the present implementation, only this part is mapped onto the FPGA. If the dimensionality d is smaller than around 100, operations 2 and 5 can become bottlenecks and should also be mapped onto the accelerator. Challenging is that for each kernel computation a new data vector has to be loaded 4 into the processor, leading to very high I/O requirements. We consider here dimensions of 10 - 10 5 7 and numbers of training data of 10 - 10 , resulting easily in Gigabytes that need to be transferred to the processors at each iteration. 1 2 3 4 5 6 Operation Initialize all αx, Gx Do Find working set αi, αj Update αi, αj Get 2 columns of kernel matrix Update gradients Gx While not converged Computation 2n IO 2n Unit CPU I I I I I * 2n I*2 I * (2d+2dn) I*n CPU CPU FPGA CPU * * * * 2n 10 2nd n Table 1: Compute- and IO-requirements of each step for SVM training (SMO algorithm). n: number of training data; d: dimension of the vectors; G: gradients; α: support vector factors; I: number of iterations. The last column indicates whether the execution happens on the host CPU or the accelerator FPGA. It is assumed that the kernel computation requires a dot product between vectors (e.g. rbf, polynomial, tanh kernels). Neural network algorithms are essentially sequences of vector-matrix multiplications, but networks with special connectivity patterns, such as convolutional networks have very different IO characteristics than fully connected networks. Table 2 shows the computation and IO requirements for scanning several convolution kernels over one input plane. A full network requires multiple of these operations for one layer, with nonlinearities between layers. We map all operations onto the FPGA accelerator, since intermediate results are re-used right away. The most significant 2 difference to between the SVM and CNN is the Compute/IO ratio: SVM: ~ 1; CNN: ~ L*k > 100. Therefore the requirements for these two algorithms are very different, and handling both cases efficiently is quite a challenge for an architecture design. Operation Load L kernels For all input pixels Shift in new pixel Multiply kernels Shift out result 1 2 3 4 Computation IO 2 L* k n* m 2 n*m*L*k n*m Unit FPGA FPGA FPGA FPGA FPGA Table 2: Compute- and IO-requirements for CNN computation (forward pass), where l kernels of size k*k are scanned simultaneously over an input plane of size n*m. This is representative for implementations with kernel unrolling (kernel pixels processed in parallel). Internal shifts, computation of the non-linearity, and border effects not shown. 2.2 Arithmetic Hardware can be built much more compactly and runs with lower power dissipation, if it uses fixed-point instead of floating-point operations. Fortunately, many learning algorithms tolerate a low resolution in most of the computations. This has been investigated extensively for neural networks [12][13], but less so for other learning algorithms. Learning from data is inherently a noisy process, because we see only a sparse sampling of the true probability distributions. A different type of noise is introduced in gradient descent algorithms, when only a few training data are used at a time to move the optimization forward iteratively. This noise is particularly pronounced for stochastic gradient descent. There is no point in representing noisy variables with high resolution, and it is therefore a property inherent to many algorithms that low-resolution computation can be used. It is important, not to confuse this tolerance to low resolution with the resolution required to avoid numeric instabilities. Some of the computations have to be performed with a high resolution, in particular for variables that are updated incrementally. They maintain the state of the optimization and may change in very small steps. But usually by far the largest part of the computation can be executed at a low resolution. Key is that the hardware is flexible enough and can take advantage of reduced resolution while handling high resolution where necessary. Problem Adult Forest MNIST NORB Kernel: Float Obj. f. # SV 31,930.77 11,486 653,170.7 49,333 4,960.13 6,172 1,243.71 3,077 F-score 77.58 98.29 99.12 93.34 Kernel: 16 bit fixed point Obj. f. # SV F-score 31,930.1 11,490 77.63 652,758 49,299 98.28 4,959.64 6,166 99.11 1,244.76 3,154 93.26 F-sc. (4b in) NA NA 99.11 92.78 Table 3: Comparison of the results of SVM training when the kernels are represented with floating point numbers (32 or 64 bits) (left half) and with 16 bit fixed point (right half). The last column shows the results when the resolution of the training data is reduced from 8 bit to 4 bit. For NORB this reduces the accuracy; all other differences in accuracy are not significant. All are two class problems: Adult: n=32,562, d=122; Forest: n=522,000, d=54 (2 against the rest); MNIST: n=60,000, d=784 (odd–even); NORB: n=48,560, d=5,184. We developed a simulator that allows running the training algorithms with various resolutions in each of the variables. A few examples for SVM training are shown in Table 3. Reducing the resolution of the kernel values from double or float to 16 bit fixed point representations does not affect the accuracy for any of the problems. Therefore all the multiplications in the dot products for the kernel computation can be done in low resolutions (4–16 bit in the factors), but the accumulator needs sufficient resolution to avoid over/under flow (48 bit). Once the calculation of the kernel value is completed, it can be reduced to 16 bit. A low resolution of 16 bit is also tolerable for the α values, but a high resolution is required for the gradients (double). For Neural Networks, including CNN, several studies have confirmed that states and gradients can be kept at low resolutions (<16 bit), but the weights must be maintained at a high resolution (float) (see e.g. [12]). In our own evaluations 24 bits in the weights tend to be sufficient. Once the network is trained, for the classification low resolutions can be used for the weights as well (<16 bit). 2.3 A rc h i t e c t u re Figure 1: Left: Schematic of the architecture with the main data flows; on one FPGA 128 VPE are configured into four SIMD groups; L-S: Load-store units. Right: Picture of an FPGA board; in our experiments one or two of them are used, connected via PCI bus to a host CPU. Based on the analysis above, it is clear that the architecture must be optimized for processing massive amounts of data with relatively low precision. Most of the time, data access patterns are predictable and data are processed in blocks that can be stored contiguously. This type of computation is well suited for vector processing, and simple vector processing elements (VPE) with fixed-point arithmetic can handle the operations. Since typically large blocks of data are processed with the same operation, groups of VPE can work in SIMD (single instruction multiple data) mode. Algorithms must then be segmented to map the highvolume, low precision parts onto the vector accelerators and parts requiring high precision arithmetic onto the CPU. The most important design decision is the organization of the memory. Most memory accesses are done in large blocks, so that the data can be streamed, making complex caching unnecessary. This is fortunate, since the amounts of data to be loaded onto the processor are so large that conventional caching strategies would be overwhelmed anyway. Because the blocks tend to be large, a high data bandwidth is crucial, but latency for starting a block transfer is less critical. Therefore we can use regular DDR memories and still get high IO rates. This led to the design shown schematically in Figure 1, where independent memory banks are connected via separate IO ports for each group of 32 VPE. By connecting multiple of the units shown in Figure 1 to a CPU, this architecture scales to larger numbers of VPE. Parallel data IO and parallel memory access scale simultaneously with the number of parallel cores, and we therefore refer to this as the P3 (P-cube) architecture. Notice also that the main data flow is only local between a group of VPE and its own memory block. Avoiding movements of data over long distances is crucial for low power dissipation. How far this architecture can reasonably scale with one CPU depends on the algorithms, the amount of data and the vector dimensionality (see below). A few hundred VPE per CPU have provided good accelerations in all our tests, and much higher numbers are possible with multi-core CPUs and faster CPU-FPGA connections. 3 I mp l e men tati on of th e P 3 A rch i t ectu re This architecture fits surprisingly well onto some of the recent FPGA chips that are available with several hundred Digital Signal Processors (DSP) units and over 1,000 IO pins for data transfers. The boards used here contain each one Xilinx Virtex 5 LX330T-2 FPGA coupled to 4 independent DDR2 SDRAM with a total of 1GB, and 2 independent 4MB SSRAM memory banks (commercial board from AlphaData). One FPGA chip contains 192 DSP with a maximum speed of 550MHz, which corresponds to a theoretical compute-performance of 105.6 GMACS (18 bit and 25 bit operands). There is a total of 14 Mbit of on-chip memory, and the chip incorporates 960 pins for data IO. Due to routing overhead, not all DSP units can be used and the actual clock frequencies tend to be considerably lower than what is advertised for such chips (typically 230MHz or less for our designs). Nevertheless, we obtain high performances because we can use a large number of DSP units for executing the main computation. The main architecture features are: • Parallel processing (on one chip): 128 VPE (hardware DSP) are divided into 4 blocks of 32, each group controlled by one sequencer with a vector instruction set. • Custom Precision: Data are represented with 1 to 16 bit resolution. Higher resolutions are possible by operating multiple DSP as one processor. • Overlapping Computation and Communication: CPU-FPGA communication is overlapped with the FPGA computation. • Overlap Memory Operations with Computation: All loads and stores from the FPGA to off-chip memory are performed concurrently with computations. • High Off-chip Memory Bandwidth: 6 independent data ports, each 32 bits wide, access banked memories concurrently (12GB/s per chip). • • Streaming Data Flow, Simple Access Patterns: Load/store units are tailored for streaming input and output data, and for simple, bursty access patterns. Caching is done under application control with dual-port memory on chip. Load/store with (de)compression: For an increase of effective IO bandwidth the load/store units provide compression and decompression in hardware. Figure 2 shows the configuration of the VPEs for vector dot product computation used for SVM training and classification. For training, the main computation is the calculation of one column of the kernel matrix. One vector is pre-fetched and stored in on-chip memory. All other vectors are streamed in from off-chip memory banks 1-4. Since this is a regular and predictable access pattern, we can utilize burst-mode, achieving a throughput of close to one memory word per cycle. But the speed is nevertheless IO bound. When several vectors can be stored on-chip, as is the case for classification, then the speed becomes compute-bound. Figure 2: Architecture for vector dot-product computation. The left side shows a high-level schematic with the main data flow. The data are streamed from memory banks 1-4 to the VPE arrays, while memory banks 5 and 6, alternatively receive results or stream them back to the host. The right side shows how a group of VPE is pipelined to improve clock speed. The operation for SVM training on the FPGA corresponds to a vector-matrix multiplication and the one for classification to a matrix-matrix multiplication. Therefore the configuration of Figure 2 is useful for many other algorithms as well, where operations with large vectors and matrices are needed, such as Neural Networks. We implemented a specialized configuration for Convolutional Neural Networks, for more efficiency and lower power dissipation. The VPE are daisy-chained and operate as systolic array. In this way we can take advantage of the high computation to IO ratio (Table 2) to reduce the data transfers from memory. 4 E val u ati on s We evaluated SVM training and classification with the NORB and MNIST problems, the latter with up to 2 million training samples (data from [11]). Both are benchmarks with vectors of high dimensionality, representative for applications in image and video analysis. The computation is split between CPU and FPGA as indicated by Table 1. The DDR2 memory banks are clocked at 230MHz, providing double that rate for data transfers. The data may be compressed to save IO bandwidth. On the FPGA they are decompressed first and distributed to the VPE. In our case, a 32 bit word contains eight 4-bit vector components. Four 32 bit words are needed to feed all 32 VPEs of a group; therefore clocking the VPE faster than 115MHz does not improve performance. A VPE executes a multiplication plus add operation in one clock cycle, resulting in a theoretical maximum of 14.7 GMACS per chip. The sustained compute-rate is lower, about 9.4 GMACS, due to overhead (see Table 4). The computation on the host CPU overlaps with that on the FPGA, and has no effect on the speed in the experiments shown here. For the classification the VPE can be clocked higher, at 230 MHz. By using 4-bit operands we can execute 2 multiply-accumulates simultaneously on one DSP, resulting in speed that is more than four times higher and a sustained 43.0 GMACS limited by the number and speed of the VPE. Adding a second FPGA card doubles the speed, showing little saturation effects yet, but for more FPGA per CPU there will be saturation (see Fig. 3). The compute speed in GMACS obtained for NORB is almost identical. # 60k 2M Iterations 8,000 266,900 CPU time 754s -- speed 0.5 -- CPU+MMX time speed 240 s 1.57 531,534 s 1.58 CPU+FPGA time speed 40 s 9.42 88,589 s 9.48 CPU+2 FPGA time speed 21 s 17.9 48,723 s 17.2 Table 4: Training times and average compute speed for SVM training. Systems tested: CPU, Opteron, 2.2GHz; CPU using MMX; CPU with one FPGA; CPU with two FPGA boards. Results are shown for training sizes of 60k and 2M samples. Compute speed is in GMACS (just kernel computations). Training algorithm: SMO with second order working set selection. Parallelizations of SVM training have been reported recently for a GPU [10] and for a cluster [11], both using the MNIST data. In [10] different bounds for stopping were used than here and in [11]. Nevertheless, a comparison of the compute performance is possible, because based on the number of iterations we can compute the average GMACS for the kernel computations. As can be seen in Table 5 a single FPGA is similar in speed to a GPU with 128 stream processors, despite a clock rate that is about 5.5 times lower for I/O and 11 times lower for the VPE. The cluster with 384 MMX units is about 6 times faster than one FPGA with 128 VPE, but dissipates about two orders of magnitude more electric power. For the FPGA this calculation includes only the computation of the kernel values while the part on the CPU is neglected. This is justified for this study, because the rest of the calculations can be mapped on the FPGA as well and will increase the power dissipation only minimally. Number Clock Operand Power Average of cores speed type dissipation compute speed CPU (Opteron) 1 2.2 GHz float 40 W 0.5 GMACS GPU (from [10]) 128 1.35 GHz float 80 W 7.4 GMACS Cluster (from [11]) 384 1.6 GHz byte > 1 kW 54 GMACS FPGA 128 0.12 GHz 4 bit nibble 9W 9.4 GMACS Table 5: Comparison of performances for SVM training (MNIST data). GPU: Nvidia 8800 GTX. Cluster: 48 dual core CPU (Athlon), 384 MMX units. The GPU was training with 60k samples ([10], table 2, second order), the cluster trained with 2 million samples. Processor Figure 3: Acceleration of SVM training as a function of the number of VPE. MNIST n: 2,000,000, d=784; NORB: n=48,560, d=5,184. The points for 128 and 256 VPE are experimental, the higher ones are simulations. Curves MNIST, NORB: Multiple FPGA are attached to one CPU. Curve MNIST C: Each FPGA is attached to a separate host CPU. Scaling of the acceleration with the number of VPEs is shown in Figure 3. The reference speed is that of one FPGA attached to a CPU. The evaluation has been done experimentally for 128 and 256 VPEs, and beyond that with a simulator. The onset of saturation depends on the dimensionality of the vectors, but to a much lesser extent on the number of training vectors (up to the limit of the memory on the FPGA card). MNIST saturates for more than two FPGAs because then the CPU and FPGA computation times become comparable. For the larger vectors of NORB (d=5,184) this saturation starts to be noticeable for more than 4 FPGA. Alternatively, a system can be scaled by grouping multiple CPU, each with one attached FPGA accelerator. Then the scaling follows a linear or even super-linear acceleration (MNIST C) to several thousand VPE. If the CPUs are working in a cluster arrangement, the scaling is similar to the one described in [11]. For convolutional neural networks, the architecture of Figure 2 is modified to allow a block of VPE to operate as systolic array. In this way convolutions can be implemented with minimal data movements. In addition to the convolution, also sub-sampling and non-linear functions plus the logistics to handle multiple layers with arbitrary numbers of kernels in each layer are done on the FPGA. Four separate blocks of such convolvers are packed onto one FPGA, using 100 VPE. Clocked at 115MHz, this architecture provides a maximum of 11.5 GMACS. Including all the overhead the sustained speed is about 10 GMACS. 5 Con cl u s i on s By systematically exploiting characteristic properties of machine learning algorithms, we developed a new massively parallel processor architecture that is very efficient and can be scaled to thousands of processing elements. The implementation demonstrated here is more than an order of magnitude higher in performance than previous FPGA implementations of SVM or CNN. For the MNIST problem it is comparable to the fastest GPU implementations reported so far. These results underline the importance of flexibility over raw compute-speed for massively parallel systems. The flexibility of the FPGA allows more efficient routing and packing of the data and the use of computations with the lowest resolution an algorithm permits. The results of Table 5 indicate the potential of this architecture for low-power operation in embedded applications. R e f e re n c e s [1] Ramacher, et al. (1995) Synapse-1: A high-speed general purpose parallel neurocomputer system. In Proc. 9th Intl. Symposium on Parallel Processing (IPPS'95), pp. 774-781. [2] Asanovic, K., Beck, Feldman, J., Morgan, N. & Wawrzynek, J. (1994) A Supercomputer for Neural Computation, Proc. IEEE Intl. Joint Conference on Neural Networks, pp. 5-9, Orlando, Florida. [3] Neil, P., (2005) Combining hardware with a powerful automotive MCU for powertrain applications. In Industrial Embedded Resource Guide, p. 88. [4] Korekado, et al. (2003) A Convolutional Neural Network VLSI for Image Recognition Using Merged/Mixed Analog-Digital Architecture, in Proc. 7th KES 2003, Oxford, pp 169-176. [5] Murasaki, M., Arima, Y. & Shinohara, H. (1993) A 20 Tera-CPS Analog Neural Network Board. In Proc. Int. Joint Conf. Neural Networks, pp. 3027 – 3030. [6] Pedersen, R., Schoeberl, M. (2006), An Embedded Support Vector Machine, WISE 2006. [7] Dey, S., Kedia, M. Agarwal, N., Basu, A., Embedded Support Vector Machine: Architectural Enhancements and Evaluation, in Proc 20th Int. Conf. VLSI Design. [8] Anguita, D., Boni, A., Ridella, S., (2003) A Digital Architecture for Support Vector Machines: Theory, Algorithm, and FPGA Implementation, IEEE Trans. Neural Networks, 14/5, pp.993-1009. [9] Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A. & Olukotun, K. (2007) Map-Reduce for Machine Learning on Multicore, Advances in Neural Information Processing Systems 19, MIT Press. [10] Catanzaro, B., Sundaram, N., & Keutzer, K. (2008) Fast Support Vector Machine Training and Classification on Graphics Processors, Proc. 25th Int. Conf. Machine Learning, pp 104-111. [11] Durdanovic, I., Cosatto, E. & Graf, H. (2007) Large Scale Parallel SVM Implementation. In L. Bottou, O. Chapelle, D. DeCoste, J. Weston (eds.), Large Scale Kernel Machines, pp. 105-138, MIT Press. [12] Simard, P & Graf, H. (1994) Backpropagation without Multiplication. In J. Cowan, G. Tesauro, J. Alspector, (eds.), Neural Information Processing Systems 6, pp. 232 – 239, Morgan Kaufmann. [13] Savich, A., Moussa, M., Areibi, S., (2007) The Impact of Arithmetic Representation on Implementing MLP-BP on FPGAs: A Study, IEEE Trans. Neural Networks, 18/1, pp. 240-252.
same-paper 2 0.81314617 212 nips-2008-Skill Characterization Based on Betweenness
Author: Ozgur Simsek, Andre S. Barreto
Abstract: We present a characterization of a useful class of skills based on a graphical representation of an agent’s interaction with its environment. Our characterization uses betweenness, a measure of centrality on graphs. It captures and generalizes (at least intuitively) the bottleneck concept, which has inspired many of the existing skill-discovery algorithms. Our characterization may be used directly to form a set of skills suitable for a given task. More importantly, it serves as a useful guide for developing incremental skill-discovery algorithms that do not rely on knowing or representing the interaction graph in its entirety. 1
3 0.80043709 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
Author: Peter Carbonetto, Mark Schmidt, Nando D. Freitas
Abstract: The stochastic approximation method is behind the solution to many important, actively-studied problems in machine learning. Despite its farreaching application, there is almost no work on applying stochastic approximation to learning problems with general constraints. The reason for this, we hypothesize, is that no robust, widely-applicable stochastic approximation method exists for handling such problems. We propose that interior-point methods are a natural solution. We establish the stability of a stochastic interior-point approximation method both analytically and empirically, and demonstrate its utility by deriving an on-line learning algorithm that also performs feature selection via L1 regularization. 1
4 0.71418029 78 nips-2008-Exact Convex Confidence-Weighted Learning
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1
5 0.40287864 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
Author: Marek Petrik, Bruno Scherrer
Abstract: Most algorithms for solving Markov decision processes rely on a discount factor, which ensures their convergence. It is generally assumed that using an artificially low discount factor will improve the convergence rate, while sacrificing the solution quality. We however demonstrate that using an artificially low discount factor may significantly improve the solution quality, when used in approximate dynamic programming. We propose two explanations of this phenomenon. The first justification follows directly from the standard approximation error bounds: using a lower discount factor may decrease the approximation error bounds. However, we also show that these bounds are loose, thus their decrease does not entirely justify the improved solution quality. We thus propose another justification: when the rewards are received only sporadically (as in the case of Tetris), we can derive tighter bounds, which support a significant improvement in the solution quality with a decreased discount factor. 1
6 0.40105662 141 nips-2008-Multi-Agent Filtering with Infinitely Nested Beliefs
7 0.39557257 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms
8 0.39206758 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
9 0.39159808 209 nips-2008-Short-Term Depression in VLSI Stochastic Synapse
10 0.39046499 196 nips-2008-Relative Margin Machines
11 0.38899919 1 nips-2008-A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation
12 0.38743046 75 nips-2008-Estimating vector fields using sparse basis field expansions
13 0.38546613 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
14 0.38449538 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
15 0.38384569 62 nips-2008-Differentiable Sparse Coding
16 0.3832781 134 nips-2008-Mixed Membership Stochastic Blockmodels
17 0.37702498 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions
18 0.37120488 30 nips-2008-Bayesian Experimental Design of Magnetic Resonance Imaging Sequences
19 0.369194 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
20 0.36564332 84 nips-2008-Fast Prediction on a Tree