nips nips2003 nips2003-105 knowledge-graph by maker-knowledge-mining

105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time


Source: pdf

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efficient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. [sent-5, score-1.026]

2 We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. [sent-6, score-1.095]

3 We design efficient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). [sent-7, score-1.123]

4 In the setting of self play (where all agents use the same algorithm), most existing MARL algorithms seek to learn to play a Nash equilibrium. [sent-14, score-0.912]

5 An equilibrium can be viewed as a convention that the learning agents reach for playing the unknown game. [sent-16, score-1.052]

6 A key difficulty here is that a game usually contains multiple equilibria, and the agents need to coordinate on which one to play. [sent-17, score-0.802]

7 Furthermore, the agents may have different preferences among the equilibria. [sent-18, score-0.605]

8 Most prior work has avoided this problem by focusing on games with a unique equilibrium or games in which the agents have common interests. [sent-19, score-0.818]

9 In this paper, we advocate Pareto-optimal Nash equilibria as the equilibria that a MARL algorithm should drive agents to. [sent-20, score-0.822]

10 This is a natural goal: Pareto-optimal equilibria are equilibria for which no other equilibrium exists where both agents are better off. [sent-21, score-0.844]

11 We further design efficient algorithms for learning agents to achieve this goal in polynomial time. [sent-22, score-0.603]

12 2 Definitions and background We study a repeated 2-agent game where the agents do not know the game up front, and try to learn how to play based on the experiences in the previous rounds of the game. [sent-23, score-1.284]

13 As usual, we assume that the agents observe each others’ actions. [sent-24, score-0.563]

14 We allow for the possibility that the agents receive noisy but bounded payoffs (as is the case in many real-world MARL settings); this complicates the game because the joint action does not determine the agents’ payoffs deterministically. [sent-25, score-1.335]

15 Furthermore, the agents may prefer different outcomes of the game. [sent-26, score-0.559]

16 The results of their joint action can be represented in matrix form: The rows correspond to agent 1’s actions and the columns correspond to agent 2’s actions. [sent-33, score-0.752]

17 A Nash equilibrium (NE) is a strategy profile π = {πi , π−i } in which no agent can improve its payoff by unilaterally deviating to a different strategy: ui ({πi , π−i }) ≥ ui ({πi , π−i }) for both agents (i = 1, 2) and any strategy πi . [sent-39, score-1.073]

18 We focus on the important and widely studied class of games called coordination games: 1 Definition 1 [Coordination game] A 2-agent coordination game G is an N × N matrix game with N strict Nash equilibria (called conventions). [sent-43, score-1.112]

19 ) A coordination game captures the notion that agents have the common interest of being coordinated (they both get higher payoffs by playing equilibria than other strategy profiles), but at the same time there are potentially non-identical interests (each agent may prefer different equilibria). [sent-45, score-1.584]

20 In this game, though agents favor different conventions, they would rather have a deal than opt out. [sent-65, score-0.574]

21 The convention where both agents opt out is Pareto-dominated and the other two conventions are Pareto-optimal. [sent-66, score-1.236]

22 Definition 2 [Pareto-optimality] A convention {a1 , a2 } is Pareto-dominated if there exists at least one other convention {a1 , a2 } such that ui ({a1 , a2 }) < ui ({a1 , a2 }) and u−i ({a1 , a2 }) ≤ u−i ({a1 , a2 }). [sent-67, score-0.95]

23 A Pareto-dominated convention is unpreferable because there is another convention that makes both agents better off. [sent-73, score-1.326]

24 Therefore, we advocate that a MARL algorithm should at least cause agents to learn a PO convention. [sent-74, score-0.622]

25 2 Learning in game theory: Necessary background Learning in game theory [6] studies repeated interactions of agents, usually with the goal of having the agents learn to play Nash equilibrium. [sent-79, score-1.284]

26 In the former, the agents are usually assumed to know the game before play, while in MARL the agents have to learn the game structure in addition to learning how to play. [sent-81, score-1.667]

27 1 Adaptive play (AP) The learning process of AP is as follows: Learning agents are assumed to have a memory to keep record of recent m plays of the game. [sent-89, score-0.773]

28 The stochastic stable conventions of a game can be identified by considering the mistakes being made during state transitions. [sent-123, score-0.651]

29 Call the absorbing states in the unperturbed process convention states in the perturbed process. [sent-125, score-0.537]

30 For each convention state h, we construct an h-tree τh (with each node being a convention state) such that there is a unique direct path from every other convention state to h. [sent-126, score-1.281]

31 Label the direct edges (v, v ) in τh with the number of mistakes rv,v needed to make the transition from convention state v to convention state v . [sent-127, score-0.909]

32 The stochastic potential of the convention state h is the least resistance among all possible h-trees τh . [sent-129, score-0.489]

33 3 Reinforcement learning Reinforcement learning offers an effective way for agents to estimate the expected payoffs associated with individual actions based on previous experience—without knowing the game structure. [sent-132, score-1.168]

34 In our single-state setting, we take a simplified form of the algorithm, with Q-value Qi (a) recording the estimate of the expected payoffs ui (a) for t agent i at time t. [sent-136, score-0.511]

35 Qi (a) = Qi (a) + α(Rt − Qi (a)) t+1 t t (1) In single-agent RL, if each action is sampled infinitely and the learning rate α is decreased over time fast enough but not too fast, the Q-values will converge to agent i’s expected payoff ui . [sent-138, score-0.543]

36 In this paper, we develop efficient MARL algorithms for learning a PO convention in an unknown coordination game. [sent-146, score-0.555]

37 We consider both the perfect monitoring setting where agents observe each others’ payoffs, and the imperfect monitoring setting where agents do not observe each others’ payoffs (and do not want to tell each other their payoffs). [sent-147, score-1.717]

38 In the latter setting, our agents learn to play PO conventions without learning each others’ preferences over conventions. [sent-148, score-1.086]

39 Then with probability at least 1 − δ, agents will start to play a joint policy a within steps polynomial in 1 , 1 , and N , such that δ there exists no convention a that satisfies u1 (a) + < u1 (a ) and u2 (a) + < u2 (a ). [sent-150, score-1.261]

40 3 An efficient algorithm for the perfect monitoring setting In order to play an -PO convention, agents need to find all these conventions first. [sent-152, score-1.142]

41 However, these approaches are thin for the goal: Even when the game structure estimated from samples is within of G, its PO conventions might still be away from these of G. [sent-154, score-0.554]

42 If (1) GM = (Q1 , Q2 ) has N conventions, and (2) for every convention {ai , a−i } in GM and every agent i, M M Qi ({ai , a−i }) > Qi ({ai , a−i }) + 2 w for every ai = ai , then Stop; else w ← w + 1, Goto Step 2. [sent-165, score-0.799]

43 M M In Step 2 and Step 3, agent i samples the coordination game G sufficiently so that the game GM = (Q1 , Q2 ) formed from M samples is within w of G with probability at least M M δ 1 − 2w−1 . [sent-166, score-1.009]

44 So, any convention not strictly Pareto-dominated in G M is a 2 -PO convention in G by definition. [sent-169, score-0.794]

45 After learning the game, the agents will further learn how to play, that is, to determine which PO convention in GM to choose. [sent-174, score-0.992]

46 A simple solution is to let two agents randomize their action selection until they arrive at a PO convention in GM . [sent-175, score-1.052]

47 However, this treatment is problematic because each agent may have different preferences over the conventions and thus will not randomly choose an action unless it believes the action is a best response to the other’s strategy. [sent-176, score-0.88]

48 In this paper, we consider the learning agents which use adaptive play to negotiate the convention they should play. [sent-177, score-1.154]

49 In game theory, AP was suggested as a simple learning model for bargaining [18], where each agent dynamically adjusts its offer w. [sent-178, score-0.567]

50 Choose an action against V GM as in adaptive play except that when there exist multiple best-response actions that correspond to some conventions in the game, choose an action that belongs to a convention that offers the greatest payoff (breaking remaining ties randomly). [sent-190, score-1.263]

51 In Step 1, agents construct a virtual game V GM from the game GM = (Q1 , Q2 ) by M M setting the payoffs of all actions except PO conventions to zero. [sent-195, score-1.664]

52 Theorem 1 In any unknown 2-agent coordination game with perfect monitoring, if m ≥ 4k, agents that use the above algorithm will learn a 2 -PO policy with probability at least 1 − 2δ in time poly(N, 1 , 1 , m, k). [sent-202, score-1.086]

53 4 An efficient algorithm for the imperfect monitoring setting In this section, we present an efficient MARL algorithm for the imperfect monitoring setting where the agents do not observe each others’ payoff during learning. [sent-204, score-1.07]

54 Actually, since agents can observe joint actions, they may explicitly signal to each other their preferences over conventions through actions. [sent-205, score-0.938]

55 Here we assume that agents are not willing to explicitly signal each other their preferences over conventions, even part of such information (e. [sent-207, score-0.605]

56 Because each agent is unable to observe the other’s payoffs and because there is noise in payoffs received, it is difficult for the agent to determine when enough samples have been taken to identify all conventions. [sent-211, score-0.904]

57 We address this by allowing agents to demonstrate to each other their understanding of game structure (where the conventions are) after sampling. [sent-212, score-1.067]

58 Each agent plays its actions in order, with wrap around, until both agents have just wrapped around. [sent-214, score-0.917]

59 Given and δ, agents are randomly sampling the game until every joint action has been visited at least δ M ( w , w−1 ) times (with w = 1) and updating their Q-values using Equation 1 along the way. [sent-220, score-1.012]

60 Starting at the same time, each agent i goes through the other’s N individual actions a −i in order, playing the action ai such that Qi ({ai , a−i }) > 2 w + Qi ({ai , a−i }) for any ai = ai . [sent-222, score-0.727]

61 (If such an action ai does not exists M M for some a−i , then agent i plays action 1 throughout this demonstration phase. [sent-223, score-0.615]

62 Each agent determines whether the agents hold the same view of the N strict Nash equilibria. [sent-225, score-0.831]

63 After learning the game, the agents start to learn how to play. [sent-227, score-0.595]

64 The difficulty is, without knowing about the other’s preferences over conventions, agents cannot explicitly eliminate Pareto-dominated conventions in GM . [sent-228, score-0.902]

65 Thus, even if there exists a better convention in which one agent compromise a little but the other is better off greatly, it will not be chosen. [sent-232, score-0.647]

66 The intriguing question here is whether agents can learn to play a PO convention without knowing the other’s preferences at all. [sent-233, score-1.227]

67 2) causes agents to choose “stochastic stable” conventions most of time. [sent-236, score-0.819]

68 Specifically, over Qi , each agent i first constructs a best-response M set by including, for each possible action of the other agent a−i , the joint action {a∗ , a−i } i where a∗ is i’s best response to a−i . [sent-238, score-0.783]

69 We have proved that in the virtual game (V Q1 , V Q2 ), conventions strictly M M Pareto-dominated are not stochastic stable [15]. [sent-240, score-0.605]

70 This implies that using AP with persistent noise, agents will play 2 -PO conventions most of time even without knowing the other’s preferences. [sent-241, score-1.017]

71 Therefore, if the agents can stop using noise in action selection at some point (and will thus play a particular convention from then on), there is a high probability that they end up playing a 2 -PO convention. [sent-242, score-1.259]

72 In an N × N game this occurs for both agents at the same time, but the technique also works for games with a different number of actions per agent. [sent-246, score-1.014]

73 3 state (containing the m most recent joint actions) is a convention state. [sent-247, score-0.501]

74 If it is, then in each of the following k steps, the agent has probability ε to randomly independently choose an action, and probability 1 − ε to play the best-response action. [sent-251, score-0.489]

75 We can model this learning process as a Markov chain, with the state space including all and only convention states. [sent-253, score-0.465]

76 Let st be the state at time t and sc be the first convention state t the agents reach after time t. [sent-254, score-1.039]

77 It is also irreducible and aperiodic, because with ε > 0, all actions have positive probability to be chosen in a convention state. [sent-257, score-0.528]

78 Our algorithm intends to let agents stop taking noisy actions at some point and stick to a particular convention. [sent-262, score-0.624]

79 If the sampling is unbiased, the agents have a probability at least 1 − Cε to learn a 2 -convention. [sent-264, score-0.618]

80 After the walk, if agents find that Ah defines an h-tree (see Section 2. [sent-270, score-0.532]

81 Otherwise, agents take another random sample from S and repeat random walk, and so on. [sent-273, score-0.572]

82 Lov´ sz and Winkler proved that the algorithm makes an exact sampling of the Markov a chain and that its expected running time is O(¯ 3 log N ), where h is the maximum expected h ¯ time to transfer from one convention state to another. [sent-274, score-0.558]

83 In our setting, we know that the probability to transit from one convention state to another is polynomial in ε (probability to make mistakes in convention states). [sent-275, score-0.932]

84 In addition, recall that ¯ ε our Markov chain is constructed on the convention states instead of all states. [sent-277, score-0.471]

85 In our setting, individual agents generate random numbers independently. [sent-280, score-0.552]

86 Without knowing each others’ random numbers, agents cannot commit to a convention together. [sent-281, score-1.002]

87 If one of our learning agents commits to the final action before the other, the other may never commit because it is unable to complete the random walk. [sent-282, score-0.719]

88 It is nontrivial to coordinate a joint commitment time between the agents because the agents cannot communicate (except via actions). [sent-283, score-1.125]

89 We solve this problem by making the agents use the same random numbers (without requiring communication). [sent-284, score-0.552]

90 In our learning setting, the agents share the same observations of previous plays. [sent-293, score-0.555]

91 Our learning agents have the same random hash function γX . [sent-295, score-0.659]

92 Whenever an agent should make a call to a random number generator, it instead inputs to γX the m most recent joint actions and the total number of steps played so far, and uses the output of γX as the random number. [sent-296, score-0.487]

93 4 This way the agents see the same uniform random numbers, and because the agents use the same algorithms, they will reach commitment to the final action at the same step. [sent-297, score-1.251]

94 , 3N do h = γS (ht , t) (Use random hash function γS to choose a convention state h uniformly from S. [sent-311, score-0.568]

95 Denote the most recent convention state by Ah (h ) (d) U = U ∪ {h } 7. [sent-316, score-0.464]

96 Endfor Theorem 2 In any unknown 2-agent coordination game with imperfect monitoring, for 0 < ε < 1 and some constant C > 0, if m ≥ 4k, using the above algorithm, the agents learn a 2 -PO deterministic policy with probability at least 1 − 2δ − Cε in time poly(N, 1 , 1 , 1 , m, k). [sent-319, score-1.116]

97 δ ε 5 Conclusions and future research In this paper, we studied how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. [sent-320, score-1.026]

98 We focused on 2-agent repeated coordination games of non-identical interest where the agents do not know the game structure up front and receive noisy payoffs. [sent-321, score-1.095]

99 We designed efficient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting (where the agents only observe their own payoffs and the joint actions). [sent-322, score-1.157]

100 4 Recall that agents have established the same numbering of actions. [sent-384, score-0.532]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('agents', 0.532), ('convention', 0.397), ('game', 0.27), ('conventions', 0.265), ('agent', 0.25), ('payoffs', 0.177), ('marl', 0.156), ('play', 0.153), ('coordination', 0.135), ('equilibria', 0.133), ('gm', 0.132), ('action', 0.123), ('monitoring', 0.121), ('games', 0.12), ('nash', 0.104), ('actions', 0.092), ('ap', 0.092), ('qi', 0.084), ('hash', 0.084), ('ksap', 0.084), ('po', 0.079), ('ai', 0.076), ('preferences', 0.073), ('imperfect', 0.067), ('ui', 0.065), ('payoff', 0.063), ('reinforcement', 0.062), ('ah', 0.057), ('chain', 0.05), ('strict', 0.049), ('absorbing', 0.048), ('polynomial', 0.048), ('lov', 0.047), ('equilibrium', 0.046), ('state', 0.045), ('plays', 0.043), ('opt', 0.042), ('learn', 0.04), ('demand', 0.04), ('joint', 0.037), ('perfect', 0.037), ('rl', 0.036), ('winkler', 0.036), ('multiagent', 0.035), ('persistent', 0.035), ('playing', 0.034), ('setting', 0.034), ('markov', 0.033), ('knowing', 0.032), ('young', 0.032), ('observe', 0.031), ('ef', 0.03), ('sz', 0.028), ('others', 0.027), ('prefer', 0.027), ('ht', 0.027), ('least', 0.026), ('strategy', 0.026), ('policy', 0.026), ('adaptive', 0.025), ('mistakes', 0.025), ('stable', 0.025), ('randomly', 0.024), ('states', 0.024), ('virtual', 0.024), ('advocate', 0.024), ('appended', 0.024), ('bargaining', 0.024), ('brafman', 0.024), ('commitment', 0.024), ('endfor', 0.024), ('negotiate', 0.024), ('pivazyan', 0.024), ('sandholm', 0.024), ('played', 0.024), ('chooses', 0.023), ('learning', 0.023), ('perturbed', 0.023), ('recent', 0.022), ('choose', 0.022), ('steps', 0.022), ('stochastic', 0.021), ('poly', 0.021), ('aperiodic', 0.021), ('commit', 0.021), ('unperturbed', 0.021), ('xiaofeng', 0.021), ('probability', 0.02), ('reach', 0.02), ('random', 0.02), ('aaai', 0.02), ('walk', 0.02), ('receive', 0.019), ('samples', 0.019), ('repeated', 0.019), ('wang', 0.019), ('successor', 0.019), ('irreducible', 0.019), ('goto', 0.019), ('expected', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efficient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). 1

2 0.34452319 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1

3 0.29821751 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Author: Gerald Tesauro

Abstract: Recent multi-agent extensions of Q-Learning require knowledge of other agents’ payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed “Hyper-Q” Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents’ strategies are estimated from observed actions via Bayesian inference. Hyper-Q may be effective against many different types of adaptive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to exact optimal time-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented. 1

4 0.25829238 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

Author: David C. Parkes, Satinder P. Singh

Abstract: Online mechanism design (MD) considers the problem of providing incentives to implement desired system-wide outcomes in systems with self-interested agents that arrive and depart dynamically. Agents can choose to misrepresent their arrival and departure times, in addition to information about their value for different outcomes. We consider the problem of maximizing the total longterm value of the system despite the self-interest of agents. The online MD problem induces a Markov Decision Process (MDP), which when solved can be used to implement optimal policies in a truth-revealing Bayesian-Nash equilibrium. 1

5 0.13052142 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

Author: Kazuyuki Samejima, Kenji Doya, Yasumasa Ueda, Minoru Kimura

Abstract: When we model a higher order functions, such as learning and memory, we face a difficulty of comparing neural activities with hidden variables that depend on the history of sensory and motor signals and the dynamics of the network. Here, we propose novel method for estimating hidden variables of a learning agent, such as connection weights from sequences of observable variables. Bayesian estimation is a method to estimate the posterior probability of hidden variables from observable data sequence using a dynamic model of hidden and observable variables. In this paper, we apply particle filter for estimating internal parameters and metaparameters of a reinforcement learning model. We verified the effectiveness of the method using both artificial data and real animal behavioral data. 1

6 0.12352327 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

7 0.10259564 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

8 0.10248181 68 nips-2003-Eye Movements for Reward Maximization

9 0.10165124 19 nips-2003-Algorithms for Interdependent Security Games

10 0.1011055 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems

11 0.091043361 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

12 0.0815209 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

13 0.066752538 55 nips-2003-Distributed Optimization in Adaptive Networks

14 0.064535052 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

15 0.060241599 62 nips-2003-Envelope-based Planning in Relational MDPs

16 0.056770429 158 nips-2003-Policy Search by Dynamic Programming

17 0.052054465 78 nips-2003-Gaussian Processes in Reinforcement Learning

18 0.046557538 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias

19 0.044817496 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

20 0.040673252 135 nips-2003-Necessary Intransitive Likelihood-Ratio Classifiers


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.151), (1, 0.335), (2, -0.129), (3, -0.071), (4, -0.002), (5, 0.046), (6, -0.07), (7, 0.069), (8, -0.277), (9, -0.187), (10, 0.083), (11, 0.06), (12, -0.144), (13, -0.131), (14, -0.028), (15, -0.076), (16, 0.05), (17, 0.009), (18, 0.006), (19, -0.009), (20, 0.02), (21, -0.062), (22, -0.154), (23, 0.058), (24, 0.127), (25, 0.062), (26, 0.038), (27, 0.099), (28, 0.079), (29, -0.019), (30, 0.066), (31, -0.035), (32, -0.065), (33, 0.047), (34, 0.054), (35, 0.046), (36, -0.012), (37, -0.086), (38, 0.045), (39, -0.039), (40, -0.03), (41, 0.08), (42, 0.028), (43, -0.041), (44, -0.094), (45, 0.051), (46, 0.094), (47, -0.052), (48, 0.033), (49, -0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97399986 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efficient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). 1

2 0.80764192 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

Author: David C. Parkes, Satinder P. Singh

Abstract: Online mechanism design (MD) considers the problem of providing incentives to implement desired system-wide outcomes in systems with self-interested agents that arrive and depart dynamically. Agents can choose to misrepresent their arrival and departure times, in addition to information about their value for different outcomes. We consider the problem of maximizing the total longterm value of the system despite the self-interest of agents. The online MD problem induces a Markov Decision Process (MDP), which when solved can be used to implement optimal policies in a truth-revealing Bayesian-Nash equilibrium. 1

3 0.75123143 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

Author: Gerald Tesauro

Abstract: Recent multi-agent extensions of Q-Learning require knowledge of other agents’ payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed “Hyper-Q” Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents’ strategies are estimated from observed actions via Bayesian inference. Hyper-Q may be effective against many different types of adaptive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to exact optimal time-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented. 1

4 0.67221266 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efficient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman filtering to allow an agent to construct a good training signal and learn an effective policy. 1

5 0.55552149 165 nips-2003-Reasoning about Time and Knowledge in Neural Symbolic Learning Systems

Author: Artur Garcez, Luis C. Lamb

Abstract: We show that temporal logic and combinations of temporal logics and modal logics of knowledge can be effectively represented in artificial neural networks. We present a Translation Algorithm from temporal rules to neural networks, and show that the networks compute a fixed-point semantics of the rules. We also apply the translation to the muddy children puzzle, which has been used as a testbed for distributed multi-agent systems. We provide a complete solution to the puzzle with the use of simple neural networks, capable of reasoning about time and of knowledge acquisition through inductive learning. 1

6 0.42227614 19 nips-2003-Algorithms for Interdependent Security Games

7 0.41900802 68 nips-2003-Eye Movements for Reward Maximization

8 0.41730347 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

9 0.29441357 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions

10 0.29266784 36 nips-2003-Auction Mechanism Design for Multi-Robot Coordination

11 0.28673261 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter

12 0.26921278 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

13 0.25798091 62 nips-2003-Envelope-based Planning in Relational MDPs

14 0.21276197 52 nips-2003-Different Cortico-Basal Ganglia Loops Specialize in Reward Prediction at Different Time Scales

15 0.18821537 41 nips-2003-Boosting versus Covering

16 0.18478428 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

17 0.18145962 55 nips-2003-Distributed Optimization in Adaptive Networks

18 0.18078957 78 nips-2003-Gaussian Processes in Reinforcement Learning

19 0.17003015 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices

20 0.16752869 102 nips-2003-Large Scale Online Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.025), (11, 0.02), (29, 0.011), (30, 0.013), (35, 0.041), (53, 0.051), (69, 0.017), (71, 0.027), (76, 0.029), (85, 0.056), (91, 0.59), (99, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98999083 154 nips-2003-Perception of the Structure of the Physical World Using Unknown Multimodal Sensors and Effectors

Author: D. Philipona, J.k. O'regan, J.-p. Nadal, Olivier Coenen

Abstract: Is there a way for an algorithm linked to an unknown body to infer by itself information about this body and the world it is in? Taking the case of space for example, is there a way for this algorithm to realize that its body is in a three dimensional world? Is it possible for this algorithm to discover how to move in a straight line? And more basically: do these questions make any sense at all given that the algorithm only has access to the very high-dimensional data consisting of its sensory inputs and motor outputs? We demonstrate in this article how these questions can be given a positive answer. We show that it is possible to make an algorithm that, by analyzing the law that links its motor outputs to its sensory inputs, discovers information about the structure of the world regardless of the devices constituting the body it is linked to. We present results from simulations demonstrating a way to issue motor orders resulting in “fundamental” movements of the body as regards the structure of the physical world. 1

same-paper 2 0.98253393 105 nips-2003-Learning Near-Pareto-Optimal Conventions in Polynomial Time

Author: Xiaofeng Wang, Tuomas Sandholm

Abstract: We study how to learn to play a Pareto-optimal strict Nash equilibrium when there exist multiple equilibria and agents may have different preferences among the equilibria. We focus on repeated coordination games of non-identical interest where agents do not know the game structure up front and receive noisy payoffs. We design efficient near-optimal algorithms for both the perfect monitoring and the imperfect monitoring setting(where the agents only observe their own payoffs and the joint actions). 1

3 0.97897494 110 nips-2003-Learning a World Model and Planning with a Self-Organizing, Dynamic Neural System

Author: Marc Toussaint

Abstract: We present a connectionist architecture that can learn a model of the relations between perceptions and actions and use this model for behavior planning. State representations are learned with a growing selforganizing layer which is directly coupled to a perception and a motor layer. Knowledge about possible state transitions is encoded in the lateral connectivity. Motor signals modulate this lateral connectivity and a dynamic field on the layer organizes a planning process. All mechanisms are local and adaptation is based on Hebbian ideas. The model is continuous in the action, perception, and time domain.

4 0.97150332 87 nips-2003-Identifying Structure across Pre-partitioned Data

Author: Zvika Marx, Ido Dagan, Eli Shamir

Abstract: We propose an information-theoretic clustering approach that incorporates a pre-known partition of the data, aiming to identify common clusters that cut across the given partition. In the standard clustering setting the formation of clusters is guided by a single source of feature information. The newly utilized pre-partition factor introduces an additional bias that counterbalances the impact of the features whenever they become correlated with this known partition. The resulting algorithmic framework was applied successfully to synthetic data, as well as to identifying text-based cross-religion correspondences. 1 In t ro d u c t i o n The standard task of feature-based data clustering deals with a single set of elements that are characterized by a unified set of features. The goal of the clustering task is to identify implicit constructs, or themes, within the clustered set, grouping together elements that are characterized similarly by the features. In recent years there has been growing interest in more complex clustering settings, in which additional information is incorporated [1], [2]. Several such extensions ([3]-[5]) are based on the information bottleneck (IB) framework [6], which facilitates coherent information-theoretic representation of different information types. In a recent line of research we have investigated the cross-dataset clustering task [7], [8]. In this setting, some inherent a-priori partition of the clustered data to distinct subsets is given. The clustering goal it to identify corresponding (analogous) structures that cut across the different subsets, while ignoring internal structures that characterize individual subsets. To accomplish this task, those features that commonly characterize elements across the different subsets guide the clustering process, while within-subset regularities are neutralized. In [7], we presented a distance-based hard clustering algorithm for the coupledclustering problem, in which the clustered data is pre-partitioned to two subsets. In [8], our setting, generalized to pre-partitions of any number of subsets, was addressed by a heuristic extension of the probabilistic IB algorithm, yielding improved empirical results. Specifically, the algorithm in [8] was based on a modification of the IB stable-point equation, which amplified the impact of features characterizing a formed cluster across all, or most, subsets. This paper describes an information-theoretic framework that motivates and extends the algorithm proposed in [8]. The given pre-partitioning is represented via a probability distribution variable, which may represent “soft” pre-partitioning of the data, versus the strictly disjoint subsets assumed in the earlier cross-dataset framework. Further, we present a new functional that captures the cross-partition motivation. From the new functional, we derive a stable-point equation underlying our algorithmic framework in conjunction with the corresponding IB equation. Our algorithm was tested empirically on synthetic data and on a real-world textbased task that aimed to identify corresponding themes across distinct religions. We have cross-clustered five sets of keywords that were extracted from topical corpora of texts about Buddhism, Christianity, Hinduism, Islam and Judaism. In distinction from standard clustering results, our algorithm reveals themes that are common to all religions, such as sacred writings, festivals, narratives and myths and theological principles, and avoids topical clusters that correspond to individual religions (for example, ‘Christmas’ and ‘Easter’ are clustered together with ‘Ramadan’ rather than with ‘Church’). Finally, we have paid specific attention to the framework of clustering with side information [4]. While this approach was presented for a somewhat different mindset, it might be used directly to address clustering across pre-partitioned data. We compare the technical details of the two approaches and demonstrate empirically that clustering with side information does not seem appropriate for the kind of cross-partition tasks that we explored. 2 Th e In fo rmat i o n B ot t len eck M et h od Probabilistic (“soft”) data clustering outputs, for each element x of the set being clustered and each cluster c, an assignment probability p(c|x). The IB method [6] interprets probabilistic clustering as lossy data compression. The given data is represented by a random variable X ranging over the clustered elements. X is compressed through another random variable C, ranging over the clusters. Every element x is characterized by conditional probability distribution p(Y|x), where Y is a third random variable taking the members y of a given set of features as values. The IB method formalizes the clustering task as minimizing the IB functional: L(IB) = I(C; X) − β I(C; Y) . (1) As known from information theory (Ch. 13 of [9]), minimizing the mutual information I(C; X) optimizes distorted compression rate. A complementary bias to maximize I(C; Y) is interpreted in [6] as articulating the level of relevance of Y to the obtained clustering, inferred from the level by which C can predict Y. β is a free parameter counterbalancing the two biases. It is shown in [6] that p(c|x) values that minimize L(IB) satisfy the following equation: p(c|x) = 1 p (c )e −β DKL [ p ( Y |x )|| p (Y |c ) ] , z( β , x) (2) where DKL stands for the Kullback-Leibler (KL) divergence, or relative entropy, between two distributions and z(β ,x) is a normalization function over C. Eq. (2) implies that, optimally, x is assigned to c in proportion to their KL distance in a feature distribution space, where the distribution p(Y|c) takes the role of a Start at time t = 0 and iterate the following update-steps, till convergence: IB1: initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ IB2: pt (c) = IB3: pt (y|c) = pt −1 (c ) e ∑ x (t = 0) (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x) p ( y | x ) p ( x) p t (c ) x Figure 1: The Information Bottleneck iterative algorithm (with fixed β and |C|). representative, or centroid, of c. The feature variable Y is hence utilized as the (exclusive) means to guide clustering, beyond the random nature of compression. Figure 1 presents the IB iterative algorithm for a fixed value of β . The IB1 update step follows Eq. (2). The other two steps, which are derived from the IB functional as well, estimate the p(c) and p(y|c) values required for the next iteration. The algorithm converges to a local minimum of the IB functional. The IB setting, particularly the derivation of steps IB1 and IB3 of the algorithm, assumes that Y and C are independent given X, that is: I(C; Y|X) = ∑x p(x) I(C|x; Y|x) = 0. The balancing parameter β affects the number of distinct clusters being formed in a manner that resembles (inverse) temperature in physical systems. The higher β is (i.e., the stronger the bias to construct C that predicts Y well), more distinct clusters are required for encoding the data. For each |C| = 2, 3, …, there is a minimal β value, enabling the formation of |C| distinct clusters. Setting β to be smaller than this critical value corresponding to the current |C| would result in two or more clusters that are identical to one another. Based on this, the iterative algorithm is applied repeatedly within a gradual cooling-like (deterministic annealing) scheme: starting with random initialization of the p0 (c|x)'s, generate two clusters with the critical β value, found empirically, for |C| = 2. Then, use a perturbation on the obtained two-cluster configuration to initialize the p0(c|x)'s for a larger set of clusters and execute additional runs of the algorithm to identify the critical β value for the larger |C|. And so on: each output configuration is used as a basis for a more granular one. The final outcome is a “soft hierarchy” of probabilistic clusters. 3 Cro ss- p a rt i t i o n Clu st eri n g Cross-partition (CP) clustering introduces a factor – a pre-given partition of the clustered data – additional to what considered in a standard clustering setting. For representing this factor we introduce the pre-partitioning variable W, ranging over all parts w of the pre-given partition. Every data element x is associated with W through a given probability distribution p(W|x). Our goal is to cluster the data, so that the clusters C would not be correlated with W. We notice that Y, which is intended to direct the formation of clusters, might be a-priori correlated with W, so the formed clusters might end up being correlated with W as well. Our method aims at eliminating this aspect of Y. 3.1 I n f or ma t i o n D e f oc us i n g As noted, some of the information conveyed by Y characterizes structures correlated with W, while the other part of the information characterizes the target cross-W structures. We are interested in detecting the latter while filtering out the former. However, there is no direct a-priori separation between the two parts of the Ymediated information. Our strategy in tackling this difficulty is: we follow in general Y's directions, as the IB method does, while avoiding Y's impact whenever it entails undesired inter-dependencies of C and W. Our strategy implies conflicting biases with regard to the mutual information I(C,Y): it should be maximized in order to form meaningful clusters, but be minimized as well in the specific context where Y entails C–W dependencies. Accordingly, we propose a computational procedure directed by two distinct cost-terms in tandem. The first one is the IB functional (Eq. 1), introducing the bias to maximize I(C,Y). With this bias alone, Y might dictate (or “explain”, in retrospect) substantial C–W dependencies, implying a low I(C;W|Y) value. 1 Hence, the guideline of preventing Y from accounting for C–W dependencies is realized through an opposing bias of maximizing I(C;W|Y) = ∑y p(y) I(C|y; W|y). The second cost term – the Information Defocusing (ID) functional – consequently counterbalances minimization of I(C,Y) against the new bias: L(ID) = I(C; Y) − η I(C;W|Y) , (3) where η is a free parameter articulating the tradeoff between the biases. The ID functional captures our goal of reducing the impact of Y selectively: “defocusing” a specific aspect of the information Y conveys: the information correlated with W. In a like manner to the stable-point equation of the IB functional (Eq. 2), we derive the following stable-point equation for the ID functional: η p ( w) 1 p ( c )∏ w p ( y | c, w) η +1 , p(c|y) = z (η , y ) (4) where z(η,y) is a normalization function over C. The derivation relies on an additional assumption, I(C;W) = 0, imposing the intended independence between C and W (the detailed derivation will be described elsewhere). The intuitive interpretation of Eq. (4) is as follows: a feature y is to be associated with a cluster c in proportion to a weighted, though flattened, geometric mean of the “W-projected centroids” p(y|c,w), priored by p(c). 2 This scheme overweighs y's that contribute to c evenly across W. Thus, clusters satisfying Eq. (4) are situated around centroids biased towards evenly contributing features. The higher η is, heavier emphasis is put on suppressing disagreements between the w's. For η → ∞ a plain weighted geometric-mean scheme is obtained. The inclusion of a step derived from Eq. (4) in our algorithm (see below) facilitates convergence on a configuration with centroids dominated by features that are evenly distributed across W. 3.2 T h e Cr os s - p a r t i t i on C l us t e r i n g A l g or i t h m Our proposed cross partition (CP) clustering algorithm (Fig. 2) seeks a clustering configuration that optimizes simultaneously both the IB and ID functionals, 1 Notice that “Z explaining well the dependencies between A and B” is equivalent with “A and B sharing little information in common given Z”, i.e. low I(A;B|Z) . Complete conditional independence is exemplified in the IB framework, assuming I(C;Y|X) = 0. 2 Eq. (4) resembles our suggestion in [8] to compute a geometric average over the subsets; in the current paper this scheme is analytically derived from the ID functional. Start at time t = 0 and iterate the following update-steps, till convergence: CP1: Initialize p t (c|x) randomly or arbitrarily −β DKL [ p (Y | x )|| pt −1 (Y |c ) ] pt (c|x) ∝ CP2: pt (c) = CP3: p*t (y|c,w) = CP4: (t = 0) p t −1 (c ) e ∑ x (t > 0) p t (c | x ) p ( x ) 1 ∑ pt ( c | x ) p ( y | x ) p ( w | x ) p ( x ) p t ( c ) p ( w) x Initialize p*t (c) randomly or arbitrarily (t = 0) p*t (c) (t > 0) = ∑ y p *t −1 (c | y ) p ( y ) η CP5: p*t (c|y) ∝ p *t (c)∏w p *t ( y | c, w) η +1 CP6: pt (y|c) = p ( w) p *t (c | y ) p ( y ) p *t (c ) Figure 2: The cross-partition clustering iterative algorithm (with fixed β, η, and |C|). thus obtaining clusters that cut across the pre-given partition W. To this end, the algorithm interleaves an iterative computation of the stable-point equations, and the additional estimated parameters, for both functionals. Steps CP1, CP2 and CP6 correspond to the computations related to the IB functional, while steps CP3, CP4 and CP5, which compute a separate set of parameters (denoted by an asterisk), correspond to the ID functional. Figure 3 summarizes the roles of the two functionals in the dynamics of the CP algorithm. The two components of the iterative cycle are tied together in steps CP3 and CP6, in which parameters from one set are used as input to compute a parameter of other set. The derivation of step CP3 relies on an additional assumption, namely that C, Y and W are jointly independent given X. This assumption, which extends to W the underlying assumption of the IB setting that C and Y are independent given X, still entails the IB stable point equation. At convergence, the stable point equations for both the IB and ID functionals are satisfied, each by its own set of parameters (in steps CP1 and CP5). The deterministic annealing scheme, which gradually increases β over repeated runs (see Sec. 2), is applied for the CP algorithm as well with η held fixed. For a given target number of clusters |C|, the algorithm empirically converges with a wide range of η values 3. I(C;X) ↓ IB β↑ I(C;Y) ↓ ID η↑ I(C; W|Y) I(C; Y; W|X) = 0 ← assumptions → I(C;W) = 0 Figure 3: The interplay of the IB and the ID functionals in the CP algorithm. High η values tend to dictate centroids with features that are unevenly distributed across W, resulting in shrinkage of some of the clusters. Further analysis will be provided in future work. 3 4 Exp e ri men t a l Resu lt s Our synthetic setting consisted of 75 virtual elements, evenly pre-partitioned into three 25-element parts denoted X 1 , X2 and X3 (in our formalism, for each clustered element x, p(w|x) = 1 holds for either w = 1, 2, or 3). On top of this pre-partition, we partitioned the data twice, getting two (exhaustive) clustering configurations: 1. Target cross-W clustering: five clusters, each with representatives from all X w's; 2. Masking within-w clustering: six clusters, each consisting of roughly half the elements of either X 1, X 2 or X3 with no representatives from the other X w's. Each cluster, of both configurations, was characterized by a designated subset of features. Masking clusters were designed to be more salient than target clusters: they had more designated features (60 vs. 48 per cluster, i.e., 360 vs. 240 in total) and their elements shared higher feature-element (virtual) co-occurrence counts with those designated features (900 vs. 450 per element-feature pair). Noise (random positive integer < 200) was added to all counts associating elements with their designated features (for both within-w and cross-W clusters), as well as to roughly quarter of the zero counts associating elements with the rest of the features. The plain IB method consistently produced configurations strongly correlated with the masking clustering, while the CP algorithm revealed the target configuration. We got (see Table 1A) almost perfect results in configurations of nearly equal-sized cross-W clusters, and somewhat less perfect reconstruction in configurations of diverging sizes (6, 9, 15, 21 and 24). Performance level was measured relatively to optimal target-output cluster match by the proportion of elements correctly assigned, where assignment of an element x follows its highest p(c|x). The results indicated were averaged over 200 runs. They were obtained for the optimal η, which was found to be higher in the diverging-sizes task. In the text-based task, the clustered elements – keywords – were automatically extracted from five distinct corpora addressing five religions: introductory web pages, online magazines, encyclopedic entries etc., all downloaded from the Internet. The clustered keyword set X was consequently pre-partitioned to disjoint subsets {X w} w∈W, one for each religion4 (|X w| ≈ 200 for each w). We conducted experiments simultaneously involving religion pairs as well as all five religions. We took the features Y to be a set of words that commonly occur within all five corpora (|Y| ≈ 7000). x–y co-occurrences were recorded within ±5-word sliding window truncated by sentence boundaries. η was fixed to a value (1.0) enabling the formation of 20 clusters in all settings. The obtained clusters revealed interesting cross religion themes (see Sec. 1). For instance, the cluster (one of nine) capturing the theme of sacred festivals: the three highest p(c/x) members within each religion were Full-moon, Ceremony, Celebration (Buddhism); Easter, Sunday, Christmas Table 1: Average correct assignment proportion scores for the synthetic task (A) and Jaccard-coefficient scores for the religion keyword classification task (B). A. Synthetic Data IB CP B. Religion Data IB Coupled Clustering [7] CP (cross-expert agreement on religion pairs .462±.232) equal-size clusters .305 .985 non-equal clusters .292 .827 4 religion pairs all five (one case) .200±.100 .220±.138 .407±.144 .104 ––––––– .167 A keyword x that appeared in the corpora of different religions was considered as a distinct element for each religion, so the Xw were kept disjointed. (Chrsitianity); Puja, Ceremony, Festival (Hinduism); Id-al-Fitr, Friday, Ramadan, (Islam); and Sukkoth, Shavuot, Rosh-Hodesh (Judaism). The closest cluster produced by the plain IB method was poorer by far, including Islamic Ramadan, and Id and Jewish Passover, Rosh-Hashanah and Sabbath (which our method ranked high too), but no single related term from the other religions. Our external evaluation standards were cross-religion keyword classes constructed manually by experts of comparative religion studies. One such expert classification involved all five religions, and eight classifications addressed religions in pairs. Each of the eight religion-pair classifications was contributed by two independent experts using the same keywords, so we could also assess the agreement between experts. As an overlap measure we employed the Jaccard coefficient: the number of element pairs co-assigned together by both one of the evaluated clusters and one of the expert classes, divided by the number of pairs co-assigned by either our clusters or the expert (or both). We did not assume the number of expert classes is known in advance (as done in the synthetic experiments), so the results were averaged over all configurations of 2–16 cluster hierarchy, for each experiment. The results shown in Table 1B – clear improvement relatively to plain IB and the distance-based coupled clustering [7] – are, however, persistent when the number of clusters is taken to be equal to the number of classes, or if only the best score in hierarchy is considered. The level of cross-expert agreement indicates that our results are reasonably close to the scores expected in such subjective task. 5 C o mp a ri so n t o R e la t ed W o r k The information bottleneck framework served as the basis for several approaches that represent additional information in their clustering setting. The multivariate information bottleneck (MIB) adapts the IB framework for networks of multiple variables [3]. However, all variables in such networks are either compressed (like X), or predicted (like Y). The incorporation of an empirical variable to be masked or defocused in the sense of our W is not possible. Including such variables in the MIB framework might be explored in future work. Particularly relevant to our work is the IB-based method for extracting relevant constructs with side information [4]. This approach addresses settings in which two different types of features are distinguished explicitly: relevant versus irrelevant ones, denoted by Y+ and Y−. Both types of features are incorporated within a single functional to be minimized: L(IB-side-info) = I(C; X) − β ( I(C; Y +) − γ I(C; Y−) ), which directly drives clustering to de-correlate C and Y−. Formally, our setting can be mapped to the side information setting by regarding the pre-partition W simply as the additional set of irrelevant features, giving symmetric (and opposite) roles to W and Y. However, it seems that this view does not address properly the desired cross-partition setting. In our setting, it is assumed that clustering should be guided in general by Y, while W should only neutralize particular information within Y that would otherwise yield the undesired correlation between C and W (as described in Section 3.1). For that reason, the defocusing functional tie the three variables together by conditioning the de-correlation of C and W on Y, while its underlying assumption ensures the global de-correlation. Indeed, our method was found empirically superior on the cross-dataset task. The side-information IB method (the iterative algorithm with best scoring γ) achieves correct assignment proportion of 0.52 in both synthetic tasks, where our method scored 0.99 and 0.83 (see Table 1A) and, in the religion-pair keyword classification task, Jaccard coefficient improved by 20% relatively to plain IB (compared to our 100% improvement, see Table 1B). 6 C o n c lu si o n s This paper addressed the problem of clustering a pre-partitioned dataset, aiming to detect new internal structures that are not correlated with the pre-given partition but rather cut across its components. The proposed framework extends the cross-dataset clustering algorithm [8], providing better formal grounding and representing any pre-given (soft) partition of the dataset. Supported by empirical evidence, we suggest that our framework is better suited for the cross-partition task than applying the side-information framework [4], which was originally developed to address a somewhat different setting. We also demonstrate substantial empirical advantage over the distance-based coupled-clustering algorithm [7]. As an applied real-world goal, the algorithm successfully detects cross-religion commonalities. This goal exemplifies the more general notion of detecting analogies across different systems, which is a somewhat vague and non-consensual task and therefore especially challenging for a computational framework. Our approach can be viewed as an initial step towards principled identification of “hidden” commonalities between substantially different real world systems, while suppressing the vast majority of attributes that are irrelevant for the analogy. Further research may study the role of defocusing in supervised learning, where some pre-given partitions might mask the role of underlying discriminative features. Additionally, it would be interesting to explore relationships to other disciplines, e.g., network information theory ([9], Ch. 14) which provided motivation for the side-information approach. Finally, both frameworks (ours and side-information) suggest the importance of dealing wisely with information that should not dictate the clustering output directly. A c k n ow l e d g me n t s We thank Yuval Krymolowski for helpful discussions and Tiina Mahlamäki, Eitan Reich and William Shepard, for contributing the religion keyword classifications. References [1] Hofmann, T. (2001) Unsupervised learning by probabilistic latent semantic analysis. Journal of Machine Learning Research, 41(1):177-196. [2] Wagstaff K., Cardie C., Rogers S. and Schroedl S., 2001. Constrained K-Means clustering with background knowledge. The 18th International Conference on Machine Learning (ICML-2001), pp 577-584. [3] Friedman N., Mosenzon O., Slonim N. & Tishby N. (2002) Multivariate information bottleneck. The 17th conference on Uncertainty in Artificial Intelligence (UAI-17), pp. 152161. [4] Chechik G. & Tishby N. (2002) Extracting relevant structures with side information. Advances in Neural Processing Information Systems 15 (NIPS'02). [5] Globerson, A., Chechik G. & Tishby N. (2003) Sufficient dimensionality reduction. Journal of Machine Learning Research, 3:1307-1331. [6] Tishby, N., Pereira, F. C. & Bialek, W. (1999) The information bottleneck method. The 37th Annual Allerton Conference on Communication, Control, and Computing, pp. 368-379. [7] Marx, Z., Dagan, I., Buhmann, J. M. & Shamir E. (2002) Coupled clustering: A method for detecting structural correspondence. Journal of Machine Learning Research, 3:747-780. [8] Dagan, I., Marx, Z. & Shamir E (2002) Cross-dataset clustering: Revealing corresponding themes across multiple corpora. Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 15-21. [9] Cover T. M. & Thomas J. A. (1991) Elements of Information Theory. Sons, Inc., New York, New York. John Wiley &

5 0.96423399 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

Author: G.C. Littlewort, M.S. Bartlett, I.R. Fasel, J. Chenu, T. Kanda, H. Ishiguro, J.R. Movellan

Abstract: Computer animated agents and robots bring a social dimension to human computer interaction and force us to think in new ways about how computers could be used in daily life. Face to face communication is a real-time process operating at a time scale of less than a second. In this paper we present progress on a perceptual primitive to automatically detect frontal faces in the video stream and code them with respect to 7 dimensions in real time: neutral, anger, disgust, fear, joy, sadness, surprise. The face finder employs a cascade of feature detectors trained with boosting techniques [13, 2]. The expression recognizer employs a novel combination of Adaboost and SVM’s. The generalization performance to new subjects for a 7-way forced choice was 93.3% and 97% correct on two publicly available datasets. The outputs of the classifier change smoothly as a function of time, providing a potentially valuable representation to code facial expression dynamics in a fully automatic and unobtrusive manner. The system was deployed and evaluated for measuring spontaneous facial expressions in the field in an application for automatic assessment of human-robot interaction.

6 0.93804854 46 nips-2003-Clustering with the Connectivity Kernel

7 0.80153841 26 nips-2003-An MDP-Based Approach to Online Mechanism Design

8 0.77705246 65 nips-2003-Extending Q-Learning to General Adaptive Multi-Agent Systems

9 0.750175 4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

10 0.73533577 68 nips-2003-Eye Movements for Reward Maximization

11 0.73009056 84 nips-2003-How to Combine Expert (and Novice) Advice when Actions Impact the Environment?

12 0.71060652 14 nips-2003-A Nonlinear Predictive State Representation

13 0.71042258 43 nips-2003-Bounded Invariance and the Formation of Place Fields

14 0.69774008 55 nips-2003-Distributed Optimization in Adaptive Networks

15 0.6934275 73 nips-2003-Feature Selection in Clustering Problems

16 0.68819243 19 nips-2003-Algorithms for Interdependent Security Games

17 0.68114114 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

18 0.67998362 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

19 0.67568046 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

20 0.67264599 107 nips-2003-Learning Spectral Clustering