jmlr jmlr2007 jmlr2007-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthew E. Taylor, Peter Stone, Yaxin Liu
Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping
Reference: text
sentIndex sentText sentNum sentScore
1 Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. [sent-14, score-0.605]
2 Few current machine learning methods are able to transfer knowledge between pairs of tasks, and none are able to transfer between a broad range of tasks to the extent that humans are. [sent-21, score-0.591]
3 This article presents a new method for transfer learning in the reinforcement learning (RL) framework using temporal difference (TD) learning methods (Sutton and Barto, 1998), whereby an agent can learn faster in a target task after training on a different, typically less complex, source task. [sent-22, score-0.785]
4 In this article we introduce transfer via inter-task mapping (TVITM), whereby a TD learner trained on one task with action-value function RL can learn faster when training on another task with related, but different, state and action spaces. [sent-31, score-0.756]
5 It is this transfer functional which defines transfer in the TVITM framework. [sent-35, score-0.522]
6 In this article we consider the general case where the state features in the source and target tasks are different (Ssource = Starget ), and/or the actions in the source and target tasks are different (Asource = Atarget ). [sent-81, score-0.746]
7 To use the learned action-value function from the source task Q (source, f inal) as the initial action-value function for a TD learner in a target task, we must transform the action-value function so that it can be directly applied to the new state and action space. [sent-82, score-0.571]
8 Similarly, χA maps each action in the target task to the most similar action in the source tasks: χA (ai,target ) = a j,source . [sent-93, score-0.486]
9 χX and χA , mappings from the target task to the source task, are used to construct ρ, a transfer functional from the source task to the target task (see Figure 2). [sent-94, score-0.94]
10 Source S A Q Target χX χA ρ S’ A’ Q’ Figure 2: χX and χA are mappings from a target to a source task; ρ maps an action-value function from a source to a target task. [sent-97, score-0.454]
11 Instead of trying to transfer higher level information about a source task into a target task, we instead focus on information contained in individual weights within function approximators. [sent-107, score-0.593]
12 Learned source task knowledge may be able to improve initial target task performance relative to learning the target task without transfer. [sent-128, score-0.543]
13 Transfer may allow an agent to accumulate more reward in 2129 TAYLOR , S TONE AND L IU the target task when compared to the non-transfer case; better initial performance and faster learning would help agents achieve more on-line reward. [sent-133, score-0.501]
14 Assuming that the transfer learner is able to learn faster or reach a higher performance, the area under the transfer curve will be greater than the area under the non-transfer curve. [sent-140, score-0.64]
15 The ratio r= area under curve with transfer - area under curve without transfer area under curve without transfer gives us a metric for how much transfer improves learning. [sent-141, score-1.044]
16 In the tasks we consider, the learners that use transfer and the learners that learn without transfer do not always plateau to the same performance, nor is there a defined task length. [sent-144, score-0.789]
17 This criterion is relevant when the source task is created for the sole purpose of speeding up learning with transfer and Q(source, f inal) is not reused. [sent-151, score-0.461]
18 We also introduce the Knight Joust, a task which we will later use as a supplemental source task from which to transfer into Keepaway. [sent-154, score-0.536]
19 In particular, the simulator places the players at their initial positions at the start of each episode and ends an episode when the ball leaves the play region or is taken away. [sent-174, score-0.412]
20 Whenever the takers take possession or the ball leaves the region, the episode ends and the players are reset for another episode (with the keepers being given possession of the ball again). [sent-177, score-0.793]
21 8 The player receives a reward of +20 every time it takes the forward action, 0 if either knight jump action is taken, and an additional +20 upon reaching the goal line. [sent-249, score-0.444]
22 Learning Keepaway TVITM aims to improve learning in the target task based on prior learning in the source, and therefore a prerequisite is that both source and target tasks are learnable. [sent-254, score-0.437]
23 The keepers learn in a constrained policy space: they have the freedom to decide which action to take only when in possession of the ball. [sent-273, score-0.466]
24 Therefore the number of actions from which the keeper with the ball may choose is equal to the number of keepers in the task. [sent-275, score-0.53]
25 Keepers not in possession of the ball are required to execute the Receive macro-action in which the player who can reach the ball the fastest goes to the ball and the remaining players follow a handcoded strategy to try to get open for a pass. [sent-276, score-0.521]
26 The keepers learn which action to take when in possession of the ball by using episodic SMDP Sarsa(λ) (Sutton and Barto, 1998), to learn their task. [sent-281, score-0.539]
27 Since we want the keepers to maintain possession of the ball for as long as possible, the reward in the Keepaway task is simply the number of time steps the ball remains in play after a macro-action is initiated. [sent-284, score-0.646]
28 As more keepers are added, the keeper with the ball has more passing options but the average pass distance is shorter. [sent-288, score-0.559]
29 Note that as the number of keepers n and the number of takers m increase, the number of state variables also increases so that the more complex state can be fully described. [sent-304, score-0.457]
30 , there are more distances to players to account for) and |A| increases as there are more teammates for the keeper with possession of the ball to pass to. [sent-313, score-0.482]
31 In this article we use three distinct function approximators and show that all are able to learn Keepaway, as well as use our transfer methodology (see Figure 6). [sent-316, score-0.424]
32 Three keepers start in each of the other three corners and the fourth keeper begins an episode at the center of the field. [sent-392, score-0.43]
33 2 Keepaway, but now A = {hold, pass to closest teammate, pass 2138 T RANSFER L EARNING VIA I NTER -TASK M APPINGS to second closest teammate, pass to third closest teammate}, and S is made up of 19 state variables due to the added players. [sent-394, score-0.451]
34 Furthermore, adding a keeper in the center of the field changes the start state significantly as now the keeper that starts with the ball has a teammate that is closer to itself, but is also closer to the takers. [sent-401, score-0.575]
35 When a group of four CMAC keepers has learned to hold the ball from the three takers for an average of 11. [sent-408, score-0.428]
36 (After training four keepers using ANN function approximation without transfer in 4 vs. [sent-424, score-0.5]
37 4 three keepers are again placed in three corners and the two remaining keepers are placed in the middle of the 25m × 25m field. [sent-433, score-0.436]
38 There are now five actions: {hold, pass to closest teammate, pass to second closest teammate, pass to third closest teammate, pass to fourth closest teammate}, and 25 state variables. [sent-435, score-0.576]
39 2139 TAYLOR , S TONE AND L IU relies on a functional ρ that is able to transfer an action-value function from a source task into a target task with different state and action spaces. [sent-449, score-0.797]
40 ” We map the novel target action “Pass to third closest keeper” to “Pass to second closest keeper” in the source task. [sent-466, score-0.418]
41 Now that χA and χX are defined, relating the state variables and actions in a target task to the state variables and actions in a source task, we can use them to construct ρs for different internal representations. [sent-475, score-0.594]
42 The functionals will transfer the learned action-value function from the source task into the target task. [sent-476, score-0.59]
43 Relevant points are the center of the field C, keepers K1 -K4 , and takers T1 -T3 , where players are ordered by increasing distance from the ball. [sent-494, score-0.443]
44 3 action Knight Joust action Hold Ball Forward Pass to closest teammate JumpW Pass to second closest teammate JumpE Pass to third closest teammate JumpE Table 3: This table describes the mapping between state variables and actions from 4 vs. [sent-509, score-0.873]
45 Rather than initialize a function approximator in the target task with values learned in the source task, we instead reuse the entire learned source task’s Q-values. [sent-556, score-0.6]
46 A copy of the source task’s function approximator is retained so that it can calculate the source task’s Q-values for any state, action pair: QsourceFA : S × A → R. [sent-557, score-0.425]
47 When computing Q-values for the target task, we first map the target task state and action to the source task’s state and action via the inter-task mappings. [sent-558, score-0.722]
48 Q-value Reuse may be considered a type of reward shaping (Colombetti and Dorigo, 1993; Mataric, 1994): we are able to directly use the expected rewards from the source task to bias the learner in the target task. [sent-561, score-0.487]
49 Such requirements will grow linearly in the number of transfer steps; while they are not substantial with a single source task, they may become prohibitive when using multiple source tasks or when performing doing multi-step transfer (such as shown later in Section 7. [sent-566, score-0.841]
50 As source task training time increases, the required target task training time decreases. [sent-638, score-0.455]
51 To determine if Keepaway players using CMAC function approximation can benefit from transfer, we compare the time it takes agents to learn the target task after transferring from the source task with the time it takes to learn the target task without transfer. [sent-679, score-1.009]
52 The top row of each table represents learning the task without transfer and thus any column with transfer times lower than the top row shows beneficial transfer. [sent-681, score-0.597]
53 Our second goal of transfer would be met if the total training time in both tasks with transfer was less than learning without transfer in the target task. [sent-682, score-0.984]
54 Again, successful transfer is demonstrated as both the transfer agents’ target task training time and the transfer agent’s total training time are less than the time required to learn the target task without transfer. [sent-773, score-1.275]
55 2 episodes show a statistically significant difference from those that learn without transfer (p < 0. [sent-778, score-0.426]
56 05), while the learning trials that used less than 500 source task episodes did not significantly reduce the target task training time. [sent-779, score-0.515]
57 3 training times for the ANN players between using TVITM and training without transfer is statistically significant (p < 0. [sent-781, score-0.441]
58 31 Table 6: Results showing that transfer with the full ρCMAC outperforms using ρCMAC without the final averaging step, using only the averaging step of ρCMAC , and when averaging weights in the source task before transferring the weights. [sent-809, score-0.659]
59 We anticipated that the benefit from transfer would be increasingly degraded, relative to using the actual ρCMAC , as fewer numbers of training episodes in the source task were used. [sent-815, score-0.596]
60 While our results demonstrate that all three function approximators can successfully transfer knowledge, we focus our supplementary experiments on CMAC function approximation so that our transfer work can be directly comparable to previous work in Keepaway, which also used CMACs (Stone et al. [sent-826, score-0.599]
61 We also conducted a set of 30 trials which modified ρCMAC so that the average weight in the source CMAC is put into all zero-weights in the target CMAC, which is possible when agents in the source task know that their saved weights will be used for TVITM. [sent-854, score-0.597]
62 We first defined ρmodi f ied by modifying χA so that instead of mapping the novel target task action “Pass to second third keeper” into the action “Pass to second closest keeper,” we instead map the novel action into “Hold ball. [sent-872, score-0.561]
63 3, the initial hold times of the keepers do not differ significantly from those of keepers with uninitialized CMACs (i. [sent-913, score-0.461]
64 For example, in the degenerate case where the source and target task are identical, the initial performance in the target task will be equivalent to the final performance in the source task. [sent-965, score-0.593]
65 The former gives more benefit after more training is completed in the source task and the second helps when less knowledge is gained in the source task before transfer. [sent-1006, score-0.421]
66 Players initialized by TVITM in the source task do not initially outperform uninitialized players in the target task, but are able to learn faster. [sent-1015, score-0.515]
67 In addition to changing the number of players between the source and target tasks, other variations such as the size of the field, wind, and player ability can be modified. [sent-1104, score-0.429]
68 Yes Yes Yes No No No No No Yes Yes Table 12: Results showing transfer via inter-task mapping benefits CMAC players utilizing ρCMAC with two kinds of actuators. [sent-1144, score-0.445]
69 These results demonstrate that transfer can succeed even when actions in the source and target tasks are qualitatively different. [sent-1145, score-0.618]
70 We are also able to use the same ρCMAC to speed up learning in the target task when both the target and source tasks have inaccurate actuators (rows 5 and 6). [sent-1167, score-0.585]
71 2 keepers that learned using an accurate pass action in the source task. [sent-1174, score-0.56]
72 , rows 7 and 8) of Table 12, even though the players in the source task have different actuators than in the target task, transfer is able to significantly speed up learning compared to not using transfer. [sent-1177, score-0.808]
73 Actuators are changed in the benchmark players by changing the pass action from the default “PassNormal” to “PassFast” which increases the speed of the pass by 50%, reducing accuracy. [sent-1179, score-0.407]
74 2154 T RANSFER L EARNING VIA I NTER -TASK M APPINGS This result confirms that the same ρ will allow TVITM to transfer between tasks where not only have S and A changed, but the effect of the actions have also changed qualitatively. [sent-1180, score-0.409]
75 Alternately, if a set of agents have learned a task and then later want to learn another task but have damaged their actuators since learning the source task, transfer may still increase the speed of learning. [sent-1183, score-0.896]
76 We also perform the inverse experiment where agents in the source task have inaccurate actuators and agents in the target task have normal actuators. [sent-1184, score-0.719]
77 Thus a fielded agent with worn down actuators would be able to successfully transfer its learned action-value function to agents whose actuators were undamaged. [sent-1189, score-0.691]
78 Interestingly, transferring from source keepers that have accurate actuators is more effective than transferring from source keepers that have inaccurate actuators both when the target task has accurate actuators and when it has inaccurate actuators. [sent-1190, score-1.344]
79 In this subsection we demonstrate that transfer works when actuators are inaccurate in both the source and target tasks. [sent-1195, score-0.592]
80 Combined, our results show that TVITM is able to speed up learning in multiple target tasks with different state and action spaces, and even when the agents have somewhat different actuators in the two tasks. [sent-1201, score-0.574]
81 4 we say that the task has been learned when the 5 keepers are able to hold the ball for an average of 9. [sent-1218, score-0.416]
82 For instance, the the target task actions “Pass to Second Closest Teammate”, “Pass to Third Closest Teammate”, and “Pass to Fourth Closest Teammate” are mapped to the source task action “Pass to Second Closest Teammate. [sent-1229, score-0.539]
83 Given the available actions, the optimal action for players is for the player closest to the takers to hold the ball until the takers captures it. [sent-1317, score-0.627]
84 We use a variant of ρCMAC to transfer the learned weights, because Knight Joust is learned with a tabular function approximator rather than a CMAC. [sent-1364, score-0.425]
85 Note that this new variant of the transfer functional is not necessitated by the novel target task and that if we learned Knight Joust with a CMAC, the original ρCMAC would be sufficient for transfer between Knight Joust and 4 vs. [sent-1366, score-0.726]
86 3 transfer times (in simulator hours) are statistically different from learning without transfer (determined via Student’s t-tests). [sent-1370, score-0.579]
87 Another question that arises from this work is: if the ultimate goal is to learn the target task in the minimal amount of time, how can one determine the optimal amount of training in the source task automatically based on task characteristics? [sent-1407, score-0.506]
88 The best learned actions in the source task, for a given state, be mapped to the best action in the target task via the inter-task mappings. [sent-1429, score-0.509]
89 Thus, if all weights from this policy were multiplied by −1, the keepers would continually pass the ball until a taker came within 6m. [sent-1435, score-0.526]
90 In this work, an example of the first condition being met is that a keeper learns to hold the ball in the source task until forced to pass. [sent-1439, score-0.433]
91 If either of these conditions were not true, the transfer functional we employed would have to account for the differences (or suffer from reduced transfer efficacy). [sent-1446, score-0.522]
92 Past research confirms that if two tasks are closely related the learned policy from a source task can be used to provide a good initial policy for a target task. [sent-1469, score-0.535]
93 TVITM differs in intent in that we aim to transfer behaviors from existing, relevant tasks which can have different state and action spaces, rather than creating artificial problems which are easier for the agent to learn. [sent-1486, score-0.575]
94 The technique of autonomous shaping (Konidaris and Barto, 2006) may prove to be useful for transfer learning, particularly in agents that have many sensors. [sent-1518, score-0.433]
95 If a later task has a similar reward structure and actions, the learned reward shaping will help the agent initially have a much higher performance than if it were learning without transfer. [sent-1520, score-0.5]
96 For instance, if a signal device (a beacon) is near the goal state, the agent may learn a shaping reward that gives an internal reward for approaching the beacon even if the environmental reward is zero. [sent-1521, score-0.542]
97 Additionally, while the authors limit themselves to transfer between “reward-linked” tasks, no method is given for determining if a sequence of tasks are reward-linked and a learned shaping function will not necessarily be useful in a given target task. [sent-1523, score-0.492]
98 Using such a translation mechanism allows, for instance, transfer between a source task policy search learner and a value function target task learner. [sent-1546, score-0.716]
99 Rather than utilizing abstract knowledge, this transfer method is able to leverage the weights from function approximators specifying action-value functions, a very task-specific form of knowledge. [sent-1559, score-0.411]
100 We also show how transfer efficacy is reduced when the source task and target task are less related, such as when using 3 vs. [sent-1568, score-0.62]
wordName wordTfidf (topN-words)
[('cmac', 0.47), ('keepaway', 0.365), ('transfer', 0.261), ('tvitm', 0.254), ('keepers', 0.218), ('dist', 0.213), ('keeper', 0.155), ('ang', 0.151), ('players', 0.138), ('reward', 0.13), ('source', 0.125), ('agents', 0.119), ('episodes', 0.114), ('joust', 0.111), ('teammate', 0.111), ('knight', 0.104), ('action', 0.101), ('actuators', 0.099), ('takers', 0.087), ('target', 0.084), ('appings', 0.083), ('ransfer', 0.083), ('soccer', 0.083), ('tone', 0.083), ('player', 0.082), ('actions', 0.079), ('ann', 0.079), ('ball', 0.078), ('transferring', 0.078), ('approximators', 0.077), ('state', 0.076), ('task', 0.075), ('approximator', 0.074), ('opponent', 0.071), ('pass', 0.071), ('nter', 0.07), ('tasks', 0.069), ('agent', 0.068), ('sarsa', 0.067), ('reinforcement', 0.065), ('iu', 0.063), ('rbf', 0.059), ('simulator', 0.057), ('episode', 0.057), ('policy', 0.056), ('taylor', 0.056), ('robocup', 0.055), ('taker', 0.055), ('closest', 0.054), ('stone', 0.054), ('learn', 0.051), ('weights', 0.048), ('td', 0.048), ('tiles', 0.048), ('learned', 0.045), ('tile', 0.044), ('learner', 0.04), ('possession', 0.04), ('passing', 0.037), ('mappings', 0.036), ('learners', 0.036), ('hours', 0.036), ('atarget', 0.036), ('giveaway', 0.036), ('article', 0.035), ('shaping', 0.033), ('inal', 0.03), ('time', 0.027), ('reach', 0.027), ('reuse', 0.027), ('activated', 0.027), ('earning', 0.026), ('speed', 0.026), ('initial', 0.025), ('utilizing', 0.025), ('yes', 0.025), ('asource', 0.024), ('ied', 0.024), ('sinitial', 0.024), ('xtarget', 0.024), ('averaging', 0.024), ('copied', 0.023), ('pole', 0.023), ('tilings', 0.023), ('initialized', 0.023), ('inaccurate', 0.023), ('sutton', 0.023), ('mapping', 0.021), ('seconds', 0.021), ('matthew', 0.021), ('training', 0.021), ('trials', 0.021), ('autonomous', 0.02), ('flat', 0.02), ('damaged', 0.02), ('noda', 0.02), ('yaxin', 0.02), ('initially', 0.019), ('bene', 0.019), ('threshold', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning
Author: Matthew E. Taylor, Peter Stone, Yaxin Liu
Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping
2 0.11531969 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning
Author: Mohammad Ghavamzadeh, Sridhar Mahadevan
Abstract: Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL including HAMs, options, MAXQ, and PHAMs has been limited to the discrete-time discounted reward semi-Markov decision process (SMDP) model. The average reward optimality criterion has been recognized to be more appropriate for a wide class of continuing tasks than the discounted framework. Although average reward RL has been studied for decades, prior work has been largely limited to flat policy representations. In this paper, we develop a framework for HRL based on the average reward optimality criterion. We investigate two formulations of HRL based on the average reward SMDP model, both for discrete-time and continuous-time. These formulations correspond to two notions of optimality that have been previously explored in HRL: hierarchical optimality and recursive optimality. We present algorithms that learn to find hierarchically and recursively optimal average reward policies under discrete-time and continuous-time average reward SMDP models. We use two automated guided vehicle (AGV) scheduling tasks as experimental testbeds to study the empirical performance of the proposed algorithms. The first problem is a relatively simple AGV scheduling task, in which the hierarchically and recursively optimal policies are different. We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. The second problem is a larger AGV scheduling task. We model this problem using both discrete-time and continuous-time models. We use a hierarchical task decomposition in which the hierarchically and recursively optimal policies are the same for this problem. We compare the performance of the proposed algorithms
Author: Sridhar Mahadevan, Mauro Maggioni
Abstract: This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nystr¨ m extension for out-of-sample interpolation of eigenfunctions, and the use of Kronecker o sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end. Keywords: Markov decision processes, reinforcement learning, value function approximation, manifold learning, spectral graph theory
4 0.06971626 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)
Author: Avrim Blum, Yishay Mansour
Abstract: External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modified online algorithm, which consistently replaces one action by another. In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efficient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modification rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modification rule is the difference between the cost of the online algorithm and the cost of the modified online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting. Keywords: online learning, internal regret, external regret, multi-arm bandit, sleeping experts, reductions
5 0.055640657 82 jmlr-2007-The Need for Open Source Software in Machine Learning
Author: Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson
Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. Keywords: machine learning, open source, reproducibility, creditability, algorithms, software 2444 M ACHINE L EARNING O PEN S OURCE S OFTWARE
6 0.047896266 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
7 0.041675672 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
8 0.031345706 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems
9 0.027262464 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
10 0.026719078 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
11 0.023579771 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
12 0.02309409 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
13 0.022854224 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
14 0.022820011 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)
15 0.020388339 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
16 0.018291626 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
17 0.018201439 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
18 0.018007735 11 jmlr-2007-Anytime Learning of Decision Trees
19 0.016813688 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
20 0.016632415 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
topicId topicWeight
[(0, 0.117), (1, 0.073), (2, -0.063), (3, -0.025), (4, -0.229), (5, 0.102), (6, 0.093), (7, 0.012), (8, -0.014), (9, 0.092), (10, 0.065), (11, -0.369), (12, 0.081), (13, 0.247), (14, -0.071), (15, 0.114), (16, -0.011), (17, -0.054), (18, 0.015), (19, 0.087), (20, 0.044), (21, -0.029), (22, 0.141), (23, -0.162), (24, -0.126), (25, 0.131), (26, -0.109), (27, -0.001), (28, 0.022), (29, -0.065), (30, 0.033), (31, 0.007), (32, -0.12), (33, -0.016), (34, -0.008), (35, -0.065), (36, -0.019), (37, 0.069), (38, 0.051), (39, 0.02), (40, 0.035), (41, 0.062), (42, 0.024), (43, 0.016), (44, -0.077), (45, 0.027), (46, 0.036), (47, -0.04), (48, 0.019), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.96963149 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning
Author: Matthew E. Taylor, Peter Stone, Yaxin Liu
Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping
2 0.81954777 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning
Author: Mohammad Ghavamzadeh, Sridhar Mahadevan
Abstract: Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL including HAMs, options, MAXQ, and PHAMs has been limited to the discrete-time discounted reward semi-Markov decision process (SMDP) model. The average reward optimality criterion has been recognized to be more appropriate for a wide class of continuing tasks than the discounted framework. Although average reward RL has been studied for decades, prior work has been largely limited to flat policy representations. In this paper, we develop a framework for HRL based on the average reward optimality criterion. We investigate two formulations of HRL based on the average reward SMDP model, both for discrete-time and continuous-time. These formulations correspond to two notions of optimality that have been previously explored in HRL: hierarchical optimality and recursive optimality. We present algorithms that learn to find hierarchically and recursively optimal average reward policies under discrete-time and continuous-time average reward SMDP models. We use two automated guided vehicle (AGV) scheduling tasks as experimental testbeds to study the empirical performance of the proposed algorithms. The first problem is a relatively simple AGV scheduling task, in which the hierarchically and recursively optimal policies are different. We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. The second problem is a larger AGV scheduling task. We model this problem using both discrete-time and continuous-time models. We use a hierarchical task decomposition in which the hierarchically and recursively optimal policies are the same for this problem. We compare the performance of the proposed algorithms
Author: Sridhar Mahadevan, Mauro Maggioni
Abstract: This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nystr¨ m extension for out-of-sample interpolation of eigenfunctions, and the use of Kronecker o sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end. Keywords: Markov decision processes, reinforcement learning, value function approximation, manifold learning, spectral graph theory
4 0.34656319 82 jmlr-2007-The Need for Open Source Software in Machine Learning
Author: Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson
Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. Keywords: machine learning, open source, reproducibility, creditability, algorithms, software 2444 M ACHINE L EARNING O PEN S OURCE S OFTWARE
5 0.30194077 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)
Author: Avrim Blum, Yishay Mansour
Abstract: External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modified online algorithm, which consistently replaces one action by another. In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efficient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modification rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modification rule is the difference between the cost of the online algorithm and the cost of the modified online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting. Keywords: online learning, internal regret, external regret, multi-arm bandit, sleeping experts, reductions
6 0.21904671 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors
7 0.20160255 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
8 0.1810952 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
11 0.13022074 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
12 0.12993258 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation
13 0.12609352 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
14 0.12326443 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems
15 0.11195783 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
16 0.10687736 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
17 0.10137846 23 jmlr-2007-Concave Learners for Rankboost
18 0.09967722 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
19 0.095067509 11 jmlr-2007-Anytime Learning of Decision Trees
20 0.090629891 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
topicId topicWeight
[(8, 0.012), (10, 0.012), (12, 0.017), (15, 0.014), (22, 0.01), (28, 0.028), (40, 0.018), (45, 0.013), (48, 0.025), (60, 0.014), (67, 0.01), (80, 0.015), (85, 0.054), (98, 0.666)]
simIndex simValue paperId paperTitle
1 0.99568582 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"
Author: Gaëlle Loosli, Stéphane Canu
Abstract: In a recently published paper in JMLR, Tsang et al. (2005) present an algorithm for SVM called Core Vector Machines (CVM) and illustrate its performances through comparisons with other SVM solvers. After reading the CVM paper we were surprised by some of the reported results. In order to clarify the matter, we decided to reproduce some of the experiments. It turns out that to some extent, our results contradict those reported. Reasons of these different behaviors are given through the analysis of the stopping criterion. Keywords: SVM, CVM, large scale, KKT gap, stopping condition, stopping criteria
2 0.9954896 23 jmlr-2007-Concave Learners for Rankboost
Author: Ofer Melnik, Yehuda Vardi, Cun-Hui Zhang
Abstract: Rankboost has been shown to be an effective algorithm for combining ranks. However, its ability to generalize well and not overfit is directly related to the choice of weak learner, in the sense that regularization of the rank function is due to the regularization properties of its weak learners. We present a regularization property called consistency in preference and confidence that mathematically translates into monotonic concavity, and describe a new weak ranking learner (MWGR) that generates ranking functions with this property. In experiments combining ranks from multiple face recognition algorithms and an experiment combining text information retrieval systems, rank functions using MWGR proved superior to binary weak learners. Keywords: rankboost, ranking, convex/concave, regularization 1. Ranking Problems A ranking problem is a learning problem that involves ranks as the inputs, the outputs or both. An example where ranks are used as inputs is a collaborative filtering application where people are asked to rank movies according to their preferences. In such an application the ranks assigned by different people are combined to generate recommendations. Another type of problem in which ranks are used as inputs are meta-search problems, where the ranks of multiple search engines are combined (Dwork et al., 2001). However, the inputs to a ranking problem are not always ranks. An object recognition ranking problem (Kittler and Roli, 2000) may receive as inputs a graphical representation and output a ranking of the possible objects, sorted by likelihood. The outputs of a ranking problem may also be ranks. For example, in combining multiple search engines the output are ranks which are a synthesis of the ranks from the individual search engines. A similar meta-recognition problem is the combination of the outputs of multiple face- recognition systems to improve the accuracy of detecting the correct face. While the inputs and outputs of this problem are ranks, the outputs can be simplified to only return the most likely candidate. Another example where the outputs do not need to be complete ranks could be an information retrieval combination task. In such a task the inputs might be ranks of sets of documents with respect to c 2007 Ofer Melnik, Yehuda Vardi and Cun-Hui Zhang. M ELNIK , VARDI AND Z HANG a particular query by different experts. Again, the outputs could be the complete ranks, or more simply a designation for each document of whether it is relevant or not to the particular query. 2. Rankboost The rankboost algorithm (Freund et al., 2003) tries to address this variety in ranking problems by maintaining generality in how it regards its inputs and how it applies different loss functions to outputs. The rankboost algorithm works on instances which are the discrete items (e.g., movies or faces) that either are ranked as input or are to be ranked as output. To allow a general input mechanism, the inputs to rankboost are called the ranking features of an instance, specified as f j (x) which is the j-th ranking feature of instance x. While this generality in how inputs are handled is potentially powerful, in the original rankboost paper as well as in this paper, only the case where the inputs are different rankings of the instances is considered. Thus, in this paper, the inputs to rankboost are constituent ranks, denoted as f 1 . . . fn , where f j (x ) < f j (x ) implies that instance x has a better rank than instance x under ranking f j . For example, in some of the experiments we present, each f j is the ranking result of a particular face recognition algorithm. The output of rankboost is a new ranking function, H(x), which defines a linear ordering on the instances, that is, H (x ) < H (x ) iff x has a better rank than x . In rankboost T H(x) = ∑ wt ht (x), (1) t=1 a weighted sum of weak ranking learners, where the ht (x)’s are relatively simple learned functions of the constituent rankings. To address the variety in possible loss functions of the outputs, in rankboost the desirable properties for the output loss function are specified with a favor function, Φ : X × X → R, where X is the space of instances (note that this function has been renamed from “preference function” to avoid confusion with the use of preference in this paper in Section 4). Here Φ (x , x ) > 0 means that x should be better ranked than x for a given query. It is an anti-symmetric function, that is, Φ (x , x ) = −Φ (x , x ) and Φ (x, x) = 0, which avoids loops where two instances should both be ranked better than the other. Also Φ (x , x ) = 0 when there is no favor in how two instances should be relatively ranked. For example, as described in Section 6.1, for the face recognition combination problem described above the favor function can be used to specify that the correct identity should be given a better rank than all other identities, while zeroing all other entries in the favor function, giving no favor in how incorrect identities are ranked between them. In a similar fashion for the information retrieval combination task mentioned above, the favor function can be specified such that all relevant documents should be better ranked than irrelevant documents, without specifying favor for the ordering between relevant documents and the ordering between irrelevant documents (Section 7.1). Rankboost is shown in Algorithm 1 as described in Freund et al. (2003). It uses the favor function to specify an initial weight function over instance pair orderings: D x ,x = c · max 0, Φ x , x −1 , (2) where c = ∑x ,x max (0, Φ (x , x )) . The algorithm itself is very similar to adaboost (Freund and Schapire, 1996). At each iteration the algorithm selects the weak learner that best maximizes a 792 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 1 rankboost algorithm for generating a ranking function. Input: constituent ranks, a favor function, and a class of weak learners with outputs between 0 and 1 and an appropriate training algorithm. Output: ranking function H(x) (Eq. 1) Initialize D(1) = D (Eq. 2) for t = 1 . . . T do Find weak learner, ht , that maximizes r(h) = ∑x ,x D(t) (x , x )(h(x ) − h(x )) (Eq. 4). Choose wt = 0.5 ln ((1 + r(ht )) / (1 − r(ht ))) . (Eq. 5) Update: D(t+1) (x , x ) = D(t) (x , x ) exp (wt (ht (x ) − ht (x ))) Zt where Zt is chosen to make ∑x ,x D(t+1) (x , x ) = 1 end for rank function’s utility, r(h), (Eq. 4). It then assigns it a weight in proportion to its performance and adds the weighted weak learner to the ranking function, H(x). After which the weight function D is adjusted to reflect the impact of the weak learner. Freund et al. (2003) prove that the rankboost algorithm iteratively minimizes the ranking loss function, a measure of the quantity of misranked pairs: ∑ D x ,x I H x ≤ H x x ,x where I is the indicator function (1 if true, 0 otherwise) and H(x) is a ranking function output by rankboost. The rankboost paper (Freund et al., 2003) uses binary weak learners of the following functional form: i f f j (x) > θ, 1 0 i f f j (x) ≤ θ, (3) h(x) = qde f i f f j (x) unde f ined. Each binary weak learner, h, operates on a single f j (ranking feature), giving a binary output of 0 or 1 depending on whether the instance’s rank is smaller than a learned parameter, θ. Note that function h(x) in (3), which is dependent on only one f j (x) with a fixed j, is monotonic increasing but not convex or concave. If there is no rank for the instance then it returns a prespecified q de f value. 3. Rankboost, the Weak Learner and Regularization While rankboost inherits the accuracy of adaboost and has been shown to be very successful in practice, in two important ways it is very different from adaboost and similar classifier leveraging algorithms (Meir and Ratsch, 2003). The first is the choice of the weak learner, h. In adaboost the weak learner is expected to minimize the weighted empirical classification error: N ∑ d (t) (i)I [yi = h (zi )] , i=1 where yi is the class label, I is the indicator function and d (t) is a weighting over training samples. This is a standard function to minimize in classification with many possible types of algorithms to 793 M ELNIK , VARDI AND Z HANG choose from as possible weak learners. In contrast the weak ranking learner for rankboost (with outputs in [0, 1]) needs to maximize the following rank utility function: r = r(h) = ∑ D(t) (x , x )(h(x ) − h(x )), (4) x ,x where D(t) is the weight function over pairs of instances. As Eq. 4 contains the term h(x ) − h(x ), short of linear learners this equation can not be concave in h or easily approximated by a concave function. Therefore the weak learner needs to be optimized either by heuristics or by brute force, which limits the choice of h. It is not surprising that the original rankboost paper only included one type of weak learner, a binary threshold function (Eq. 3) that was tuned using brute force. The second difference between rankboost and adaboost also concerns the weak ranking learner. One feature of boosting that has sparked a great deal of interesting research is the algorithm’s ability to avoid overfitting for low noise classification problems (with modifications to higher noise problems), see Meir and Ratsch (2003) for a survey. In contrast for rankboost it is only by limiting the type of underlying weak learners that overfitting is avoided. In their paper, Freund et al. (2003) show that not using weak ranking learners with cumulative positive coefficients leads to overfitting and poor performance quite quickly. Therefore, choosing a weak learner that regularizes the ranking function, the output of rankboost, is very important for achieving accuracy and avoiding overfitting. It is clear from the above discussion that the choice of a weak ranking learner for rankboost is important and non trivial. First, the learner must be efficiently tunable with respect to Eq. 4, typically limiting its complexity. Second, the learner itself must demonstrate regularization properties that are appropriate for ranking functions. In this paper we present a new weak ranking learner that enforces consistency in preference and confidence for the ranking function by being monotonic and concave. We start with a discussion of these regularization properties, theoretically justify them, and show what they mean in terms of the final ranking function. Then we present a new learner, Minimum Weighted Group Ranks (MWGR), that satisfies these properties and can be readily learned. This learner is tested and compared with the binary learner of rankboost on combining multiple face recognition systems from the FERET study (Phillips et al., 2000) and on an information retrieval combination task from TREC (Voorhees and Harman, 2001). 4. Regularizing Ranking Functions with Consistency in Preference and Confidence In this paper, as Freund et al. (2003), we consider ranking functions H(x) which depend on x only through the values of the ranking features, y j = f j (x), for that instance, so that H(x) = G ( f1 (x), . . . , fn (x)), for certain functions G (y1 , . . . , yn ) = G(y). Here, we assume that the f j (x) is an actual rank assigned to an instance, x, by the j-th ranker. Note that if the original data are numerical scores then they can easily be converted to rankings. Freund et al. (2003) make a strong case for conversion of raw measures to relative orderings (rankings) over combining measures directly, arguing that it is more general and avoids the semantics of particular measures. As the y j ’s are ranks instead of points in a general space, care should be taken as to the functional form of G. A great deal of literature in social choice theory (Arrow, 1951) revolves around the properties of various rank combination strategies that try to achieve fair rankings. In this machine learning case our motivations are different. Fairness is not the goal; the goal is to improve the 794 C ONCAVE L EARNERS FOR R ANKBOOST accuracy or performance of the ranking function. Thus, regularization, by functionally constraining G, is used to confer information on how to interpret ranks in order to ultimately improve accuracy. Freund et al. (2003) imposed the regularization principle of monotonicity on the ranking function. Monotonicity encompasses a belief that for individual features a smaller rank is always considered better than a bigger rank. It means that for every two rank vectors, a and b, if a j ≤ b j , j = 1, . . . , n then G(a) ≤ G(b). Monotonicity was enforced by requiring that the coefficients, wt in Eq. 1, combining the binary weak learners (Eq. 3) were cumulatively positive. As shown by Freund et al. (2003), without enforcing monotonicity the rankboost algorithm quickly overfits. In this section we present another regularization principle, consistency in preference and confidence (which includes monotonicity). A ranking function with this regularization property should be consistent in its preference for the individual rankers and also in how it captures their confidence. The following example illustrates these two concepts. 4.1 Grocer Example A grocer needs to make 2 decisions, to decide between stocking oat bran vs. granola and to decide between stocking turnips vs. radishes. The grocer asks her consultants to express their opinion about stocking each item, and based on their responses makes her 2 decisions. First of all, in either problem, the grocer will adopt the opinion of her consultants if they all agree with each other, that is, they all prefer granola over oat bran in the first problem. Lets assume the grocer considered the first problem and chose granola over oat bran. What this implies is that the grocer adopted the opinions of the consultants that preferred granola over oat bran. Now consider the turnips vs. radishes decision. Lets say that the same consultants that liked granola more also liked radishes more (and the same ones that like oat bran more like turnips more). Also, for this decision the radish lovers actually feel more confident in their choice than they did for granola, while the turnip lovers are more unsure than they were for oat bran. Then for this second choice, if the grocer is consistent in her method of combining the opinions of her consultants, we would expect her to pick radishes over the turnips. In addition, we would expect her to be more confident in this second decision as a reflection of the consultants relative confidence. The above properties of preference and confidence imply that preference should be applied consistently across different problems and confidence should reflect the relative confidences of the individual consultants. 4.2 The General Principle of Consistency in Preference and Confidence To generalize the above grocer’s problem consider the problem of combining the opinions of a panel of consultants on several multiple-choice decision problems. Suppose for each decision problem each consultant provides his preference as one of the possible choices and a qualitative measure of the confidence level for his preference. The consistency principle in preference and confidence holds if the combiner always agrees with at least one of the consultants in each decision problem and that the following condition holds for any pair of decision problems. Definition 1. Consistency principle for a pair of decision problems: Suppose that for the first decision problem, the combiner adopts the preference of a subset A of consultants in the sense that A is the (non empty) set of all consultants whose preference is identical to that of the combiner. 795 M ELNIK , VARDI AND Z HANG Suppose that for the second decision problem, the preference of the consultants in A is again identical. Suppose further that compared with the confidence level for the first decision problem, each consultant in A has higher or the same confidence level for his preference in the second problem, while each consultant with a different preference than A for the second problem has lower or the same confidence level. Then, the preference of the combiner for the second problem is identical to that of the consultants in A, and the confidence level of the combiner for the second problem is at least as high as his confidence level for the first problem. Let B be the set of consultants whose preferences are different from that of the combiner in the first decision problem, that is, B = Ac . If some members of B switch sides or have no preference in the second problem, the combiner would be even more confident about the adoption of the preference of A in the second problem regardless of the confidence of those who switch sides. Thus, Definition 1 requires only those against the preference of A in the second problem (necessarily a subset of B since members of A act in unison in both problems) to have lower or equal confidence in the second problem, instead of all members of B. This is taken into consideration in Theorem 1 below. Note that while preference for individual consultants is specified within each decision problem, two decision problems are needed to compare the qualitative measure of confidence. However, comparison of confidence is not always possible (it is a partial ordering). In particular, the level of confidence between different experts may not be comparable, and the levels of confidence of a given expert (or the combiner) for different decision problems are not always comparable. 4.3 Confidence for Ranks and Ranking Functions In order to apply consistency to rank combination we need to specify what we mean by more or less confidence. For ranks we make the assumption that a constant change of ranks requires more confidence for low ranks than high ranks. For example, we would expect the difference between ranks of 1 and 3 to represent a more significant distinction on the part of a ranker than would for example the difference between ranks 34 and 36 which may be almost arbitrarily assigned. Specifically we make the following definition: Definition 2. Preference and confidence for univariate rank values For any pair of rank values {r, r } ⊂ R with r < r , by monotonicity r is preferred. For any two pairs of rank values {r, r } and {r , r } with r < r and r ≤ r , the confidence level for {r, r }is higher than that of {r , r } if r ≤ r , r −r ≤ r −r with at least one inequality. The pair{r, r }provides no preference if r = r . Likewise, if either r − r = r − r = 0 or r − r = r − r = 0, the two pairs {r, r } and {r , r }are defined to have the same level of confidence. In this definition confidence between pairs of ranks can only be compared if the pair with the lowest rank has a gap at least as large as the other pair. Thus, we are actually comparing two numbers, that is, we are doing a vector comparison. As such, this comparison does not cover all pairs of ranks and applies only a partial ordering on rank pairs. As a regularization principle, a partial ordering is desirable since it does not overly constrain the ranking function and allows for flexibility. 796 C ONCAVE L EARNERS FOR R ANKBOOST As the rank function, G, is a univariate function, we apply the same definitions of preference and confidence to it. That is, for a ranking function G and for any fixed rank vectors, y and y , the preferred rank vector is the one with smaller G, that is, y is preferred iff G(y) < G(y ). For confidence again we need to consider pairs of rank vectors, {y, y } and {y , y }, with G(y) ≤ G(y ) and G(y ) ≤ G(y ). If G(y) ≤ G(y ) and G(y ) − G(y) ≥ G(y ) − G(y ) then we say that the first decision, between yand y , shows equal or more confidence than the second decision between y and y . This numerical method of capturing confidence and preference in ranks, y, and ranking functions, G, allows us to apply Definition 1. Specifically, for a pair of binary decisions the confidence of a consistent ranking function increases for the second decision if in the second decision the confidence of the agreeing rank values are comparable and increase and the confidence of the disagreeing rank values are comparable and decrease, with the exception of those that switched sides. 4.4 Three-Point Consistency in Preference and Confidence for Combining Ranks As described, the consistency property is over two binary decision problems. In this section we consider ranking functions, G, that have consistency in preference and confidence for all pairs of binary decision problems involving three rank vectors y, y and y . We show that such functions have specific mathematical properties under this three-point consistency principle in preference and confidence for ranks.. Theorem 1: For a ranking function G whose domain is convex, three-point consistency in preference and confidence holds for G iff G(y) is nondecreasing in individual components of y and is jointly concave in y. Proof of Necessity: Assume that the ranking function G exhibits three-point consistency in preference and confidence for any three rank vectors. If y ≤ y component wise, then the individual components of y all agree in their preference to y, so that G(y) ≤ G(y ) by the consistency of G in preference. This implies the monotonicity of G in y. We pick three points y, y − a and y + a, such that all points are rank vectors. Consider the pair of binary comparison problems, y vs. y + a and y − a vs. y. Assume that G(y) < G(y + a). These three points are comparable by Definition 2 (r − r = r − r for all components in the rank vectors). Since the agreeing rankers, A = j y j < y j + a j , in the y vs. y + a comparison are greater than in the y − a vs. y comparison, and the disagreeing ranks, outside A, are smaller, then by the consistency of the combiner in preference and confidence (definition 1), G(y − a) ≤ G(y) and G(y) − G(y − a) ≥ G(y + a) − G(y). A function G is concave in a convex domain iff 2G(y) ≥ G(y − a) + G(y + a), for every y and a with y ± a in its domain. To verify this inequality for a particular y and a, we must have G(y + a) > G(y), G(y − a) > G(y) or the third case of G(y) ≥ max{G(y + a), G(y − a)}. We have already proven that the consistency properties of preference and confidence imply G(y) − G(y − a) ≥ G(y + a) − G(y) in the first case. By symmetry this requirement also must hold for −a, so that G(y) − G(y + a) ≥ G(y − a) − G(y) in the second case. This completes the necessity part of the proof since 2G(y) ≥ G(y − a) + G(y + a) automatically holds in the third case. Proof of Sufficiency: Assume the ranking function G is nondecreasing in individual components of y and jointly concave in y. We need to prove consistency in preference and confidence for a pair 797 M ELNIK , VARDI AND Z HANG 1) G(y) < G(y’) y ≤ y’’ A A y’’ A 4 A A 10 y’ 8 y’ 6 4) G(y) > G(y’) y ≥ y’’ A 10 8 A 2) G(y) < G(y’) y ≥ y’’ A 10 8 6 y A y 6 4 4 2 2 2 0 0 y y’ y’’ y’’ 0 2 4 6 8 10 0 2 4 B 6 8 10 0 0 B 2 4 6 8 10 B Figure 1: Consider two decision problems in two dimensions involving three points, y, y and y , where the first decision problem is to choose between y and y and the second problem is to choose between y and y . The cases where we expect consistency in preference and confidence (Definitions 1 and 2) can be enumerated by the result of the first decision problem and relative location of y . The darker gray box is the location where B has lesser or equal confidence in the second problem, and the lighter gray box is where B switches sides. of decision problems involving three rank vectors y, y and y . Without loss of generality, suppose that the first problem is to choose between y and y and the second problem is to choose between y and y . We break the proof up by the results of the comparison between y vs. y and the relative location of the third ranking vector y that satisfies the preference and confidence requirements. If G(y) = G(y ) = G(y ), the combiner has equal confidence in the two decision problems by definition 2, so that the consistency principle holds automatically. Thus, we only need to consider the cases where G(y) = G(y ). Figure 1 illustrates three of these cases two dimensionally, where there is a single agreeing component A, the y-axis, and a single disagreeing component B, the x-axis. Let A be the set of agreeing indices, A = j : sgn(y j − y j ) = sgn (G(y ) − G(y)) = 0 , and B = c the set of disagreeing indices. We use the notation y = (y , j ∈ A) to describe corresponding A j A subvectors. Also, when used, vector inequalities are component wise. Case 1: G(y) < G(y ) and yA ≤ yA In the first decision problem, as G agrees with A and disagrees with B yA < y A , yB ≥ y B . / We also know that A = 0 since otherwise y ≥ y and therefore by monotonicity G(y) ≥ G(y ), which is a contradiction. For the second decision problem we consider the values of y which are consistent with the confidence assumption (Definitions 1 and 2), that is, agreeing with more confidence along the A indices and disagreeing with less confidence or switching sides along the B indices. As y A ≤ yA , either by the confidence relationship or by switching sides (see Figure 1) these y values satisfy yA − y A ≥ y A − y A , 798 yB − y B ≤ y B − y B C ONCAVE L EARNERS FOR R ANKBOOST which implies y ≤ y , and therefore G(y ) ≤ G(y ) by monotonicity. Thus, G(y) < G(y ) ≤ G(y ), which implies that in the second decision problem of choosing between y and y , the preference of G is the same as that of A (i.e., y since yA ≤ yA ) and the confidence of G is at least as high as the first decision problem. Case 2: G(y) < G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B . This implies that y + y ≤ 2y, which means that G(y ) + G(y ) ≤ 2G y +y /2 ≤ 2G(y) by the concavity and monotonicity of G. Thus, G(y) − G(y ) ≥ G(y ) − G(y) and since G(y ) > G(y), we have that G(y) − G(y ) > 0 and thus G(y) > G(y ). Therefore we see that the preference in G for the two decision problems is the same as A (with y in the first problem and with y in the second problem) and the confidence is no smaller for the second comparison. Case 3: G(y) > G(y ) and yA ≤ yA Since y is preferred in the first decision problem we have yA > y A , yB ≤ y B . For the confidence assumption to hold, that is, greater confidence for A in the second problem (Definition 2), the smaller ranks in the second problem have to be smaller than the smaller ranks in the first problem. But with yA ≥ yA and yA ≤ yA that can only be if yA = yA , which leads to y ≤ y , and by monotonicity implies G(y) ≤ G(y ). However that is a contradiction to G(y) > G(y ), and therefore this is not a viable case for comparing confidence. Case 4: G(y) > G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the confidence relationship or by switching sides y satisfies yA − y A ≥ y A − y A , yB − y B ≤ y B − y B , which means that y ≤ y . Thus, by the monotonicity of G, G(y ) ≤ G(y ). Since G(y ) < G(y) the preference in the second decision problem is also with A (i.e., y ) and the confidence is at least as high as the first decision problem. 4.5 Applying Regularization to Rankboost It follows from the above theorem that to have consistency in preference and confidence we desire ranking functions that are monotonic and concave. In rankboost H(x) = ∑ wt ht (x) (Eq. 1). To make H an increasing and concave function of constituent rankings y j = f j (x) we need to constrain the weak ranking learners. If the learners themselves are monotonically increasing and concave functions of y j , then linearly combining them with positive weights will give an H that is also an increasing and concave function of y j . 799 M ELNIK , VARDI AND Z HANG In this paper, we apply the “third method” (Freund et al., 2003) to setting a wt weight value. That is, weak learners are selected on their ability to maximize r from Eq. 4 and then wt = 0.5 ln ((1 + rmax ) / (1 − rmax )) . (5) Therefore, using monotonic and concave weak learners we select only ones that rankboost gives a positive r value to, which renders a positive wt weight. If no r values are positive the rankboost algorithm stops. We mention that a ranking function can be construed as the negative of a score function, that is, G(y) = −S(y). For score functions these regularization properties become monotonically decreasing, and convex (Melnik et al., 2004). 5. Minimum Weighted Group Ranks The functional structure of the new learner we propose is h(y) = min {γ1 y1 , . . . , γn yn , 1} , (6) where y = (y1 , . . . , yn ) = ( f1 (x), . . . , fn (x)) the vector of ranking features, and the γ j are learned positive coefficients. Note that the function’s range is in [0, 1] due to the 1 term in the min function. Using rankings as our features, the learner function (Eq. 6) is monotonically increasing. It is also a concave function in y. Thus if these learners are linearly combined with positive weights, the resulting ranking function, H, will have three-point consistency in confidence and preference. To gain some intuition, the functional form of the learner is related to the Highest Rank combination method (Ho, 1992) that assigns combined ranks as the best rank each class receives from any ranker. This is a confidence based approach, as Highest Rank bets on the classifier that is most confident, the one giving the best rank. As such, a single learner can potentially be error prone. But as we combine many of these learners during boosting, it becomes more powerful, allowing the confidence of different classifiers to be weighed with preference for their potential accuracy. 5.1 Learning At each boosting iteration, rather than selecting from all possible weak learners of form (Eq. 6), we limit our choices by building new weak learners out of the ones that have been previously trained. Let F = { f1 (x), . . . , fn (x)} be the set of ranking features. Recall that in rankboost H(x) = (t) (t) (t) ∑t wt ht (x), where ht (x) = min γ1 f1 (x), . . . , γn fn (x), 1 and γ j are learned coefficients. We set (s) (s) Ht = h(x) h(x) = min γ1 f1 (x), . . . , γn fn (x) , s ≤ t and select ht+1 (x) from weak learners of the form hnew (x) = min αh(x), β f (x), 1 (7) (t+1) with h(x) ∈ Ht and f (x) ∈ F . This learner can be rewritten in the form of Eq. 6 with the γ j derived from the learned α, β, h(x) and f (x). Thus, at each iteration we look at combinations of 800 C ONCAVE L EARNERS FOR R ANKBOOST the features and existing learners. As discussed in the next section, we can either consider all such combinations, or use a heuristic selection strategy. We propose to optimize α and β separately, in stages. That is, given a value for one of the variables we optimize the other. As α and β are symmetric we show how to optimize β given α. We need to find a value of β that maximizes r in Eq. 4. Freund et al. (2003) pointed out that this equation can be rewritten as: r = ∑ D(t) (x , x )(h(x ) − h(x )) x ,x = ∑ π(x)h(x) x where π(x) = ∑x D(t) (x , x) − D(t) (x, x ) . Given the form of Eq. 7 we can write r as a function of β, ∑ (β f (x) − 1) π(x) + ∑ r (β) = αh(x) − 1 π(x). β f j (x)≤min(αh(x),1) (8) αh(x) < 1 , then O(βl+1 ) = O(βl ) − π(xl ), P(βl+1 ) = P(βl ) − f (xl )π(xl ), Q(βl+1 ) = Q(βl ) +W π(xl ), R(βl+1 ) = R(βl ) +W αh(xl )π(xl ). Combining these formulas gives algorithm 2 for optimizing β. Algorithm 2 Algorithm for optimizing β Given α, h(x) ∈ Ht , f (x) ∈ F and the training instances. For all x’s generate and sort the set of candidate βs, B, such that β 1 ≤ β2 ≤ · · · ≤ β|B| and βl = min(αh(xl ),1) . f (xl ) O = ∑xl π(xl ) P = ∑xl f (xl )π(xl ) Q=0 R=0 rbest = 0 for j = 1 . . . |B| do r = βl P − O + R − Q if r > rbest then rbest = r end if O = O − π(xl ), P = P − f (xl )π(xl ) if αh(xl ) < 1 then Q = Q + π(xl ), R = R + αh(xl )π(xl ) end if end for 5.2 Heuristics for Learner Selection If at each boosting iteration we select from all combinations of h(x) and f (x) we end up with an O(T 2 ) algorithm, where T is the number of iterations. However, we can sacrifice accuracy for speed by only evaluating and selecting from a fixed sized pool of previously trained learners h(x) and features f (x), where at each iteration the pool is randomly chosen from the full Ht and F . To improve performance, instead of using a uniform sampling distribution we can apply a heuristic to focus the distribution on combinations with better potential. As Eq. 8 is composed of two sums, for r to be large the terms f (x)π(x) and h(x)π(x) need to be large. We can consider s f = ∑x f (x)π(x) and sh = ∑x h(x)π(x) as indicators of how well these components work with π(x). Thus, we might expect larger r values to occur when these two score values are larger. Of course, we are discounting interactions, which is the reason for the combination. Using these score values, we can order all h(x) and f (x) separately, and sample such that learners and features with better scores are more likely to be selected. We opted for a polynomial weighting 802 C ONCAVE L EARNERS FOR R ANKBOOST Training error on Dup II 9 P=1 P=1/2 P=1/4 Avg rank of correct class 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 1 2 3 4 5 6 7 8 9 Iteration number Figure 2: These plots show convergence of training error for 3 values of the heuristic selection pressure value p. These plots are averages over 10 runs and are typical of the other training data sets as well. As can be seen the heuristic improves the convergence rate of the training error. method. Thus, for example, all f ∈ F are sorted by their score and are assigned a number based on the rank of these scores,(maxrank − rank)/maxrank, that gives each f an equally sized bin in the range 0 and 1. Given a random number ξ ∼ U(0, 1), we calculate ξ p and select the f that corresponds to the bin this number falls into. Here p < 1 can be construed as a selection pressure, where bins corresponding to higher scores are more likely to be selected. Figure 2 demonstrates the effect of different values of the p parameter in one of our experiments. 6. Face Recognition Experiments We present experiments on the combination of face recognizers, comparing the binary learner with the MWGR learner. Given an image of a face, a face recognizer assigns a similarity score to each of the faces it is trained to recognize. These scores give a linear order or ranking to the gallery of faces. Different face recognition algorithms have different performance characteristics. Thus, we explore how combining the outputs of multiple face recognizers can improve recognition accuracy. 6.1 Algorithm Methods We consider a data set I of face images (queries) to train on. For each query image, i in I, we need to rank all u ∈ U faces in the gallery. In rankboost the favor function, Φ, that specifies output loss, is a function of the query and the item to be ranked. Therefore, the notational convention is to combine the query, i, and the item to be ranked, u, as an instance, x ≡ (i, u). As such, f j (x) = f j ((i, u)) is the 803 M ELNIK , VARDI AND Z HANG Error convergence 9 Train error on dup II Test error on dup I Avg rank of correct class 8.5 8 7.5 7 6.5 6 0 10 20 30 40 50 60 70 80 90 100 Iteration number Figure 3: This plot is typical of the convergence behavior of rankboost with MWGR on the FERET data. Both training and test errors tended to converge within 10-30 iterations of boosting with no significant post-convergence divergence. rank assigned to identity u for query image i by recognition algorithm j. As there is only one correct identity, we only care about the ranking of the one correct identity for each query. We set the favor function as stated by Freund et al. (2003) for this type of output loss. Let u ∗ be the the correct identity for training image i, then Φ ((i, u) , (i, u∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1 for all u = u∗ , setting all remaining elements of Φ (x , x ) = 0. That is, the correct identity of a query image is given positive favor compared to all other identities for that image, while all other rankings, including interactions between training images, are given zero favor. Note that since there is no favor interaction between queries (different i’s); Φ (x , x ) is effectively a function of 3 variables, (i, u , u ). Both weak learners were trained for 100 iterations, giving them ample time to converge. See Figure 3 for an illustration of convergence times. The binary learner was trained as specified by Freund et al. (2003). At each iteration the MWGR learner was selected from a pool of candidate combinations of f (x) and h(x), with a selection pressure of p = 0.5. For each candidate, first β was optimized with α = 1, then α was optimized using the optimized β. The candidate with the most positive r value was always selected. This is summarized in algorithm 3. 6.2 Experimental Setup FERET (Phillips et al., 2000) was a government sponsored program for the evaluation of face recognition algorithms. In this program commercial and academic algorithms were evaluated on their ability to differentiate between 1,196 individuals. The test consisted of different data sets of varying difficulty, for a total of 3,816 different images. The data sets, in order of perceived difficulty, are: the fafb data set of 1,195 images which consists of pictures taken the same day with different facial 804 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 3 Algorithm for selecting learner from pool. Given poolsize and selection pressure. Calculate s f j and sh for all rank features and existing learners. for p = 1 . . . poolsize do Generate d f ∼ U(0, 1) and dh ∼ U(0, 1) Select which h = min αh(x), β f (x), 1 to try, using s f j , sh , u f , uh . Setting α = 1, optimize β (algorithm 2) Keeping the optimized β, optimize α (algorithm 2) Get r of learner with this α, β if r > 0 and r > rbest then rbest = r, hbest = h, hbest = min αh(x), β f (x) end if end for ht = hbest, Ht+1 = Ht ∪ hbest expressions; the fafc data set of 194 images that contains pictures taken with different cameras and lighting conditions; the dup I data set of 488 images that has duplicate pictures taken within a year of the initial photo; and the most difficult, the dup II data set of 234 images which contains duplicate pictures taken more than a year later. Note that in our experiments we separate the images of dup II from the dup I data set, unlike the FERET study where dup II was also a subset of dup I. The FERET study evaluated 10 baseline and proprietary face recognition algorithms. The baseline algorithms consisted of a correlation-based method and a number of eigenfaces (principle components) methods that differ in the internal metric they use. Of the proprietary algorithms, most were from different academic institutions and one was commercial. Of the 10 algorithms we selected three dominant algorithms. From the baseline algorithms we chose to use the ANM algorithm which uses a Mahalanobis distance variation on angular distances for eigenfaces (Moon and Phillips, 2001). While this algorithm’s performance is not distinctive, within the class of baseline algorithms it was strong. Moreover, in accuracy with respect to average rank of the correct class on the dup I data set it demonstrated superior performance to all other algorithms. The other two algorithms we used were the University of Maryland’s 1997 test submission (UMD) and the University of Southern California’s 1997 test submission (USC). These algorithms clearly outperformed the other algorithms. UMD is based on a discriminant analysis of eigenfaces (Zhao et al., 1998), and USC is an elastic bunch graph matching approach (Wiskott et al., 1997). The outputs of the 10 face recognizers on the four FERET data sets, fafb, fafc, dup I and dup II were the data for the experiments. Thus, we never had access to the actual classifiers, only to data on how they ranked the different faces in these data sets. We conducted experiments based on homogeneous and heterogeneous data sets, testing the efficiency and robustness (adaptivity) of the MWGR procedure. For the homogeneous case we took all 4 FERET data sets and randomly shuffled them together. We call this the homogeneous data set as both the training and testing data are selected from the same combined pool. On this combined data set we did 4-fold cross validation. For each fold 75% of the data was used for training and the rest for testing. We combined the results of all four runs together for evaluation purposes. 805 M ELNIK , VARDI AND Z HANG For the heterogeneous case, in each experiment one of the FERET data sets was selected as a training set and another data set was selected for testing. This gave 12 experiments (not including training and testing on the same data set) per group of face recognizers, where we get combinations of training on easy data sets and testing on hard data sets, training on hard and testing on easy data sets, and training and testing on hard data sets. To reduce noise in our experiments the training ranks were truncated at 150, and outliers were removed. In face recognition and other classification applications usually only the top ranks are important. Thus, in evaluating the results we focused on the top 30 ranks. All ranks returned by a boosted combiner for the correct class above 30 were truncated to 30. In evaluating the performance of the combiners not all the test data are equally useful. We consider the following two cases as non informative. When the two best face recognizers, UMD and USC both give the correct class a rank of 1 there is very little reason for the combined rank to be different. Also when both the binary-learner-based combiner and the MWGR-based combiner give the correct class a rank greater than the truncation value (30) it makes little sense to compare between the combiners. The testing data was filtered to remove these cases, and the results are presented without them. Before presenting the results, it should be said that rankboost with both learner types gives ranking functions that significantly outperform all the individual face recognition algorithms. In addition, in our tests both learners also clearly outperformed other standard rank combination methods, such as the Borda count, highest rank and logistic regression (Ho et al., 1992). We present two sets of experiments—the combination of the 3 selected classifiers (ANM, UMD and USC) and the combination of all 10 classifiers. These are qualitatively different tasks. In combining 3, we seek to capitalize on the unique strengths of each classifier. In combining 10, we are introducing classifiers which may be correlated and classifiers which are comparably much noisier. The size of the pool was 6 when we combined 3 classifiers and 20 when combining all 10 classifiers. For all experiments we measure the average rank of the correct class for both learners: A= 1 N ∑ min {Rank (xi∗ ) , 30} , N i=1 where Rank (xi∗ ) is the rank of the correct class for query i, and the sum is over all useful test queries, as described above. The average rank difference between the learners is calculated to show the improvement in performance. To evaluate the significance of the improvement we ran paired one-sided t-tests and evaluated the significance of the p-value (a value less than 0.05 is significant). In addition we show the standard deviation of the rank difference. 6.3 Results In the experiments with the homogeneous data sets, combining all classifiers gives an improvement in the average rank of the correct class of 0.296 for MWGR, with a standard deviation of 3.9, a paired t-test statistic of 1.76 and p-value of 0.039, where combining the ANM, UMD and USC classifiers gives an average rank improvement of 0.1 for MWGR, with standard deviation of 3.6, a paired t-test statistic of 0.68 and p-value of 0.24. Table 2 contains the results for combining the 3 classifiers in the experiments with heterogeneous data sets (compare the combiner results in columns bin mean and mwgr mean with the aver806 C ONCAVE L EARNERS FOR R ANKBOOST ANM ARL EFAVG EFML1 EFML2 EXCA MSU RUT UMD USC dup i 9.52 16.85 15.02 18.23 15.85 11.9 17.64 17.87 13.21 12.44 dup ii 18.14 16.96 20.61 22.21 20.33 16.28 23.44 16.48 13.74 6.85 fafb 10.88 8.94 14.43 16.23 12.21 11.67 6.81 12.93 4.57 5.84 fafc 19.71 28.02 26.17 17.39 19.78 20.30 14.27 22.79 8.96 5.17 Table 1: The best average rank of the correct class on the different data sets for all constituent face recognition systems. age rank of the correct class for all constituent classifiers in Table 1). The diff mean column contains the improvement of MWGR over the binary learner in terms of average rank of the correct class. Of the 12 experiments, we see an improvement in 10 cases. Six of those 10 have significant p-values. The two no improvement experiments do not have significant p-values. Table 3 contains the results for combining all 10 classifiers. Of the 12 experiments, we see an improvement for MWGR in 11 cases. Eight or nine of those 11 have significant p-values. The one no improvement experiment does not have a significant p-value. It is interesting to note that we do not seem to see overfitting when increasing the number of constituents. In some case we see improvement, in others we see slight degradation, but all in all the combiner seems resilient to the noise of adding less informative constituents. All sets of experiments, homogeneous data, heterogeneous data sets, combining 3 select recognizers and combining all 10 recognizers at once yielded significant improvements in accuracy, as is visible in the change in the average rank of the correct class and the significance of the statistical tests. 7. Information Retrieval Experiments The annual Text REtrieval Conference (TREC) generates high-quality retrieval results of different systems on different retrieval tasks (Voorhees and Harman, 2001). We use the result data sets of the TREC-2001 web ad hoc task that uses actual web queries taken from web logs. This task has been used in other rank fusion experiments (Renda and Straccia, 2003). As did Renda and Straccia (2003) we combine the results of the following top 12 systems: iit01m, ok10wtnd1, csiro0mwal, flabxtd, UniNEn7d, fub01be2, hum01tdlx, JuruFull, kuadhoc, ricMM, jscbtawtl4, apl10wd. Similar to other TREC information retrieval tasks, the TREC-2001 ad hoc task consists of 50 queries. For each query a list of relevant documents is supplied. Each system returns an ordered list of 1,000 documents for each query. The fusion goal is to combine these 12 individual lists into one list of 1,000 documents, which hopefully has greater precision than the individual systems. For the 807 M ELNIK , VARDI AND Z HANG test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 7.19 7.36 8.31 5.25 5.15 5.19 2.62 3.38 2.60 2.85 2.40 2.56 mwgr mean 5.96 6.21 6.27 5.38 4.4 4.95 2.22 2.60 2.28 2.78 2.43 2.03 diff mean 1.22 1.14 2.04 -.12 0.74 0.23 0.39 0.78 0.32 0.06 -.03 0.52 diff std 5.22 5.15 5.51 4.0 4.61 5.1 3.09 3.79 2.81 3.33 2.27 2.57 pval .7e-4 .001 .1e-6 .658 .018 .275 .115 .029 .142 .424 .537 .028 Table 2: Results of combining the ANM, UMD, and USC classifiers using individual FERET data sets. test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 6.73 8.06 6.68 5.75 6.24 5.86 2.67 3.56 2.68 3.36 3.17 2.22 mwgr mean 5.89 7.09 4.87 5.67 5.47 5.31 2.07 2.45 2.17 2.51 2.95 2.23 diff mean 0.84 0.96 1.81 0.08 0.76 0.55 0.59 1.11 0.51 0.85 0.22 -0.01 diff std 4.79 6.01 5.24 4.61 4.22 4.87 2.97 4.51 2.03 3.23 3.95 2.08 pval .009 .014 .5e-6 .408 .009 .074 .033 .012 .01 .007 .3 .52 Table 3: Results of combining all 10 classifiers using individual FERET data sets. 50 queries of the TREC-2001 ad hoc task, the number of relevant documents that intersect with the union of system results range between 662 and 2664. 7.1 Methods In the information retrieval task the favor function needs to show favor for relevant documents while disregarding other documents. Thus, we can set the favor function similarly to the way it was set in 808 C ONCAVE L EARNERS FOR R ANKBOOST JuruFull 0.759 hum01tdlx 0.760 UniNEn7d 0.763 iit01m 0.762 apl10wd 0.735 jscbtawt14 0.761 csiro0mwa1 0.721 kuadhoc2001 0.727 flabxtd 0.741 ok10wtnd1 0.745 fub01be2 0.734 ricMM 0.765 Table 4: Normalized mean average precision for each constituent. the face recognition classification task. Consider a data set with I queries to train on. For each query, i in I, we need to rank all u ∈ U documents. An instance in this case is a pair, x ≡ (i, u). For each u∗ a relevant document and u an irrelevant document for query i, we set Φ ((i, u) , (i, u ∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1, setting all remaining elements of Φ (x0 , x1 ) = 0. Thus, all relevant documents are given a positive favor with respect to irrelevant documents, while all other rankings, including interactions between relevant documents and interactions between irrelevant documents are given zero favor. As the task has only 50 queries, rather than separating the data into train and test, we opted to do a cross validation performance evaluation. We use 5-fold cross validation on the normalized mean average precision measure (Salton and McGill, 1983), a standard IR measure which accounts for the precision (quantity of correct documents retrieved) and their rank position in the list. It is described by the following equation: AveP = ∑N (Prec(r) × rel(r)) r=1 number o f relevant documents where r is the rank, N is the number of documents retrieved, rel() is a binary function of the relevance of the document at rank r,and Prec is the precision ( [number of relevant documents retrieved] / [number of documents retrieved]) at a given cut-off rank. It is clear that higher values of AveP are better. Note that for purposes of normalization we assigned unretrieved relevant documents a constant rank. As the binary and MWGR weak learners had significantly different convergence properties on this task (see Figure 4) MWGR was trained for 100 iterations and the binary learner for 300 iterations. As in the FERET experiments, MWGR was optimized using algorithm 3, with a selection pressure of p = 0.5. 7.2 Results As seen in Figure 4, the MWGR learner converges in significantly less iterations than the binary learner. This could possibly be attributed to the fact that the MWGR is a more complex function that can incorporate the rankings of multiple classifiers in each learner. Also the MWGR function is not tuned to particular rank cutoffs, whereas the binary learner is, so the MWGR can better accommodate the variety in the 1000 ranks being considered. The normalized mean average precision for the MWGR after 100 iterations was 0.8537 and it was 0.8508 for the binary learner after 300 iterations. Compare these results with the precision of the constituents in Table 4. Both weak learners had a performance rate of change of approximately 3 ∗ 10−5 on their final iteration (better for MWGR). A paired t-test on the cross validation results of the two learners gives a statistically significant p-value of 0.007 in favor of MWGR. 809 M ELNIK , VARDI AND Z HANG Cross Validation Performance 0.88 MWGR Binary 0.87 0.86 0.85 0.84 Mean Avg Precision 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.7 0 50 100 150 200 Iteration Number 250 300 Figure 4: The cross validation mean average precision score of the two weak learners, MWGR and binary, as a function of boosting iteration number. 8. Discussion The question of how to combine ordinal data has become an active focus of research in machine learning, as applications in pattern recognition, information retrieval and other domains have come to the forefront. A particular question of importance is how can the structure of ranks be correctly exploited to maximize performance. The semi parametric nature of rankboost offers the possibility to generate arbitrarily flexible ranking functions. But as observed this flexibility comes at a cost of significant overfitting without further regularization. Freund et al. (2003) demonstrate that successful generalization only occurs when the resulting ranking functions are constrained to be monotonic. This constraint can be thought of as a regularization that incorporates prior knowledge on the interpretation of ranks and as such, how they can be combined. We present a regularization framework based on the concept of consistency in confidence and preference. Ranking functions with this property show a consistency in how they treat the preference and relative confidence exhibited by constituents. We prove that under a natural interpretation of preference and confidence for ranks, this consistency property of the combiner is equivalent to monotonicity and concavity of its ranking function. We enhance rankboost by designing a weak ranking learner that exhibits consistency in preference and confidence. A computational advantage of this weak learner, called minimum weighted group ranks (MWGR) is that its parameters can be individually optimized readily with respect to the rankboost criteria, allowing it to be tested on real-world data. In our first experiments we compare the original rankboost binary weak learner with MWGR on a task of combining the output of multiple face recognition algorithms from the FERET study. We conducted experiments on homogeneous data, testing the intrinsic efficiency of the MWGR proce810 C ONCAVE L EARNERS FOR R ANKBOOST dure. We also conducted experiments on heterogeneous data, testing the robustness or adaptivity of the procedure. In almost all cases we see that MWGR shows improved performance compared with the binary weak learner, whether combining three or all of the face recognizers, confirming the utility of this monotonic and concave learner. Our second experiment was on an Information Retrieval task taken from the TREC conference. In this task we see MWGR converges in significantly less iterations and generates statistically significant improved performance. Final Words Ofer Melnik and Cun-Hui Zhang are very saddened that our colleague and friend Yehuda Vardi passed away before he could give this paper his final stamp of approval. He was very enthusiastic about this research and would have been very pleased to see it come to fruition. Acknowledgments This research is partially supported by NSF Grants DMS 04-05202, DMS 05-04387 and, DMS 0604571, ONR Grant N00014-02-1-056 and NSA Grant H98230-04-1-0041. Ofer Melnik would also like to thank DIMACS for their support in this research. The authors are also grateful to the editor and reviewers for their constructive suggestions which improved the presentation of the results. Appendix A. In this paper we showed how three-point consistency in preference and confidence implies concave and monotonic ranking functions. For two decision problems involving two pairs of rank vectors, the four-point consistency property implies the following constraints for a ranking function G(y) < G(y ) yB − y B ≤ 0 < y A − y A zA ≤ yA , yA − yA ≤ zA − zA ⇒ G(z ) − G(z) > G(y ) − G(y) yB ≤ z B , zB − z B ≤ y B − y B (11) where y and y are the rank vectors from the first decision problem and z and z are the rank vectors from the second decision problem. Unlike the three-point case, the four-point consistency property does not imply a clearly recognizable functional form for the ranking function. What we can say about it though is that as the constraints are linear, in the same way that concavity and monotonicity in the weak learner conferred the same properties to the ranking function, a weak learner that satisfies Eq. 11 will also confer those properties to the ranking function that uses it with positive weights. References K. Arrow. Social Choice and Individual Values. Wiley, 1951. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. 10th Intl. World Wide Web Conf., pages 613–622, 2001. 811 M ELNIK , VARDI AND Z HANG Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. T. K. Ho. A Theory of Multiple Classifier Systems and Its Application to Visual Word Recognition. PhD thesis, State University of New York at Buffalo, May 1992. T. K. Ho, J. J. Hull, and S. N. Srihari. Combination of decisions by multiple classifiers. In H. S. Baird, H. Bunke, and K. Yamamoto (Eds.), editors, Structured Document Image Analysis, pages 188–202. Springer-Verlag, Heidelberg, 1992. J. Kittler and F. Roli, editors. Multiple Classifier Systems, Lecture Notes in Computer Science 1857, 2000. Springer. R. Meir and G. Ratsch. Advanced Lectures in Machine Learning, Lecture Notes in Computer Science 2600, chapter An introduction to boosting and leveraging, pages 119–184. Springer, 2003. O. Melnik, Y. Vardi, and C-H. Zhang. Mixed group ranks: Preference and confidence in classifier combination. IEEE Pattern Analysis and Machine Intelligence, 26(8):973–981, 2004. H. Moon and P.J. Phillips. Computational and performance aspects of PCA-based face-recognition algorithms. Perception, 30:303–321, 2001. P.J. Phillips, H. Moon, S.A. Rizvi, and P.J. Rauss. The FERET evaluation methodology for facerecognition algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22:1090– 1104, 2000. M.E. Renda and U. Straccia. Web metasearch: Rank vs. score based rank aggregation methods. In 18th Annual ACM Symposium on Applied Computing (SAC-03), pages 841–846, Melbourne, Florida, USA, 2003. ACM. G. Salton and J.M. McGill. Introduction to Modern Information Retrieval. Addison Wesley Publ. Co., 1983. E.M. Voorhees and D.K. Harman, editors. NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), number SN003-003-03750-8, 2001. Department of Commerce, National Institute of Standards and Technology, Government Printing Office. URL http://trec.nist.gov. L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):775– 779, 1997. W. Zhao, A. Krishnaswamy, R. Chellappa, D. Swets, and J. Weng. Face Recognition: From Theory to Applications, chapter Discriminant Analysis of Principal Components, pages 73–86. SpringerVerlag, Berlin, 1998. 812
same-paper 3 0.99430269 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning
Author: Matthew E. Taylor, Peter Stone, Yaxin Liu
Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping
4 0.98047984 70 jmlr-2007-Ranking the Best Instances
Author: Stéphan Clémençon, Nicolas Vayatis
Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We first state the problem of finding the best instances which can be cast as a classification problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where first, the best instances would be tentatively identified and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates
5 0.97774547 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results
Author: Peter L. Bartlett, Ambuj Tewari
Abstract: One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. Keywords: kernel methods, support vector machines, sparseness, estimating conditional probabilities
6 0.83580673 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
7 0.82671916 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
8 0.82174253 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
9 0.812868 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
10 0.80830389 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
11 0.80600625 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
13 0.79343367 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes
14 0.79341841 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
15 0.79010719 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
16 0.78932142 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
17 0.78906262 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
18 0.78721339 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions
19 0.78070229 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
20 0.77211183 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features