jmlr jmlr2007 jmlr2007-85 knowledge-graph by maker-knowledge-mining

85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Source: pdf

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artiﬁcial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. [sent-14, score-0.605]

2 Few current machine learning methods are able to transfer knowledge between pairs of tasks, and none are able to transfer between a broad range of tasks to the extent that humans are. [sent-21, score-0.591]

3 This article presents a new method for transfer learning in the reinforcement learning (RL) framework using temporal difference (TD) learning methods (Sutton and Barto, 1998), whereby an agent can learn faster in a target task after training on a different, typically less complex, source task. [sent-22, score-0.785]

4 In this article we introduce transfer via inter-task mapping (TVITM), whereby a TD learner trained on one task with action-value function RL can learn faster when training on another task with related, but different, state and action spaces. [sent-31, score-0.756]

5 It is this transfer functional which deﬁnes transfer in the TVITM framework. [sent-35, score-0.522]

6 In this article we consider the general case where the state features in the source and target tasks are different (Ssource = Starget ), and/or the actions in the source and target tasks are different (Asource = Atarget ). [sent-81, score-0.746]

7 To use the learned action-value function from the source task Q (source, f inal) as the initial action-value function for a TD learner in a target task, we must transform the action-value function so that it can be directly applied to the new state and action space. [sent-82, score-0.571]

8 Similarly, χA maps each action in the target task to the most similar action in the source tasks: χA (ai,target ) = a j,source . [sent-93, score-0.486]

9 χX and χA , mappings from the target task to the source task, are used to construct ρ, a transfer functional from the source task to the target task (see Figure 2). [sent-94, score-0.94]

10 Source S A Q Target χX χA ρ S’ A’ Q’ Figure 2: χX and χA are mappings from a target to a source task; ρ maps an action-value function from a source to a target task. [sent-97, score-0.454]

11 Instead of trying to transfer higher level information about a source task into a target task, we instead focus on information contained in individual weights within function approximators. [sent-107, score-0.593]

12 Learned source task knowledge may be able to improve initial target task performance relative to learning the target task without transfer. [sent-128, score-0.543]

13 Transfer may allow an agent to accumulate more reward in 2129 TAYLOR , S TONE AND L IU the target task when compared to the non-transfer case; better initial performance and faster learning would help agents achieve more on-line reward. [sent-133, score-0.501]

14 Assuming that the transfer learner is able to learn faster or reach a higher performance, the area under the transfer curve will be greater than the area under the non-transfer curve. [sent-140, score-0.64]

15 The ratio r= area under curve with transfer - area under curve without transfer area under curve without transfer gives us a metric for how much transfer improves learning. [sent-141, score-1.044]

16 In the tasks we consider, the learners that use transfer and the learners that learn without transfer do not always plateau to the same performance, nor is there a deﬁned task length. [sent-144, score-0.789]

17 This criterion is relevant when the source task is created for the sole purpose of speeding up learning with transfer and Q(source, f inal) is not reused. [sent-151, score-0.461]

18 We also introduce the Knight Joust, a task which we will later use as a supplemental source task from which to transfer into Keepaway. [sent-154, score-0.536]

19 In particular, the simulator places the players at their initial positions at the start of each episode and ends an episode when the ball leaves the play region or is taken away. [sent-174, score-0.412]

20 Whenever the takers take possession or the ball leaves the region, the episode ends and the players are reset for another episode (with the keepers being given possession of the ball again). [sent-177, score-0.793]

21 8 The player receives a reward of +20 every time it takes the forward action, 0 if either knight jump action is taken, and an additional +20 upon reaching the goal line. [sent-249, score-0.444]

22 Learning Keepaway TVITM aims to improve learning in the target task based on prior learning in the source, and therefore a prerequisite is that both source and target tasks are learnable. [sent-254, score-0.437]

23 The keepers learn in a constrained policy space: they have the freedom to decide which action to take only when in possession of the ball. [sent-273, score-0.466]

24 Therefore the number of actions from which the keeper with the ball may choose is equal to the number of keepers in the task. [sent-275, score-0.53]

25 Keepers not in possession of the ball are required to execute the Receive macro-action in which the player who can reach the ball the fastest goes to the ball and the remaining players follow a handcoded strategy to try to get open for a pass. [sent-276, score-0.521]

26 The keepers learn which action to take when in possession of the ball by using episodic SMDP Sarsa(λ) (Sutton and Barto, 1998), to learn their task. [sent-281, score-0.539]

27 Since we want the keepers to maintain possession of the ball for as long as possible, the reward in the Keepaway task is simply the number of time steps the ball remains in play after a macro-action is initiated. [sent-284, score-0.646]

28 As more keepers are added, the keeper with the ball has more passing options but the average pass distance is shorter. [sent-288, score-0.559]

29 Note that as the number of keepers n and the number of takers m increase, the number of state variables also increases so that the more complex state can be fully described. [sent-304, score-0.457]

30 , there are more distances to players to account for) and |A| increases as there are more teammates for the keeper with possession of the ball to pass to. [sent-313, score-0.482]

31 In this article we use three distinct function approximators and show that all are able to learn Keepaway, as well as use our transfer methodology (see Figure 6). [sent-316, score-0.424]

32 Three keepers start in each of the other three corners and the fourth keeper begins an episode at the center of the ﬁeld. [sent-392, score-0.43]

33 2 Keepaway, but now A = {hold, pass to closest teammate, pass 2138 T RANSFER L EARNING VIA I NTER -TASK M APPINGS to second closest teammate, pass to third closest teammate}, and S is made up of 19 state variables due to the added players. [sent-394, score-0.451]

34 Furthermore, adding a keeper in the center of the ﬁeld changes the start state signiﬁcantly as now the keeper that starts with the ball has a teammate that is closer to itself, but is also closer to the takers. [sent-401, score-0.575]

35 When a group of four CMAC keepers has learned to hold the ball from the three takers for an average of 11. [sent-408, score-0.428]

36 (After training four keepers using ANN function approximation without transfer in 4 vs. [sent-424, score-0.5]

37 4 three keepers are again placed in three corners and the two remaining keepers are placed in the middle of the 25m × 25m ﬁeld. [sent-433, score-0.436]

38 There are now ﬁve actions: {hold, pass to closest teammate, pass to second closest teammate, pass to third closest teammate, pass to fourth closest teammate}, and 25 state variables. [sent-435, score-0.576]

39 2139 TAYLOR , S TONE AND L IU relies on a functional ρ that is able to transfer an action-value function from a source task into a target task with different state and action spaces. [sent-449, score-0.797]

40 ” We map the novel target action “Pass to third closest keeper” to “Pass to second closest keeper” in the source task. [sent-466, score-0.418]

41 Now that χA and χX are deﬁned, relating the state variables and actions in a target task to the state variables and actions in a source task, we can use them to construct ρs for different internal representations. [sent-475, score-0.594]

42 The functionals will transfer the learned action-value function from the source task into the target task. [sent-476, score-0.59]

43 Relevant points are the center of the ﬁeld C, keepers K1 -K4 , and takers T1 -T3 , where players are ordered by increasing distance from the ball. [sent-494, score-0.443]

44 3 action Knight Joust action Hold Ball Forward Pass to closest teammate JumpW Pass to second closest teammate JumpE Pass to third closest teammate JumpE Table 3: This table describes the mapping between state variables and actions from 4 vs. [sent-509, score-0.873]

45 Rather than initialize a function approximator in the target task with values learned in the source task, we instead reuse the entire learned source task’s Q-values. [sent-556, score-0.6]

46 A copy of the source task’s function approximator is retained so that it can calculate the source task’s Q-values for any state, action pair: QsourceFA : S × A → R. [sent-557, score-0.425]

47 When computing Q-values for the target task, we ﬁrst map the target task state and action to the source task’s state and action via the inter-task mappings. [sent-558, score-0.722]

48 Q-value Reuse may be considered a type of reward shaping (Colombetti and Dorigo, 1993; Mataric, 1994): we are able to directly use the expected rewards from the source task to bias the learner in the target task. [sent-561, score-0.487]

49 Such requirements will grow linearly in the number of transfer steps; while they are not substantial with a single source task, they may become prohibitive when using multiple source tasks or when performing doing multi-step transfer (such as shown later in Section 7. [sent-566, score-0.841]

50 As source task training time increases, the required target task training time decreases. [sent-638, score-0.455]

51 To determine if Keepaway players using CMAC function approximation can beneﬁt from transfer, we compare the time it takes agents to learn the target task after transferring from the source task with the time it takes to learn the target task without transfer. [sent-679, score-1.009]

52 The top row of each table represents learning the task without transfer and thus any column with transfer times lower than the top row shows beneﬁcial transfer. [sent-681, score-0.597]

53 Our second goal of transfer would be met if the total training time in both tasks with transfer was less than learning without transfer in the target task. [sent-682, score-0.984]

54 Again, successful transfer is demonstrated as both the transfer agents’ target task training time and the transfer agent’s total training time are less than the time required to learn the target task without transfer. [sent-773, score-1.275]

55 2 episodes show a statistically signiﬁcant difference from those that learn without transfer (p < 0. [sent-778, score-0.426]

56 05), while the learning trials that used less than 500 source task episodes did not signiﬁcantly reduce the target task training time. [sent-779, score-0.515]

57 3 training times for the ANN players between using TVITM and training without transfer is statistically signiﬁcant (p < 0. [sent-781, score-0.441]

58 31 Table 6: Results showing that transfer with the full ρCMAC outperforms using ρCMAC without the ﬁnal averaging step, using only the averaging step of ρCMAC , and when averaging weights in the source task before transferring the weights. [sent-809, score-0.659]

59 We anticipated that the beneﬁt from transfer would be increasingly degraded, relative to using the actual ρCMAC , as fewer numbers of training episodes in the source task were used. [sent-815, score-0.596]

60 While our results demonstrate that all three function approximators can successfully transfer knowledge, we focus our supplementary experiments on CMAC function approximation so that our transfer work can be directly comparable to previous work in Keepaway, which also used CMACs (Stone et al. [sent-826, score-0.599]

61 We also conducted a set of 30 trials which modiﬁed ρCMAC so that the average weight in the source CMAC is put into all zero-weights in the target CMAC, which is possible when agents in the source task know that their saved weights will be used for TVITM. [sent-854, score-0.597]

62 We ﬁrst deﬁned ρmodi f ied by modifying χA so that instead of mapping the novel target task action “Pass to second third keeper” into the action “Pass to second closest keeper,” we instead map the novel action into “Hold ball. [sent-872, score-0.561]

63 3, the initial hold times of the keepers do not differ signiﬁcantly from those of keepers with uninitialized CMACs (i. [sent-913, score-0.461]

64 For example, in the degenerate case where the source and target task are identical, the initial performance in the target task will be equivalent to the ﬁnal performance in the source task. [sent-965, score-0.593]

65 The former gives more beneﬁt after more training is completed in the source task and the second helps when less knowledge is gained in the source task before transfer. [sent-1006, score-0.421]

66 Players initialized by TVITM in the source task do not initially outperform uninitialized players in the target task, but are able to learn faster. [sent-1015, score-0.515]

67 In addition to changing the number of players between the source and target tasks, other variations such as the size of the ﬁeld, wind, and player ability can be modiﬁed. [sent-1104, score-0.429]

68 Yes Yes Yes No No No No No Yes Yes Table 12: Results showing transfer via inter-task mapping beneﬁts CMAC players utilizing ρCMAC with two kinds of actuators. [sent-1144, score-0.445]

69 These results demonstrate that transfer can succeed even when actions in the source and target tasks are qualitatively different. [sent-1145, score-0.618]

70 We are also able to use the same ρCMAC to speed up learning in the target task when both the target and source tasks have inaccurate actuators (rows 5 and 6). [sent-1167, score-0.585]

71 2 keepers that learned using an accurate pass action in the source task. [sent-1174, score-0.56]

72 , rows 7 and 8) of Table 12, even though the players in the source task have different actuators than in the target task, transfer is able to signiﬁcantly speed up learning compared to not using transfer. [sent-1177, score-0.808]

73 Actuators are changed in the benchmark players by changing the pass action from the default “PassNormal” to “PassFast” which increases the speed of the pass by 50%, reducing accuracy. [sent-1179, score-0.407]

74 2154 T RANSFER L EARNING VIA I NTER -TASK M APPINGS This result conﬁrms that the same ρ will allow TVITM to transfer between tasks where not only have S and A changed, but the effect of the actions have also changed qualitatively. [sent-1180, score-0.409]

75 Alternately, if a set of agents have learned a task and then later want to learn another task but have damaged their actuators since learning the source task, transfer may still increase the speed of learning. [sent-1183, score-0.896]

76 We also perform the inverse experiment where agents in the source task have inaccurate actuators and agents in the target task have normal actuators. [sent-1184, score-0.719]

77 Thus a ﬁelded agent with worn down actuators would be able to successfully transfer its learned action-value function to agents whose actuators were undamaged. [sent-1189, score-0.691]

78 Interestingly, transferring from source keepers that have accurate actuators is more effective than transferring from source keepers that have inaccurate actuators both when the target task has accurate actuators and when it has inaccurate actuators. [sent-1190, score-1.344]

79 In this subsection we demonstrate that transfer works when actuators are inaccurate in both the source and target tasks. [sent-1195, score-0.592]

80 Combined, our results show that TVITM is able to speed up learning in multiple target tasks with different state and action spaces, and even when the agents have somewhat different actuators in the two tasks. [sent-1201, score-0.574]

81 4 we say that the task has been learned when the 5 keepers are able to hold the ball for an average of 9. [sent-1218, score-0.416]

82 For instance, the the target task actions “Pass to Second Closest Teammate”, “Pass to Third Closest Teammate”, and “Pass to Fourth Closest Teammate” are mapped to the source task action “Pass to Second Closest Teammate. [sent-1229, score-0.539]

83 Given the available actions, the optimal action for players is for the player closest to the takers to hold the ball until the takers captures it. [sent-1317, score-0.627]

84 We use a variant of ρCMAC to transfer the learned weights, because Knight Joust is learned with a tabular function approximator rather than a CMAC. [sent-1364, score-0.425]

85 Note that this new variant of the transfer functional is not necessitated by the novel target task and that if we learned Knight Joust with a CMAC, the original ρCMAC would be sufﬁcient for transfer between Knight Joust and 4 vs. [sent-1366, score-0.726]

86 3 transfer times (in simulator hours) are statistically different from learning without transfer (determined via Student’s t-tests). [sent-1370, score-0.579]

87 Another question that arises from this work is: if the ultimate goal is to learn the target task in the minimal amount of time, how can one determine the optimal amount of training in the source task automatically based on task characteristics? [sent-1407, score-0.506]

88 The best learned actions in the source task, for a given state, be mapped to the best action in the target task via the inter-task mappings. [sent-1429, score-0.509]

89 Thus, if all weights from this policy were multiplied by −1, the keepers would continually pass the ball until a taker came within 6m. [sent-1435, score-0.526]

90 In this work, an example of the ﬁrst condition being met is that a keeper learns to hold the ball in the source task until forced to pass. [sent-1439, score-0.433]

91 If either of these conditions were not true, the transfer functional we employed would have to account for the differences (or suffer from reduced transfer efﬁcacy). [sent-1446, score-0.522]

92 Past research conﬁrms that if two tasks are closely related the learned policy from a source task can be used to provide a good initial policy for a target task. [sent-1469, score-0.535]

93 TVITM differs in intent in that we aim to transfer behaviors from existing, relevant tasks which can have different state and action spaces, rather than creating artiﬁcial problems which are easier for the agent to learn. [sent-1486, score-0.575]

94 The technique of autonomous shaping (Konidaris and Barto, 2006) may prove to be useful for transfer learning, particularly in agents that have many sensors. [sent-1518, score-0.433]

95 If a later task has a similar reward structure and actions, the learned reward shaping will help the agent initially have a much higher performance than if it were learning without transfer. [sent-1520, score-0.5]

96 For instance, if a signal device (a beacon) is near the goal state, the agent may learn a shaping reward that gives an internal reward for approaching the beacon even if the environmental reward is zero. [sent-1521, score-0.542]

97 Additionally, while the authors limit themselves to transfer between “reward-linked” tasks, no method is given for determining if a sequence of tasks are reward-linked and a learned shaping function will not necessarily be useful in a given target task. [sent-1523, score-0.492]

98 Using such a translation mechanism allows, for instance, transfer between a source task policy search learner and a value function target task learner. [sent-1546, score-0.716]

99 Rather than utilizing abstract knowledge, this transfer method is able to leverage the weights from function approximators specifying action-value functions, a very task-speciﬁc form of knowledge. [sent-1559, score-0.411]

100 We also show how transfer efﬁcacy is reduced when the source task and target task are less related, such as when using 3 vs. [sent-1568, score-0.62]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cmac', 0.47), ('keepaway', 0.365), ('transfer', 0.261), ('tvitm', 0.254), ('keepers', 0.218), ('dist', 0.213), ('keeper', 0.155), ('ang', 0.151), ('players', 0.138), ('reward', 0.13), ('source', 0.125), ('agents', 0.119), ('episodes', 0.114), ('joust', 0.111), ('teammate', 0.111), ('knight', 0.104), ('action', 0.101), ('actuators', 0.099), ('takers', 0.087), ('target', 0.084), ('appings', 0.083), ('ransfer', 0.083), ('soccer', 0.083), ('tone', 0.083), ('player', 0.082), ('actions', 0.079), ('ann', 0.079), ('ball', 0.078), ('transferring', 0.078), ('approximators', 0.077), ('state', 0.076), ('task', 0.075), ('approximator', 0.074), ('opponent', 0.071), ('pass', 0.071), ('nter', 0.07), ('tasks', 0.069), ('agent', 0.068), ('sarsa', 0.067), ('reinforcement', 0.065), ('iu', 0.063), ('rbf', 0.059), ('simulator', 0.057), ('episode', 0.057), ('policy', 0.056), ('taylor', 0.056), ('robocup', 0.055), ('taker', 0.055), ('closest', 0.054), ('stone', 0.054), ('learn', 0.051), ('weights', 0.048), ('td', 0.048), ('tiles', 0.048), ('learned', 0.045), ('tile', 0.044), ('learner', 0.04), ('possession', 0.04), ('passing', 0.037), ('mappings', 0.036), ('learners', 0.036), ('hours', 0.036), ('atarget', 0.036), ('giveaway', 0.036), ('article', 0.035), ('shaping', 0.033), ('inal', 0.03), ('time', 0.027), ('reach', 0.027), ('reuse', 0.027), ('activated', 0.027), ('earning', 0.026), ('speed', 0.026), ('initial', 0.025), ('utilizing', 0.025), ('yes', 0.025), ('asource', 0.024), ('ied', 0.024), ('sinitial', 0.024), ('xtarget', 0.024), ('averaging', 0.024), ('copied', 0.023), ('pole', 0.023), ('tilings', 0.023), ('initialized', 0.023), ('inaccurate', 0.023), ('sutton', 0.023), ('mapping', 0.021), ('seconds', 0.021), ('matthew', 0.021), ('training', 0.021), ('trials', 0.021), ('autonomous', 0.02), ('flat', 0.02), ('damaged', 0.02), ('noda', 0.02), ('yaxin', 0.02), ('initially', 0.019), ('bene', 0.019), ('threshold', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

2 0.11531969 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning

Author: Mohammad Ghavamzadeh, Sridhar Mahadevan

Abstract: Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL including HAMs, options, MAXQ, and PHAMs has been limited to the discrete-time discounted reward semi-Markov decision process (SMDP) model. The average reward optimality criterion has been recognized to be more appropriate for a wide class of continuing tasks than the discounted framework. Although average reward RL has been studied for decades, prior work has been largely limited to ﬂat policy representations. In this paper, we develop a framework for HRL based on the average reward optimality criterion. We investigate two formulations of HRL based on the average reward SMDP model, both for discrete-time and continuous-time. These formulations correspond to two notions of optimality that have been previously explored in HRL: hierarchical optimality and recursive optimality. We present algorithms that learn to ﬁnd hierarchically and recursively optimal average reward policies under discrete-time and continuous-time average reward SMDP models. We use two automated guided vehicle (AGV) scheduling tasks as experimental testbeds to study the empirical performance of the proposed algorithms. The ﬁrst problem is a relatively simple AGV scheduling task, in which the hierarchically and recursively optimal policies are different. We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. The second problem is a larger AGV scheduling task. We model this problem using both discrete-time and continuous-time models. We use a hierarchical task decomposition in which the hierarchically and recursively optimal policies are the same for this problem. We compare the performance of the proposed algorithms

3 0.09287028 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

Author: Sridhar Mahadevan, Mauro Maggioni

Abstract: This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A speciﬁc instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a ﬁnal parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A speciﬁc instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nystr¨ m extension for out-of-sample interpolation of eigenfunctions, and the use of Kronecker o sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are brieﬂy summarized at the end. Keywords: Markov decision processes, reinforcement learning, value function approximation, manifold learning, spectral graph theory

4 0.06971626 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)

Author: Avrim Blum, Yishay Mansour

Abstract: External regret compares the performance of an online algorithm, selecting among N actions, to the performance of the best of those actions in hindsight. Internal regret compares the loss of an online algorithm to the loss of a modiﬁed online algorithm, which consistently replaces one action by another. In this paper we give a simple generic reduction that, given an algorithm for the external regret problem, converts it to an efﬁcient online algorithm for the internal regret problem. We provide methods that work both in the full information model, in which the loss of every action is observed at each time step, and the partial information (bandit) model, where at each time step only the loss of the selected action is observed. The importance of internal regret in game theory is due to the fact that in a general game, if each player has sublinear internal regret, then the empirical frequencies converge to a correlated equilibrium. For external regret we also derive a quantitative regret bound for a very general setting of regret, which includes an arbitrary set of modiﬁcation rules (that possibly modify the online algorithm) and an arbitrary set of time selection functions (each giving different weight to each time step). The regret for a given time selection and modiﬁcation rule is the difference between the cost of the online algorithm and the cost of the modiﬁed online algorithm, where the costs are weighted by the time selection function. This can be viewed as a generalization of the previously-studied sleeping experts setting. Keywords: online learning, internal regret, external regret, multi-arm bandit, sleeping experts, reductions

5 0.055640657 82 jmlr-2007-The Need for Open Source Software in Machine Learning

Author: Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson

Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the ﬁeld of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be signiﬁcantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientiﬁc community. Keywords: machine learning, open source, reproducibility, creditability, algorithms, software 2444 M ACHINE L EARNING O PEN S OURCE S OFTWARE

6 0.047896266 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

7 0.041675672 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

8 0.031345706 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

9 0.027262464 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

10 0.026719078 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

11 0.023579771 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

12 0.02309409 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

13 0.022854224 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

14 0.022820011 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)

15 0.020388339 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

16 0.018291626 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

17 0.018201439 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

18 0.018007735 11 jmlr-2007-Anytime Learning of Decision Trees

19 0.016813688 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

20 0.016632415 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, 0.073), (2, -0.063), (3, -0.025), (4, -0.229), (5, 0.102), (6, 0.093), (7, 0.012), (8, -0.014), (9, 0.092), (10, 0.065), (11, -0.369), (12, 0.081), (13, 0.247), (14, -0.071), (15, 0.114), (16, -0.011), (17, -0.054), (18, 0.015), (19, 0.087), (20, 0.044), (21, -0.029), (22, 0.141), (23, -0.162), (24, -0.126), (25, 0.131), (26, -0.109), (27, -0.001), (28, 0.022), (29, -0.065), (30, 0.033), (31, 0.007), (32, -0.12), (33, -0.016), (34, -0.008), (35, -0.065), (36, -0.019), (37, 0.069), (38, 0.051), (39, 0.02), (40, 0.035), (41, 0.062), (42, 0.024), (43, 0.016), (44, -0.077), (45, 0.027), (46, 0.036), (47, -0.04), (48, 0.019), (49, -0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96963149 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

2 0.81954777 41 jmlr-2007-Hierarchical Average Reward Reinforcement Learning

Author: Mohammad Ghavamzadeh, Sridhar Mahadevan

3 0.44329244 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

Author: Sridhar Mahadevan, Mauro Maggioni

4 0.34656319 82 jmlr-2007-The Need for Open Source Software in Machine Learning

5 0.30194077 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)

Author: Avrim Blum, Yishay Mansour

6 0.21904671 56 jmlr-2007-Multi-Task Learning for Classification with Dirichlet Process Priors

7 0.20160255 14 jmlr-2007-Behavioral Shaping for Geometric Concepts

8 0.1810952 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

9 0.16042332 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data

10 0.15352152 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)

11 0.13022074 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

12 0.12993258 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

13 0.12609352 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

14 0.12326443 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

15 0.11195783 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

16 0.10687736 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners

17 0.10137846 23 jmlr-2007-Concave Learners for Rankboost

18 0.09967722 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

19 0.095067509 11 jmlr-2007-Anytime Learning of Decision Trees

20 0.090629891 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.012), (10, 0.012), (12, 0.017), (15, 0.014), (22, 0.01), (28, 0.028), (40, 0.018), (45, 0.013), (48, 0.025), (60, 0.014), (67, 0.01), (80, 0.015), (85, 0.054), (98, 0.666)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99568582 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

Author: Gaëlle Loosli, Stéphane Canu

Abstract: In a recently published paper in JMLR, Tsang et al. (2005) present an algorithm for SVM called Core Vector Machines (CVM) and illustrate its performances through comparisons with other SVM solvers. After reading the CVM paper we were surprised by some of the reported results. In order to clarify the matter, we decided to reproduce some of the experiments. It turns out that to some extent, our results contradict those reported. Reasons of these different behaviors are given through the analysis of the stopping criterion. Keywords: SVM, CVM, large scale, KKT gap, stopping condition, stopping criteria

2 0.9954896 23 jmlr-2007-Concave Learners for Rankboost

Author: Ofer Melnik, Yehuda Vardi, Cun-Hui Zhang

Abstract: Rankboost has been shown to be an effective algorithm for combining ranks. However, its ability to generalize well and not overﬁt is directly related to the choice of weak learner, in the sense that regularization of the rank function is due to the regularization properties of its weak learners. We present a regularization property called consistency in preference and conﬁdence that mathematically translates into monotonic concavity, and describe a new weak ranking learner (MWGR) that generates ranking functions with this property. In experiments combining ranks from multiple face recognition algorithms and an experiment combining text information retrieval systems, rank functions using MWGR proved superior to binary weak learners. Keywords: rankboost, ranking, convex/concave, regularization 1. Ranking Problems A ranking problem is a learning problem that involves ranks as the inputs, the outputs or both. An example where ranks are used as inputs is a collaborative ﬁltering application where people are asked to rank movies according to their preferences. In such an application the ranks assigned by different people are combined to generate recommendations. Another type of problem in which ranks are used as inputs are meta-search problems, where the ranks of multiple search engines are combined (Dwork et al., 2001). However, the inputs to a ranking problem are not always ranks. An object recognition ranking problem (Kittler and Roli, 2000) may receive as inputs a graphical representation and output a ranking of the possible objects, sorted by likelihood. The outputs of a ranking problem may also be ranks. For example, in combining multiple search engines the output are ranks which are a synthesis of the ranks from the individual search engines. A similar meta-recognition problem is the combination of the outputs of multiple face- recognition systems to improve the accuracy of detecting the correct face. While the inputs and outputs of this problem are ranks, the outputs can be simpliﬁed to only return the most likely candidate. Another example where the outputs do not need to be complete ranks could be an information retrieval combination task. In such a task the inputs might be ranks of sets of documents with respect to c 2007 Ofer Melnik, Yehuda Vardi and Cun-Hui Zhang. M ELNIK , VARDI AND Z HANG a particular query by different experts. Again, the outputs could be the complete ranks, or more simply a designation for each document of whether it is relevant or not to the particular query. 2. Rankboost The rankboost algorithm (Freund et al., 2003) tries to address this variety in ranking problems by maintaining generality in how it regards its inputs and how it applies different loss functions to outputs. The rankboost algorithm works on instances which are the discrete items (e.g., movies or faces) that either are ranked as input or are to be ranked as output. To allow a general input mechanism, the inputs to rankboost are called the ranking features of an instance, speciﬁed as f j (x) which is the j-th ranking feature of instance x. While this generality in how inputs are handled is potentially powerful, in the original rankboost paper as well as in this paper, only the case where the inputs are different rankings of the instances is considered. Thus, in this paper, the inputs to rankboost are constituent ranks, denoted as f 1 . . . fn , where f j (x ) < f j (x ) implies that instance x has a better rank than instance x under ranking f j . For example, in some of the experiments we present, each f j is the ranking result of a particular face recognition algorithm. The output of rankboost is a new ranking function, H(x), which deﬁnes a linear ordering on the instances, that is, H (x ) < H (x ) iff x has a better rank than x . In rankboost T H(x) = ∑ wt ht (x), (1) t=1 a weighted sum of weak ranking learners, where the ht (x)’s are relatively simple learned functions of the constituent rankings. To address the variety in possible loss functions of the outputs, in rankboost the desirable properties for the output loss function are speciﬁed with a favor function, Φ : X × X → R, where X is the space of instances (note that this function has been renamed from “preference function” to avoid confusion with the use of preference in this paper in Section 4). Here Φ (x , x ) > 0 means that x should be better ranked than x for a given query. It is an anti-symmetric function, that is, Φ (x , x ) = −Φ (x , x ) and Φ (x, x) = 0, which avoids loops where two instances should both be ranked better than the other. Also Φ (x , x ) = 0 when there is no favor in how two instances should be relatively ranked. For example, as described in Section 6.1, for the face recognition combination problem described above the favor function can be used to specify that the correct identity should be given a better rank than all other identities, while zeroing all other entries in the favor function, giving no favor in how incorrect identities are ranked between them. In a similar fashion for the information retrieval combination task mentioned above, the favor function can be speciﬁed such that all relevant documents should be better ranked than irrelevant documents, without specifying favor for the ordering between relevant documents and the ordering between irrelevant documents (Section 7.1). Rankboost is shown in Algorithm 1 as described in Freund et al. (2003). It uses the favor function to specify an initial weight function over instance pair orderings: D x ,x = c · max 0, Φ x , x −1 , (2) where c = ∑x ,x max (0, Φ (x , x )) . The algorithm itself is very similar to adaboost (Freund and Schapire, 1996). At each iteration the algorithm selects the weak learner that best maximizes a 792 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 1 rankboost algorithm for generating a ranking function. Input: constituent ranks, a favor function, and a class of weak learners with outputs between 0 and 1 and an appropriate training algorithm. Output: ranking function H(x) (Eq. 1) Initialize D(1) = D (Eq. 2) for t = 1 . . . T do Find weak learner, ht , that maximizes r(h) = ∑x ,x D(t) (x , x )(h(x ) − h(x )) (Eq. 4). Choose wt = 0.5 ln ((1 + r(ht )) / (1 − r(ht ))) . (Eq. 5) Update: D(t+1) (x , x ) = D(t) (x , x ) exp (wt (ht (x ) − ht (x ))) Zt where Zt is chosen to make ∑x ,x D(t+1) (x , x ) = 1 end for rank function’s utility, r(h), (Eq. 4). It then assigns it a weight in proportion to its performance and adds the weighted weak learner to the ranking function, H(x). After which the weight function D is adjusted to reﬂect the impact of the weak learner. Freund et al. (2003) prove that the rankboost algorithm iteratively minimizes the ranking loss function, a measure of the quantity of misranked pairs: ∑ D x ,x I H x ≤ H x x ,x where I is the indicator function (1 if true, 0 otherwise) and H(x) is a ranking function output by rankboost. The rankboost paper (Freund et al., 2003) uses binary weak learners of the following functional form:  i f f j (x) > θ,  1 0 i f f j (x) ≤ θ, (3) h(x) =  qde f i f f j (x) unde f ined. Each binary weak learner, h, operates on a single f j (ranking feature), giving a binary output of 0 or 1 depending on whether the instance’s rank is smaller than a learned parameter, θ. Note that function h(x) in (3), which is dependent on only one f j (x) with a ﬁxed j, is monotonic increasing but not convex or concave. If there is no rank for the instance then it returns a prespeciﬁed q de f value. 3. Rankboost, the Weak Learner and Regularization While rankboost inherits the accuracy of adaboost and has been shown to be very successful in practice, in two important ways it is very different from adaboost and similar classiﬁer leveraging algorithms (Meir and Ratsch, 2003). The ﬁrst is the choice of the weak learner, h. In adaboost the weak learner is expected to minimize the weighted empirical classiﬁcation error: N ∑ d (t) (i)I [yi = h (zi )] , i=1 where yi is the class label, I is the indicator function and d (t) is a weighting over training samples. This is a standard function to minimize in classiﬁcation with many possible types of algorithms to 793 M ELNIK , VARDI AND Z HANG choose from as possible weak learners. In contrast the weak ranking learner for rankboost (with outputs in [0, 1]) needs to maximize the following rank utility function: r = r(h) = ∑ D(t) (x , x )(h(x ) − h(x )), (4) x ,x where D(t) is the weight function over pairs of instances. As Eq. 4 contains the term h(x ) − h(x ), short of linear learners this equation can not be concave in h or easily approximated by a concave function. Therefore the weak learner needs to be optimized either by heuristics or by brute force, which limits the choice of h. It is not surprising that the original rankboost paper only included one type of weak learner, a binary threshold function (Eq. 3) that was tuned using brute force. The second difference between rankboost and adaboost also concerns the weak ranking learner. One feature of boosting that has sparked a great deal of interesting research is the algorithm’s ability to avoid overﬁtting for low noise classiﬁcation problems (with modiﬁcations to higher noise problems), see Meir and Ratsch (2003) for a survey. In contrast for rankboost it is only by limiting the type of underlying weak learners that overﬁtting is avoided. In their paper, Freund et al. (2003) show that not using weak ranking learners with cumulative positive coefﬁcients leads to overﬁtting and poor performance quite quickly. Therefore, choosing a weak learner that regularizes the ranking function, the output of rankboost, is very important for achieving accuracy and avoiding overﬁtting. It is clear from the above discussion that the choice of a weak ranking learner for rankboost is important and non trivial. First, the learner must be efﬁciently tunable with respect to Eq. 4, typically limiting its complexity. Second, the learner itself must demonstrate regularization properties that are appropriate for ranking functions. In this paper we present a new weak ranking learner that enforces consistency in preference and conﬁdence for the ranking function by being monotonic and concave. We start with a discussion of these regularization properties, theoretically justify them, and show what they mean in terms of the ﬁnal ranking function. Then we present a new learner, Minimum Weighted Group Ranks (MWGR), that satisﬁes these properties and can be readily learned. This learner is tested and compared with the binary learner of rankboost on combining multiple face recognition systems from the FERET study (Phillips et al., 2000) and on an information retrieval combination task from TREC (Voorhees and Harman, 2001). 4. Regularizing Ranking Functions with Consistency in Preference and Conﬁdence In this paper, as Freund et al. (2003), we consider ranking functions H(x) which depend on x only through the values of the ranking features, y j = f j (x), for that instance, so that H(x) = G ( f1 (x), . . . , fn (x)), for certain functions G (y1 , . . . , yn ) = G(y). Here, we assume that the f j (x) is an actual rank assigned to an instance, x, by the j-th ranker. Note that if the original data are numerical scores then they can easily be converted to rankings. Freund et al. (2003) make a strong case for conversion of raw measures to relative orderings (rankings) over combining measures directly, arguing that it is more general and avoids the semantics of particular measures. As the y j ’s are ranks instead of points in a general space, care should be taken as to the functional form of G. A great deal of literature in social choice theory (Arrow, 1951) revolves around the properties of various rank combination strategies that try to achieve fair rankings. In this machine learning case our motivations are different. Fairness is not the goal; the goal is to improve the 794 C ONCAVE L EARNERS FOR R ANKBOOST accuracy or performance of the ranking function. Thus, regularization, by functionally constraining G, is used to confer information on how to interpret ranks in order to ultimately improve accuracy. Freund et al. (2003) imposed the regularization principle of monotonicity on the ranking function. Monotonicity encompasses a belief that for individual features a smaller rank is always considered better than a bigger rank. It means that for every two rank vectors, a and b, if a j ≤ b j , j = 1, . . . , n then G(a) ≤ G(b). Monotonicity was enforced by requiring that the coefﬁcients, wt in Eq. 1, combining the binary weak learners (Eq. 3) were cumulatively positive. As shown by Freund et al. (2003), without enforcing monotonicity the rankboost algorithm quickly overﬁts. In this section we present another regularization principle, consistency in preference and conﬁdence (which includes monotonicity). A ranking function with this regularization property should be consistent in its preference for the individual rankers and also in how it captures their conﬁdence. The following example illustrates these two concepts. 4.1 Grocer Example A grocer needs to make 2 decisions, to decide between stocking oat bran vs. granola and to decide between stocking turnips vs. radishes. The grocer asks her consultants to express their opinion about stocking each item, and based on their responses makes her 2 decisions. First of all, in either problem, the grocer will adopt the opinion of her consultants if they all agree with each other, that is, they all prefer granola over oat bran in the ﬁrst problem. Lets assume the grocer considered the ﬁrst problem and chose granola over oat bran. What this implies is that the grocer adopted the opinions of the consultants that preferred granola over oat bran. Now consider the turnips vs. radishes decision. Lets say that the same consultants that liked granola more also liked radishes more (and the same ones that like oat bran more like turnips more). Also, for this decision the radish lovers actually feel more conﬁdent in their choice than they did for granola, while the turnip lovers are more unsure than they were for oat bran. Then for this second choice, if the grocer is consistent in her method of combining the opinions of her consultants, we would expect her to pick radishes over the turnips. In addition, we would expect her to be more conﬁdent in this second decision as a reﬂection of the consultants relative conﬁdence. The above properties of preference and conﬁdence imply that preference should be applied consistently across different problems and conﬁdence should reﬂect the relative conﬁdences of the individual consultants. 4.2 The General Principle of Consistency in Preference and Conﬁdence To generalize the above grocer’s problem consider the problem of combining the opinions of a panel of consultants on several multiple-choice decision problems. Suppose for each decision problem each consultant provides his preference as one of the possible choices and a qualitative measure of the conﬁdence level for his preference. The consistency principle in preference and conﬁdence holds if the combiner always agrees with at least one of the consultants in each decision problem and that the following condition holds for any pair of decision problems. Deﬁnition 1. Consistency principle for a pair of decision problems: Suppose that for the ﬁrst decision problem, the combiner adopts the preference of a subset A of consultants in the sense that A is the (non empty) set of all consultants whose preference is identical to that of the combiner. 795 M ELNIK , VARDI AND Z HANG Suppose that for the second decision problem, the preference of the consultants in A is again identical. Suppose further that compared with the conﬁdence level for the ﬁrst decision problem, each consultant in A has higher or the same conﬁdence level for his preference in the second problem, while each consultant with a different preference than A for the second problem has lower or the same conﬁdence level. Then, the preference of the combiner for the second problem is identical to that of the consultants in A, and the conﬁdence level of the combiner for the second problem is at least as high as his conﬁdence level for the ﬁrst problem. Let B be the set of consultants whose preferences are different from that of the combiner in the ﬁrst decision problem, that is, B = Ac . If some members of B switch sides or have no preference in the second problem, the combiner would be even more conﬁdent about the adoption of the preference of A in the second problem regardless of the conﬁdence of those who switch sides. Thus, Deﬁnition 1 requires only those against the preference of A in the second problem (necessarily a subset of B since members of A act in unison in both problems) to have lower or equal conﬁdence in the second problem, instead of all members of B. This is taken into consideration in Theorem 1 below. Note that while preference for individual consultants is speciﬁed within each decision problem, two decision problems are needed to compare the qualitative measure of conﬁdence. However, comparison of conﬁdence is not always possible (it is a partial ordering). In particular, the level of conﬁdence between different experts may not be comparable, and the levels of conﬁdence of a given expert (or the combiner) for different decision problems are not always comparable. 4.3 Conﬁdence for Ranks and Ranking Functions In order to apply consistency to rank combination we need to specify what we mean by more or less conﬁdence. For ranks we make the assumption that a constant change of ranks requires more conﬁdence for low ranks than high ranks. For example, we would expect the difference between ranks of 1 and 3 to represent a more signiﬁcant distinction on the part of a ranker than would for example the difference between ranks 34 and 36 which may be almost arbitrarily assigned. Speciﬁcally we make the following deﬁnition: Deﬁnition 2. Preference and conﬁdence for univariate rank values For any pair of rank values {r, r } ⊂ R with r < r , by monotonicity r is preferred. For any two pairs of rank values {r, r } and {r , r } with r < r and r ≤ r , the conﬁdence level for {r, r }is higher than that of {r , r } if r ≤ r , r −r ≤ r −r with at least one inequality. The pair{r, r }provides no preference if r = r . Likewise, if either r − r = r − r = 0 or r − r = r − r = 0, the two pairs {r, r } and {r , r }are deﬁned to have the same level of conﬁdence. In this deﬁnition conﬁdence between pairs of ranks can only be compared if the pair with the lowest rank has a gap at least as large as the other pair. Thus, we are actually comparing two numbers, that is, we are doing a vector comparison. As such, this comparison does not cover all pairs of ranks and applies only a partial ordering on rank pairs. As a regularization principle, a partial ordering is desirable since it does not overly constrain the ranking function and allows for ﬂexibility. 796 C ONCAVE L EARNERS FOR R ANKBOOST As the rank function, G, is a univariate function, we apply the same deﬁnitions of preference and conﬁdence to it. That is, for a ranking function G and for any ﬁxed rank vectors, y and y , the preferred rank vector is the one with smaller G, that is, y is preferred iff G(y) < G(y ). For conﬁdence again we need to consider pairs of rank vectors, {y, y } and {y , y }, with G(y) ≤ G(y ) and G(y ) ≤ G(y ). If G(y) ≤ G(y ) and G(y ) − G(y) ≥ G(y ) − G(y ) then we say that the ﬁrst decision, between yand y , shows equal or more conﬁdence than the second decision between y and y . This numerical method of capturing conﬁdence and preference in ranks, y, and ranking functions, G, allows us to apply Deﬁnition 1. Speciﬁcally, for a pair of binary decisions the conﬁdence of a consistent ranking function increases for the second decision if in the second decision the conﬁdence of the agreeing rank values are comparable and increase and the conﬁdence of the disagreeing rank values are comparable and decrease, with the exception of those that switched sides. 4.4 Three-Point Consistency in Preference and Conﬁdence for Combining Ranks As described, the consistency property is over two binary decision problems. In this section we consider ranking functions, G, that have consistency in preference and conﬁdence for all pairs of binary decision problems involving three rank vectors y, y and y . We show that such functions have speciﬁc mathematical properties under this three-point consistency principle in preference and conﬁdence for ranks.. Theorem 1: For a ranking function G whose domain is convex, three-point consistency in preference and conﬁdence holds for G iff G(y) is nondecreasing in individual components of y and is jointly concave in y. Proof of Necessity: Assume that the ranking function G exhibits three-point consistency in preference and conﬁdence for any three rank vectors. If y ≤ y component wise, then the individual components of y all agree in their preference to y, so that G(y) ≤ G(y ) by the consistency of G in preference. This implies the monotonicity of G in y. We pick three points y, y − a and y + a, such that all points are rank vectors. Consider the pair of binary comparison problems, y vs. y + a and y − a vs. y. Assume that G(y) < G(y + a). These three points are comparable by Deﬁnition 2 (r − r = r − r for all components in the rank vectors). Since the agreeing rankers, A = j y j < y j + a j , in the y vs. y + a comparison are greater than in the y − a vs. y comparison, and the disagreeing ranks, outside A, are smaller, then by the consistency of the combiner in preference and conﬁdence (deﬁnition 1), G(y − a) ≤ G(y) and G(y) − G(y − a) ≥ G(y + a) − G(y). A function G is concave in a convex domain iff 2G(y) ≥ G(y − a) + G(y + a), for every y and a with y ± a in its domain. To verify this inequality for a particular y and a, we must have G(y + a) > G(y), G(y − a) > G(y) or the third case of G(y) ≥ max{G(y + a), G(y − a)}. We have already proven that the consistency properties of preference and conﬁdence imply G(y) − G(y − a) ≥ G(y + a) − G(y) in the ﬁrst case. By symmetry this requirement also must hold for −a, so that G(y) − G(y + a) ≥ G(y − a) − G(y) in the second case. This completes the necessity part of the proof since 2G(y) ≥ G(y − a) + G(y + a) automatically holds in the third case. Proof of Sufﬁciency: Assume the ranking function G is nondecreasing in individual components of y and jointly concave in y. We need to prove consistency in preference and conﬁdence for a pair 797 M ELNIK , VARDI AND Z HANG 1) G(y) < G(y’) y ≤ y’’ A A y’’ A 4 A A 10 y’ 8 y’ 6 4) G(y) > G(y’) y ≥ y’’ A 10 8 A 2) G(y) < G(y’) y ≥ y’’ A 10 8 6 y A y 6 4 4 2 2 2 0 0 y y’ y’’ y’’ 0 2 4 6 8 10 0 2 4 B 6 8 10 0 0 B 2 4 6 8 10 B Figure 1: Consider two decision problems in two dimensions involving three points, y, y and y , where the ﬁrst decision problem is to choose between y and y and the second problem is to choose between y and y . The cases where we expect consistency in preference and conﬁdence (Deﬁnitions 1 and 2) can be enumerated by the result of the ﬁrst decision problem and relative location of y . The darker gray box is the location where B has lesser or equal conﬁdence in the second problem, and the lighter gray box is where B switches sides. of decision problems involving three rank vectors y, y and y . Without loss of generality, suppose that the ﬁrst problem is to choose between y and y and the second problem is to choose between y and y . We break the proof up by the results of the comparison between y vs. y and the relative location of the third ranking vector y that satisﬁes the preference and conﬁdence requirements. If G(y) = G(y ) = G(y ), the combiner has equal conﬁdence in the two decision problems by deﬁnition 2, so that the consistency principle holds automatically. Thus, we only need to consider the cases where G(y) = G(y ). Figure 1 illustrates three of these cases two dimensionally, where there is a single agreeing component A, the y-axis, and a single disagreeing component B, the x-axis. Let A be the set of agreeing indices, A = j : sgn(y j − y j ) = sgn (G(y ) − G(y)) = 0 , and B = c the set of disagreeing indices. We use the notation y = (y , j ∈ A) to describe corresponding A j A subvectors. Also, when used, vector inequalities are component wise. Case 1: G(y) < G(y ) and yA ≤ yA In the ﬁrst decision problem, as G agrees with A and disagrees with B yA < y A , yB ≥ y B . / We also know that A = 0 since otherwise y ≥ y and therefore by monotonicity G(y) ≥ G(y ), which is a contradiction. For the second decision problem we consider the values of y which are consistent with the conﬁdence assumption (Deﬁnitions 1 and 2), that is, agreeing with more conﬁdence along the A indices and disagreeing with less conﬁdence or switching sides along the B indices. As y A ≤ yA , either by the conﬁdence relationship or by switching sides (see Figure 1) these y values satisfy yA − y A ≥ y A − y A , 798 yB − y B ≤ y B − y B C ONCAVE L EARNERS FOR R ANKBOOST which implies y ≤ y , and therefore G(y ) ≤ G(y ) by monotonicity. Thus, G(y) < G(y ) ≤ G(y ), which implies that in the second decision problem of choosing between y and y , the preference of G is the same as that of A (i.e., y since yA ≤ yA ) and the conﬁdence of G is at least as high as the ﬁrst decision problem. Case 2: G(y) < G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the conﬁdence relationship or by switching sides y satisﬁes yA − y A ≥ y A − y A , yB − y B ≤ y B − y B . This implies that y + y ≤ 2y, which means that G(y ) + G(y ) ≤ 2G y +y /2 ≤ 2G(y) by the concavity and monotonicity of G. Thus, G(y) − G(y ) ≥ G(y ) − G(y) and since G(y ) > G(y), we have that G(y) − G(y ) > 0 and thus G(y) > G(y ). Therefore we see that the preference in G for the two decision problems is the same as A (with y in the ﬁrst problem and with y in the second problem) and the conﬁdence is no smaller for the second comparison. Case 3: G(y) > G(y ) and yA ≤ yA Since y is preferred in the ﬁrst decision problem we have yA > y A , yB ≤ y B . For the conﬁdence assumption to hold, that is, greater conﬁdence for A in the second problem (Deﬁnition 2), the smaller ranks in the second problem have to be smaller than the smaller ranks in the ﬁrst problem. But with yA ≥ yA and yA ≤ yA that can only be if yA = yA , which leads to y ≤ y , and by monotonicity implies G(y) ≤ G(y ). However that is a contradiction to G(y) > G(y ), and therefore this is not a viable case for comparing conﬁdence. Case 4: G(y) > G(y ) and yA ≥ yA Since in this case yA ≥ yA , then either by the conﬁdence relationship or by switching sides y satisﬁes yA − y A ≥ y A − y A , yB − y B ≤ y B − y B , which means that y ≤ y . Thus, by the monotonicity of G, G(y ) ≤ G(y ). Since G(y ) < G(y) the preference in the second decision problem is also with A (i.e., y ) and the conﬁdence is at least as high as the ﬁrst decision problem. 4.5 Applying Regularization to Rankboost It follows from the above theorem that to have consistency in preference and conﬁdence we desire ranking functions that are monotonic and concave. In rankboost H(x) = ∑ wt ht (x) (Eq. 1). To make H an increasing and concave function of constituent rankings y j = f j (x) we need to constrain the weak ranking learners. If the learners themselves are monotonically increasing and concave functions of y j , then linearly combining them with positive weights will give an H that is also an increasing and concave function of y j . 799 M ELNIK , VARDI AND Z HANG In this paper, we apply the “third method” (Freund et al., 2003) to setting a wt weight value. That is, weak learners are selected on their ability to maximize r from Eq. 4 and then wt = 0.5 ln ((1 + rmax ) / (1 − rmax )) . (5) Therefore, using monotonic and concave weak learners we select only ones that rankboost gives a positive r value to, which renders a positive wt weight. If no r values are positive the rankboost algorithm stops. We mention that a ranking function can be construed as the negative of a score function, that is, G(y) = −S(y). For score functions these regularization properties become monotonically decreasing, and convex (Melnik et al., 2004). 5. Minimum Weighted Group Ranks The functional structure of the new learner we propose is h(y) = min {γ1 y1 , . . . , γn yn , 1} , (6) where y = (y1 , . . . , yn ) = ( f1 (x), . . . , fn (x)) the vector of ranking features, and the γ j are learned positive coefﬁcients. Note that the function’s range is in [0, 1] due to the 1 term in the min function. Using rankings as our features, the learner function (Eq. 6) is monotonically increasing. It is also a concave function in y. Thus if these learners are linearly combined with positive weights, the resulting ranking function, H, will have three-point consistency in conﬁdence and preference. To gain some intuition, the functional form of the learner is related to the Highest Rank combination method (Ho, 1992) that assigns combined ranks as the best rank each class receives from any ranker. This is a conﬁdence based approach, as Highest Rank bets on the classiﬁer that is most conﬁdent, the one giving the best rank. As such, a single learner can potentially be error prone. But as we combine many of these learners during boosting, it becomes more powerful, allowing the conﬁdence of different classiﬁers to be weighed with preference for their potential accuracy. 5.1 Learning At each boosting iteration, rather than selecting from all possible weak learners of form (Eq. 6), we limit our choices by building new weak learners out of the ones that have been previously trained. Let F = { f1 (x), . . . , fn (x)} be the set of ranking features. Recall that in rankboost H(x) = (t) (t) (t) ∑t wt ht (x), where ht (x) = min γ1 f1 (x), . . . , γn fn (x), 1 and γ j are learned coefﬁcients. We set (s) (s) Ht = h(x) h(x) = min γ1 f1 (x), . . . , γn fn (x) , s ≤ t and select ht+1 (x) from weak learners of the form hnew (x) = min αh(x), β f (x), 1 (7) (t+1) with h(x) ∈ Ht and f (x) ∈ F . This learner can be rewritten in the form of Eq. 6 with the γ j derived from the learned α, β, h(x) and f (x). Thus, at each iteration we look at combinations of 800 C ONCAVE L EARNERS FOR R ANKBOOST the features and existing learners. As discussed in the next section, we can either consider all such combinations, or use a heuristic selection strategy. We propose to optimize α and β separately, in stages. That is, given a value for one of the variables we optimize the other. As α and β are symmetric we show how to optimize β given α. We need to ﬁnd a value of β that maximizes r in Eq. 4. Freund et al. (2003) pointed out that this equation can be rewritten as: r = ∑ D(t) (x , x )(h(x ) − h(x )) x ,x = ∑ π(x)h(x) x where π(x) = ∑x D(t) (x , x) − D(t) (x, x ) . Given the form of Eq. 7 we can write r as a function of β, ∑ (β f (x) − 1) π(x) + ∑ r (β) = αh(x) − 1 π(x). β f j (x)≤min(αh(x),1) (8) αh(x) < 1 , then O(βl+1 ) = O(βl ) − π(xl ), P(βl+1 ) = P(βl ) − f (xl )π(xl ), Q(βl+1 ) = Q(βl ) +W π(xl ), R(βl+1 ) = R(βl ) +W αh(xl )π(xl ). Combining these formulas gives algorithm 2 for optimizing β. Algorithm 2 Algorithm for optimizing β Given α, h(x) ∈ Ht , f (x) ∈ F and the training instances. For all x’s generate and sort the set of candidate βs, B, such that β 1 ≤ β2 ≤ · · · ≤ β|B| and βl = min(αh(xl ),1) . f (xl ) O = ∑xl π(xl ) P = ∑xl f (xl )π(xl ) Q=0 R=0 rbest = 0 for j = 1 . . . |B| do r = βl P − O + R − Q if r > rbest then rbest = r end if O = O − π(xl ), P = P − f (xl )π(xl ) if αh(xl ) < 1 then Q = Q + π(xl ), R = R + αh(xl )π(xl ) end if end for 5.2 Heuristics for Learner Selection If at each boosting iteration we select from all combinations of h(x) and f (x) we end up with an O(T 2 ) algorithm, where T is the number of iterations. However, we can sacriﬁce accuracy for speed by only evaluating and selecting from a ﬁxed sized pool of previously trained learners h(x) and features f (x), where at each iteration the pool is randomly chosen from the full Ht and F . To improve performance, instead of using a uniform sampling distribution we can apply a heuristic to focus the distribution on combinations with better potential. As Eq. 8 is composed of two sums, for r to be large the terms f (x)π(x) and h(x)π(x) need to be large. We can consider s f = ∑x f (x)π(x) and sh = ∑x h(x)π(x) as indicators of how well these components work with π(x). Thus, we might expect larger r values to occur when these two score values are larger. Of course, we are discounting interactions, which is the reason for the combination. Using these score values, we can order all h(x) and f (x) separately, and sample such that learners and features with better scores are more likely to be selected. We opted for a polynomial weighting 802 C ONCAVE L EARNERS FOR R ANKBOOST Training error on Dup II 9 P=1 P=1/2 P=1/4 Avg rank of correct class 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 1 2 3 4 5 6 7 8 9 Iteration number Figure 2: These plots show convergence of training error for 3 values of the heuristic selection pressure value p. These plots are averages over 10 runs and are typical of the other training data sets as well. As can be seen the heuristic improves the convergence rate of the training error. method. Thus, for example, all f ∈ F are sorted by their score and are assigned a number based on the rank of these scores,(maxrank − rank)/maxrank, that gives each f an equally sized bin in the range 0 and 1. Given a random number ξ ∼ U(0, 1), we calculate ξ p and select the f that corresponds to the bin this number falls into. Here p < 1 can be construed as a selection pressure, where bins corresponding to higher scores are more likely to be selected. Figure 2 demonstrates the effect of different values of the p parameter in one of our experiments. 6. Face Recognition Experiments We present experiments on the combination of face recognizers, comparing the binary learner with the MWGR learner. Given an image of a face, a face recognizer assigns a similarity score to each of the faces it is trained to recognize. These scores give a linear order or ranking to the gallery of faces. Different face recognition algorithms have different performance characteristics. Thus, we explore how combining the outputs of multiple face recognizers can improve recognition accuracy. 6.1 Algorithm Methods We consider a data set I of face images (queries) to train on. For each query image, i in I, we need to rank all u ∈ U faces in the gallery. In rankboost the favor function, Φ, that speciﬁes output loss, is a function of the query and the item to be ranked. Therefore, the notational convention is to combine the query, i, and the item to be ranked, u, as an instance, x ≡ (i, u). As such, f j (x) = f j ((i, u)) is the 803 M ELNIK , VARDI AND Z HANG Error convergence 9 Train error on dup II Test error on dup I Avg rank of correct class 8.5 8 7.5 7 6.5 6 0 10 20 30 40 50 60 70 80 90 100 Iteration number Figure 3: This plot is typical of the convergence behavior of rankboost with MWGR on the FERET data. Both training and test errors tended to converge within 10-30 iterations of boosting with no signiﬁcant post-convergence divergence. rank assigned to identity u for query image i by recognition algorithm j. As there is only one correct identity, we only care about the ranking of the one correct identity for each query. We set the favor function as stated by Freund et al. (2003) for this type of output loss. Let u ∗ be the the correct identity for training image i, then Φ ((i, u) , (i, u∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1 for all u = u∗ , setting all remaining elements of Φ (x , x ) = 0. That is, the correct identity of a query image is given positive favor compared to all other identities for that image, while all other rankings, including interactions between training images, are given zero favor. Note that since there is no favor interaction between queries (different i’s); Φ (x , x ) is effectively a function of 3 variables, (i, u , u ). Both weak learners were trained for 100 iterations, giving them ample time to converge. See Figure 3 for an illustration of convergence times. The binary learner was trained as speciﬁed by Freund et al. (2003). At each iteration the MWGR learner was selected from a pool of candidate combinations of f (x) and h(x), with a selection pressure of p = 0.5. For each candidate, ﬁrst β was optimized with α = 1, then α was optimized using the optimized β. The candidate with the most positive r value was always selected. This is summarized in algorithm 3. 6.2 Experimental Setup FERET (Phillips et al., 2000) was a government sponsored program for the evaluation of face recognition algorithms. In this program commercial and academic algorithms were evaluated on their ability to differentiate between 1,196 individuals. The test consisted of different data sets of varying difﬁculty, for a total of 3,816 different images. The data sets, in order of perceived difﬁculty, are: the fafb data set of 1,195 images which consists of pictures taken the same day with different facial 804 C ONCAVE L EARNERS FOR R ANKBOOST Algorithm 3 Algorithm for selecting learner from pool. Given poolsize and selection pressure. Calculate s f j and sh for all rank features and existing learners. for p = 1 . . . poolsize do Generate d f ∼ U(0, 1) and dh ∼ U(0, 1) Select which h = min αh(x), β f (x), 1 to try, using s f j , sh , u f , uh . Setting α = 1, optimize β (algorithm 2) Keeping the optimized β, optimize α (algorithm 2) Get r of learner with this α, β if r > 0 and r > rbest then rbest = r, hbest = h, hbest = min αh(x), β f (x) end if end for ht = hbest, Ht+1 = Ht ∪ hbest expressions; the fafc data set of 194 images that contains pictures taken with different cameras and lighting conditions; the dup I data set of 488 images that has duplicate pictures taken within a year of the initial photo; and the most difﬁcult, the dup II data set of 234 images which contains duplicate pictures taken more than a year later. Note that in our experiments we separate the images of dup II from the dup I data set, unlike the FERET study where dup II was also a subset of dup I. The FERET study evaluated 10 baseline and proprietary face recognition algorithms. The baseline algorithms consisted of a correlation-based method and a number of eigenfaces (principle components) methods that differ in the internal metric they use. Of the proprietary algorithms, most were from different academic institutions and one was commercial. Of the 10 algorithms we selected three dominant algorithms. From the baseline algorithms we chose to use the ANM algorithm which uses a Mahalanobis distance variation on angular distances for eigenfaces (Moon and Phillips, 2001). While this algorithm’s performance is not distinctive, within the class of baseline algorithms it was strong. Moreover, in accuracy with respect to average rank of the correct class on the dup I data set it demonstrated superior performance to all other algorithms. The other two algorithms we used were the University of Maryland’s 1997 test submission (UMD) and the University of Southern California’s 1997 test submission (USC). These algorithms clearly outperformed the other algorithms. UMD is based on a discriminant analysis of eigenfaces (Zhao et al., 1998), and USC is an elastic bunch graph matching approach (Wiskott et al., 1997). The outputs of the 10 face recognizers on the four FERET data sets, fafb, fafc, dup I and dup II were the data for the experiments. Thus, we never had access to the actual classiﬁers, only to data on how they ranked the different faces in these data sets. We conducted experiments based on homogeneous and heterogeneous data sets, testing the efﬁciency and robustness (adaptivity) of the MWGR procedure. For the homogeneous case we took all 4 FERET data sets and randomly shufﬂed them together. We call this the homogeneous data set as both the training and testing data are selected from the same combined pool. On this combined data set we did 4-fold cross validation. For each fold 75% of the data was used for training and the rest for testing. We combined the results of all four runs together for evaluation purposes. 805 M ELNIK , VARDI AND Z HANG For the heterogeneous case, in each experiment one of the FERET data sets was selected as a training set and another data set was selected for testing. This gave 12 experiments (not including training and testing on the same data set) per group of face recognizers, where we get combinations of training on easy data sets and testing on hard data sets, training on hard and testing on easy data sets, and training and testing on hard data sets. To reduce noise in our experiments the training ranks were truncated at 150, and outliers were removed. In face recognition and other classiﬁcation applications usually only the top ranks are important. Thus, in evaluating the results we focused on the top 30 ranks. All ranks returned by a boosted combiner for the correct class above 30 were truncated to 30. In evaluating the performance of the combiners not all the test data are equally useful. We consider the following two cases as non informative. When the two best face recognizers, UMD and USC both give the correct class a rank of 1 there is very little reason for the combined rank to be different. Also when both the binary-learner-based combiner and the MWGR-based combiner give the correct class a rank greater than the truncation value (30) it makes little sense to compare between the combiners. The testing data was ﬁltered to remove these cases, and the results are presented without them. Before presenting the results, it should be said that rankboost with both learner types gives ranking functions that signiﬁcantly outperform all the individual face recognition algorithms. In addition, in our tests both learners also clearly outperformed other standard rank combination methods, such as the Borda count, highest rank and logistic regression (Ho et al., 1992). We present two sets of experiments—the combination of the 3 selected classiﬁers (ANM, UMD and USC) and the combination of all 10 classiﬁers. These are qualitatively different tasks. In combining 3, we seek to capitalize on the unique strengths of each classiﬁer. In combining 10, we are introducing classiﬁers which may be correlated and classiﬁers which are comparably much noisier. The size of the pool was 6 when we combined 3 classiﬁers and 20 when combining all 10 classiﬁers. For all experiments we measure the average rank of the correct class for both learners: A= 1 N ∑ min {Rank (xi∗ ) , 30} , N i=1 where Rank (xi∗ ) is the rank of the correct class for query i, and the sum is over all useful test queries, as described above. The average rank difference between the learners is calculated to show the improvement in performance. To evaluate the signiﬁcance of the improvement we ran paired one-sided t-tests and evaluated the signiﬁcance of the p-value (a value less than 0.05 is signiﬁcant). In addition we show the standard deviation of the rank difference. 6.3 Results In the experiments with the homogeneous data sets, combining all classiﬁers gives an improvement in the average rank of the correct class of 0.296 for MWGR, with a standard deviation of 3.9, a paired t-test statistic of 1.76 and p-value of 0.039, where combining the ANM, UMD and USC classiﬁers gives an average rank improvement of 0.1 for MWGR, with standard deviation of 3.6, a paired t-test statistic of 0.68 and p-value of 0.24. Table 2 contains the results for combining the 3 classiﬁers in the experiments with heterogeneous data sets (compare the combiner results in columns bin mean and mwgr mean with the aver806 C ONCAVE L EARNERS FOR R ANKBOOST ANM ARL EFAVG EFML1 EFML2 EXCA MSU RUT UMD USC dup i 9.52 16.85 15.02 18.23 15.85 11.9 17.64 17.87 13.21 12.44 dup ii 18.14 16.96 20.61 22.21 20.33 16.28 23.44 16.48 13.74 6.85 fafb 10.88 8.94 14.43 16.23 12.21 11.67 6.81 12.93 4.57 5.84 fafc 19.71 28.02 26.17 17.39 19.78 20.30 14.27 22.79 8.96 5.17 Table 1: The best average rank of the correct class on the different data sets for all constituent face recognition systems. age rank of the correct class for all constituent classiﬁers in Table 1). The diff mean column contains the improvement of MWGR over the binary learner in terms of average rank of the correct class. Of the 12 experiments, we see an improvement in 10 cases. Six of those 10 have signiﬁcant p-values. The two no improvement experiments do not have signiﬁcant p-values. Table 3 contains the results for combining all 10 classiﬁers. Of the 12 experiments, we see an improvement for MWGR in 11 cases. Eight or nine of those 11 have signiﬁcant p-values. The one no improvement experiment does not have a signiﬁcant p-value. It is interesting to note that we do not seem to see overﬁtting when increasing the number of constituents. In some case we see improvement, in others we see slight degradation, but all in all the combiner seems resilient to the noise of adding less informative constituents. All sets of experiments, homogeneous data, heterogeneous data sets, combining 3 select recognizers and combining all 10 recognizers at once yielded signiﬁcant improvements in accuracy, as is visible in the change in the average rank of the correct class and the signiﬁcance of the statistical tests. 7. Information Retrieval Experiments The annual Text REtrieval Conference (TREC) generates high-quality retrieval results of different systems on different retrieval tasks (Voorhees and Harman, 2001). We use the result data sets of the TREC-2001 web ad hoc task that uses actual web queries taken from web logs. This task has been used in other rank fusion experiments (Renda and Straccia, 2003). As did Renda and Straccia (2003) we combine the results of the following top 12 systems: iit01m, ok10wtnd1, csiro0mwal, ﬂabxtd, UniNEn7d, fub01be2, hum01tdlx, JuruFull, kuadhoc, ricMM, jscbtawtl4, apl10wd. Similar to other TREC information retrieval tasks, the TREC-2001 ad hoc task consists of 50 queries. For each query a list of relevant documents is supplied. Each system returns an ordered list of 1,000 documents for each query. The fusion goal is to combine these 12 individual lists into one list of 1,000 documents, which hopefully has greater precision than the individual systems. For the 807 M ELNIK , VARDI AND Z HANG test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 7.19 7.36 8.31 5.25 5.15 5.19 2.62 3.38 2.60 2.85 2.40 2.56 mwgr mean 5.96 6.21 6.27 5.38 4.4 4.95 2.22 2.60 2.28 2.78 2.43 2.03 diff mean 1.22 1.14 2.04 -.12 0.74 0.23 0.39 0.78 0.32 0.06 -.03 0.52 diff std 5.22 5.15 5.51 4.0 4.61 5.1 3.09 3.79 2.81 3.33 2.27 2.57 pval .7e-4 .001 .1e-6 .658 .018 .275 .115 .029 .142 .424 .537 .028 Table 2: Results of combining the ANM, UMD, and USC classiﬁers using individual FERET data sets. test set dup i dup i dup i dup ii dup ii dup ii fafb fafb fafb fafc fafc fafc train set dup ii fafb fafc dup i fafb fafc dup i dup ii fafc dup i dup ii fafb bin mean 6.73 8.06 6.68 5.75 6.24 5.86 2.67 3.56 2.68 3.36 3.17 2.22 mwgr mean 5.89 7.09 4.87 5.67 5.47 5.31 2.07 2.45 2.17 2.51 2.95 2.23 diff mean 0.84 0.96 1.81 0.08 0.76 0.55 0.59 1.11 0.51 0.85 0.22 -0.01 diff std 4.79 6.01 5.24 4.61 4.22 4.87 2.97 4.51 2.03 3.23 3.95 2.08 pval .009 .014 .5e-6 .408 .009 .074 .033 .012 .01 .007 .3 .52 Table 3: Results of combining all 10 classiﬁers using individual FERET data sets. 50 queries of the TREC-2001 ad hoc task, the number of relevant documents that intersect with the union of system results range between 662 and 2664. 7.1 Methods In the information retrieval task the favor function needs to show favor for relevant documents while disregarding other documents. Thus, we can set the favor function similarly to the way it was set in 808 C ONCAVE L EARNERS FOR R ANKBOOST JuruFull 0.759 hum01tdlx 0.760 UniNEn7d 0.763 iit01m 0.762 apl10wd 0.735 jscbtawt14 0.761 csiro0mwa1 0.721 kuadhoc2001 0.727 ﬂabxtd 0.741 ok10wtnd1 0.745 fub01be2 0.734 ricMM 0.765 Table 4: Normalized mean average precision for each constituent. the face recognition classiﬁcation task. Consider a data set with I queries to train on. For each query, i in I, we need to rank all u ∈ U documents. An instance in this case is a pair, x ≡ (i, u). For each u∗ a relevant document and u an irrelevant document for query i, we set Φ ((i, u) , (i, u ∗ )) = +1 and Φ ((i, u∗ ) , (i, u)) = −1, setting all remaining elements of Φ (x0 , x1 ) = 0. Thus, all relevant documents are given a positive favor with respect to irrelevant documents, while all other rankings, including interactions between relevant documents and interactions between irrelevant documents are given zero favor. As the task has only 50 queries, rather than separating the data into train and test, we opted to do a cross validation performance evaluation. We use 5-fold cross validation on the normalized mean average precision measure (Salton and McGill, 1983), a standard IR measure which accounts for the precision (quantity of correct documents retrieved) and their rank position in the list. It is described by the following equation: AveP = ∑N (Prec(r) × rel(r)) r=1 number o f relevant documents where r is the rank, N is the number of documents retrieved, rel() is a binary function of the relevance of the document at rank r,and Prec is the precision ( [number of relevant documents retrieved] / [number of documents retrieved]) at a given cut-off rank. It is clear that higher values of AveP are better. Note that for purposes of normalization we assigned unretrieved relevant documents a constant rank. As the binary and MWGR weak learners had signiﬁcantly different convergence properties on this task (see Figure 4) MWGR was trained for 100 iterations and the binary learner for 300 iterations. As in the FERET experiments, MWGR was optimized using algorithm 3, with a selection pressure of p = 0.5. 7.2 Results As seen in Figure 4, the MWGR learner converges in signiﬁcantly less iterations than the binary learner. This could possibly be attributed to the fact that the MWGR is a more complex function that can incorporate the rankings of multiple classiﬁers in each learner. Also the MWGR function is not tuned to particular rank cutoffs, whereas the binary learner is, so the MWGR can better accommodate the variety in the 1000 ranks being considered. The normalized mean average precision for the MWGR after 100 iterations was 0.8537 and it was 0.8508 for the binary learner after 300 iterations. Compare these results with the precision of the constituents in Table 4. Both weak learners had a performance rate of change of approximately 3 ∗ 10−5 on their ﬁnal iteration (better for MWGR). A paired t-test on the cross validation results of the two learners gives a statistically signiﬁcant p-value of 0.007 in favor of MWGR. 809 M ELNIK , VARDI AND Z HANG Cross Validation Performance 0.88 MWGR Binary 0.87 0.86 0.85 0.84 Mean Avg Precision 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.7 0 50 100 150 200 Iteration Number 250 300 Figure 4: The cross validation mean average precision score of the two weak learners, MWGR and binary, as a function of boosting iteration number. 8. Discussion The question of how to combine ordinal data has become an active focus of research in machine learning, as applications in pattern recognition, information retrieval and other domains have come to the forefront. A particular question of importance is how can the structure of ranks be correctly exploited to maximize performance. The semi parametric nature of rankboost offers the possibility to generate arbitrarily ﬂexible ranking functions. But as observed this ﬂexibility comes at a cost of signiﬁcant overﬁtting without further regularization. Freund et al. (2003) demonstrate that successful generalization only occurs when the resulting ranking functions are constrained to be monotonic. This constraint can be thought of as a regularization that incorporates prior knowledge on the interpretation of ranks and as such, how they can be combined. We present a regularization framework based on the concept of consistency in conﬁdence and preference. Ranking functions with this property show a consistency in how they treat the preference and relative conﬁdence exhibited by constituents. We prove that under a natural interpretation of preference and conﬁdence for ranks, this consistency property of the combiner is equivalent to monotonicity and concavity of its ranking function. We enhance rankboost by designing a weak ranking learner that exhibits consistency in preference and conﬁdence. A computational advantage of this weak learner, called minimum weighted group ranks (MWGR) is that its parameters can be individually optimized readily with respect to the rankboost criteria, allowing it to be tested on real-world data. In our ﬁrst experiments we compare the original rankboost binary weak learner with MWGR on a task of combining the output of multiple face recognition algorithms from the FERET study. We conducted experiments on homogeneous data, testing the intrinsic efﬁciency of the MWGR proce810 C ONCAVE L EARNERS FOR R ANKBOOST dure. We also conducted experiments on heterogeneous data, testing the robustness or adaptivity of the procedure. In almost all cases we see that MWGR shows improved performance compared with the binary weak learner, whether combining three or all of the face recognizers, conﬁrming the utility of this monotonic and concave learner. Our second experiment was on an Information Retrieval task taken from the TREC conference. In this task we see MWGR converges in signiﬁcantly less iterations and generates statistically signiﬁcant improved performance. Final Words Ofer Melnik and Cun-Hui Zhang are very saddened that our colleague and friend Yehuda Vardi passed away before he could give this paper his ﬁnal stamp of approval. He was very enthusiastic about this research and would have been very pleased to see it come to fruition. Acknowledgments This research is partially supported by NSF Grants DMS 04-05202, DMS 05-04387 and, DMS 0604571, ONR Grant N00014-02-1-056 and NSA Grant H98230-04-1-0041. Ofer Melnik would also like to thank DIMACS for their support in this research. The authors are also grateful to the editor and reviewers for their constructive suggestions which improved the presentation of the results. Appendix A. In this paper we showed how three-point consistency in preference and conﬁdence implies concave and monotonic ranking functions. For two decision problems involving two pairs of rank vectors, the four-point consistency property implies the following constraints for a ranking function   G(y) < G(y )   yB − y B ≤ 0 < y A − y A  zA ≤ yA , yA − yA ≤ zA − zA ⇒ G(z ) − G(z) > G(y ) − G(y)   yB ≤ z B , zB − z B ≤ y B − y B (11) where y and y are the rank vectors from the ﬁrst decision problem and z and z are the rank vectors from the second decision problem. Unlike the three-point case, the four-point consistency property does not imply a clearly recognizable functional form for the ranking function. What we can say about it though is that as the constraints are linear, in the same way that concavity and monotonicity in the weak learner conferred the same properties to the ranking function, a weak learner that satisﬁes Eq. 11 will also confer those properties to the ranking function that uses it with positive weights. References K. Arrow. Social Choice and Individual Values. Wiley, 1951. C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. 10th Intl. World Wide Web Conf., pages 613–622, 2001. 811 M ELNIK , VARDI AND Z HANG Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efﬁcient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148–156, 1996. T. K. Ho. A Theory of Multiple Classiﬁer Systems and Its Application to Visual Word Recognition. PhD thesis, State University of New York at Buffalo, May 1992. T. K. Ho, J. J. Hull, and S. N. Srihari. Combination of decisions by multiple classiﬁers. In H. S. Baird, H. Bunke, and K. Yamamoto (Eds.), editors, Structured Document Image Analysis, pages 188–202. Springer-Verlag, Heidelberg, 1992. J. Kittler and F. Roli, editors. Multiple Classiﬁer Systems, Lecture Notes in Computer Science 1857, 2000. Springer. R. Meir and G. Ratsch. Advanced Lectures in Machine Learning, Lecture Notes in Computer Science 2600, chapter An introduction to boosting and leveraging, pages 119–184. Springer, 2003. O. Melnik, Y. Vardi, and C-H. Zhang. Mixed group ranks: Preference and conﬁdence in classiﬁer combination. IEEE Pattern Analysis and Machine Intelligence, 26(8):973–981, 2004. H. Moon and P.J. Phillips. Computational and performance aspects of PCA-based face-recognition algorithms. Perception, 30:303–321, 2001. P.J. Phillips, H. Moon, S.A. Rizvi, and P.J. Rauss. The FERET evaluation methodology for facerecognition algorithms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22:1090– 1104, 2000. M.E. Renda and U. Straccia. Web metasearch: Rank vs. score based rank aggregation methods. In 18th Annual ACM Symposium on Applied Computing (SAC-03), pages 841–846, Melbourne, Florida, USA, 2003. ACM. G. Salton and J.M. McGill. Introduction to Modern Information Retrieval. Addison Wesley Publ. Co., 1983. E.M. Voorhees and D.K. Harman, editors. NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001), number SN003-003-03750-8, 2001. Department of Commerce, National Institute of Standards and Technology, Government Printing Ofﬁce. URL http://trec.nist.gov. L. Wiskott, J.-M. Fellous, N. Kruger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(7):775– 779, 1997. W. Zhao, A. Krishnaswamy, R. Chellappa, D. Swets, and J. Weng. Face Recognition: From Theory to Applications, chapter Discriminant Analysis of Principal Components, pages 73–86. SpringerVerlag, Berlin, 1998. 812

same-paper 3 0.99430269 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

4 0.98047984 70 jmlr-2007-Ranking the Best Instances

Author: Stéphan Clémençon, Nicolas Vayatis

Abstract: We formulate a local form of the bipartite ranking problem where the goal is to focus on the best instances. We propose a methodology based on the construction of real-valued scoring functions. We study empirical risk minimization of dedicated statistics which involve empirical quantiles of the scores. We ﬁrst state the problem of ﬁnding the best instances which can be cast as a classiﬁcation problem with mass constraint. Next, we develop special performance measures for the local ranking problem which extend the Area Under an ROC Curve (AUC) criterion and describe the optimal elements of these new criteria. We also highlight the fact that the goal of ranking the best instances cannot be achieved in a stage-wise manner where ﬁrst, the best instances would be tentatively identiﬁed and then a standard AUC criterion could be applied. Eventually, we state preliminary statistical results for the local ranking problem. Keywords: ranking, ROC curve and AUC, empirical risk minimization, fast rates

5 0.97774547 75 jmlr-2007-Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: One of the nice properties of kernel classiﬁers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classiﬁers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. Keywords: kernel methods, support vector machines, sparseness, estimating conditional probabilities

6 0.83580673 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption

7 0.82671916 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds

8 0.82174253 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

9 0.812868 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

10 0.80830389 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

11 0.80600625 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)

12 0.80249041 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters (Special Topic on Model Selection)

13 0.79343367 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

14 0.79341841 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

15 0.79010719 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

16 0.78932142 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

17 0.78906262 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

18 0.78721339 55 jmlr-2007-Minimax Regret Classifier for Imprecise Class Distributions

19 0.78070229 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

20 0.77211183 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features