nips nips2006 nips2006-89 knowledge-graph by maker-knowledge-mining

89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising


Source: pdf

Author: Sandeep Pandey, Christopher Olston

Abstract: We consider how a search engine should select advertisements to display with search results, in order to maximize its revenue. Under the standard “pay-per-click” arrangement, revenue depends on how well the displayed advertisements appeal to users. The main difficulty stems from new advertisements whose degree of appeal has yet to be determined. Often the only reliable way of determining appeal is exploration via display to users, which detracts from exploitation of other advertisements known to have high appeal. Budget constraints and finite advertisement lifetimes make it necessary to explore as well as exploit. In this paper we study the tradeoff between exploration and exploitation, modeling advertisement placement as a multi-armed bandit problem. We extend traditional bandit formulations to account for budget constraints that occur in search engine advertising markets, and derive theoretical bounds on the performance of a family of algorithms. We measure empirical performance via extensive experiments over real-world data. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We consider how a search engine should select advertisements to display with search results, in order to maximize its revenue. [sent-5, score-0.432]

2 Under the standard “pay-per-click” arrangement, revenue depends on how well the displayed advertisements appeal to users. [sent-6, score-0.505]

3 Often the only reliable way of determining appeal is exploration via display to users, which detracts from exploitation of other advertisements known to have high appeal. [sent-8, score-0.405]

4 In this paper we study the tradeoff between exploration and exploitation, modeling advertisement placement as a multi-armed bandit problem. [sent-10, score-0.667]

5 We extend traditional bandit formulations to account for budget constraints that occur in search engine advertising markets, and derive theoretical bounds on the performance of a family of algorithms. [sent-11, score-0.786]

6 Under the standard “pay-per-click” arrangement, search engines earn revenue by displaying appealing advertisements that attract user clicks. [sent-15, score-0.562]

7 The search engine runs a large auction where each advertiser submits its bids to the search engine for the query phrases in which it is interested. [sent-26, score-0.6]

8 Advertiser Ai submits advertisement ai,j to target query phrase Qj , and promises to pay bi,j amount of money for each click on this advertisement, where bi,j is Ai ’s bid for advertisement ai,j . [sent-27, score-0.855]

9 Advertiser Ai can also specify a daily budget (di ) that is the total amount of money it is willing to pay for the clicks on its advertisements in a day. [sent-28, score-0.58]

10 Given a user search query on phrase Qj , the search engine selects a constant number C ≥ 1 of advertisements from the candidate set of advertisements {a∗,j }, targeted to Qj . [sent-29, score-0.914]

11 The objective in selecting advertisements is to maximize the search engine’s total daily revenue. [sent-30, score-0.367]

12 For now we assume that each day a new set of advertisements is given to the search engine and the set remains fixed through out the day; we drop both of these assumptions later in Section 4. [sent-32, score-0.392]

13 High revenue is achieved by displaying advertisements that have high bids as well as high likelihood of being clicked on by users. [sent-35, score-0.524]

14 Formally, the click-through rate (CTR) ci,j of advertisement ai,j is the probability of a user to click on advertisement ai,j given that the advertisement was displayed to the user for query phrase Qj . [sent-36, score-1.156]

15 In the absence of budget constraints, revenue is maximized by displaying advertisements with the highest ci,j · bi,j value. [sent-37, score-0.698]

16 The work of [8] showed how to maximize revenue in the presence of budget constraints, but under the assumption that all CTRs are known in advance. [sent-38, score-0.464]

17 In this paper we tackle the more difficult but realistic problem of maximizing advertisement revenue when CTRs are not necessarily known at the outset, and must be learned on the fly. [sent-39, score-0.46]

18 GREEDY refers to selection of advertisements according to expected revenue (i. [sent-41, score-0.424]

19 In this paper we give the first policies for Cells V and VI, where we must choose which advertisements to display while simultaneously estimating click-through rates of advertisements. [sent-48, score-0.373]

20 To maximize short-term revenue, the search engine should exploit its current, imperfect CTR estimates by displaying advertisements whose estimated CTRs are large. [sent-51, score-0.402]

21 This kind of exploration entails displaying advertisements whose current CTR estimates are of low confidence, which inevitably leads to displaying some low-CTR ads in the short-term. [sent-55, score-0.672]

22 , in clinical trials, and has been extensively studied in the context of the multi-armed bandit problem [4]. [sent-58, score-0.36]

23 In this paper we draw upon and extend the existing bandit literature to solve the advertisement problem in the case of unknown CTR. [sent-59, score-0.608]

24 In particular, first in Section 3 we show that the unbudgeted variant of the problem (Cell V in Figure 2) is an instance of the multi-armed bandit problem. [sent-60, score-0.469]

25 Then, in Section 4 we introduce a new kind of bandit problem that we termed the budgeted multi-armed multi-bandit problem (BMMP), and show that the budgeted unknown-CTR advertisement problem (Cell VI) is an instance of BMMP. [sent-61, score-0.834]

26 , exploiting any prior information available about the CTRs of ads, permitting advertisers to submit and revoke advertisements at any time, not just at day boundaries. [sent-66, score-0.369]

27 Reference [7] discusses how contextual information such as user demographic or ad topic can be used to estimate CTRs, and makes connections to the recommender and bandit problems, but stops short of presenting technical solutions. [sent-69, score-0.513]

28 The key problem addressed in [1] is to satisfy the contracts made with the advertisers in terms of the minimum guaranteed number of impressions (as opposed to the budget constraints in our problem). [sent-72, score-0.35]

29 Our unbudgeted problem is an instance of the multi-armed bandit problem [4], which is the following: we have K arms where each arm has an associated reward and payoff probability. [sent-77, score-0.766]

30 The payoff probability is not known to us while the reward may or may not be known (both versions of the bandit problem exist). [sent-78, score-0.446]

31 The objective is to determine a policy for activating the arms so as to maximize the total reward over some number of invocations. [sent-81, score-0.379]

32 To solve the unbudgeted unknown-CTR advertisement problem, we create a multi-armed bandit problem instance for each query phrase Q, where ads targeted for the query phrase are the arms, bid values are the rewards and CTRs are the payoff probabilities of the bandit instance. [sent-82, score-1.973]

33 Since there are no budget constraints, we can treat each query phrase independently and solve each bandit instance in isolation. [sent-83, score-0.883]

34 2 The number of invocations for a bandit instance is not known in advance because the number of queries of phrase Q in a given day is not known in advance. [sent-84, score-0.701]

35 A variety of policies have been proposed for the bandit problem, e. [sent-85, score-0.482]

36 A mistake occurs when a suboptimal arm is chosen by a policy (the optimal arm is the one with the highest expected reward). [sent-90, score-0.321]

37 Display the C ads targeted for Qj that have the highest priority. [sent-95, score-0.366]

38 The priority Pi,j of ad ai,j is a function of its current CTR estimate (ˆi,j ), its bid value (bi,j ), the c number of times it has been displayed so far (ni,j ), and the number of times phrase Qj has been queried so far in the day (nj ). [sent-96, score-0.389]

39 Formally, priority Pi,j is defined as: Pi,j = 1 ci,j + ˆ ∞ 2 ln nj ni,j · bi,j if ni,j > 0 otherwise The conventional multi-armed bandit problem is defined for C = 1. [sent-97, score-0.498]

40 Importantly, the MIX policy is practical to implement because it can be evaluated efficiently using a single pass over the ads targeted for a query phrase. [sent-108, score-0.637]

41 Recall from Section 3 that in the absence of budget constraints, we were able to treat the bandit instance created for a query phrase independent of the other bandit instances. [sent-111, score-1.243]

42 However, budget constraints create dependencies between query phrases targeted by an advertiser. [sent-112, score-0.498]

43 To model this situation, we introduce a new kind of bandit problem that we call Budgeted Multi-armed Multi-bandit Problem (BMMP), in which multiple bandit instances are run in parallel under overarching budget constraints. [sent-113, score-0.972]

44 1 Budgeted Multi-armed Multi-bandit Problem BMMP consists of a finite set of multi-armed bandit instances, B = {B1 , B2 . [sent-116, score-0.36]

45 Each bandit instance Bi has a finite number of arms and associated rewards and payoff probabilities as described in Section 3. [sent-120, score-0.555]

46 Each type Ti ∈ T has budget di ∈ [0, ∞] which specifies the maximum amount of reward that can be generated by activating all the arms of that type. [sent-122, score-0.49]

47 Once the specified budget is reached for a type, the corresponding arms can still be activated but no further reward is earned. [sent-123, score-0.464]

48 With each invocation of the bandit system, one bandit instance from B is invoked; the policy has no control over which bandit instance is invoked. [sent-124, score-1.399]

49 Then the policy activates C arms of the invoked bandit instance, and the activated arms generate some (possibly zero) total reward. [sent-125, score-0.832]

50 It is easy to see that the budgeted unknown-CTR advertisement problem is an instance of BMMP. [sent-126, score-0.389]

51 Each query phrase acts as a bandit instance and the ads targeted for it act as bandit arms, as described in Section 3. [sent-127, score-1.382]

52 Each advertiser defines a unique type of arms and gives a budget constraint for that type; all ads submitted by an advertiser belong to the type defined by it. [sent-128, score-0.93]

53 When a query is submitted by a user, the corresponding bandit instance is invoked. [sent-129, score-0.579]

54 We now show how to derive a policy for BMMP given as input a policy POL for the regular multi-armed bandit problem such as one of the policies from [3]. [sent-130, score-0.748]

55 Observe that in the second step of BPOL, when POL is restarted, POL loses any state it has built up, including any knowledge gained about the payoff probabilities of bandit arms. [sent-143, score-0.36]

56 In practice, since most bandit policies can take prior information about the payoff probabilities as input, when restarting POL we can supply the previous payoff probability estimates as the prior (as done in our experiments). [sent-145, score-0.497]

57 2 Performance Bound for BMMP Policies Let S denote the sequence of bandit instances that are invoked, i. [sent-147, score-0.385]

58 S(N )} where S(n) denotes the index of the bandit instance invoked at the nth invocation. [sent-152, score-0.469]

59 We compare the performance of BPOL with that of the optimal policy, denoted by OPT, where OPT has advance knowledge of S and the exact payoff probabilities of all bandit instances. [sent-153, score-0.379]

60 Since bandit arms generate rewards stochastically, it is not clear how we should compare BPOL and OPT. [sent-158, score-0.499]

61 For example, even if BPOL and OPT behave in exactly the same way (activate the same arm on each bandit invocation), we cannot guarantee that both will have the same total reward in the end. [sent-159, score-0.558]

62 To enable meaningful comparison, we define a payoff instance, denoted by I, such that I(i, n) denotes the reward generated by arm i of bandit instance S(n) for invocation n in payoff instance I. [sent-160, score-0.726]

63 Let B (I, n) and O(I, n) denote the arms of bandit instance S(n) activated under BPOL and OPT respectively. [sent-165, score-0.567]

64 Based on the different possibilities that can arise, we classify invocation n into one of three categories: • Category 1: The arm activated by OPT, O(I, n), is of smaller or equal expected reward in comparison to the arm activated by BPOL, B (I, n). [sent-166, score-0.416]

65 • Category 2: Arm O(I, n) is of greater expected reward than B (I, n), but O(I, n) is not available for BPOL to activate at invocation n due to budget restrictions. [sent-168, score-0.422]

66 Let bpol k (N ) and opt k (N ) denote the expected reward obtained during the invocations of category k (1, 2 or 3) by BPOL and OPT respectively. [sent-171, score-0.75]

67 In [9] we show that P(I) · bpol k (N ) = I∈I I(B (I, n), n) n∈N k (I) Similarly, P(I) · opt k (N ) = I∈I I(O(I, n), n) n∈N k (I) Then for each k we bound opt k (N ) in terms of bpol (N ). [sent-172, score-1.136]

68 In [9] we provide proof of each of the following bounds: Lemma 1 opt 1 (N ) ≤ bpol 1 (N ). [sent-173, score-0.568]

69 Lemma 2 opt 2 (N ) ≤ bpol (N ) + (|T | · rmax ), where |T | denotes the number of arm types and rmax denotes the maximum reward. [sent-174, score-0.708]

70 From the above bounds we obtain our overall claim: Theorem 1 bpol (N ) ≥ opt(N )/2 − O(f (N )), where bpol (N ) and opt(N ) denote the total expected reward obtained under BPOL and OPT respectively. [sent-176, score-0.78]

71 Proof: opt(N ) = opt 1 (N ) + opt 2 (N ) + opt 3 (N ) ≤ bpol 1 (N ) + bpol (N ) + |T | · rmax + O(f (N )) ≤ 2 · bpol (N ) + O(f (N )) Hence, bpol (N ) ≥ opt(N )/2 − O(f (N )). [sent-177, score-2.065]

72 If we supply MIX (Section 3) as input to our generic BPOL framework, we obtain BMIX, a policy for the budgeted unknown-CTR advertisement problem. [sent-178, score-0.481]

73 Due to the way MIX structures and maintains its internal state, it is not necessary to restart a MIX instance when an advertiser’s budget is depleted in BMIX, as specified in the generic BPOL framework (the exact steps of BMIX are given in [9]). [sent-179, score-0.332]

74 So far, for modeling purposes, we have assumed the search engine receives an entirely new batch of advertisements each day. [sent-180, score-0.33]

75 In fact, with a little care we can permit ads to be submitted and revoked at arbitrary times (not just at day boundaries). [sent-184, score-0.415]

76 In particular, we compare it with the greedy policy proposed for the known-CTR advertisement problem (Cells 1-IV in Figure 2). [sent-188, score-0.47]

77 Internally, BMIX runs instances of MIX to select which ads to display. [sent-194, score-0.332]

78 It is shown in [8] that in the presence of budget constraints, it is beneficial to display the ads of an advertiser less often as the advertiser’s remaining budget decreases. [sent-199, score-0.927]

79 In particular, they propose to multiply bids from advertiser Ai by the following discount factor : φ(di ) = 1 − e−di /di where di is the current remaining budget of advertiser Ai for the day and di is its total daily budget. [sent-200, score-0.787]

80 1 Experiment Setup We evaluate advertisement policies by conducting simulations over real-world data. [sent-204, score-0.37]

81 Since we have the frequency counts of these query phrases but not the actual order, we ran the simulations multiple times with random orderings of the query instances and report the average revenue in all our experiment results. [sent-207, score-0.559]

82 For each query phrase we have the list of advertisers interested in it and the ads submitted by them to Yahoo! [sent-209, score-0.667]

83 Roughly 60% of the advertisers in our data set impose daily budget constraints. [sent-212, score-0.396]

84 ads we selected those ads that have been displayed more than thousand times, and therefore we have highly accurate CTR estimates. [sent-216, score-0.663]

85 (Although this method may introduce some skew compared with the (unknown) true distribution, it is the best we could do short of serving live ads just for the purpose of measuring CTRs). [sent-219, score-0.307]

86 Due to lack of space we consider a simple setting here where the set of ads is fixed and no prior information about CTR is available. [sent-221, score-0.307]

87 2 Exploration/Exploitation Tradeoff We ran each of the policies for a time horizon of ten days; each policy carries over its CTR estimates from one day to the next. [sent-224, score-0.317]

88 For now we fix the number of displayed ads (C) to 1. [sent-226, score-0.356]

89 Figure 3 plots the revenue generated by each policy after a given number of days (for confidentiality reasons we have changed the unit of revenue). [sent-227, score-0.369]

90 However, that does not happen because GREEDY immediately fixates on ads that are not very profitable (i. [sent-233, score-0.307]

91 Next we vary the number of ads displayed for each query (C). [sent-236, score-0.494]

92 Each policy earns more revenue when more ads are displayed (larger C). [sent-238, score-0.701]

93 In fact, GREEDY must display almost twice as many ads as BMIX-E to generate the same amount of revenue. [sent-240, score-0.346]

94 12 5 Total revenue 4 3 Total revenue GREEDY BMIX BMIX-T BMIX-E BMIX-ET 2 8 GREEDY BMIX BMIX-T BMIX-E BMIX-ET 4 1 0 1 4 7 10 Time horizon (days) Figure 3: Revenue generated by different advertisement policies (C=1). [sent-241, score-0.794]

95 6 0 1 4 7 10 Ads per query (C) Figure 4: Effect of C (number of ads displayed per query). [sent-242, score-0.494]

96 Summary and Future Work In this paper we studied how a search engine should select which ads to display in order to maximize revenue, when click-through rates are not initially known. [sent-243, score-0.489]

97 We dealt with the underlying exploration/exploitation tradeoff using multi-armed bandit theory. [sent-244, score-0.36]

98 In the process we contributed to bandit theory by proposing a new variant of the bandit problem that we call budgeted multi-armed multi-bandit problem (BMMP). [sent-245, score-0.805]

99 Practical extensions of our advertisement policies are given in the extended version of the paper. [sent-247, score-0.37]

100 Extensive experiments over real ad data demonstrate substantial revenue gains compared to a greedy strategy that has no provision for exploration. [sent-248, score-0.385]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bandit', 0.36), ('bpol', 0.338), ('ads', 0.307), ('advertisement', 0.248), ('opt', 0.23), ('budget', 0.227), ('ctr', 0.222), ('revenue', 0.212), ('advertisements', 0.212), ('payo', 0.175), ('ctrs', 0.148), ('query', 0.138), ('policy', 0.133), ('advertiser', 0.127), ('policies', 0.122), ('arms', 0.117), ('bmix', 0.116), ('bmmp', 0.106), ('phrase', 0.102), ('advertisers', 0.095), ('pol', 0.095), ('arm', 0.094), ('greedy', 0.089), ('reward', 0.086), ('budgeted', 0.085), ('ad', 0.084), ('engine', 0.08), ('daily', 0.074), ('invocations', 0.074), ('invocation', 0.074), ('mix', 0.07), ('priority', 0.064), ('exploitation', 0.063), ('day', 0.062), ('di', 0.06), ('targeted', 0.059), ('tradeo', 0.059), ('exploration', 0.059), ('instance', 0.056), ('advertising', 0.053), ('click', 0.053), ('invoked', 0.053), ('unbudgeted', 0.053), ('qj', 0.049), ('displayed', 0.049), ('displaying', 0.047), ('phrases', 0.046), ('budgets', 0.042), ('display', 0.039), ('search', 0.038), ('ln', 0.038), ('nj', 0.036), ('activate', 0.035), ('user', 0.035), ('activated', 0.034), ('mistakes', 0.034), ('bi', 0.033), ('appeal', 0.032), ('bids', 0.032), ('clicks', 0.032), ('depleted', 0.032), ('ucb', 0.032), ('constraints', 0.028), ('yahoo', 0.028), ('queries', 0.028), ('bid', 0.028), ('instances', 0.025), ('maximize', 0.025), ('submitted', 0.025), ('users', 0.025), ('days', 0.024), ('arrangement', 0.023), ('rmax', 0.023), ('erent', 0.023), ('cells', 0.023), ('ai', 0.023), ('category', 0.022), ('rewards', 0.022), ('banner', 0.021), ('clicked', 0.021), ('heuristical', 0.021), ('msvv', 0.021), ('olston', 0.021), ('pandey', 0.021), ('poli', 0.021), ('revoked', 0.021), ('submits', 0.021), ('vi', 0.021), ('cell', 0.02), ('advance', 0.019), ('engines', 0.018), ('recommender', 0.018), ('total', 0.018), ('ratio', 0.018), ('money', 0.017), ('geared', 0.017), ('restart', 0.017), ('contextual', 0.016), ('arrival', 0.016), ('supply', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

Author: Sandeep Pandey, Christopher Olston

Abstract: We consider how a search engine should select advertisements to display with search results, in order to maximize its revenue. Under the standard “pay-per-click” arrangement, revenue depends on how well the displayed advertisements appeal to users. The main difficulty stems from new advertisements whose degree of appeal has yet to be determined. Often the only reliable way of determining appeal is exploration via display to users, which detracts from exploitation of other advertisements known to have high appeal. Budget constraints and finite advertisement lifetimes make it necessary to explore as well as exploit. In this paper we study the tradeoff between exploration and exploitation, modeling advertisement placement as a multi-armed bandit problem. We extend traditional bandit formulations to account for budget constraints that occur in search engine advertising markets, and derive theoretical bounds on the performance of a family of algorithms. We measure empirical performance via extensive experiments over real-world data. 1

2 0.13456988 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.

3 0.11730297 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Author: Peter Auer, Ronald Ortner

Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1

4 0.1148407 186 nips-2006-Support Vector Machines on a Budget

Author: Ofer Dekel, Yoram Singer

Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1

5 0.087554 44 nips-2006-Bayesian Policy Gradient Algorithms

Author: Mohammad Ghavamzadeh, Yaakov Engel

Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1

6 0.064761423 154 nips-2006-Optimal Change-Detection and Spiking Neurons

7 0.062263448 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

8 0.057828315 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

9 0.056819245 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

10 0.046640951 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

11 0.046632931 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

12 0.044436168 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

13 0.043117132 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

14 0.035353079 122 nips-2006-Learning to parse images of articulated bodies

15 0.034426752 163 nips-2006-Prediction on a Graph with a Perceptron

16 0.034112994 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

17 0.032868352 130 nips-2006-Max-margin classification of incomplete data

18 0.030529773 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

19 0.028653985 129 nips-2006-Map-Reduce for Machine Learning on Multicore

20 0.024642445 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.084), (1, -0.015), (2, -0.123), (3, -0.116), (4, 0.021), (5, -0.032), (6, -0.008), (7, -0.019), (8, 0.18), (9, 0.004), (10, -0.01), (11, -0.027), (12, 0.007), (13, -0.033), (14, 0.015), (15, 0.029), (16, 0.016), (17, -0.024), (18, 0.055), (19, -0.058), (20, -0.023), (21, 0.055), (22, 0.041), (23, 0.037), (24, -0.01), (25, 0.082), (26, -0.14), (27, -0.027), (28, 0.027), (29, -0.07), (30, -0.005), (31, -0.032), (32, -0.037), (33, -0.158), (34, -0.019), (35, -0.042), (36, 0.029), (37, 0.023), (38, -0.083), (39, -0.053), (40, -0.077), (41, 0.052), (42, -0.075), (43, -0.089), (44, 0.073), (45, 0.009), (46, -0.122), (47, 0.005), (48, -0.22), (49, 0.129)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96853685 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

Author: Sandeep Pandey, Christopher Olston

Abstract: We consider how a search engine should select advertisements to display with search results, in order to maximize its revenue. Under the standard “pay-per-click” arrangement, revenue depends on how well the displayed advertisements appeal to users. The main difficulty stems from new advertisements whose degree of appeal has yet to be determined. Often the only reliable way of determining appeal is exploration via display to users, which detracts from exploitation of other advertisements known to have high appeal. Budget constraints and finite advertisement lifetimes make it necessary to explore as well as exploit. In this paper we study the tradeoff between exploration and exploitation, modeling advertisement placement as a multi-armed bandit problem. We extend traditional bandit formulations to account for budget constraints that occur in search engine advertising markets, and derive theoretical bounds on the performance of a family of algorithms. We measure empirical performance via extensive experiments over real-world data. 1

2 0.53148609 125 nips-2006-Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Author: Peter Auer, Ronald Ortner

Abstract: We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm’s online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy. 1 1.1

3 0.52593124 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

Author: Peter L. Bartlett, Ambuj Tewari

Abstract: We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.

4 0.45512509 44 nips-2006-Bayesian Policy Gradient Algorithms

Author: Mohammad Ghavamzadeh, Yaakov Engel

Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate this gradient. Since Monte Carlo methods tend to have high variance, a large number of samples is required, resulting in slow convergence. In this paper, we propose a Bayesian framework that models the policy gradient as a Gaussian process. This reduces the number of samples needed to obtain accurate gradient estimates. Moreover, estimates of the natural gradient as well as a measure of the uncertainty in the gradient estimates are provided at little extra cost. 1

5 0.41443944 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

Author: Gediminas Lukšys, Jérémie Knüsel, Denis Sheynikhovich, Carmen Sandi, Wulfram Gerstner

Abstract: Stress and genetic background regulate different aspects of behavioral learning through the action of stress hormones and neuromodulators. In reinforcement learning (RL) models, meta-parameters such as learning rate, future reward discount factor, and exploitation-exploration factor, control learning dynamics and performance. They are hypothesized to be related to neuromodulatory levels in the brain. We found that many aspects of animal learning and performance can be described by simple RL models using dynamic control of the meta-parameters. To study the effects of stress and genotype, we carried out 5-hole-box light conditioning and Morris water maze experiments with C57BL/6 and DBA/2 mouse strains. The animals were exposed to different kinds of stress to evaluate its effects on immediate performance as well as on long-term memory. Then, we used RL models to simulate their behavior. For each experimental session, we estimated a set of model meta-parameters that produced the best fit between the model and the animal performance. The dynamics of several estimated meta-parameters were qualitatively similar for the two simulated experiments, and with statistically significant differences between different genetic strains and stress conditions. 1

6 0.35957694 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

7 0.32633552 186 nips-2006-Support Vector Machines on a Budget

8 0.32146472 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

9 0.31216136 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

10 0.30980438 129 nips-2006-Map-Reduce for Machine Learning on Multicore

11 0.30266365 174 nips-2006-Similarity by Composition

12 0.2965261 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

13 0.29082915 156 nips-2006-Ordinal Regression by Extended Binary Classification

14 0.28958076 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight

15 0.28609726 137 nips-2006-Multi-Robot Negotiation: Approximating the Set of Subgame Perfect Equilibria in General-Sum Stochastic Games

16 0.28211769 114 nips-2006-Learning Time-Intensity Profiles of Human Activity using Non-Parametric Bayesian Models

17 0.27043572 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

18 0.24797723 154 nips-2006-Optimal Change-Detection and Spiking Neurons

19 0.24105836 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints

20 0.23929679 155 nips-2006-Optimal Single-Class Classification Strategies


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.075), (6, 0.014), (7, 0.057), (9, 0.02), (22, 0.038), (44, 0.061), (57, 0.043), (65, 0.031), (69, 0.02), (71, 0.027), (99, 0.496)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81675434 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

Author: Sandeep Pandey, Christopher Olston

Abstract: We consider how a search engine should select advertisements to display with search results, in order to maximize its revenue. Under the standard “pay-per-click” arrangement, revenue depends on how well the displayed advertisements appeal to users. The main difficulty stems from new advertisements whose degree of appeal has yet to be determined. Often the only reliable way of determining appeal is exploration via display to users, which detracts from exploitation of other advertisements known to have high appeal. Budget constraints and finite advertisement lifetimes make it necessary to explore as well as exploit. In this paper we study the tradeoff between exploration and exploitation, modeling advertisement placement as a multi-armed bandit problem. We extend traditional bandit formulations to account for budget constraints that occur in search engine advertising markets, and derive theoretical bounds on the performance of a family of algorithms. We measure empirical performance via extensive experiments over real-world data. 1

2 0.65367019 49 nips-2006-Causal inference in sensorimotor integration

Author: Konrad P. Körding, Joshua B. Tenenbaum

Abstract: Many recent studies analyze how data from different modalities can be combined. Often this is modeled as a system that optimally combines several sources of information about the same variable. However, it has long been realized that this information combining depends on the interpretation of the data. Two cues that are perceived by different modalities can have different causal relationships: (1) They can both have the same cause, in this case we should fully integrate both cues into a joint estimate. (2) They can have distinct causes, in which case information should be processed independently. In many cases we will not know if there is one joint cause or two independent causes that are responsible for the cues. Here we model this situation as a Bayesian estimation problem. We are thus able to explain some experiments on visual auditory cue combination as well as some experiments on visual proprioceptive cue integration. Our analysis shows that the problem solved by people when they combine cues to produce a movement is much more complicated than is usually assumed, because they need to infer the causal structure that is underlying their sensory experience.

3 0.54090053 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

Author: Murat Dundar, Balaji Krishnapuram, R. B. Rao, Glenn M. Fung

Abstract: Many computer aided diagnosis (CAD) problems can be best modelled as a multiple-instance learning (MIL) problem with unbalanced data: i.e. , the training data typically consists of a few positive bags, and a very large number of negative instances. Existing MIL algorithms are much too computationally expensive for these datasets. We describe CH, a framework for learning a Convex Hull representation of multiple instances that is significantly faster than existing MIL algorithms. Our CH framework applies to any standard hyperplane-based learning algorithm, and for some algorithms, is guaranteed to find the global optimal solution. Experimental studies on two different CAD applications further demonstrate that the proposed algorithm significantly improves diagnostic accuracy when compared to both MIL and traditional classifiers. Although not designed for standard MIL problems (which have both positive and negative bags and relatively balanced datasets), comparisons against other MIL methods on benchmark problems also indicate that the proposed method is competitive with the state-of-the-art.

4 0.25415927 5 nips-2006-A Kernel Method for the Two-Sample-Problem

Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola

Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.

5 0.25217444 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

6 0.25201187 65 nips-2006-Denoising and Dimension Reduction in Feature Space

7 0.24963586 99 nips-2006-Information Bottleneck Optimization and Independent Component Extraction with Spiking Neurons

8 0.24958983 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

9 0.2495856 117 nips-2006-Learning on Graph with Laplacian Regularization

10 0.24909015 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

11 0.24872692 175 nips-2006-Simplifying Mixture Models through Function Approximation

12 0.24861695 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

13 0.24746688 35 nips-2006-Approximate inference using planar graph decomposition

14 0.24643104 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

15 0.24626401 21 nips-2006-AdaBoost is Consistent

16 0.24623908 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

17 0.24602666 193 nips-2006-Tighter PAC-Bayes Bounds

18 0.24599819 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics

19 0.24539612 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

20 0.24506375 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation