nips nips2011 nips2011-160 knowledge-graph by maker-knowledge-mining

160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

Source: pdf

Author: Yisong Yue, Carlos Guestrin

Abstract: Diversiﬁed retrieval and online learning are two core research areas in the design of modern information retrieval systems. In this paper, we propose the linear submodular bandits problem, which is an online learning setting for optimizing a general class of feature-rich submodular utility models for diversiﬁed retrieval. We present an algorithm, called LSBG REEDY, and prove that it efﬁciently converges to a near-optimal model. As a case study, we applied our approach to the setting of personalized news recommendation, where the system must recommend small sets of news articles selected from tens of thousands of available articles each day. In a live user study, we found that LSBG REEDY signiﬁcantly outperforms existing online learning approaches. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Diversiﬁed retrieval and online learning are two core research areas in the design of modern information retrieval systems. [sent-4, score-0.234]

2 In this paper, we propose the linear submodular bandits problem, which is an online learning setting for optimizing a general class of feature-rich submodular utility models for diversiﬁed retrieval. [sent-5, score-0.686]

3 As a case study, we applied our approach to the setting of personalized news recommendation, where the system must recommend small sets of news articles selected from tens of thousands of available articles each day. [sent-7, score-1.245]

4 In a live user study, we found that LSBG REEDY signiﬁcantly outperforms existing online learning approaches. [sent-8, score-0.232]

5 1 Introduction User feedback has become an invaluable source of training data for optimizing information retrieval systems in a rapidly expanding range of domains, most notably content recommendation (e. [sent-9, score-0.249]

6 When designing retrieval systems that adapt to user feedback, two important challenges arise. [sent-12, score-0.289]

7 First, the system should recommend optimally diversiﬁed content that maximizes coverage of the information the user ﬁnds interesting (to maximize positive feedback). [sent-13, score-0.377]

8 In most retrieval settings, the retrieval system must recommend sets of articles, rather than individual articles. [sent-16, score-0.277]

9 This is motivated by the principle that recommending redundant articles leads to diminishing returns on utility, since users need to consume redundant information only once. [sent-18, score-0.668]

10 This notion of diminishing returns is well-captured by submodular utility models, which have become an increasingly popular approach to modeling diversiﬁed retrieval tasks in recent years [24, 25, 18, 3, 21, 9, 16]. [sent-19, score-0.445]

11 In most retrieval settings, users typically only provide feedback on the articles recommended to them. [sent-21, score-0.755]

12 This partial feedback issue leads to an inherent tension between exploration and exploitation when deciding which articles to recommend to the user. [sent-22, score-0.683]

13 Furthermore, it is typically desirable to learn a feature-based model that can generalize to new or previously unseen articles and users; this is often called the contextual bandits problem [13, 15, 7]. [sent-23, score-0.578]

14 For instance, existing online approaches for optimizing submodular functions typically assume a featurefree model, and thus cannot generalize easily [18, 22, 23]. [sent-25, score-0.281]

15 Thus, they are not suitable for many retrieval settings since the set of available articles can change frequently (e. [sent-29, score-0.594]

16 We propose the linear submodular bandits problem, which is an online learning setting for optimizing a general class of feature-based 1 submodular utility models. [sent-33, score-0.686]

17 To make learning practical, we represent the beneﬁt of adding an article to an existing set of selected articles as a linear model with respect to the user’s preferences. [sent-34, score-0.597]

18 This class of models encompasses several existing information coverage utility models for diversiﬁed retrieval [24, 25, 9], and allows us to learn ﬂexible models that can generalize to new predictions. [sent-35, score-0.265]

19 Similar to the contextual bandits setting considered in [15], our setting can be characterized as a feature-based exploration-exploitation problem, where the uncertainty lies in how best to model user interests using the available features. [sent-36, score-0.326]

20 In contrast to [15], we aim to recommend optimally diversiﬁed sets of articles rather than just single articles. [sent-37, score-0.557]

21 When learning a d-dimensional model to recommend sets of L articles for T time steps, we prove that LSBG REEDY incurs rep gret that grows as O(d LT ) (ignoring log factors). [sent-40, score-0.557]

22 In addition to simulation experiments, we conducted a live user study over a period of ten rounds, where in each round the retrieval system must recommend a small set of news articles selected from tens of thousands of available articles for that round. [sent-43, score-1.434]

23 We compared against existing online learning approaches that either employ no exploration [9], or learn to recommend only single articles (and thus do not model diversity) [15]. [sent-44, score-0.654]

24 Suppose that news articles are represented using a set of d “topics” or “concepts” that we wish to cover (e. [sent-50, score-0.572]

25 1 Intuitively, recommending two articles that cover highly overlapping topics might not be more beneﬁcial than recommending just one of the articles – this is the notion of diminishing returns we wish to capture in our information coverage model. [sent-53, score-1.34]

26 A set function F mapping sets of recommended articles A to real values (e. [sent-55, score-0.549]

27 , the total information covered by A) is monotone and submodular if and only if F (A [ {a}) F (A) and F (A [ {a}) F (A) F (B [ {a}) F (B), respectively, for all articles a and sets A ✓ B. [sent-57, score-0.8]

28 Submodularity provides a natural framework for characterizing diminishing returns in information coverage, since the gain of adding a second (redundant) article on a topic will be smaller than the gain of adding the ﬁrst. [sent-59, score-0.286]

29 For each topic i, let Fi (A) be a monotone submodular function corresponding to how well the recommended articles A cover topic i. [sent-60, score-0.973]

30 Since sums of monotone submodular functions are themselves monotone submodular, this implies that F (A|w) is also monotone submodular (this would not hold if w has negative components). [sent-66, score-0.668]

31 This class of information utility models encompasses several existing models of information coverage for diversiﬁed retrieval [24, 25, 9]. [sent-68, score-0.265]

32 , submodular advantage) of topic i by article a, conditioned on articles A having already been selected. [sent-75, score-0.87]

33 Another attractive property of monotone submodular functions is that the myopic greedy algorithm is guaranteed to produce a near-optimal solution [17]. [sent-78, score-0.328]

34 3 Problem Formulation We propose the linear submodular bandits problem which is described in the following. [sent-86, score-0.322]

35 , T , our algorithm interacts with the user in the following way: • A set of articles At is made available to the algorithm. [sent-90, score-0.646]

36 Each article a 2 At is represented using a set of d basis coverage functions F1 , . [sent-91, score-0.231]

37 , at ), using the basis coverage functions of the articles and the outcomes of previous time steps. [sent-98, score-0.599]

38 , clicks on or ignores each article), and the rewards for each recommended articles rt (At ) (4) are observed. [sent-101, score-0.664]

39 We assume the user scans the recommended articles A = (a(1) , . [sent-103, score-0.721]

40 For each article a(`) , the user considers the new information covered by a(`) and not covered by the above articles A(1:` 1) (A(1:`) denotes the articles in the ﬁrst ` slots). [sent-107, score-1.272]

41 The user then clicks on (or likes) a(`) with independent probability > (w⇤ ) (a(`) |A(1:` 1) ), where w⇤ is the hidden preferences of the user. [sent-109, score-0.23]

42 Formally, for any set of articles A chosen at time t, the rewards rt (A) can be written as the sum of rewards at each slot, rt (A) = L X (`) rt (A). [sent-110, score-0.72]

43 We call this independence property conditional submodular independence, which we will leverage in 2 E. [sent-113, score-0.257]

44 , the topics and coverage probabilities can be derived from a topic model such as LDA [4]. [sent-115, score-0.223]

45 While conditional submodular independence may seem ideal, we will show in our user study experiments that it is not required for our proposed algorithm to achieve good performance. [sent-126, score-0.448]

46 Thus, E[rt ] is monotone submodular, and a clairvoyant system with perfect knowledge of w⇤ can greedily select articles to achieve (expected) reward at least (1 1/e)OP T , where OP T denotes the total expected reward of the optimal recommendations for t = 1, . [sent-128, score-0.629]

47 Let A⇤ denote the optimal set of articles t at time t. [sent-132, score-0.474]

48 To minimize regret (6), an algorithm must exploit its past experience to recommend sets of articles that appear to maximize information coverage. [sent-136, score-0.61]

49 In order to avoid this situation, the algorithm must explore by recommending articles about seemingly poor topics in order to gather more information about them. [sent-140, score-0.623]

50 Lines 3–5 in Algorithm 1 describe this step, (`) where ⌧ denotes the incremental coverage features of the article selected at time ⌧ and slot `, and (`) r⌧ denotes the associated reward. [sent-146, score-0.331]

51 Given wt , we can now estimate the incremental gain of adding any article a to an existing set of results A. [sent-149, score-0.227]

52 If our algorithm were to > purely exploit prior knowledge, then it would greedily choose articles that maximize wt (a|A). [sent-152, score-0.55]

53 Each day comprises 3 articles covering 4 topics, which are depicted in the two plots. [sent-159, score-0.587]

54 In day 1, LSBG REEDY recommends articles to explore topics 1, 2, and 3, and the user indicates liking a1 and disliking a2. [sent-161, score-0.874]

55 In day 2, LSBG REEDY recommends b1 to exploitatively cover topic 1, and b3 to both cover topic 1 and explore topic 4. [sent-162, score-0.325]

56 Our algorithm’s uncertainty in the gain of article a given set A depends directly to how much feedback we have collected regarding prominent topics in (a|A). [sent-166, score-0.285]

57 In our linear setting, uncertainty is measured using the inverse covariance matrix Mt 1 of the submodular features of the previously selected articles (Line 9). [sent-167, score-0.746]

58 If our algorithm were to purely explore, q then it would greedily select articles that have maximal uncertainty (a|A)> Mt 1 (a|A). [sent-168, score-0.521]

59 In order to achieve low regret, LSBG REEDY greedily selects articles that maximize a compromise between estimated gain and uncertainty (Line 10), with ↵t controlling the tradeoff. [sent-170, score-0.549]

60 This means that the average loss incurred per slot and per day by p LSBG REEDY relative to (1 1/e)OP T decreases at a rate of O(d/ T L). [sent-176, score-0.222]

61 [9, 15, 16]), where the system is tasked with recommending sets of articles that maximally cover the interesting information of the available articles. [sent-183, score-0.574]

62 , obey conditional submodular independence), our user study tests the effectiveness of our approach in settings beyond those considered in our theoretical analysis. [sent-190, score-0.455]

63 For the synthetic data, all articles 5 Figure 2: Simulation results comparing LSBG REEDY (red), RankLinUCB (black thick), Multiplicative Weighting (black thin), and ✏-Greedy (dashed thin). [sent-195, score-0.474]

64 were randomly generated using d = 25 topics, and w⇤ was randomly generated and re-scaled so the most likely articles were liked with probability ⇡ 75%. [sent-198, score-0.557]

65 For the blog dataset, articles are represented using d = 100 topics generated using Latent Dirichlet Allocation [4], and w⇤ was derived from a preliminary version of our user study. [sent-199, score-0.736]

66 Our simulated user behaves according to the user model described in Section 3. [sent-200, score-0.344]

67 We use probabilistic coverage (2) as the submodular basis functions. [sent-201, score-0.366]

68 Note that all learning algorithms use the same underlying submodular utility model. [sent-204, score-0.307]

69 We presented each user with ten articles per day over ten days from January 18, 2009 to January 27, 2009. [sent-222, score-0.796]

70 Each day, the articles are selected using an interleaving of two policies (described below). [sent-223, score-0.512]

71 The articles are displayed as a title with its contents viewable via a preview pane. [sent-224, score-0.474]

72 The user is instructed p One can show that RankLinUCB achieves greedy regret (6) that grows as O(dL T ) (ignoring log factors), p which is a factor L worse than the regret guarantee of LSBG REEDY. [sent-225, score-0.303]

73 4 6 Figure 3: Displaying normalized learned preferences of LSBG REEDY (dark) and MW (light) for two user study sessions. [sent-238, score-0.222]

74 In the right session, the user likes very few articles, and MW does not discover any topics that interest the user. [sent-240, score-0.271]

75 The parenthetical values in the last column are computed ignoring clicks on articles jointly recommended by both algorithms (see Section 5. [sent-246, score-0.611]

76 to brieﬂy skim each article to get a sense of its content and, one by one, mark each article as “interested in reading in detail” (like), or “not interested” (dislike). [sent-249, score-0.248]

77 As in [9], for each decision, the user is told to take into account the articles shown above in the current day, so as to capture the notion of incremental coverage. [sent-250, score-0.677]

78 For example, a user might be interested in reading an article regarding the Middle East appearing at the top slot, and would mark it as “interested. [sent-251, score-0.294]

79 ” However, if several very similar articles appear below it, the user may mark the subsequent articles as “not interested. [sent-252, score-1.136]

80 Interleaving allows us to make paired comparisons such that we simultaneously control for the particular user and particular day (certain days may contain more or less interesting content to the user than other days). [sent-255, score-0.498]

81 We created 18 genres (examples shown in Figure 3), labeled relevant articles and trained a model via linear regression for each genre. [sent-261, score-0.474]

82 Note that many articles are relevant to multiple genres. [sent-262, score-0.474]

83 For each user, we computed three statistics: (1) whether LSBG REEDY won, tied, or lost in terms of total number of liked articles, (2) the difference in liked articles per day, and (3) the fraction of liked articles recommended by LSBG REEDY. [sent-270, score-1.288]

84 Jointly recommended articles can be either counted as half to each algorithm or ignored (these results are shown in parentheticals in Table 1). [sent-271, score-0.549]

85 On average, LSBG REEDY obtains about one additional liked article per day and 63% of all liked articles versus the static baseline, and about half an additional liked article per day and 57% of all liked articles versus the two competing learning algorithms. [sent-273, score-1.825]

86 Since MW does not employ exploration, it can either overﬁt to its previous experience and not ﬁnd new topics that interest the user (left plot), or fail to discover any good topics (right plot). [sent-276, score-0.336]

87 We are chieﬂy interested in training ﬂexible submodular utility models, since such models yield practical algorithmic approaches. [sent-279, score-0.307]

88 Beyond submodular models of information coverage, other approaches include methods that balance relevance and novelty [5, 26, 6] and graph-based methods [27]. [sent-286, score-0.241]

89 It may be possible to develop more robust algorithms for our submodular bandits setting by building upon algorithms with more general guarantees (e. [sent-293, score-0.339]

90 Most set-based settings, such as bandit submodular optimization or the general bandit slate problem, assume a feature-free model [18, 22, 23, 12]. [sent-296, score-0.371]

91 However, it is unclear how to incorporate our submodular features into their setting. [sent-301, score-0.256]

92 Our approach requires access to submodular basis functions as features. [sent-303, score-0.264]

93 As such, one important direction for future work is to learn the appropriate basis features from user feedback, which is similar to the setting of interactive topic modeling [11]. [sent-307, score-0.276]

94 Moreover, user behavior is likely to be inﬂuenced by many factors beyond those well-modeled by submodular basis features. [sent-308, score-0.436]

95 For example, the probability of the user liking a certain article could be inﬂuenced by the time of day, or day of the week. [sent-309, score-0.415]

96 A more uniﬁed approach would be to incorporate both these standard features as well as submodular basis features in a joint model. [sent-310, score-0.294]

97 8 Conclusion We proposed an online learning setting for optimizing a general class of submodular functions. [sent-316, score-0.298]

98 This setting is well-suited for modeling diversiﬁed retrieval systems that interactively learn from user feedback. [sent-317, score-0.286]

99 We conducted simulations as well as user studies in the setting of news recommendation, and found that LSBG REEDY outperforms competing online learning approaches. [sent-319, score-0.361]

100 An analysis of the approximations for maximizing submodular set functions. [sent-429, score-0.241]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lsbg', 0.497), ('articles', 0.474), ('reedy', 0.45), ('submodular', 0.241), ('user', 0.172), ('diversi', 0.135), ('ranklinucb', 0.13), ('day', 0.113), ('article', 0.106), ('coverage', 0.102), ('retrieval', 0.097), ('liked', 0.083), ('recommend', 0.083), ('bandits', 0.081), ('recommending', 0.077), ('slot', 0.077), ('recommended', 0.075), ('news', 0.075), ('topics', 0.072), ('rt', 0.07), ('recommendation', 0.069), ('mw', 0.068), ('utility', 0.066), ('feedback', 0.063), ('monotone', 0.062), ('regret', 0.053), ('bandit', 0.053), ('mt', 0.051), ('topic', 0.049), ('personalized', 0.047), ('users', 0.046), ('wt', 0.045), ('recommendations', 0.041), ('online', 0.04), ('interleaving', 0.038), ('competing', 0.037), ('exploration', 0.037), ('op', 0.035), ('dence', 0.033), ('fd', 0.033), ('incremental', 0.031), ('preferences', 0.031), ('greedily', 0.031), ('gain', 0.028), ('clicks', 0.027), ('likes', 0.027), ('diminishing', 0.026), ('exploitation', 0.026), ('radlinski', 0.026), ('sigir', 0.026), ('greedy', 0.025), ('weighting', 0.024), ('liking', 0.024), ('regg', 0.024), ('slate', 0.024), ('subtopic', 0.024), ('settings', 0.023), ('cover', 0.023), ('covered', 0.023), ('basis', 0.023), ('contextual', 0.023), ('days', 0.021), ('clairvoyant', 0.021), ('personalization', 0.021), ('streeter', 0.021), ('employ', 0.02), ('challenges', 0.02), ('content', 0.02), ('conducted', 0.02), ('live', 0.02), ('guestrin', 0.02), ('study', 0.019), ('acm', 0.019), ('det', 0.019), ('recommends', 0.019), ('conference', 0.019), ('versus', 0.019), ('rewards', 0.018), ('ignoring', 0.018), ('tl', 0.018), ('blog', 0.018), ('yue', 0.018), ('multiplicative', 0.018), ('setting', 0.017), ('jointly', 0.017), ('appendix', 0.017), ('session', 0.017), ('reyzin', 0.017), ('con', 0.017), ('adding', 0.017), ('mark', 0.016), ('east', 0.016), ('independence', 0.016), ('argmaxa', 0.016), ('uncertainty', 0.016), ('per', 0.016), ('vs', 0.016), ('features', 0.015), ('returns', 0.015), ('redundant', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

Author: Yisong Yue, Carlos Guestrin

2 0.22291167 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they ﬁnd the complete information they need. We are speciﬁcally interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a conﬁdent diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) deﬁned over a ground set V is called submodular if it satisﬁes the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, inﬂuence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is deﬁned to be the set containing the ﬁrst i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order speciﬁed by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it ﬁrst in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have deﬁned it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can deﬁne F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a ﬁxed set of candidate diseases, then this function is submodular. Speciﬁcally, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The ofﬂine approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this ofﬂine algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single ﬁxed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem ﬁts in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard ofﬂine problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efﬁcient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best ﬁxed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an ofﬂine approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an ofﬂine algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this ofﬂine algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known ﬁnite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Ofﬂine Analysis In this work we present an algorithm for online submodular set cover which extends the ofﬂine algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this ofﬂine algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Ofﬂine Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in ofﬂine analysis the ofﬂine algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no beneﬁt but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is deﬁned such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it ﬁnds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Speciﬁcally, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), deﬁne a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) deﬁne a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next deﬁne truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one ﬁnal lemma before proving the main result of this section. For a sequence S deﬁne Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Deﬁne Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it sufﬁces to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it sufﬁces to show that after selecting the ﬁrst k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Deﬁne a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ deﬁnition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to ﬁx. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modiﬁcation does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the ﬁrst k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the ofﬂine setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Deﬁne F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to ﬁrst select the two broad actions covering all common ads then select the narrow actions in any order. However, the ofﬂine cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for ofﬂine solutions as baselines. As seen in the ﬁgure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the ofﬂine cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We deﬁne an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Deﬁne the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Deﬁne F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Speciﬁcally, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Deﬁne F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netﬂix’s Watch Instantly service and use 803 questions based on those we used for an ofﬂine problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modiﬁcation of Normal-Hedge which ﬁxes ct = 1 for a ﬁxed, more aggressive learning rate than theoretically justiﬁed. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed signiﬁcantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modiﬁed Normal-Hedge has no theoretical justiﬁcation, so it may not generalize to other problems. For the modiﬁed Normal-Hedge the ﬁnal average cover times are 7.72 adaptive and 8.22 cumulative. The ofﬂine values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a ﬁxed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of inﬂuence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efﬁcient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9

3 0.17890799 199 nips-2011-On fast approximate submodular minimization

Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes

Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest signiﬁcant speedups over minimum norm while retaining higher accuracies. 1

4 0.14652537 251 nips-2011-Shaping Level Sets with Submodular Functions

Author: Francis R. Bach

Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide uniﬁed optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting speciﬁc submodular functions, we give a new interpretation to known norms, such as the total variation; we also deﬁne new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.

5 0.10730389 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

Abstract: Thompson sampling is one of oldest heuristic to address the exploration / exploitation trade-off, but it is surprisingly unpopular in the literature. We present here some empirical results using Thompson sampling on simulated and real data, and show that it is highly competitive. And since this heuristic is very easy to implement, we argue that it should be part of the standard baselines to compare against. 1

6 0.09334261 129 nips-2011-Improving Topic Coherence with Regularized Topic Models

7 0.081228681 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation

8 0.080325313 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

9 0.074554987 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

10 0.071168333 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

11 0.069534272 61 nips-2011-Contextual Gaussian Process Bandit Optimization

12 0.067248046 136 nips-2011-Inverting Grice's Maxims to Learn Rules from Natural Language Extractions

13 0.062900297 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

14 0.059412234 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

15 0.055210292 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

16 0.048328701 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

17 0.047912374 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

18 0.047831535 56 nips-2011-Committing Bandits

19 0.047730658 220 nips-2011-Prediction strategies without loss

20 0.046343669 281 nips-2011-The Doubly Correlated Nonparametric Topic Model

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.118), (1, -0.098), (2, -0.025), (3, 0.028), (4, 0.065), (5, -0.143), (6, -0.178), (7, 0.206), (8, -0.086), (9, 0.088), (10, -0.04), (11, -0.014), (12, 0.091), (13, 0.093), (14, 0.066), (15, -0.015), (16, -0.075), (17, 0.028), (18, 0.005), (19, -0.021), (20, 0.004), (21, 0.009), (22, -0.022), (23, -0.026), (24, 0.048), (25, 0.023), (26, -0.016), (27, -0.02), (28, 0.021), (29, 0.071), (30, 0.06), (31, -0.038), (32, 0.013), (33, -0.031), (34, 0.017), (35, 0.03), (36, 0.009), (37, -0.009), (38, -0.028), (39, -0.005), (40, -0.02), (41, 0.046), (42, -0.026), (43, 0.026), (44, -0.022), (45, 0.081), (46, -0.021), (47, -0.112), (48, 0.029), (49, -0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94380718 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

Author: Yisong Yue, Carlos Guestrin

2 0.75768983 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

3 0.67619962 199 nips-2011-On fast approximate submodular minimization

Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes

4 0.65789157 251 nips-2011-Shaping Level Sets with Submodular Functions

Author: Francis R. Bach

5 0.55883908 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

Author: Matthias Hein, Simon Setzer

Abstract: Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efﬁcient ﬁrst-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. 1

6 0.51633048 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

7 0.45727935 32 nips-2011-An Empirical Evaluation of Thompson Sampling

8 0.38496795 277 nips-2011-Submodular Multi-Label Learning

9 0.36841327 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

10 0.35623476 61 nips-2011-Contextual Gaussian Process Bandit Optimization

11 0.35177973 245 nips-2011-Selecting the State-Representation in Reinforcement Learning

12 0.3489916 129 nips-2011-Improving Topic Coherence with Regularized Topic Models

13 0.34867918 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation

14 0.34421214 272 nips-2011-Stochastic convex optimization with bandit feedback

15 0.33448368 128 nips-2011-Improved Algorithms for Linear Stochastic Bandits

16 0.32925299 281 nips-2011-The Doubly Correlated Nonparametric Topic Model

17 0.29901111 210 nips-2011-PAC-Bayesian Analysis of Contextual Bandits

18 0.28514129 293 nips-2011-Understanding the Intrinsic Memorability of Images

19 0.28205627 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation

20 0.27562115 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.022), (4, 0.026), (10, 0.047), (20, 0.017), (26, 0.022), (31, 0.076), (33, 0.024), (43, 0.034), (45, 0.149), (47, 0.284), (57, 0.035), (74, 0.059), (79, 0.014), (83, 0.03), (99, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78499526 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

Author: Yisong Yue, Carlos Guestrin

2 0.66249084 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems

Author: Philipp Hennig

Abstract: The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A ﬁrst order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an inﬁnite-dimensional partial differential equation. An approximate ﬁnitedimensional projection gives an impression for how this result may be helpful. 1 Introduction – Optimal Reinforcement Learning Reinforcement learning is about doing two things at once: Optimizing a function while learning about it. These two objectives must be balanced: Ignorance precludes efﬁcient optimization; time spent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known as the exploration exploitation trade-off. Classic reinforcement learning often considers time cheap; the trade-off then plays a subordinate role to the desire for learning a “correct” model or policy. Many classic reinforcement learning algorithms thus rely on ad-hoc methods to control exploration, such as “ -greedy” [1], or “Thompson sampling” [2]. However, at least since a thesis by Duff [3] it has been known that Bayesian inference allows optimal balance between exploration and exploitation. It requires integration over every possible future trajectory under the current belief about the system’s dynamics, all possible new data acquired along those trajectories, and their effect on decisions taken along the way. This amounts to optimization and integration over a tree, of exponential cost in the size of the state space [4]. The situation is particularly dire for continuous space-times, where both depth and branching factor of the “tree” are uncountably inﬁnite. Several authors have proposed approximating this lookahead through samples [5, 6, 7, 8], or ad-hoc estimators that can be shown to be in some sense close to the Bayes-optimal policy [9]. In a parallel development, recent work by Todorov [10], Kappen [11] and others introduced an idea to reinforcement learning long commonplace in other areas of machine learning: Structural assumptions, while restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins et al. [12] showed that it is actually possible to solve the exploration exploitation trade-off locally, by constructing a linear approximation using a Kalman ﬁlter. Simpkins and colleagues further assumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as inspiration for a study of general optimal reinforcement learning of dynamics and loss functions of an unknown, nonlinear, time-varying system (note that most reinforcement learning algorithms are restricted to time-invariant systems). The core assumption is that all uncertain variables are known up to Gaussian process uncertainty. The main result is a ﬁrst-order description of optimal reinforcement learning in form of inﬁnite-dimensional differential statements. This kind of description opens up new approaches to reinforcement learning. As an only initial example of such treatments, Section 4 1 presents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in some simple but instructive experiments (Section 5). An intuitive description of the paper’s results is this: From prior and corresponding choice of learning machinery (Section 2), we construct statements about the dynamics of the learning process (Section 3). The learning machine itself provides a probabilistic description of the dynamics of the physical system. Combining both dynamics yields a joint system, which we aim to control optimally. Doing so amounts to simultaneously controlling exploration (controlling the learning system) and exploitation (controlling the physical system). Because large parts of the analysis rely on concepts from optimal control theory, this paper will use notation from that ﬁeld. Readers more familiar with the reinforcement learning literature may wish to mentally replace coordinates x with states s, controls u with actions a, dynamics with transitions p(s | s, a) and utilities q with losses (negative rewards) −r. The latter is potentially confusing, so note that optimal control in this paper will attempt to minimize values, rather than to maximize them, as usual in reinforcement learning (these two descriptions are, of course, equivalent). 2 A Class of Learning Problems We consider the task of optimally controlling an uncertain system whose states s ≡ (x, t) ∈ K ≡ RD ×R lie in a D +1 dimensional Euclidean phase space-time: A cost Q (cumulated loss) is acquired at (x, t) with rate dQ/dt = q(x, t), and the ﬁrst inference problem is to learn this analytic function q. A second, independent learning problem concerns the dynamics of the system. We assume the dynamics separate into free and controlled terms afﬁne to the control: dx(t) = [f (x, t) + g(x, t)u(x, t)] dt (1) where u(x, t) is the control function we seek to optimize, and f, g are analytic functions. To simplify our analysis, we will assume that either f or g are known, while the other may be uncertain (or, alternatively, that it is possible to obtain independent samples from both functions). See Section 3 for a note on how this assumption may be relaxed. W.l.o.g., let f be uncertain and g known. Information about both q(x, t) and f (x, t) = [f1 , . . . , fD ] is acquired stochastically: A Poisson process of constant rate λ produces mutually independent samples yq (x, t) = q(x, t)+ q and yf d (x, t) = fd (x, t)+ fd where q 2 ∼ N (0, σq ); fd 2 ∼ N (0, σf d ). (2) The noise levels σq and σf are presumed known. Let our initial beliefs about q and f be given by D Gaussian processes GP kq (q; µq , Σq ); and independent Gaussian processes d GP kf d (fd ; µf d , Σf d ), respectively, with kernels kr , kf 1 , . . . , kf D over K, and mean / covariance functions µ / Σ. In other words, samples over the belief can be drawn using an inﬁnite vector Ω of i.i.d. Gaussian variables, as 1/2 ˜ fd ([x, t]) = µf d ([x, t])+ 1/2 Σf d ([x, t], [x , t ])Ω(x , t )dx dt = µf d ([x, t])+(Σf d Ω)([x, t]) (3) the second equation demonstrates a compact notation for inner products that will be used throughout. It is important to note that f, q are unknown, but deterministic. At any point during learning, we can use the same samples Ω to describe uncertainty, while µ, Σ change during the learning process. To ensure continuous trajectories, we also need to regularize the control. Following control custom, 1 we introduce a quadratic control cost ρ(u) = 2 u R−1 u with control cost scaling matrix R. Its units [R] = [x/t]/[Q/x] relate the cost of changing location to the utility gained by doing so. The overall task is to ﬁnd the optimal discounted horizon value ∞ v(x, t) = min u t 1 e−(τ −t)/γ q[χ[τ, u(χ, τ )], τ ] + u(χ, τ ) R−1 u(χ, τ ) dτ 2 (4) where χ(τ, u) is the trajectory generated by the dynamics deﬁned in Equation (1), using the control law (policy) u(x, t). The exponential deﬁnition of the discount γ > 0 gives the unit of time to γ. Before beginning the analysis, consider the relative generality of this deﬁnition: We allow for a continuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form, and may change over time. The speciﬁc choice of a Poisson process for the generation of samples is 2 somewhat ad-hoc, but some measure is required to quantify the ﬂow of information through time. The Poisson process is in some sense the simplest such measure, assigning uniform probability density. An alternative is to assume that datapoints are acquired at regular intervals of width λ. This results in a quite similar model but, since the system’s dynamics still proceed in continuous time, can complicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq. (1) still covers numerous physical systems studied in control, for example many mechanical systems, from classics like cart-and-pole to realistic models for helicopters [13]. 3 Optimal Control for the Learning Process The optimal solution to the exploration exploitation trade-off is formed by the dual control [14] of a joint representation of the physical system and the beliefs over it. In reinforcement learning, this idea is known as a belief-augmented POMDP [3, 4], but is not usually construed as a control problem. This section constructs the Hamilton-Jacobi-Bellman (HJB) equation of the joint control problem for the system described in Sec. 2, and analytically solves the equation for the optimal control. This necessitates a description of the learning algorithm’s dynamics: At time t = τ , let the system be at phase space-time sτ = (x(τ ), τ ) and have the Gaussian process belief GP(q; µτ (s), Στ (s, s )) over the function q (all derivations in this section will focus on q, and we will drop the sub-script q from many quantities for readability. The forms for f , or g, are entirely analogous, with independent Gaussian processes for each dimension d = 1, . . . , D). This belief stems from a ﬁnite number N of samples y 0 = [y1 , . . . , yN ] ∈ RN collected at space-times S 0 = [(x1 , t1 ), . . . , (xN , tN )] ≡ [s1 , . . . , sN ] ∈ KN (note that t1 to tN need not be equally spaced, ordered, or < τ ). For arbitrary points s∗ = (x∗ , t∗ ) ∈ K, the belief over q(s∗ ) is a Gaussian with mean function µτ , and co-variance function Στ [15] 2 µτ (s∗ ) = k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 y 0 i i (5) 2 Στ (s∗ , s∗ ) = k(s∗ , s∗ ) − k(s∗ , S 0 )[K(S 0 , S 0 ) + σq I]−1 k(S 0 , s∗ ) i j i j i j where K(S 0 , S 0 ) is the Gram matrix with elements Kab = k(sa , sb ). We will abbreviate K0 ≡ 2 [K(S 0 , S 0 ) + σy I] from here on. The co-vector k(s∗ , S 0 ) has elements ki = k(s∗ , si ) and will be shortened to k0 . How does this belief change as time moves from τ to τ + dt? If dt 0, the chance of acquiring a datapoint yτ in this time is λ dt. Marginalising over this Poisson stochasticity, we expect one sample with probability λ dt, two samples with (λ dt)2 and so on. So the mean after dt is expected to be µτ + dt = λ dt (k0 , kτ ) K0 ξτ ξτ κτ −1 y0 −1 + (1 − λ dt − O(λ dt)2 ) · k0 K0 y 0 + O(λ dt)2 (6) yτ where we have deﬁned the map kτ = k(s∗ , sτ ), the vector ξ τ with elements ξτ,i = k(si , sτ ), and 2 the scalar κτ = k(sτ , sτ ) + σq . Algebraic re-formulation yields −1 −1 −1 −1 µτ + dt = k0 K0 y 0 + λ(kt − k0 K0 ξ t )(κt − ξ t K0 ξ t )−1 (yt − ξ t K0 y 0 ) dt. −1 ξ τ K0 y 0 −1 ξ τ K0 ξ τ ) Note that = µ(sτ ), the mean prediction at sτ and (κτ − ¯ ¯ the marginal variance there. Hence, we can deﬁne scalars Σ, σ and write −1 (yτ − ξ τ K0 y 0 ) [Σ1/2 Ω](sτ ) + σω ¯τ = ≡ Σ1/2 Ω + στ ω ¯ −1 1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (κτ − ξ τ K0 ξ τ ) = 2 σq (7) + Σ(sτ , sτ ), with ω ∼ N (0, 1). (8) So the change to the mean consists of a deterministic but uncertain change whose effects accumulate linearly in time, and a stochastic change, caused by the independent noise process, whose variance accumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is left out for lack of space). We use the Wiener [16] measure dω to write dµsτ (s∗ ) = λ −1 kτ − k0 K0 ξ τ [Σ1/2 Ω](sτ ) + σω ¯τ dt ≡ λLsτ (s∗ )[Σ1/2 Ω dt + στ dω] ¯ −1 (κτ − ξ τ K0 ξ τ )−1/2 [Σ(sτ , sτ ) + σ 2 ]1/2 (9) where we have implicitly deﬁned the innovation function L. Note that L is a function of both s∗ and sτ . A similar argument ﬁnds the change of the covariance function to be the deterministic rate dΣsτ (s∗ , s∗ ) = −λLsτ (s∗ )Lsτ (s∗ ) dt. i j i j 3 (10) So the dynamics of learning consist of a deterministic change to the covariance, and both deterministic and stochastic changes to the mean, both of which are samples a Gaussian processes with covariance function proportional to LL . This separation is a fundamental characteristic of GPs (it is the nonparametric version of a more straightforward notion for ﬁnite-dimensional Gaussian beliefs, for data with known noise magnitude). We introduce the belief-augmented space H containing states z(τ ) ≡ [x(τ ), τ, µτ (s), µτ 1 , . . . , µτ D , q f f Στ (s, s ), Στ 1 , . . . , Στ D ]. Since the means and covariances are functions, H is inﬁnite-dimensional. q f f Under our beliefs, z(τ ) obeys a stochastic differential equation of the form dz = [A(z) + B(z)u + C(z)Ω] dt + D(z) dω (11) with free dynamics A, controlled dynamics Bu, uncertainty operator C, and noise operator D A = µτ (zx , zt ) , 1 , 0 , 0 , . . . , 0 , −λLq Lq , −λLf 1 Lf 1 , . . . , −λLf D Lf D ; f B = [g(s∗ ), 0, 0, 0, . . . ]; (12) 1/2 ¯q ¯ 1/2 ¯ 1/2 C = diag(Σf τ , 0, λLq Σ1/2 , λLf 1 Σf 1 , . . . , λLf D Σf d , 0, . . . , 0); D = diag(0, 0, λLq σq , λLf 1 σf 1 , . . . , λLf D σf D , 0, . . . , 0) ¯ ¯ ¯ (13) ∗ The value – the expected cost to go – of any state s is given by the Hamilton-Jacobi-Bellman equation, which follows from Bellman’s principle and a ﬁrst-order expansion, using Eq. (4): 1 (14) µq (sτ ) + Σ1/2 Ωq + σq ωq + u R−1 u dt + v(zτ + dt ) dω dΩ qτ 2 1 v(zτ ) ∂v 1 µτ +Σ1/2 Ωq + u R−1 u+ + +[A+Bu+CΩ] v+ tr[D ( 2 v)D]dΩ dt q qτ 2 dt ∂t 2 v(zτ ) = min u = min u Integration over ω can be performed with ease, and removes the stochasticity from the problem; The uncertainty over Ω is a lot more challenging. Because the distribution over future losses is correlated through space and time, v, 2 v are functions of Ω, and the integral is nontrivial. But there are some obvious approximate approaches. For example, if we (inexactly) swap integration and minimisation, draw samples Ωi and solve for the value for each sample, we get an “average optimal controller”. This over-estimates the actual sum of future rewards by assuming the controller has access to the true system. It has the potential advantage of considering the actual optimal controller for every possible system, the disadvantage that the average of optima need not be optimal for any actual solution. On the other hand, if we ignore the correlation between Ω and v, we can integrate (17) locally, all terms in Ω drop out and we are left with an “optimal average controller”, which assumes that the system locally follows its average (mean) dynamics. This cheaper strategy was adopted in the following. Note that it is myopic, but not greedy in a simplistic sense – it does take the effect of learning into account. It amounts to a “global one-step look-ahead”. One could imagine extensions that consider the inﬂuence of Ω on v to a higher order, but these will be left for future work. Under this ﬁrst-order approximation, analytic minimisation over u can be performed in closed form, and bears u(z) = −RB(z) v(z) = −Rg(x, t) x v(z). (15) The optimal Hamilton-Jacobi-Bellman equation is then 1 1 v − [ v] BRB v + tr D ( 2 v)D . 2 2 A more explicit form emerges upon re-inserting the deﬁnitions of Eq. (12) into Eq. (16): γ −1 v(z) = µτ + A q γ −1 v(z) = [µτ + µτ (zx , zt ) q f x + t 1 v(z) − [ 2 x v(z)] free drift cost c=q,f1 ,...,fD x v(z) control beneﬁt − λ Lc Lc + g (zx , zt )Rg(zx , zt ) (16) Σc 1 v(z) + λ2 σc Lf d ( ¯2 2 exploration bonus 2 µf d v(z))Lf d (17) diffusion cost Equation (17) is the central result: Given Gaussian priors on nonlinear control-afﬁne dynamic systems, up to a ﬁrst order approximation, optimal reinforcement learning is described by an inﬁnitedimensional second-order partial differential equation. It can be interpreted as follows (labels in the 4 equation, note the negative signs of “beneﬁcial” terms): The value of a state comprises the immediate utility rate; the effect of the free drift through space-time and the beneﬁt of optimal control; an exploration bonus of learning, and a diffusion cost engendered by the measurement noise. The ﬁrst two lines of the right hand side describe effects from the phase space-time subspace of the augmented space, while the last line describes effects from the belief part of the augmented space. The former will be called exploitation terms, the latter exploration terms, for the following reason: If the ﬁrst two lines line dominate the right hand side of Equation (17) in absolute size, then future losses are governed by the physical sub-space – caused by exploiting knowledge to control the physical system. On the other hand, if the last line dominates the value function, exploration is more important than exploitation – the algorithm controls the physical space to increase knowledge. To my knowledge, this is the ﬁrst differential statement about reinforcement learning’s two objectives. Finally, note the role of the sampling rate λ: If λ is very low, exploration is useless over the discount horizon. Even after these approximations, solving Equation (17) for v remains nontrivial for two reasons: First, although the vector product notation is pleasingly compact, the mean and covariance functions are of course inﬁnite-dimensional, and what looks like straightforward inner vector products are in fact integrals. For example, the average exploration bonus for the loss, writ large, reads ∂v(z) (18) −λLq Lq Σq v(z) = − λL(q) (s∗ )L(q) (s∗ ) ds∗ ds∗ . sτ i sτ j ∂Σ(s∗ , s∗ ) i j K i j (note that this object remains a function of the state sτ ). For general kernels k, these integrals may only be solved numerically. However, for at least one speciﬁc choice of kernel (square-exponentials) and parametric Ansatz, the required integrals can be solved in closed form. This analytic structure is so interesting, and the square-exponential kernel so widely used that the “numerical” part of the paper (Section 4) will restrict the choice of kernel to this class. The other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4 presents one, initial attempt at a numerical solution that should not be mistaken for a deﬁnitive answer. Despite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning: It replaces the intractable deﬁnition of the value in terms of future trajectories with a differential equation. This raises hope for new approaches to reinforcement learning, based on numerical analysis rather than sampling. Digression: Relaxing Some Assumptions This paper only applies to the speciﬁc problem class of Section 2. Any generalisations and extensions are future work, and I do not claim to solve them. But it is instructive to consider some easier extensions, and some harder ones: For example, it is intractable to simultaneously learn both g and f nonparametrically, if only the actual transitions are observed, because the beliefs over the two functions become inﬁnitely dependent when conditioned on data. But if the belief on either g or f is parametric (e.g. a general linear model), a joint belief on g and f is tractable [see 15, §2.7], in fact straightforward. Both the quadratic control cost ∝ u Ru and the control-afﬁne form (g(x, t)u) are relaxable assumptions – other parametric forms are possible, as long as they allow for analytic optimization of Eq. (14). On the question of learning the kernels for Gaussian process regression on q and f or g, it is clear that standard ways of inferring kernels [15, 18] can be used without complication, but that they are not covered by the notion of optimal learning as addressed here. 4 Numerically Solving the Hamilton-Jacobi-Bellman Equation Solving Equation (16) is principally a problem of numerical analysis, and a battery of numerical methods may be considered. This section reports on one speciﬁc Ansatz, a Galerkin-type projection analogous to the one used in [12]. For this we break with the generality of previous sections and assume that the kernels k are given by square exponentials k(a, b) = kSE (a, b; θ, S) = 1 θ2 exp(− 2 (a − b) S −1 (a − b)) with parameters θ, S. As discussed above, we approximate by setting Ω = 0. We ﬁnd an approximate solution through a factorizing parametric Ansatz: Let the value of any point z ∈ H in the belief space be given through a set of parameters w and some nonlinear functionals φ, such that their contributions separate over phase space, mean, and covariance functions: v(z) = φe (ze ) we with φe , we ∈ RNe (19) e=x,Σq ,µq ,Σf ,µf 5 This projection is obviously restrictive, but it should be compared to the use of radial basis functions for function approximation, a similarly restrictive framework widely used in reinforcement learning. The functionals φ have to be chosen conducive to the form of Eq. (17). For square exponential kernels, one convenient choice is φa (zs ) = k(sz , sa ; θa , Sa ) s (20) [Σz (s∗ , s∗ ) − k(s∗ , s∗ )]k(s∗ , sb ; θb , Sb )k(s∗ , sb ; θb , Sb ) ds∗ ds∗ i j i j i j i j φb (zΣ ) = Σ and (21) K µz (s∗ )µz (s∗ )k(s∗ , sc , θc , Sc )k(s∗ , sc , θc , Sc ) ds∗ ds∗ i j i j i j φc (zµ ) = µ (22) K (the subtracted term in the ﬁrst integral serves only numerical purposes). With this choice, the integrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The approximate Ansatz turns Eq. (17) into an algebraic equation quadratic in wx , linear in all other we : 1 w Ψ(zx )wx − q(zx ) + 2 x Ξe (ze )we = 0 (23) e=x,µq ,Σq ,µf ,Σf using co-vectors Ξ and a matrix Ψ with elements Ξx (zs ) = γ −1 φa (zs ) − f (zx ) a s ΞΣ (zΣ ) = γ −1 φa (zΣ ) + λ a Σ a x φs (zs ) − a t φs (zs ) Lsτ (s∗ )Lsτ (s∗ ) i j K Ξµ (zµ ) a = γ −1 φa (zµ ) µ Ψ(z)k = [ k x φs (z)] − λ2 σsτ ¯2 2 ∂φΣ (zΣ ) ds∗ ds∗ ∂Σz (s∗ , s∗ ) i j i j Lsτ (s∗ )Lsτ (s∗ ) i j K g(zx )Rg(zx ) [ ∂ 2 φa (zµ ) µ ds∗ ds∗ ∂µz (s∗ )∂µz (s∗ ) i j i j (24) x φs (z)] Note that Ξµ and ΞΣ are both functions of the physical state, through sτ . It is through this functional dependency that the value of information is associated with the physical phase space-time. To solve for w, we simply choose a number of evaluation points z eval sufﬁcient to constrain the resulting system of quadratic equations, and then ﬁnd the least-squares solution wopt by function minimisation, using standard methods, such as Levenberg-Marquardt [19]. A disadvantage of this approach is that is has a number of degrees of freedom Θ, such as the kernel parameters, and the number and locations xa of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to get interesting results simply by choosing these parameters heuristically. 5 5.1 Experiments Illustrative Experiment on an Artiﬁcial Environment As a simple example system with a one-dimensional state space, f, q were sampled from the model described in Section 2, and g set to the unit function. The state space was tiled regularly, in a bounded region, with 231 square exponential (“radial”) basis functions (Equation 20), initially all with weight i wx = 0. For the information terms, only a single basis function was used for each term (i.e. one single φΣq , one single φµq , and equally for f , all with very large length scales S, covering the entire region of interest). As pointed out above, this does not imply a trivial structure for these terms, because of the functional dependency on Lsτ . Five times the number of parameters, i.e. Neval = 1175 evaluation points zeval were sampled, at each time step, uniformly over the same region. It is not intuitively clear whether each ze should have its own belief (i.e. whether the points must cover the belief space as well as the phase space), but anecdotal evidence from the experiments suggests that it sufﬁces to use the current beliefs for all evaluation points. A more comprehensive evaluation of such aspects will be the subject of a future paper. The discount factor was set to γ = 50s, the sampling rate at λ = 2/s, the control cost at 10m2 /($s). Value and optimal control were evaluated at time steps of δt = 1/λ = 0.5s. Figure 1 shows the situation 50s after initialisation. The most noteworthy aspect is the nontrivial structure of exploration and exploitation terms. Despite the simplistic parameterisation of the corresponding functionals, their functional dependence on sτ induces a complex shape. The system 6 0 40 40 0.5 20 −2 20 x 0 0 0 −4 −20 −0.5 −20 −6 −40 −1 −40 −8 0 20 40 60 80 100 0 40 20 40 60 80 100 0 40 20 20 0.5 0 −20 x 0 −20 −40 −40 0 0 20 40 60 80 −0.5 100 −1 0 t 20 40 60 80 100 t Figure 1: State after 50 time steps, plotted over phase space-time. top left: µq (blue is good). The belief over f is not shown, but has similar structure. top right: value estimate v at current belief: compare to next two panels to note that the approximation is relatively coarse. bottom left: exploration terms. bottom right: exploitation terms. At its current state (black diamond), the system is in the process of switching from exploitation to exploration (blue region in bottom right panel is roughly cancelled by red, forward cone in bottom left one). constantly balances exploration and exploitation, and the optimal balance depends nontrivially on location, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This is an important insight that casts doubt on the usefulness of simple, local exploration boni, used in many reinforcement learning algorithms. Secondly, note that the system’s trajectory does not necessarily follow what would be the optimal path under full information. The value estimate reﬂects this, by assigning low (good) value to regions behind the system’s trajectory. This amounts to a sense of “remorse”: If the learner would have known about these regions earlier, it would have strived to reach them. But this is not a sign of sub-optimality: Remember that the value is deﬁned on the augmented space. The plots in Figure 1 are merely a slice through that space at some level set in the belief space. 5.2 Comparative Experiment – The Furuta Pendulum The cart-and-pole system is an under-actuated problem widely studied in reinforcement learning. For variation, this experiment uses a cylindrical version, the pendulum on the rotating arm [20]. The task is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the average loss of a controller with access to the true f, g, q, but otherwise using Algorithm 1, to that of an -greedy TD(λ) learner with linear function approximation, Simpkins’ et al.’s [12] Kalman method and the Gaussian process learning controller (Fig. 2). The linear function approximation of TD(λ) used the same radial basis functions as the three other methods. None of these methods is free of assumptions: Note that the sampling frequency inﬂuences TD in nontrivial ways rarely studied (for example through the coarseness of the -greedy policy). The parameters were set to γ = 5s, λ = 50/s. Note that reinforcement learning experiments often quote total accumulated loss, which differs from the discounted task posed to the learner. Figure 2 reports actual discounted losses. The GP method clearly outperforms the other two learners, which barely explore. Interestingly, none of the tested methods, not even the informed controller, achieve a stable controlled balance, although 7 u θ1 1 2 Method cumulative loss Full Information (baseline) TD(λ) Kalman ﬁlter Optimal Learner Gaussian process optimal learner 4.4 ±0.3 6.401±0.001 6.408±0.001 4.6 ±1.4 θ2 Figure 2: The Furuta pendulum system: A pendulum of length 2 is attached to a rotatable arm of length 1 . The control input is the torque applied to the arm. Right: cost to go achieved by different methods. Lower is better. Error measures are one standard deviation over ﬁve experiments. the GP learner does swing up the pendulum. This is due to the random, non-optimal location of basis functions, which means resolution is not necessarily available where it is needed (in regions of high curvature of the value function), and demonstrates a need for better solution methods for Eq. (17). There is of course a large number of other algorithms methods to potentially compare to, and these results are anything but exhaustive. They should not be misunderstood as a critique of any other method. But they highlight the need for units of measure on every quantity, and show how hard optimal exploration and exploitation truly is. Note that, for time-varying or discounted problems, there is no “conservative” option that cold be adopted in place of the Bayesian answer. 6 Conclusion Gaussian process priors provide a nontrivial class of reinforcement learning problems for which optimal reinforcement learning reduces to solving differential equations. Of course, this fact alone does not make the problem easier, as solving nonlinear differential equations is in general intractable. However, the ubiquity of differential descriptions in other ﬁelds raises hope that this insight opens new approaches to reinforcement learning. For intuition on how such solutions might work, one speciﬁc approximation was presented, using functionals to reduce the problem to ﬁnite least-squares parameter estimation. The critical reader will have noted how central the prior is for the arguments in Section 3: The dynamics of the learning process are predictions of future data, thus inherently determined exclusively by prior assumptions. One may ﬁnd this unappealing, but there is no escape from it. Minimizing future loss requires predicting future loss, and predictions are always in danger of falling victim to incorrect assumptions. A ﬁnite initial identiﬁcation phase may mitigate this problem by replacing prior with posterior uncertainty – but even then, predictions and decisions will depend on the model. The results of this paper raise new questions, theoretical and applied. The most pressing questions concern better solution methods for Eq. (14), in particular better means for taking the expectation over the uncertain dynamics to more than ﬁrst order. There are also obvious probabilistic issues: Are there other classes of priors that allow similar treatments? (Note some conceptual similarities between this work and the BEETLE algorithm [4]). To what extent can approximate inference methods – widely studied in combination with Gaussian process regression – be used to broaden the utility of these results? Acknowledgments The author wishes to express his gratitude to Carl Rasmussen, Jan Peters, Zoubin Ghahramani, Peter Dayan, and an anonymous reviewer, whose thoughtful comments uncovered several errors and crucially improved this paper. 8 References [1] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998. [2] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of two samples. Biometrika, 25:275–294, 1933. [3] M.O.G. Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. PhD thesis, U of Massachusetts, Amherst, 2002. [4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697–704, 2006. [5] Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Uncertainty in Artiﬁcial Intelligence, pages 150–159, 1999. [6] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on Machine Learning, pages 943–950, 2000. [7] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine Learning, pages 956–963, 2005. [8] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration in reinforcement learning. In Uncertainty in Artiﬁcial Intelligence, 2009. [9] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th International Conference on Machine Learning. Morgan Kaufmann, 2009. [10] E. Todorov. Linearly-solvable Markov decision problems. Advances in Neural Information Processing Systems, 19, 2007. [11] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In 9th Granada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative Behavior in Neural Systems., pages 149–181, 2007. [12] A. Simpkins, R. De Callafon, and E. Todorov. Optimal trade-off between exploration and exploitation. In American Control Conference, 2008, pages 33–38, 2008. [13] I. Fantoni and L. Rogelio. Non-linear Control for Underactuated Mechanical Systems. Springer, 1973. [14] A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874–880, April 1961. [15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [16] N. Wiener. Differential space. Journal of Mathematical Physics, 2:131–174, 1923. [17] T. Kailath. An innovations approach to least-squares estimation — part I: Linear ﬁltering in additive white noise. IEEE Transactions on Automatic Control, 13(6):646–655, 1968. [18] I. Murray and R.P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models. arXiv:1006.0868, 2010. [19] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963. [20] K. Furuta, M. Yamakita, and S. Kobayashi. Swing-up control of inverted pendulum using pseudo-state feedback. Journal of Systems and Control Engineering, 206(6):263–269, 1992. 9

3 0.60515088 32 nips-2011-An Empirical Evaluation of Thompson Sampling

Author: Olivier Chapelle, Lihong Li

4 0.56747365 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

5 0.55884814 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Author: Siwei Lyu

Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a uniﬁed framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be uniﬁed under the minimum KL contraction framework with different choices of the KL contraction operators. 1

6 0.55634135 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

7 0.5547877 271 nips-2011-Statistical Tests for Optimization Efficiency

8 0.55308598 283 nips-2011-The Fixed Points of Off-Policy TD

9 0.55200565 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

10 0.55067527 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

11 0.54891491 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

12 0.54878801 220 nips-2011-Prediction strategies without loss

13 0.54700291 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

14 0.54667789 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning

15 0.54574925 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

16 0.54459327 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

17 0.54443383 277 nips-2011-Submodular Multi-Label Learning

18 0.54191631 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

19 0.54155725 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

20 0.54155183 186 nips-2011-Noise Thresholds for Spectral Clustering