nips nips2013 nips2013-7 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella
Abstract: Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a consistent increase in prediction performance obtained by exploiting the network structure. 1
Reference: text
sentIndex sentText sentNum sentScore
1 it Abstract Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. [sent-7, score-0.394]
2 In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. [sent-8, score-0.436]
3 For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. [sent-9, score-0.452]
4 In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. [sent-10, score-0.331]
5 More specifically, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. [sent-11, score-0.613]
6 We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. [sent-13, score-0.346]
7 1 Introduction The ability of a website to present personalized content recommendations is playing an increasingly crucial role in achieving user satisfaction. [sent-15, score-0.34]
8 Because of the appearance of new content, and due to the ever-changing nature of content popularity, modern approaches to content recommendation are strongly adaptive, and attempt to match as closely as possible users’ interests by learning good mappings between available content and users. [sent-16, score-0.467]
9 The need to focus on content that raises the user interest and, simultaneously, the need of exploring new content in order to globally improve the user experience creates an exploration-exploitation dilemma, which is commonly formalized as a multi-armed bandit problem. [sent-18, score-0.828]
10 Indeed, contextual bandits have become a reference model for the study of adaptive techniques in recommender systems (e. [sent-19, score-0.418]
11 In many cases, however, the users targeted by a recommender system form a social network. [sent-21, score-0.378]
12 This is because the knowledge gathered about the interests of a given user may be exploited to improve the recommendation to the user’s friends. [sent-24, score-0.32]
13 In this work, an algorithmic approach to networked contextual bandits is proposed which is provably able to leverage user similarities represented as a graph. [sent-25, score-0.534]
14 Our approach consists in running an instance of a contextual bandit algorithm at each network node. [sent-26, score-0.53]
15 This mechanism is implemented by running instances of a linear contextual bandit algorithm in a specific reproducing kernel Hilbert space (RKHS). [sent-29, score-0.478]
16 The Laplacian matrix provides the information we rely upon to share user feedbacks from one node to the others, according to the network structure. [sent-33, score-0.423]
17 Moreover, the existing performance guarantees for the specific bandit algorithm we use can be directly lifted to the RKHS, and expressed in terms of spectral properties of the user network. [sent-35, score-0.445]
18 First, running a network of linear contextual bandit algorithms with a Laplacian-based feedback sharing mechanism may cause significant scaling problems, even on small to medium sized social networks. [sent-37, score-0.661]
19 Second, the social information provided by the network structure at hand need not be fully reliable in accounting for user behavior similarities. [sent-38, score-0.354]
20 After collecting empirical evidence on the sensitivity of networked bandit methods to graph noise, we propose two simple modifications to our basic strategy, both aimed at circumventing the above issues by clustering the graph nodes. [sent-40, score-0.598]
21 The first approach reduces graph noise simply by deleting edges between pairs of clusters. [sent-41, score-0.27]
22 We run experiments on two real-world datasets: one is extracted from the social bookmarking web service Delicious, and the other one from the music streaming platform Last. [sent-45, score-0.246]
23 2 Related work The benefit of using social relationships in order to improve the quality of recommendations is a recognized fact in the literature of content recommender systems —see e. [sent-47, score-0.377]
24 Linear models for contextual bandits were introduced in [4]. [sent-50, score-0.306]
25 Their application to personalized content recommendation was pioneered in [15], where the LinUCB algorithm was introduced. [sent-51, score-0.289]
26 To the best of our knowledge, this is the first work that combines contextual bandits with the social graph information. [sent-53, score-0.551]
27 However, non-contextual stochastic bandits in social networks were studied in a recent independent work [20]. [sent-54, score-0.277]
28 Other works, such as [2, 19], consider contextual bandits assuming metric or probabilistic dependencies on the product space of contexts and actions. [sent-55, score-0.357]
29 A non-contextual model of bandit algorithms running on the nodes of a graph was studied in [14]. [sent-57, score-0.505]
30 In that work, only one node reveals its payoffs, and the statistical information acquired by this node over time is spread across the entire network following the graphical structure. [sent-58, score-0.386]
31 More recently, a new model of distributed non-contextual bandit algorithms has been presented in [21], where the number of communications among the nodes is limited, and all the nodes in the network have the same best action. [sent-60, score-0.472]
32 3 Learning model We assume the social relationships over users are encoded as a known undirected and connected graph G = (V, E), where V = {1, . [sent-61, score-0.408]
33 Recall that a graph G can be equivalently defined in terms n of its Laplacian matrix L = Li,j i,j=1 , where Li,i is the degree of node i (i. [sent-65, score-0.314]
34 , the learner receives a user index it ∈ V together with a set of context vectors Cit = {xt,1 , xt,2 , . [sent-71, score-0.307]
35 The learner then selects some kt ∈ Cit to recommend to user it and observes some payoff at ∈ [−1, 1], a function 2 ¯ of it and xt = xt,kt . [sent-75, score-0.535]
36 1 A standard modeling assumption for bandit problems with contextual information (one that is also adopted here) is to assume that rewards are generated by noisy versions of unknown linear functions of the context vectors. [sent-77, score-0.509]
37 That is, we assume each node i ∈ V hosts an unknown parameter vector ui ∈ Rd , and that the reward value ai (x) associated with node i and context vector x ∈ Rd is given by the random variable ai (x) = ui x + i (x), where i (x) is a conditionally zero-mean and bounded variance noise term. [sent-78, score-1.057]
38 Therefore, ui x is the expected reward observed at node i for context vector x. [sent-88, score-0.593]
39 The regret rt of the learner at time t is the amount by which the average reward of the best choice in hindsight at node it exceeds the average reward of the algorithm’s choice, i. [sent-90, score-0.711]
40 We model the similarity among users in V by making the assumption that nearby users hold similar underlying vectors ui , so that reward signals received at a given node it at time t are also, to some extent, informative to learn the behavior of other users j connected to it within G. [sent-100, score-0.984]
41 , [8]), and assume that ui − uj 2 (1) (i,j)∈E 2 is small compared to i∈V ui , where · denotes the standard Euclidean norm of vectors. [sent-103, score-0.269]
42 4 Algorithm and regret analysis Our bandit algorithm maintains at time t an estimate wi,t for vector ui . [sent-106, score-0.443]
43 Vectors wi,t are updated based on the reward signals as in a standard linear bandit algorithm (e. [sent-107, score-0.506]
44 Every node i of G hosts a linear bandit algorithm like the one described in Figure 1. [sent-110, score-0.479]
45 The algorithm in Figure 1 maintains at time t a prototype vector wt which is the result of a standard linear least-squares approximation to the unknown parameter vector u associated with the node under consideration. [sent-111, score-0.257]
46 The ¯ linear bandit algorithm selects xt = xt,kt as the vector in Ct that maximizes an upper-confidencecorrected estimation of the expected reward achieved over context vectors xt,k . [sent-120, score-0.716]
47 The estimation is based on the current wt−1 , while the upper confidence level CBt is suggested by the standard analysis of linear bandit algorithms —see, e. [sent-121, score-0.274]
48 Once the actual reward at associated with ¯ ¯ xt is observed, the algorithm uses xt for updating Mt−1 to Mt via a rank-one adjustment, and bt−1 to bt via an additive update whose learning rate is precisely at . [sent-124, score-0.474]
49 This algorithm can be seen as a version of LinUCB [9], a linear bandit algorithm derived from LinRel [4]. [sent-125, score-0.274]
50 , ct ; wt−1 φt,k + CBt (φt,k ) where CB t (φt,k ) ¯ Set xt = xt,kt ; Observe reward at ∈ [−1, 1]; Update ¯ ¯ • Mt = Mt−1 + xt xt , ¯ • bt = bt−1 + at xt . [sent-163, score-0.675]
51 end for = −1 φt,k Mt−1 φt,kσ |Mt | ln + U δ Observe reward at ∈ [−1, 1] at node it ; Update • Mt = Mt−1 + φt,kt φt,kt , • Figure 1: Pseudocode of the linear bandit algorithm sitting at each node i of the given graph. [sent-164, score-0.902]
52 Lin lets the algorithm in Figure 1 operate on each node i of G (we should then add subscript i throughout, replacing wt by wi,t , Mt by Mi,t , and so forth). [sent-171, score-0.228]
53 The updates Mi,t−1 → Mi,t and bi,t−1 → bi,t ¯ are performed at node i through vector xt both when i = it (i. [sent-172, score-0.241]
54 , when node i is the one which the context vectors in Cit refer to) and to a lesser extent when i = it (i. [sent-174, score-0.303]
55 , when node i is not the one which the vectors in Cit refer to). [sent-176, score-0.228]
56 This is because, as we said, the payoff at received for node it is somehow informative also for all other nodes i = it . [sent-177, score-0.478]
57 In other words, because we are assuming the underlying parameter vectors ui are close to each other, we should let the corresponding prototype vectors wi,t undergo similar updates, so as to also keep the wi,t close to each other over time. [sent-178, score-0.27]
58 , Dn ), whose i-th block Di is the d × d matrix Di = Id + t : kt =i xt xt . [sent-203, score-0.26]
59 Similarly, bt would be the dn-long vector whose i-th d-dimensional block contains t : kt =i at xt . [sent-204, score-0.247]
60 This would be equivalent to running n independent linear bandit algorithms (Figure 1), one per node of G. [sent-205, score-0.485]
61 Yet, the only reward signal observed at time t is the one available at node it . [sent-208, score-0.399]
62 Lin’s running time is mainly affected by the inversion of the dn × dn matrix Mt , which can be performed in time of order (dn)2 per round by using well-known formulas for incremental matrix inversions. [sent-213, score-0.326]
63 In the next section, we show that simple graph compression schemes (like node clustering) can conveniently be applied to both reduce edge noise and bring the algorithm to reasonable scaling behaviors. [sent-219, score-0.345]
64 , [8, 17, 12] and a version of the linear bandit algorithm described and analyzed in [1]. [sent-225, score-0.274]
65 , n}, hosting at each node i ∈ V vector ui ∈ Rd . [sent-231, score-0.286]
66 Then the cumulative regret satisfies T rt ≤ 2 T t=1 2σ 2 ln |MT | + 2L(u1 , . [sent-243, score-0.258]
67 Compared to running n independent bandit algorithms (which corresponds to A⊗ being the identity matrix), the bound in the above theorem has an extra term (i,j)∈E ui − uj 2 , which we assume small according to our working assumption. [sent-247, score-0.468]
68 Lin is a factor n smaller than the corresponding term for the n independent bandit case (see, e. [sent-250, score-0.274]
69 On the other hand, When G is the complete graph then TR (MT ) = TR(I) + 2t/(n + 1) = nd + 2T /(n + 1), hence ln |MT | ≤ dn ln(1 + 2T /(dn(n + 1))). [sent-256, score-0.284]
70 Lin (and its variants) to linear bandit algorithms which do not exploit the relational information provided by the graph. [sent-259, score-0.274]
71 We tested our algorithm and its competitors on a synthetic dataset and two freely available real-world datasets extracted from the social bookmarking web service Delicious and from the music streaming service Last. [sent-263, score-0.363]
72 This is an artificial dataset whose graph contains four cliques of 25 nodes each to which we added graph noise. [sent-267, score-0.301]
73 More precisely, we created a n × n symmetric noise matrix of random numbers in [0, 1], and we selected a threshold value such that the expected number of matrix elements above this value is exactly some chosen noise rate parameter. [sent-269, score-0.226]
74 This is a social network containing 1,892 nodes and 12,717 edges. [sent-274, score-0.256]
75 The dataset contains information about the listened artists, and we used this information in order to create the payoffs: if a user listened to an artist at least once the payoff is 1, otherwise the payoff is 0. [sent-276, score-0.799]
76 The payoffs were created using the information about the bookmarked URLs for each user: the payoff is 1 if the user bookmarked the URL, otherwise the payoff is 0. [sent-280, score-0.991]
77 3 These two networks are structurally different: on Delicious, payoffs depend on users more strongly than on Last. [sent-283, score-0.317]
78 1 10 100 1000 ITEM ID 10000 100000 We preprocessed datasets by breaking down the tags into smaller tags made up of single words. [sent-302, score-0.399]
79 In fact, many users tend to create tags like “webdesign tutorial css”. [sent-303, score-0.316]
80 This tag has been splitted into three smaller tags corresponding to the three words therein. [sent-304, score-0.224]
81 More generally, we splitted all compound tags containing underscores, hyphens and apexes. [sent-305, score-0.252]
82 This makes sense because users create tags independently, and we may have both “rock and roll” and “rock n roll”. [sent-306, score-0.316]
83 In the context of recommender systems, these two datasets may be seen as representatives of two “markets” whose products have significantly different market shares (the well-known dichotomy of hit vs. [sent-313, score-0.224]
84 Graph noise increases from top to bottom, payoff noise increases from left to right. [sent-348, score-0.366]
85 Lin is clearly more robust to payoff noise than its competitors. [sent-350, score-0.302]
86 We used all tags associated with a single item to create a TF-IDF context vector that uniquely represents that item, independent of which user the item is proposed to. [sent-358, score-0.629]
87 In 4Cliques we assigned the same unit norm random vector ui to every node in the same clique i of the original graph (before adding graph noise). [sent-363, score-0.514]
88 Payoffs were then generated according to the following stochastic model: ai (x) = ui x + , where (the payoff noise) is uniformly distributed in a bounded interval centered around zero. [sent-364, score-0.395]
89 , xt,25 in Cit by picking 24 vectors at random from the dataset and one among those vectors with nonzero payoff for user it . [sent-373, score-0.531]
90 This is necessary in order to avoid a meaningless comparison: with high probability, a purely random selection would result in payoffs equal to zero for all the context vectors in Cit . [sent-374, score-0.318]
91 Lin and its variants against two baselines: a baseline LinUCB-IND that runs an independent instance of the algorithm in Figure 1 at each node (this is equivalent to running GOB. [sent-376, score-0.289]
92 BLOCK 125 100 75 50 25 0 0 0 2000 4000 6000 8000 10000 TIME 0 2000 4000 6000 8000 10000 TIME Figure 4: Cumulative reward for all the bandit algorithms introduced in this section. [sent-389, score-0.506]
93 Table 2 and Figure 4 show the cumulative reward for each algorithm, as compared (“normalized”) to that of the random predictor, that is t (at − at ), where at is ¯ the payoff obtained by the algorithm and at is the payoff obtained by the random predictor, i. [sent-420, score-0.824]
94 , the ¯ average payoff over the context vectors available at time t. [sent-422, score-0.374]
95 Lin and LinUCB-SIN are more robust to payoff noise than LinUCB-IND. [sent-424, score-0.302]
96 When the payoff noise is low and the graph noise grows GOB. [sent-427, score-0.48]
97 fm, where many users gave positive payoffs to the same few items. [sent-435, score-0.317]
98 Movie recommendation using random walks over the contextual graph. [sent-478, score-0.28]
99 Personalized recommendation of social software items based on social relations. [sent-534, score-0.447]
100 Multi-armed bandits in the presence of side observations in social networks. [sent-581, score-0.277]
wordName wordTfidf (topN-words)
[('cit', 0.3), ('bandit', 0.274), ('delicious', 0.26), ('mt', 0.258), ('payoff', 0.238), ('reward', 0.232), ('payoffs', 0.182), ('tags', 0.181), ('user', 0.171), ('node', 0.167), ('contextual', 0.16), ('cbt', 0.152), ('bandits', 0.146), ('users', 0.135), ('social', 0.131), ('recommendation', 0.12), ('ui', 0.119), ('cumulative', 0.116), ('graph', 0.114), ('recommender', 0.112), ('dn', 0.108), ('content', 0.106), ('item', 0.101), ('bt', 0.094), ('context', 0.075), ('xt', 0.074), ('nodes', 0.073), ('laplacian', 0.072), ('bookmarked', 0.065), ('linucb', 0.065), ('rdn', 0.065), ('items', 0.065), ('noise', 0.064), ('personalized', 0.063), ('ln', 0.062), ('wt', 0.061), ('vectors', 0.061), ('artists', 0.057), ('listened', 0.057), ('networked', 0.057), ('multitask', 0.057), ('ct', 0.053), ('kt', 0.052), ('network', 0.052), ('contexts', 0.051), ('regret', 0.05), ('edges', 0.049), ('running', 0.044), ('bookmarking', 0.043), ('deleting', 0.043), ('everybody', 0.043), ('hetrec', 0.043), ('onzero', 0.043), ('splitted', 0.043), ('service', 0.042), ('variants', 0.04), ('clustering', 0.039), ('tested', 0.038), ('artist', 0.038), ('hosts', 0.038), ('niche', 0.038), ('nities', 0.038), ('tems', 0.038), ('ai', 0.038), ('baseline', 0.038), ('italy', 0.038), ('datasets', 0.037), ('pseudocode', 0.037), ('vt', 0.036), ('un', 0.036), ('macro', 0.035), ('urls', 0.035), ('id', 0.035), ('contained', 0.034), ('rd', 0.034), ('di', 0.033), ('gang', 0.033), ('milano', 0.033), ('rock', 0.033), ('roll', 0.033), ('studi', 0.033), ('uit', 0.033), ('preferences', 0.033), ('matrix', 0.033), ('created', 0.032), ('degli', 0.031), ('init', 0.031), ('dramatic', 0.031), ('uj', 0.031), ('markets', 0.03), ('web', 0.03), ('rt', 0.03), ('interests', 0.029), ('moderately', 0.029), ('prototype', 0.029), ('cluster', 0.028), ('compound', 0.028), ('multi', 0.028), ('relationships', 0.028), ('block', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 7 nips-2013-A Gang of Bandits
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella
Abstract: Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a consistent increase in prediction performance obtained by exploiting the network structure. 1
2 0.18397853 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
3 0.17228775 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
4 0.14167567 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
Author: Noga Alon, Nicolò Cesa-Bianchi, Claudio Gentile, Yishay Mansour
Abstract: We consider the partial observability model for multi-armed bandits, introduced by Mannor and Shamir [14]. Our main result is a characterization of regret in the directed observability model in terms of the dominating and independence numbers of the observability graph (which must be accessible before selecting an action). In the undirected case, we show that the learner can achieve optimal regret without even accessing the observability graph before selecting an action. Both results are shown using variants of the Exp3 algorithm operating on the observability graph in a time-efficient manner. 1
5 0.12852246 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
Author: Shunan Zhang, Angela J. Yu
Abstract: How humans achieve long-term goals in an uncertain environment, via repeated trials and noisy observations, is an important problem in cognitive science. We investigate this behavior in the context of a multi-armed bandit task. We compare human behavior to a variety of models that vary in their representational and computational complexity. Our result shows that subjects’ choices, on a trial-totrial basis, are best captured by a “forgetful” Bayesian iterative learning model [21] in combination with a partially myopic decision policy known as Knowledge Gradient [7]. This model accounts for subjects’ trial-by-trial choice better than a number of other previously proposed models, including optimal Bayesian learning and risk minimization, ε-greedy and win-stay-lose-shift. It has the added benefit of being closest in performance to the optimal Bayesian model than all the other heuristic models that have the same computational complexity (all are significantly less complex than the optimal model). These results constitute an advancement in the theoretical understanding of how humans negotiate the tension between exploration and exploitation in a noisy, imperfectly known environment. 1
6 0.12518413 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
7 0.12369441 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
8 0.11040115 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
9 0.10986156 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
10 0.10700451 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
11 0.10584713 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
12 0.1029698 137 nips-2013-High-Dimensional Gaussian Process Bandits
13 0.10293508 148 nips-2013-Latent Maximum Margin Clustering
14 0.10264266 85 nips-2013-Deep content-based music recommendation
15 0.10055266 330 nips-2013-Thompson Sampling for 1-Dimensional Exponential Family Bandits
16 0.09911298 162 nips-2013-Learning Trajectory Preferences for Manipulators via Iterative Improvement
17 0.09718968 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
18 0.096358337 230 nips-2013-Online Learning with Costly Features and Labels
19 0.096252166 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
20 0.094680436 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
topicId topicWeight
[(0, 0.236), (1, -0.103), (2, 0.082), (3, -0.088), (4, 0.036), (5, -0.056), (6, 0.099), (7, -0.087), (8, -0.03), (9, -0.016), (10, -0.029), (11, -0.017), (12, 0.132), (13, -0.07), (14, -0.071), (15, 0.061), (16, -0.011), (17, 0.154), (18, -0.002), (19, -0.081), (20, 0.113), (21, -0.059), (22, 0.013), (23, -0.094), (24, 0.048), (25, -0.03), (26, 0.031), (27, 0.008), (28, -0.048), (29, -0.08), (30, -0.004), (31, -0.032), (32, 0.03), (33, 0.056), (34, -0.112), (35, -0.087), (36, -0.001), (37, -0.047), (38, -0.076), (39, -0.01), (40, 0.068), (41, 0.031), (42, 0.107), (43, -0.001), (44, 0.004), (45, 0.075), (46, -0.111), (47, 0.071), (48, 0.016), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.94875628 7 nips-2013-A Gang of Bandits
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella
Abstract: Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a consistent increase in prediction performance obtained by exploiting the network structure. 1
2 0.57237351 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
Author: Xiaoxiao Guo, Satinder Singh, Richard L. Lewis
Abstract: We consider how to transfer knowledge from previous tasks (MDPs) to a current task in long-lived and bounded agents that must solve a sequence of tasks over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent’s behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent’s performance relative to other approaches, including an approach that transfers policies. 1
3 0.53382218 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
Author: Tai Qin, Karl Rohe
Abstract: Spectral clustering is a fast and popular algorithm for finding clusters in networks. Recently, Chaudhuri et al. [1] and Amini et al. [2] proposed inspired variations on the algorithm that artificially inflate the node degrees for improved statistical performance. The current paper extends the previous statistical estimation results to the more canonical spectral clustering algorithm in a way that removes any assumption on the minimum degree and provides guidance on the choice of the tuning parameter. Moreover, our results show how the “star shape” in the eigenvectors–a common feature of empirical networks–can be explained by the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees. Throughout, the paper characterizes and justifies several of the variations of the spectral clustering algorithm in terms of these models. 1
4 0.53283042 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
Author: Dan Russo, Benjamin Van Roy
Abstract: This paper considers the sample complexity of the multi-armed bandit with dependencies among the arms. Some of the most successful algorithms for this problem use the principle of optimism in the face of uncertainty to guide exploration. The clearest example of this is the class of upper confidence bound (UCB) algorithms, but recent work has shown that a simple posterior sampling algorithm, sometimes called Thompson sampling, can be analyzed in the same manner as optimistic approaches. In this paper, we develop a regret bound that holds for both classes of algorithms. This bound applies broadly and can be specialized to many model classes. It depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among action rewards. Compared to UCB algorithm regret bounds for specific model classes, our general bound matches the best available for linear models and is stronger than the best available for generalized linear models. 1
5 0.52685028 289 nips-2013-Scalable kernels for graphs with continuous attributes
Author: Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, Karsten Borgwardt
Abstract: While graphs with continuous node attributes arise in many applications, stateof-the-art graph kernels for comparing continuous-attributed graphs suffer from a high runtime complexity. For instance, the popular shortest path kernel scales as O(n4 ), where n is the number of nodes. In this paper, we present a class of graph kernels with computational complexity O(n2 (m + log n + δ 2 + d)), where δ is the graph diameter, m is the number of edges, and d is the dimension of the node attributes. Due to the sparsity and small diameter of real-world graphs, these kernels typically scale comfortably to large graphs. In our experiments, the presented kernels outperform state-of-the-art kernels in terms of speed and accuracy on classification benchmark datasets. 1
6 0.51380199 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
7 0.51012129 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
8 0.48381621 288 nips-2013-Scalable Influence Estimation in Continuous-Time Diffusion Networks
9 0.48086452 124 nips-2013-Forgetful Bayes and myopic planning: Human learning and decision-making in a bandit setting
10 0.4749051 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
11 0.46918356 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
12 0.46677476 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
13 0.46572784 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
14 0.45193586 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
15 0.4487721 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
16 0.44256601 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling
17 0.43842202 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
18 0.43759379 85 nips-2013-Deep content-based music recommendation
19 0.43757525 196 nips-2013-Modeling Overlapping Communities with Node Popularities
20 0.4362936 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards
topicId topicWeight
[(16, 0.029), (33, 0.096), (34, 0.08), (41, 0.017), (49, 0.028), (56, 0.128), (70, 0.018), (85, 0.461), (89, 0.022), (93, 0.037), (95, 0.011)]
simIndex simValue paperId paperTitle
1 0.90152651 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
Author: Tai Qin, Karl Rohe
Abstract: Spectral clustering is a fast and popular algorithm for finding clusters in networks. Recently, Chaudhuri et al. [1] and Amini et al. [2] proposed inspired variations on the algorithm that artificially inflate the node degrees for improved statistical performance. The current paper extends the previous statistical estimation results to the more canonical spectral clustering algorithm in a way that removes any assumption on the minimum degree and provides guidance on the choice of the tuning parameter. Moreover, our results show how the “star shape” in the eigenvectors–a common feature of empirical networks–can be explained by the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees. Throughout, the paper characterizes and justifies several of the variations of the spectral clustering algorithm in terms of these models. 1
same-paper 2 0.87313306 7 nips-2013-A Gang of Bandits
Author: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella
Abstract: Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a consistent increase in prediction performance obtained by exploiting the network structure. 1
3 0.80296886 168 nips-2013-Learning to Pass Expectation Propagation Messages
Author: Nicolas Heess, Daniel Tarlow, John Winn
Abstract: Expectation Propagation (EP) is a popular approximate posterior inference algorithm that often provides a fast and accurate alternative to sampling-based methods. However, while the EP framework in theory allows for complex nonGaussian factors, there is still a significant practical barrier to using them within EP, because doing so requires the implementation of message update operators, which can be difficult and require hand-crafted approximations. In this work, we study the question of whether it is possible to automatically derive fast and accurate EP updates by learning a discriminative model (e.g., a neural network or random forest) to map EP message inputs to EP message outputs. We address the practical concerns that arise in the process, and we provide empirical analysis on several challenging and diverse factors, indicating that there is a space of factors where this approach appears promising. 1
4 0.80026734 332 nips-2013-Tracking Time-varying Graphical Structure
Author: Erich Kummerfeld, David Danks
Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1
5 0.79148215 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Author: Bruno Scherrer
Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !
6 0.68353343 89 nips-2013-Dimension-Free Exponentiated Gradient
7 0.65497941 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
8 0.6450156 196 nips-2013-Modeling Overlapping Communities with Node Popularities
9 0.62398517 232 nips-2013-Online PCA for Contaminated Data
10 0.61222452 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
11 0.5999617 289 nips-2013-Scalable kernels for graphs with continuous attributes
12 0.59731978 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
13 0.59550309 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
14 0.55928618 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
15 0.5562166 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
16 0.55570447 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
17 0.54853374 25 nips-2013-Adaptive Anonymity via $b$-Matching
18 0.53887892 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
19 0.53692776 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
20 0.5340665 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning