nips nips2013 nips2013-159 knowledge-graph by maker-knowledge-mining

159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

Source: pdf

Author: Kareem Amin, Afshin Rostamizadeh, Umar Syed

Abstract: Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer’s value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller’s long-term revenue. We deﬁne the natural notion of strategic regret — the lost revenue as measured against a truthful (non-strategic) buyer. We present seller algorithms that are no(strategic)-regret when the buyer discounts her future surplus — i.e. the buyer prefers showing advertisements to users sooner rather than later. We also give a lower bound on strategic regret that increases as the buyer’s discounting weakens and shows, in particular, that any seller algorithm will suffer linear strategic regret if there is no discounting. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer’s value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. [sent-5, score-1.321]

2 We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller’s long-term revenue. [sent-6, score-0.91]

3 We deﬁne the natural notion of strategic regret — the lost revenue as measured against a truthful (non-strategic) buyer. [sent-7, score-0.409]

4 We present seller algorithms that are no(strategic)-regret when the buyer discounts her future surplus — i. [sent-8, score-1.385]

5 the buyer prefers showing advertisements to users sooner rather than later. [sent-10, score-0.805]

6 We also give a lower bound on strategic regret that increases as the buyer’s discounting weakens and shows, in particular, that any seller algorithm will suffer linear strategic regret if there is no discounting. [sent-11, score-0.859]

7 When a user visits a web page whose advertising inventory is managed by an ad exchange, a description of the web page, the user, and other relevant properties of the impression, along with a reserve price for the impression, is transmitted to bidding servers operating on behalf of advertisers. [sent-15, score-0.308]

8 Without competitive pressure from other bidders, the task of maximizing the publisher’s revenue falls entirely to the reserve price setting mechanism. [sent-23, score-0.302]

9 The seller offers a price for a good, and a buyer decides whether to accept or reject the price (i. [sent-25, score-1.528]

10 In each round t, the seller offers a good to a buyer for price pt . [sent-29, score-1.491]

11 If the buyer accepts price pt , the seller receives revenue pt , and the buyer receives surplus vt − pt . [sent-32, score-2.854]

12 Since the same buyer participates in 1 the auction in each round, the seller has the opportunity to learn about the buyer’s value distribution and set prices accordingly. [sent-33, score-1.376]

13 Taken as an online learning problem, we can view this as a ‘bandit’ problem [18, 16], since the revenue for any price not offered is not observed (e. [sent-35, score-0.31]

14 , even if a buyer rejects a price, she may well have accepted a lower price). [sent-37, score-0.819]

15 Consider the motivations of a buyer interacting with an ad exchange where the prices are set by a no-regret algorithm, and suppose for simplicity that the buyer has a ﬁxed value vt = v for all t. [sent-41, score-1.823]

16 The goal of the buyer is to acquire the most valuable advertising inventory for the least total cost, i. [sent-42, score-0.856]

17 , to maximize her total surplus t v − pt , where the sum is over rounds where the buyer accepts the seller’s price. [sent-44, score-1.127]

18 A naive buyer might simply accept the seller’s price pt if and only if vt ≥ pt ; a buyer who behaves this way is called truthful. [sent-45, score-2.009]

19 Against a truthful buyer any no-regret algorithm will eventually learn to offer prices pt ≈ v on nearly all rounds. [sent-46, score-1.066]

20 But a more savvy buyer will notice that if she rejects prices in earlier rounds, then she will tend to see lower prices in later rounds. [sent-47, score-1.031]

21 Indeed, suppose the buyer only accepts prices below some small amount . [sent-48, score-0.929]

22 If v then the deceptive buyer strategy results in a large gain in total surplus for the buyer, and a large loss in total revenue for the seller, relative to the truthful buyer. [sent-51, score-1.175]

23 In our setting, the seller chooses a learning algorithm for selecting prices and announces this algorithm to the buyer. [sent-54, score-0.55]

24 We assume that the buyer will examine this algorithm and adopt whatever strategy maximizes her expected surplus over all T rounds. [sent-55, score-0.963]

25 We deﬁne the seller’s strategic regret to be the difference between his expected revenue and the expected revenue he would have earned if, rather than using his chosen algorithm to set prices, he had instead offered the best ﬁxed price p∗ on all rounds and the buyer had been truthful. [sent-56, score-1.528]

26 As we have seen, this revenue can be much higher than the revenue of the best ﬁxed price in hindsight (in the example above, p∗ = v). [sent-57, score-0.423]

27 We make one further assumption about buyer behavior, which is based on the observation that in many important real-world markets — and particularly in online advertising — sellers are far more willing to wait for revenue than buyers are willing to wait for goods. [sent-59, score-1.018]

28 To model this phenomenon we multiply the buyer’s surplus in each round by a discount factor: If the buyer accepts the seller’s price pt in round t, she receives surplus γt (vt − pt ), where {γt } is a nonincreasing sequence contained in T the interval (0, 1]. [sent-63, score-1.544]

29 In Section 4 we consider the special case that the buyer has a ﬁxed √ value vt = v for all rounds t, and give an algorithm with regret at most O(Tγ T ). [sent-67, score-1.047]

30 In Section 6 we prove that this requirement is necessary for no-regret: any seller algorithm has regret at least Ω(Tγ ). [sent-72, score-0.543]

31 That our regret bounds should depend so crucially on Tγ is foreshadowed by the example above, in 2 which a deceptive buyer foregoes surplus in early rounds to obtain even more surplus is later rounds. [sent-74, score-1.276]

32 A buyer with a short horizon Tγ will be unable to execute this strategy, as she will not be capable of bearing the short-term costs required to manipulate the seller. [sent-75, score-0.823]

33 study a related problem of setting the reserve price in a second price auction with multiple (but not repeated) bidders at each round [9]. [sent-78, score-0.372]

34 This is because a new buyer is considered at each time step and if the seller behavior depends only on previous buyers, then the setting immediately becomes strategyproof. [sent-82, score-1.256]

35 Contrary to what is studied in these previous theoretical settings, electronic exchanges in practice see the same buyer appearing in multiple auctions and, thus, the buyer has incentive to act strategically. [sent-83, score-1.775]

36 Previous work [1, 13] has shown, as we do in Section 6, that a seller cannot beneﬁt from conditioning prices on past behavior if the buyer is not myopic and can respond strategically. [sent-86, score-1.367]

37 However, in contrast to our work, these results assume that the seller knows the buyer’s value distribution. [sent-87, score-0.437]

38 Further, our game has a particular structure that allows us to design seller algorithms that are much more efﬁcient than generic algorithms for solving repeated games. [sent-90, score-0.476]

39 This resembles the ability, in our setting, of a strategic buyer to adapt to the seller algorithm’s behavior. [sent-97, score-1.347]

40 3 Preliminaries and Model We consider a posted-price model for a single buyer repeatedly purchasing items from a single seller. [sent-100, score-0.818]

41 Associated with the buyer is a ﬁxed distribution D over the interval [0, 1], which is known only to the buyer. [sent-101, score-0.805]

42 On each round t, the buyer receives a value vt ∈ V ⊆ [0, 1] from the distribution D. [sent-102, score-0.92]

43 Finally, the buyer selects an allocation decision at ∈ {0, 1}. [sent-104, score-0.843]

44 On each round t, the buyer receives an instantaneous surplus of at (vt − pt ), and the seller receives an instantaneous revenue of at pt . [sent-105, score-1.766]

45 As is common in mechanism design, we assume that the seller announces his 3 choice of algorithm A in advance. [sent-112, score-0.449]

46 The buyer then selects her allocation strategy in response. [sent-113, score-0.858]

47 In the most general setting, we will consider a buyer whose surplus may be discounted through time. [sent-117, score-0.948]

48 For a choice of T , we will deﬁne the effective “time-horizon” for the T buyer as Tγ = t=1 γt . [sent-120, score-0.805]

49 We assume that the seller is faced with a strategic buyer who adapts to the choice of A. [sent-122, score-1.347]

50 Thus, let B ∗ (A, D) be a surplus-maximizing buyer for seller algorithm A and value distribution is D. [sent-123, score-1.242]

51 Let p∗ = arg maxp∈P p PrD [v ≥ p], the revenuemaximizing choice of price for a seller that knows the distribution D, and simply posts a price of p∗ on every round. [sent-126, score-0.693]

52 Against such a pricing strategy, it is in the buyer’s best interest to be truthful, accepting if and only if vt ≥ p∗ , and the seller would receive a revenue of T p∗ Prv∼D [v ≥ p∗ ]. [sent-127, score-0.669]

53 4 Fixed Value Setting In this section we consider the case of a single unknown ﬁxed buyer value, that is V = {v} for some v ∈ (0, 1]. [sent-135, score-0.805]

54 We show that in this setting a very simple pricing algorithm with monotonically √ decreasing price offerings is able to achieve O(Tγ T ) when the buyer discount is γt = γ t−1 . [sent-136, score-0.964]

55 In the Monotone algorithm, the seller starts at the maximum price of 1, and decreases the price by a factor of β whenever the buyer rejects the price, and otherwise leaves it unchanged. [sent-140, score-1.512]

56 Since Monotone is deterministic and the buyer’s value v is ﬁxed, the surplus-maximizing buyer algorithm B ∗ (Monotone, v) is characterized by a deterministic allocation sequence a∗ ∈ {0, 1}T . [sent-141, score-0.825]

57 1 1:T The following lemma partially characterizes the optimal buyer allocation sequence. [sent-142, score-0.839]

58 1 T 1 If there are multiple optimal sequences, the buyer can then choose to randomize over the set of sequences. [sent-148, score-0.805]

59 1:T 4 In other words, once a buyer decides to start accepting the offered price at a certain time step, she will keep accepting from that point on. [sent-151, score-0.999]

60 1:T Now, to ﬁnish characterizing the optimal allocation sequence, we provide the following lemma, which describes time steps where the buyer has with certainty begun to accept the offered price. [sent-155, score-0.869]

61 By Lemmas 1 and 2 we receive no revenue until at most round dβ,γ + 1, and from that round onwards we receive at least revenue β dβ,γ per round. [sent-164, score-0.342]

62 This corollary shows that it is indeed possible to achieve no-regret against a strategic buyer with a unknown ﬁxed value as long √ as Tγ = o( T ). [sent-172, score-0.91]

63 That is, the effective buyer horizon must be more than a constant factor smaller than the square-root of the game’s ﬁnite horizon. [sent-173, score-0.823]

64 5 Stochastic Value Setting We next give a seller algorithm that attains no-regret when the set of prices P is ﬁnite, the buyer’s discount is γt = γ t−1 , and the buyer’s value vt for each round is drawn from a ﬁxed distribution D that satistﬁes a certain continuity assumption, detailed below. [sent-174, score-0.67]

65 Let Ap,i = Number of explore rounds in phase i where price p was offered and the buyer accepted. [sent-181, score-1.056]

66 During exploit rounds, the algorithm repeatedly selects the price that realized the greatest revenue during the immediately preceding explore rounds. [sent-186, score-0.334]

67 First notice that a strategic buyer has no incentive to lie during exploit rounds (i. [sent-187, score-1.059]

68 it will accept any price pt < vt and reject any price pt > vt ), since its decisions there do not affect any of its future prices. [sent-189, score-0.61]

69 Thus, the exploit rounds are the time at which the seller can exploit what it has learned from the buyer during exploration. [sent-190, score-1.335]

70 Alternatively, if the buyer has successfully manipulated the seller into offering a low price, we can view the buyer as “exploiting” the seller. [sent-191, score-2.066]

71 5 During explore rounds, on the other hand, the strategic buyer can beneﬁt by telling lies which will cause it to witness better prices during the corresponding exploit rounds. [sent-192, score-1.045]

72 However, the value of these lies to the buyer will depend on the fraction of the phase consisting of explore rounds. [sent-193, score-0.837]

73 Taken to the extreme, if the entire phase consists of explore rounds, the buyer is not interested in lying. [sent-194, score-0.837]

74 In general, the more explore rounds, the more revenue has to be sacriﬁced by a buyer that is lying during the explore rounds. [sent-195, score-0.986]

75 Otherwise, the buyer is largely indifferent to being offered prices p or p , but distinguishing between the two prices is essential for the seller during exploit rounds. [sent-201, score-1.484]

76 The lower bound proved in the next section states that, in the worst case, any seller algorithm will incur a regret of at least Ω(T c ). [sent-213, score-0.543]

77 6 Lower Bound In this section we state the main lower bound, which establishes a connection between the regret of any seller algorithm and the buyer’s discounting. [sent-214, score-0.543]

78 Speciﬁcally, we prove that the regret of any seller algorithm is Ω(Tγ ). [sent-215, score-0.543]

79 , the buyer does not discount her future surplus — our lower bound proves that no-regret seller algorithms do not exist, and thus it is impossible for the seller to take advantage of learned information. [sent-218, score-1.863]

80 For example, consider the seller algorithm that uniformly selects prices pt from [0, 1]. [sent-219, score-0.647]

81 The optimal buyer algorithm is truthful, accepting if pt < vt , as the seller algorithm is non-adaptive, and the buyer does not gain any advantage by being more strategic. [sent-220, score-2.229]

82 In such a scenario the seller would quickly learn a good estimate of the value distribution D. [sent-221, score-0.437]

83 What is surprising is that a seller cannot use this information if the buyer does not discount her future surplus. [sent-222, score-1.273]

84 If the seller attempts to leverage information learned through interactions with the buyer, the buyer can react accordingly to negate this advantage. [sent-223, score-1.242]

85 The lower bound further relates regret in the repeated setting to regret in a particular single-shot game between the buyer and the seller. [sent-224, score-1.056]

86 This demonstrates that, against a non-discounted buyer, the seller is no better off in the repeated setting than he would be by repeatedly implementing such a single-shot mechanism (ignoring previous interactions with the buyer). [sent-225, score-0.471]

87 A seller selects a family of distributions S indexed by b ∈ [0, 1], where each Sb is a distribution on [0, 1] × {0, 1}. [sent-229, score-0.455]

88 The family S is revealed to a buyer with unknown value v ∈ [0, 1], who then must select a bid b ∈ [0, 1], and then (p, a) ∼ Sb is drawn from the corresponding distribution. [sent-230, score-0.827]

89 As usual, the buyer gets a surplus of a(v − p), while the seller enjoys a revenue of ap. [sent-231, score-1.526]

90 We restrict the set of seller strategies to distributions that are incentive compatible and rational. [sent-232, score-0.521]

91 any buyer maximizing expected surplus is actually incentivised to play the game). [sent-236, score-0.948]

92 Informally, the Lemma states that the surplus enjoyed by an optimal buyer algorithm would only increase if this surplus were viewed without discounting. [sent-249, score-1.091]

93 For any seller algorithm A, value distribution D, and surplus-maximizing buyer algorithm B ∗ (A, D), T T E t=1 γt at (vt − pt ) ≤ E t=1 at (vt − pt ) Notice if at (vt − pt ) ≥ 0 for all t, then the Lemma 4 is trivial. [sent-252, score-1.515]

94 This would occur if the buyer only ever accepts prices less than its value (at = 1 only if pt ≤ vt ). [sent-253, score-1.091]

95 However, Lemma 4 is interesting in that it holds for any seller algorithm A. [sent-254, score-0.437]

96 It’s easy to imagine a seller algorithm that incentivizes the buyer to sometimes accept a price pt > vt with the promise that this will generate better prices in the future (e. [sent-255, score-1.661]

97 Let A be any seller algorithm for the repeated setting. [sent-261, score-0.458]

98 There exists a buyer value distribution D such that Regret(A, D, T ) ≥ 1 12 Tγ . [sent-262, score-0.805]

99 We show that if the buyer values inventory in the present more than in the far future, no-regret (with respect to revenue gained against a truthful buyer) learning is possible. [sent-284, score-1.021]

100 Future directions of study include studying buyer behavior under weaker polynomial discounting rates as well understanding when existing “off-the-shelf” bandit-algorithm (UCB, or EXP3), perhaps with slight modiﬁcations, are able to perform well against strategic buyers. [sent-287, score-0.924]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('buyer', 0.805), ('seller', 0.437), ('surplus', 0.143), ('revenue', 0.141), ('price', 0.128), ('regret', 0.106), ('strategic', 0.105), ('prices', 0.101), ('pt', 0.091), ('vt', 0.071), ('auctions', 0.066), ('rounds', 0.065), ('incentive', 0.06), ('truthful', 0.057), ('phased', 0.042), ('buyersurplus', 0.041), ('ssregret', 0.034), ('auction', 0.033), ('prv', 0.033), ('advertising', 0.033), ('reserve', 0.033), ('discount', 0.031), ('sb', 0.031), ('round', 0.03), ('sv', 0.028), ('monotone', 0.028), ('exchanges', 0.027), ('offered', 0.026), ('ad', 0.024), ('impressions', 0.024), ('buyers', 0.024), ('compatible', 0.024), ('rt', 0.023), ('bandit', 0.023), ('accepts', 0.023), ('bid', 0.022), ('impression', 0.022), ('rational', 0.021), ('repeated', 0.021), ('bidder', 0.02), ('bidders', 0.02), ('explore', 0.02), ('allocation', 0.02), ('accepting', 0.02), ('offering', 0.019), ('game', 0.018), ('inventory', 0.018), ('advertisers', 0.018), ('horizon', 0.018), ('selects', 0.018), ('accept', 0.018), ('adversary', 0.017), ('maxp', 0.017), ('felix', 0.017), ('exchange', 0.017), ('commerce', 0.016), ('ads', 0.016), ('strategy', 0.015), ('sponsored', 0.015), ('nonincreasing', 0.015), ('online', 0.015), ('lemma', 0.014), ('receives', 0.014), ('exploit', 0.014), ('user', 0.014), ('behavior', 0.014), ('rejects', 0.014), ('kleinberg', 0.014), ('aleksandrs', 0.014), ('deceptive', 0.014), ('sellerrevenue', 0.014), ('hindsight', 0.013), ('ti', 0.013), ('repeatedly', 0.013), ('electronic', 0.012), ('multiarmed', 0.012), ('oblivious', 0.012), ('ucb', 0.012), ('phase', 0.012), ('babaioff', 0.012), ('moshe', 0.012), ('announces', 0.012), ('servers', 0.012), ('private', 0.012), ('web', 0.012), ('offer', 0.012), ('reject', 0.012), ('earned', 0.011), ('posted', 0.011), ('earn', 0.011), ('bidding', 0.011), ('charged', 0.011), ('ariel', 0.011), ('page', 0.011), ('notice', 0.01), ('nicolo', 0.01), ('sold', 0.01), ('myopic', 0.01), ('imagine', 0.01), ('proves', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

Author: Kareem Amin, Afshin Rostamizadeh, Umar Syed

2 0.12471783 26 nips-2013-Adaptive Market Making via Online Learning

Author: Jacob Abernethy, Satyen Kale

Abstract: We consider the design of strategies for market making in an exchange. A market maker generally seeks to proﬁt from the difference between the buy and sell price of an asset, yet the market maker also takes exposure risk in the event of large price movements. Proﬁt guarantees for market making strategies have typically required certain stochastic assumptions on the price ﬂuctuations of the asset in question; for example, assuming a model in which the price process is mean reverting. We propose a class of “spread-based” market making strategies whose performance can be controlled even under worst-case (adversarial) settings. We prove structural properties of these strategies which allows us to design a master algorithm which obtains low regret relative to the best such strategy in hindsight. We run a set of experiments showing favorable performance on recent real-world stock price data. 1

3 0.12383409 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

Author: Min Xu, Tao Qin, Tie-Yan Liu

Abstract: In search advertising, the search engine needs to select the most proﬁtable advertisements to display, which can be formulated as an instance of online learning with partial feedback, also known as the stochastic multi-armed bandit (MAB) problem. In this paper, we show that the naive application of MAB algorithms to search advertising for advertisement selection will produce sample selection bias that harms the search engine by decreasing expected revenue and “estimation of the largest mean” (ELM) bias that harms the advertisers by increasing game-theoretic player-regret. We then propose simple bias-correction methods with beneﬁts to both the search engine and the advertisers. 1

4 0.076203495 139 nips-2013-How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal

Author: Jacob Abernethy, Peter Bartlett, Rafael Frongillo, Andre Wibisono

Abstract: We consider a popular problem in ﬁnance, option pricing, through the lens of an online learning game between Nature and an Investor. In the Black-Scholes option pricing model from 1973, the Investor can continuously hedge the risk of an option by trading the underlying asset, assuming that the asset’s price ﬂuctuates according to Geometric Brownian Motion (GBM). We consider a worst-case model, in which Nature chooses a sequence of price ﬂuctuations under a cumulative quadratic volatility constraint, and the Investor can make a sequence of hedging decisions. Our main result is to show that the value of our proposed game, which is the “regret” of hedging strategy, converges to the Black-Scholes option price. We use signiﬁcantly weaker assumptions than previous work—for instance, we allow large jumps in the asset price—and show that the Black-Scholes hedging strategy is near-optimal for the Investor even in this non-stochastic framework. 1

5 0.072007328 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Yasin Abbasi, Peter Bartlett, Varun Kanade, Yevgeny Seldin, Csaba Szepesvari

Abstract: We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves p O( T log |⇧| + log |⇧|) regret with respect to a comparison set of policies ⇧. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efﬁciently and the comparison set ⇧ has polynomial size, this algorithm is efﬁcient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identiﬁed start and ﬁnish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to ﬁnish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a ﬁxed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efﬁciently solved. We show that it also can be efﬁciently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efﬁcient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difﬁcult problem that has been used to design efﬁcient cryptographic schemes. Finally, we present an efﬁcient algorithm whose regret scales linearly with the number of distinct graphs. 1

6 0.064807221 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

7 0.064247027 325 nips-2013-The Pareto Regret Frontier

8 0.063960373 230 nips-2013-Online Learning with Costly Features and Labels

9 0.063250914 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

10 0.05892095 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

11 0.057771828 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

12 0.053243 137 nips-2013-High-Dimensional Gaussian Process Bandits

13 0.047135778 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

14 0.043642867 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

15 0.043453258 338 nips-2013-Two-Target Algorithms for Infinite-Armed Bandits with Bernoulli Rewards

16 0.039956823 89 nips-2013-Dimension-Free Exponentiated Gradient

17 0.038216133 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration

18 0.034062568 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

19 0.034043092 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems

20 0.031884417 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.061), (1, -0.065), (2, 0.078), (3, -0.076), (4, -0.002), (5, -0.037), (6, 0.028), (7, -0.026), (8, -0.005), (9, 0.014), (10, -0.01), (11, -0.003), (12, 0.01), (13, 0.016), (14, 0.026), (15, -0.042), (16, 0.004), (17, -0.042), (18, -0.031), (19, -0.016), (20, -0.04), (21, -0.026), (22, -0.052), (23, 0.018), (24, 0.009), (25, 0.019), (26, -0.024), (27, -0.002), (28, 0.01), (29, 0.003), (30, -0.033), (31, 0.031), (32, 0.048), (33, -0.056), (34, -0.043), (35, 0.057), (36, 0.01), (37, -0.013), (38, 0.001), (39, 0.004), (40, -0.052), (41, 0.063), (42, -0.019), (43, 0.006), (44, -0.031), (45, 0.171), (46, 0.053), (47, -0.051), (48, 0.144), (49, 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92378694 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

Author: Kareem Amin, Afshin Rostamizadeh, Umar Syed

2 0.7702949 139 nips-2013-How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal

Author: Jacob Abernethy, Peter Bartlett, Rafael Frongillo, Andre Wibisono

3 0.74140805 26 nips-2013-Adaptive Market Making via Online Learning

Author: Jacob Abernethy, Satyen Kale

4 0.55279785 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

Author: Min Xu, Tao Qin, Tie-Yan Liu

5 0.55032927 325 nips-2013-The Pareto Regret Frontier

Author: Wouter M. Koolen

Abstract: Performance guarantees for online learning algorithms typically take the form of regret bounds, which express that the cumulative loss overhead compared to the best expert in hindsight is small. In the common case of large but structured expert sets we typically wish to keep the regret especially small compared to simple experts, at the cost of modest additional overhead compared to more complex others. We study which such regret trade-offs can be achieved, and how. We analyse regret w.r.t. each individual expert as a multi-objective criterion in the simple but fundamental case of absolute loss. We characterise the achievable and Pareto optimal trade-offs, and the corresponding optimal strategies for each sample size both exactly for each ﬁnite horizon and asymptotically. 1

6 0.46867561 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization

7 0.42487535 280 nips-2013-Robust Data-Driven Dynamic Programming

8 0.3957836 230 nips-2013-Online Learning with Costly Features and Labels

9 0.355863 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

10 0.34938398 137 nips-2013-High-Dimensional Gaussian Process Bandits

11 0.34831649 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

12 0.34522983 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries

13 0.33099043 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

14 0.32673511 270 nips-2013-Regret based Robust Solutions for Uncertain Markov Decision Processes

15 0.31639925 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes

16 0.31605768 253 nips-2013-Prior-free and prior-dependent regret bounds for Thompson Sampling

17 0.27350202 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

18 0.26469228 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

19 0.26108596 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models

20 0.25959188 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.067), (2, 0.029), (12, 0.301), (16, 0.022), (33, 0.101), (34, 0.045), (36, 0.014), (41, 0.024), (49, 0.025), (56, 0.087), (70, 0.025), (85, 0.039), (89, 0.017), (91, 0.041), (93, 0.022), (95, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70743895 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers

Author: Kareem Amin, Afshin Rostamizadeh, Umar Syed

2 0.61003143 340 nips-2013-Understanding variable importances in forests of randomized trees

Author: Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts

Abstract: Despite growing interest and practical use in various scientiﬁc areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. 1 Motivation An important task in many scientiﬁc ﬁelds is the prediction of a response variable based on a set of predictor variables. In many situations though, the aim is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions, e.g. in order to understand the underlying process. Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with success in various scientiﬁc areas. Despite their extensive use in applied research, only a couple of works have studied the theoretical properties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al. (2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simpliﬁed to very realistic variants of these algorithms and proved the consistency of those variants. Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. In this work, we aim at ﬁlling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. The rest of the paper is organized as follows: in section 2, we provide the background about ensembles of randomized trees and recall how variable importances can be derived from them; in section 3, we then derive a characterization in asymptotic conditions and show how variable importances derived from totally randomized trees offer a three-level decomposition of the information jointly contained in the input variables about the output; section 4 shows that this characterization only depends on the relevant variables and section 5 1 discusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6 then illustrates all these ideas on an artiﬁcial problem; ﬁnally, section 7 includes our conclusions and proposes directions of future works. 2 Background In this section, we ﬁrst describe decision trees, as well as forests of randomized trees. Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. 2.1 Single classiﬁcation and regression trees and random forests A binary classiﬁcation (resp. regression) tree (Breiman et al., 1984) is an input-output model represented by a tree structure T , from a random input vector (X1 , ..., Xp ) taking its values in X1 × ... × Xp = X to a random output variable Y ∈ Y . Any node t in the tree represents a subset of the space X , with the root node being X itself. Internal nodes t are labeled with a binary test (or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children tL and tR , while the terminal nodes (or leaves) are labeled with a best guess value of the output ˆ variable2 . The predicted output Y for a new instance is the label of the leaf reached by the instance when it is propagated through the tree. A tree is built from a learning sample of size N drawn from P (X1 , ..., Xp , Y ) using a recursive procedure which identiﬁes at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ), and where pL = NtL /Nt and pR = NtR /Nt . The construction of the tree stops , e.g., when nodes become pure in terms of Y or when all variables Xi are locally constant. Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. A very efﬁcient and simple way to address this ﬂaw is to use them in the context of randomization-based ensemble methods. Speciﬁcally, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended and combined with a randomization of the input variables that are used when considering candidate variables to split internal nodes t. In particular, instead of looking for the best split s∗ among all variables, the Random Forest algorithm selects, at each node, a random subset of K variables and then determines the best split over these latter variables only. 2.2 Variable importances In the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the importance of a variable Xm for predicting Y by adding up the weighted impurity decreases p(t)∆i(st , t) for all nodes t where Xm is used, averaged over all NT trees in the forest: Imp(Xm ) = 1 NT p(t)∆i(st , t) T (2) t∈T :v(st )=Xm and where p(t) is the proportion Nt /N of samples reaching t and v(st ) is the variable used in split st . When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, since it can be deﬁned for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). We will characterize and derive results for this measure in the rest of this text. 1 More generally, splits are deﬁned by a (not necessarily binary) partition of the range Xm of possible values of a single variable Xm . 2 e.g. determined as the majority class j(t) (resp., the average value y (t)) within the subset of the leaf t. ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as the permutation importance. Thanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa et al., 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. Little is known however regarding their inner workings. Strobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased towards some predictor variables. As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not conﬁrmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view, Ishwaran (2007) provides a detailed theoretical development of a simpliﬁed version of MDA, giving key insights for the understanding of the actual MDA. 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as deﬁned by Equation 2, and let us assume a set V = {X1 , ..., Xp } of categorical input variables and a categorical output Y . For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. Given a training sample L of N joint observations of X1 , ..., Xp , Y independently drawn from the joint distribution P (X1 , ..., Xp , Y ), let us assume that we infer from it an inﬁnitely large ensemble of totally randomized and fully developed trees. In this setting, a totally randomized and fully developed tree is deﬁned as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i.e., one for each possible value of Xi , and where the recursive construction process halts only when all p variables have been used along the current branch. Hence, in such a tree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection with the set X of all possible joint conﬁgurations of the p input variables. For example, if the input variables are all binary, the resulting tree will have exactly 2p leaves. Theorem 1. The MDI importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of fully developed totally randomized trees and an inﬁnitely large training sample is: p−1 Imp(Xm ) = k=0 1 1 k Cp p − k I(Xm ; Y |B), (3) B∈Pk (V −m ) where V −m denotes the subset V \ {Xm }, Pk (V −m ) is the set of subsets of V −m of cardinality k, and I(Xm ; Y |B) is the conditional mutual information of Xm and Y given the variables in B. Proof. See Appendix B. Theorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions (e.g., in the same conditions as those of Theorem 1), we have that p Imp(Xm ) = I(X1 , . . . , Xp ; Y ). (4) m=1 Proof. See Appendix C. Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . . . , Xp ; Y ) contained in the set of input variables about the output variable. The ﬁrst level is a decomposition among input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the 3 degrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of Theorem 1), and the third level is a decomposition along the combinations B of interaction terms of ﬁxed size k of possible interacting variables (see the inner sum in Equation 3). We observe that the decomposition includes, for each variable, each and every interaction term of each and every degree weighted in a fashion resulting only from the combinatorics of possible interaction terms. In particular, since all I(Xm ; Y |B) terms are at most equal to H(Y ), the prior entropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by 1 1 1 1 1 H(Y ) = k C k H(Y ) = H(Y ). k Cp p − k Cp p − k p−1 p −m B∈Pk (V ) As such, the second level decomposition resulting from totally randomized trees makes the p sub1 1 importance terms C k p−k B∈Pk (V −m ) I(Xm ; Y |B) to equally contribute (at most) to the total p importance, even though they each include a combinatorially different number of terms. 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us deﬁne as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0.3 Thus we deﬁne as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. Remark that if Xi is irrelevant to Y with respect to V , then by deﬁnition it is also irrelevant to Y with respect to any subset of V . Theorem 3. Xi ∈ V is irrelevant to Y with respect to V if and only if its inﬁnite sample size importance as computed with an inﬁnite ensemble of fully developed totally randomized trees built on V for Y is 0. Proof. See Appendix D. Lemma 4. Let Xi ∈ V be an irrelevant variable for Y with respect to V . The inﬁnite sample size / importance of Xm ∈ V as computed with an inﬁnite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . Proof. See Appendix E. Theorem 5. Let VR ⊆ V be the subset of all variables in V that are relevant to Y with respect to V . The inﬁnite sample size importance of any variable Xm ∈ VR as computed with an inﬁnite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . That is: p−1 Imp(Xm ) = k=0 r−1 = l=0 1 1 k Cp p − k 1 1 l Cr r − l I(Xm ; Y |B) B∈Pk (V −m ) (5) I(Xm ; Y |B) −m B∈Pl (VR ) where r is the number of relevant variables in VR . Proof. See Appendix F. Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. Irrelevant variables have a zero importance and do not affect the importance of relevant variables. Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. We ﬁrst show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of ﬁxed size. Finally, we discuss the case of non-totally randomized trees. 5.1 Generalization to other impurity measures Although our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure, we show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substituting conditional mutual information for conditional impurity reduction in the different formulas and in the deﬁnition of irrelevant variables. In particular, our results thus hold for the Gini index in classiﬁcation and can be extended to regression problems using variance as the impurity measure. 5.2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all p variables were used within each branch. When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q ﬁrst terms of Equation 3. We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. Proposition 6. The importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of pruned totally randomized trees built up to depth q ≤ p and an inﬁnitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. See Appendix G. Proposition 7. The importance of Xm ∈ V for Y as computed with an inﬁnite ensemble of pruned totally randomized trees built up to depth q ≤ p and an inﬁnitely large training sample is identical to the importance as computed for Y with an inﬁnite ensemble of fully developed totally randomized trees built on random subspaces of q variables drawn from V . Proof. See Appendix H. As long as q ≥ r (where r denotes the number of relevant variables in V ), it can easily be shown that all relevant variables will still obtain a strictly positive importance, which will however differ in general from the importances computed by fully grown totally randomized trees built over all variables. Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. 5.3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as deﬁned before, while, for K = p, it amounts to building classical single trees in a deterministic way. First, the importance of Xm ∈ V as computed with an inﬁnite ensemble of such randomized trees is not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. To make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be a slightly noisy copy of X1 (i.e., I(X1 ; Y ) ≈ I(X2 ; Y ), I(X1 ; Y |X2 ) = and I(X2 ; Y |X1 ) = 0). Using totally randomized trees, the importances of X1 and X2 are nearly equal – the importance of X1 being slightly higher than the importance of X2 : 1 1 1 Imp(X1 ) = I(X1 ; Y ) + I(X1 ; Y |X2 ) = I(X1 ; Y ) + 2 2 2 2 1 1 1 Imp(X2 ) = I(X2 ; Y ) + I(X2 ; Y |X1 ) = I(X2 ; Y ) + 0 2 2 2 In non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is always used in its children. Also, since X1 perfectly explains Y , all its children are pure and the reduction of entropy when splitting on X2 is null. As a result, ImpK=2 (X1 ) = I(X1 ; Y ) and ImpK=2 (X2 ) = I(X2 ; Y |X1 ) = 0. Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. Let us indeed consider the following counter-example: let us add in the previous example an irrelevant variable Xi with respect to {X1 , X2 } and let us keep K = 2. The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. For K ﬁxed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. As we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. In a ﬁnite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice of the splitting variables remains however a sound strategy. In such a case, I(Xm ; Y |B) cannot be measured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. 6 Illustration on a digit recognition problem In this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating variable importances as computed with totally randomized trees. We verify that they match with our theoretical developments and that they decompose as foretold. We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3, for increasing values of K. Let us consider a seven-segment indicator displaying numerals using horizontal and vertical lights in on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in {0, 1, ..., 9} with equal probability and let X1 , ..., X7 be binary variables whose values are each determined univocally given the corresponding value of Y in Table 1. Since Table 1 perfectly deﬁnes the data distribution, and given the small dimensionality of the problem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our 6 X1 X2 y 0 1 2 3 4 5 6 7 8 9 X3 X4 X5 X6 X7 Eqn. 3 0.412 0.581 0.531 0.542 0.656 0.225 0.372 3.321 K=1 0.414 0.583 0.532 0.543 0.658 0.221 0.368 3.321 x2 1 0 0 0 1 1 1 0 1 1 x3 1 1 1 1 1 0 0 1 1 1 x4 0 0 1 1 1 1 1 0 1 1 x5 1 0 1 0 0 0 1 0 1 0 x6 1 1 0 1 1 1 1 1 1 1 x7 1 0 1 1 0 1 1 0 1 1 Table 1: Values of Y, X1 , ..., X7 Figure 1: 7-segment display X1 X2 X3 X4 X5 X6 X7 x1 1 0 1 1 0 1 1 1 1 1 K=2 0.362 0.663 0.512 0.525 0.731 0.140 0.385 3.321 K=3 0.327 0.715 0.496 0.484 0.778 0.126 0.392 3.321 K=4 0.309 0.757 0.489 0.445 0.810 0.122 0.387 3.321 K=5 0.304 0.787 0.483 0.414 0.827 0.122 0.382 3.321 K=6 0.305 0.801 0.475 0.409 0.831 0.121 0.375 3.321 K=7 0.306 0.799 0.475 0.412 0.835 0.120 0.372 3.321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. Importances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K increases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always equal to I(X1 , . . . , X7 ; Y ) = H(Y ) = log2 (10) = 3.321 since inputs allow to perfectly predict the output. theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that given the known structure of the problem, building trees on a sample of ﬁnite size that perfectly follows the data distribution amounts to building them on a sample of inﬁnite size. At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which conﬁrms Theorem 1. Small differences stem from the fact that a ﬁnite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards inﬁnity. It also shows that importances indeed add up to I(X1 , ...X7 ; Y ), which conﬁrms Theorem 2. Regarding the actual importances, they indicate that X5 is stronger than all others, followed – in that order – by X2 , X4 and X3 which also show large importances. X1 , X7 and X6 appear to be the less informative. The table also reports importances for increasing values of K. As discussed before, we see that a large value of K yields importances that can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). It can also be observed that masking effects may even induce changes in the variable rankings (e.g., compare the rankings at K = 1 and at K = 7), which thus conﬁrms that importances are differently affected. To better understand why a variable is important, it is also insightful to look at its decomposition into its p sub-importances terms, as shown in Figure 2. Each row in the plots of the ﬁgure corresponds to one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e.g., to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). In the left plot, for K = 1, the ﬁgure ﬁrst illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. We can observe that they each equally contribute (at most) the total importance of a variable. The plot also illustrates why X5 is important: it is informative either alone or conditioned with any combination of the other variables (all of its terms are signiﬁcantly larger than 0). By contrast, it also clearly shows why 7 K=1 0.5 K=7 X1 X1 X2 X3 X4 X4 X5 X5 X6 X6 X7 0.375 X2 X3 X7 0 1 2 3 4 5 6 0.25 0.125 0 1 2 3 4 5 6 0.0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. At K = 1, all I(Xm ; Y |B) are accounted for in the total importance, while at K = 7 only some of them are taken into account due to masking effects. X6 is not important: neither alone nor combined with others X6 seems to be very informative (all of its terms are close to 0). More interestingly, this ﬁgure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the ﬁrst terms are signiﬁcantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. The ﬁrst thing to notice is masking effects. Some of the I(Xm ; Y |B) terms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the total importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. At k = 0, masking effects also amplify the contribution of I(X2 ; Y ) (resp. I(X5 ; Y )) since X2 (resp. X5 ) appears more frequently at the root node than in totally randomized trees. In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. All in all, this ﬁgure thus indeed conﬁrms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. 7 Conclusions In this work, we made a ﬁrst step towards understanding variable importances as computed with a forest of randomized trees. In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. We showed that they offer a three-level decomposition of the information jointly provided by all input variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and ﬁnally illustrated our developments on an artiﬁcial but insightful example (Section 6). There remain several limitations to our framework that we would like address in the future. First, our results should be adapted to binary splits as used within an actual Random Forest-like algorithm. In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. In the same direction, our framework should also be extended to the case of continuous variables. Finally, results presented in this work are valid in an asymptotic setting only. An important direction of future work includes the characterization of the distribution of variable importances in a ﬁnite setting. Acknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its ﬁnancial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Ofﬁce. 8 References Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063–1095. Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classiﬁers. The Journal of Machine Learning Research, 9:2015–2033. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. Breiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classiﬁcation and regression trees. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ho, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artiﬁcial intelligence, 97(1):273–324. Liaw, A. and Wiener, M. (2002). Classiﬁcation and regression by randomforest. R news, 2(3):18–22. Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7:983–999. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25. White, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Zhao, G. (2000). A new perspective on classiﬁcation. PhD thesis, Utah State University, Department of Mathematics and Statistics. 9

3 0.57971251 350 nips-2013-Wavelets on Graphs via Deep Learning

Author: Raif Rustamov, Leonidas Guibas

Abstract: An increasing number of applications require processing of signals deﬁned on weighted graphs. While wavelets provide a ﬂexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less ﬂexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is conﬁrmed via experiments both on synthetic and real data. 1

4 0.56729722 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

Author: Eftychios A. Pnevmatikakis, Liam Paninski

Abstract: We propose a compressed sensing (CS) calcium imaging framework for monitoring large neuronal populations, where we image randomized projections of the spatial calcium concentration at each timestep, instead of measuring the concentration at individual locations. We develop scalable nonnegative deconvolution methods for extracting the neuronal spike time series from such observations. We also address the problem of demixing the spatial locations of the neurons using rank-penalized matrix factorization methods. By exploiting the sparsity of neural spiking we demonstrate that the number of measurements needed per timestep is signiﬁcantly smaller than the total number of neurons, a result that can potentially enable imaging of larger populations at considerably faster rates compared to traditional raster-scanning techniques. Unlike traditional CS setups, our problem involves a block-diagonal sensing matrix and a non-orthogonal sparse basis that spans multiple timesteps. We provide tight approximations to the number of measurements needed for perfect deconvolution for certain classes of spiking processes, and show that this number undergoes a “phase transition,” which we characterize using modern tools relating conic geometry to compressed sensing. 1

5 0.55175257 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

Author: Ari Pakman, Liam Paninski

Abstract: We present a new approach to sample from generic binary distributions, based on an exact Hamiltonian Monte Carlo algorithm applied to a piecewise continuous augmentation of the binary distribution of interest. An extension of this idea to distributions over mixtures of binary and possibly-truncated Gaussian or exponential variables allows us to sample from posteriors of linear and probit regression models with spike-and-slab priors and truncated parameters. We illustrate the advantages of these algorithms in several examples in which they outperform the Metropolis or Gibbs samplers. 1

6 0.4683837 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising

7 0.46412453 324 nips-2013-The Fast Convergence of Incremental PCA

8 0.45719481 25 nips-2013-Adaptive Anonymity via $b$-Matching

9 0.45713636 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited

10 0.45238149 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

11 0.452335 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search

12 0.44918448 26 nips-2013-Adaptive Market Making via Online Learning

13 0.4443005 196 nips-2013-Modeling Overlapping Communities with Node Popularities

14 0.44185212 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

15 0.4388361 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

16 0.43815684 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences

17 0.4377819 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

18 0.43728328 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

19 0.43630657 355 nips-2013-Which Space Partitioning Tree to Use for Search?

20 0.43625185 233 nips-2013-Online Robust PCA via Stochastic Optimization