nips nips2010 nips2010-269 knowledge-graph by maker-knowledge-mining

269 nips-2010-Throttling Poisson Processes


Source: pdf

Author: Uwe Dick, Peter Haider, Thomas Vanck, Michael Brückner, Tobias Scheffer

Abstract: We study a setting in which Poisson processes generate sequences of decisionmaking events. The optimization goal is allowed to depend on the rate of decision outcomes; the rate may depend on a potentially long backlog of events and decisions. We model the problem as a Poisson process with a throttling policy that enforces a data-dependent rate limit and reduce the learning problem to a convex optimization problem that can be solved efficiently. This problem setting matches applications in which damage caused by an attacker grows as a function of the rate of unsuppressed hostile events. We report on experiments on abuse detection for an email service. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The optimization goal is allowed to depend on the rate of decision outcomes; the rate may depend on a potentially long backlog of events and decisions. [sent-4, score-0.585]

2 We model the problem as a Poisson process with a throttling policy that enforces a data-dependent rate limit and reduce the learning problem to a convex optimization problem that can be solved efficiently. [sent-5, score-0.67]

3 The optimization criterion is allowed to depend on the rate of decision outcomes within a time interval; the criterion is not necessarily a sum of a loss function over individual decisions. [sent-11, score-0.488]

4 The problems that we study cannot adequately be modeled as Mavkov or semi-Markov decision problems because the probability of transitioning from any value of decision rates to any other value depends on the exact points in time at which each event occurred in the past. [sent-12, score-0.291]

5 Encoding the entire backlog of time stamps in the state of a Markov process would lead to an unwieldy formalism. [sent-13, score-0.302]

6 A prevention system has to meet each request by a decision to suppress it, or allow it to be processed by the service provider. [sent-17, score-0.242]

7 Only when the rate of passed abusive requests exceeds a certain capacity, the service becomes unavailable and costs incur. [sent-20, score-0.443]

8 Any email service provider has to deal with a certain fraction of accounts that are set up to disseminate phishing messages and email spam. [sent-22, score-0.483]

9 But if the rate of spam messages that an outbound email server discharges triggers alerting mechanisms of other providers, then that outbound server will become blacklisted and the service is disrupted. [sent-24, score-1.261]

10 1 Let x denote a sequence of decision events x1 , . [sent-26, score-0.232]

11 Sequence t denotes the time stamps ti ∈ R+ of the decision events with ti < ti+1 . [sent-30, score-1.138]

12 In our application, an episode corresponds to the sequence of emails sent within an observation interval from a legitimate (y = −1) or abusive (y = +1) account e. [sent-32, score-0.414]

13 We write xi and ti to denote the initial sequence of the first i elements of x and t, respectively. [sent-33, score-0.445]

14 Let A = {−1, +1} be a binary decision set, where +1 corresponds to suppressing an event and −1 corresponds to passing it. [sent-35, score-0.233]

15 The decision model π gets to make a decision π (xi , ti ) ∈ A at each point in time ti at which an event occurs. [sent-36, score-1.037]

16 The outbound rate rπ (t′ |x, t) at time t′ for episode e and decision model π is a crucial concept. [sent-37, score-0.63]

17 It is therefore defined as rπ (t′ |x, t) = |{i : π(xi , ti ) = −1 ∧ ti ∈ [t′ − τ, t′ )}|. [sent-39, score-0.746]

18 In outbound spam throttling, τ corresponds to the time interval that is used by other providers to estimate the incoming spam rate. [sent-40, score-0.741]

19 Additionally, the rate-based loss λ : Y × R+ → R+ is the loss that runs up per unit of time. [sent-42, score-0.418]

20 We require λ to be a convex, monotonically increasing function in the outbound rate for y = +1 and to be 0 otherwise. [sent-43, score-0.435]

21 The rate-based loss reflects the risk of the service getting blacklisted based on the current sending behaviour. [sent-44, score-0.42]

22 This risk grows in the rate of spam messages discharged and the duration over which a high sending rate of spam messages is maintained. [sent-45, score-0.863]

23 First, a rate parameter ρ, label y, and the sequence of instances x, are drawn from a joint distribution p(x, ρ, y). [sent-54, score-0.225]

24 The expected loss of decision model π is taken over all input sequences x, rate parameter ρ, label y, and over all possible sequences of time stamps t that can be generated according to the Poisson process. [sent-56, score-0.794]

25 1 t ρ y Derivation of Empirical Loss In deriving the empirical counterpart of the expected loss, we want to exploit our assumption that time stamps are generated by a Poisson process with unknown but fixed rate parameter. [sent-58, score-0.419]

26 Equation 4 introduces the observed time sequence of time stamps t′ into Equation 3 and uses the fact that the rate parameter ρ is independent of x and y given t′ . [sent-60, score-0.394]

27 Equation 5 rearranges the terms, and Equation 6 writes the central integral as a conditional expected value of the loss given the rate ρ. [sent-61, score-0.391]

28 1 ∑ Et [L(π; xe , t, y e ) | ρ∗ ] + ηΩ(π) e m e=1 m ˆ Ex,t,y [L(π; x, t, y)] with = (8) ρ∗ = argmaxρ p(ρ|te ) e Minimizing this risk functional is the basis of the learning procedure in the next section. [sent-65, score-0.489]

29 However, as we will see in Section 4, the λ-dependent loss makes the task of learning a decision function hard to solve; attributing individual decisions with their “fair share” of the rate loss—and thus estimating the cost of the decision—is problematic. [sent-67, score-0.493]

30 The Erlang learning model of Section 3 employs a decision function that allows to factorize the rate loss naturally. [sent-68, score-0.463]

31 3 Erlang Learning Model In the following we derive an optimization problem that is based on modeling the policy as a datadependent rate limit. [sent-69, score-0.371]

32 This allows us to apply a result from queuing theory and approximate the empirical risk functional of Equation (8) with a convex upper bound. [sent-70, score-0.253]

33 We define decision model π in terms of the function fθ (xi , ti ) = exp(θT ϕ (xi , ti )) which sets a limit on the admissible rate of events, where ϕ is some feature mapping of the initial sequence (xi , ti ) and θ is a parameter vector. [sent-71, score-1.444]

34 The throttling model is defined as { π (xi , ti ) = −1 (“allow”) if rπ (ti |xi , ti ) + 1 ≤ fθ (xi , ti ) +1 (“suppress”) otherwise. [sent-72, score-1.253]

35 (9) The decision model blocks event xi , if the number of instances that were sent within [ti − τ, ti ), plus the current instance, would exceed rate limit fθ (xi , ti ). [sent-73, score-1.212]

36 To this end, we first decompose the expected loss of an input sequence given the rate parameter in Equation 8 into immediate and rate-dependent loss terms. [sent-75, score-0.669]

37 Note that te denotes the observed training sequence whereas t serves as expectation variable for the expectation Et [·|ρe ∗ ] over all sequences 3 conditional on the Poisson process rate parameter ρe ∗ as in Equation 8. [sent-76, score-0.451]

38 Equation 11 exploits that only decisions against the correct label, π(xe , ti ) ̸= y e , incur a positive loss ℓ(y, π(xe , ti )). [sent-78, score-0.998]

39 i i We will first derive a convex approximation of the expected rate-based loss ∫ t e+τ Et [ t1n λ (y e , rπ (t′ |xe , t)) dt′ |ρ∗ ] (left side of Equation 11). [sent-79, score-0.294]

40 Our definition of the decision e model allows us to factorize the expected rate-based loss into contributions of individual rate limit decisions. [sent-80, score-0.543]

41 Since the outbound rate rπ increases only at decision points ti , we can upper-bound its value with the value immediately after the most recent decision in Equation 12. [sent-82, score-0.982]

42 Equation 13 approximates the actual outbound rate with the rate limit given by fθ (xe , te ). [sent-83, score-0.804]

43 This is reasonable because the i i outbound rate depends on the policy decisions which are defined in terms of the rate limit. [sent-84, score-0.809]

44 Because t is generated by a Poisson process, Et [ti+1 − ti | ρ∗ ] = ρ1 (Equation 14). [sent-85, score-0.394]

45 We will now derive a closed form approximation of Et [δ (π(xe , ti ) ̸= y e ) | ρ∗ ], the second part of e i the loss functional in Equation 11. [sent-87, score-0.608]

46 Queuing theory provides a convex approximation: The Erlang-B formula [5] gives the probability that a queuing process which maintains a constant rate limit of f within a time interval of τ will block an event when events are generated by a Poisson process with given rate parameter ρ. [sent-88, score-0.848]

47 Fortet’s formula (Equation 15) generalizes the Erlang-B formula for non-integer rate limits. [sent-89, score-0.254]

48 The formula requires a constant rate limit, so that the process can reach an equilibrium. [sent-93, score-0.228]

49 In our model, the rate limit fθ (xi , ti ) is a function of the sequences xi and ti until instance xi , and Fortet’s formula therefore serves as an approximation. [sent-94, score-1.132]

50 Et [δ(π(xe , ti ) = 1)|ρ∗ ] ≈ B(fθ (xe , te ), ρ∗ τ ) i e i i e [∫ ∞ ]−1 e e z = e−z (1 + ∗ )fθ (xi ,ti ) dz ρe τ 0 (16) (17) Unfortunately, Equation 17 is not convex in θ. [sent-95, score-0.641]

51 We approximate it with the convex upper bound − log (1 − B(fθ (xe , te ), ρ∗ τ )) (cf. [sent-96, score-0.27]

52 Likewise, Et [δ(π(xe , ti ) = −1)|ρ∗ ] is approximated by upper bound e i log (B(fθ (xe , te ), ρ∗ τ )). [sent-99, score-0.578]

53 We have thus derived a convex upper bound of Et [δ (π(xe , ti ) ̸= y e ) |ρ∗ ]. [sent-100, score-0.464]

54 e e i i i 4 Combining the two components of the optimization goal (Equation 11) and adding convex regularizer Ω(θ) and regularization parameter η > 0 (Equation 8), we arrive at an optimization problem for finding the optimal policy parameters θ. [sent-101, score-0.343]

55 4 Prior Work and Reference Methods We will now discuss how the problem of minimizing the expected loss, π ∗ = argminπ Ex,t,y [L(π; x, t, y)], from a sample of sequences x of events with labels y and observed rate parameters ρ∗ relates to previously studied methods. [sent-117, score-0.312]

56 Sequential decision-making problems are commonly solved by reinforcement learning approaches, which have to attribute the loss of an episode (Equation 2) to individual decisions in order to learn to decide optimally in each state. [sent-118, score-0.387]

57 Thus, a crucial part of defining an appropriate procedure for learning the optimal policy consists in defining an appropriate state-action loss function. [sent-119, score-0.4]

58 Qπ (s, a) estimates the loss of performing action a in state s when following policy π for the rest of the episode. [sent-120, score-0.4]

59 For example, policy gradient methods such as in [4] assign the loss of an episode to individual decisions proportional to the log-probabilities of the decisions. [sent-122, score-0.586]

60 Other approaches use sampled estimates of the rest of the episode Q(si , ai ) = L(π, s) − L(π, si ) or the expected loss if a distribution of states of the episode is known [7]. [sent-123, score-0.513]

61 Assigning the cumulative loss of the episode to all instances leads to a grave distortion of the optimization criterion. [sent-126, score-0.345]

62 As reference in our experiments we use a state-action loss function that assigns the immediate loss ℓ(y, ai ) to state si only. [sent-127, score-0.584]

63 Decision ai determines the loss incurred by λ only for τ time units, in ∫ t +τ the interval [ti , ti + τ ). [sent-128, score-0.671]

64 The corresponding rate loss is tii λ(y, rπ (t′ |x, t))dt′ . [sent-129, score-0.34]

65 Thus, the loss of deciding ai = −1 instead of ai = +1 is the difference in the corresponding λ-induced loss. [sent-130, score-0.288]

66 This leads to a state-action loss function that is the sum of immediate loss and λ-induced loss; it serves as our first baseline. [sent-132, score-0.485]

67 Since the loss crucially depends on outbound rate rπ (t′ |x, t), any throttling model must have access to the current outbound rate. [sent-135, score-1.012]

68 The transition between a current and a subsequent rate depends on the time at which the next event occurs, but also on the entire backlog of events, because past events may drop out of the interval τ at any time. [sent-136, score-0.455]

69 In analogy to the information that is available to the Erlang learning model, it is natural to encode states si as a vector of features ϕ(xi , ti ) (see Section 5 for details) together with the current outbound rate rπ (ti |x, t). [sent-137, score-0.839]

70 Given a representation of the state and a state-action loss function, different approaches for defining the policy π and optimizing its parameters have been investigated. [sent-138, score-0.4]

71 Policy gradient methods model a stochastic policy directly as a parameterized decision function. [sent-141, score-0.335]

72 The gradient of the expected loss with respect to the parameters is estimated in each iteration k for the distribution over episodes, states, and losses that the current policy πk induces. [sent-143, score-0.505]

73 We implement two policy gradient algorithms for experimentation which only differ in using Qit and Qub , respectively. [sent-145, score-0.267]

74 Classifiers πt for time step t are learned iteratively on the distribution of states generated by the policy (π0 , . [sent-153, score-0.245]

75 Our derived algorithm iteratively learns weighted support vector machine (SVM) classifier πk+1 in iteration k+1 on the set of instances and losses Qπk (s, a) that were observed after classifier πk was used as policy on the training sample. [sent-157, score-0.264]

76 All four procedures iteratively estimate the loss of a policy decision on the data via a state-action loss function and learn a new policy π based on this estimated cost of the decisions. [sent-163, score-0.899]

77 Since the transition distribution in fact depends on the entire backlog of time stamps and the duration over which state si has been maintained, the Markov assumption is violated to some extent in practice. [sent-165, score-0.327]

78 In other words, the λ-based loss is minimized without explicitely estimating the loss of any decisions that are implied by the rate limit. [sent-171, score-0.592]

79 5 Application The goal of our experiments is to study the relative benefits of the Erlang learning model and the four reference methods over a number of loss functions. [sent-173, score-0.233]

80 The subject of our experimentation is the problem of suppressing spam and phishing messages sent from abusive accounts registered at a large email service provider. [sent-174, score-0.763]

81 10,000 randomly selected accounts over two days and label them automatically based on information passed by other email service providers via feedback loops (in most cases triggered by “report spam” buttons). [sent-176, score-0.36]

82 Finally, other attributes quantify the size of the message and the score returned by a content-based spam filter employed by the email service. [sent-182, score-0.349]

83 We implemented the baseline methods that were descibed in Section 4, namely the iterative SVM methods It-SVMub and It-SVMit and the policy gradient methods PGub and PGit . [sent-183, score-0.286]

84 In our experiments, we assume a cost that is quadratic in the outbound rate. [sent-187, score-0.269]

85 That is, 2 λ(1, rπ (t′ |x, t))) = cλ · rπ (t′ |x, t) with cλ > 0 determining the influence of the rate loss to the overall loss. [sent-188, score-0.34]

86 2 We evaluated our method for different costs of incorrectly classified non-spam emails (c− ), incorrectly classified spam emails (c+ ) (see the definition of ℓ in Equation 1), and rate of outbound spam messages (cλ ). [sent-191, score-0.997]

87 Splits where chosen such that there were equally many spam episodes in training and test set. [sent-193, score-0.226]

88 1 Results Figure 1 shows the resulting average loss of the Erlang learning model and reference methods. [sent-196, score-0.233]

89 Each of the three plots shows loss versus parameter cλ which determines the influence of the rate loss on the overall loss. [sent-197, score-0.538]

90 We can see in Figure 1 that the Erlang learning model outperforms all baseline methods for larger values of cλ —more influence of the rate dependent loss on the overall loss—in two of the three settings. [sent-199, score-0.34]

91 The iterative classifier It-SVMub that uses the approximated state-action loss Qub performs uniformly better than It-SVMit , the iterative SVM method that uses the sampled loss from the previous iteration. [sent-201, score-0.496]

92 Both policy gradient methods perform comparable to the Erlang learning model for smaller values of cλ but deteriorate for larger values. [sent-203, score-0.236]

93 As expected, the iterative SVM and the standard SVM algorithms perform better than the Erlang learning model and policy gradient models if the influence of the rate pedendent loss is very small. [sent-217, score-0.626]

94 However, for larger cλ the influence of the rate dependent loss rises and more and more dominates the immediate classification loss ℓ. [sent-222, score-0.603]

95 Consequently, for those cases — which are the important ones in this real world application — the better rate loss estimation of the Erlang learning model compared to the baselines leads to better performance. [sent-223, score-0.34]

96 The Erlang learning model converged after 44 minutes and the policy gradient methods took approximately 45 minutes. [sent-226, score-0.236]

97 6 Conclusion We devised a model for sequential decision-making problems in which events are generated by a Poisson process and the loss may depend on the rate of decision outcomes. [sent-228, score-0.588]

98 Using a throttling policy that enforces a data-dependent rate-limit, we were able to factor the loss over single events. [sent-229, score-0.555]

99 Applying a result from queuing theory led us to a closed-form approximation of the immediate event-specific loss under a rate limit set by a policy. [sent-230, score-0.531]

100 It has replaced a procedure of manual deactivation of accounts after inspection triggered by spam reports. [sent-235, score-0.237]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xe', 0.404), ('ti', 0.373), ('erlang', 0.288), ('outbound', 0.269), ('policy', 0.202), ('loss', 0.198), ('te', 0.179), ('spam', 0.177), ('stamps', 0.173), ('rate', 0.142), ('throttling', 0.134), ('poisson', 0.133), ('elm', 0.115), ('fortet', 0.115), ('pgit', 0.115), ('pgub', 0.115), ('email', 0.114), ('service', 0.109), ('decision', 0.099), ('events', 0.098), ('episode', 0.098), ('svm', 0.092), ('queuing', 0.077), ('abusive', 0.077), ('backlog', 0.077), ('tne', 0.077), ('messages', 0.075), ('equation', 0.073), ('event', 0.071), ('immediate', 0.065), ('convex', 0.065), ('suppressing', 0.063), ('emails', 0.062), ('requests', 0.062), ('dxdt', 0.058), ('qit', 0.058), ('qub', 0.058), ('unsuppressed', 0.058), ('formula', 0.056), ('si', 0.055), ('decisions', 0.054), ('legitimate', 0.051), ('providers', 0.051), ('dt', 0.05), ('iterative', 0.05), ('limit', 0.049), ('episodes', 0.049), ('risk', 0.048), ('sent', 0.046), ('interval', 0.045), ('classi', 0.044), ('et', 0.043), ('sequences', 0.041), ('losses', 0.04), ('blacklisted', 0.038), ('phishing', 0.038), ('uence', 0.038), ('convexity', 0.038), ('xi', 0.037), ('functional', 0.037), ('reinforcement', 0.037), ('reference', 0.035), ('sequence', 0.035), ('suppress', 0.034), ('blatt', 0.034), ('server', 0.034), ('gradient', 0.034), ('accounts', 0.033), ('ai', 0.033), ('costs', 0.033), ('expected', 0.031), ('experimentation', 0.031), ('potsdam', 0.031), ('sender', 0.031), ('attributes', 0.031), ('process', 0.03), ('sending', 0.027), ('triggered', 0.027), ('optimization', 0.027), ('message', 0.027), ('label', 0.026), ('upper', 0.026), ('serves', 0.024), ('dz', 0.024), ('er', 0.024), ('monotonically', 0.024), ('panel', 0.024), ('deciding', 0.024), ('factorize', 0.024), ('approximates', 0.023), ('runs', 0.022), ('moving', 0.022), ('time', 0.022), ('regularizer', 0.022), ('instances', 0.022), ('generated', 0.021), ('enforces', 0.021), ('integral', 0.02), ('exceeds', 0.02), ('nm', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 269 nips-2010-Throttling Poisson Processes

Author: Uwe Dick, Peter Haider, Thomas Vanck, Michael Brückner, Tobias Scheffer

Abstract: We study a setting in which Poisson processes generate sequences of decisionmaking events. The optimization goal is allowed to depend on the rate of decision outcomes; the rate may depend on a potentially long backlog of events and decisions. We model the problem as a Poisson process with a throttling policy that enforces a data-dependent rate limit and reduce the learning problem to a convex optimization problem that can be solved efficiently. This problem setting matches applications in which damage caused by an attacker grows as a function of the rate of unsuppressed hostile events. We report on experiments on abuse detection for an email service. 1

2 0.17132396 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Author: Tang Jie, Pieter Abbeel

Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1

3 0.16299625 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.

4 0.14331332 208 nips-2010-Policy gradients in linearly-solvable MDPs

Author: Emanuel Todorov

Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.

5 0.1350245 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

Author: Umar Syed, Robert E. Schapire

Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1

6 0.12322611 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

7 0.11700938 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

8 0.11494214 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

9 0.1111659 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration

10 0.10788964 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

11 0.093419552 243 nips-2010-Smoothness, Low Noise and Fast Rates

12 0.088730969 134 nips-2010-LSTD with Random Projections

13 0.088219419 227 nips-2010-Rescaling, thinning or complementing? On goodness-of-fit procedures for point process models and Generalized Linear Models

14 0.085856952 282 nips-2010-Variable margin losses for classifier design

15 0.08094722 22 nips-2010-Active Estimation of F-Measures

16 0.078298502 43 nips-2010-Bootstrapping Apprenticeship Learning

17 0.078068703 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

18 0.075583048 191 nips-2010-On the Theory of Learnining with Privileged Information

19 0.075079545 130 nips-2010-Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains

20 0.074578367 38 nips-2010-Batch Bayesian Optimization via Simulation Matching


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, -0.172), (2, 0.012), (3, -0.017), (4, 0.031), (5, 0.056), (6, -0.031), (7, -0.033), (8, 0.017), (9, -0.146), (10, -0.023), (11, -0.005), (12, 0.027), (13, 0.101), (14, -0.041), (15, 0.066), (16, 0.091), (17, 0.066), (18, 0.037), (19, -0.117), (20, 0.035), (21, -0.001), (22, 0.041), (23, -0.016), (24, 0.003), (25, 0.011), (26, 0.049), (27, 0.06), (28, 0.047), (29, 0.082), (30, -0.012), (31, 0.05), (32, 0.045), (33, -0.056), (34, -0.009), (35, -0.015), (36, -0.025), (37, -0.092), (38, -0.008), (39, -0.047), (40, 0.029), (41, 0.009), (42, -0.054), (43, 0.014), (44, 0.083), (45, -0.002), (46, -0.02), (47, 0.009), (48, 0.055), (49, -0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93384898 269 nips-2010-Throttling Poisson Processes

Author: Uwe Dick, Peter Haider, Thomas Vanck, Michael Brückner, Tobias Scheffer

Abstract: We study a setting in which Poisson processes generate sequences of decisionmaking events. The optimization goal is allowed to depend on the rate of decision outcomes; the rate may depend on a potentially long backlog of events and decisions. We model the problem as a Poisson process with a throttling policy that enforces a data-dependent rate limit and reduce the learning problem to a convex optimization problem that can be solved efficiently. This problem setting matches applications in which damage caused by an attacker grows as a function of the rate of unsuppressed hostile events. We report on experiments on abuse detection for an email service. 1

2 0.68078363 152 nips-2010-Learning from Logged Implicit Exploration Data

Author: Alex Strehl, John Langford, Lihong Li, Sham M. Kakade

Abstract: We provide a sound and consistent foundation for the use of nonrandom exploration data in “contextual bandit” or “partially labeled” settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which “offline” data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from Yahoo!.

3 0.66716588 189 nips-2010-On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Author: Tang Jie, Pieter Abbeel

Abstract: Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (i) using the past experience to estimate only the gradient of the expected return U (θ) at the current policy parameterization θ, rather than to obtain a more complete estimate of U (θ), and (ii) using past experience under the current policy only rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines—a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds. 1

4 0.64777458 208 nips-2010-Policy gradients in linearly-solvable MDPs

Author: Emanuel Todorov

Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.

5 0.61848891 290 nips-2010-t-logistic regression

Author: Nan Ding, S.v.n. Vishwanathan

Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1

6 0.61446321 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement

7 0.60452616 14 nips-2010-A Reduction from Apprenticeship Learning to Classification

8 0.58547944 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration

9 0.58015603 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

10 0.57392669 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks

11 0.55800056 282 nips-2010-Variable margin losses for classifier design

12 0.55724829 38 nips-2010-Batch Bayesian Optimization via Simulation Matching

13 0.54442286 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning

14 0.53922665 64 nips-2010-Distributionally Robust Markov Decision Processes

15 0.52056199 201 nips-2010-PAC-Bayesian Model Selection for Reinforcement Learning

16 0.51535314 228 nips-2010-Reverse Multi-Label Learning

17 0.51384443 202 nips-2010-Parallelized Stochastic Gradient Descent

18 0.5110122 61 nips-2010-Direct Loss Minimization for Structured Prediction

19 0.50898498 255 nips-2010-Static Analysis of Binary Executables Using Structural SVMs

20 0.4861114 243 nips-2010-Smoothness, Low Noise and Fast Rates


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.026), (17, 0.015), (27, 0.05), (30, 0.042), (45, 0.222), (50, 0.051), (52, 0.035), (55, 0.332), (60, 0.019), (77, 0.042), (78, 0.011), (90, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7317642 269 nips-2010-Throttling Poisson Processes

Author: Uwe Dick, Peter Haider, Thomas Vanck, Michael Brückner, Tobias Scheffer

Abstract: We study a setting in which Poisson processes generate sequences of decisionmaking events. The optimization goal is allowed to depend on the rate of decision outcomes; the rate may depend on a potentially long backlog of events and decisions. We model the problem as a Poisson process with a throttling policy that enforces a data-dependent rate limit and reduce the learning problem to a convex optimization problem that can be solved efficiently. This problem setting matches applications in which damage caused by an attacker grows as a function of the rate of unsuppressed hostile events. We report on experiments on abuse detection for an email service. 1

2 0.69012713 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls

Author: Kun Gai, Guangyun Chen, Chang-shui Zhang

Abstract: In this paper, we point out that there exist scaling and initialization problems in most existing multiple kernel learning (MKL) approaches, which employ the large margin principle to jointly learn both a kernel and an SVM classifier. The reason is that the margin itself can not well describe how good a kernel is due to the negligence of the scaling. We use the ratio between the margin and the radius of the minimum enclosing ball to measure the goodness of a kernel, and present a new minimization formulation for kernel learning. This formulation is invariant to scalings of learned kernels, and when learning linear combination of basis kernels it is also invariant to scalings of basis kernels and to the types (e.g., L1 or L2 ) of norm constraints on combination coefficients. We establish the differentiability of our formulation, and propose a gradient projection algorithm for kernel learning. Experiments show that our method significantly outperforms both SVM with the uniform combination of basis kernels and other state-of-art MKL approaches. 1

3 0.59744346 282 nips-2010-Variable margin losses for classifier design

Author: Hamed Masnadi-shirazi, Nuno Vasconcelos

Abstract: The problem of controlling the margin of a classifier is studied. A detailed analytical study is presented on how properties of the classification risk, such as its optimal link and minimum risk functions, are related to the shape of the loss, and its margin enforcing properties. It is shown that for a class of risks, denoted canonical risks, asymptotic Bayes consistency is compatible with simple analytical relationships between these functions. These enable a precise characterization of the loss for a popular class of link functions. It is shown that, when the risk is in canonical form and the link is inverse sigmoidal, the margin properties of the loss are determined by a single parameter. Novel families of Bayes consistent loss functions, of variable margin, are derived. These families are then used to design boosting style algorithms with explicit control of the classification margin. The new algorithms generalize well established approaches, such as LogitBoost. Experimental results show that the proposed variable margin losses outperform the fixed margin counterparts used by existing algorithms. Finally, it is shown that best performance can be achieved by cross-validating the margin parameter. 1

4 0.59384954 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers

Author: Leonidas Lefakis, Francois Fleuret

Abstract: The standard strategy for efficient object detection consists of building a cascade composed of several binary classifiers. The detection process takes the form of a lazy evaluation of the conjunction of the responses of these classifiers, and concentrates the computation on difficult parts of the image which cannot be trivially rejected. We introduce a novel algorithm to construct jointly the classifiers of such a cascade, which interprets the response of a classifier as the probability of a positive prediction, and the overall response of the cascade as the probability that all the predictions are positive. From this noisy-AND model, we derive a consistent loss and a Boosting procedure to optimize that global probability on the training set. Such a joint learning allows the individual predictors to focus on a more restricted modeling problem, and improves the performance compared to a standard cascade. We demonstrate the efficiency of this approach on face and pedestrian detection with standard data-sets and comparisons with reference baselines. 1

5 0.59288037 239 nips-2010-Sidestepping Intractable Inference with Structured Ensemble Cascades

Author: David Weiss, Benjamin Sapp, Ben Taskar

Abstract: For many structured prediction problems, complex models often require adopting approximate inference techniques such as variational methods or sampling, which generally provide no satisfactory accuracy guarantees. In this work, we propose sidestepping intractable inference altogether by learning ensembles of tractable sub-models as part of a structured prediction cascade. We focus in particular on problems with high-treewidth and large state-spaces, which occur in many computer vision tasks. Unlike other variational methods, our ensembles do not enforce agreement between sub-models, but filter the space of possible outputs by simply adding and thresholding the max-marginals of each constituent model. Our framework jointly estimates parameters for all models in the ensemble for each level of the cascade by minimizing a novel, convex loss function, yet requires only a linear increase in computation over learning or inference in a single tractable sub-model. We provide a generalization bound on the filtering loss of the ensemble as a theoretical justification of our approach, and we evaluate our method on both synthetic data and the task of estimating articulated human pose from challenging videos. We find that our approach significantly outperforms loopy belief propagation on the synthetic data and a state-of-the-art model on the pose estimation/tracking problem. 1

6 0.59232986 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification

7 0.59210199 63 nips-2010-Distributed Dual Averaging In Networks

8 0.59165925 290 nips-2010-t-logistic regression

9 0.59154123 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

10 0.5915246 158 nips-2010-Learning via Gaussian Herding

11 0.59112585 1 nips-2010-(RF)^2 -- Random Forest Random Field

12 0.59108478 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

13 0.59063596 177 nips-2010-Multitask Learning without Label Correspondences

14 0.5906086 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models

15 0.59059399 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

16 0.59047508 103 nips-2010-Generating more realistic images using gated MRF's

17 0.5902757 243 nips-2010-Smoothness, Low Noise and Fast Rates

18 0.59018445 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery

19 0.59011191 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect

20 0.59003764 224 nips-2010-Regularized estimation of image statistics by Score Matching