jmlr jmlr2011 jmlr2011-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
Reference: text
sentIndex sentText sentNum sentScore
1 We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. [sent-13, score-0.327]
2 We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. [sent-15, score-0.675]
3 This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. [sent-16, score-0.66]
4 Our challenge is to design a measurement policy which produces fast convergence to the optimal solution, evaluated using the expected objective function after a specified number of iterations. [sent-24, score-0.34]
5 M ES , P OWELL AND F RAZIER problem, where the difference is that the number of alternatives |X | is extremely large relative to the measurement budget. [sent-31, score-0.279]
6 We do not make any explicit structural assumptions about θx , but we do assume that we are given an ordered set G and a family of aggregation functions Gg : X → X g , g ∈ G , each of which maps X to a region X g , which is successively smaller than the original set of alternatives. [sent-32, score-0.381]
7 After n observations, we obtain a family of estimates µg,n of the function at different x levels of aggregation, and we form an estimate µn of θx using x µn = x ∑ wg,n µg,n , x x (1) g∈G where the weights wg,n sum to one over all the levels of aggregation for each point x. [sent-34, score-0.446]
8 The estimates x g,n µx at more aggregate levels have lower statistical variance since they are based upon more observations, but exhibit aggregation bias. [sent-35, score-0.512]
9 Our goal is to create a measurement policy π that leads us to find the alternative x that maximizes θx . [sent-38, score-0.403]
10 A number of measurement policies have been proposed for the ranking and selection problem when the number of alternatives is not too large, and where our beliefs about the value of each alternative are independent. [sent-41, score-0.581]
11 (2009) which proposes a policy, the knowledge-gradient policy for correlated beliefs, that exploits correlations in the belief structure, but where these correlations are assumed known. [sent-43, score-0.292]
12 First, we propose a version of the knowledge gradient policy that exploits aggregation structure and similarity between alternatives, without requiring that we specify an explicit covariance matrix for our belief. [sent-45, score-0.599]
13 Our method requires that a family of aggregation functions be provided, but otherwise does not make any specific assumptions about the structure of the function or set of alternatives. [sent-51, score-0.381]
14 In Section 3, we present our model, the aggregation techniques we use, and the Bayesian updating approach. [sent-54, score-0.398]
15 We present our measurement policy in Section 4 and a proof of convergence of this policy in Section 5. [sent-55, score-0.536]
16 We are aware of no work in the online learning community, however, whether considering cumulative value or terminal value, that considers the type of hierarchical aggregation structures that we consider here. [sent-98, score-0.487]
17 (2006) and the knowledge-gradient policy for correlated normal beliefs (KGCB) from Frazier et al. [sent-132, score-0.289]
18 The latter policy is an extension of the knowledge-gradient algorithm in the presence of correlated beliefs, where measuring one alternative updates our belief about other alternatives. [sent-134, score-0.315]
19 Moreover, the presence of categorical dimensions might give a good indication for the aggregation function to be used. [sent-145, score-0.417]
20 There is a separate literature on aggregation and the use of mixtures of estimates. [sent-147, score-0.381]
21 Bertsekas and Castanon (1989) describes adaptive aggregation techniques in the context of dynamic programming, while Bertsekas and Tsitsiklis (1996) provides a good presentation of state aggregation methods used in value iteration. [sent-150, score-0.762]
22 Therefore we define Π to be the set of sampling policies that satisfies the requirement xn ∈ F n and introduce π ∈ Π as a policy that produces a sequence of decisions x0 , . [sent-189, score-0.483]
23 1 Model Specification In this section we describe our statistical model, beginning first by describing the aggregation structure upon which it relies, and then describing our Bayesian prior on the sampling means θx . [sent-198, score-0.416]
24 Throughout these sections we make the following assumptions: (i) we assume independent beliefs across different levels of aggregation and (ii) we have two quantities which we assume are fixed parameters of our model whereas we estimate them using the empirical Bayes approach. [sent-201, score-0.47]
25 Aggregation is performed using a set of aggregation functions Gg : X → X g , where X g represents the gth level of aggregation of the original set X . [sent-204, score-0.819]
26 We denote the set of all aggregation levels by G = {0, 1, . [sent-205, score-0.404]
27 , G}, with g = 0 being the lowest aggregation level (which might be the finest discretization of a continuous set of alternatives), g = G being the highest aggregation level, and G = |G | − 1. [sent-208, score-0.813]
28 The aggregation functions Gg are typically problem specific and involve a certain amount of domain knowledge, but it is possible to define generic forms of aggregation. [sent-209, score-0.381]
29 We introduce the following notation and illustrate its value using the example of Figure 1: G (x, x′ ) Set of all aggregation levels that the alternatives x and x′ have in common, with G (x, x′ ) ⊆ G . [sent-218, score-0.539]
30 X g (x) Set of all alternatives that share the same aggregated alternative Gg (x) at the gth aggregation level, with X g (x) ⊆ X . [sent-220, score-0.686]
31 2937 M ES , P OWELL AND F RAZIER g=2 g=1 g=0 10 1 2 3 13 11 4 5 6 7 12 8 9 Figure 1: Example with nine alternatives and three aggregation levels. [sent-222, score-0.516]
32 Given this aggregation structure, we now define our Bayesian model. [sent-223, score-0.381]
33 x x The values θg − θx are independent across different values of g, and between values of x that x differ at aggregation level g, that is, that have different values of G g (x). [sent-231, score-0.41]
34 We suppose that we observe a value yg,n+1 for each aggregation level g ∈ G . [sent-239, score-0.41]
35 Our estimate will be (δg,n )2 , where δg,n is an estimate of the aggregation x x x bias that we define here. [sent-267, score-0.419]
36 At the unaggregated level (g = 0), the aggregation bias is clearly 0, so we set δ0,n = 0. [sent-268, score-0.41]
37 When δ > 0, estimates of the aggregation bias are prevented from falling below some minimum threshold, which prevents the algorithm from placing too much weight on a frequently measured aggregate level when estimating the value of an infrequently measured disaggregate level. [sent-270, score-0.535]
38 2 g,n,ε In the computation of σx , the numbers m0,n can be regarded as weights: the sum of the x′ bias and measurement variance of the alternative we measured the most contributes the most to g,n,ε 2 the group variance σx . [sent-284, score-0.333]
39 Given this method of inference, we formally present in the next section the HKG policy for choosing the measurements xn . [sent-291, score-0.317]
40 2 we summarize the knowledge-gradient policy for independent and correlated multivariate normal beliefs as introduced in Frazier et al. [sent-304, score-0.289]
41 We end with an illustration of how the hierarchical knowledge gradient policy chooses its measurements (Section 4. [sent-308, score-0.348]
42 In the case of correlated beliefs, an observation yx of alternative x may change our estimate ˆn+1 n of alternatives x′ = x. [sent-330, score-0.338]
43 We rewrite the updating Equation (2) as g,n+1 µx = g,n,ε n+1 βg,n µg,n + βx yx ˆ /βg,n+1 x x x g,n,ε βx ˆn+1 − µg,n x g,n,ε yx βg,n + βx x g,n,ε g,n,ε βx βx n g,n yx − µn + g,n ˆn+1 = µg,n + g,n x x g,n,ε g,n,ε (µx − µx ) . [sent-352, score-0.299]
44 The conditional distribution of yx is N µn , (σn )2 + λx where the variance of yx is given ˆn+1 ˆn+1 x x by the measurement noise λx of the current measurement plus the variance (σn )2 of our belief given x by (6). [sent-356, score-0.613]
45 The problem here is that the values µn′ for all alternatives x′ ∈ X are updated whenever they share at least x one aggregation level with alternative x, which is to say for all x′ for which G (x′ , x) is not empty. [sent-360, score-0.608]
46 However, it can be shown ˜x that without aggregation levels they coincide (if G = 0, then an′ (x) = µ0,n = µn′ and bn′ (x) = σ0,n ). [sent-371, score-0.404]
47 ,M−1 ˜ i+1 ˜i bn (x) − bn (x) f − ˜i+1 an (x) − an (x) ˜i n (x) − bn (x) ˜ ˜ b i+1 , (21) i ˜ where an (x) and bn (x) follow from (19) and (20), after the sort and merge operation as described in ˜i i Section 4. [sent-379, score-0.372]
48 As illustrated in Powell and Frazier (2008), the independent KG policy prefers to measure alternatives with a high mean and/or with a low precision. [sent-387, score-0.331]
49 As an illustration, consider Figure 3, where we use an aggregation structure given by a perfect binary tree (see Section 6. [sent-388, score-0.381]
50 At aggregation level 5, there are four aggregated alternatives. [sent-390, score-0.489]
51 The fifth measurement will be either in an unexplored region one aggregation level lower (aggregation level 4 consisting of eight aggregated alternatives) or at an already explored region that has a high weighted estimate. [sent-392, score-0.69]
52 The same holds for the sixth measurements which would be either from one of the three remaining unexplored aggregated alternatives from level 4, or from an already explored alternative with high weighted mean. [sent-394, score-0.423]
53 In this case, HKG chooses to sample from the region 32 < x ≤ 40, which corresponds with an unexplored alternative at the aggregation level 3. [sent-395, score-0.518]
54 This implies that the HKG policy learns the true values of every alternative as n → ∞ (Corollary 2) and eventually finds a globally optimal alternative (Corollary 3). [sent-417, score-0.322]
55 Although the posterior inference and the derivation of the HKG policy assumed that samples from alternative x were normal random variables with known variance λx , the theoretical results in this section allow general sampling distributions. [sent-420, score-0.411]
56 Proof Consider what happens as the number of measurements n we make under the HKG policy goes to infinity. [sent-429, score-0.285]
57 In this result, keep in mind that xn = arg maxx µN is the ˆ x alternative one would estimate to be best at time N, given all the measurements collected by HKG. [sent-472, score-0.288]
58 To this end, we compare HKG with some well-known competing policies and study the sensitivity of these policies to various problem settings such as the dimensionality and smoothness of the function, and the measurement noise. [sent-486, score-0.547]
59 2947 M ES , P OWELL AND F RAZIER We also consider an hybrid version of the HKG algorithm (HHKG) in which we only exploit the similarity between alternatives in the updating equations and not in the measurement decision. [sent-496, score-0.314]
60 As a result, this policy uses the measurement decision of IKG and the updating equations of HKG. [sent-497, score-0.357]
61 The possible advantage of this hybrid policy is that it is able to cope with similarity between alternatives without the computational complexity of HKG. [sent-498, score-0.349]
62 Since several of the policies require choosing one or more parameters, we provide a brief description of the implementation of these policies in Appendix D. [sent-499, score-0.384]
63 3 Experimental Settings We consider the following experimental factors: the measurement variance λ, the measurement budget N, and for the HKG policy the aggregation structure. [sent-641, score-0.942]
64 However, for these problems it would not make sense to compare with the tested versions of IE, UCB and BOLTZ since, in the absence of an informed prior, these methods typically choose one measurement of each of the M alternatives before measuring any alternative a second time. [sent-647, score-0.342]
65 To obtain meaningful results for the tested versions of IE, UCB and BOLTZ, we start with an experiment with a relatively large measurement budget and relatively large measurement noise. [sent-652, score-0.311]
66 This is especially true with the multi-dimensional problems where the number of alternatives after discretization is much bigger then the measurement budget. [sent-657, score-0.279]
67 5} and √ the sensitivity of HKG on the aggregation structure. [sent-665, score-0.4]
68 5 and 1, and five different aggregation structures as presented at the end of this subsection. [sent-667, score-0.381]
69 We end this section with an explanation of the aggregation functions used by HKG. [sent-683, score-0.381]
70 Our default aggregation structure is given by a binary tree, that is, |X g (x)| = 2g for all x ∈ X g and g ∈ G . [sent-684, score-0.381]
71 As a result, we have 8 (ln(128)/ ln(2) + 1) aggregation levels for the one-dimensional problems and 6 (ln(32)/ ln(2) + 1) for the two-dimensional problems. [sent-685, score-0.404]
72 For the experiment with varying aggregation functions, we introduce a variable ω to denote the number of alternatives Gg (x), g < G that should be aggregated in a single alternative Gg+1 (x) one aggregation level higher. [sent-686, score-1.068]
73 To evaluate the impact of having a difference in the size of aggregated sets, we introduce a fifth aggregation structure where ω alternately takes values 2 and 4. [sent-690, score-0.46]
74 At aggregation level 0, we have 25 regions for location and domicile, and 6 capacity types, producing 3750 attribute vectors. [sent-692, score-0.43]
75 2952 H IERARCHICAL K NOWLEDGE G RADIENT FOR S EQUENTIAL S AMPLING At aggregation level 1, we represent the driver domicile as one of 5 areas. [sent-693, score-0.499]
76 At aggregation level 2, we ignore the driver domicile; at aggregation level 3, we ignore capacity type; and at aggregation level 4, we represent location as one of 5 areas. [sent-694, score-1.297]
77 In particular, it outperforms others on functions for which the use of an aggregation function seems to be a natural choice, but it also performs well on problems for which the other policies are specifically designed. [sent-702, score-0.593]
78 1 One-dimensional Functions In our first experiment, we focus on the comparison with R&S; policies using a relatively large measurement budget. [sent-705, score-0.336]
79 To illustrate the sensitivity of the performance of these policies to the number of measurements n, we also provide a graphical illustration in Figure 6. [sent-707, score-0.3]
80 BOLTZ only performs well for few measurements (n ≤ M) after which it underperforms the other policies with the exception of EXPL, which spends an unnecessary portion of its measurements on less attractive alternatives. [sent-711, score-0.392]
81 However, it seems that the number of measurements required for IE to outperform KGCB and HKG increases with increasing measurement variance λ. [sent-713, score-0.287]
82 Apart from the given aggregation function, HKG does not assume any structure and therefore has a slower rate of convergence on these instances. [sent-738, score-0.381]
83 As a final test with one-dimensional functions, we now vary the aggregation structure used by HKG. [sent-869, score-0.381]
84 Obviously, HKG is sensitive to the choice of aggregation structure. [sent-871, score-0.381]
85 The aggregation function with ω = 16 is so coarse that, even on the lowest aggregation level, there exists aggregate alternatives that have local maxima as well as local minima in their aggregated set. [sent-872, score-1.052]
86 We also see that the performance under the ω = 2/4 structure is close to that of ω = 4, which indicates that having some symmetry in the aggregation function is preferable. [sent-873, score-0.381]
87 When comparing the two figures, we see that the impact of the aggregation function decreases with increasing λ. [sent-874, score-0.381]
88 As a result, the benefit of having more precise lower aggregation levels decreases. [sent-876, score-0.404]
89 05 0 100 number of measurements (n) 200 0 100 number of measurements (n) Figure 8: Sensitivity of HKG to the aggregation function. [sent-886, score-0.559]
90 However, their influence on the fit (via the aggregation function) is limited since HKG automatically puts a low weight on them. [sent-899, score-0.381]
91 function to this surface, but we do require a family of aggregation functions. [sent-1070, score-0.381]
92 In particular, it outperforms the other policies on functions for which the use of an aggregation function seems to be a natural choice (e. [sent-1080, score-0.593]
93 The limitation of the HKG policy is that it requires a given aggregation structure, which means that we depend on having some insight into the problem. [sent-1083, score-0.577]
94 When this is the case, the ability to capture this knowledge in an aggregation structure is actually a strength, since we can capture the most important features in the highest levels of aggregation. [sent-1084, score-0.404]
95 If we do not have this insight, designing the aggregation functions imposes an additional modeling burden. [sent-1085, score-0.381]
96 First, we observe convergence problems for HKG in the case of low measurement variance where HKG tends to become to confident about values of alternatives never measured before. [sent-1087, score-0.351]
97 As a starting point we might create this set by running HKG on a higher aggregation level which has fewer elements. [sent-1093, score-0.41]
98 Future research could derive such bounds, or could create new techniques appropriate for problems with hierarchical aggregation structures that have bounds on their convergence rates. [sent-1098, score-0.405]
99 i i i+1 i+1 8: for i = 1 to M − 1 do 9: if bn (x) = bn (x) then i i+1 10: Remove entry i from the sequence (an (x), bn (x))M i i i=1 11: end if 12: end for ˜ 13: Use Algorithm 1 from Frazier et al. [sent-1105, score-0.279]
100 Proposition 4 The posterior belief on θx given observations up to time n for all aggregation levels is normally distributed with mean and precision µn = x 1 β0 µ0 + ∑ (σg,n )2 + νg x x βn x x g∈G x βn = β0 + x x ∑ (σg,n )2 + νg x x −1 −1 g,n µx , . [sent-1107, score-0.498]
wordName wordTfidf (topN-words)
[('hkg', 0.561), ('aggregation', 0.381), ('sko', 0.27), ('kgcb', 0.263), ('policy', 0.196), ('policies', 0.192), ('measurement', 0.144), ('alternatives', 0.135), ('oc', 0.132), ('frazier', 0.122), ('gg', 0.117), ('hhkg', 0.111), ('equential', 0.104), ('nowledge', 0.104), ('yx', 0.094), ('owell', 0.094), ('razier', 0.094), ('bn', 0.093), ('measurements', 0.089), ('ampling', 0.088), ('aggregated', 0.079), ('ierarchical', 0.079), ('radient', 0.073), ('kg', 0.071), ('camel', 0.069), ('ie', 0.069), ('maxx', 0.068), ('alternative', 0.063), ('boltz', 0.062), ('transportation', 0.059), ('sn', 0.057), ('expl', 0.055), ('ikg', 0.055), ('terminal', 0.055), ('aggregate', 0.054), ('ph', 0.054), ('variance', 0.054), ('opportunity', 0.053), ('limn', 0.051), ('driver', 0.047), ('beliefs', 0.047), ('regret', 0.045), ('posterior', 0.044), ('wx', 0.042), ('domicile', 0.042), ('drivers', 0.042), ('tbranin', 0.042), ('ucb', 0.042), ('shuf', 0.037), ('categorical', 0.036), ('sampling', 0.035), ('branin', 0.035), ('disaggregate', 0.035), ('reward', 0.033), ('xn', 0.032), ('bi', 0.03), ('level', 0.029), ('belief', 0.029), ('truth', 0.028), ('decisions', 0.028), ('gth', 0.028), ('unexplored', 0.028), ('cumulative', 0.027), ('correlated', 0.027), ('wr', 0.026), ('gn', 0.026), ('hierarchical', 0.024), ('bubeck', 0.024), ('levels', 0.023), ('powell', 0.023), ('budget', 0.023), ('outperformed', 0.022), ('exception', 0.022), ('lowest', 0.022), ('gradient', 0.022), ('stationary', 0.021), ('tilted', 0.021), ('srinivas', 0.021), ('ego', 0.021), ('nsgp', 0.021), ('precision', 0.021), ('bayesian', 0.02), ('capacity', 0.02), ('correlations', 0.02), ('outperforms', 0.02), ('george', 0.019), ('sensitivity', 0.019), ('estimate', 0.019), ('normal', 0.019), ('dy', 0.018), ('hybrid', 0.018), ('measured', 0.018), ('infn', 0.018), ('induction', 0.017), ('chooses', 0.017), ('updating', 0.017), ('arg', 0.017), ('resembles', 0.017), ('surely', 0.017), ('gp', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
2 0.19650573 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
3 0.13709363 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments
Author: Jaedeug Choi, Kee-Eung Kim
Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert’s behavior, namely the case in which the expert’s policy is explicitly given, and the case in which the expert’s trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings. Keywords: inverse reinforcement learning, partially observable Markov decision process, inverse optimization, linear programming, quadratically constrained programming
4 0.10460202 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also briefly analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes
5 0.06223968 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
Author: Adam D. Bull
Abstract: In the efficient global optimization problem, we minimize an unknown function f , using as few observations f (x) as possible. It can be considered a continuum-armed-bandit problem, with noiseless data, and simple regret. Expected-improvement algorithms are perhaps the most popular methods for solving the problem; in this paper, we provide theoretical results on their asymptotic behaviour. Implementing these algorithms requires a choice of Gaussian-process prior, which determines an associated space of functions, its reproducing-kernel Hilbert space (RKHS). When the prior is fixed, expected improvement is known to converge on the minimum of any function in its RKHS. We provide convergence rates for this procedure, optimal for functions of low smoothness, and describe a modified algorithm attaining optimal rates for smoother functions. In practice, however, priors are typically estimated sequentially from the data. For standard estimators, we show this procedure may never find the minimum of f . We then propose alternative estimators, chosen to minimize the constants in the rate of convergence, and show these estimators retain the convergence rates of a fixed prior. Keywords: convergence rates, efficient global optimization, expected improvement, continuumarmed bandit, Bayesian optimization
6 0.047015872 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
7 0.041268013 104 jmlr-2011-X-Armed Bandits
8 0.034316536 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
9 0.032533571 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning
10 0.029915638 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
11 0.028117206 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
12 0.02730247 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
13 0.024951058 61 jmlr-2011-Logistic Stick-Breaking Process
14 0.024866054 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
15 0.024481159 14 jmlr-2011-Better Algorithms for Benign Bandits
16 0.023673587 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
17 0.023653556 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
18 0.023005618 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes
19 0.021995259 16 jmlr-2011-Clustering Algorithms for Chains
20 0.021964548 55 jmlr-2011-Learning Multi-modal Similarity
topicId topicWeight
[(0, 0.143), (1, 0.026), (2, -0.086), (3, 0.142), (4, -0.018), (5, 0.209), (6, -0.176), (7, -0.06), (8, -0.079), (9, -0.051), (10, -0.012), (11, 0.079), (12, -0.129), (13, 0.173), (14, -0.111), (15, 0.241), (16, 0.008), (17, 0.364), (18, -0.194), (19, 0.191), (20, -0.069), (21, 0.01), (22, -0.038), (23, 0.206), (24, -0.123), (25, -0.109), (26, 0.052), (27, 0.019), (28, -0.052), (29, -0.094), (30, 0.031), (31, -0.043), (32, 0.023), (33, -0.085), (34, -0.098), (35, -0.009), (36, 0.054), (37, 0.074), (38, 0.084), (39, 0.001), (40, -0.047), (41, -0.01), (42, -0.022), (43, 0.011), (44, -0.06), (45, 0.006), (46, 0.004), (47, -0.039), (48, 0.017), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.95180255 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
2 0.65589005 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
Author: Stéphane Gaïffas, Guillaume Lecué
Abstract: Given a finite set F of functions and a learning sample, the aim of an aggregation procedure is to have a risk as close as possible to risk of the best function in F. Up to now, optimal aggregation procedures are convex combinations of every elements of F. In this paper, we prove that optimal aggregation procedures combining only two functions in F exist. Such algorithms are of particular interest when F contains many irrelevant functions that should not appear in the aggregation procedure. Since selectors are suboptimal aggregation procedures, this proves that two is the minimal number of elements of F required for the construction of an optimal aggregation procedure in every situations. Then, we perform a numerical study for the problem of selection of the regularization parameters of the Lasso and the Elastic-net estimators. We compare on simulated examples our aggregation algorithms to aggregation with exponential weights, to Mallow’s Cp and to cross-validation selection procedures. Keywords: aggregation, exact oracle inequality, empirical risk minimization, empirical process theory, sparsity, Lasso, Lars
3 0.48358858 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Value function approximation methods have been successfully used in many applications, but the prevailing techniques often lack useful a priori error bounds. We propose a new approximate bilinear programming formulation of value function approximation, which employs global optimization. The formulation provides strong a priori guarantees on both robust and expected policy loss by minimizing specific norms of the Bellman residual. Solving a bilinear program optimally is NP-hard, but this worst-case complexity is unavoidable because the Bellman-residual minimization itself is NP-hard. We describe and analyze the formulation as well as a simple approximate algorithm for solving bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. We also briefly analyze the behavior of bilinear programming algorithms under incomplete samples. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on simple benchmark problems. Keywords: value function approximation, approximate dynamic programming, Markov decision processes
4 0.4663153 47 jmlr-2011-Inverse Reinforcement Learning in Partially Observable Environments
Author: Jaedeug Choi, Kee-Eung Kim
Abstract: Inverse reinforcement learning (IRL) is the problem of recovering the underlying reward function from the behavior of an expert. Most of the existing IRL algorithms assume that the environment is modeled as a Markov decision process (MDP), although it is desirable to handle partially observable settings in order to handle more realistic scenarios. In this paper, we present IRL algorithms for partially observable environments that can be modeled as a partially observable Markov decision process (POMDP). We deal with two cases according to the representation of the given expert’s behavior, namely the case in which the expert’s policy is explicitly given, and the case in which the expert’s trajectories are available instead. The IRL in POMDPs poses a greater challenge than in MDPs since it is not only ill-posed due to the nature of IRL, but also computationally intractable due to the hardness in solving POMDPs. To overcome these obstacles, we present algorithms that exploit some of the classical results from the POMDP literature. Experimental results on several benchmark POMDP domains show that our work is useful for partially observable settings. Keywords: inverse reinforcement learning, partially observable Markov decision process, inverse optimization, linear programming, quadratically constrained programming
5 0.20113637 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
6 0.15666358 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
7 0.14684431 61 jmlr-2011-Logistic Stick-Breaking Process
8 0.1322784 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
9 0.12871158 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
10 0.12592669 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
11 0.12586498 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
12 0.12098811 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
13 0.11405896 16 jmlr-2011-Clustering Algorithms for Chains
14 0.11337015 3 jmlr-2011-A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis
15 0.11209062 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
16 0.11067857 104 jmlr-2011-X-Armed Bandits
17 0.10844812 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
18 0.10288476 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
19 0.10282719 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
20 0.099733114 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
topicId topicWeight
[(4, 0.023), (9, 0.027), (10, 0.03), (22, 0.415), (24, 0.036), (31, 0.079), (32, 0.025), (41, 0.035), (60, 0.012), (65, 0.032), (67, 0.029), (70, 0.012), (73, 0.035), (78, 0.073), (90, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.68066967 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
Author: Martijn R.K. Mes, Warren B. Powell, Peter I. Frazier
Abstract: We propose a sequential sampling policy for noisy discrete global optimization and ranking and selection, in which we aim to efficiently explore a finite set of alternatives before selecting an alternative as best when exploration stops. Each alternative may be characterized by a multidimensional vector of categorical and numerical attributes and has independent normal rewards. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. We propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement. This approach greatly reduces the measurement effort required, but it requires some prior knowledge on the smoothness of the function in the form of an aggregation function and computational issues limit the number of alternatives that can be easily considered to the thousands. We prove that our policy is consistent, finding a globally optimal alternative when given enough measurements, and show through simulations that it performs competitively with or significantly better than other policies. Keywords: sequential experimental design, ranking and selection, adaptive learning, hierarchical statistics, Bayesian statistics
2 0.29222956 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
Author: Lauren A. Hannah, David M. Blei, Warren B. Powell
Abstract: We propose Dirichlet Process mixtures of Generalized Linear Models (DP-GLM), a new class of methods for nonparametric regression. Given a data set of input-response pairs, the DP-GLM produces a global model of the joint distribution through a mixture of local generalized linear models. DP-GLMs allow both continuous and categorical inputs, and can model the same class of responses that can be modeled with a generalized linear model. We study the properties of the DP-GLM, and show why it provides better predictions and density estimates than existing Dirichlet process mixture regression models. We give conditions for weak consistency of the joint distribution and pointwise consistency of the regression estimate. Keywords: Bayesian nonparametrics, generalized linear models, posterior consistency
3 0.29049736 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
4 0.2859948 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
Author: Nicolò Cesa-Bianchi, Shai Shalev-Shwartz, Ohad Shamir
Abstract: We investigate three variants of budgeted learning, a setting in which the learner is allowed to access a limited number of attributes from training or test examples. In the “local budget” setting, where a constraint is imposed on the number of available attributes per training example, we design and analyze an efficient algorithm for learning linear predictors that actively samples the attributes of each training instance. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on the training set. This result is complemented by a general lower bound for the easier “global budget” setting, where it is only the overall number of accessible training attributes that is being constrained. In the third, “prediction on a budget” setting, when the constraint is on the number of available attributes per test example, we show that there are cases in which there exists a linear predictor with zero error but it is statistically impossible to achieve arbitrary accuracy without full information on test examples. Finally, we run simple experiments on a digit recognition problem that reveal that our algorithm has a good performance against both partial information and full information baselines. Keywords: budgeted learning, statistical learning, linear predictors, learning with partial information, learning theory
5 0.28418118 16 jmlr-2011-Clustering Algorithms for Chains
Author: Antti Ukkonen
Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a finite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The first one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the significance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artificial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing
6 0.2840488 12 jmlr-2011-Bayesian Co-Training
7 0.28342918 91 jmlr-2011-The Sample Complexity of Dictionary Learning
8 0.28236705 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
9 0.28215262 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
10 0.28207928 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
11 0.28139919 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
12 0.28033638 104 jmlr-2011-X-Armed Bandits
13 0.27936736 14 jmlr-2011-Better Algorithms for Benign Bandits
14 0.27935231 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
15 0.27873808 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
16 0.27868041 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
17 0.27830663 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
18 0.27785209 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
19 0.27727574 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
20 0.27652755 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates