jmlr jmlr2008 jmlr2008-94 knowledge-graph by maker-knowledge-mining

94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Source: pdf

Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni

Abstract: We consider the problem of estimating the value of a multiattribute resource, where the attributes are categorical or discrete in nature and the number of potential attribute vectors is very large. The problem arises in approximate dynamic programming when we need to estimate the value of a multiattribute resource from estimates based on Monte-Carlo simulation. These problems have been traditionally solved using aggregation, but choosing the right level of aggregation requires resolving the classic tradeoff between aggregation error and sampling error. We propose a method that estimates the value of a resource at different levels of aggregation simultaneously, and then uses a weighted combination of the estimates. Using the optimal weights, which minimizes the variance of the estimate while accounting for correlations between the estimates, is computationally too expensive for practical applications. We have found that a simple inverse variance formula (adjusted for bias), which effectively assumes the estimates are independent, produces near-optimal estimates. We use the setting of two levels of aggregation to explain why this approximation works so well. Keywords: hierarchical statistics, approximate dynamic programming, mixture models, adaptive learning, multiattribute resources

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The problem arises in approximate dynamic programming when we need to estimate the value of a multiattribute resource from estimates based on Monte-Carlo simulation. [sent-7, score-0.634]

2 These problems have been traditionally solved using aggregation, but choosing the right level of aggregation requires resolving the classic tradeoff between aggregation error and sampling error. [sent-8, score-0.942]

3 We propose a method that estimates the value of a resource at different levels of aggregation simultaneously, and then uses a weighted combination of the estimates. [sent-9, score-0.824]

4 We use the setting of two levels of aggregation to explain why this approximation works so well. [sent-12, score-0.528]

5 • Managing a ﬂeet of trucks to move loads - The truck can be described using attributes such as its current location, the home domicile of the driver, the maintenance level and whether it is being driven by a solo driver or a team of two drivers. [sent-27, score-0.558]

6 If we act on the resource, we may produce a resource with attribute a with value va . [sent-35, score-0.693]

7 In a dynamic programming setting, the value va refers to the solution of a ﬁnite horizon discounted reward dynamic program. [sent-36, score-0.614]

8 Instead of estimating va , we might deﬁne an aggregation function G(a) which produces an aggregated attribute a which has fewer outcomes. [sent-41, score-1.033]

9 For example, a ﬁve-digit zip code can be aggregated up to a three-digit zip; a numerical attribute can be divided into fewer ranges; or an attribute can be completely ignored. [sent-42, score-0.697]

10 The resulting smaller attribute space produces more observations of each attribute, but at a cost of aggregation error. [sent-43, score-0.812]

11 A truck driver might be characterized by his location and his home domicile; the value of a driver at a location depends very much on where he lives. [sent-49, score-0.587]

12 We are interested in developing a method for estimating the value v a of a resource with attribute a, making minimal assumptions about the structure of the attribute space. [sent-50, score-0.807]

13 We take advantage of the fact that for every application with which we are familiar, it is quite easy to design a family of aggregation functions G where G(g) : A → A (g) is an aggregation of the attribute space A . [sent-51, score-1.227]

14 Techniques range from picking a ﬁxed level of aggregation (Whitt, 1978; Bean et al. [sent-59, score-0.492]

15 A nice discussion of aggregation and abstraction techniques in an approximate dynamic programming setting is given in Boutilier et al. [sent-67, score-0.702]

16 Instead of using a single level of aggregation, researchers have considered combining estimates from all the levels of aggregation at once. [sent-69, score-0.69]

17 2081 G EORGE , P OWELL AND K ULKARNI In this paper, we solve the problem of optimally combining (correlated) value estimates at different levels of aggregation in an approximate dynamic programming setting and derive expressions for optimal weights. [sent-76, score-0.9]

18 The major contribution of this paper lies in ﬁnding that an inverse-variance weighting formula (adjusted for bias), which is optimal only when the estimates are independent, proves to be nearoptimal even though estimates at different levels of aggregation are not independent. [sent-79, score-0.867]

19 We show that if we compute optimal weights (without assuming independence) and compare the results if we do assume independence, the results are the same for two extremes: when the difference between the aggregate and disaggregate value estimates is very large or very small. [sent-81, score-0.66]

20 The method of weighting a family of aggregate estimates is shown to naturally shift the weight from aggregate to disaggregate estimates as the algorithm progresses. [sent-84, score-0.925]

21 The next three sections, however, focus purely on the statistics of aggregation outside of a dynamic programming setting. [sent-91, score-0.676]

22 For this reason, we propose a simpler formula that assumes that statistics from different levels of aggregation are independent. [sent-95, score-0.528]

23 In Section 5, we compare the two weighting formulas (with and without the independence assumption) for the special case where there are only two levels of aggregation which allows the optimal weights to be computed analytically. [sent-96, score-0.759]

24 In Section 6, we demonstrate our approximation method in the context of an approximate dynamic programming algorithm for solving a multiattribute resource allocation problem. [sent-99, score-0.523]

25 This research estimated the value of a resource at different levels of aggregation, but kept track of the variance of these estimates at each level of aggregation and always used the estimate that provided the smallest variance. [sent-123, score-0.869]

26 Resource allocation problems can be modeled by letting a ∈ A be an attribute vector (a may consist of categorical and numerical attributes), and by letting Rta be the number of resources with attribute a. [sent-124, score-0.719]

27 In this section, we describe the basic approximate dynamic programming (ADP) strategy to solve the problem of managing a single resource with multiple attributes, the nomadic trucker. [sent-129, score-0.599]

28 The state of the resource is deﬁned by an attribute vector, a, composed of multiple attributes, which may be numerical or categorical. [sent-131, score-0.538]

29 For the nomadic trucker problem, examples of attributes include the location of the truck, the home domicile of the driver and the number of hours driven. [sent-132, score-0.584]

30 Typically, the set of potential decisions depends on the current state (attributes) of our resource, so we let D a be the decisions available to a resource with attribute a. [sent-139, score-0.606]

31 We assume that the impact of a decision d on a resource with attribute a is deterministic, and is given by the function a = aM (a, d). [sent-140, score-0.506]

32 The next three sections consider the general problem of estimating a quantity (the value of a resource with attribute a) outside of the context of approximate dynamic programming. [sent-180, score-0.656]

33 The Statistics of Aggregation In this section, we investigate the statistics of aggregation by studying a sampling process where at iteration n we ﬁrst sample the attribute vector a = an . [sent-186, score-0.777]

34 Consider a case where the attribute vector has more than one dimension, with A i denoting the number of distinct states that attribute ai can assume. [sent-201, score-0.68]

35 One strategy is to resort to aggregation (such as dropping one or more dimensions of a) which can quickly reduce the number of values but introduces structural error. [sent-204, score-0.525]

36 In one of our trucking applications, one attribute is the location of the truck, while a second attribute is the driver’s home domicile. [sent-207, score-0.835]

37 In general, aggregation of attribute vectors is performed using a collection of aggregation functions, Gg : A → A (g) , where A (g) represents the gth level of aggregation of the attribute space A . [sent-210, score-2.104]

38 We deﬁne the following: a(g) = Gg (a), the gth level aggregation of the attribute vector a. [sent-211, score-0.877]

39 The ﬁrst-level aggregation function, G(1) , involves aggregating the location to the regional level which reduces the number of states to three. [sent-220, score-0.582]

40 The second-level aggregation function, G (2) , would be deﬁned as aggregating out the capacity type attribute completely, which leaves us with a single state. [sent-221, score-0.777]

41 As in this example and in the experimental work to follow, it is usually the case that the gth level of aggregation acts on the (g − 1)st level. [sent-222, score-0.55]

42 (g) va = The estimate of the value associated with the attribute vector a at the gth level of aggregation, given the sample, N . [sent-233, score-0.666]

43 (g) We can compute the estimate, va , as (g) va = 1 ∑ (g) Na n∈Na(g) vn . [sent-234, score-0.512]

44 Consider the state of a resource to be composed of two attributes, namely, location of the resource and resource type. [sent-236, score-0.581]

45 We use aggregation functions that aggregate out the type attribute and then the location attribute to obtain three different levels of aggregation. [sent-240, score-1.393]

46 We can form estimates of the values of the various states at the different levels of aggregation as illustrated in Table 1. [sent-258, score-0.674]

47 A traditional strategy is to choose the right level of aggregation by trading off statistical and structural errors to ﬁnd a model with the least overall error. [sent-260, score-0.592]

48 2088 F UNCTION A PPROXIMATION WITH M ULTIPLE AGGREGATION (g) In order to better understand these two kinds of errors in an aggregation setting, we ﬁrst let δ a (g) denote the total error in the estimate, va , from the true value associated with attribute vector a: (g) (g) δa = v a − νa . [sent-261, score-1.015]

49 Consider our most recently observed attribute vector an and some other attribute a, where an and a may aggregate up to ˆ ˆ the same aggregated attribute at some level g ∈ G , that is, G g (a) = Gg (an ) (for the moment, these ˆ are simply two attribute vectors). [sent-263, score-1.54]

50 (g) We can express va in terms of its bias and noise components as follows: (g) va = 1 ∑ (g) Na n∈Na(g)  = νa +  We let, (νa + µn + εn ) a 1    1 µn  +  (g) εn  . [sent-266, score-0.486]

51 This enables us to express the total error as follows: (g) δa (g) (g) = µ a + εa (5) (g) where µa gives an estimate of the bias between the values of a at the gth level of aggregation and (g) at the disaggregate level. [sent-269, score-0.928]

52 Although it is common in practice to choose a single level of aggregation that produces the lower overall error, it can be useful to combine estimates from several levels of aggregation. [sent-278, score-0.69]

53 Breiman (1996) proposes a method called stacked regression which in our setting would be equivalent to combining estimates at different levels of aggregation using va = ∑ w(g) · va (g) , g∈G where w(g) is a set of weights for each level of aggregation. [sent-284, score-1.242]

54 We prefer to use the strategy suggested by LeBlanc and Tibshirani (1996) (Section 8), where the weights depend on the attribute: va = ∑ wa (g) (g) · va . [sent-286, score-0.702]

55 1 Optimal Weights We begin by ﬁnding the weighting scheme that will optimally combine the estimates at the different levels of aggregation, that is, the weights which give a combined estimate with the least squared deviation from the true value associated with attribute vector a. [sent-301, score-0.751]

56 2 An Approximation Assuming Independence (g) It is a well-known result in statistics that if the estimates va , g ∈ G ased, then the optimal weights would be given by (g) wa = 1 (g) 2 (g) σa /Na   1 ∑ (g ) 2 (g ) g ∈G σa /Na were independent and unbi- −1  . [sent-316, score-0.561]

57 In order to adapt the simpler formula in (13) to the aggregation setting while acknowledging the bias, we ﬁrst deﬁne: (g) (g) µa = Expected bias in the estimate, va (g) = E va − νa . [sent-323, score-0.936]

58 We ﬁrst compute estimates of the bias and the variance using (g) 2 sa (g) µa ˜ (g) = The sample variance of the observations corresponding to the estimate v a 1 (g) 2 = ˆ ∑ vn − va . [sent-331, score-0.54]

59 (g) Na − 1 n∈Na(g) (g) = An estimate of the bias in the estimated value (va ) from the true value (g) (0) = va − va . [sent-332, score-0.512]

60 However, it produces the best results only when the estimates of values at different levels of aggregation are independent, an assumption that we cannot expect to hold true. [sent-338, score-0.648]

61 We assume that the statistical noise is independent of the attribute vector sampled and also that we know the probability distributions of the sampling of the attribute vectors and their values. [sent-347, score-0.654]

62 1 Analytical Comparison For the two-level problem, we can obtain the optimal weights (WOPT) by solving the following system of equations: (0)2 (0) (1) wa − λ = 0, (1)2 (0) E δa wa − λ = 0, (1) w a + E δa δa (0) (1) E δa δa (0) (1) w a + E δa (0) (1) = 1, (0) (1) wa , wa ≥ 0. [sent-354, score-0.609]

63 wa + w a Since we are concerned with computing the weights for a particular attribute vector, we drop the index a in the following analysis. [sent-355, score-0.555]

64 (1) N 2 F UNCTION A PPROXIMATION WITH M ULTIPLE AGGREGATION These results enable us to rewrite Equation 17 for computing the weights on the disaggregate estimate using the WOPT scheme (which we denote as wopt ) as follows: 1 . [sent-360, score-0.56]

65 ˜ ˜ vopt ˜ = wopt v(0) + (1 − wopt )v(1) , vind = wind v(0) + (1 − wind )v(1) . [sent-364, score-0.682]

66 ˜ We can write the difference between the estimates of the value obtained with and without the independence assumption as ∆ = vopt − vind = ∆w · ∆v , where ∆w = wopt − wind and ∆v = v(0) − v(1) . [sent-365, score-0.517]

67 Estimates of values of each attribute vector are computed at both the aggregate and disaggregate levels. [sent-382, score-0.766]

68 Figure 4 gives the weights (to be applied to the disaggregate level) produced by the optimal formula, WOPT, and the formula assuming independence, WIND, for each attribute a. [sent-391, score-0.72]

69 Note, however (consistent with our understanding from the previous section) that the weights are also quite different for the cells a = 2 and a = 5 where the aggregate and disaggregate functions are most similar (which means the bias is small). [sent-395, score-0.6]

70 = ˜ ∑ (vsa − νa )2 a∈A ε G θs = The sum of squared errors using the static aggregation strategy which treats the function as a constant over its domain. [sent-410, score-0.564]

71 Experiments in an ADP Application We implemented the hierarchical weighting strategy in the approximate dynamic procedure for solving the nomadic trucker problem described in Section 2. [sent-445, score-0.513]

72 The ﬁnal attribute that we consider is the number of days that the driver is away from home. [sent-472, score-0.501]

73 For example, at aggregation level 1, the location attribute is aggregated from 50 regions to 10 geographical areas. [sent-476, score-0.951]

74 The apparent discrepancy in the size of the state space at levels 0 and 1 arises because the days-away-from-home attribute is always set to 0 for the location corresponding to the home of the driver, while for all the other locations it can be any number from 1 to 12. [sent-479, score-0.644]

75 In order to compute the true values associated with each attribute vector, we use a standard backward dynamic programming algorithm. [sent-480, score-0.581]

76 A ‘∗’ corresponding to a particular attribute indicates that the attribute is included in the attribute vector, and a ‘−’ indicates that it is aggregated out. [sent-483, score-1.024]

77 our aggregation strategies in the approximate dynamic programming algorithm outlined in Figure 1. [sent-484, score-0.727]

78 Equation 4 is replaced by a series of equations for updating the value function estimates at all the levels of aggregation corresponding to the currently visited attribute vector: (g),n va (g),n−1 = (1 − α)va + αvn , ˆa a ∈ A,g ∈ G. [sent-492, score-1.211]

79 2100 F UNCTION A PPROXIMATION WITH M ULTIPLE AGGREGATION For methods 2 and 3, we point out that the chosen level of aggregation and the weights on the estimates at different levels are dynamic in nature, in that they change with each iteration. [sent-496, score-0.941]

80 The static aggregation strategies for g = 0, 1 & 2 are denoted as Disaggregate, Aggregate1 and Aggregate2 respectively. [sent-511, score-0.516]

81 As the algorithm progresses, higher weights are put on the more disaggregate levels, with ultimately the highest weight on the most disaggregate level. [sent-518, score-0.685]

82 It is interesting (and noteworthy) that the weights on the higher levels of aggregation drop quickly at ﬁrst, but then stabilize, dropping 2101 G EORGE , P OWELL AND K ULKARNI Problem # 1 2 3 4 5 6 Disaggregate 6. [sent-519, score-0.629]

83 We see that the hierarchical weighting scheme outperforms all the static aggregation strategies as well as the dynamic aggregation strategy which picks the “best” level of aggregation, producing the least errors in the estimates. [sent-583, score-1.37]

84 Two higher levels of aggregation are also used in the problem, but omitted from the tables, as they give inferior results. [sent-749, score-0.528]

85 xad = The number of times we act on a resource with attribute a using a decision of type d ∈ D . [sent-767, score-0.564]

86 The right level of aggregation depends not only on the speciﬁc attribute (some are sampled more often than others) but also on the number of iterations we have run the algorithm. [sent-780, score-0.819]

87 of the driver (out of 100 regions), the driver’s home domicile (out of 100 regions), and whether the driver was a single driver employed by the company, a contract driver, or a team (two drivers working together). [sent-785, score-0.612]

88 At the most disaggregate level, the attribute space consisted of 30,000 elements. [sent-786, score-0.619]

89 We considered an aggregate level which captured only the location of the driver (100 elements) and a hierarchical strategy that used a weighted estimate across four levels of aggregation. [sent-787, score-0.61]

90 As expected, if we use only an aggregate attribute vector, we obtain fast convergence but it levels off quickly. [sent-792, score-0.552]

91 By contrast, using a weighted combination of estimates at different levels of aggregation produced the fastest convergence and the best results overall. [sent-794, score-0.671]

92 Even though the weighted combination strategy has a greater number of computations per iteration compared to static aggregation schemes, we have observed that the increase in running time for additional computations is not signiﬁcant. [sent-796, score-0.562]

93 Requiring no more than a set of aggregation functions which are typically quite easy to design, we have proposed an adaptive weighting strategy that produces an estimate of the value of a resource with a particular attribute. [sent-800, score-0.776]

94 The weighting system is quite easy to compute, and is implementable for very large scale problems with attribute spaces that are arbitrarily large, since the computational complexity is a function of how many attributes you actually visit. [sent-801, score-0.495]

95 Our most surprising result was the ﬁnding that a weighting system that assumed independence of the estimates at each level of aggregation worked so well. [sent-805, score-0.742]

96 If the difference between a disaggregate estimate and the corresponding aggregate estimate was large, the weights produced by our approximation closely matched the weights that would have been produced had we properly accounted for the correlation. [sent-808, score-0.693]

97 If the difference between the disaggregate and aggregate estimates was small, the weights could be quite different, but in this case it did not matter. [sent-809, score-0.66]

98 The Lagrangian for the problem formulated in (6)-(7) is  L(w, λ) = E   1 2 1 = E 2 2 ∑ wa (g) g∈G (g) · va − ν a  +λ 1− 2 ∑ g∈G (g) wa (g) va − νa 2107  ∑ wa (g) g∈G +λ 1− ∑ wa (g) g∈G . [sent-817, score-0.934]

99 G EORGE , P OWELL AND K ULKARNI The ﬁrst order optimality conditions are easily shown to be E ∑ wa (g) g∈G (g) (g ) va − νa − λ = 0 ∀ g ∈ G, va − ν a ∑ wa (g) (27) − 1 = 0. [sent-818, score-0.68]

100 g∈G To simplify Equation 27, we note that, ∑ wa (g) E g∈G (g) (g ) va − νa va − νa = E = ∑ wa g∈G ∑ wa (g) g∈G (g) (g) (g ) δa δa (g) (g ) E δ a δa . [sent-819, score-0.807]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('aggregation', 0.45), ('attribute', 0.327), ('disaggregate', 0.292), ('na', 0.26), ('va', 0.213), ('wind', 0.175), ('resource', 0.153), ('dynamic', 0.15), ('aggregate', 0.147), ('driver', 0.142), ('wopt', 0.141), ('eorge', 0.133), ('owell', 0.133), ('pproximation', 0.133), ('ulkarni', 0.133), ('wa', 0.127), ('estimates', 0.12), ('home', 0.117), ('weights', 0.101), ('weighting', 0.099), ('unction', 0.093), ('vn', 0.086), ('eet', 0.083), ('multiattribute', 0.083), ('powell', 0.083), ('wimse', 0.083), ('ultiple', 0.081), ('levels', 0.078), ('programming', 0.076), ('nomadic', 0.075), ('trucker', 0.075), ('managing', 0.071), ('attributes', 0.069), ('location', 0.064), ('bias', 0.06), ('state', 0.058), ('gth', 0.058), ('truck', 0.058), ('xad', 0.058), ('unbiased', 0.05), ('strategy', 0.048), ('aggregated', 0.043), ('level', 0.042), ('domicile', 0.042), ('minv', 0.042), ('static', 0.041), ('discount', 0.041), ('vehicle', 0.041), ('hierarchical', 0.04), ('routing', 0.038), ('equation', 0.036), ('aircraft', 0.035), ('adp', 0.035), ('tsitsiklis', 0.035), ('observations', 0.035), ('allocation', 0.035), ('decisions', 0.034), ('freight', 0.033), ('jets', 0.033), ('leblanc', 0.033), ('loaded', 0.033), ('trucks', 0.033), ('days', 0.032), ('transportation', 0.032), ('independence', 0.031), ('princeton', 0.03), ('resources', 0.03), ('bertsekas', 0.03), ('blood', 0.028), ('loads', 0.028), ('backward', 0.028), ('team', 0.027), ('load', 0.027), ('structural', 0.027), ('dm', 0.027), ('gg', 0.027), ('st', 0.026), ('decision', 0.026), ('states', 0.026), ('estimate', 0.026), ('approximate', 0.026), ('errors', 0.025), ('horizon', 0.025), ('cargo', 0.025), ('gendreau', 0.025), ('geographical', 0.025), ('locomotives', 0.025), ('military', 0.025), ('simao', 0.025), ('spivey', 0.025), ('stacked', 0.025), ('vind', 0.025), ('vopt', 0.025), ('strategies', 0.025), ('policy', 0.024), ('management', 0.024), ('equations', 0.023), ('projects', 0.023), ('boutilier', 0.023), ('weighted', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni

2 0.14046755 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

Abstract: This paper investigates statistical performances of Support Vector Machines (SVM) and considers the problem of adaptation to the margin parameter and to complexity. In particular we provide a classiﬁer with no tuning parameter. It is a combination of SVM classiﬁers. Our contribution is two-fold: (1) we propose learning rates for SVM using Sobolev spaces and build a numerically realizable aggregate that converges with same rate; (2) we present practical experiments of this method of aggregation for SVM using both Sobolev spaces and Gaussian kernels. Keywords: classiﬁcation, support vector machines, learning rates, approximation, aggregation of classiﬁers

3 0.10502937 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

Author: Amit Dhurandhar, Alin Dobra

Abstract: In this paper we use the methodology introduced by Dhurandhar and Dobra (2009) for analyzing the error of classiﬁers and the model selection measures, to analyze decision tree algorithms. The methodology consists of obtaining parametric expressions for the moments of the generalization error (GE) for the classiﬁcation model of interest, followed by plotting these expressions for interpretability. The major challenge in applying the methodology to decision trees, the main theme of this work, is customizing the generic expressions for the moments of GE to this particular classiﬁcation algorithm. The speciﬁc contributions we make in this paper are: (a) we primarily characterize a subclass of decision trees namely, Random decision trees, (b) we discuss how the analysis extends to other decision tree algorithms and (c) in order to extend the analysis to certain model selection measures, we generalize the relationships between the moments of GE and moments of the model selection measures given in (Dhurandhar and Dobra, 2009) to randomized classiﬁcation algorithms. An empirical comparison of the proposed method with Monte Carlo and distribution free bounds obtained using Breiman’s formula, depicts the advantages of the method in terms of running time and accuracy. It thus showcases the use of the deployed methodology as an exploratory tool to study learning algorithms. Keywords: moments, generalization error, decision trees

4 0.095295817 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen

Abstract: Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies—attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and deﬁne a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model’s parameters and to ﬁnd the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve signiﬁcant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by ﬁxed-structured hierarchical models. Keywords: conditional random ﬁelds, dynamic hierarchical Markov random ﬁelds, integrated web data extraction, statistical hierarchical modeling, blocky artifact issue

5 0.072738096 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

Author: Balázs Csanád Csáji, László Monostori

Abstract: The paper investigates the possibility of applying value function based reinforcement learning (RL) methods in cases when the environment may change over time. First, theorems are presented which show that the optimal value function of a discounted Markov decision process (MDP) Lipschitz continuously depends on the immediate-cost function and the transition-probability function. Dependence on the discount factor is also analyzed and shown to be non-Lipschitz. Afterwards, the concept of (ε, δ)-MDPs is introduced, which is a generalization of MDPs and ε-MDPs. In this model the environment may change over time, more precisely, the transition function and the cost function may vary from time to time, but the changes must be bounded in the limit. Then, learning algorithms in changing environments are analyzed. A general relaxed convergence theorem for stochastic iterative algorithms is presented. We also demonstrate the results through three classical RL methods: asynchronous value iteration, Q-learning and temporal difference learning. Finally, some numerical experiments concerning changing environments are presented. Keywords: Markov decision processes, reinforcement learning, changing environments, (ε, δ)MDPs, value function bounds, stochastic iterative algorithms

6 0.05340888 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents

7 0.041325606 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

8 0.034473754 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

9 0.034330595 52 jmlr-2008-Learning from Multiple Sources

10 0.034306746 87 jmlr-2008-Stationary Features and Cat Detection

11 0.030636085 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

12 0.028866753 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

13 0.028637145 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

14 0.028573895 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

15 0.027875688 49 jmlr-2008-Learning Control Knowledge for Forward Search Planning

16 0.027678154 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

17 0.026514011 14 jmlr-2008-An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons

18 0.025945602 86 jmlr-2008-SimpleMKL

19 0.025230242 65 jmlr-2008-Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study

20 0.024777278 44 jmlr-2008-Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.152), (1, -0.046), (2, 0.031), (3, -0.066), (4, -0.141), (5, -0.087), (6, 0.029), (7, 0.007), (8, 0.093), (9, 0.028), (10, -0.022), (11, 0.057), (12, -0.043), (13, -0.044), (14, -0.102), (15, -0.32), (16, 0.133), (17, -0.43), (18, 0.131), (19, -0.146), (20, 0.149), (21, -0.089), (22, -0.029), (23, -0.013), (24, 0.097), (25, -0.041), (26, 0.084), (27, 0.114), (28, 0.095), (29, -0.003), (30, 0.116), (31, 0.107), (32, 0.023), (33, -0.004), (34, -0.113), (35, 0.062), (36, -0.092), (37, -0.14), (38, -0.035), (39, 0.004), (40, 0.017), (41, 0.074), (42, -0.164), (43, -0.034), (44, -0.136), (45, -0.003), (46, 0.079), (47, 0.043), (48, -0.151), (49, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97409111 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni

2 0.46544498 77 jmlr-2008-Probabilistic Characterization of Random Decision Trees

Author: Amit Dhurandhar, Alin Dobra

3 0.44646749 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces

Author: Sébastien Loustau

4 0.39464149 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

Author: Jun Zhu, Zaiqing Nie, Bo Zhang, Ji-Rong Wen

5 0.26689121 95 jmlr-2008-Value Function Based Reinforcement Learning in Changing Markovian Environments

Author: Balázs Csanád Csáji, László Monostori

6 0.21475093 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents

7 0.18137889 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting

8 0.17239736 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses

9 0.16693932 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

10 0.15639628 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

11 0.1559394 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

12 0.14899476 53 jmlr-2008-Learning to Combine Motor Primitives Via Greedy Additive Regression

13 0.13133636 4 jmlr-2008-A Multiple Instance Learning Strategy for Combating Good Word Attacks on Spam Filters

14 0.12628442 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection

15 0.12616907 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

16 0.12544374 86 jmlr-2008-SimpleMKL

17 0.12275361 44 jmlr-2008-Incremental Identification of Qualitative Models of Biological Systems using Inductive Logic Programming

18 0.11905306 87 jmlr-2008-Stationary Features and Cat Detection

19 0.11456294 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

20 0.11000575 52 jmlr-2008-Learning from Multiple Sources

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.017), (5, 0.024), (40, 0.042), (54, 0.062), (58, 0.037), (66, 0.044), (69, 0.447), (76, 0.023), (78, 0.012), (88, 0.061), (92, 0.052), (94, 0.046), (99, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.77136058 42 jmlr-2008-HPB: A Model for Handling BN Nodes with High Cardinality Parents

Author: Jorge Jambeiro Filho, Jacques Wainer

Abstract: We replaced the conditional probability tables of Bayesian network nodes whose parents have high cardinality with a multilevel empirical hierarchical Bayesian model called hierarchical pattern Bayes (HPB).1 The resulting Bayesian networks achieved signiÄ?Ĺš cant performance improvements over Bayesian networks with the same structure and traditional conditional probability tables, over Bayesian networks with simpler structures like naĂ‚Â¨ve Bayes and tree augmented naĂ‚Â¨ve Bayes, over Ă„Ä… Ă„Ä… Bayesian networks where traditional conditional probability tables were substituted by noisy-OR gates, default tables, decision trees and decision graphs and over Bayesian networks constructed after a cardinality reduction preprocessing phase using the agglomerative information bottleneck method. Our main tests took place in important fraud detection domains, which are characterized by the presence of high cardinality attributes and by the existence of relevant interactions among them. Other tests, over UCI data sets, show that HPB may have a quite wide applicability. Keywords: probabilistic reasoning, Bayesian networks, smoothing, hierarchical Bayes, empirical Bayes

same-paper 2 0.75625008 94 jmlr-2008-Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

Author: Abraham George, Warren B. Powell, Sanjeev R. Kulkarni

3 0.27977774 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

Author: Rémi Munos, Csaba Szepesvári

Abstract: In this paper we develop a theoretical analysis of the performance of sampling-based ﬁtted value iteration (FVI) to solve inﬁnite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of ﬁnite-time bounds on the performance of two versions of sampling-based FVI. The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufﬁciently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability. An important feature of our proof technique is that it permits the study of weighted L p -norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reﬂects how well the function space is “aligned” with the dynamics and rewards of the MDP. The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results. Numerical experiments are used to substantiate the theoretical ﬁndings. Keywords: ﬁtted value iteration, discounted Markovian decision processes, generative model, reinforcement learning, supervised learning, regression, Pollard’s inequality, statistical learning theory, optimal control

4 0.2718448 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

Author: Michael Collins, Amir Globerson, Terry Koo, Xavier Carreras, Peter L. Bartlett

Abstract: Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efﬁcient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O( 1 ) EG ε updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log( 1 )) updates are required. For both the max-margin and log-linear cases, our bounds suggest ε that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments conﬁrm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efﬁciently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. Keywords: exponentiated gradient, log-linear models, maximum-margin models, structured prediction, conditional random ﬁelds ∗. These authors contributed equally. c 2008 Michael Col

5 0.26982853 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefﬁcients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identiﬁcation from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in signiﬁcant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identiﬁcation application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

6 0.26968667 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

7 0.26608744 56 jmlr-2008-Magic Moments for Structured Output Prediction

8 0.26456445 86 jmlr-2008-SimpleMKL

9 0.2635937 58 jmlr-2008-Max-margin Classification of Data with Absent Features

10 0.26250777 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

11 0.26109836 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

12 0.26090258 83 jmlr-2008-Robust Submodular Observation Selection

13 0.25869629 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers

14 0.25822219 33 jmlr-2008-Evidence Contrary to the Statistical View of Boosting

15 0.25802928 57 jmlr-2008-Manifold Learning: The Price of Normalization

16 0.25766623 9 jmlr-2008-Active Learning by Spherical Subdivision

17 0.25713876 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields

18 0.2568168 63 jmlr-2008-Model Selection for Regression with Continuous Kernel Functions Using the Modulus of Continuity (Special Topic on Model Selection)

19 0.25592622 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

20 0.25114602 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns