jmlr jmlr2006 jmlr2006-37 knowledge-graph by maker-knowledge-mining

37 jmlr-2006-Incremental Algorithms for Hierarchical Classification


Source: pdf

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. To evaluate classifiers in our multipath framework, we define a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classification mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classifier knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classification, second order perceptron, support vector machines, regret bound, loss function

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. [sent-9, score-0.497]

2 In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine. [sent-15, score-0.619]

3 Keywords: incremental algorithms, online learning, hierarchical classification, second order perceptron, support vector machines, regret bound, loss function 1. [sent-16, score-0.615]

4 That is, we assume that every data instance is labelled with a (possibly empty) set of class nodes and, whenever an instance is labelled with a certain node i, then it is also labelled with all the nodes on the path from the root of the tree where i occurs down to node i. [sent-21, score-0.842]

5 A distinctive feature of our framework is that we also allow multiple-path labellings (instances can be labelled with nodes belonging to more than one path in the forest), and partial-path labellings (instances can be labelled with nodes belonging to a path that does not end on a leaf). [sent-22, score-0.358]

6 A hierarchical classification is then obtained by evaluating the node classifiers in a top-down fashion, so that the final labelling is consistent with the taxonomy. [sent-24, score-0.497]

7 The problem of hierarchical classification, especially of textual information, has been extensively investigated in past years (see, e. [sent-25, score-0.291]

8 The on-line approach to hierarchical classification, which we analyze here, seems well suited when dealing with scenarios in which new data are produced frequently and in large amounts (e. [sent-31, score-0.291]

9 An important ingredient in a hierarchical classification problem is the loss function used to evaluate the classifier’s performance. [sent-35, score-0.339]

10 In a hierarchical setting this loss would simply count one mistake each time, on a given data instance, the set of class labels output by the hierarchical classifier is not perfectly identical to the set of true labels associated to that instance. [sent-37, score-0.813]

11 In this paper we define a new loss function, the H-loss (hierarchical loss), whose simple definition captures the following intuition: “if a mistake is made at node i of the taxonomy, then further mistakes made in the subtree rooted at i are unimportant”. [sent-43, score-0.505]

12 The hierarchical labellings associated to the instances, instead, are assumed to be independently generated according to a parametric stochastic process defined on the taxonomy. [sent-48, score-0.328]

13 Our main theoretical result is a bound on the regret of our hierarchical learning algorithm with respect to a reference hierarchical classifier based on the true parameters of the label-generating process. [sent-50, score-0.853]

14 More specifically, we bound the contribution to the cumulative regret of each node classifier in terms of quantities related to the position of the node in the taxonomy and the data process parameters. [sent-51, score-0.969]

15 In general, the overall cumulative regret is seen to grow at most logarithmically in the length T of the data sequence. [sent-53, score-0.335]

16 The use of hierarchically trained linear-threshold classifiers is common to several of the previous approaches to hierarchical classification (e. [sent-55, score-0.291]

17 However, to our knowledge, this research is the first one to provide a rigorous performance analysis of hierarchical classification algorithms in the presence of multiple and partial path classifications. [sent-61, score-0.32]

18 The core of our analysis is a local cumulative regret bound showing that the instantaneous regret of each node classifier vanishes at a rate 1/T . [sent-63, score-0.799]

19 The precise dependence of the rate of convergence on the eigenstructure of the data each node observes is a major contribution of this paper. [sent-64, score-0.309]

20 The experiments show that our on-line algorithm performs significantly better than Perceptron-based hierarchical classifiers. [sent-72, score-0.291]

21 , N} belongs to the multilabel of x if and only if vi = 1. [sent-93, score-0.329]

22 A multilabel v ∈ {0, 1}N is said to respect a taxonomy G if and only if v is the union of one or more paths in G, where each path starts from a root but need not terminate on a leaf, see Figure 1. [sent-95, score-0.563]

23 We assume the data-generating mechanism produces examples (x, v) such that v respects some fixed underlying taxonomy G with N nodes (see Section 5). [sent-96, score-0.33]

24 We use PAR(i) to denote the unique parent of node i, ANC(i) to denote the set of ancestors of i, SUB(i) to denote the set of nodes in the subtree rooted at i (including i), and CHILD(i) to denote the set of children of node i. [sent-98, score-0.572]

25 According to our definition, the multilabel v = (1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0) respects this taxonomy (since it is the union of paths 1 → 2, 1 → 3 and 6 → 8 → 10), while the multilabel v = (1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0) does not, since 1 → 2 → 4 is not a path in the forest. [sent-102, score-0.822]

26 The H-Loss Two very simple loss functions, measuring the discrepancy between the prediction multilabel y = (y1 , . [sent-107, score-0.287]

27 In the next lemma we show an important (and intuitive) property of the H-loss: when the multilabel v to be predicted respects a taxonomy G then there is no loss of generality in restricting to predictions which respect G. [sent-125, score-0.589]

28 Formally, given a multilabel y ∈ {0, 1}N , we define the G-truncation of y as the multilabel y = (y1 , . [sent-126, score-0.478]

29 Note that the G-truncation of any multilabel always respects G. [sent-133, score-0.323]

30 Each node corresponds to a class in the taxonomy G, hence in this case N = 12. [sent-136, score-0.398]

31 Gray nodes are included in the multilabel under consideration, white nodes are not. [sent-137, score-0.347]

32 (d) Superposition of multilabel (b) on multilabel (c): Only the checked nodes contribute to the H-loss between (b) and (c). [sent-140, score-0.532]

33 Hence the H-loss between multilabel (b) and multilabel (c) is 3. [sent-141, score-0.478]

34 But, since the G-truncation of y does not change the value of a node i whose ancestors j are such that y j = 1, this also implies yi = yi . [sent-156, score-0.294]

35 A New Hierarchical Learning Algorithm In this section we describe our on-line algorithm for hierarchical classification. [sent-159, score-0.291]

36 instance xt is revealed to the algorithm which outputs the prediction yt = (y1,t , . [sent-165, score-0.435]

37 ˆ ˆ This is viewed as a guess for the multilabel vt = (v1,t , v2,t , . [sent-169, score-0.29]

38 , vN,t ) associated with the current instance xt . [sent-172, score-0.36]

39 After each prediction, the algorithm observes the true multilabel vt and adjusts its parameters for the next prediction. [sent-173, score-0.335]

40 These node classifiers are evaluated, starting from each root, in the following topdown fashion: the root is labelled by evaluating its node classifier; if a node has been labelled 1, then each child is labelled by evaluating its node classifier. [sent-178, score-1.065]

41 Figure 3: The hierarchical learning algorithm H - RLS. [sent-208, score-0.291]

42 In other words, wi is considered for update only on those instances xt such that vPAR(i),t = 1. [sent-216, score-0.379]

43 Let Q(i,t) denote the number of times the parent of node i observes a positive label up to time t, i. [sent-217, score-0.281]

44 The weight vector wi,t stored at time t in node i is a (conditional) regularized least squares estimator given by wi,t = I + Si,Q(i,t−1) Si,Q(i,t−1) + xt xt −1 Si,Q(i,t−1) (vi,i1 , . [sent-220, score-0.927]

45 , 2003) where we include the current instance xt in the computation of wi,t (see, e. [sent-232, score-0.36]

46 of instances, we base our analysis on the following stochastic model for generating the multilabel associated to an instance xt . [sent-243, score-0.624]

47 The multilabel V t is distributed according to the joint distribution fG (· | xt ). [sent-264, score-0.573]

48 With each node i in the taxonomy, we associate a unit-norm weight vector ui ∈ Rd . [sent-267, score-0.29]

49 Then, we define the conditional probabilities for a nonroot node i with parent j as follows: 1 + ui x . [sent-268, score-0.345]

50 Regret and the Reference Classifier Assuming the stochastic model described in Section 5, we compare the performance of our algorithm to the performance of the fixed hierarchical classifier built on the true parameters u1 , . [sent-278, score-0.291]

51 This reference hierarchical classifier has the same form as the classifiers generated by H - RLS. [sent-282, score-0.33]

52 To evaluate our algorithm against the reference hierarchical classifier defined in (4), we use the cumulative regret. [sent-287, score-0.433]

53 We measure the performance of H - RLS through its cumulative regret on a sequence of T examples: T ∑ t=1 E (yt ,V t ) − E (yt ,V t ) . [sent-289, score-0.335]

54 (5) The regret bound we prove in Section 7 holds when = H , and is shown to depend on the interaction between the spectral structure of the data generating process and the structure of the taxonomy on which the process is applied. [sent-290, score-0.449]

55 Analysis We now prove a bound on the cumulative regret of H - RLS with respect to the H-loss function H . [sent-292, score-0.335]

56 Our analysis hinges on proving that for any node i, the estimated margin wi,t xt is an asymptotically unbiased estimator of the true margin ui xt , and then on using known large deviation arguments to obtain the stated bound. [sent-293, score-1.035]

57 Then the cumulative regret of the H - RLS algorithm (described in Figure 3) satisfies, for each T ≥ 1, N T ∑ t=1 E H (yt ,V t ) − E H (yt ,V t ) Ci E 2 i=1 ∆i ≤ 16(1 + 1/e) ∑ d ∑ log(1 + λi, j ) , j=1 where ∆i,t = ui xt , ∆2 = min ∆2 , i i,t t=1,. [sent-306, score-0.753]

58 Remark 4 It is important to emphasize the interplay between the taxonomy structure and the process generating the examples, as expressed by the above regret bound. [sent-320, score-0.449]

59 From the previous remark we have ∑d λi, j = trace Si,Q(i,T ) Si,Q(i,T ) = Q(i, T ) since xt = 1 ∀t, and j=1 d d d ∑ log(1 + λi, j ) ≤ max ∑ log(1 + µ j ) : ∑ µ j = Q(i, T ) j=1 j=1 j=1 = d log 1 + Q(i, T ) . [sent-325, score-0.358]

60 d Moreover, Q(i, T ) is the sum of T Bernoulli random variables, where the t-th variable takes value 1 when the parent of the i-th node in the taxonomy observes label VPAR(i),t = 1 at time t. [sent-326, score-0.473]

61 (7) Bound (6) is obviously a log T cumulative regret bound, since Q(i, T ) ≤ T anyway. [sent-329, score-0.359]

62 It is important, however, to see how the regret bound depends on the taxonomy structure. [sent-330, score-0.424]

63 If i is a root node then E Q(i, T ) = Q(i, T ) = T (since a root node observes all labels). [sent-332, score-0.585]

64 If i is a root node, then its contribution to the overall regret is roughly log T . [sent-335, score-0.35]

65 On the other hand, the deeper is node i within the taxonomy the smaller is the contribution of node i to the overall regret. [sent-336, score-0.634]

66 Therefore, the contribution of leaf node i is smaller than log T because the hierarchical nature of the problem (as expressed by the H-loss) lowers the relative importance of the accuracy of estimator wi,t when computing the overall regret. [sent-340, score-0.606]

67 Remark 5 Nothing prevents us from generalizing the H-loss by associating fixed cost coefficients to each taxonomy node: N H (y, v) = ∑ ci {yi = vi ∧ y j = v j , j ∈ ANC(i)} , i=1 where the cost coefficients ci are positive real numbers. [sent-342, score-0.35]

68 For brevity, in the rest of this proof we use the notations ∆i,t = ui xt (the target margin on xt ) and ∆i,t = wi,t xt (the algorithm margin on xt ). [sent-365, score-1.468]

69 Recalling the definition (1) of wi,t , this implies (for conciseness we write Q instead of Q(i,t − 1)) Ei,t [∆i,t ] = ui Si,Q Si,Q (I + Si,Q Si,Q + xt xt )−1 xt . [sent-391, score-1.086]

70 Note that ∆i,t = Ei,t [∆i,t ] + ui (I + xt xt )(I + Si,Q Si,Q + xt xt )−1 xt = Ei,t [∆i,t ] + Bi,t , where Bi,t = ui (I + xt xt )(I + Si,Q Si,Q + xt xt )−1 xt is the conditional bias of wi,t . [sent-392, score-3.508]

71 It is useful to introduce the short-hand notation ri,t = xt (I + Si,Q Si,Q + xt xt )−1 xt . [sent-393, score-1.336]

72 , Zi,t,Q ) = Si,Q I + Si,Q Si,Q + xt xt −1 xt . [sent-401, score-1.002]

73 be arbitrary while the real-valued labels yt are defined as yt = u xt + εt , where εt are i. [sent-467, score-0.525]

74 A regret bound similar to the one established by Theorem 2 can be proven for the zero-one loss using the fact that this loss can be crudely upper bounded by the H-loss (with all cost coefficients set to 1). [sent-471, score-0.328]

75 In any case, since these two loss functions are unable to capture the hierarchical nature of our classification problem, we believe the resulting bounds are less relevant to this paper. [sent-481, score-0.339]

76 The associated taxonomy of labels, which are the document topics, contains 101 nodes organized in a forest of 4 trees. [sent-485, score-0.298]

77 The average number of paths in the multilabel of an instance is 1. [sent-491, score-0.304]

78 The performance of SH - RLS is compared against five baseline algorithms: a flat and a hierarchical version of the Perceptron algorithm (Novikov, 1962; Rosenblatt, 1958), a flat and a hierarchical 45 C ESA -B IANCHI ET AL . [sent-517, score-0.582]

79 The first algorithm we consider, H - PERC, is a simple hierarchical version of the Perceptron. [sent-524, score-0.291]

80 In particular, H - PERC learns a hierarchical classifier by training a linear-threshold classifier at each node via the Perceptron algorithm. [sent-526, score-0.497]

81 If {wi,t xt ≥ 0} = vi,t for such a node i, then the weight vector wi,t is updated using the Perceptron rule wi,t+1 = wi,t + vi,t xt ; on the other hand, if {wi,t xt ≥ 0} = vi,t , then wi,t+1 = wi,t (no update takes place at node i). [sent-535, score-1.414]

82 , yN ) of a test instance x using the same top-down process described in Figure 3,  {wi x ≥ 0} if i is a root node,  (15) yi = {wi x ≥ 0} if i is not a root node and y j = 1 for j = PAR(i), ˆ   0 if i is not a root node and y j = 0 for j = PAR(i). [sent-539, score-0.674]

83 However, whereas H - RLS would update the weight wi,t of all such nodes i, SH - RLS also requires the margin condition |wi,t xt | ≤ (5 lnt)/Ni,t , where Ni,t is the number of instances stored at node i up to time t − 1. [sent-543, score-0.663]

84 We also tested a hierarchical version of SVM (denoted by H - SVM) in which each node is an SVM classifier trained using a batch version of our hierarchical learning protocol. [sent-546, score-0.814]

85 The resulting set of linear-threshold functions was then evaluated on the test set using the hierarchical classification scheme (15). [sent-548, score-0.291]

86 , 2000) for SVM and found o the setting C = 1 to work best2 for our data (recall that all instances xt are normalized, xt = 1). [sent-550, score-0.713]

87 007) Table 1: Experimental results on two hierarchical text classification tasks under various loss functions. [sent-629, score-0.362]

88 To give an idea of how flat and hierarchical algorithms compare in terms of running times, we mention that hierarchical algorithms turned out to be roughly four times faster than the corresponding flat algorithms running on the same data sets. [sent-641, score-0.582]

89 Then node i makes an H-loss mistake on (x, v) if yi = vi ∧ y j = v j = 1, j ∈ ANC(i) . [sent-645, score-0.441]

90 48 I NCREMENTAL A LGORITHMS FOR H IERARCHICAL C LASSIFICATION Thus, node i makes a false positive mistake if yi = 1 ∧ vi = 0 ∧ y j = v j = 1, j ∈ ANC(i) and makes a false negative mistake if yi = 0 ∧ vi = 1 ∧ y j = v j = 1, j ∈ ANC(i) . [sent-649, score-0.676]

91 Conclusions, Ongoing Research, and Open Problems We have introduced H - RLS, a new on-line algorithm for hierarchical classification that maintains and updates regularized least-squares estimators on the nodes of a taxonomy. [sent-654, score-0.345]

92 The linear-threshold classifications, obtained from the estimators, are combined to produce a single hierarchical multilabel through a simple top-down evaluation model. [sent-655, score-0.53]

93 To properly evaluate hierarchical classifiers in this framework we have defined the H-loss, a new hierarchical loss function, with cost coefficients possibly associated to each taxonomy node—see Remark 5. [sent-657, score-0.822]

94 Our main theoretical result states that, on any sequence of instances, the cumulative H-loss of H - RLS is never much bigger than the cumulative H-loss of a reference classifier tuned with the parameters of the stochastic process generating the multilabels for the given sequence of instances. [sent-658, score-0.354]

95 The experiments show that one epoch of training of our algorithm is enough to achieve a performance close to that of the hierarchical SVM. [sent-660, score-0.328]

96 The first open question is the derivation of a hierarchical algorithm especially designed to minimize the H-loss. [sent-662, score-0.291]

97 Since such optimal classifier turns out to be remarkably different from the hierarchical classifiers produced by H - RLS, a related theoretical question is to prove any reasonable bound on the regret with respect to the Bayes optimal classifier. [sent-664, score-0.523]

98 Throughout this appendix A denotes the positive definite matrix I + Si,Q Si,Q , while r denotes the quadratic form xt (A + xt xt )−1 xt . [sent-675, score-1.336]

99 Proof of Lemma 8 Setting for brevity H = Si,Q A−1 xt and a = xt A−1 xt we can write Z i,t 2 = xt A + xt xt −1 −1 Si,Q Si,Q A + xt xt xt A−1 xt xt A−1 A−1 xt xt A−1 Si,Q Si,Q A−1 − xt 1 + xt A−1 xt 1 + xt A−1 xt (by the Sherman-Morrison formula—e. [sent-677, score-6.012]

100 0) a a a2 = H H− H H− H H+ H H 1+a 1+a (1 + a)2 H H = (1 + a)2 xt A−1 Si,Q Si,Q A−1 xt = (1 + a)2 = xt = ≤ = A−1 − xt A−1/2 A−1/2 Si,Q Si,Q A−1/2 A−1/2 xt (1 + a)2 A−1/2 xt A−1/2 Si,Q Si,Q A−1/2 xt A−1/2 (1 + a)2 a (1 + a)2 A−1/2 Si,Q Si,Q A−1/2 , 51 (16) C ESA -B IANCHI ET AL . [sent-680, score-2.338]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('anc', 0.376), ('xt', 0.334), ('rls', 0.292), ('hierarchical', 0.291), ('multilabel', 0.239), ('regret', 0.232), ('node', 0.206), ('taxonomy', 0.192), ('perc', 0.157), ('sh', 0.136), ('ohsumed', 0.135), ('ianchi', 0.125), ('fp', 0.12), ('ierarchical', 0.115), ('ncremental', 0.115), ('esa', 0.112), ('vpar', 0.111), ('cumulative', 0.103), ('mistake', 0.101), ('al', 0.096), ('vi', 0.09), ('fn', 0.085), ('ui', 0.084), ('respects', 0.084), ('multilabels', 0.084), ('yt', 0.075), ('mistakes', 0.074), ('lgorithms', 0.074), ('classi', 0.073), ('par', 0.071), ('lassification', 0.07), ('root', 0.064), ('fg', 0.062), ('labelled', 0.059), ('nodes', 0.054), ('forest', 0.052), ('lai', 0.051), ('vt', 0.051), ('subtree', 0.05), ('loss', 0.048), ('documents', 0.047), ('instances', 0.045), ('dekel', 0.045), ('observes', 0.045), ('perceptron', 0.045), ('svm', 0.044), ('mi', 0.044), ('incremental', 0.044), ('yi', 0.044), ('labels', 0.041), ('det', 0.04), ('paths', 0.039), ('reference', 0.039), ('er', 0.039), ('epoch', 0.037), ('labellings', 0.037), ('ruiz', 0.037), ('hierarchy', 0.036), ('electronically', 0.034), ('xx', 0.034), ('ci', 0.034), ('ers', 0.032), ('claudio', 0.031), ('conconi', 0.031), ('contribution', 0.03), ('parent', 0.03), ('corpus', 0.029), ('estimator', 0.029), ('path', 0.029), ('sun', 0.028), ('hofmann', 0.028), ('eigenstructure', 0.028), ('azoury', 0.028), ('milano', 0.028), ('mccallum', 0.027), ('batch', 0.026), ('leaf', 0.026), ('lemma', 0.026), ('instance', 0.026), ('dipartimento', 0.026), ('instantaneous', 0.026), ('rooted', 0.026), ('generating', 0.025), ('eigenvalues', 0.025), ('charged', 0.025), ('granitzer', 0.025), ('lnt', 0.025), ('mladenic', 0.025), ('nonroot', 0.025), ('tagged', 0.025), ('taxonomical', 0.025), ('sub', 0.024), ('log', 0.024), ('margin', 0.024), ('squares', 0.024), ('dell', 0.024), ('dumais', 0.024), ('lnai', 0.024), ('incrementally', 0.023), ('text', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. To evaluate classifiers in our multipath framework, we define a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classification mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classifier knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classification, second order perceptron, support vector machines, regret bound, loss function

2 0.21804105 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

3 0.21313386 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs

4 0.15885414 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

5 0.15183857 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers

6 0.13831142 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

7 0.094751604 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

8 0.079654977 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications     (Special Topic on Machine Learning and Optimization)

9 0.068851486 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

10 0.06482818 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

11 0.064652629 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

12 0.062979624 12 jmlr-2006-Active Learning with Feedback on Features and Instances

13 0.061470687 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

14 0.055029929 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

15 0.051764708 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

16 0.04682963 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

17 0.04682596 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

18 0.046476893 91 jmlr-2006-The Interplay of Optimization and Machine Learning Research     (Special Topic on Machine Learning and Optimization)

19 0.044068176 57 jmlr-2006-Linear State-Space Models for Blind Source Separation

20 0.039656915 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.306), (1, 0.298), (2, 0.28), (3, 0.052), (4, -0.059), (5, 0.045), (6, 0.091), (7, -0.11), (8, -0.018), (9, 0.071), (10, -0.045), (11, 0.117), (12, -0.052), (13, 0.112), (14, 0.129), (15, -0.027), (16, 0.016), (17, -0.172), (18, 0.103), (19, 0.018), (20, -0.245), (21, -0.001), (22, -0.084), (23, -0.045), (24, -0.001), (25, 0.095), (26, -0.064), (27, 0.02), (28, -0.111), (29, 0.043), (30, -0.029), (31, -0.04), (32, -0.004), (33, 0.042), (34, 0.066), (35, -0.108), (36, -0.076), (37, 0.031), (38, 0.021), (39, -0.02), (40, -0.01), (41, 0.043), (42, -0.074), (43, -0.084), (44, 0.063), (45, -0.064), (46, 0.026), (47, 0.15), (48, -0.053), (49, -0.1)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9572503 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. To evaluate classifiers in our multipath framework, we define a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classification mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classifier knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classification, second order perceptron, support vector machines, regret bound, loss function

2 0.68951392 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs

3 0.57867771 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

4 0.52232641 86 jmlr-2006-Step Size Adaptation in Reproducing Kernel Hilbert Space

Author: S. V. N. Vishwanathan, Nicol N. Schraudolph, Alex J. Smola

Abstract: This paper presents an online support vector machine (SVM) that uses the stochastic meta-descent (SMD) algorithm to adapt its step size automatically. We formulate the online learning problem as a stochastic gradient descent in reproducing kernel Hilbert space (RKHS) and translate SMD to the nonparametric setting, where its gradient trace parameter is no longer a coefficient vector but an element of the RKHS. We derive efficient updates that allow us to perform the step size adaptation in linear time. We apply the online SVM framework to a variety of loss functions, and in particular show how to handle structured output spaces and achieve efficient online multiclass classification. Experiments show that our algorithm outperforms more primitive methods for setting the gradient step size. Keywords: online SVM, stochastic meta-descent, structured output spaces

5 0.49609795 75 jmlr-2006-Policy Gradient in Continuous Time

Author: Rémi Munos

Abstract: Policy search is a method for approximately solving an optimal control problem by performing a parametric optimization search in a given class of parameterized policies. In order to process a local optimization technique, such as a gradient method, we wish to evaluate the sensitivity of the performance measure with respect to the policy parameters, the so-called policy gradient. This paper is concerned with the estimation of the policy gradient for continuous-time, deterministic state dynamics, in a reinforcement learning framework, that is, when the decision maker does not have a model of the state dynamics. We show that usual likelihood ratio methods used in discrete-time, fail to proceed the gradient because they are subject to variance explosion when the discretization time-step decreases to 0. We describe an alternative approach based on the approximation of the pathwise derivative, which leads to a policy gradient estimate that converges almost surely to the true gradient when the timestep tends to 0. The underlying idea starts with the derivation of an explicit representation of the policy gradient using pathwise derivation. This derivation makes use of the knowledge of the state dynamics. Then, in order to estimate the gradient from the observable data only, we use a stochastic policy to discretize the continuous deterministic system into a stochastic discrete process, which enables to replace the unknown coefficients by quantities that solely depend on known data. We prove the almost sure convergence of this estimate to the true policy gradient when the discretization time-step goes to zero. The method is illustrated on two target problems, in discrete and continuous control spaces. Keywords: optimal control, reinforcement learning, policy search, sensitivity analysis, parametric optimization, gradient estimate, likelihood ratio method, pathwise derivation 1. Introduction and Statement of the Problem We consider an optimal control problem with continuous state (xt ∈ IRd )t≥0 whose state dynamics is defined according to the controlled differential equation: dxt = f (xt , ut ), dt (1) where the control (ut )t≥0 is a Lebesgue measurable function with values in a control space U. Note that the state-dynamics f may also depend on time, but we omit this dependency in the notation, for simplicity. We intend to maximize a functional J that depends on the trajectory (xt )0≤t≤T over a finite-time horizon T > 0. For simplicity, in the paper, we illustrate the case of a terminal reward c 2006 Rémi Munos. M UNOS only: J(x; (ut )t≥0 ) := r(xT ), (2) where r : IRd → IR is the reward function. Extension to the case of general functional of the kind J(x; (ut )t≥0 ) = Z T 0 r(t, xt )dt + R(xT ), (3) with r and R being current and terminal reward functions, would easily follow, as indicated in Remark 1. The optimal control problem of finding a control (ut )t≥0 that maximizes the functional is replaced by a parametric optimization problem for which we search for a good feed-back control law in a given class of parameterized policies {πα : [0, T ] × IRd → U}α , where α ∈ IRm is the parameter. The control ut ∈ U (or action) at time t is ut = πα (t, xt ), and we may write the dynamics of the resulting feed-back system as dxt = fα (xt ), (4) dt where fα (xt ) := f (x, πα (t, x)). In the paper, we will make the assumption that fα is C 2 , with bounded derivatives. Let us define the performance measure V (α) := J(x; πα (t, xt )t≥0 ), where its dependency with respect to (w.r.t.) the parameter α is emphasized. One may also consider an average performance measure according to some distribution µ for the initial state: V (α) := E[J(x; πα (t, xt )t≥0 )|x ∼ µ]. In order to find a local maximum of V (α), one may perform a local search, such as a gradient ascent method α ← α + η∇αV (α), (5) with an adequate step η (see for example (Polyak, 1987; Kushner and Yin, 1997)). The computation of the gradient ∇αV (α) is the object of this paper. A first method would be to approximate the gradient by a finite-difference quotient for each of the m components of the parameter: V (α + εei ) −V (α) , ε for some small value of ε (we use the notation ∂α instead of ∇α to indicate that it is a singledimensional derivative). This finite-difference method requires the simulation of m + 1 trajectories to compute an approximation of the true gradient. When the number of parameters is large, this may be computationally expensive. However, this simple method may be efficient if the number of parameters is relatively small. In the rest of the paper we will not consider this approach, and will aim at computing the gradient using one trajectory only. ∂αi V (α) ≃ 772 P OLICY G RADIENT IN C ONTINUOUS T IME Pathwise estimation of the gradient. We now illustrate that if the decision-maker has access to a model of the state dynamics, then a pathwise derivation would directly lead to the policy gradient. Indeed, let us define the gradient of the state with respect to the parameter: zt := ∇α xt (i.e. zt is defined as a d × m-matrix whose (i, j)-component is the derivative of the ith component of xt w.r.t. α j ). Our smoothness assumption on fα allows to differentiate the state dynamics (4) w.r.t. α, which provides the dynamics on (zt ): dzt = ∇α fα (xt ) + ∇x fα (xt )zt , dt (6) where the coefficients ∇α fα and ∇x fα are, respectively, the derivatives of f w.r.t. the parameter (matrix of size d × m) and the state (matrix of size d × d). The initial condition for z is z0 = 0. When the reward function r is smooth (i.e. continuously differentiable), one may apply a pathwise differentiation to derive a gradient formula (see e.g. (Bensoussan, 1988) or (Yang and Kushner, 1991) for an extension to the stochastic case): ∇αV (α) = ∇x r(xT )zT . (7) Remark 1 In the more general setting of a functional (3), the gradient is deduced (by linearity) from the above formula: ∇αV (α) = Z T 0 ∇x r(t, xt )zt dt + ∇x R(xT )zT . What is known from the agent? The decision maker (call it the agent) that intends to design a good controller for the dynamical system may or may not know a model of the state dynamics f . In case the dynamics is known, the state gradient zt = ∇α xt may be computed from (6) along the trajectory and the gradient of the performance measure w.r.t. the parameter α is deduced at time T from (7), which allows to perform the gradient ascent step (5). However, in this paper we consider a Reinforcement Learning (Sutton and Barto, 1998) setting in which the state dynamics is unknown from the agent, but we still assume that the state is fully observable. The agent knows only the response of the system to its control. To be more precise, the available information to the agent at time t is its own control policy πα and the trajectory (xs )0≤s≤t up to time t. At time T , the agent receives the reward r(xT ) and, in this paper, we assume that the gradient ∇r(xT ) is available to the agent. From this point of view, it seems impossible to derive the state gradient zt from (6), since ∇α f and ∇x f are unknown. The term ∇x f (xt ) may be approximated by a least squares method from the observation of past states (xs )s≤t , as this will be explained later on in subsection 3.2. However the term ∇α f (xt ) cannot be calculated analogously. In this paper, we introduce the idea of using stochastic policies to approximate the state (xt ) and the state gradient (zt ) by discrete-time stochastic processes (Xt∆ ) and (Zt∆ ) (with ∆ being some discretization time-step). We show how Zt∆ can be computed without the knowledge of ∇α f , but only from information available to the agent. ∆ ∆ We prove the convergence (with probability one) of the gradient estimate ∇x r(XT )ZT derived from the stochastic processes to ∇αV (α) when ∆ → 0. Here, almost sure convergence is obtained using the concentration of measure phenomenon (Talagrand, 1996; Ledoux, 2001). 773 M UNOS y ∆ XT ∆ X t2 ∆ Xt 0 fα ∆ x Xt 1 Figure 1: A trajectory (Xt∆ )0≤n≤N and the state dynamics vector fα of the continuous process n (xt )0≤t≤T . Likelihood ratio method? It is worth mentioning that this strong convergence result contrasts with the usual likelihood ratio method (also called score method) in discrete time (see e.g. (Reiman and Weiss, 1986; Glynn, 1987) or more recently in the reinforcement learning literature (Williams, 1992; Sutton et al., 2000; Baxter and Bartlett, 2001; Marbach and Tsitsiklis, 2003)) for which the policy gradient estimate is subject to variance explosion when the discretization time-step ∆ tends to 0. The intuitive reason for that problem lies in the fact that the number of decisions before getting the reward grows to infinity when ∆ → 0 (the variance of likelihood ratio estimates being usually linear with the number of decisions). Let us illustrate this problem on a simple 2 dimensional process. Consider the deterministic continuous process (xt )0≤t≤1 defined by the state dynamics: dxt = fα := dt α 1−α , (8) (0 < α < 1) with initial condition x0 = (0 0)′ (where ′ denotes the transpose operator). The performance measure V (α) is the reward at the terminal state at time T = 1, with the reward function being the first coordinate of the state r((x y)′ ) := x. Thus V (α) = r(xT =1 ) = α and its derivative is ∇αV (α) = 1. Let (Xt∆ )0≤n≤N ∈ IR2 be a discrete time stochastic process (the discrete times being {tn = n ∆ n∆}n=0...N with the discretization time-step ∆ = 1/N) that starts from initial state X0 = x0 = (0 0)′ and makes N random moves of length ∆ towards the right (action u1 ) or the top (action u2 ) (see Figure 1) according to the stochastic policy (i.e., the probability of choosing the actions in each state x) πα (u1 |x) = α, πα (u2 |x) = 1 − α. The process is thus defined according to the dynamics: Xt∆ = Xt∆ + n n+1 Un 1 −Un ∆, (9) where (Un )0≤n < N and all ∞ N > 0), there exists a constant C that does not depend on N such that dn ≤ C/N. Thus we may take D2 = C2 /N. Now, from the previous paragraph, ||E[XN ] − xN || ≤ e(N), with e(N) → 0 when N → ∞. This means that ||h − E[h]|| + e(N) ≥ ||XN − xN ||, thus P(||h − E[h]|| ≥ ε + e(N)) ≥ P(||XN − xN || ≥ ε), and we deduce from (31) that 2 /(2C 2 ) P(||XN − xN || ≥ ε) ≤ 2e−N(ε+e(N)) . Thus, for all ε > 0, the series ∑N≥0 P(||XN − xN || ≥ ε) converges. Now, from Borel-Cantelli lemma, we deduce that for all ε > 0, there exists Nε such that for all N ≥ Nε , ||XN − xN || < ε, which ∆→0 proves the almost sure convergence of XN to xN as N → ∞ (i.e. XT −→ xT almost surely). Appendix C. Proof of Proposition 8 ′ First, note that Qt = X X ′ − X X is a symmetric, non-negative matrix, since it may be rewritten as 1 nt ∑ (Xs+ − X)(Xs+ − X)′ . s∈S(t) In solving the least squares problem (21), we deduce b = ∆X + AX∆, thus min A,b 1 1 ∑ ∆Xs − b −A(Xs+2 ∆Xs )∆ nt s∈S(t) ≤ 2 = min A 1 ∑ ∆Xs − ∆X − A(Xs+ − X)∆ nt s∈S(t) 1 ∑ ∆Xs− ∆X− ∇x f (X, ut )(Xs+− X)∆ nt s∈S(t) 2 2 . (32) Now, since Xs = X + O(∆) one may obtain like in (19) and (20) (by replacing Xt by X) that: ∆Xs − ∆X − ∇x f (X, ut )(Xs+ − X)∆ = O(∆3 ). (33) We deduce from (32) and (33) that 1 nt ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) (Xs+ − X)∆ 2 = O(∆6 ). s∈S(t) By developing each component, d ∑ ∇x f (Xt , ut ) − ∇x f (X, ut ) i=1 row i Qt ∇x f (Xt , ut ) − ∇x f (X, ut ) ′ row i = O(∆4 ). Now, from the definition of ν(∆), for all vector u ∈ IRd , u′ Qt u ≥ ν(∆)||u||2 , thus ν(∆)||∇x f (Xt , ut ) − ∇x f (X, ut )||2 = O(∆4 ). Condition (23) yields ∇x f (Xt , ut ) = ∇x f (X, ut ) + o(1), and since ∇x f (Xt , ut ) = ∇x f (X, ut ) + O(∆), we deduce lim ∇x f (Xt , ut ) = ∇x f (Xt , ut ). ∆→0 789 M UNOS References J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. A. Bensoussan. Perturbation methods in optimal control. Wiley/Gauthier-Villars Series in Modern Applied Mathematics. John Wiley & Sons Ltd., Chichester, 1988. Translated from the French by C. Tomson. A. Bogdanov. Optimal control of a double inverted pendulum on a cart. Technical report CSE-04006, CSEE, OGI School of Science and Engineering, OHSU, 2004. P. W. Glynn. Likelihood ratio gradient estimation: an overview. In A. Thesen, H. Grant, and W. D. Kelton, editors, Proceedings of the 1987 Winter Simulation Conference, pages 366–375, 1987. E. Gobet and R. Munos. Sensitivity analysis using Itô-Malliavin calculus and martingales. application to stochastic optimal control. SIAM journal on Control and Optimization, 43(5):1676–1713, 2005. G. H. Golub and C. F. Van Loan. Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins, 1996. R. E. Kalman, P. L. Falb, and M. A. Arbib. Topics in Mathematical System Theory. New York: McGraw Hill, 1969. P. E. Kloeden and E. Platen. Numerical Solutions of Stochastic Differential Equations. SpringerVerlag, 1995. H. J. Kushner and G. Yin. Stochastic Approximation Algorithms and Applications. Springer-Verlag, Berlin and New York, 1997. S. M. LaValle. Planning Algorithms. Cambridge University Press, 2006. M. Ledoux. The concentration of measure phenomenon. American Mathematical Society, Providence, RI, 2001. P. Marbach and J. N. Tsitsiklis. Approximate gradient methods in policy-space optimization of Markov reward processes. Journal of Discrete Event Dynamical Systems, 13:111–148, 2003. B. T. Polyak. Introduction to Optimization. Optimization Software Inc., New York, 1987. M. I. Reiman and A. Weiss. Sensitivity analysis via likelihood ratios. In J. Wilson, J. Henriksen, and S. Roberts, editors, Proceedings of the 1986 Winter Simulation Conference, pages 285–289, 1986. R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. Bradford Book, 1998. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems. MIT Press, pages 1057–1063, 2000. 790 P OLICY G RADIENT IN C ONTINUOUS T IME M. Talagrand. A new look at independence. Annals of Probability, 24:1–34, 1996. R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992. J. Yang and H. J. Kushner. A Monte Carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM J. Control Optim., 29(5):1216–1249, 1991. 791

6 0.45481569 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

7 0.33964452 63 jmlr-2006-New Algorithms for Efficient High-Dimensional Nonparametric Classification

8 0.33306408 3 jmlr-2006-A Hierarchy of Support Vector Machines for Pattern Detection

9 0.27344537 12 jmlr-2006-Active Learning with Feedback on Features and Instances

10 0.26922667 38 jmlr-2006-Incremental Support Vector Learning: Analysis, Implementation and Applications     (Special Topic on Machine Learning and Optimization)

11 0.25944772 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

12 0.25484321 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

13 0.25240669 90 jmlr-2006-Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

14 0.23949653 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

15 0.21297629 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

16 0.19321585 80 jmlr-2006-Segmental Hidden Markov Models with Random Effects for Waveform Modeling

17 0.17877653 40 jmlr-2006-Infinite-σ Limits For Tikhonov Regularization

18 0.1725501 23 jmlr-2006-Consistency and Convergence Rates of One-Class SVMs and Related Algorithms

19 0.16766661 35 jmlr-2006-Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

20 0.1665262 26 jmlr-2006-Efficient Learning of Label Ranking by Soft Projections onto Polyhedra     (Special Topic on Machine Learning and Optimization)


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.017), (36, 0.072), (40, 0.422), (45, 0.023), (50, 0.057), (61, 0.019), (63, 0.051), (76, 0.017), (78, 0.014), (79, 0.013), (81, 0.048), (84, 0.021), (90, 0.052), (91, 0.016), (96, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75158358 37 jmlr-2006-Incremental Algorithms for Hierarchical Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. To evaluate classifiers in our multipath framework, we define a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classification mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classifier knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine. Keywords: incremental algorithms, online learning, hierarchical classification, second order perceptron, support vector machines, regret bound, loss function

2 0.38668737 41 jmlr-2006-Kernel-Based Learning of Hierarchical Multilabel Classification Models     (Special Topic on Machine Learning and Optimization)

Author: Juho Rousu, Craig Saunders, Sandor Szedmak, John Shawe-Taylor

Abstract: We present a kernel-based algorithm for hierarchical text classification where the documents are allowed to belong to more than one category at a time. The classification model is a variant of the Maximum Margin Markov Network framework, where the classification hierarchy is represented as a Markov tree equipped with an exponential family defined on the edges. We present an efficient optimization algorithm based on incremental conditional gradient ascent in single-example subspaces spanned by the marginal dual variables. The optimization is facilitated with a dynamic programming based algorithm that computes best update directions in the feasible set. Experiments show that the algorithm can feasibly optimize training sets of thousands of examples and classification hierarchies consisting of hundreds of nodes. Training of the full hierarchical model is as efficient as training independent SVM-light classifiers for each node. The algorithm’s predictive accuracy was found to be competitive with other recently introduced hierarchical multicategory or multilabel classification learning algorithms. Keywords: kernel methods, hierarchical classification, text categorization, convex optimization, structured outputs

3 0.35174054 96 jmlr-2006-Worst-Case Analysis of Selective Sampling for Linear Classification

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Luca Zaniboni

Abstract: A selective sampling algorithm is a learning algorithm for classification that, based on the past observed data, decides whether to ask the label of each new instance to be classified. In this paper, we introduce a general technique for turning linear-threshold classification algorithms from the general additive family into randomized selective sampling algorithms. For the most popular algorithms in this family we derive mistake bounds that hold for individual sequences of examples. These bounds show that our semi-supervised algorithms can achieve, on average, the same accuracy as that of their fully supervised counterparts, but using fewer labels. Our theoretical results are corroborated by a number of experiments on real-world textual data. The outcome of these experiments is essentially predicted by our theoretical results: Our selective sampling algorithms tend to perform as well as the algorithms receiving the true label after each classification, while observing in practice substantially fewer labels. Keywords: selective sampling, semi-supervised learning, on-line learning, kernel algorithms, linear-threshold classifiers

4 0.34199318 70 jmlr-2006-Online Passive-Aggressive Algorithms

Author: Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, Yoram Singer

Abstract: We present a family of margin based online learning algorithms for various prediction tasks. In particular we derive and analyze algorithms for binary and multiclass categorization, regression, uniclass prediction and sequence prediction. The update steps of our different algorithms are all based on analytical solutions to simple constrained optimization problems. This unified view allows us to prove worst-case loss bounds for the different algorithms and for the various decision problems based on a single lemma. Our bounds on the cumulative loss of the algorithms are relative to the smallest loss that can be attained by any fixed hypothesis, and as such are applicable to both realizable and unrealizable settings. We demonstrate some of the merits of the proposed algorithms in a series of experiments with synthetic and real data sets.

5 0.31149182 52 jmlr-2006-Learning Spectral Clustering, With Application To Speech Separation

Author: Francis R. Bach, Michael I. Jordan

Abstract: Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram. Keywords: spectral clustering, blind source separation, computational auditory scene analysis

6 0.31121051 9 jmlr-2006-Accurate Error Bounds for the Eigenvalues of the Kernel Matrix

7 0.309331 29 jmlr-2006-Estimation of Gradients and Coordinate Covariation in Classification

8 0.30735999 60 jmlr-2006-Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

9 0.30487341 66 jmlr-2006-On Model Selection Consistency of Lasso

10 0.30234319 1 jmlr-2006-A Direct Method for Building Sparse Kernel Learning Algorithms

11 0.30080929 28 jmlr-2006-Estimating the "Wrong" Graphical Model: Benefits in the Computation-Limited Setting

12 0.30071068 17 jmlr-2006-Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

13 0.30062613 53 jmlr-2006-Learning a Hidden Hypergraph

14 0.29845008 44 jmlr-2006-Large Scale Transductive SVMs

15 0.29742712 61 jmlr-2006-Maximum-Gain Working Set Selection for SVMs     (Special Topic on Machine Learning and Optimization)

16 0.29533046 73 jmlr-2006-Pattern Recognition for Conditionally Independent Data

17 0.29296219 95 jmlr-2006-Walk-Sums and Belief Propagation in Gaussian Graphical Models

18 0.29222891 56 jmlr-2006-Linear Programs for Hypotheses Selection in Probabilistic Inference Models     (Special Topic on Machine Learning and Optimization)

19 0.29155359 11 jmlr-2006-Active Learning in Approximately Linear Regression Based on Conditional Expectation of Generalization Error

20 0.29069015 51 jmlr-2006-Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming     (Special Topic on Machine Learning and Optimization)