nips nips2001 nips2001-105 knowledge-graph by maker-knowledge-mining

105 nips-2001-Kernel Machines and Boolean Functions


Source: pdf

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

Abstract: We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 au Abstract We give results about the learnability and required complexity of logical formulae to solve classification problems. [sent-10, score-0.32]

2 These results are obtained by linking propositional logic with kernel machines. [sent-11, score-0.395]

3 In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. [sent-12, score-0.958]

4 Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning. [sent-13, score-1.173]

5 1 Introduction The question of how many Boolean primitives are needed to learn a logical formula is typically an NP-hard problem, especially when learning from noisy data. [sent-14, score-0.209]

6 Likewise, when dealing with decision trees, the question what depth and complexity of a tree is required to learn a certain mapping has proven to be a difficult task. [sent-15, score-0.414]

7 We address this issue in the present paper and give lower bounds on the number of Boolean functions required to learn a mapping. [sent-16, score-0.116]

8 This is achieved by a constructive algorithm which can be carried out in polynomial time. [sent-17, score-0.173]

9 Our tools for this purpose are a Support Vector learning algorithm and a special polynomial kernel. [sent-18, score-0.201]

10 We show that we can treat propositional logic and decision trees within the same framework. [sent-20, score-0.557]

11 Furthermore we will argue that in the limit boosted decision trees correspond to polynomial classifiers built directly on the data. [sent-21, score-0.52]

12 Section 3 contains our main result linking the margin of separation to a simple complexity measure on the class of logical formulae (number of terms and depth). [sent-22, score-0.592]

13 Subsequently we apply this connection to devise test procedures concerning the complexity of logical formulae capable of learning a certain dataset. [sent-23, score-0.32]

14 More specifically, this will involve the training of a perceptron to minimize the regularized risk functional. [sent-24, score-0.638]

15 2 Polynomial Representation of Boolean Formulae We use the standard assumptions of supervised learning: we have a training set . [sent-30, score-0.047]

16 Based on these observations we attempt to find a function which incorporates the information given by the training set. [sent-31, score-0.047]

17 (&$" ¡ ' 2 # 0 431) What makes the situation in this paper special is that we assume that where and moreover . [sent-34, score-0.038]

18 q where (2) ¡ s ©§ ts ¢ ijA )  ¢ dw  ba Y X xh`R! [sent-40, score-0.134]

19 b ffd g e m k RlE i h c eg ¢ A A ƒ0 9 E ” •v  ˆ ˆ ˆ ‡„ ‘„ v ‰¤…v † A p C ™g ˜ C – $—v 9 of all polynomials of the form t ©s t ) y  ’ ¦   9 ¦ ¤ ’ €S ©‘xx§©w   xx¦ ¤ w  v ¦ v , where and we use a compact notation for monomials on , with the usual convention of . [sent-41, score-0.336]

20 A ƒ0 A B7 The subset will be denoted1 by ¦ ¡ for every for every expansions that on ba Y c`X E ¦ C D§   A F' of degree ¡ for every 9 7 5 @86# The set of all polynomials given by the expansion will be called disjunctive normal forms. [sent-43, score-0.767]

21 It is linked by the following lemma to the set of disjunctive normal forms commonly used in the propositional logic. [sent-44, score-0.363]

22 The latter which can be expressed by disjunctions consist of all clauses of terms, each being a conjunction of up to logical primitives and . [sent-45, score-0.167]

23 ) And vice versa, for every such Standing Assumption: Unless stated otherwise, we will in the rest of the paper assume that (3) of the above lemma holds. [sent-47, score-0.124]

24 ToŽy , called deciNow we consider another special subclass of polynomials, sion trees . [sent-52, score-0.229]

25 These are polynomials which have expansions of type (1) where all coeffiand for every exactly one monomial , , ‘fires’, i. [sent-53, score-0.284]

26 cients exactly one of the numbers equals 1 and all the others are 0. [sent-55, score-0.121]

27 (1) shows that each decision tree can be expressed as half of a difference between two disjunctive normal forms such that for any given input, one and only one of the conjunctions comprising them will be true. [sent-57, score-0.459]

28 There exists also an obvious link to popular decision trees (on Boolean variables) used for classification in machine learning, cf. [sent-58, score-0.409]

29 Here the depth of a leaf equals the degree of the corresponding monomial, and the coefficient corresponds to the class associated with the leaf. [sent-60, score-0.118]

30 E G I… g   ¡ S p 1 Such binary polynomials are widely used under the name of score tables, e. [sent-61, score-0.136]

31 3 Reproducing Kernel Hilbert Space and Risk Kernel The next step is to map the complexity measure applied to decision trees, such as depth or number of leaves, to a Reproducing Kernel Hilbert Space (RKHS), as used in Support Vector machines. [sent-64, score-0.329]

32 This is defined as with the scalar product corresponding to the norm defined via the quadratic form on ) ba Y X A wh`—Ž  ¥ £ ih “ p ¥ ¦¤‡ eg  g g (4) ¡ ¡ ¢“ A0 ) p g S ¦    ¦ C ‰•˜ ’ C   are complexity weights for each degree of the polynomiare the coefficients of expansion (1). [sent-65, score-0.468]

33 Q ¡ Here with als and the coefficients § ¨„ £ Lemma 2 (Existence of Kernel) The RKHS kernel realizing the dot product corresponding to the quadratic form (4) with has the following efficient functional form: ©  (5) ! [sent-66, score-0.171]

34 Furthermore it is easy to check that (4) defines a homogeneous quadratic form on . [sent-68, score-0.067]

35 From [1] we obtain that there exists a unique kernel corresponding to . [sent-71, score-0.186]

36 The key observation for derivation of its form (5) and there are exactly non-vanishing monomials of is that given  7 9 ˆ ¢ ¦ ¢ ˆ ˆ ˆ  – are positions , (5) simply leads  0Š ) Q C  % (&# A ' % k E l”  § A E  ! [sent-72, score-0.041]

37 (6) ˆ ¢  ¦ ¢   A b e A   ”  ˆ ¦ w  ¢ ¢ © The larger , the less severely we will penalize higher order polynomials, which provides us with an effective means of controlling the complexity of the estimates. [sent-74, score-0.096]

38 Note that this is applicable to the case when , and always holds for . [sent-75, score-0.034]

39 t t ™¢ —– y b ~ d S u‰) for   g ¥ ¦¤c eg ¥ £ ih A ¢“ ) t t xt xt y b ~ and we obtain     d Q b pe d g b pe nn) g d S g g for  and A IQ in ‡ A p Due to the choice of the C k DlE ¥ E¤c eg ¥ £ ih A ¢“ ) t t xt xt Next we introduce regularized risk functionals. [sent-76, score-2.387]

40 They follow the standard assumptions made in soft-margin SVM and regularization networks. [sent-77, score-0.05]

41  S nI a Y X S b h`u‰) for every The first risk is typically used by regularization networks [8], the other by support vector machines [5]. [sent-79, score-0.263]

42  y b o1) d S   ~ Note that in (7) equalities hold throughout for and in such a case the risks are fully determined by the depths of the leaves of the decision tree and the number of classification errors. [sent-82, score-0.531]

43 Furthermore, in the particular case of decision trees and all coefficients , i. [sent-83, score-0.358]

44 when equals to the number of leaves of the decision tree , the regularized risks are exactly equal to the “cost complexity” employed to prune decision trees by CART algorithm [4]. [sent-85, score-1.09]

45 In other words, the basis of the pruning algorithm in CART is the minimisation of the regularised risk in the class of subtrees of the maximal tree, with the regularisation constant selected by a heuristic applied to a validation set. [sent-86, score-0.268]

46 This can then be translated into a lower bound on the complexity of since . [sent-88, score-0.17]

47 b ffd g e y b d a Y¦ b h`X    ~ b ff—") g e d S y b „ ¦ G ~ )ƒ d S —() I H „ ¦ G )ƒ H ) 4 Complexity Bounds The last part missing to establish a polynomial-time device to lower-bound the required complexity of a logical formula is to present actual algorithms for minimizing or . [sent-90, score-0.33]

48 In this section we will study two such methods: the kernel perceptron and the maximum margin perceptron and establish bounds on execution time and regularized risk. [sent-91, score-1.173]

49 „ ¦ G )ƒ H „ ¦ G I PH )ƒ ¦ Kernel Perceptron Test The -perceptron learning algorithm is a direct modification it becomes of ordinary linear perceptron learning rule. [sent-92, score-0.398]

50 In the particular case of the ordinary perceptron learning rule in the feature space . [sent-93, score-0.365]

51 perceptron learning rule in the extended feature space C CA § G G G ©    % u$    C o– G © ¦ Algorithm 1 Regularized kernel perceptron ( -perceptron) Given: a Mercer kernel and a constant . [sent-96, score-0.934]

52  were generated after -th update  7 I ¢ 9 '  3 „ ) „ tt ¨ ¢ xD© £    7 @ A —7 3 ¦ ¨ ¦ ¤ §¥h 9 ' 7 ) „ 4 ¤¡ ”  ¥ (¡ 7 3„ £  and for every . [sent-104, score-0.134]

53 Theorem 3 Assume that the coefficients of the -perceptron and (9) )ƒ H ¡   Note that defined above is the maximal margin of separation of the training data by polynomials from (treated as elements of the RKHS). [sent-109, score-0.486]

54 a Y b h`X Maximum Margin Perceptron Test Below we state formally the soft margin version of maximal margin perceptron algorithm. [sent-110, score-0.776]

55 This is a simplified (homogeneous) version of the algorithm introduced in [9]. [sent-111, score-0.033]

56 ¦ -MMP) G © Algorithm 2 Greedy Maximal Margin Perceptron ( Given: , a and a Mercer kernel . [sent-112, score-0.135]

57 Initialize: for ; , , ; and for while do for for every do “ ) “ „  i “ ) „ ) „  i „  „ ©„   E ”  m k ¡ k ¡  ¡  ¥ (¡ ”  ¢A $ ¡ ¡ £ E£ ’   ! [sent-113, score-0.07]

58 , then set ; end while The proof of the following theorem uses the extended feature space [13]. [sent-118, score-0.076]

59 k E t xt ¢“ 9 ' m Rk G “ ¡   3 7) A k 1 “A “ H – d„ ¦ G t xt ! [sent-123, score-0.808]

60 If the algorithm halts after -th update, then m Rk ! [sent-125, score-0.033]

61 G E G “   A i E  ””„ ¦ G )ƒ I PH (q )tr 34 'h¡ % & – E “ t t xt xt @7 G k t ‘t ¢“ 9 ' 73 8) ‘tt ! [sent-126, score-0.808]

62 – F„ ¦ G )ƒ a Y X S b h`u‰) H for every (10) (12) Note that condition (10) ensures the convergence of the algorithm in a finite time. [sent-127, score-0.136]

63 The above theorem for ensures that solution generated by Algorithm 2 converges to the (hard) maximum margin classifier. [sent-128, score-0.217]

64 Further, it can be shown that the bound (11) holds for every such that each and . [sent-129, score-0.144]

65 C E A „ C – " 7 „ A xG  7 „   7 @7 A Bounds on classification error The task of finding a linear perceptron minimizing the number of classification errors on the training set is known to be NP-hard. [sent-130, score-0.431]

66 On this basis it is reasonable to expect that finding a decision tree or disjunctive normal form of upper bounded complexity and minimizing the number of errors is also hard. [sent-131, score-0.607]

67 In this section we provide a lower bound on the number of errors for such classifiers. [sent-132, score-0.126]

68 the number of classification errors (8), can be     @ ~ ©  The following estimates on derived from Theorems 3 and 4: C § A¦ G and . [sent-135, score-0.052]

69 7 A (7 “  G © CG & b ¦ – ) i t t xt xt ! [sent-139, score-0.808]

70 “  3 t ‘t 7 —@ 7 A   ¢“ 9 ' 7 )  t xt CG –    ©  ) ! [sent-141, score-0.404]

71 On the other hand, if has been generated after -th iteration of the “while loop” of the -MMP learning rule, then ¢“ ) i t t xt ‘t ! [sent-142, score-0.404]

72 ¢“ ) i t t ‘t xt G  S t t ‘t ‘t “ m k  „ “ 1   7 @7 A! [sent-144, score-0.404]

73 “ k k G E k & “ @ 3 t xt G k  t xt 7E ) ¢“ 9 '  nS “  1 „ CG ¦ ©  ) – 6   CG G © – A f7 Additionally, the estimate (14) holds for every . [sent-145, score-0.912]

74 „ 7 " Note that equals in (13), while it is 1 in (14). [sent-147, score-0.052]

75 There exists an algorithm which runs in Theorem 6 Given time polynomial in both the input dimension and the number of training samples , that given the labelled training sample , , it outputs a polynomial such that for every in . [sent-150, score-0.442]

76 Using the standard quadratic programming approach this can be done in time polynomial in both and [3]. [sent-152, score-0.133]

77 Next, define as the vector of the hyperplane with the lowest error rate on the whole training set. [sent-153, score-0.047]

78 We generated 500 examples for training and 5000 for independent test, under assumption of 10% probability of a bit being reversed. [sent-156, score-0.047]

79 The task set was to discriminate between two classes, digits 0-4 and digits 5-9. [sent-157, score-0.108]

80 Each was complemented by an additional 7 bits vector “noisy digit” data vector to ensure that our Standing Assumption of Section 2 holds true. [sent-158, score-0.034]

81 “ 7  ‘t ‘t t t @ G k 9' ¢ ‘t 7 3 t ) tt x©  ¡ ! [sent-166, score-0.064]

82  ¢’ z£w  ba Y X S xh`u‰) Table 1: Results for recognition of two groups of digits on faulty LED-display. [sent-168, score-0.188]

83 of leaves /SV/terms ) Error rate %: train/test A Q ¤ ¥ ¦IQ A 80 (17 leaves) 40. [sent-170, score-0.103]

84 2(10 terms) ¥ A §Q Decision tree Kernel SVM Kernel percep. [sent-176, score-0.085]

85 6 The lower bound on risk from maximal margin criterion (Eq. [sent-194, score-0.485]

86 Similarly, the lower bound on risk from kernel perceptron criterion (Eq. [sent-198, score-0.684]

87 Risks for SVM solutions approach this bound and for kernel perceptron they are reasonably close. [sent-202, score-0.539]

88 Comparison with the risks obtained for decision trees show that our lower bounds are meaningful (for the “un-pruned” decision trees risks were only slightly worse). [sent-203, score-1.184]

89 The mask perceptron results show that simple (low number of terms) polynomial solutions with risks approaching our lower bounds can be practically found. [sent-204, score-0.809]

90 SVM solutions have error rates closest to the Bayesian classifier (the test error rate for exceeds the one of the Bayes-optimal classifier by only 7%). [sent-209, score-0.032]

91 ¥ ©IQ A Boosted Decision Trees An obvious question to ask is what happens if we take a large enough linear combination of decision trees. [sent-210, score-0.167]

92 In a nutshell, the proof relies on the partition of the identity into   9 v i E ¦    ¦ “ v i E ¦ ¤ …‰v i E ”  A y b d   ~ ¢ a Y b `X  where g ¢ ¢  ¢ g  e  I  A E and solving this expansion for , where the remainder turns out to be a decision tree. [sent-213, score-0.25]

93 This , means that in the limit, boosting decision trees finds a maximum margin solution in a goal more directly achievable via a maximum margin perceptron on . [sent-214, score-1.042]

94 The constructive lower bounds we proved offer a fresh approach to some seemingly intractable problems. [sent-216, score-0.159]

95 For instance, such bounds can be used as points of reference for practical applications of inductive techniques like as decision trees. [sent-217, score-0.249]

96 The use of Boolean kernels introduced here allows a more insightful comparison of performance of logic based and analytical, linear machine learning algorithms. [sent-218, score-0.097]

97 This contributes to the research in the theory of learning systems as illustrated by the result on existence of polynomial time algorithm for estimation of minimal number of training errors for decision trees and disjunctive normal forms. [sent-219, score-0.794]

98 A potentially more practical link, to boosted decision trees, and their convergence to the maximum margin solutions has to be investigated further. [sent-220, score-0.473]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xt', 0.404), ('perceptron', 0.332), ('trees', 0.191), ('margin', 0.176), ('risks', 0.176), ('boolean', 0.173), ('decision', 0.167), ('disjunctive', 0.153), ('risk', 0.143), ('polynomials', 0.136), ('kernel', 0.135), ('ba', 0.134), ('classi', 0.133), ('logical', 0.116), ('regularized', 0.116), ('formulae', 0.108), ('leaves', 0.103), ('pe', 0.102), ('propositional', 0.102), ('polynomial', 0.097), ('logic', 0.097), ('complexity', 0.096), ('maximal', 0.092), ('cg', 0.089), ('ph', 0.088), ('telstra', 0.088), ('iq', 0.086), ('tree', 0.085), ('eg', 0.083), ('bounds', 0.082), ('nn', 0.077), ('ffd', 0.076), ('xx', 0.075), ('coef', 0.074), ('ih', 0.071), ('every', 0.07), ('cients', 0.069), ('cation', 0.067), ('depth', 0.066), ('boosted', 0.065), ('os', 0.065), ('rkhs', 0.065), ('tt', 0.064), ('sv', 0.062), ('linking', 0.061), ('reproducing', 0.058), ('mask', 0.056), ('lemma', 0.054), ('digits', 0.054), ('normal', 0.054), ('vs', 0.053), ('equals', 0.052), ('errors', 0.052), ('exists', 0.051), ('kowalczyk', 0.051), ('primitives', 0.051), ('standing', 0.051), ('regularization', 0.05), ('hilbert', 0.048), ('expansion', 0.048), ('training', 0.047), ('belmont', 0.046), ('monomial', 0.046), ('cart', 0.046), ('ns', 0.043), ('bq', 0.043), ('constructive', 0.043), ('rk', 0.043), ('xh', 0.043), ('analytical', 0.043), ('formula', 0.042), ('theorem', 0.041), ('svm', 0.041), ('tr', 0.041), ('monomials', 0.041), ('subsequently', 0.041), ('bound', 0.04), ('dw', 0.039), ('hh', 0.039), ('special', 0.038), ('er', 0.038), ('ers', 0.036), ('quadratic', 0.036), ('loop', 0.036), ('proof', 0.035), ('separation', 0.035), ('lower', 0.034), ('holds', 0.034), ('smola', 0.034), ('schuurmans', 0.033), ('tools', 0.033), ('ordinary', 0.033), ('convergence', 0.033), ('algorithm', 0.033), ('de', 0.033), ('solutions', 0.032), ('expansions', 0.032), ('mercer', 0.032), ('else', 0.031), ('homogeneous', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 105 nips-2001-Kernel Machines and Boolean Functions

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

Abstract: We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

2 0.22177128 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

3 0.2208285 56 nips-2001-Convolution Kernels for Natural Language

Author: Michael Collins, Nigel Duffy

Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.

4 0.2121136 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

Author: Roni Khardon, Dan Roth, Rocco A. Servedio

Abstract: We study online learning in Boolean domains using kernels which capture feature expansions equivalent to using conjunctions over basic features. We demonstrate a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability of the resulting classifier. We first describe several kernel functions which capture either limited forms of conjunctions or all conjunctions. We show that these kernels can be used to efficiently run the Perceptron algorithm over an exponential number of conjunctions; however we also prove that using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. We also consider an analogous use of kernel functions to run the multiplicative-update Winnow algorithm over an expanded feature space of exponentially many conjunctions. While known upper bounds imply that Winnow can learn DNF formulae with a polynomial mistake bound in this setting, we prove that it is computationally hard to simulate Winnow’s behavior for learning DNF over such a feature set, and thus that such kernel functions for Winnow are not efficiently computable.

5 0.15991823 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

6 0.15289129 147 nips-2001-Pranking with Ranking

7 0.15252838 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

8 0.13419047 164 nips-2001-Sampling Techniques for Kernel Methods

9 0.12661752 8 nips-2001-A General Greedy Approximation Algorithm with Applications

10 0.12655987 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

11 0.1221692 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

12 0.11695819 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

13 0.11659446 190 nips-2001-Thin Junction Trees

14 0.115605 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

15 0.11536678 46 nips-2001-Categorization by Learning and Combining Object Parts

16 0.11530294 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation

17 0.11110738 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

18 0.10859199 43 nips-2001-Bayesian time series classification

19 0.10731739 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

20 0.10704603 180 nips-2001-The Concave-Convex Procedure (CCCP)


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.309), (1, 0.195), (2, -0.008), (3, 0.175), (4, 0.071), (5, -0.049), (6, 0.004), (7, 0.099), (8, -0.104), (9, 0.062), (10, -0.198), (11, -0.102), (12, 0.032), (13, 0.114), (14, 0.061), (15, -0.003), (16, -0.022), (17, 0.265), (18, 0.016), (19, -0.146), (20, -0.004), (21, -0.06), (22, 0.115), (23, 0.026), (24, 0.046), (25, -0.104), (26, 0.063), (27, -0.037), (28, -0.039), (29, -0.122), (30, -0.074), (31, 0.044), (32, 0.023), (33, 0.025), (34, -0.139), (35, 0.069), (36, 0.029), (37, -0.004), (38, 0.084), (39, -0.123), (40, 0.037), (41, 0.029), (42, -0.052), (43, -0.104), (44, -0.096), (45, -0.021), (46, 0.131), (47, -0.031), (48, -0.033), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95784044 105 nips-2001-Kernel Machines and Boolean Functions

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

Abstract: We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

2 0.79374558 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

Author: Roni Khardon, Dan Roth, Rocco A. Servedio

Abstract: We study online learning in Boolean domains using kernels which capture feature expansions equivalent to using conjunctions over basic features. We demonstrate a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability of the resulting classifier. We first describe several kernel functions which capture either limited forms of conjunctions or all conjunctions. We show that these kernels can be used to efficiently run the Perceptron algorithm over an exponential number of conjunctions; however we also prove that using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. We also consider an analogous use of kernel functions to run the multiplicative-update Winnow algorithm over an expanded feature space of exponentially many conjunctions. While known upper bounds imply that Winnow can learn DNF formulae with a polynomial mistake bound in this setting, we prove that it is computationally hard to simulate Winnow’s behavior for learning DNF over such a feature set, and thus that such kernel functions for Winnow are not efficiently computable.

3 0.66367996 56 nips-2001-Convolution Kernels for Natural Language

Author: Michael Collins, Nigel Duffy

Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.

4 0.54150939 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms

Author: Nicolò Cesa-bianchi, Alex Conconi, Claudio Gentile

Abstract: In this paper we show that on-line algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentration-of-measure arguments and they hold for arbitrary on-line learning algorithms. Furthermore, when applied to concrete on-line algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.

5 0.5406304 139 nips-2001-Online Learning with Kernels

Author: Jyrki Kivinen, Alex J. Smola, Robert C. Williamson

Abstract: We consider online learning in a Reproducing Kernel Hilbert Space. Our method is computationally efficient and leads to simple algorithms. In particular we derive update equations for classification, regression, and novelty detection. The inclusion of the -trick allows us to give a robust parameterization. Moreover, unlike in batch learning where the -trick only applies to the -insensitive loss function we are able to derive general trimmed-mean types of estimators such as for Huber’s robust loss.     ¡

6 0.45987326 147 nips-2001-Pranking with Ranking

7 0.45953676 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

8 0.45444489 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

9 0.45276025 180 nips-2001-The Concave-Convex Procedure (CCCP)

10 0.45192158 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

11 0.44630021 164 nips-2001-Sampling Techniques for Kernel Methods

12 0.41841376 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

13 0.41541165 190 nips-2001-Thin Junction Trees

14 0.40853134 156 nips-2001-Rao-Blackwellised Particle Filtering via Data Augmentation

15 0.37981477 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

16 0.37469989 31 nips-2001-Algorithmic Luckiness

17 0.37171444 118 nips-2001-Matching Free Trees with Replicator Equations

18 0.37136853 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines

19 0.35887665 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces

20 0.35631257 25 nips-2001-Active Learning in the Drug Discovery Process


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.101), (17, 0.014), (19, 0.032), (27, 0.096), (30, 0.073), (36, 0.01), (38, 0.014), (59, 0.026), (72, 0.07), (79, 0.02), (83, 0.398), (91, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.87802505 105 nips-2001-Kernel Machines and Boolean Functions

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

Abstract: We give results about the learnability and required complexity of logical formulae to solve classification problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

2 0.87040323 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a finite set of standard, scalar-reward learning algorithms. Sufficient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

3 0.79303759 43 nips-2001-Bayesian time series classification

Author: Peter Sykacek, Stephen J. Roberts

Abstract: This paper proposes an approach to classification of adjacent segments of a time series as being either of classes. We use a hierarchical model that consists of a feature extraction stage and a generative classifier which is built on top of these features. Such two stage approaches are often used in signal and image processing. The novel part of our work is that we link these stages probabilistically by using a latent feature space. To use one joint model is a Bayesian requirement, which has the advantage to fuse information according to its certainty. The classifier is implemented as hidden Markov model with Gaussian and Multinomial observation distributions defined on a suitably chosen representation of autoregressive models. The Markov dependency is motivated by the assumption that successive classifications will be correlated. Inference is done with Markov chain Monte Carlo (MCMC) techniques. We apply the proposed approach to synthetic data and to classification of EEG that was recorded while the subjects performed different cognitive tasks. All experiments show that using a latent feature space results in a significant improvement in generalization accuracy. Hence we expect that this idea generalizes well to other hierarchical models.

4 0.54274338 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Author: Evan Greensmith, Peter L. Bartlett, Jonathan Baxter

Abstract: We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially observable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal∗ Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good estimate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investigate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X → R, and we know the integral of another function ϕ : X → R. Since X f = X (f − ϕ) + X ϕ, the integral of f − ϕ can be estimated instead. Obviously if ϕ = f then the variance is zero. More generally, Var(f − ϕ) = Var(f ) − 2Cov(f, ϕ) + Var(ϕ), so that if φ and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The first (Section 2) is the technique of adding a baseline. We find the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actorcritic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the fixed value function and the true value function. In this paper we show that this may not be the best approach: selecting the fixed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a fixed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a fixed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a fixed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices {P(u) : u ∈ U}, each with components pij (u) for i, j ∈ S, u ∈ U, an observation process ν : S → PY , where PY is the space of probability distributions over Y, and a reward function r : S → R. We assume that S, U, Y are finite, although all our results extend easily to infinite U and Y, and with more restrictive assumptions can be extended to infinite S. A reactive, parameterized policy for a POMDP is a set of mappings {µ(·, θ) : Y → PU |θ ∈ RK }. Together with the POMDP, this defines the controlled POMDP (S, U, Y, P , ν, r, µ). The joint state, observation and control process, {Xt , Yt , Ut }, is Markov. The state process, {Xt }, is also Markov, with transition probabilities pij (θ) = y∈Y,u∈U νy (i)µu (y, θ)pij (u), where νy (i) denotes the probability of observation y given the state i, and µu (y, θ) denotes the probability of action u given parameters θ and observation y. The Markov chain M(θ) = (S, P(θ)) then describes the behaviour of the process {Xt }. Assumption 1 The controlled POMDP (S, U, Y, P , ν, r, µ) satisfies: For all θ ∈ RK there exists a unique stationary distribution satisfying π (θ) P(θ) = π (θ). There is an R < ∞ such that, for all i ∈ S, |r(i)| ≤ R. There is a B < ∞ such that, for all u ∈ U, y ∈ Y and θ ∈ RK the derivatives ∂µu (y, θ)/∂θk (1 ≤ k ≤ K) exist, and the vector of these derivatives satisfies µu (y, θ)/µu (y, θ) ≤ B, where · denotes the Euclidean norm on RK . def T −1 1 We consider the average reward, η(θ) = limT →∞ E T t=0 r(Xt ) . Assumption 1 implies that this limit exists, and does not depend on the start state X0 . The aim is to def select a policy to maximize this quantity. Define the discounted value function, J β (i, θ) = T −1 t limT →∞ E t=0 β r(Xt ) X0 = i . Throughout the rest of the paper, dependences upon θ are assumed, and dropped in the notation. For a random vector A, we denote Var(A) = E (A − E [A])2 , where a2 denotes a a, and a denotes the transpose of the column vector a. GPOMDP Algorithm The GPOMDP algorithm [4] uses a sample path to estimate the gradient approximation def µu(y) η, but the βη = E β η approaches the true gradient µu(y) Jβ (j) . As β → 1, def variance increases. We consider a slight modification [2]: with Jt = def ∆T = 1 T T −1 t=0 2T s=t µUt (Yt ) Jt+1 . µUt (Yt ) β s−t r(Xs ), (1) Throughout this paper the process {Xt , Yt , Ut , Xt+1 } is generally understood to be generated by a controlled POMDP satisfying Assumption 1, with X0 ∼π (ie the initial state distributed according to the stationary distribution). That is, before computing the gradient estimates, we wait until the process has settled down to the stationary distribution. Dependent Samples Correlation terms arise in the variance quantities to be analysed. We show here that considering iid samples gives an upper bound on the variance of the general case. The mixing time of a finite ergodic Markov chain M = (S, P ) is defined as def τ = min t > 1 : max dT V i,j Pt i , Pt j ≤ e−1 , where [P t ]i denotes the ith row of P t and dT V is the total variation distance, dT V (P, Q) = i |P (i) − Q(i)|. Theorem 1 Let M = (S, P ) be a finite ergodic Markov chain, with mixing time τ , and 2|S|e and 0 ≤ α < let π be its stationary distribution. There are constants L < exp(−1/(2τ )), which depend only on M , such that, for all f : S → R and all t, Covπ (t) ≤ Lαt Varπ (f), where Varπ (f) is the variance of f under π, and Covπ (t) is f f the auto-covariance of the process {f(Xt )}, where the process {Xt } is generated by M with initial distribution π. Hence, for some constant Ω∗ ≤ 4Lτ , Var 1 T T −1 f(Xt ) t=0 ≤ Ω∗ Varπ (f). T We call (L, τ ) the mixing constants of the Markov chain M (or of the controlled POMDP D; ie the Markov chain (S, P )). We omit the proof (all proofs are in the full version [8]). Briefly, we show that for a finite ergodic Markov chain M , Covπ (t) ≤ Rt (M )Varπ (f), f 2 t for some Rt (M ). We then show that Rt (M ) < 2|S| exp(− τ ). In fact, for a reversible chain, we can choose L = 1 and α = |λ2 |, the second largest magnitude eigenvalue of P . 2 Baseline We consider an alteration of (1), def ∆T = 1 T T −1 µUt (Yt ) (Jt+1 − Ar (Yt )) . µUt (Yt ) t=0 (2) For any baseline Ar : Y → R, it is easy to show that E [∆T ] = E [∆T ]. Thus, we select Ar to minimize variance. The following theorem shows that this variance is bounded by a variance involving iid samples, with Jt replaced by the exact value function. Theorem 2 Suppose that D = (S, U, Y, P , ν, r, µ) is a controlled POMDP satisfying Assumption 1, D has mixing constants (L, τ ), {Xt , Yt , Ut , Xt+1 } is a process generated by D with X0 ∼π ,Ar : Y → R is a baseline that is uniformly bounded by M, and J (j) has the distribution of ∞ β s r(Xt ), where the states Xt are generated by D starting in s=0 X0 = j. Then there are constants C ≤ 5B2 R(R + M) and Ω ≤ 4Lτ ln(eT ) such that Var 1 T T −1 t=0 µUt (Yt ) Ω (Jt+1 −Ar (Yt )) ≤ Varπ µUt (Yt ) T + Ω E T µu (y) (J (j) − Jβ (j)) µu (y) µu (y) (Jβ (j)−Ar (y)) µu (y) 2 + Ω +1 T C βT , (1 − β)2 where, as always, (i, y, u, j) are generated iid with i∼π, y∼ν(i), u∼µ(y) and j∼P i (u). The proof uses Theorem 1 and [2, Lemma 4.3]. Here we have bounded the variance of (2) with the variance of a quantity we may readily analyse. The second term on the right hand side shows the error associated with replacing an unbiased, uncorrelated estimate of the value function with the true value function. This quantity is not dependent on the baseline. The final term on the right hand side arises from the truncation of the discounted reward— and is exponentially decreasing. We now concentrate on minimizing the variance σ 2 = Varπ r µu (y) (Jβ (j) − Ar (y)) , µu (y) (3) which the following lemma relates to the variance σ 2 without a baseline, µu (y) Jβ (j) . µu (y) σ 2 = Varπ Lemma 3 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For any baseline Ar : Y → R, and for i∼π, y∼ν(i), u∼µ(y) and j∼Pi (u), σ 2 = σ 2 + E A2 (y) E r r µu (y) µu (y) 2 y − 2Ar (y)E µu (y) µu (y) 2 Jβ (j) y . From Lemma 3 it can be seen that the task of finding the optimal baseline is in effect that of minimizing a quadratic for each observation y ∈ Y. This gives the following theorem. Theorem 4 For the controlled POMDP as in Lemma 3,  2 µu (y) min σ 2 = σ 2 − E  E Jβ (j) y r Ar µu (y) 2 /E µu (y) µu (y) 2 y and this minimum is attained with the baseline 2 µu (y) µu (y) A∗ (y) = E r Jβ (j) , 2 µu (y) µu (y) /E y  y . Furthermore, the optimal constant baseline is µu (y) µu (y) A∗ = E r 2 Jβ (j) /E µu (y) µu (y) 2 . The following theorem shows that the variance of an estimate with an arbitrary baseline can be expressed as the sum of the variance with the optimal baseline and a certain squared weighted distance between the baseline function and the optimal baseline function. Theorem 5 If Ar : Y → R is a baseline function, A∗ is the optimal baseline defined in r Theorem 4, and σ 2 is the variance of the corresponding estimate, then r∗ µu (y) µu (y) σ 2 = σ2 + E r r∗ 2 (Ar (y) − A∗ (y)) r 2 , where i∼π, y ∼ν(i), and u∼µ(y). Furthermore, the same result is true for the case of constant baselines, with Ar (y) replaced by an arbitrary constant baseline Ar , and A∗ (y) r replaced by A∗ , the optimum constant baseline defined in Theorem 4. r For the constant baseline Ar = E i∼π [Jβ (i)], Theorem 5 implies that σ 2 is equal to r min Ar ∈R σ2 r + E µu (y) µu (y) 2 E [Jβ (j)] − E µu (y) µu (y) 2 2 /E Jβ (j) µu (y) µu (y) 2 . Thus, its performance depends on the random variables ( µu (y)/µu (y))2 and Jβ (j); if they are nearly independent, E [Jβ ] is a good choice. 3 Fixed Value Functions: Actor-Critic Methods We consider an alteration of (1), ˜ def 1 ∆T = T T −1 t=0 µUt (Yt ) ˜ Jβ (Xt+1 ), µUt (Yt ) (4) ˜ for some fixed value function Jβ : S → R. Define ∞ def β k d(Xk , Xk+1 ) Aβ (j) = E X0 = j , k=0 def ˜ ˜ where d(i, j) = r(i) + β Jβ (j) − Jβ (i) is the temporal difference. Then it is easy to show that the estimate (4) has a bias of µu (y) ˜ Aβ (j) . β η − E ∆T = E µu (y) The following theorem gives a bound on the expected squared error of (4). The main tool in the proof is Theorem 1. Theorem 6 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For a sample path from D, that is, {X0∼π, Yt∼ν(Xt ), Ut∼µ(Yt ), Xt+1∼PXt (Ut )}, E ˜ ∆T − βη 2 ≤ Ω∗ Varπ T µu (y) ˜ Jβ (j) + E µu (y) µu (y) Aβ (j) µu (y) 2 , where the second expectation is over i∼π, y∼ν(i), u∼µ(y), and j∼P i (u). ˜ If we write Jβ (j) = Jβ (j) + v(j), then by selecting v = (v(1), . . . , v(|S|)) from the right def null space of the K × |S| matrix G, where G = i,y,u πi νy (i) µu (y)Pi (u), (4) will produce an unbiased estimate of β η. An obvious example of such a v is a constant vector, (c, c, . . . , c) : c ∈ R. We can use this to construct a trivial example where (4) produces an unbiased estimate with zero variance. Indeed, let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1, with r(i) = c, for some 0 < c ≤ R. Then Jβ (j) = c/(1 − β) and β η = 0. If we choose v = (−c/(1 − β), . . . , −c/(1 − β)) and ˜ ˜ Jβ (j) = Jβ (j) + v(j), then µµu(y) Jβ (j) = 0 for all y, u, j, and so (4) gives an unbiased u(y) estimate of β η, with zero variance. Furthermore, for any D for which there exists a pair y, u such that µu (y) > 0 and µu (y) = 0, choosing ˜β (j) = Jβ (j) gives a variance J greater than zero—there is a non-zero probability event, (Xt = i, Yt = y, Ut = u, Xt+1 = j), such that µµu(y) Jβ (j) = 0. u(y) 4 Algorithms Given a parameterized class of baseline functions Ar (·, θ) : Y → R θ ∈ RL , we can use Theorem 5 to bound the variance of our estimates. Computing the gradient of this bound with respect to the parameters θ of the baseline function allows a gradient optimization of the baseline. The GDORB Algorithm produces an estimate ∆ S of these gradients from a sample path of length S. Under the assumption that the baseline function and its gradients are uniformly bounded, we can show that these estimates converge to the gradient of σ 2 . We omit the details (see [8]). r GDORB Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized baseline Ar . set z0 = 0 (z0 ∈ RL ), ∆0 = 0 (∆0 ∈ RL ) for all {is , ys , us , is+1 , ys+1 } generated by the POMDP do zs+1 = βzs + ∆s+1 = ∆s + end for Ar (ys ) 1 s+1 µus(ys ) µus(ys ) 2 ((Ar (ys ) − βAr (ys+1 ) − r(xs+1 )) zs+1 − ∆s ) ˜ For a parameterized class of fixed value functions {Jβ (·, θ) : S → R θ ∈ RL }, we can use Theorem 6 to bound the expected squared error of our estimates, and compute the gradient of this bound with respect to the parameters θ of the baseline function. The GBTE Algorithm produces an estimate ∆S of these gradients from a sample path of length S. Under the assumption that the value function and its gradients are uniformly bounded, we can show that these estimates converge to the true gradient. GBTE Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized fixed value function ˜β . J set z0 = 0 (z0 ∈ RK ), ∆A0 = 0 (∆A0 ∈ R1×L ), ∆B 0 = 0 (∆B 0 ∈ RK ), ∆C 0 = 0 (∆C 0 ∈ RK ) and ∆D0 = 0 (∆D0 ∈ RK×L ) for all {is , ys , us , is+1 , is+2 } generated by the POMDP do µ s(y ) zs+1 = βzs + µuu(yss ) s µus(ys ) ˜ µus(ys ) Jβ (is+1 ) µus(ys ) µus(ys ) ˜ Jβ (is+1 ) ∆As+1 = ∆As + 1 s+1 ∆B s+1 = ∆B s + 1 s+1 µus(ys ) ˜ µus(ys ) Jβ (is+1 ) ∆C s+1 = ∆C s + 1 s+1 ˜ ˜ r(is+1 ) + β Jβ (is+2 ) − Jβ (is+1 ) zs+1 − ∆C s ∆Ds+1 = ∆Ds + 1 s+1 µus(ys ) µus(ys ) ∆s+1 = end for Ω∗ T ∆As+1 − − ∆B s ˜ Jβ (is+1 ) Ω∗ T ∆B s+1 ∆D s+1 − ∆As − ∆D s − ∆C s+1 ∆Ds+1 5 Experiments Experimental results comparing these GPOMDP variants for a simple three state MDP (described in [5]) are shown in Figure 1. The exact value function plots show how different choices of baseline and fixed value function compare when all algorithms have access to the exact value function Jβ . Using the expected value function as a baseline was an improvement over GPOMDP. Using the optimum, or optimum constant, baseline was a further improvement, each performing comparably to the other. Using the pre-trained fixed value function was also an improvement over GPOMDP, showing that selecting the true value function was indeed not the best choice in this case. The trained fixed value function was not optimal though, as Jβ (j) − A∗ is a valid choice of fixed value function. r The optimum baseline, and fixed value function, will not normally be known. The online plots show experiments where the baseline and fixed value function were trained using online gradient descent whilst the performance gradient was being estimated, using the same data. Clear improvement over GPOMDP is seen for the online trained baseline variant. For the online trained fixed value function, improvement is seen until T becomes—given the simplicity of the system—very large. References [1] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1999. [2] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002. To appear. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983. [4] J. Baxter and P. L. Bartlett. Infinite-horizon gradient-based policy search. Journal of Artificial Intelligence Research, 15:319–350, 2001. [5] J. Baxter, P. L. Bartlett, and L. Weaver. Infinite-horizon gradient-based policy search: II. Gradient ascent algorithms and experiments. Journal of Artificial Intelligence Research, 15:351–381, 2001. [6] M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, 2000. Exact Value Function—Mean Error Exact Value Function—One Standard Deviation 0.4 0.4 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain Relative Norm Difference Relative Norm Difference 0.25 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain 0.35   0.3 0.2 0.15 0.1 0.05 ¡ 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 T Online—Mean Error 100000 1e+06 1e+07 Online—One Standard Deviation 1 1 GPOMDP BL-online FVF-online 0.8 Relative Norm Difference Relative Norm Difference 10000 T 0.6 0.4 0.2 0 GPOMDP BL-online FVF-online 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 T 1000 10000 100000 1e+06 1e+07 T Figure 1: Three state experiments—relative norm error ∆ est − η / η . Exact value function plots compare mean error and standard deviations for gradient estimates (with knowledge of Jβ ) computed by: GPOMDP [GPOMDP-Jβ ]; with baseline Ar = [Jβ ] [BL- [Jβ ]]; with optimum baseline [BL-A∗ (y)]; with optimum constant baseline [BL-A∗ ]; with pre-trained fixed value r r function [FVF-pretrain]. Online plots do a similar comparison of estimates computed by: GPOMDP [GPOMDP]; with online trained baseline [BL-online]; with online trained fixed value function [FVFonline]. The plots were computed over 500 runs (1000 for FVF-online), with β = 0.95. Ω∗ /T was set to 0.001 for FVF-pretrain, and 0.01 for FVF-online. ¢ ¢ [7] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 1990. [8] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Technical report, ANU, 2002. [9] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 152–160, 1997. [10] V. R. Konda and J. N. Tsitsiklis. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000. [11] P. Marbach and J. N. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. Technical report, MIT, 1998. [12] R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Ann. Oper. Res., 27:175–211, 1991. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [14] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. [15] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992.

5 0.52349246 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

6 0.505422 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

7 0.50499952 56 nips-2001-Convolution Kernels for Natural Language

8 0.48044303 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

9 0.47942132 118 nips-2001-Matching Free Trees with Replicator Equations

10 0.47485906 125 nips-2001-Modularity in the motor system: decomposition of muscle patterns as combinations of time-varying synergies

11 0.47085774 121 nips-2001-Model-Free Least-Squares Policy Iteration

12 0.47058803 137 nips-2001-On the Convergence of Leveraging

13 0.46961066 13 nips-2001-A Natural Policy Gradient

14 0.46723461 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

15 0.4656316 139 nips-2001-Online Learning with Kernels

16 0.46538585 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

17 0.46464744 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

18 0.46179205 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

19 0.45962393 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

20 0.45952776 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms