nips nips2001 nips2001-43 knowledge-graph by maker-knowledge-mining

43 nips-2001-Bayesian time series classification

Source: pdf

Author: Peter Sykacek, Stephen J. Roberts

Abstract: This paper proposes an approach to classiﬁcation of adjacent segments of a time series as being either of classes. We use a hierarchical model that consists of a feature extraction stage and a generative classiﬁer which is built on top of these features. Such two stage approaches are often used in signal and image processing. The novel part of our work is that we link these stages probabilistically by using a latent feature space. To use one joint model is a Bayesian requirement, which has the advantage to fuse information according to its certainty. The classiﬁer is implemented as hidden Markov model with Gaussian and Multinomial observation distributions deﬁned on a suitably chosen representation of autoregressive models. The Markov dependency is motivated by the assumption that successive classiﬁcations will be correlated. Inference is done with Markov chain Monte Carlo (MCMC) techniques. We apply the proposed approach to synthetic data and to classiﬁcation of EEG that was recorded while the subjects performed different cognitive tasks. All experiments show that using a latent feature space results in a signiﬁcant improvement in generalization accuracy. Hence we expect that this idea generalizes well to other hierarchical models.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bayesian time series classiﬁcation Stephen Roberts Department of Engineering Science University of Oxford Oxford, OX1 3PJ, UK sjrob@robots. [sent-1, score-0.111]

2 uk Abstract This paper proposes an approach to classiﬁcation of adjacent segments of a time series as being either of classes. [sent-7, score-0.235]

3 We use a hierarchical model that consists of a feature extraction stage and a generative classiﬁer which is built on top of these features. [sent-8, score-0.392]

4 The novel part of our work is that we link these stages probabilistically by using a latent feature space. [sent-10, score-0.493]

5 The classiﬁer is implemented as hidden Markov model with Gaussian and Multinomial observation distributions deﬁned on a suitably chosen representation of autoregressive models. [sent-12, score-0.315]

6 Inference is done with Markov chain Monte Carlo (MCMC) techniques. [sent-14, score-0.074]

7 We apply the proposed approach to synthetic data and to classiﬁcation of EEG that was recorded while the subjects performed different cognitive tasks. [sent-15, score-0.159]

8 All experiments show that using a latent feature space results in a signiﬁcant improvement in generalization accuracy. [sent-16, score-0.4]

9 Hence we expect that this idea generalizes well to other hierarchical models. [sent-17, score-0.077]

10 1 Introduction Many applications in signal or image processing are hierarchical in the sense that a probabilistic model is built on top of variables that are the coefﬁcients of some feature extraction technique. [sent-18, score-0.475]

11 In this paper we consider a particular problem of that kind, where a Gaussian and Multinomial observation hidden Markov model (GMOHMM) is used to discriminate coefﬁcients of an Auto Regressive (AR) process as being either of classes. [sent-19, score-0.274]

12 However this is suboptimal since it meant to establish a no probabilistic link between feature extraction and classiﬁcation. [sent-23, score-0.293]

13 Two arguments suggest the building of one probabilistic model which combines feature extraction and classiﬁcation: Since there is a probabilistic link, the generative classiﬁer acts as a prior for fea¡ ture extraction. [sent-24, score-0.471]

14 The advantage of using this prior is that it naturally encodes our knowledge about features as obtained from training data and other sensors. [sent-25, score-0.179]

15 1 A Gaussian and Multinomial observation hidden Markov model As we attempt to classify adjacent segments of a time series, it is very likely that we ﬁnd correlations between successive class labels. [sent-30, score-0.506]

16 Hence our model has a hidden Markov model ([RJ86]) like architecture, with diagonal Gaussian observation models for continuous variables and Multinomial observation models for discrete variables. [sent-31, score-0.524]

17 We call the architecture a Gaussian and Multinomial observation hidden Markov model or GMOHMM for short. [sent-32, score-0.274]

18 Contrary to the classical approach, where each class is represented by its own trained HMM, our model has class labels which are child nodes of the hidden state variables. [sent-33, score-0.385]

19 We use here the convention found in [RG97], where circular nodes are latent and square nodes are observed variables. [sent-35, score-0.28]

20 1 Quantities of interest We regard all variables in the DAG that represent the probabilistic model of the time series as quantities of interest. [sent-38, score-0.44]

21 These are the hidden states, , the variables of the latent feature space, , and , the class labels, , the sufﬁcient statistics of the AR process, , . [sent-39, score-0.647]

22 The DAG shows the observation model only for and the segments of the time series, the -th state. [sent-40, score-0.292]

23 We have latent feature variables, , which represent the coefﬁcients of the preprocessing model for of the -th segment at sensor . [sent-41, score-0.674]

24 The state conditional distributions, , are modeled by diagonal Gaussians. [sent-42, score-0.101]

25 Variable is the latent model indicator which represents the model order of the preprocessing model and hence the dimension of . [sent-43, score-0.645]

26 The third observation, , represents the class label of the -th segment. [sent-45, score-0.106]

27 The observation model for is again a Multinomial-one distribution. [sent-46, score-0.168]

28 Note that depending on whether we know the class label or not, can be a latent variable or observed. [sent-47, score-0.386]

29 The and is the observed variable , which represents a sufﬁcient statischild node of tics of the corresponding time series segment. [sent-48, score-0.111]

30 Finally we use to represent the precision of the residual noise model. [sent-51, score-0.079]

31 2 Model coefﬁcients Since we integrate over all unknown quantities, there is no conceptual difference between model coefﬁcients and the variables described above. [sent-57, score-0.149]

32 Furthermore the model coefﬁcients are only updated during model inference whereas all quantities of interest are updated during model inference and for prediction. [sent-60, score-0.389]

33 We have three different prior counts, , and , which deﬁne the Dirichlet priors of the corresponding probabilities. [sent-61, score-0.158]

34 This allows us to obtain the unconditional prior probability of states from the recurrence relation . [sent-64, score-0.112]

35 The prior probability of the ﬁrst hidden 9 #5 ¡ R @ 8 (5 6 75 F E C PH QI@ G! [sent-65, score-0.218]

36 Variable represents the probabilities of class , , which are conditional on as well. [sent-67, score-0.221]

37 The prior probabilities for observing the model indicator are represented by . [sent-68, score-0.418]

38 The probability is again conditional on the state . [sent-69, score-0.101]

39 As was mentioned above, represents the model order of the time series model. [sent-70, score-0.178]

40 Hence another interpretation of is that of state dependent prior probabilities for observing particular model orders. [sent-71, score-0.409]

41 The observation models for are dynamic mixtures of Gaussians, with one model for each sensor . [sent-72, score-0.269]

42 Another interpretation is that the discrete indicator variables and determine together with and a Gaussian prior over . [sent-75, score-0.255]

43 The nodes , , , , and deﬁne a hierarchical prior setting which is discussed below. [sent-76, score-0.189]

44 All other variables denote extract from the time series model coefﬁcients: are the transition probabilities; are the probabilities for class ; and are mean vectors and covariance matrices of the Gaussian observation model for sensor ; and are the probabilities for observing . [sent-81, score-0.979]

45 2 Likelihood and priors for the GMOHMM ) F Suppose that we are provided with segments of training data, . [sent-83, score-0.22]

46 The likelihood function of the GMOHMM parameters is then obtained by summation over all possible sequences, , of latent states, . [sent-84, score-0.344]

47 ¨ %&$ % #$© ¤ § ¢ ) &E;%©BA ¥£¢¡G BA E ¦ ©¨ ¦ E ¤ F ¤ £ ¢ ¦ Q0§ ) ¤¦ 4¨ ) ¦ QE £ ) ¤¦ E " E ¤E E 9 '' 87' ¦ ¥¤¡ ¨ ¦ %¤¡ ¤ £ $ ¦ ¥1§ ) ¦ ¥¡ ¨ ) ¦ ¥¡ ¡ ¤¡ ¤ Gibbs sampling requires that we obtain full conditional distributions1 we can sample from. [sent-89, score-0.139]

48 The state conditional class probabilities, , get a Dirichlet prior: . [sent-95, score-0.16]

49 The probabilities transition probabilities, for observing different model orders, , depend on the state . [sent-97, score-0.297]

50 The inverse covariance matrix , where is again the range of the estimates at sensor . [sent-104, score-0.152]

51 3 Sampling from the posterior During model inference we need to update all unobserved variables of the DAG, whereas for predictions we update only the variables summarized in section 2. [sent-107, score-0.294]

52 Most of the updates are done using the corresponding full conditional distributions, which have the same functional forms as the corresponding priors. [sent-110, score-0.092]

53 These full conditionals follow closely from what was published previously in [Syk00], with some modiﬁcations necessary (see e. [sent-111, score-0.096]

54 [Rob96]), because we need to consider the Markov dependency between successive hidden states. [sent-113, score-0.155]

55 As the derivations of the full conditionals do not differ much from previous work, we will omit them here and instead concentrate on an illustration how to update the latent feature space, . [sent-114, score-0.496]

56 1 A representation of the latent feature space h `X The AR model in Equation (2) is a linear regression model. [sent-117, score-0.467]

57 We use to denote the AR coefﬁcients, to denote the model order and to denote a sample from the noise process, which we assume to be Gaussian with precision . [sent-118, score-0.34]

58 Hence AR coefﬁcients are not a convenient representation of the latent feature 1 These are the distributions obtained when we condition on all other variables of the DAG. [sent-123, score-0.571]

59 A much more convenient representation is provided by using reﬂection coefﬁcients, , (statistically speaking they are partial correlation coefﬁcients), which relate to AR coefﬁcients via the order recursive Levinson algorithm. [sent-125, score-0.098]

60 For such processes, the latent density is deﬁned on is in contrast with the proposed DAG, where we use a ﬁnite Gaussian mixture as probabilistic model for the latent variable, which is is deﬁned on . [sent-135, score-0.67]

61 In order to avoid this mismatch, we reparameterise the space of reﬂection coefﬁcients by applying , to obtain a more convenient representation of the latent features. [sent-136, score-0.395]

62 In order to obtain likelihood ratio 1, we propose from the multivariate Student-t distribution shown below, reparameterise in terms of reﬂection coefﬁcients and apply the transformation. [sent-145, score-0.131]

63 The proposal in Equation (5) gives a likelihood ratio of . [sent-148, score-0.125]

64 3 Updating model orders Updating model orders requires us to sample across different dimensional parameter spaces. [sent-156, score-0.491]

65 One way of doing this is by using the reversible jump MCMC which was recently proposed in [Gre95]. [sent-157, score-0.19]

66 We implement the reversible jump move from parameter space to parameter space as partial proposal. [sent-158, score-0.19]

67 b out the precision of the noise model This suggests the following proposal: ¦ %3¨ ¤¡ 2 0 "'%# ) ¤ Y3$&(&$" (¦ %¡ £ ¢ ) ¡ E ¦ %5¡ £ ¤ F A @© 9 (7) E 3 ££ q F with G where we obtain again a Student-t distributed likelihood. [sent-162, score-0.146]

68 We use to denote the number of observations and to denote and as dithe estimated auto covariance at time lag to obtain mensional sample covariance matrix. [sent-168, score-0.326]

69 Assuming that the probability of proposing this move is independent of , the proposal from to has acceptance probability to ¦ %¡ § ¤ F If we attempt an update from of the operation in Equation (8). [sent-169, score-0.108]

70 We generate a ﬁrst order Markov sequence as target labels (2 state values) with 200 samples used for training and 600 used for testing. [sent-173, score-0.094]

71 Each sample is used as label of a segment with 200 samples from an auto regressive process. [sent-174, score-0.293]

72 In order to make the problem more realistic, we use a second state sequence to replace of the segments with white noise. [sent-179, score-0.176]

73 " b When compared with conditioning on feature estimates, the latent features show increased likelihood. [sent-183, score-0.538]

74 The likelihood gets even larger when we regard both the feature values and the model orders of the preprocessing stage as random variables. [sent-184, score-0.617]

75 As can be seen in ﬁgure 2, this effect is also evident when we look at the generalization probabilities which become larger as well. [sent-185, score-0.113]

76 We explain this by sharper “priors” over feature values and model orders, which are due to the information provided by temporal context 2 of every segment. [sent-186, score-0.237]

77 This reduces the variance of the observation models which in turn increases likelihoods and target probabilities. [sent-187, score-0.16]

78 Table 1 shows that these higher probabilities correspond to a signiﬁcant improvement in generalization accuracy. [sent-188, score-0.113]

79 5 0 50 100 50 100 50 100 150 200 250 300 350 400 Probabilities from integrating over features 450 500 550 600 150 200 250 300 350 400 450 500 Probabilities from integrating over model orders and features 550 600 550 600 1 0. [sent-190, score-0.49]

80 5 0 150 200 250 300 350 400 450 500 Figure 2: This ﬁgure shows the generalization probabilities obtained with different settings. [sent-192, score-0.113]

81 We see that the class probabilities get larger when we regard features as random variables. [sent-193, score-0.314]

82 This effect is even stronger when both the features and the model orders are random variables. [sent-194, score-0.289]

83 2 Classiﬁcation of cognitive tasks The data used in these experiments is EEG recorded from 5 young, healthy and untrained subjects while they perform different cognitive tasks. [sent-196, score-0.135]

84 We classify 2 task pairings: auditorynavigation and left motor-right motor imagination. [sent-197, score-0.106]

85 The recordings were taken from 3 electrode sites: T4, P4 (right tempero-parietal for spatial and auditory tasks), C3’ , C3” (left motor area for right motor imagination) and C4’ , C4” (right motor area for left motor imagination). [sent-198, score-0.512]

86 The and fourth order band pass data were recorded using an ISO-DAM system (gain of ﬁlter with pass band between Hz and Hz). [sent-200, score-0.127]

87 £ ¡ ¤¢ ¡ © ¡ ¡ ¨¢ ¥ §¦¡ Classiﬁcation uses again the same settings as with the synthetic problem. [sent-203, score-0.069]

88 We observe again signiﬁcantly improved results when we regard features and model orders as latent variables. [sent-205, score-0.644]

89 The values in brackets are the signiﬁcance levels for comparing integration of features with conditioning and full integration with integration over feature values only. [sent-206, score-0.466]

90 ¡ ¢ 4 Discussion We propose in this paper a novel approach to hierarchical time series processing which makes use of a latent feature representation. [sent-207, score-0.588]

91 This understanding of features and model orders as random variables is a direct consequence of applying Bayesian theory. [sent-208, score-0.371]

92 Empirical 2 In a multi sensor setting there is spatial context as well. [sent-209, score-0.101]

93 Table 1: Generalization accuracies of different experiments conditioning marginalize features ( ) ( ) full integration ( ) ( ) ( ) a 9b b ' ¡ 9 ! [sent-210, score-0.236]

94 The only disadvantage of having a latent feature space is that all computations get more involved, since there are additional variables that have to be integrated over. [sent-217, score-0.482]

95 However this additional complexity does not render the method intractable since the algorithm remains polynomial in the number of segments to be classiﬁed. [sent-218, score-0.124]

96 Finally we want to point out that the improvements observed in our results can only be attributed to the idea of using a latent feature space. [sent-219, score-0.4]

97 This idea is certainly not limited to time series classiﬁcation and should generalize well to other hierarchical architectures. [sent-220, score-0.188]

98 Stokes, who provided us with the EEG recordings that were used in the experiments section. [sent-224, score-0.091]

99 Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. [sent-244, score-0.225]

100 On input selection with reversible jump Markov chain Monte Carlo sampling. [sent-306, score-0.264]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('latent', 0.28), ('coef', 0.279), ('ar', 0.273), ('cients', 0.235), ('gmohmm', 0.167), ('orders', 0.155), ('dag', 0.145), ('ection', 0.145), ('classi', 0.126), ('segments', 0.124), ('feature', 0.12), ('probabilities', 0.113), ('prior', 0.112), ('series', 0.111), ('markov', 0.107), ('motor', 0.106), ('reversible', 0.106), ('hidden', 0.106), ('sensor', 0.101), ('observation', 0.101), ('ba', 0.097), ('multinomial', 0.089), ('bayesian', 0.088), ('extraction', 0.086), ('jump', 0.084), ('variables', 0.082), ('auto', 0.079), ('precision', 0.079), ('hierarchical', 0.077), ('carlo', 0.076), ('regard', 0.075), ('chain', 0.074), ('dirichlet', 0.073), ('conditioning', 0.071), ('eeg', 0.07), ('synthetic', 0.069), ('features', 0.067), ('model', 0.067), ('integrating', 0.067), ('oxford', 0.067), ('regressive', 0.067), ('reparameterise', 0.067), ('sykacek', 0.067), ('richardson', 0.066), ('observing', 0.065), ('likelihood', 0.064), ('qh', 0.063), ('inference', 0.063), ('cation', 0.062), ('quantities', 0.062), ('yx', 0.061), ('monte', 0.061), ('proposal', 0.061), ('indicator', 0.061), ('class', 0.059), ('cient', 0.059), ('likelihoods', 0.059), ('gilks', 0.058), ('imagination', 0.058), ('integration', 0.055), ('gibbs', 0.055), ('segment', 0.053), ('preprocessing', 0.053), ('spiegelhalter', 0.053), ('conditionals', 0.053), ('wp', 0.053), ('state', 0.052), ('hz', 0.052), ('covariance', 0.051), ('hence', 0.05), ('provided', 0.05), ('gh', 0.049), ('denote', 0.049), ('conditional', 0.049), ('stages', 0.049), ('successive', 0.049), ('convenient', 0.048), ('suf', 0.047), ('gaussian', 0.047), ('sample', 0.047), ('label', 0.047), ('electrode', 0.047), ('acceptance', 0.047), ('mcmc', 0.047), ('priors', 0.046), ('recorded', 0.045), ('cognitive', 0.045), ('re', 0.045), ('link', 0.044), ('full', 0.043), ('probabilistic', 0.043), ('labels', 0.042), ('stage', 0.042), ('distributions', 0.041), ('gets', 0.041), ('recordings', 0.041), ('pu', 0.041), ('band', 0.041), ('acyclic', 0.041), ('gamma', 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 43 nips-2001-Bayesian time series classification

Author: Peter Sykacek, Stephen J. Roberts

2 0.1571243 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller

Abstract: Driven by the progress in the ﬁeld of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming ﬁnger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classiﬁcation accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classiﬁers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).

3 0.15708108 183 nips-2001-The Infinite Hidden Markov Model

Author: Matthew J. Beal, Zoubin Ghahramani, Carl E. Rasmussen

Abstract: We show that it is possible to extend hidden Markov models to have a countably inﬁnite number of hidden states. By using the theory of Dirichlet processes we can implicitly integrate out the inﬁnitely many transition parameters, leaving only three hyperparameters which can be learned from data. These three hyperparameters deﬁne a hierarchical Dirichlet process capable of capturing a rich set of transition dynamics. The three hyperparameters control the time scale of the dynamics, the sparsity of the underlying state-transition matrix, and the expected number of distinct hidden states in a ﬁnite sequence. In this framework it is also natural to allow the alphabet of emitted symbols to be inﬁnite— consider, for example, symbols being possible words appearing in English text.

4 0.14037369 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

Abstract: We investigate a learning algorithm for the classiﬁcation of nonnegative data by mixture models. Multiplicative update rules are derived that directly optimize the performance of these models as classiﬁers. The update rules have a simple closed form and an intuitive appeal. Our algorithm retains the main virtues of the Expectation-Maximization (EM) algorithm—its guarantee of monotonic improvement, and its absence of tuning parameters—with the added advantage of optimizing a discriminative objective function. The algorithm reduces as a special case to the method of generalized iterative scaling for log-linear models. The learning rate of the algorithm is controlled by the sparseness of the training data. We use the method of nonnegative matrix factorization (NMF) to discover sparse distributed representations of the data. This form of feature selection greatly accelerates learning and makes the algorithm practical on large problems. Experiments show that discriminatively trained mixture models lead to much better classiﬁcation than comparably sized models trained by EM. 1

5 0.13543627 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

Author: Carl E. Rasmussen, Zoubin Ghahramani

Abstract: We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using an input-dependent adaptation of the Dirichlet Process, we implement a gating network for an inﬁnite number of Experts. Inference in this model may be done efﬁciently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets – thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

6 0.12958777 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

7 0.12700707 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

8 0.1230462 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

9 0.10976133 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

10 0.10859199 105 nips-2001-Kernel Machines and Boolean Functions

11 0.10518356 115 nips-2001-Linear-time inference in Hierarchical HMMs

12 0.10488077 139 nips-2001-Online Learning with Kernels

13 0.10392684 164 nips-2001-Sampling Techniques for Kernel Methods

14 0.10189127 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

15 0.099319018 190 nips-2001-Thin Junction Trees

16 0.095827088 46 nips-2001-Categorization by Learning and Combining Object Parts

17 0.095446281 107 nips-2001-Latent Dirichlet Allocation

18 0.09317106 84 nips-2001-Global Coordination of Local Linear Models

19 0.092618957 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

20 0.091461249 144 nips-2001-Partially labeled classification with Markov random walks

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.31), (1, 0.02), (2, -0.023), (3, 0.012), (4, -0.209), (5, -0.058), (6, 0.051), (7, 0.072), (8, 0.034), (9, -0.029), (10, 0.005), (11, -0.015), (12, -0.106), (13, -0.142), (14, -0.124), (15, -0.123), (16, -0.019), (17, 0.064), (18, 0.168), (19, -0.057), (20, 0.019), (21, 0.024), (22, 0.021), (23, 0.012), (24, -0.054), (25, -0.093), (26, -0.004), (27, 0.055), (28, 0.021), (29, 0.031), (30, -0.016), (31, 0.102), (32, -0.027), (33, 0.086), (34, -0.051), (35, 0.022), (36, -0.066), (37, 0.014), (38, 0.146), (39, -0.038), (40, 0.011), (41, -0.102), (42, 0.039), (43, -0.09), (44, -0.062), (45, 0.055), (46, 0.008), (47, 0.045), (48, 0.035), (49, -0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96797085 43 nips-2001-Bayesian time series classification

Author: Peter Sykacek, Stephen J. Roberts

2 0.62863362 183 nips-2001-The Infinite Hidden Markov Model

Author: Matthew J. Beal, Zoubin Ghahramani, Carl E. Rasmussen

3 0.60724831 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

Author: Carl E. Rasmussen, Zoubin Ghahramani

4 0.5672943 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

Author: Aaron C. Courville, David S. Touretzky

Abstract: The Temporal Coding Hypothesis of Miller and colleagues [7] suggests that animals integrate related temporal patterns of stimuli into single memory representations. We formalize this concept using quasi-Bayes estimation to update the parameters of a constrained hidden Markov model. This approach allows us to account for some surprising temporal effects in the second order conditioning experiments of Miller et al. [1 , 2, 3], which other models are unable to explain. 1

5 0.56188107 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

Author: Lawrence K. Saul, Daniel D. Lee

6 0.5583986 125 nips-2001-Modularity in the motor system: decomposition of muscle patterns as combinations of time-varying synergies

7 0.53924584 61 nips-2001-Distribution of Mutual Information

8 0.52447551 115 nips-2001-Linear-time inference in Hierarchical HMMs

9 0.52026898 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

10 0.50619811 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior

11 0.5041194 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity

12 0.50160009 148 nips-2001-Predictive Representations of State

13 0.49496537 68 nips-2001-Entropy and Inference, Revisited

14 0.48911139 3 nips-2001-ACh, Uncertainty, and Cortical Inference

15 0.48909888 154 nips-2001-Products of Gaussians

16 0.48431465 153 nips-2001-Product Analysis: Learning to Model Observations as Products of Hidden Variables

17 0.48296592 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

18 0.47716406 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation

19 0.47570598 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

20 0.47227511 107 nips-2001-Latent Dirichlet Allocation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.035), (17, 0.016), (19, 0.019), (27, 0.094), (30, 0.06), (38, 0.011), (59, 0.042), (72, 0.077), (79, 0.052), (83, 0.376), (91, 0.153)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92362028 187 nips-2001-The Steering Approach for Multi-Criteria Reinforcement Learning

Author: Shie Mannor, Nahum Shimkin

Abstract: We consider the problem of learning to attain multiple goals in a dynamic environment, which is initially unknown. In addition, the environment may contain arbitrarily varying elements related to actions of other agents or to non-stationary moves of Nature. This problem is modelled as a stochastic (Markov) game between the learning agent and an arbitrary player, with a vector-valued reward function. The objective of the learning agent is to have its long-term average reward vector belong to a given target set. We devise an algorithm for achieving this task, which is based on the theory of approachability for stochastic games. This algorithm combines, in an appropriate way, a ﬁnite set of standard, scalar-reward learning algorithms. Suﬃcient conditions are given for the convergence of the learning algorithm to a general target set. The specialization of these results to the single-controller Markov decision problem are discussed as well. 1

2 0.88909286 105 nips-2001-Kernel Machines and Boolean Functions

Author: Adam Kowalczyk, Alex J. Smola, Robert C. Williamson

Abstract: We give results about the learnability and required complexity of logical formulae to solve classiﬁcation problems. These results are obtained by linking propositional logic with kernel machines. In particular we show that decision trees and disjunctive normal forms (DNF) can be represented by the help of a special kernel, linking regularized risk to separation margin. Subsequently we derive a number of lower bounds on the required complexity of logic formulae using properties of algorithms for generation of linear estimators, such as perceptron and maximal perceptron learning.

same-paper 3 0.87925541 43 nips-2001-Bayesian time series classification

Author: Peter Sykacek, Stephen J. Roberts

4 0.62036419 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Author: Evan Greensmith, Peter L. Bartlett, Jonathan Baxter

Abstract: We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learning problems. The ﬁrst approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sample paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the variance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the optimal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary experiments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially observable Markov decision process (POMDP). Gradient ascent methods (e.g., [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal∗ Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good estimate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investigate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f : X → R, and we know the integral of another function ϕ : X → R. Since X f = X (f − ϕ) + X ϕ, the integral of f − ϕ can be estimated instead. Obviously if ϕ = f then the variance is zero. More generally, Var(f − ϕ) = Var(f ) − 2Cov(f, ϕ) + Var(ϕ), so that if φ and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The ﬁrst (Section 2) is the technique of adding a baseline. We ﬁnd the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actorcritic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the ﬁxed value function and the true value function. In this paper we show that this may not be the best approach: selecting the ﬁxed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a ﬁxed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a ﬁxed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a ﬁxed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices {P(u) : u ∈ U}, each with components pij (u) for i, j ∈ S, u ∈ U, an observation process ν : S → PY , where PY is the space of probability distributions over Y, and a reward function r : S → R. We assume that S, U, Y are ﬁnite, although all our results extend easily to inﬁnite U and Y, and with more restrictive assumptions can be extended to inﬁnite S. A reactive, parameterized policy for a POMDP is a set of mappings {µ(·, θ) : Y → PU |θ ∈ RK }. Together with the POMDP, this deﬁnes the controlled POMDP (S, U, Y, P , ν, r, µ). The joint state, observation and control process, {Xt , Yt , Ut }, is Markov. The state process, {Xt }, is also Markov, with transition probabilities pij (θ) = y∈Y,u∈U νy (i)µu (y, θ)pij (u), where νy (i) denotes the probability of observation y given the state i, and µu (y, θ) denotes the probability of action u given parameters θ and observation y. The Markov chain M(θ) = (S, P(θ)) then describes the behaviour of the process {Xt }. Assumption 1 The controlled POMDP (S, U, Y, P , ν, r, µ) satisﬁes: For all θ ∈ RK there exists a unique stationary distribution satisfying π (θ) P(θ) = π (θ). There is an R < ∞ such that, for all i ∈ S, |r(i)| ≤ R. There is a B < ∞ such that, for all u ∈ U, y ∈ Y and θ ∈ RK the derivatives ∂µu (y, θ)/∂θk (1 ≤ k ≤ K) exist, and the vector of these derivatives satisﬁes µu (y, θ)/µu (y, θ) ≤ B, where · denotes the Euclidean norm on RK . def T −1 1 We consider the average reward, η(θ) = limT →∞ E T t=0 r(Xt ) . Assumption 1 implies that this limit exists, and does not depend on the start state X0 . The aim is to def select a policy to maximize this quantity. Deﬁne the discounted value function, J β (i, θ) = T −1 t limT →∞ E t=0 β r(Xt ) X0 = i . Throughout the rest of the paper, dependences upon θ are assumed, and dropped in the notation. For a random vector A, we denote Var(A) = E (A − E [A])2 , where a2 denotes a a, and a denotes the transpose of the column vector a. GPOMDP Algorithm The GPOMDP algorithm [4] uses a sample path to estimate the gradient approximation def µu(y) η, but the βη = E β η approaches the true gradient µu(y) Jβ (j) . As β → 1, def variance increases. We consider a slight modiﬁcation [2]: with Jt = def ∆T = 1 T T −1 t=0 2T s=t µUt (Yt ) Jt+1 . µUt (Yt ) β s−t r(Xs ), (1) Throughout this paper the process {Xt , Yt , Ut , Xt+1 } is generally understood to be generated by a controlled POMDP satisfying Assumption 1, with X0 ∼π (ie the initial state distributed according to the stationary distribution). That is, before computing the gradient estimates, we wait until the process has settled down to the stationary distribution. Dependent Samples Correlation terms arise in the variance quantities to be analysed. We show here that considering iid samples gives an upper bound on the variance of the general case. The mixing time of a ﬁnite ergodic Markov chain M = (S, P ) is deﬁned as def τ = min t > 1 : max dT V i,j Pt i , Pt j ≤ e−1 , where [P t ]i denotes the ith row of P t and dT V is the total variation distance, dT V (P, Q) = i |P (i) − Q(i)|. Theorem 1 Let M = (S, P ) be a ﬁnite ergodic Markov chain, with mixing time τ , and 2|S|e and 0 ≤ α < let π be its stationary distribution. There are constants L < exp(−1/(2τ )), which depend only on M , such that, for all f : S → R and all t, Covπ (t) ≤ Lαt Varπ (f), where Varπ (f) is the variance of f under π, and Covπ (t) is f f the auto-covariance of the process {f(Xt )}, where the process {Xt } is generated by M with initial distribution π. Hence, for some constant Ω∗ ≤ 4Lτ , Var 1 T T −1 f(Xt ) t=0 ≤ Ω∗ Varπ (f). T We call (L, τ ) the mixing constants of the Markov chain M (or of the controlled POMDP D; ie the Markov chain (S, P )). We omit the proof (all proofs are in the full version [8]). Brieﬂy, we show that for a ﬁnite ergodic Markov chain M , Covπ (t) ≤ Rt (M )Varπ (f), f 2 t for some Rt (M ). We then show that Rt (M ) < 2|S| exp(− τ ). In fact, for a reversible chain, we can choose L = 1 and α = |λ2 |, the second largest magnitude eigenvalue of P . 2 Baseline We consider an alteration of (1), def ∆T = 1 T T −1 µUt (Yt ) (Jt+1 − Ar (Yt )) . µUt (Yt ) t=0 (2) For any baseline Ar : Y → R, it is easy to show that E [∆T ] = E [∆T ]. Thus, we select Ar to minimize variance. The following theorem shows that this variance is bounded by a variance involving iid samples, with Jt replaced by the exact value function. Theorem 2 Suppose that D = (S, U, Y, P , ν, r, µ) is a controlled POMDP satisfying Assumption 1, D has mixing constants (L, τ ), {Xt , Yt , Ut , Xt+1 } is a process generated by D with X0 ∼π ,Ar : Y → R is a baseline that is uniformly bounded by M, and J (j) has the distribution of ∞ β s r(Xt ), where the states Xt are generated by D starting in s=0 X0 = j. Then there are constants C ≤ 5B2 R(R + M) and Ω ≤ 4Lτ ln(eT ) such that Var 1 T T −1 t=0 µUt (Yt ) Ω (Jt+1 −Ar (Yt )) ≤ Varπ µUt (Yt ) T + Ω E T µu (y) (J (j) − Jβ (j)) µu (y) µu (y) (Jβ (j)−Ar (y)) µu (y) 2 + Ω +1 T C βT , (1 − β)2 where, as always, (i, y, u, j) are generated iid with i∼π, y∼ν(i), u∼µ(y) and j∼P i (u). The proof uses Theorem 1 and [2, Lemma 4.3]. Here we have bounded the variance of (2) with the variance of a quantity we may readily analyse. The second term on the right hand side shows the error associated with replacing an unbiased, uncorrelated estimate of the value function with the true value function. This quantity is not dependent on the baseline. The ﬁnal term on the right hand side arises from the truncation of the discounted reward— and is exponentially decreasing. We now concentrate on minimizing the variance σ 2 = Varπ r µu (y) (Jβ (j) − Ar (y)) , µu (y) (3) which the following lemma relates to the variance σ 2 without a baseline, µu (y) Jβ (j) . µu (y) σ 2 = Varπ Lemma 3 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For any baseline Ar : Y → R, and for i∼π, y∼ν(i), u∼µ(y) and j∼Pi (u), σ 2 = σ 2 + E A2 (y) E r r µu (y) µu (y) 2 y − 2Ar (y)E µu (y) µu (y) 2 Jβ (j) y . From Lemma 3 it can be seen that the task of ﬁnding the optimal baseline is in effect that of minimizing a quadratic for each observation y ∈ Y. This gives the following theorem. Theorem 4 For the controlled POMDP as in Lemma 3,  2 µu (y) min σ 2 = σ 2 − E  E Jβ (j) y r Ar µu (y) 2 /E µu (y) µu (y) 2 y and this minimum is attained with the baseline 2 µu (y) µu (y) A∗ (y) = E r Jβ (j) , 2 µu (y) µu (y) /E y  y . Furthermore, the optimal constant baseline is µu (y) µu (y) A∗ = E r 2 Jβ (j) /E µu (y) µu (y) 2 . The following theorem shows that the variance of an estimate with an arbitrary baseline can be expressed as the sum of the variance with the optimal baseline and a certain squared weighted distance between the baseline function and the optimal baseline function. Theorem 5 If Ar : Y → R is a baseline function, A∗ is the optimal baseline deﬁned in r Theorem 4, and σ 2 is the variance of the corresponding estimate, then r∗ µu (y) µu (y) σ 2 = σ2 + E r r∗ 2 (Ar (y) − A∗ (y)) r 2 , where i∼π, y ∼ν(i), and u∼µ(y). Furthermore, the same result is true for the case of constant baselines, with Ar (y) replaced by an arbitrary constant baseline Ar , and A∗ (y) r replaced by A∗ , the optimum constant baseline deﬁned in Theorem 4. r For the constant baseline Ar = E i∼π [Jβ (i)], Theorem 5 implies that σ 2 is equal to r min Ar ∈R σ2 r + E µu (y) µu (y) 2 E [Jβ (j)] − E µu (y) µu (y) 2 2 /E Jβ (j) µu (y) µu (y) 2 . Thus, its performance depends on the random variables ( µu (y)/µu (y))2 and Jβ (j); if they are nearly independent, E [Jβ ] is a good choice. 3 Fixed Value Functions: Actor-Critic Methods We consider an alteration of (1), ˜ def 1 ∆T = T T −1 t=0 µUt (Yt ) ˜ Jβ (Xt+1 ), µUt (Yt ) (4) ˜ for some ﬁxed value function Jβ : S → R. Deﬁne ∞ def β k d(Xk , Xk+1 ) Aβ (j) = E X0 = j , k=0 def ˜ ˜ where d(i, j) = r(i) + β Jβ (j) − Jβ (i) is the temporal difference. Then it is easy to show that the estimate (4) has a bias of µu (y) ˜ Aβ (j) . β η − E ∆T = E µu (y) The following theorem gives a bound on the expected squared error of (4). The main tool in the proof is Theorem 1. Theorem 6 Let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1. For a sample path from D, that is, {X0∼π, Yt∼ν(Xt ), Ut∼µ(Yt ), Xt+1∼PXt (Ut )}, E ˜ ∆T − βη 2 ≤ Ω∗ Varπ T µu (y) ˜ Jβ (j) + E µu (y) µu (y) Aβ (j) µu (y) 2 , where the second expectation is over i∼π, y∼ν(i), u∼µ(y), and j∼P i (u). ˜ If we write Jβ (j) = Jβ (j) + v(j), then by selecting v = (v(1), . . . , v(|S|)) from the right def null space of the K × |S| matrix G, where G = i,y,u πi νy (i) µu (y)Pi (u), (4) will produce an unbiased estimate of β η. An obvious example of such a v is a constant vector, (c, c, . . . , c) : c ∈ R. We can use this to construct a trivial example where (4) produces an unbiased estimate with zero variance. Indeed, let D = (S, U, Y, P , ν, r, µ) be a controlled POMDP satisfying Assumption 1, with r(i) = c, for some 0 < c ≤ R. Then Jβ (j) = c/(1 − β) and β η = 0. If we choose v = (−c/(1 − β), . . . , −c/(1 − β)) and ˜ ˜ Jβ (j) = Jβ (j) + v(j), then µµu(y) Jβ (j) = 0 for all y, u, j, and so (4) gives an unbiased u(y) estimate of β η, with zero variance. Furthermore, for any D for which there exists a pair y, u such that µu (y) > 0 and µu (y) = 0, choosing ˜β (j) = Jβ (j) gives a variance J greater than zero—there is a non-zero probability event, (Xt = i, Yt = y, Ut = u, Xt+1 = j), such that µµu(y) Jβ (j) = 0. u(y) 4 Algorithms Given a parameterized class of baseline functions Ar (·, θ) : Y → R θ ∈ RL , we can use Theorem 5 to bound the variance of our estimates. Computing the gradient of this bound with respect to the parameters θ of the baseline function allows a gradient optimization of the baseline. The GDORB Algorithm produces an estimate ∆ S of these gradients from a sample path of length S. Under the assumption that the baseline function and its gradients are uniformly bounded, we can show that these estimates converge to the gradient of σ 2 . We omit the details (see [8]). r GDORB Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized baseline Ar . set z0 = 0 (z0 ∈ RL ), ∆0 = 0 (∆0 ∈ RL ) for all {is , ys , us , is+1 , ys+1 } generated by the POMDP do zs+1 = βzs + ∆s+1 = ∆s + end for Ar (ys ) 1 s+1 µus(ys ) µus(ys ) 2 ((Ar (ys ) − βAr (ys+1 ) − r(xs+1 )) zs+1 − ∆s ) ˜ For a parameterized class of ﬁxed value functions {Jβ (·, θ) : S → R θ ∈ RL }, we can use Theorem 6 to bound the expected squared error of our estimates, and compute the gradient of this bound with respect to the parameters θ of the baseline function. The GBTE Algorithm produces an estimate ∆S of these gradients from a sample path of length S. Under the assumption that the value function and its gradients are uniformly bounded, we can show that these estimates converge to the true gradient. GBTE Algorithm: Given: Controlled POMDP (S, U, Y, P , ν, r, µ), parameterized ﬁxed value function ˜β . J set z0 = 0 (z0 ∈ RK ), ∆A0 = 0 (∆A0 ∈ R1×L ), ∆B 0 = 0 (∆B 0 ∈ RK ), ∆C 0 = 0 (∆C 0 ∈ RK ) and ∆D0 = 0 (∆D0 ∈ RK×L ) for all {is , ys , us , is+1 , is+2 } generated by the POMDP do µ s(y ) zs+1 = βzs + µuu(yss ) s µus(ys ) ˜ µus(ys ) Jβ (is+1 ) µus(ys ) µus(ys ) ˜ Jβ (is+1 ) ∆As+1 = ∆As + 1 s+1 ∆B s+1 = ∆B s + 1 s+1 µus(ys ) ˜ µus(ys ) Jβ (is+1 ) ∆C s+1 = ∆C s + 1 s+1 ˜ ˜ r(is+1 ) + β Jβ (is+2 ) − Jβ (is+1 ) zs+1 − ∆C s ∆Ds+1 = ∆Ds + 1 s+1 µus(ys ) µus(ys ) ∆s+1 = end for Ω∗ T ∆As+1 − − ∆B s ˜ Jβ (is+1 ) Ω∗ T ∆B s+1 ∆D s+1 − ∆As − ∆D s − ∆C s+1 ∆Ds+1 5 Experiments Experimental results comparing these GPOMDP variants for a simple three state MDP (described in [5]) are shown in Figure 1. The exact value function plots show how different choices of baseline and ﬁxed value function compare when all algorithms have access to the exact value function Jβ . Using the expected value function as a baseline was an improvement over GPOMDP. Using the optimum, or optimum constant, baseline was a further improvement, each performing comparably to the other. Using the pre-trained ﬁxed value function was also an improvement over GPOMDP, showing that selecting the true value function was indeed not the best choice in this case. The trained ﬁxed value function was not optimal though, as Jβ (j) − A∗ is a valid choice of ﬁxed value function. r The optimum baseline, and ﬁxed value function, will not normally be known. The online plots show experiments where the baseline and ﬁxed value function were trained using online gradient descent whilst the performance gradient was being estimated, using the same data. Clear improvement over GPOMDP is seen for the online trained baseline variant. For the online trained ﬁxed value function, improvement is seen until T becomes—given the simplicity of the system—very large. References [1] L. Baird and A. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11, pages 968–974. MIT Press, 1999. [2] P. L. Bartlett and J. Baxter. Estimation and approximation bounds for gradient-based reinforcement learning. Journal of Computer and Systems Sciences, 2002. To appear. [3] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difﬁcult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13:834–846, 1983. [4] J. Baxter and P. L. Bartlett. Inﬁnite-horizon gradient-based policy search. Journal of Artiﬁcial Intelligence Research, 15:319–350, 2001. [5] J. Baxter, P. L. Bartlett, and L. Weaver. Inﬁnite-horizon gradient-based policy search: II. Gradient ascent algorithms and experiments. Journal of Artiﬁcial Intelligence Research, 15:351–381, 2001. [6] M. Evans and T. Swartz. Approximating integrals via Monte Carlo and deterministic methods. Oxford University Press, 2000. Exact Value Function—Mean Error Exact Value Function—One Standard Deviation 0.4 0.4 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain Relative Norm Difference Relative Norm Difference 0.25 GPOMDP-Jβ BL- [Jβ ] BL-A∗ (y) r BL-A∗ r FVF-pretrain 0.35 0.3 0.2 0.15 0.1 0.05 ¡ 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 1000 T Online—Mean Error 100000 1e+06 1e+07 Online—One Standard Deviation 1 1 GPOMDP BL-online FVF-online 0.8 Relative Norm Difference Relative Norm Difference 10000 T 0.6 0.4 0.2 0 GPOMDP BL-online FVF-online 0.8 0.6 0.4 0.2 0 1 10 100 1000 10000 100000 1e+06 1e+07 1 10 100 T 1000 10000 100000 1e+06 1e+07 T Figure 1: Three state experiments—relative norm error ∆ est − η / η . Exact value function plots compare mean error and standard deviations for gradient estimates (with knowledge of Jβ ) computed by: GPOMDP [GPOMDP-Jβ ]; with baseline Ar = [Jβ ] [BL- [Jβ ]]; with optimum baseline [BL-A∗ (y)]; with optimum constant baseline [BL-A∗ ]; with pre-trained ﬁxed value r r function [FVF-pretrain]. Online plots do a similar comparison of estimates computed by: GPOMDP [GPOMDP]; with online trained baseline [BL-online]; with online trained ﬁxed value function [FVFonline]. The plots were computed over 500 runs (1000 for FVF-online), with β = 0.95. Ω∗ /T was set to 0.001 for FVF-pretrain, and 0.01 for FVF-online. ¢ ¢ [7] P. W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33:75–84, 1990. [8] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Technical report, ANU, 2002. [9] H. Kimura, K. Miyazaki, and S. Kobayashi. Reinforcement learning in POMDPs with function approximation. In D. H. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97), pages 152–160, 1997. [10] V. R. Konda and J. N. Tsitsiklis. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems 12, pages 1008–1014. MIT Press, 2000. [11] P. Marbach and J. N. Tsitsiklis. Simulation-Based Optimization of Markov Reward Processes. Technical report, MIT, 1998. [12] R. Y. Rubinstein. How to optimize complex stochastic systems from a single sample path by the score function method. Ann. Oper. Res., 27:175–211, 1991. [13] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge MA, 1998. ISBN 0-262-19398-1. [14] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems 12, pages 1057–1063. MIT Press, 2000. [15] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8:229–256, 1992.

5 0.61223215 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning

Author: Yu-Han Chang, Leslie Pack Kaelbling

Abstract: We propose a new classiﬁcation for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classiﬁcation, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.

6 0.59596223 32 nips-2001-An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

7 0.57754725 56 nips-2001-Convolution Kernels for Natural Language

8 0.57671726 51 nips-2001-Cobot: A Social Reinforcement Learning Agent

9 0.57344997 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

10 0.56900573 125 nips-2001-Modularity in the motor system: decomposition of muscle patterns as combinations of time-varying synergies

11 0.56439936 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

12 0.55860901 95 nips-2001-Infinite Mixtures of Gaussian Process Experts

13 0.55448145 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

14 0.55415374 121 nips-2001-Model-Free Least-Squares Policy Iteration

15 0.5527792 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

16 0.55266917 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

17 0.5484978 183 nips-2001-The Infinite Hidden Markov Model

18 0.54660988 118 nips-2001-Matching Free Trees with Replicator Equations

19 0.54443455 169 nips-2001-Small-World Phenomena and the Dynamics of Information

20 0.5428654 13 nips-2001-A Natural Policy Gradient