nips nips2007 nips2007-79 knowledge-graph by maker-knowledge-mining

79 nips-2007-Efficient multiple hyperparameter learning for log-linear models


Source: pdf

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Efficient multiple hyperparameter learning for log-linear models Chuong B. [sent-1, score-0.528]

2 edu Abstract In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. [sent-5, score-0.688]

3 While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. [sent-6, score-0.737]

4 In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). [sent-7, score-0.676]

5 Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. [sent-8, score-0.21]

6 In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. [sent-9, score-0.905]

7 1 Introduction In many supervised learning methods, overfitting is controlled through the use of regularization penalties for limiting model complexity. [sent-10, score-0.174]

8 The effectiveness of penalty-based regularization for a given learning task depends not only on the type of regularization penalty used (e. [sent-11, score-0.348]

9 , L1 vs L2 ) [29] but also (and perhaps even more importantly) on the choice of hyperparameters governing the regularization penalty (e. [sent-13, score-0.57]

10 , the hyperparameter λ in an isotropic Gaussian parameter prior, λ||w||2 ). [sent-15, score-0.462]

11 When only a single hyperparameter must be tuned, cross-validation provides a simple yet reliable procedure for hyperparameter selection. [sent-16, score-0.953]

12 For example, the regularization hyperparameter C in a support vector machine (SVM) is usually tuned by training the SVM with several different values of C, and selecting the one that achieves the best performance on a holdout set. [sent-17, score-0.858]

13 Unfortunately, for sophisticated models with multiple hyperparameters [23], the na¨ve grid search strategy of directly trying out possible combinations of ı hyperparameter settings quickly grows infeasible as the number of hyperparameters becomes large. [sent-19, score-1.384]

14 Scalable strategies for cross-validation–based hyperparameter learning that rely on computing the gradient of cross-validation loss with respect to the desired hyperparameters arose first in the neural network modeling community [20, 21, 1, 12]. [sent-20, score-1.028]

15 Here, we consider the problem of hyperparameter learning for a specialized class of structured classification models known as conditional log-linear models (CLLMs), a generalization of conditional random fields (CRFs) [19]. [sent-22, score-0.633]

16 Designing hyperparameter learning algorithms for structured classification models thus yields a number of unique computational challenges not normally encountered in the flat classification setting. [sent-24, score-0.533]

17 In this paper, we derive a gradient-based approach for optimizing the hyperparameters of a CLLM using the loss incurred on a holdout set. [sent-25, score-0.59]

18 Finally, we demonstrate on both simulations and a real-world computational biology task that our hyperparameter learning method can give gains over learning flat unstructured regularization priors. [sent-27, score-0.636]

19 2 Preliminaries Conditional log-linear models (CLLMs) are a probabilistic framework for sequence labeling or parsing problems, where X is an exponentially large space of possible input sequences and Y is an exponentially large space of candidate label sequences or parse trees. [sent-28, score-0.183]

20 While a number of efficient procedures exist for solving the optimization problem OPT1 [34, 11], little attention is usually given to choosing an appropriate regularization matrix C. [sent-36, score-0.319]

21 Generally, C is parameterized using a small number of free variables, d ∈ Rk , known as the hyperparameters of the m ˜ model. [sent-37, score-0.396]

22 Given a holdout set H = (˜(i) , y (i) ) i=1 of i. [sent-38, score-0.194]

23 examples drawn from D, hyperparameter x ˜ learning itself can be cast as an optimization problem: m ˜ minimize d∈Rk log P y (i) | x(i) ; w⋆ (C) . [sent-41, score-0.502]

24 ˜ ˜ − (OPT2) i=1 In words, OPT2 finds the hyperparameters d whose regularization matrix C leads the parameter vector w⋆ (C) learned from the training set to obtain small logloss on holdout data. [sent-42, score-1.074]

25 While this parameterization may be partially motivated by concerns of hyperparameter overfitting [28], such a choice usually stems from the difficulty of hyperparameter inference. [sent-44, score-0.924]

26 In practice, grid-search procedures provide a reliable method for determining hyperparameters to low-precision: one trains the model using several candidate values of C (e. [sent-45, score-0.424]

27 ), and chooses the C that minimizes holdout logloss. [sent-53, score-0.194]

28 While this strategy is suitable for tuning a single model hyperparameter, more sophisticated strategies are necessary when optimizing multiple hyperparameters. [sent-54, score-0.163]

29 3 Learning multiple hyperparameters In this section, we lay the framework for multiple hyperparameter learning by describing a simple yet flexible parameterization of C that arises quite naturally in many practical problems. [sent-55, score-0.93]

30 We then describe a generic strategy for hyperparameter adaptation via gradient-based optimization. [sent-56, score-0.543]

31 Consider a setting in which predefined subsets of parameter components (which we call regularization groups) are constrained to use the same hyperparameters [6]. [sent-57, score-0.57]

32 For instance, in an NLP task, individual word occurrence features may be placed in a separate regularization group from word bigram features. [sent-58, score-0.362]

33 Formally, let k be a fixed number of regularization groups, and let π : {1, . [sent-59, score-0.174]

34 , k} be a prespecified mapping from parameters to regularization groups. [sent-65, score-0.174]

35 In the sequel, we parameterize C ∈ Rn×n in terms of some hyperparameter vector d ∈ Rk as the diagonal matrix, C(d) = diag(exp(d)). [sent-70, score-0.462]

36 Specifically, let ℓT (w) = − i=1 log P y (i) | x(i) ; w denote the training logloss and m ˜ ℓH (w) = − i=1 log P y (i) | x(i) ; w the holdout logloss for a parameter vector w. [sent-72, score-0.696]

37 (OPT2’) n 2 d∈Rk w∈R For any fixed setting of these hyperparameters, the objective function of OPT2’ can be evaluated by (1) using the hyperparameters d to determine the regularization matrix C, (2) solving OPT1 using C to determine w⋆ and (3) computing the holdout logloss using the parameters w⋆ . [sent-74, score-1.078]

38 Given both procedures for function and gradient evaluation, we may apply standard gradient-based optimization (e. [sent-76, score-0.175]

39 , conjugate gradient or L-BFGS [30]) in order to find a local optimum of the objective. [sent-78, score-0.147]

40 In general, we observe that only a few iterations (∼ 5) are usually sufficient to determine reasonable hyperparameters to low accuracy. [sent-79, score-0.396]

41 4 The hyperparameter gradient Note that the optimization objective ℓH (w⋆ ) is a function of w⋆ . [sent-80, score-0.609]

42 In turn, w⋆ is a function of the hyperparameters d, as implicitly defined by the gradient stationarity condition, Cw⋆ + ∇w ℓT (w⋆ ) = 0. [sent-81, score-0.503]

43 To compute the hyperparameter gradient, we will use both of these facts. [sent-82, score-0.462]

44 1 Deriving the hyperparameter gradient First, we apply the chain rule to the objective function of OPT2’ to obtain ∇d ℓH (w⋆ ) = JT ∇w ℓH (w⋆ ) (1) d ⋆ where Jd is the n × k Jacobian matrix whose (i, j)th entry is ∂wi /∂dj . [sent-84, score-0.614]

45 The term ∇w ℓH (w⋆ ) is simply the gradient of the holdout logloss evaluated at w⋆ . [sent-85, score-0.538]

46 Specifically, we can differentiate both sides of (2) with respect to dj to obtain n ⋆ wp 0= p=1 ∂ ∂ ⋆ Cip + Cip w ∂dj ∂dj p n + n ⋆ = I{π(i)=j} wi exp(dj ) + Cip + p=1 ∂ ∂ ∂ ⋆ ℓT (w⋆ ) w , ∂wp ∂wi ∂dj p p=1 (3) ∂ ∂ ∂ ⋆ ℓT (w⋆ ) w . [sent-93, score-0.223]

47 , k}, we obtain the equivalent matrix equation, 0 = B + (C + ∇2 ℓT (w⋆ ))Jd (5) w ⋆ 2 ⋆ where B is the n × k matrix whose (i, j)th element is I{π(i)=j} wi exp(dj ), and ∇w ℓT (w ) is the Hessian of the training logloss evaluated at w⋆ . [sent-100, score-0.422]

48 2 Computing the hyperparameter gradient efficiently In principle, one could simply use (6) to obtain the Jacobian matrix Jd directly. [sent-103, score-0.614]

49 Computing the Hessian matrix ∇2 ℓT (w⋆ ) in w w a typical CLLM requires approximately n times the cost of a single logloss gradient evaluation. [sent-105, score-0.418]

50 To deal with these Algorithm 1: Gradient computation for hyperparameter selection. [sent-110, score-0.491]

51 m Input: training set T = (x(i) , y (i) ) i=1 , holdout set H = (˜(i) , y (i) ) x ˜ current hyperparameters d ∈ Rk Output: m ˜ i=1 hyperparameter gradient ∇d ℓH (w⋆ ) 1. [sent-111, score-1.187]

52 Compute solution w⋆ to OPT1 using regularization matrix C = diag(exp(d)). [sent-112, score-0.219]

53 Use conjugate gradient algorithm to solve the linear system, (C + ∇2 ℓT (w⋆ ))x = ∇w ℓH (w⋆ ). [sent-116, score-0.147]

54 Figure 1: Pseudocode for gradient computation problems, we first explain why (C+∇2 ℓT (w⋆ ))v for any arbitrary vector v ∈ Rn can be computed w in O(n) time, even though forming (C + ∇2 wℓT (w⋆ ))−1 is expensive. [sent-119, score-0.136]

55 Using this result, we then b describe an efficient procedure for computing the holdout hyperparameter gradient which avoids the expensive Hessian computation and inversion steps of the direct method. [sent-120, score-0.792]

56 By applying stanw dard rules for differential operators, Rv {∇w ℓT (w⋆ )} can be computed recursively using a modified version of the original gradient computation routine; see [31] for details. [sent-134, score-0.136]

57 Hessian-vector products for graphical models were previously used in the context of step-size adaptation for stochastic gradient descent [36]. [sent-135, score-0.219]

58 Given the above procedure for computing matrix-vector products, we can now use the conjugate gradient (CG) method to solve the matrix equation (5) to obtain Jd . [sent-137, score-0.192]

59 Unlike the direct method of forming the (C + ∇2 ℓT (w⋆ )) matrix and its inverse, solving the linear systems avoids w the expensive Ω(n2 ) cost of Hessian computation and matrix inversion. [sent-141, score-0.151]

60 Nevertheless, even this approach for computing the Jacobian matrices still requires the solution of multiple linear systems, which scales poorly when the number of hyperparameters k is large. [sent-142, score-0.432]

61 (b) xj 1 y2 xj 2 “observed features” xj 1 xj 2 “noise features” ··· ··· yL xj L j ∈ {1, . [sent-143, score-0.27]

62 1 0 10 20 30 40 Number of relevant features, R grid single separate grouped 0. [sent-159, score-0.222]

63 In both (b) and (c), each point represents an average over 100 independent runs of HMM training/holdout/testing set generation and CRF training and hyperparameter optimization. [sent-168, score-0.49]

64 A similar trick was previously used for hyperparameter adaptation in SVMs [16] and kernel logistic regression [33]. [sent-172, score-0.51]

65 Figure 1 shows a summary of our algorithm for hyperparameter gradient computation. [sent-173, score-0.569]

66 , 40} with each hidden state yi , including R “relevant” obi served features whose values were chosen based on yi , and (40 − R) “irrelevant” noise features whose values were chosen to be either 0 or 1 with equal probability, independent of yi . [sent-180, score-0.235]

67 Next, we constructed a CRF based on an HMM model similar to that shown in Figure 2a in which potentials were included for the initial node y1 , between each yi and yi+1 , and between yi and each xj (including both the observed features and the noise features). [sent-183, score-0.191]

68 Figure 2b shows the performance of the CRF for each of the three parameter-tying gradient-based optimization schemes, as well as the performance of scheme (a) when using the standard grid-search strategy of trying regularization matrices CI for C ∈ . [sent-185, score-0.247]

69 As seen in Figures 2b and 2c, the gradient-based procedure performed either as well as or better than a grid search for single hyperparameter models. [sent-192, score-0.522]

70 Roughly 3-5 line searches were sufficient to identify good hyperparameter settings; assuming that each line search takes 2-4 times the cost of solving OPT1, the overall hyperparameter learning procedure takes approximately 20 times the cost of solving OPT1 once. [sent-194, score-0.988]

71 8 Specificity Figure 3: RNA secondary structure prediction. [sent-254, score-0.204]

72 (a) An illustration of the secondary structure prediction task. [sent-255, score-0.204]

73 (b) Grouped hyperparameters learned using our algorithm for each of the two folds. [sent-256, score-0.396]

74 (c) Performance comparison with state-of-the-art methods when using either a single hyperparameter (the “original” CONTRAfold), separate hyperparameters, or grouped hyperparameters. [sent-257, score-0.653]

75 Enforcing regularization groups, however, gave consistently lower error rates, achieving an absolute reduction in generalization error over the next-best model of 6. [sent-259, score-0.209]

76 We also applied our framework to the problem of RNA secondary structure prediction. [sent-263, score-0.204]

77 Here, we focus on the task of predicting RNA secondary structure, i. [sent-266, score-0.204]

78 As a starting point, we used CONTRAfold [7], a current state-of-the-art secondary structure prediction program based on CLLMs. [sent-269, score-0.204]

79 In brief, the CONTRAfold program models RNA secondary structures using a variant of stochastic context-free grammars (SCFGs) which incorporates features chosen to closely match the energetic terms found in standard physics-based models of RNA structure. [sent-270, score-0.323]

80 , all features for describing hairpin loop lengths were placed in a single regularization group). [sent-278, score-0.513]

81 For testing, we collected 151 RNA sequences from the Rfam database [13] for which experimentally-determined secondary structures were already known. [sent-279, score-0.235]

82 Despite the small size of the training set, the hyperparameters learned on each fold were nonetheless qualitatively similar, indicating the robustness of the procedure (see Figure 3b). [sent-281, score-0.459]

83 4 Using separate or grouped hyperparameters both gave increased sensitivity and increased specificity compared to the original model, which was learned using a single regularization hyperparameter. [sent-284, score-0.829]

84 Overall, the testing logloss (summed over the two folds) decreased by roughly 6. [sent-285, score-0.269]

85 6 Discussion and related work In this work, we presented a gradient-based approach for hyperparameter learning based on minimizing logloss on a holdout set. [sent-290, score-0.893]

86 While the use of cross-validation loss as a proxy for generalization error is fairly natural, in many other supervised learning methods besides log-linear models, other objective functions have been proposed for hyperparameter optimization. [sent-291, score-0.462]

87 A different method for dealing with hyperparameters, common in neural network modeling, is the Bayesian approach of treating hyperparameters themselves as parameters in the model to be estimated. [sent-294, score-0.427]

88 In an ideal Bayesian scheme, one does not perform hyperparameter or parameter inference, but rather integrates over all possible hyperparameters and parameters in order to obtain a posterior distribution over predicted outputs given the training data. [sent-295, score-0.886]

89 In this strategy, however, the “Occam factor” used for hyperparameter optimization still requires a Hessian computation, which does not scale well for log-linear models. [sent-303, score-0.502]

90 Furthermore, our algorithm is simple and efficient, both conceptually and in practice: one iteratively optimizes the parameters of a log-linear model using a fixed setting of the hyperparameters, and then one changes the hyperparameters based on the holdout logloss gradient. [sent-307, score-0.827]

91 The gradient computation relies primarily on a simple conjugate gradient solver for linear systems, coupled with the ability to compute Hessian-vector products (straightforward in any modern programming language that allows for operation overloading). [sent-308, score-0.317]

92 As we demonstrated in the context of RNA secondary structure prediction, gradient-based hyperparameter learning is a practical and effective method for tuning hyperparameters when applied to large-scale log-linear models. [sent-309, score-1.095]

93 Finally we note that for neural networks, [9] and [5] proposed techniques for simultaneous optimization of hyperparameters and parameters; these results suggest that similar procedures for faster hyperparameter learning that do not require a doubly-nested optimization may be possible. [sent-310, score-0.997]

94 Optimal use of regularization and cross-validation in neural network modeling. [sent-340, score-0.205]

95 Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. [sent-412, score-0.429]

96 An efficient method for gradient-based adaptation of hyperparameters in SVM models. [sent-419, score-0.444]

97 Yet faster method to optimize SVR hyperparameters based on minimizing cross-validation error. [sent-425, score-0.396]

98 Faster optimization of SVR hyperparameters based on minimizing crossvalidation error. [sent-430, score-0.436]

99 Design and regularization of neural networks: the optimal use of a validation set. [sent-444, score-0.205]

100 Accelerated training of conditional random fields with stochastic gradient methods. [sent-551, score-0.17]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hyperparameter', 0.462), ('hyperparameters', 0.396), ('rna', 0.305), ('logloss', 0.237), ('secondary', 0.204), ('holdout', 0.194), ('regularization', 0.174), ('jd', 0.134), ('contrafold', 0.128), ('rv', 0.119), ('gradient', 0.107), ('loop', 0.104), ('dj', 0.1), ('grouped', 0.097), ('hessian', 0.089), ('cllms', 0.085), ('lengths', 0.083), ('jacobian', 0.079), ('cw', 0.075), ('hmm', 0.074), ('cg', 0.071), ('crfs', 0.071), ('rk', 0.069), ('wi', 0.067), ('separate', 0.065), ('bulge', 0.064), ('cip', 0.064), ('hairpin', 0.064), ('larsen', 0.064), ('nnsp', 0.064), ('rnas', 0.064), ('features', 0.059), ('wp', 0.056), ('loops', 0.054), ('xj', 0.054), ('adaptation', 0.048), ('nlp', 0.047), ('parsing', 0.047), ('matrix', 0.045), ('bayesian', 0.044), ('labeling', 0.044), ('rn', 0.043), ('andersen', 0.043), ('cllm', 0.043), ('helix', 0.043), ('kobayashi', 0.043), ('rfam', 0.043), ('svarer', 0.043), ('crf', 0.043), ('groups', 0.043), ('structured', 0.041), ('optimization', 0.04), ('conjugate', 0.04), ('internal', 0.04), ('derivative', 0.04), ('stacking', 0.039), ('yi', 0.039), ('proportion', 0.038), ('pairings', 0.037), ('nucleotides', 0.037), ('keerthi', 0.037), ('svr', 0.037), ('tertiary', 0.037), ('svm', 0.036), ('multiple', 0.036), ('fold', 0.035), ('gave', 0.035), ('conditional', 0.035), ('icml', 0.035), ('auc', 0.034), ('products', 0.034), ('neurocomputing', 0.034), ('ijcnn', 0.034), ('base', 0.034), ('wt', 0.034), ('sensitivity', 0.033), ('strategy', 0.033), ('tuning', 0.033), ('solving', 0.032), ('testing', 0.032), ('elds', 0.032), ('word', 0.032), ('strategies', 0.032), ('regularizers', 0.032), ('hansen', 0.032), ('sequences', 0.031), ('bt', 0.031), ('lim', 0.031), ('neural', 0.031), ('grid', 0.031), ('models', 0.03), ('nucleic', 0.03), ('exp', 0.03), ('classi', 0.03), ('single', 0.029), ('computation', 0.029), ('acid', 0.028), ('training', 0.028), ('procedures', 0.028), ('svms', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. 1

2 0.095244482 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio

Abstract: Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.

3 0.086140968 8 nips-2007-A New View of Automatic Relevance Determination

Author: David P. Wipf, Srikantan S. Nagarajan

Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1

4 0.076489076 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

Author: Maneesh Sahani, Byron M. Yu, John P. Cunningham, Krishna V. Shenoy

Abstract: Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientific and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train’s underlying firing rate. Current techniques to find time-varying firing rates require ad hoc choices of parameters, offer no confidence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of firing rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators. 1

5 0.071658321 97 nips-2007-Hidden Common Cause Relations in Relational Learning

Author: Ricardo Silva, Wei Chu, Zoubin Ghahramani

Abstract: When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classification model is built upon this graphical representation and evaluated with several empirical studies. 1 Contribution Prediction problems, such as classification, can be easier when class labels share a sort of relational dependency that is not accounted by the input features [10]. If the variables to be predicted are attributes of objects in a relational database, such dependencies are often postulated from the relations that exist in the database. This paper proposes and evaluates a new method for building classifiers that uses information concerning the relational structure of the problem. Consider the following standard example, adapted from [3]. There are different webpages, each one labeled according to some class (e.g., “student page” or “not a student page”). Features such as the word distribution within the body of each page can be used to predict each webpage’s class. However, webpages do not exist in isolation: there are links connecting them. Two pages having a common set of links is evidence for similarity between such pages. For instance, if W1 and W3 both link to W2 , this is commonly considered to be evidence for W1 and W3 having the same class. One way of expressing this dependency is through the following Markov network [5]: ∗ Now at the Statistical Laboratory, University of Cambridge. E-mail: silva@statslab.cam.ac.uk F1 F2 F 3 C1 C2 C3 Here Fi are the features of page Wi , and Ci is its respective page label. Other edges linking F variables to C variables (e.g., F1 −C2 ) can be added without affecting the main arguments presented in this section. The semantics of the graph, for a fixed input feature set {F1 , F2 , F3 }, are as follows: C1 is marginally dependent on C3 , but conditionally independent given C2 . Depending on the domain, this might be either a suitable or unsuitable representation of relations. For instance, in some domains it could be the case that the most sensible model would state that C1 is only informative about C3 once we know what C2 is: that is, C1 and C3 are marginally independent, but dependent given C2 . This can happen if the existence of a relation (Ci , Cj ) corresponds to the existence of hidden common causes generating this pair of random variables. Consider the following example, loosely based on a problem described by [12]. We have three objects, Microsoft (M ), Sony (S) and Philips (P ). The task is a regression task where we want to predict the stock market price of each company given its profitability from last year. The given relationships are that M and S are direct competitors (due to the videogame console market), as well S and P (due to the TV set market). M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Profit S.Profit P.Profit M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock M.Stock S.Stock P.Stock εm εs εp εm εs εp (a) (b) (c) Figure 1: (a) Assumptions that relate Microsoft, Sony and Philips stock prices through hidden common cause mechanisms, depicted as unlabeled gray vertices; (b) A graphical representation for generic hidden common causes relationships by using bi-directed edges; (c) A depiction of the same relationship skeleton by a Markov network model, which has different probabilistic semantics. It is expected that several market factors that affect stock prices are unaccounted by the predictor variable Past Year Profit. For example, a shortage of Microsoft consoles is a hidden common factor for both Microsoft’s and Sony’s stock. Another hidden common cause would be a high price for Sony’s consoles. Assume here that these factors have no effect on Philips’ stock value. A depiction of several hidden common causes that correpond to the relations Competitor(M, S) and Competitor(S, P ) is given in Figure 1(a) as unlabeled gray vertices. Consider a linear regression model for this setup. We assume that for each object Oi ∈ {M, S, P }, the stock price Oi .Stock, centered at the mean, is given by Oi .Stock = β × Oi .P rof it + where each i i (1) is a Gaussian random variable. The fact that there are several hidden common causes between M and S can be modeled by the covariance of m and s , σms . That is, unlike in standard directed Gaussian models, σms is allowed to be non-zero. The same holds for σsp . Covariances of error terms of unrelated objects should be zero (σmp = 0). This setup is very closely related to the classic seemingly unrelated regression model popular in economics [12]. A graphical representation for this type of model is the directed mixed graph (DMG) [9, 11], with bi-directed edges representing the relationship of having hidden common causes between a pair of vertices. This is shown in Figure 1(b). Contrast this to the Markov network representation in Figure 1(c). The undirected representation encodes that m and p are marginally dependent, which does not correspond to our assumptions1 . Moreover, the model in Figure 1(b) states that once we observe Sony’s stock price, Philip’s stocks (and profit) should have a non-zero association with Microsoft’s profit: this follows from a extension of d-separation to DMGs [9]. This is expected from the assumptions (Philip’s stocks should tell us something about Microsoft’s once we know Sony’s, but not before it), but does not hold in the graphical model in Figure 1(c). While it is tempting to use Markov networks to represent relational models (free of concerns raised by cyclic directed representations), it is clear that there are problems for which they are not a sensible choice. This is not to say that Markov networks are not the best representation for large classes of relational problems. Conditional random fields [4] are well-motivated Markov network models for sequence learning. The temporal relationship is closed under marginalization: if we do not measure some steps in the sequence, we will still link the corresponding remaining vertices accordingly, as illustrated in Figure 2. Directed mixed graphs are not a good representation for this sequence structure. X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 (a) (b) X1 X3 X5 Y1 Y3 Y5 (c) Figure 2: (a) A conditional random field (CRF) graph for sequence data; (b) A hypothetical scenario where two of the time slices are not measured, as indicated by dashed boxes; (c) The resulting CRF graph for the remaining variables, which corresponds to the same criteria for construction of (a). To summarize, the decision between using a Markov network or a DMG reduces to the following modeling issue: if two unlinked object labels yi , yj are statistically associated when some chain of relationships exists between yi and yj , then the Markov network semantics should apply (as in the case for temporal relationships). However, if the association arises only given the values of the other objects in the chain, then this is accounted by the dependence semantics of the directed mixed graph representation. The DMG representation propagates training data information through other training points. The Markov network representation propagates training data information through test points. Propagation through training points is relevant in real problems. For instance, in a webpage domain where each webpage has links to pages of several kinds (e.g., [3]), a chain of intermediated points between two classes labels yi and yj is likely to be more informative if we know the values of the labels in this chain. The respective Markov network would ignore all training points in this chain besides the endpoints. In this paper, we introduce a non-parametric classification model for relational data that factorizes according to a directed mixed graph. Sections 2 and 3 describes the model and contrasts it to a closely related approach which bears a strong analogy to the Markov network formulation. Experiments in text classification are described in Section 4. 2 Model Chu et al. [2] describe an approach for Gaussian process classification using relational information, which we review and compare to our proposed model. Previous approach: relational Gaussian processes through indicators − For each point x in the input space X , there is a corresponding function value fx . Given observed input points x1 , x2 , . . . , xn , a Gaussian process prior over f = [f1 , f2 , . . . , fn ]T has the shape P(f ) = 1 (2π)n/2 |Σ|1/2 1 exp − f T Σ−1 f 2 (2) 1 For Gaussian models, the absence of an edge in the undirected representation (i.e., Gaussian Markov random fields) corresponds to a zero entry in the inverse covariance matrix, where in the DMG it corresponds to a zero in the covariance matrix [9]. X1 X2 f1 X3 X2 X3 X1 X2 X3 f3 f2 ξ 12 X1 f1 f2 f3 f1 f2 f3 Y2 Y3 ξ 23 Y1 Y2 Y3 Y1 Y2 Y3 Y1 ε1 ε2 ε3 ε1 ε2 ε3 ε1 (a) ζ1 (b) ζ2 ε2 ζ3 ε3 (c) Figure 3: (a) A prediction problem where y3 is unknown and the training set is composed of other two datapoints. Dependencies between f1 , f2 and f3 are given by a Gaussian process prior and not represented in the picture. Indicators ξij are known and set to 1; (b) The extra associations that arise by conditioning on ξ = 1 can be factorized as the Markov network model here depicted, in the spirit of [9]; (c) Our proposed model, which ties the error terms and has origins in known statistical models such as seemingly unrelated regression and structural equation models [11]. where the ijth entry of Σ is given by a Mercer kernel function K(xi , xj ) [8]. The idea is to start from a standard Gaussian process prior, and add relational information by conditioning on relational indicators. Let ξij be an indicator that assumes different values, e.g., 1 or 0. The indicator values are observed for each pair of data points (xi , xj ): they are an encoding of the given relational structure. A model for P (ξij = 1|fi , fj ) is defined. This evidence is incorporated into the Gaussian process by conditioning on all indicators ξij that are positive. Essentially, the idea boils down to using P(f |ξ = 1) as the prior for a Gaussian process classifier. Figure 3(a) illustrates a problem with datapoints {(x1 , y1 ), (x2 , y2 ), (x3 , y3 )}. Gray vertices represent unobserved variables. Each yi is a binary random variable, with conditional probability given by P(yi = 1|fi ) = Φ(fi /σ) (3) where Φ(·) is the standard normal cumulative function and σ is a hyperparameter. This can be interpreted as the cumulative distribution of fi + i , where fi is given and i is a normal random variable with zero mean and variance σ 2 . In the example of Figure 3(a), one has two relations: (x1 , x2 ), (x2 , x3 ). This information is incorporated by conditioning on the evidence (ξ12 = 1, ξ23 = 1). Observed points (x1 , y1 ), (x2 , y2 ) form the training set. The prediction task is to estimate y3 . Notice that ξ12 is not used to predict y3 : the Markov blanket for f3 includes (f1 , f2 , ξ23 , y3 , 3 ) and the input features. Essentially, conditioning on ξ = 1 corresponds to a pairwise Markov network structure, as depicted in Figure 3(b) [9]2 . Our approach: mixed graph relational model − Figure 3(c) illustrates our proposed setup. For reasons that will become clear in the sequel, we parameterize the conditional probability of yi as √ (4) P(yi = 1|gi , vi ) = Φ(gi / vi ) where gi = fi + ζi . As before, Equation (4) can be interpreted as the cumulative distribution of 2 gi + i , with i as a normal random variable with zero mean and variance vi = σ 2 − σζi , the last term being the variance of ζi . That is, we break the original error term as i = ζi + i , where i and j are independent for all i = j. Random vector ζ is a multivariate normal with zero mean and covariance matrix Σζ . The key aspect in our model is that the covariance of ζi and ζj is non-zero only if objects i and j are related (that is, bi-directed edge yi ↔ yj is in the relational graph). Parameterizing Σζ for relational problems is non-trivial and discussed in the next section. In the example of Figure 3, one noticeable difference of our model 3(c) to a standard Markov network models 3(b) is that now the Markov blanket for f3 includes error terms for all variables (both and ζ terms), following the motivation presented in Section 1. 2 In the figure, we are not representing explicitly that f1 , f2 and f3 are not independent (the prior covariance matrix Σ is complete). The figure is meant as a representation of the extra associations that arise when conditioning on ξ = 1, and the way such associations factorize. As before, the prior for f in our setup is the Gaussian process prior (2). This means that g has the following Gaussian process prior (implicitly conditioned on x): P(g) = 1 (2π)n/2 |R|1/2 1 exp − g R−1 g 2 (5) where R = K + Σζ is the covariance matrix of g = f + ζ, with Kij = K(xi , xj ). 3 Parametrizing a mixed graph model for relational classification For simplicity, in this paper we will consider only relationships that induce positive associations between labels. Ideally, the parameterization of Σζ has to fulfill two desiderata: (i). it should respect the marginal independence constraints as encoded by the graphical model (i.e., zero covariance for vertices that are not adjacent), and be positive definite; (ii). it has to be parsimonious in order to facilitate hyperparameter selection, both computationally and statistically. Unlike the multivariate analysis problems in [11], the size of our covariance matrix grows with the number of data points. As shown by [11], exact inference in models with covariance matrices with zero-entry constraints is computationally demanding. We provide two alternative parameterizations that are not as flexible, but which lead to covariance matrices that are simple to compute and easy to implement. We will work under the transductive scenario, where training and all test points are given in advance. The corresponding graph thus contain unobserved and observed label nodes. 3.1 Method I The first method is an automated method to relax some of the independence constraints, while guaranteeing positive-definiteness, and a parameterization that depends on a single scalar ρ. This allows for more efficient inference and is done as follows: 1. Let Gζ be the corresponding bi-directed subgraph of our original mixed graph, and let U0 be a matrix with n × n entries, n being the number of nodes in Gζ 2. Set U0 to be the number of cliques in Gζ where yi and yj appear together; ij 3. Set U0 to be the number of cliques containing yi , plus a small constant ∆; ii 4. Set U to be the corresponding correlation matrix obtained by intepreting U0 as a covariance matrix and rescaling it; Finally, set Σζ = ρU, where ρ ∈ [0, 1] is a given hyperparameter. Matrix U is always guaranteed to be positive definite: it is equivalent to obtaining the covariance matrix of y from a linear latent variable model, where there is an independent standard Gaussian latent variable as a common parent to every clique, and every observed node yi is given by the sum of its parents plus an independent error term of variance ∆. Marginal independencies are respected, since independent random variables will never be in a same clique in Gζ . In practice, this method cannot be used as is since the number of cliques will in general grow at an exponential rate as a function of n. Instead, we first triangulate the graph: in this case, extracting cliques can be done in polynomial time. This is a relaxation of the original goal, since some of the original marginal independence constraints will not be enforced due to the triangulation3. 3.2 Method II The method suggested in the previous section is appealing under the assumption that vertices that appear in many common cliques are more likely to have more hidden common causes, and hence should have stronger associations. However, sometimes the triangulation introduces bad artifacts, with lots of marginal independence constraints being violated. In this case, this will often result in a poor prediction performance. A cheap alternative approach is not generating cliques, and instead 3 The need for an approximation is not a shortcoming only of the DMG approach. Notice that the relational Gaussian process of [2] also requires an approximation of its relational kernel. 10 10 10 20 20 20 30 30 30 40 40 40 50 50 50 60 60 60 70 70 70 80 80 80 90 90 90 100 100 100 10 20 30 40 50 60 (a) 70 80 90 100 10 20 30 40 50 60 70 80 90 (b) 100 10 20 30 40 50 60 70 80 90 100 (c) Figure 4: (a) The link matrix for the political books dataset. (b) The relational kernel matrix obtained with the approximated Method I. (c) The kernel matrix obtained with Method II, which tends to produce much weaker associations but does not introduce spurious relations. getting a marginal covariance matrix from a different latent variable model. In this model, we create an independent standard Gaussian variable for each edge yi ↔ yj instead of each clique. No triangulation will be necessary, and all marginal independence constraints will be respected. This, however, has shortcomings of its own: for all pairs (yi , yj ) connected by an edge, it will be the case that U0 = 1, while U0 can be as large as n. This means that the resulting correlation in Uij can be ij ii close to zero even if yi and yj are always in the same cliques. In Section 4, we will choose between Methods I and II according to the marginal likelihood of the model. 3.3 Algorithm Recall that our model is a Gaussian process classifier with error terms i of variance σ such that i = ζi + i . Without loss of generality, we will assume that σ = 1. This results in the following parameterization of the full error covariance matrix: Σ = (1 − ρ)I + ρU (6) where I is an n × n identity matrix. Matrix (1 − ρ)I corresponds to the covariance matrix Σ . The usefulness of separating as and ζ becomes evident when we use an expectation-propagation (EP) algorithm [7] to perform inference in our relational classifier. Instead of approximating the posterior of f , we approximate the posterior density P(g|D), D = {(x1 , y1 ), . . . , (xn , yn )} being ˜ the given training data. The approximate posterior has the form Q(g) ∝ P(g) i ti (gi ) where P(g) is the Gaussian process prior with kernel matrix R = K + Σζ as defined in the previous section. Since the covariance matrix Σ is diagonal, the true likelihood of y given g factorizes n over each datapoint: P(y|g) = i=1 P(yi |gi ), and standard EP algorithms for Gaussian process classification can be used [8] (with the variance given by Σ instead of Σ , and kernel matrix R instead of K). The final algorithm defines a whole new class of relational models, depends on a single hyperparameter ρ which can be optimized by grid search in [0, 1], and requires virtually no modification of code written for EP-based Gaussian process classifiers4 . 4 Results We now compare three different methods in relational classification tasks. We will compare a standard Gaussian process classifier (GPC), the relational Gaussian process (RGP) of [2] and our method, the mixed graph Gaussian process (XGP). A linear kernel K(x, z) = x · z is used, as described by [2]. We set ∆ = 10−4 and the hyperparameter ρ is found by a grid search in the space {0.1, 0.2, 0.3, . . . , 1.0} maximizing the approximate EP marginal likelihood5. 4 We provide MATLAB/Octave code for our method in http://www.statslab.cam.ac.uk/∼silva. For triangulation, we used the MATLAB implementation of the Reverse Cuthill McKee vertex ordering available at http://people.scs.fsu.edu/∼burkardt/m src/rcm/rcm.html 5 Table 1: The averaged AUC scores of citation prediction on test cases of the Cora database are recorded along with standard deviation over 100 trials. “n” denotes the number of papers in one class. “Citations” denotes the citation count within the two paper classes. Group n Citations GPC GPC with Citations XGP 5vs1 346/488 2466 0.905 ± 0.031 0.891 ± 0.022 0.945 ± 0.053 5vs2 346/619 3417 0.900 ± 0.032 0.905 ± 0.044 0.933 ± 0.059 5vs3 346/1376 3905 0.863 ± 0.040 0.893 ± 0.017 0.883 ± 0.013 5vs4 346/646 2858 0.916 ± 0.030 0.887 ± 0.018 0.951 ± 0.042 5vs6 346/281 1968 0.887 ± 0.054 0.843 ± 0.076 0.955 ± 0.041 5vs7 346/529 2948 0.869 ± 0.045 0.867 ± 0.041 0.926 ± 0.076 4.1 Political books We consider first a simple classification problem where the goal is to classify whether a particular book is of liberal political inclination or not. The features of each book are given by the words in the Amazon.com front page for that particular book. The choice of books, labels, and relationships are given in the data collected by Valdis Krebs and available at http://www-personal.umich.edu/ mejn/netdata. The data containing book features can be found at http://www.statslab.cam.ac.uk/∼silva. There are 105 books, 43 of which are labeled as liberal books. The relationships are pairs of books which are frequently purchased together by a same customer. Notice this is an easy problem, where labels are strongly associated if they share a relationship. We performed evaluation by sampling 100 times from the original pool of books, assigning half of them as trainining data. The evaluation criterion was the area under the curve (AUC) for this binary problem. This is a problem where Method I is suboptimal. Figure 4(a) shows the original binary link matrix. Figure 4(b) depicts the corresponding U0 matrix obtained with Method I, where entries closer to red correspond to stronger correlations. Method II gives a better performance here (Method I was better in the next two experiments). The AUC result for GPC was of 0.92, while both RGP and XGP achieved 0.98 (the difference between XGP and GPC having a std. deviation of 0.02). 4.2 Cora The Cora collection [6] contains over 50,000 computer science research papers including bibliographic citations. We used a subset in our experiment. The subset consists of 4,285 machine learning papers categorized into 7 classes. The second column of Table 1 shows the class sizes. Each paper was preprocessed as a bag-of-words, a vector of “term frequency” components scaled by “inverse document frequency”, and then normalized to unity length. This follows the pre-processing used in [2]. There is a total of 20,082 features. For each class, we randomly selected 1% of the labelled samples for training and tested on the remainder. The partition was repeated 100 times. We used the fact that the database is composed of fairly specialized papers as an illustration of when XGP might not be as optimal as RGP (whose AUC curves are very close to 1), since the population of links tends to be better separated between different classes (but this is also means that the task is fairly easy, and differences disappear very rapidly with increasing sample sizes). The fact there is very little training data also favors RGP, since XGP propagates information through training points. Still, XGP does better than the non-relational GPC. Notice that adding the citation adjacency matrix as a binary input feature for each paper does not improve the performance of the GPC, as shown in Table 1. Results for other classes are of similar qualitative nature and not displayed here. 4.3 WebKB The WebKB dataset consists of homepages from 4 different universities: Cornell, Texas, Washington and Wisconsin [3]. Each webpage belongs to one out of 7 categories: student, professor, course, project, staff, department and “other”. The relations come from actual links in the webpages. There is relatively high heterogeneity of types of links in each page: in terms of mixed graph modeling, this linkage mechanism is explained by a hidden common cause (e.g., a student and a course page are associated because that person’s interest in enrolling as a student also creates demand for a course). The heterogeneity also suggests that two unlinked pages should not, on average, have an association if they link to a common page W . However, observing the type of page W might create Table 2: Comparison of the three algorithms on the task “other” vs. “not-other” in the WebKB domain. Results for GPC and RGP taken from [2]. The same partitions for training and test are used to generate the results for XGP. Mean and standard deviation of AUC results are reported. University Numbers Other or Not Other All Link GPC RGP XGP Cornell 617 865 13177 0.708 ± 0.021 0.884 ± 0.025 0.917 ± 0.022 Texas 571 827 16090 0.799 ± 0.021 0.906 ± 0.026 0.949 ± 0.015 Washington 939 1205 15388 0.782 ± 0.023 0.877 ± 0.024 0.923 ± 0.016 Wisconsin 942 1263 21594 0.839 ± 0.014 0.899 ± 0.015 0.941 ± 0.018 the association. We compare how the three algorithms perform when trying to predict if a webpage is of class “other” or not (the other classifications are easier, with smaller differences. Results are omitted for space purposes). The proportion of “other” to non-“other” is about 4:1, which makes the area under the curve (AUC) a more suitable measure of success. We used the same 100 subsamples from [2], where 10% of the whole data is sampled from the pool for a specific university, and the remaining is used for test. We also used the same features as in [2], pre-processed as described in the previous section. The results are shown in Table 2. Both relational Gaussian processes are far better than the non-relational GPC. XGP gives significant improvements over RGP in all four universities. 5 Conclusion We introduced a new family of relational classifiers by extending a classical statistical model [12] to non-parametric relational classification. This is inspired by recent advances in relational Gaussian processes [2] and Bayesian inference for mixed graph models [11]. We showed empirically that modeling the type of latent phenomena that our approach postulates can sometimes improve prediction performance in problems traditionally approached by Markov network structures. Several interesting problems can be treated in the future. It is clear that there are many different ways by which the relational covariance matrix can be parameterized. Intermediate solutions between Methods I and II, approximations through matrix factorizations and graph cuts are only a few among many alternatives that can be explored. Moreover, there is a relationship between our model and multiple kernel learning [1], where one of the kernels comes from error covariances. This might provide alternative ways of learning our models, including multiple types of relationships. Acknowledgements: We thank Vikas Sindhwani for the preprocessed Cora database. References [1] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. 21st International Conference on Machine Learning, 2004. [2] W. Chu, V. Sindhwani, Z. Ghahramani, and S. Keerthi. Relational learning with Gaussian processes. Neural Information Processing Systems, 2006. [3] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the World Wide Web. Proceedings of AAAI’98, pages 509–516, 1998. [4] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 18th International Conference on Machine Learning, 2001. [5] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [6] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of Internet portals with machine learning. Information Retrieval Journal, 3:127–163, 2000. [7] T. Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis, MIT, 2001. [8] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [9] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962–1030, 2002. [10] P. Sen and L. Getoor. Link-based classification. Report CS-TR-4858, University of Maryland, 2007. [11] R. Silva and Z. Ghahramani. Bayesian inference for Gaussian mixed graph models. UAI, 2006. [12] A. Zellner. An efficient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association, 1962.

6 0.068115287 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

7 0.067813247 135 nips-2007-Multi-task Gaussian Process Prediction

8 0.066933244 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

9 0.066077143 40 nips-2007-Bundle Methods for Machine Learning

10 0.062378377 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach

11 0.060264945 166 nips-2007-Regularized Boost for Semi-Supervised Learning

12 0.060120653 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

13 0.059701838 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

14 0.058967493 134 nips-2007-Multi-Task Learning via Conic Programming

15 0.058713999 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

16 0.058031172 69 nips-2007-Discriminative Batch Mode Active Learning

17 0.056582522 145 nips-2007-On Sparsity and Overcompleteness in Image Models

18 0.054386858 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators

19 0.054273956 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables

20 0.053564057 195 nips-2007-The Generalized FITC Approximation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.207), (1, 0.028), (2, -0.049), (3, 0.02), (4, 0.002), (5, -0.009), (6, -0.025), (7, -0.079), (8, -0.025), (9, -0.006), (10, 0.033), (11, -0.045), (12, 0.004), (13, 0.021), (14, -0.004), (15, -0.011), (16, -0.106), (17, 0.03), (18, 0.03), (19, -0.013), (20, -0.053), (21, 0.033), (22, 0.073), (23, 0.066), (24, 0.035), (25, 0.023), (26, -0.065), (27, 0.029), (28, -0.074), (29, -0.025), (30, -0.03), (31, -0.069), (32, -0.051), (33, -0.062), (34, -0.055), (35, -0.068), (36, 0.062), (37, 0.016), (38, 0.086), (39, 0.013), (40, -0.127), (41, 0.151), (42, 0.179), (43, 0.069), (44, -0.088), (45, 0.078), (46, -0.084), (47, 0.086), (48, -0.092), (49, -0.114)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9464286 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. 1

2 0.6441316 206 nips-2007-Topmoumoute Online Natural Gradient Algorithm

Author: Nicolas L. Roux, Pierre-antoine Manzagol, Yoshua Bengio

Abstract: Guided by the goal of obtaining an optimization algorithm that is both fast and yields good generalization, we study the descent direction maximizing the decrease in generalization error or the probability of not increasing generalization error. The surprising result is that from both the Bayesian and frequentist perspectives this can yield the natural gradient direction. Although that direction can be very expensive to compute we develop an efficient, general, online approximation to the natural gradient descent which is suited to large scale problems. We report experimental results showing much faster convergence in computation time and in number of iterations with TONGA (Topmoumoute Online natural Gradient Algorithm) than with stochastic gradient descent, even on very large datasets.

3 0.5976373 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta

Author: Ben Blum, Rhiju Das, Philip Bradley, David Baker, Michael I. Jordan, David Tax

Abstract: Rosetta is one of the leading algorithms for protein structure prediction today. It is a Monte Carlo energy minimization method requiring many random restarts to find structures with low energy. In this paper we present a resampling technique for structure prediction of small alpha/beta proteins using Rosetta. From an initial round of Rosetta sampling, we learn properties of the energy landscape that guide a subsequent round of sampling toward lower-energy structures. Rather than attempt to fit the full energy landscape, we use feature selection methods—both L1-regularized linear regression and decision trees—to identify structural features that give rise to low energy. We then enrich these structural features in the second sampling round. Results are presented across a benchmark set of nine small alpha/beta proteins demonstrating that our methods seldom impair, and frequently improve, Rosetta’s performance. 1

4 0.55939966 152 nips-2007-Parallelizing Support Vector Machines on Distributed Computers

Author: Kaihua Zhu, Hao Wang, Hongjie Bai, Jian Li, Zhihuan Qiu, Hang Cui, Edward Y. Chang

Abstract: Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2 ) to O(np/m), and improves computation time to O(np2 /m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

5 0.5383442 156 nips-2007-Predictive Matrix-Variate t Models

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: It is becoming increasingly important to learn from a partially-observed random matrix and predict its missing elements. We assume that the entire matrix is a single sample drawn from a matrix-variate t distribution and suggest a matrixvariate t model (MVTM) to predict those missing elements. We show that MVTM generalizes a range of known probabilistic models, and automatically performs model selection to encourage sparse predictive models. Due to the non-conjugacy of its prior, it is difficult to make predictions by computing the mode or mean of the posterior distribution. We suggest an optimization method that sequentially minimizes a convex upper-bound of the log-likelihood, which is very efficient and scalable. The experiments on a toy data and EachMovie dataset show a good predictive accuracy of the model. 1

6 0.53402513 96 nips-2007-Heterogeneous Component Analysis

7 0.5327118 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

8 0.52862805 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning

9 0.51716411 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

10 0.50233227 97 nips-2007-Hidden Common Cause Relations in Relational Learning

11 0.49927887 8 nips-2007-A New View of Automatic Relevance Determination

12 0.4974792 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks

13 0.48955002 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

14 0.46336082 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

15 0.45991516 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

16 0.45506671 158 nips-2007-Probabilistic Matrix Factorization

17 0.44085926 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach

18 0.43180794 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

19 0.42912069 40 nips-2007-Bundle Methods for Machine Learning

20 0.42243356 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.041), (13, 0.058), (16, 0.075), (18, 0.015), (21, 0.072), (28, 0.253), (31, 0.02), (34, 0.02), (35, 0.033), (47, 0.067), (49, 0.026), (83, 0.113), (85, 0.032), (87, 0.011), (90, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77852976 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. 1

2 0.72891378 15 nips-2007-A general agnostic active learning algorithm

Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu

Abstract: We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm’s label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally. 1

3 0.67314428 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

Author: András Antos, Csaba Szepesvári, Rémi Munos

Abstract: We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems. 1 Preliminaries We will build on the results from [1, 2, 3] and for this reason we use the same notation as these papers. The unattributed results cited in this section can be found in the book [4]. A discounted MDP is defined by a quintuple (X , A, P, S, γ), where X is the (possible infinite) state space, A is the set of actions, P : X × A → M (X ) is the transition probability kernel with P (·|x, a) defining the next-state distribution upon taking action a from state x, S(·|x, a) gives the corresponding distribution of immediate rewards, and γ ∈ (0, 1) is the discount factor. Here X is a measurable space and M (X ) denotes the set of all probability measures over X . The Lebesguemeasure shall be denoted by λ. We start with the following mild assumption on the MDP: Assumption A1 (MDP Regularity) X is a compact subset of the dX -dimensional Euclidean space, ˆ A is a compact subset of [−A∞ , A∞ ]dA . The random immediate rewards are bounded by Rmax and that the expected immediate reward function, r(x, a) = rS(dr|x, a), is uniformly bounded by Rmax : r ∞ ≤ Rmax . A policy determines the next action given the past observations. Here we shall deal with stationary (Markovian) policies which choose an action in a stochastic way based on the last observation only. The value of a policy π when it is started from a state x is defined as the total expected discounted ∞ reward that is encountered while the policy is executed: V π (x) = Eπ [ t=0 γ t Rt |X0 = x]. Here Rt ∼ S(·|Xt , At ) is the reward received at time step t, the state, Xt , evolves according to Xt+1 ∼ ∗ Also with: Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary. 1 P (·|Xt , At ), where At is sampled from the distribution determined by π. We use Qπ : X × A → R ∞ to denote the action-value function of policy π: Qπ (x, a) = Eπ [ t=0 γ t Rt |X0 = x, A0 = a]. The goal is to find a policy that attains the best possible values, V ∗ (x) = supπ V π (x), at all states ∗ x ∈ X . Here V ∗ is called the optimal value function and a policy π ∗ that satisfies V π (x) = ∗ ∗ ∗ V (x) for all x ∈ X is called optimal. The optimal action-value function Q (x, a) is Q (x, a) = supπ Qπ (x, a). We say that a (deterministic stationary) policy π is greedy w.r.t. an action-value function Q ∈ B(X × A), and we write π = π (·; Q), if, for all x ∈ X , π(x) ∈ argmaxa∈A Q(x, a). ˆ Under mild technical assumptions, such a greedy policy always exists. Any greedy policy w.r.t. Q∗ is optimal. For π : X → A we define its evaluation operator, T π : B(X × A) → B(X × A), by (T π Q)(x, a) = r(x, a) + γ X Q(y, π(y)) P (dy|x, a). It is known that Qπ = T π Qπ . Further, if we let the Bellman operator, T : B(X × A) → B(X × A), defined by (T Q)(x, a) = r(x, a) + γ X supb∈A Q(y, b) P (dy|x, a) then Q∗ = T Q∗ . It is known that V π and Qπ are bounded by Rmax /(1 − γ), just like Q∗ and V ∗ . For π : X → A, the operator E π : B(X × A) → B(X ) is defined by (E π Q)(x) = Q(x, π(x)), while E : B(X × A) → B(X ) is defined by (EQ)(x) = supa∈A Q(x, a). Throughout the paper F ⊂ {f : X × A → R} will denote a subset of real-valued functions over the state-action space X × A and Π ⊂ AX will be a set of policies. For ν ∈ M (X ) and f : X → R p measurable, we let (for p ≥ 1) f p,ν = X |f (x)|p ν(dx). We simply write f ν for f 2,ν . 2 Further, we extend · ν to F by f ν = A X |f |2 (x, a) dν(x) dλA (a), where λA is the uniform distribution over A. We shall use the shorthand notation νf to denote the integral f (x)ν(dx). We denote the space of bounded measurable functions with domain X by B(X ). Further, the space of measurable functions bounded by 0 < K < ∞ shall be denoted by B(X ; K). We let · ∞ denote the supremum norm. 2 Fitted Q-iteration with approximate policy maximization We assume that we are given a finite trajectory, {(Xt , At , Rt )}1≤t≤N , generated by some stochastic stationary policy πb , called the behavior policy: At ∼ πb (·|Xt ), Xt+1 ∼ P (·|Xt , At ), Rt ∼ def S(·|Xt , At ), where πb (·|x) is a density with π0 = inf (x,a)∈X ×A πb (a|x) > 0. The generic recipe for fitted Q-iteration (FQI) [5] is Qk+1 = Regress(Dk (Qk )), (1) where Regress is an appropriate regression procedure and Dk (Qk ) is a dataset defining a regression problem in the form of a list of data-point pairs: Dk (Qk ) = (Xt , At ), Rt + γ max Qk (Xt+1 , b) b∈A 1≤t≤N .1 Fitted Q-iteration can be viewed as approximate value iteration applied to action-value functions. To see this note that value iteration would assign the value (T Qk )(x, a) = r(x, a) + γ maxb∈A Qk (y, b) P (dy|x, a) to Qk+1 (x, a) [6]. Now, remember that the regression function for the jointly distributed random variables (Z, Y ) is defined by the conditional expectation of Y given Z: m(Z) = E [Y |Z]. Since for any fixed function Q, E [Rt + γ maxb∈A Q(Xt+1 , b)|Xt , At ] = (T Q)(Xt , At ), the regression function corresponding to the data Dk (Q) is indeed T Q and hence if FQI solved the regression problem defined by Qk exactly, it would simulate value iteration exactly. However, this argument itself does not directly lead to a rigorous analysis of FQI: Since Qk is obtained based on the data, it is itself a random function. Hence, after the first iteration, the “target” function in FQI becomes random. Furthermore, this function depends on the same data that is used to define the regression problem. Will FQI still work despite these issues? To illustrate the potential difficulties consider a dataset where X1 , . . . , XN is a sequence of independent random variables, which are all distributed uniformly at random in [0, 1]. Further, let M be a random integer greater than N which is independent of the dataset (Xt )N . Let U be another random variable, uniformly t=1 distributed in [0, 1]. Now define the regression problem by Yt = fM,U (Xt ), where fM,U (x) = sgn(sin(2M 2π(x + U ))). Then it is not hard to see that no matter how big N is, no procedure can 1 Since the designer controls Qk , we may assume that it is continuous, hence the maximum exists. 2 estimate the regression function fM,U with a small error (in expectation, or with high probability), even if the procedure could exploit the knowledge of the specific form of fM,U . On the other hand, if we restricted M to a finite range then the estimation problem could be solved successfully. The example shows that if the complexity of the random functions defining the regression problem is uncontrolled then successful estimation might be impossible. Amongst the many regression methods in this paper we have chosen to work with least-squares methods. In this case Equation (1) takes the form N Qk+1 = argmin Q∈F t=1 1 πb (At |Xt ) 2 Q(Xt , At ) − Rt + γ max Qk (Xt+1 , b) b∈A . (2) We call this method the least-squares fitted Q-iteration (LSFQI) method. Here we introduced the weighting 1/πb (At |Xt ) since we do not want to give more weight to those actions that are preferred by the behavior policy. Besides this weighting, the only parameter of the method is the function set F. This function set should be chosen carefully, to keep a balance between the representation power and the number of samples. As a specific example for F consider neural networks with some fixed architecture. In this case the function set is generated by assigning weights in all possible ways to the neural net. Then the above minimization becomes the problem of tuning the weights. Another example is to use linearly parameterized function approximation methods with appropriately selected basis functions. In this case the weight tuning problem would be less demanding. Yet another possibility is to let F be an appropriate restriction of a Reproducing Kernel Hilbert Space (e.g., in a ball). In this case the training procedure becomes similar to LS-SVM training [7]. As indicated above, the analysis of this algorithm is complicated by the fact that the new dataset is defined in terms of the previous iterate, which is already a function of the dataset. Another complication is that the samples in a trajectory are in general correlated and that the bias introduced by the imperfections of the approximation architecture may yield to an explosion of the error of the procedure, as documented in a number of cases in, e.g., [8]. Nevertheless, at least for finite action sets, the tools developed in [1, 3, 2] look suitable to show that under appropriate conditions these problems can be overcome if the function set is chosen in a judicious way. However, the results of these works would become essentially useless in the case of an infinite number of actions since these previous bounds grow to infinity with the number of actions. Actually, we believe that this is not an artifact of the proof techniques of these works, as suggested by the counterexample that involved random targets. The following result elaborates this point further: Proposition 2.1. Let F ⊂ B(X × A). Then even if the pseudo-dimension of F is finite, the fatshattering function of ∨ Fmax = VQ : VQ (·) = max Q(·, a), Q ∈ F a∈A 2 can be infinite over (0, 1/2). Without going into further details, let us just note that the finiteness of the fat-shattering function is a sufficient and necessary condition for learnability and the finiteness of the fat-shattering function is implied by the finiteness of the pseudo-dimension [9].The above proposition thus shows that without imposing further special conditions on F, the learning problem may become infeasible. One possibility is of course to discretize the action space, e.g., by using a uniform grid. However, if the action space has a really high dimensionality, this approach becomes unfeasible (even enumerating 2dA points could be impossible when dA is large). Therefore we prefer alternate solutions. Another possibility is to make the functions in F, e.g., uniformly Lipschitz in their state coordinates. ∨ Then the same property will hold for functions in Fmax and hence by a classical result we can bound the capacity of this set (cf. pp. 353–357 of [10]). One potential problem with this approach is that this way it might be difficult to get a fine control of the capacity of the resulting set. 2 The proof of this and the other results are given in the appendix, available in the extended version of this paper, downloadable from http://hal.inria.fr/inria-00185311/en/. 3 In the approach explored here we modify the fitted Q-iteration algorithm by introducing a policy set Π and a search over this set for an approximately greedy policy in a sense that will be made precise in a minute. Our algorithm thus has four parameters: F, Π, K, Q0 . Here F is as before, Π is a user-chosen set of policies (mappings from X to A), K is the number of iterations and Q0 is an initial value function (a typical choice is Q0 ≡ 0). The algorithm computes a sequence of iterates (Qk , πk ), k = 0, . . . , K, defined by the following equations: ˆ N π0 ˆ = argmax π∈Π Q0 (Xt , π(Xt )), t=1 N Qk+1 = argmin Q∈F t=1 1 Q(Xt , At ) − Rt + γQk (Xt+1 , πk (Xt+1 )) ˆ πb (At |Xt ) 2 , (3) N πk+1 ˆ = argmax π∈Π Qk+1 (Xt , π(Xt )). (4) t=1 Thus, (3) is similar to (2), while (4) defines the policy search problem. The policy search will generally be solved by a gradient procedure or some other appropriate method. The cost of this step will be primarily determined by how well-behaving the iterates Qk+1 are in their action arguments. For example, if they were quadratic and if π was linear then the problem would be a quadratic optimization problem. However, except for special cases3 the action value functions will be more complicated, in which case this step can be expensive. Still, this cost could be similar to that of searching for the maximizing actions for each t = 1, . . . , N if the approximately maximizing actions are similar across similar states. This algorithm, which we could also call a fitted actor-critic algorithm, will be shown to overcome the above mentioned complexity control problem provided that the complexity of Π is controlled appropriately. Indeed, in this case the set of possible regression problems is determined by the set ∨ FΠ = { V : V (·) = Q(·, π(·)), Q ∈ F, π ∈ Π } , ∨ and the proof will rely on controlling the complexity of FΠ by selecting F and Π appropriately. 3 3.1 The main theoretical result Outline of the analysis In order to gain some insight into the behavior of the algorithm, we provide a brief summary of its error analysis. The main result will be presented subsequently. For f ,Q ∈ F and a policy π, we define the tth TD-error as follows: dt (f ; Q, π) = Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At ). Further, we define the empirical loss function by 1 ˆ LN (f ; Q, π) = N N t=1 d2 (f ; Q, π) t , λ(A)πb (At |Xt ) where the normalization with λ(A) is introduced for mathematical convenience. Then (3) can be ˆ written compactly as Qk+1 = argminf ∈F LN (f ; Qk , πk ). ˆ ˆ The algorithm can then be motivated by the observation that for any f ,Q, and π, LN (f ; Q, π) is an unbiased estimate of def 2 L(f ; Q, π) = f − T π Q ν + L∗ (Q, π), (5) where the first term is the error we are interested in and the second term captures the variance of the random samples: L∗ (Q, π) = E [Var [R1 + γQ(X2 , π(X2 ))|X1 , A1 = a]] dλA (a). A 3 Linear quadratic regulation is such a nice case. It is interesting to note that in this special case the obvious choices for F and Π yield zero error in the limit, as can be proven based on the main result of this paper. 4 ˆ This result is stated formally by E LN (f ; Q, π) = L(f ; Q, π). Since the variance term in (5) is independent of f , argminf ∈F L(f ; Q, π) = 2 π argminf ∈F f − T Q ν . Thus, if πk were greedy w.r.t. Qk then argminf ∈F L(f ; Qk , πk ) = ˆ ˆ 2 argminf ∈F f − T Qk ν . Hence we can still think of the procedure as approximate value iteration over the space of action-value functions, projecting T Qk using empirical risk minimization on the space F w.r.t. · ν distances in an approximate manner. Since πk is only approximately greedy, we ˆ will have to deal with both the error coming from the approximate projection and the error coming from the choice of πk . To make this clear, we write the iteration in the form ˆ ˆ ˆ Qk+1 = T πk Qk + εk = T Qk + εk + (T πk Qk − T Qk ) = T Qk + εk , def ˆ ˆ where εk is the error committed while computing T πk Qk , εk = T πk Qk − T Qk is the error committed because the greedy policy is computed approximately and εk = εk + εk is the total error of step k. Hence, in order to show that the procedure is well behaved, one needs to show that both errors are controlled and that when the errors are propagated through these equations, the resulting error stays controlled, too. Since we are ultimately interested in the performance of the policy obtained, we will also need to show that small action-value approximation errors yield small performance losses. For these we need a number of assumptions that concern either the training data, the MDP, or the function sets used for learning. 3.2 Assumptions 3.2.1 Assumptions on the training data We shall assume that the data is rich, is in a steady state, and is fast-mixing, where, informally, mixing means that future depends weakly on the past. Assumption A2 (Sample Path Properties) Assume that {(Xt , At , Rt )}t=1,...,N is the sample path of πb , a stochastic stationary policy. Further, assume that {Xt } is strictly stationary (Xt ∼ ν ∈ M (X )) and exponentially β-mixing with the actual rate given by the parameters (β, b, κ).4 We further assume that the sampling policy πb satisfies π0 = inf (x,a)∈X ×A πb (a|x) > 0. The β-mixing property will be used to establish tail inequalities for certain empirical processes.5 Note that the mixing coefficients do not need to be known. In the case when no mixing condition is satisfied, learning might be impossible. To see this just consider the case when X1 = X2 = . . . = XN . Thus, in this case the learner has many copies of the same random variable and successful generalization is thus impossible. We believe that the assumption that the process is in a steady state is not essential for our result, as when the process reaches its steady state quickly then (at the price of a more involved proof) the result would still hold. 3.2.2 Assumptions on the MDP In order to prevent the uncontrolled growth of the errors as they are propagated through the updates, we shall need some assumptions on the MDP. A convenient assumption is the following one [11]: Assumption A3 (Uniformly stochastic transitions) For all x ∈ X and a ∈ A, assume that P (·|x, a) is absolutely continuous w.r.t. ν and the Radon-Nikodym derivative of P w.r.t. ν is bounded def < +∞. uniformly with bound Cν : Cν = supx∈X ,a∈A dP (·|x,a) dν ∞ Note that by the definition of measure differentiation, Assumption A3 means that P (·|x, a) ≤ Cν ν(·). This assumption essentially requires the transitions to be noisy. We will also prove (weaker) results under the following, weaker assumption: 4 For the definition of β-mixing, see e.g. [2]. We say “empirical process” and “empirical measure”, but note that in this work these are based on dependent (mixing) samples. 5 5 Assumption A4 (Discounted-average concentrability of future-state distributions) Given ρ, ν, m ≥ 1 and an arbitrary sequence of stationary policies {πm }m≥1 , assume that the futuredef state distribution ρP π1 P π2 . . . P πm is absolutely continuous w.r.t. ν. Assume that c(m) = π1 π2 πm def satisfies m≥1 mγ m−1 c(m) < +∞. We shall call Cρ,ν = supπ1 ,...,πm d(ρP Pdν ...P ) ∞ max (1 − γ)2 m≥1 mγ m−1 c(m), (1 − γ) m≥1 γ m c(m) the discounted-average concentrability coefficient of the future-state distributions. The number c(m) measures how much ρ can get amplified in m steps as compared to the reference distribution ν. Hence, in general we expect c(m) to grow with m. In fact, the condition that Cρ,µ is finite is a growth rate condition on c(m). Thanks to discounting, Cρ,µ is finite for a reasonably large class of systems (see the discussion in [11]). A related assumption is needed in the error analysis of the approximate greedy step of the algorithm: Assumption A5 (The random policy “makes no peak-states”) Consider the distribution µ = (ν × λA )P which is the distribution of a state that results from sampling an initial state according to ν and then executing an action which is selected uniformly at random.6 Then Γν = dµ/dν ∞ < +∞. Note that under Assumption A3 we have Γν ≤ Cν . This (very mild) assumption means that after one step, starting from ν and executing this random policy, the probability of the next state being in a set is upper bounded by Γν -times the probability of the starting state being in the same set. def Besides, we assume that A has the following regularity property: Let Py(a, h, ρ) = (a , v) ∈ RdA +1 : a − a 1 ≤ ρ, 0 ≤ v/h ≤ 1 − a − a 1 /ρ denote the pyramid with hight h and base given by the 1 def -ball B(a, ρ) = a ∈ RdA : a − a 1 ≤ρ centered at a. Assumption A6 (Regularity of the action space) We assume that there exists α > 0, such that for all a ∈ A, for all ρ > 0, λ(Py(a, 1, ρ) ∩ (A × R)) λ(A) ≥ min α, λ(Py(a, 1, ρ)) λ(B(a, ρ)) For example, if A is an 1 . -ball itself, then this assumption will be satisfied with α = 2−dA . Without assuming any smoothness of the MDP, learning in infinite MDPs looks hard (see, e.g., [12, 13]). Here we employ the following extra condition: Assumption A7 (Lipschitzness of the MDP in the actions) Assume that the transition probabilities and rewards are Lipschitz w.r.t. their action variable, i.e., there exists LP , Lr > 0 such that for all (x, a, a ) ∈ X × A × A and measurable set B of X , |P (B|x, a) − P (B|x, a )| ≤ LP a − a 1 , |r(x, a) − r(x, a )| ≤ Lr a − a 1 . Note that previously Lipschitzness w.r.t. the state variables was used, e.g., in [11] to construct consistent planning algorithms. 3.2.3 Assumptions on the function sets used by the algorithm These assumptions are less demanding since they are under the control of the user of the algorithm. However, the choice of these function sets will greatly influence the performance of the algorithm, as we shall see it from the bounds. The first assumption concerns the class F: Assumption A8 (Lipschitzness of candidate action-value functions) Assume F ⊂ B(X × A) and that any elements of F is uniformly Lipschitz in its action-argument in the sense that |Q(x, a) − Q(x, a )| ≤ LA a − a 1 holds for any x ∈ X , a,a ∈ A, and Q ∈ F . 6 Remember that λA denotes the uniform distribution over the action set A. 6 We shall also need to control the capacity of our function sets. We assume that the reader is familiar with the concept of VC-dimension.7 Here we use the pseudo-dimension of function sets that builds upon the concept of VC-dimension: Definition 3.1 (Pseudo-dimension). The pseudo-dimension VF + of F is defined as the VCdimension of the subgraphs of functions in F (hence it is also called the VC-subgraph dimension of F). Since A is multidimensional, we define VΠ+ to be the sum of the pseudo-dimensions of the coordinate projection spaces, Πk of Π: dA V Π+ = VΠ + , k=1 k Πk = { πk : X → R : π = (π1 , . . . , πk , . . . , πdA ) ∈ Π } . Now we are ready to state our assumptions on our function sets: Assumption A9 (Capacity of the function and policy sets) Assume that F ⊂ B(X × A; Qmax ) for Qmax > 0 and VF + < +∞. Also, A ⊂ [−A∞ , A∞ ]dA and VΠ+ < +∞. Besides their capacity, one shall also control the approximation power of the function sets involved. Let us first consider the policy set Π. Introduce e∗ (F, Π) = sup inf ν(EQ − E π Q). Q∈F π∈Π Note that inf π∈Π ν(EQ − E π Q) measures the quality of approximating νEQ by νE π Q. Hence, e∗ (F, Π) measures the worst-case approximation error of νEQ as Q is changed within F. This can be made small by choosing Π large. Another related quantity is the one-step Bellman-error of F w.r.t. Π. This is defined as follows: For a fixed policy π, the one-step Bellman-error of F w.r.t. T π is defined as E1 (F; π) = sup inf Q∈F Q ∈F Q − T πQ ν . Taking again a pessimistic approach, the one-step Bellman-error of F is defined as E1 (F, Π) = sup E1 (F; π). π∈Π Typically by increasing F, E1 (F, Π) can be made smaller (this is discussed at some length in [3]). However, it also holds for both Π and F that making them bigger will increase their capacity (pseudo-dimensions) which leads to an increase of the estimation errors. Hence, F and Π must be selected to balance the approximation and estimation errors, just like in supervised learning. 3.3 The main result Theorem 3.2. Let πK be a greedy policy w.r.t. QK , i.e. πK (x) ∈ argmaxa∈A QK (x, a). Then under Assumptions A1, A2, and A5–A9, for all δ > 0 we have with probability at least 1 − δ: given Assumption A3 (respectively A4), V ∗ − V πK ∞ (resp. V ∗ − V πK 1,ρ ), is bounded by    d 1+1 κ+1   A   4κ (log N + log(K/δ))  + γK , C E1 (F, Π) + e∗ (F, Π) + 1/4   N   where C depends on dA , VF + , (VΠ+ )dA , γ, κ, b, β, Cν (resp. Cρ,ν ), Γν , LA , LP ,Lr , α, λ(A), π0 , k=1 k κ+1 ˆ Qmax , Rmax , Rmax , and A∞ . In particular, C scales with V 4κ(dA +1) , where V = 2VF + + VΠ+ plays the role of the “combined effective” dimension of F and Π. 7 Readers not familiar with VC-dimension are suggested to consult a book, such as the one by Anthony and Bartlett [14]. 7 4 Discussion We have presented what we believe is the first finite-time bounds for continuous-state and actionspace RL that uses value functions. Further, this is the first analysis of fitted Q-iteration, an algorithm that has proved to be useful in a number of cases, even when used with non-averagers for which no previous theoretical analysis existed (e.g., [15, 16]). In fact, our main motivation was to show that there is a systematic way of making these algorithms work and to point at possible problem sources the same time. We discussed why it can be difficult to make these algorithms work in practice. We suggested that either the set of action-value candidates has to be carefully controlled (e.g., assuming uniform Lipschitzness w.r.t. the state variables), or a policy search step is needed, just like in actorcritic algorithms. The bound in this paper is similar in many respects to a previous bound of a Bellman-residual minimization algorithm [2]. It looks that the techniques developed here can be used to obtain results for that algorithm when it is applied to continuous action spaces. Finally, although we have not explored them here, consistency results for FQI can be obtained from our results using standard methods, like the methods of sieves. We believe that the methods developed here will eventually lead to algorithms where the function approximation methods are chosen based on the data (similar to adaptive regression methods) so as to optimize performance, which in our opinion is one of the biggest open questions in RL. Currently we are exploring this possibility. Acknowledgments Andr´ s Antos would like to acknowledge support for this project from the Hungarian Academy of Sciences a (Bolyai Fellowship). Csaba Szepesv´ ri greatly acknowledges the support received from the Alberta Ingenuity a Fund, NSERC, the Computer and Automation Research Institute of the Hungarian Academy of Sciences. References [1] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. In COLT-19, pages 574–588, 2006. [2] A. Antos, Cs. Szepesv´ ri, and R. Munos. Learning near-optimal policies with Bellman-residual minia mization based fitted policy iteration and a single sample path. Machine Learning, 2007. (accepted). [3] A. Antos, Cs. Szepesv´ ri, and R. Munos. Value-iteration based fitted policy iteration: learning with a a single trajectory. In IEEE ADPRL, pages 330–337, 2007. [4] D. P. Bertsekas and S.E. Shreve. Stochastic Optimal Control (The Discrete Time Case). Academic Press, New York, 1978. [5] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005. [6] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Bradford Book. MIT Press, 1998. [7] N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines (and other kernel-based learning methods). Cambridge University Press, 2000. [8] J.A. Boyan and A.W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In NIPS-7, pages 369–376, 1995. [9] P.L. Bartlett, P.M. Long, and R.C. Williamson. Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52:434–452, 1996. [10] A.N. Kolmogorov and V.M. Tihomirov. -entropy and -capacity of sets in functional space. American Mathematical Society Translations, 17(2):277–364, 1961. [11] R. Munos and Cs. Szepesv´ ri. Finite time bounds for sampling based fitted value iteration. Technical a report, Computer and Automation Research Institute of the Hungarian Academy of Sciences, Kende u. 13-17, Budapest 1111, Hungary, 2006. [12] A.Y. Ng and M. Jordan. PEGASUS: A policy search method for large MDPs and POMDPs. In Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 406–415, 2000. [13] P.L. Bartlett and A. Tewari. Sample complexity of policy search with known dynamics. In NIPS-19. MIT Press, 2007. [14] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999. [15] M. Riedmiller. Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method. In 16th European Conference on Machine Learning, pages 317–328, 2005. [16] S. Kalyanakrishnan and P. Stone. Batch reinforcement learning in a complex domain. In AAMAS-07, 2007. 8

4 0.59528792 140 nips-2007-Neural characterization in partially observed populations of spiking neurons

Author: Jonathan W. Pillow, Peter E. Latham

Abstract: Point process encoding models provide powerful statistical methods for understanding the responses of neurons to sensory stimuli. Although these models have been successfully applied to neurons in the early sensory pathway, they have fared less well capturing the response properties of neurons in deeper brain areas, owing in part to the fact that they do not take into account multiple stages of processing. Here we introduce a new twist on the point-process modeling approach: we include unobserved as well as observed spiking neurons in a joint encoding model. The resulting model exhibits richer dynamics and more highly nonlinear response properties, making it more powerful and more flexible for fitting neural data. More importantly, it allows us to estimate connectivity patterns among neurons (both observed and unobserved), and may provide insight into how networks process sensory input. We formulate the estimation procedure using variational EM and the wake-sleep algorithm, and illustrate the model’s performance using a simulated example network consisting of two coupled neurons.

5 0.59472954 104 nips-2007-Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

Author: Maneesh Sahani, Byron M. Yu, John P. Cunningham, Krishna V. Shenoy

Abstract: Neural spike trains present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientific and neural prosthetic importance rely on a smoothed, denoised estimate of the spike train’s underlying firing rate. Current techniques to find time-varying firing rates require ad hoc choices of parameters, offer no confidence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of firing rate functions underlying single or multiple neural spike trains. We test the performance of the method on simulated data and experimentally gathered neural spike trains, and we demonstrate improvements over conventional estimators. 1

6 0.58945471 195 nips-2007-The Generalized FITC Approximation

7 0.58237714 156 nips-2007-Predictive Matrix-Variate t Models

8 0.57594061 36 nips-2007-Better than least squares: comparison of objective functions for estimating linear-nonlinear models

9 0.57503444 63 nips-2007-Convex Relaxations of Latent Variable Training

10 0.57334101 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

11 0.57168859 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

12 0.57123131 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

13 0.57036221 185 nips-2007-Stable Dual Dynamic Programming

14 0.56988907 174 nips-2007-Selecting Observations against Adversarial Objectives

15 0.56756979 200 nips-2007-The Tradeoffs of Large Scale Learning

16 0.56570673 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

17 0.56541896 164 nips-2007-Receptive Fields without Spike-Triggering

18 0.5649913 86 nips-2007-Exponential Family Predictive Representations of State

19 0.56421959 7 nips-2007-A Kernel Statistical Test of Independence

20 0.56388646 18 nips-2007-A probabilistic model for generating realistic lip movements from speech