nips nips2007 nips2007-79 nips2007-79-reference knowledge-graph by maker-knowledge-mining

79 nips-2007-Efficient multiple hyperparameter learning for log-linear models


Source: pdf

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random fields (CRFs). Using an implicit differentiation trick, we derive an efficient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we find that multiple hyperparameter learning can provide a significant boost in accuracy compared to using only a single regularization hyperparameter. 1


reference text

[1] L. Andersen, J. Larsen, L. Hansen, and M. Hintz-Madsen. Adaptive regularization of neural classifiers. In NNSP, 1997.

[2] D. Anguita, S. Ridella, F. Rivieccio, and R. Zunino. Hyperparameter design criteria for support vector classifiers. Neurocomputing, 55:109–134, 2003. 4 Following [7], we used the maximum expected accuracy algorithm for decoding, which returns a set of candidates parses reflecting different trade-offs between sensitivity (proportion of true base-pairs called) and specificity (proportion of called base-pairs which are correct).

[3] Y. Bengio. Gradient-based optimization of hyperparameters. Neural Computation, 12:1889–1900, 2000.

[4] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1–3):131–159, 2002.

[5] D. Chen and M. Hagan. Optimal use of regularization and cross-validation in neural network modeling. In IJCNN, 1999.

[6] C. B. Do, S. S. Gross, and S. Batzoglou. CONTRAlign: discriminative training for protein sequence alignment. In RECOMB, pages 160–174, 2006.

[7] C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14):e90–e98, 2006.

[8] K. Duan, S. S. Keerthi, and A.N. Poo. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing, 51(4):41–59, 2003.

[9] R. Eigenmann and J. A. Nossek. Gradient based adaptive regularization. In NNSP, pages 87–94, 1999.

[10] T. Glasmachers and C. Igel. Gradient-based adaptation of general Gaussian kernels. Neural Comp., 17(10):2099–2105, 2005.

[11] A. Globerson, T. Y. Koo, X. Carreras, and M. Collins. Exponentiated gradient algorithms for log-linear structured prediction. In ICML, pages 305–312, 2007.

[12] C. Goutte and J. Larsen. Adaptive regularization of neural networks using conjugate gradient. In ICASSP, 1998.

[13] S. Griffiths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33:D121–D124, 2005.

[14] A. Kapoor, Y. Qi, H. Ahn, and R. W. Picard. Hyperparameter and kernel learning for graph based semisupervised classification. In NIPS, pages 627–634, 2006.

[15] S. S. Keerthi. Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Transaction on Neural Networks, 13(5):1225–1229, 2002.

[16] S. S. Keerthi, V. Sindhwani, and O. Chapelle. An efficient method for gradient-based adaptation of hyperparameters in SVM models. In NIPS, 2007.

[17] K. Kobayashi, D. Kitakoshi, and R. Nakano. Yet faster method to optimize SVR hyperparameters based on minimizing cross-validation error. In IJCNN, volume 2, pages 871–876, 2005.

[18] K. Kobayashi and R. Nakano. Faster optimization of SVR hyperparameters based on minimizing crossvalidation error. In IEEE Conference on Cybernetics and Intelligent Systems, 2004.

[19] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In ICML 18, pages 282–289, 2001.

[20] J. Larsen, L. K. Hansen, C. Svarer, and M. Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In NNSP, 1996.

[21] J. Larsen, C. Svarer, L. N. Andersen, and L. K. Hansen. Adaptive regularization in neural network modeling. In Neural Networks: Tricks of the Trade, pages 113–132, 1996.

[22] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.

[23] D. J. C. MacKay and R. Takeuchi. Interpolation models with multiple hyperparameters. Statistics and Computing, 8:15–23, 1998.

[24] J. R. R. A. Martins, P. Sturdza, and J. J. Alonso. The complex-step derivative approximation. ACM Trans. Math. Softw., 29(3):245–262, 2003.

[25] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, volume 17, pages 362–369, 2001.

[26] I. Murray and Z. Ghahramani. Bayesian learning in undirected graphical models: approximate MCMC algorithms. In UAI, pages 392–399, 2004.

[27] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.

[28] A. Y. Ng. Preventing overfitting of cross-validation data. In ICML, pages 245–253, 1997.

[29] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. In ICML, 2004.

[30] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.

[31] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Comp, 6(1):147–160, 1994.

[32] Y. Qi, M. Szummer, and T. P. Minka. Bayesian conditional random fields. In AISTATS, 2005.

[33] M. Seeger. Cross-validation optimization for large scale hierarchical classification kernel methods. In NIPS, 2007.

[34] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In NAACL, pages 134–141, 2003.

[35] S. Sundararajan and S. S. Keerthi. Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Comp., 13(5):1103–1118, 2001.

[36] S. V. N. Vishwanathan, N. N. Schraudolph, M. W. Schmidt, and K. P. Murphy. Accelerated training of conditional random fields with stochastic gradient methods. In ICML, pages 969–976, 2006.

[37] M. Wellings and S. Parise. Bayesian random fields: the Bethe-Laplace approximation. In ICML, 2006.

[38] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.

[39] X. Zhang and W. S. Lee. Hyperparameter learning for graph based semi-supervised learning algorithms. In NIPS, 2007.