nips nips2007 nips2007-79 nips2007-79-reference knowledge-graph by maker-knowledge-mining

79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

Source: pdf

Author: Chuan-sheng Foo, Chuong B. Do, Andrew Y. Ng

Abstract: In problems where input features have varying amounts of noise, using distinct regularization hyperparameters for different features provides an effective means of managing model complexity. While regularizers for neural networks and support vector machines often rely on multiple hyperparameters, regularizers for structured prediction models (used in tasks such as sequence labeling or parsing) typically rely only on a single shared hyperparameter for all features. In this paper, we consider the problem of choosing regularization hyperparameters for log-linear models, a class of structured prediction probabilistic models which includes conditional random ﬁelds (CRFs). Using an implicit differentiation trick, we derive an efﬁcient gradient-based method for learning Gaussian regularization priors with multiple hyperparameters. In both simulations and the real-world task of computational RNA secondary structure prediction, we ﬁnd that multiple hyperparameter learning can provide a signiﬁcant boost in accuracy compared to using only a single regularization hyperparameter. 1

reference text

[1] L. Andersen, J. Larsen, L. Hansen, and M. Hintz-Madsen. Adaptive regularization of neural classiﬁers. In NNSP, 1997.

[2] D. Anguita, S. Ridella, F. Rivieccio, and R. Zunino. Hyperparameter design criteria for support vector classiﬁers. Neurocomputing, 55:109–134, 2003. 4 Following [7], we used the maximum expected accuracy algorithm for decoding, which returns a set of candidates parses reﬂecting different trade-offs between sensitivity (proportion of true base-pairs called) and speciﬁcity (proportion of called base-pairs which are correct).

[3] Y. Bengio. Gradient-based optimization of hyperparameters. Neural Computation, 12:1889–1900, 2000.

[4] O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee. Choosing multiple parameters for support vector machines. Machine Learning, 46(1–3):131–159, 2002.

[5] D. Chen and M. Hagan. Optimal use of regularization and cross-validation in neural network modeling. In IJCNN, 1999.

[6] C. B. Do, S. S. Gross, and S. Batzoglou. CONTRAlign: discriminative training for protein sequence alignment. In RECOMB, pages 160–174, 2006.

[7] C. B. Do, D. A. Woods, and S. Batzoglou. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 22(14):e90–e98, 2006.

[8] K. Duan, S. S. Keerthi, and A.N. Poo. Evaluation of simple performance measures for tuning SVM hyperparameters. Neurocomputing, 51(4):41–59, 2003.

[9] R. Eigenmann and J. A. Nossek. Gradient based adaptive regularization. In NNSP, pages 87–94, 1999.

[10] T. Glasmachers and C. Igel. Gradient-based adaptation of general Gaussian kernels. Neural Comp., 17(10):2099–2105, 2005.

[11] A. Globerson, T. Y. Koo, X. Carreras, and M. Collins. Exponentiated gradient algorithms for log-linear structured prediction. In ICML, pages 305–312, 2007.

[12] C. Goutte and J. Larsen. Adaptive regularization of neural networks using conjugate gradient. In ICASSP, 1998.

[13] S. Grifﬁths-Jones, S. Moxon, M. Marshall, A. Khanna, S. R. Eddy, and A. Bateman. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res, 33:D121–D124, 2005.

[14] A. Kapoor, Y. Qi, H. Ahn, and R. W. Picard. Hyperparameter and kernel learning for graph based semisupervised classiﬁcation. In NIPS, pages 627–634, 2006.

[15] S. S. Keerthi. Efﬁcient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Transaction on Neural Networks, 13(5):1225–1229, 2002.

[16] S. S. Keerthi, V. Sindhwani, and O. Chapelle. An efﬁcient method for gradient-based adaptation of hyperparameters in SVM models. In NIPS, 2007.

[17] K. Kobayashi, D. Kitakoshi, and R. Nakano. Yet faster method to optimize SVR hyperparameters based on minimizing cross-validation error. In IJCNN, volume 2, pages 871–876, 2005.

[18] K. Kobayashi and R. Nakano. Faster optimization of SVR hyperparameters based on minimizing crossvalidation error. In IEEE Conference on Cybernetics and Intelligent Systems, 2004.

[19] J. Lafferty, A. McCallum, and F. Pereira. Conditional random ﬁelds: probabilistic models for segmenting and labeling sequence data. In ICML 18, pages 282–289, 2001.

[20] J. Larsen, L. K. Hansen, C. Svarer, and M. Ohlsson. Design and regularization of neural networks: the optimal use of a validation set. In NNSP, 1996.

[21] J. Larsen, C. Svarer, L. N. Andersen, and L. K. Hansen. Adaptive regularization in neural network modeling. In Neural Networks: Tricks of the Trade, pages 113–132, 1996.

[22] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.

[23] D. J. C. MacKay and R. Takeuchi. Interpolation models with multiple hyperparameters. Statistics and Computing, 8:15–23, 1998.

[24] J. R. R. A. Martins, P. Sturdza, and J. J. Alonso. The complex-step derivative approximation. ACM Trans. Math. Softw., 29(3):245–262, 2003.

[25] T. P. Minka. Expectation propagation for approximate Bayesian inference. In UAI, volume 17, pages 362–369, 2001.

[26] I. Murray and Z. Ghahramani. Bayesian learning in undirected graphical models: approximate MCMC algorithms. In UAI, pages 392–399, 2004.

[27] R. M. Neal. Bayesian Learning for Neural Networks. Springer, 1996.

[28] A. Y. Ng. Preventing overﬁtting of cross-validation data. In ICML, pages 245–253, 1997.

[29] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. In ICML, 2004.

[30] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 1999.

[31] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Comp, 6(1):147–160, 1994.

[32] Y. Qi, M. Szummer, and T. P. Minka. Bayesian conditional random ﬁelds. In AISTATS, 2005.

[33] M. Seeger. Cross-validation optimization for large scale hierarchical classiﬁcation kernel methods. In NIPS, 2007.

[34] F. Sha and F. Pereira. Shallow parsing with conditional random ﬁelds. In NAACL, pages 134–141, 2003.

[35] S. Sundararajan and S. S. Keerthi. Predictive approaches for choosing hyperparameters in Gaussian processes. Neural Comp., 13(5):1103–1118, 2001.

[36] S. V. N. Vishwanathan, N. N. Schraudolph, M. W. Schmidt, and K. P. Murphy. Accelerated training of conditional random ﬁelds with stochastic gradient methods. In ICML, pages 969–976, 2006.

[37] M. Wellings and S. Parise. Bayesian random ﬁelds: the Bethe-Laplace approximation. In ICML, 2006.

[38] C. K. I. Williams and D. Barber. Bayesian classiﬁcation with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998.

[39] X. Zhang and W. S. Lee. Hyperparameter learning for graph based semi-supervised learning algorithms. In NIPS, 2007.