nips nips2008 nips2008-78 nips2008-78-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1
[1] Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, New York, NY, USA, 1997.
[2] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the Association for Computing Machinery, 44(3):427–485, May 1997.
[3] Nicol´ Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm. Siam o Journal of Commutation, 34(3):640–668, 2005.
[4] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585, 2006.
[5] Mark Dredze and Koby Crammer. Active learning with confidence. In ACL, 2008.
[6] Mark Dredze, Koby Crammer, and Fernando Pereira. Confidence-weighted linear classification. In International Conference on Machine Learning, 2008.
[7] E. Harrington, R. Herbrich, J. Kivinen, J. Platt, and R.C. Williamson. Online bayes point machines. In 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2003.
[8] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. JMLR, 1:245–279, 2001.
[9] T. Jaakkola and M. Jordan. A variational approach to bayesian logistic regression models and their extensions. In Workshop on Artificial Intelligence and Statistics, 1997.
[10] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–64, January 1997.
[11] N. Littlestone. Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285–318, 1988.
[12] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.
[13] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962.
[14] K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2007.
[15] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.
[16] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).).
[17] P. Shivaswamy and T. Jebara. Ellipsoidal kernel machines. In AISTATS, 2007.
[18] Richard S. Sutton. Adapting bias by gradient descent: an incremental version of delta-bar-delta. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 171–176. MIT Press, 1992.
[19] M. E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, 2001.
[20] L. Xu, K. Crammer, and D. Schuurmans. Robust support vector machine training via convex outlier ablation. In AAAI-2006, 2006. 8