nips nips2008 nips2008-78 nips2008-78-reference knowledge-graph by maker-knowledge-mining

78 nips-2008-Exact Convex Confidence-Weighted Learning

Source: pdf

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Conﬁdence-weighted (CW) learning [6], an online learning method for linear classiﬁers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Conﬁdence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classiﬁes examples with a speciﬁed probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used ﬁrst-order and second-order online methods. 1

reference text

[1] Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, New York, NY, USA, 1997.

[2] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. Warmuth. How to use expert advice. Journal of the Association for Computing Machinery, 44(3):427–485, May 1997.

[3] Nicol´ Cesa-Bianchi, Alex Conconi, and Claudio Gentile. A second-order perceptron algorithm. Siam o Journal of Commutation, 34(3):640–668, 2005.

[4] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585, 2006.

[5] Mark Dredze and Koby Crammer. Active learning with conﬁdence. In ACL, 2008.

[6] Mark Dredze, Koby Crammer, and Fernando Pereira. Conﬁdence-weighted linear classiﬁcation. In International Conference on Machine Learning, 2008.

[7] E. Harrington, R. Herbrich, J. Kivinen, J. Platt, and R.C. Williamson. Online bayes point machines. In 7th Paciﬁc-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2003.

[8] R. Herbrich, T. Graepel, and C. Campbell. Bayes point machines. JMLR, 1:245–279, 2001.

[9] T. Jaakkola and M. Jordan. A variational approach to bayesian logistic regression models and their extensions. In Workshop on Artiﬁcial Intelligence and Statistics, 1997.

[10] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1):1–64, January 1997.

[11] N. Littlestone. Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285–318, 1988.

[12] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212–261, 1994.

[13] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962.

[14] K. B. Petersen and M. S. Pedersen. The matrix cookbook, 2007.

[15] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.

[16] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).).

[17] P. Shivaswamy and T. Jebara. Ellipsoidal kernel machines. In AISTATS, 2007.

[18] Richard S. Sutton. Adapting bias by gradient descent: an incremental version of delta-bar-delta. In Proceedings of the Tenth National Conference on Artiﬁcial Intelligence, pages 171–176. MIT Press, 1992.

[19] M. E. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211–244, 2001.

[20] L. Xu, K. Crammer, and D. Schuurmans. Robust support vector machine training via convex outlier ablation. In AAAI-2006, 2006. 8