nips nips2007 nips2007-45 nips2007-45-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum
Abstract: We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We prove asymptotic optimality of this criterion for Gaussian data and analyze its relationships to classical classifiers. Theoretical results provide new insights into relationships among popular classifiers such as MAP and RDA, as well as unsupervised clustering methods based on lossy compression [13]. Minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small-sample setting. Compression also provides a uniform means of handling classes of varying dimension. This simple classification criterion and its kernel and local versions perform competitively against existing classifiers on both synthetic examples and real imagery data such as handwritten digits and human faces, without requiring domain-specific information. 1
[1] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6):2743–2760, 1998.
[2] R. Basri and D. Jacobs. Lambertian reflection and linear subspaces. PAMI, 25(2):218– 233, 2003.
[3] P. Bickel and B. Li. Regularization in statistics. TEST, 15(2):271–344, 2006.
[4] C. Chang and C. Lin. LIBSVM: a library for support vector machines, 2001.
[5] T. Cover and J. Thomas. Elements of Information Theory. Wiley Series in Telecommunications, 1991.
[6] J. Friedman. Regularized discriminant analysis. JASA, 84:165–175, 1989.
[7] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. PAMI, 23(6):643–660, 2001.
[8] P. Grunwald and J. Langford. Suboptimal behaviour of Bayes and MDL in classification under misspecification. In Proceedings of Conference on Learning Theory, 2004.
[9] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer, 2001.
[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[11] K. Lee, J. Ho, and D. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. PAMI, 27(5):684–698, 2005.
[12] J. Li. A source coding approach to classification by vector quantization and the principle of minimum description length. In IEEE DCC, pages 382–391, 2002.
[13] Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed data via lossy data coding and compression. PAMI, 29(9):1546–1562, 2007.
[14] D. MacKay. Developments in probabilistic modelling with neural networks – ensemble learning. In Proc. 3rd Annual Symposium on Neural Networks, pages 191–198, 1995.
[15] M. Madiman, M. Harrison, and I. Kontoyiannis. Minimum description length vs. maximum likelihood in lossy data compression. In IEEE International Symposium on Information Theory, 2004.
[16] J. Rissanen. Modeling by shortest data description. Automatica, 14:465–471, 1978.
[17] P. Simard, Y. LeCun, and J. Denker. Efficient pattern recognition using a new transformation distance. In Proceedings of NIPS, volume 5, 1993.
[18] P. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural networks applied to visual document analysis. In ICDAR, pages 958–962, 2003.
[19] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. In Allerton, 1999.
[20] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 2000.
[21] J. Wright, Y. Tao, Z. Lin, Y. Ma, and H. Shum. Classification via minimum incremental coding length (MICL). Technical report, UILU-ENG-07-2201, http://perception.csl.uiuc.edu/coding, 2007. 8