nips nips2013 nips2013-65 nips2013-65-reference knowledge-graph by maker-knowledge-mining

65 nips-2013-Compressive Feature Learning


Source: pdf

Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie

Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1


reference text

[1] D. Benedetto, E. Caglioti, and V. Loreto. Language trees and zipping. PRL, 88(4):048702, 2002.

[2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1– 122, 2011.

[3] A. Bratko, B. Filipiˇ , G. V. Cormack, T. R. Lynam, and B. Zupan. Spam filtering using statistical data c compression models. JMLR, 7:2673–2698, 2006.

[4] P. Brucker. An O(n) algorithm for quadratic knapsack problems. Operations Research Letters, 3(3):163– 166, 1984.

[5] E. Cand` s, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted e and Applications, 14(5-6):877–905, 2008. 1 minimization. J Fourier Analysis

[6] R. Cilibrasi and P. M. Vit´ nyi. Clustering by compression. TIT, 51(4):1523–1545, 2005. a

[7] E. Frank, C. Chui, and I. Witten. Text categorization using compression models. Technical Report 00/02, University of Waikato, Department of Computer Science, 2000.

[8] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. J Stat Softw, 33(1):1–22, 2010.

[9] E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In ICML, 2004.

[10] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. JMLR, 3:1157–1182, 2003.

[11] E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards parameter-free data mining. In KDD, 2004.

[12] D. Kim, S. Sra, and I. S. Dhillon. Tackling box-constrained optimization via a new projected quasi-newton approach. SIAM Journal on Scientific Computing, 32(6):3548–3563, 2010.

[13] V. Kuleshov. Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration. In ICML, 2013.

[14] M. Lan, C. Tan, and H. Low. Proposing a new term weighting scheme for text categorization. In AAAI, 2006.

[15] K. Lang. Newsweeder: Learning to filter netnews. In ICML, 1995.

[16] H. Larochelle and Y. Bengio. Classification using discriminative restricted Boltzmann machines. In ICML, 2008.

[17] B. Li and C. Vogel. Improving multiclass text classification with error-correcting output coding and sub-class partitions. In Can Conf Adv Art Int, 2010.

[18] H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. TKDE, 17(4):491–502, 2005.

[19] T. Liu, S. Liu, Z. Chen, and W. Ma. An evaluation on feature selection for text clustering. In ICML, 2003.

[20] A. Maas, R. Daly, P. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In ACL, 2011.

[21] H. S. Paskov, R. West, J. C. Mitchell, and T. J. Hastie. Supplementary material for Compressive Feature Learning, 2013.

[22] J. Rennie. 20 Newsgroups dataset, 2008. http://qwone.com/˜jason/20Newsgroups (accessed May 31, 2013).

[23] J. Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.

[24] D. Sculley and C. E. Brodley. Compression and machine learning: A new perspective on feature space vectors. In DCC, 2006.

[25] R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B, 58(1):267–288, 1996.

[26] Y. Yang and J. Pedersen. A comparative study on feature selection in text categorization. In ICML, 1997.

[27] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In NIPS, 2004.

[28] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. TIT, 23(3):337–343, 1977.

[29] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. JCGS, 15(2):265–286, 2006. 9