jmlr jmlr2009 jmlr2009-50 jmlr2009-50-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Omid Madani, Michael Connor, Wiley Greiner
Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization
S. Albers and J. Westbrook. Self-organizing data structures. In A. Fiat and G. Woeginger, editors, Online Algorithms: The State of the Art, pages 31–51. Springer LNCS 1442, 1998. J. K. Anlauf and M. Biehl. The adatron: an adaptive perceptron algorithm. Europhysics Letters, 1989. H. Aradhye, G. Toderici, and J. Yagnik. Video2text: Learning to annotate video content. In IEEE Int. Conf. on Data Mining (ICDM) Workshop on Internet Multimedia Mining, 2009. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. R. Bayardo, Y. Ma, and R. Srikant. Scaling up all-pairs similarity search. In Proc. Int. World Wide Web Conference (WWW), 2007. A. Blum. Empirical support for winnow and wighted majority algorithms: Results on a calendar scheduling domain. Machine Learning, 26:5–23, 1997. A. Borodin and R. El Yaniv. Online Computation and Competitive Analysis. Cambridge University Press, 1998. V. R. Carvalho and W. Cohen. Single pass online learning. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), 2006. N. Cesa-Bianchi, Y. Freund, D. P. Helmbold, D. Haussler, R. Schapire, and M. Warmuth. How to use expert advice. Journal of the ACM, 44(3):427–485, 1997. T. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng. Nus-wide: A real-world web image database from national university of singapore. In Proc. of ACM Conf. on Image and Video Retrieval (CIVR’09), 2009. 2609 M ADANI , C ONNOR AND G REINER W. W. Cohen and Y. Singer. Context-senstive learning methods for text categorization. ACM Trans. on Information Systems (TOIS), 17:141–173, 1999. K. Crammer and Y. Singer. A new family of online algorithms for category ranking. Journal of Machine Learning Research (JMLR), 3:1025–1058, 2003a. K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Journal of Machine Learning Research, 3:951–991, 2003b. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms. Journal of Machine Learning Research, 7:551–585, 2006. O. Dekel, J. Keshet, and Y. Singer. Large margin hierarchical classification. In Proc. Int. Conf. on Machine Learning (ICML), 2003. T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting codes. Journal of Artificial Intelligence Research, 2:263–286, 1995. S. Dumais and H. Chen. Hierarchical classification of web content. In Proc. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), 2000. Y. Even-Zohar and D. Roth. A classification approach to word prediction. In Proc. of the 1st North Amercian Association of Computational Linguistics (NAACL), 2000. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. Ullman. Computing iceberg queries efficiently. In Proc. 24th Int. Conf. Very Large Scale Data Bases (VLDB), 1998. S. Fidler and A. Leonardis. Towards scalable representations of object categories: Learning a hierarchy of parts. In Proc. of IEEE Int. Conf. on Vision and Pattern Recognition (CVPR), 2007. D. A. Forsyth and J. Ponce. Computer Vision. Prentice Hall, 2003. Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296, 1999. Y. Freund, R. Schapire, Y. Singer, and M. Warmuth. Using and combining predictors that specialize. In Proc. ACM Symposum on Theory of Computing (STOC), 1997. E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipida-baed explicit semantic analysis. In Proc. Int. Joint Conf. on AI (IJCAI), 2007. M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NPCompleteness. W. H. Freeman, 1979. C. Genest and J. V. Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1(1):114–148, 1986. C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. 2610 L EARNING W HEN C ONCEPTS A BOUND P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In DIMACS: Series in Discrete Mathematics and Theoretical Computer Science: Special Issue on Eternal Memory Algorithms and Visualization, 1999. J. T. Goodman. A bit of progress in language modeling. Computer Speech and Language, 15(4): 403–434, October 2001. K. Grill-Spector and N. Kanwisher. Visual recognition, as soon as you know it is there, you know what it is. Pscychological Science, 16(2):152–160, 2005. M. Grobelnik and D. Mladenic. Efficient text categorization. In Text Mining Workshop at European Conf. on Machine Learning (ECML). 1998. S. Guha and A. McGregor. Space-efficient sampling. In Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS), 2007. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer-Verlag, 2001. C. J. Hsieh, K. J. Chang, C. J. Lin, and S. Sathiya Keerthi. A dual coordinate descent method for large-scale linear SVM. In Proc. Int. Conf. on Machine Learning (ICML), 2008. J. Huang, O. Madani, and C. Lee Giles. Error-driven generalist+experts (EDGE): A multi-stage ensemble framework for text categorization. In Proc. ACM Conf. on Information and Knowledge Management (CIKM), 2008. R. M. Karp, C. H. Papadimitriou, and S. Shenker. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Systems (TODS), 28:51–55, 2003. S. Keerthi and D. DeCoste. A modified finite newton method for fast solution of large scale linear svms. Journal of Machine Learning Research (JMLR), 6:341–361, 2005. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. Int. Conf. on Machine Learning (ICML), 1997. W. Krauth and M. Mezard. Learning algorithms with optimal stability in neural networks. J. of Physics A, 20, 1987. K. Lang. Newsweeder: Learning to filter netnews. In Proc. Int. Conf. on Machine Learning, 1995. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research (JMLR), 5:361–397, 2004. Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1-3), 2002. Y. Li, H. Zaragoza, R. Herbrich, J. Shawe-Taylor, and J. Kandola. The perceptron algorithm with uneven margins. In Proc. Int. Conf. on Machine Learning (ICML), 2002. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4):285–318, 1988. 2611 M ADANI , C ONNOR AND G REINER T. Liu, Y. Yang, H. Wan, H. Zeng, Z. Chen, and W. Ma. Support vector machines classification with very large scale taxonomy. SIGKDD Explorations, 7, 2005. O. Madani. Exploring massive learning via a prediction system. In AAAI Fall Symposium Series: Computational Approaches to Representation Change During Learning and Development, 2007a. O. Madani. Prediction games in infinitely rich worlds. Technical Report 2, Yahoo! Research (and workshop on Utility Based Data Mining (UBDM)’06), June 2007b. O. Madani and M. Connor. Ranked Recall: Efficient classification by efficient learning of indices that rank. Technical Report 3, Yahoo! Research, 2007. O. Madani and M. Connor. Large-scale many-class learning. In SIAM Conf. on Data Mining (SDM), 2008. O. Madani and J. Huang. On updates that constrain the features’ connections during learning. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), 2008. O. Madani, W. Greiner, D. Kempe, and M. Salavatipour. Recall systems: Efficient learning and use of category indices. In Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS), 2007. O. Madani, H. Bui, and E. Yeh. Efficient online learning and prediction of users’ desktop actions. In Proc. Int. Joint Conf. on AI (IJCAI), 2009. C. Mesterharm. A multiclass linear learning algorithm related to Winnow. In Proc. Neural Information Processing Systems (NIPS), 2000. C. Mesterharm. Transforming linear-threshold learning algorithms into multiclass linear learning algorithms. Technical Report dcs-tr-460, Rutgers, 2001. G. L. Murphy. The Big Book of Concepts. MIT Press, 2002. J. Platt. Probabilities for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schlkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 61–74. MIT Press, 1999. D. R. Radev, H. Qi, H. Wu, and W. Fan. Evaluating web-based question answering systems. In Proc. Int. Conf. on Language Resources and Evaluation (LREC), 2002. H. Raghavan, O. Madani, and R. Jones. When will a human in the loop accelerate learning? quantifying the complexity of classification problems. In Int. Workshop on AI for Human Computing, at IJCAI, 2007. J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumption of Naive Bayes text classifiers. In Proc. Int. Conf. on Machine Learning (ICML), 2003. R. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research (JMLR), 5, 2004. T. G. Rose, M. Stevenson, and Miles Whitehead. The reuters corpus vol. 1 - from yesterday’s news to tomorrow’s language resources. In Proc. Int. Conf. on Lang. Resources and Evaluation (LREC), 2002. 2612 L EARNING W HEN C ONCEPTS A BOUND F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34: 1–47, 2002. S. Shalev-Schwartz, Y. Singer, and N. Srebro. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In Proc. Int. Conf. on Machine Learning (ICML), 2007. S. Thorpe, D. Fize, and C. Marlot. Speed of processing in the human visual system. Nature, 381: 520–522, 1996. T. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Information Processing & Management, 31(6), 1995. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 2000. V. G. Vovk. Aggregating strategies. In Annual Workshop on Computational Learning Theory, 1990. J. Z. Wang, J. Li, and G. Wiederhold. SIMPLIcity: Semantics-sensitive integrated matching for picture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9):947– 963, 2001. I. H. Witten, T. C. Bell, and A. Moffat. Managing Gigabytes: Compressing and Indexing Documents and Images. John Wiley & Sons, 1994. GR. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In Proc. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), 2008. 2613