emnlp emnlp2012 emnlp2012-120 emnlp2012-120-reference knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Benjamin Van Durme
Abstract: Inferring attributes of discourse participants has been treated as a batch-processing task: data such as all tweets from a given author are gathered in bulk, processed, analyzed for a particular feature, then reported as a result of academic interest. Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge. We show that under certain common formulations, the batchprocessing analytic framework can be decomposed into a sequential series of updates, using as an example the task of gender classification. Once in a streaming framework, and motivated by large data sets generated by social media services, we present novel results in approximate counting, showing its applicability to space efficient streaming classification.
Rahul Bhagat and Deepak Ravichandran. 2008. Large Scale Acquisition of Paraphrases for Learning Surface Patterns. In Proceedings of ACL. Constantinos Boulis and Mari Ostendorf. 2005. A quantitative analysis of lexical differences between genders in telephone conversations. In Proceedings of ACL. John D. Burger, John Henderson, George Kim, and Guido Zarrella. 2011. Discriminating gender on twitter. In Proceedings of EMNLP. Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC. Christopher Cieri, David Miller, and Kevin Walker. 2004. The fisher corpus: a resource for the next generations of speech-to-text. In Proceedings of LREC. Graham Cormode and Marios Hadjieleftheriou. 2009. Finding the frequent items in streams of data. Communications of the ACM, 52(10):97–105. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. 2006. Online passiveaggressive algorithms. Journal of Machine Learning Research, 7:551–585. Christopher P. Diehl, Galileo Namata, and Lise Getoor. 2007. Relationship identification for social network discovery. In Proceedings of AAAI. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsief, XiangRui Wang, and Chih-Jen Lin. 2008. Liblinear: A library for large linear classification. Journal of Machine Learning Research, (9). Philippe Flajolet. 1985. Approximate counting: a detailed analysis. BIT, 25(1): 113–134. 57 Nikesh Garera and David Yarowsky. 2009. Modeling latent biographic attributes in conversational genres. In Proceedings of ACL. John J. Godfrey, Edward C. Holliman, and Jane McDaniel. 1992. Switchboard: Telephone speech corpus for research and development. In Proceedings of ICASSP. Amit Goyal, Hal Daum e´ III, and Suresh Venkatasubramanian. 2009. Streaming for large scale NLP: Language Modeling. In Proceedings of NAACL. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of STOC. Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the International Conference on Web Search and Wed Data Mining, pages 219–230. Abby Levenberg and Miles Osborne. 2009. Streambased randomised language models for smt. In Proceedings of EMNLP. Gurmeet Singh Manku and Rajeev Motwani. 2002. Approximate frequency counts over data streams. In Proceedings of the 28th international conference on Very Large Data Bases (VLDB). Robert Morris. 1978. Counting large numbers of events in small registers. Communications of the ACM, 21(10):840–842. S. Muthu Muthukrishnan. 2005. Data streams: Algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2). Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of ICWSM. Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey Hancock. 2011. Finding deceptive opinion spam by any stretch of the imagination. In Proceedings of ACL. Sasa Petrovic, Miles Osborne, and Victor Lavrenko. 2010. Streaming first story detection with application to twitter. In Proceedings of NAACL. Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user attributes in twitter. In Proceedings of the 2nd International Workshop on Search and Mining Usergenerated Contents (SMUC). Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering. In Proceedings of ACL. David Talbot and Thorsten Brants. 2008. Randomized language models via perfect hash functions. In Pro- ceedings of ACL. David Talbot and Miles Osborne. 2007a. Randomised language modelling for statistical machine translation. In Proceedings of ACL. David Talbot and Miles Osborne. 2007b. Smoothed Bloom filter language models: Tera-Scale LMs on the Cheap. In Proceedings of EMNLP. David Talbot. 2009. Succinct approximate counting of skewed data. In Proceedings of IJCAI. Benjamin Van Durme and Ashwin Lall. 2009. Probabilistic Counting with Randomized Storage. In Proceedings of IJCAI. Benjamin Van Durme and Ashwin Lall. 2010. Online Generation of Locality Sensitive Hash Signatures. In Proceedings of ACL. Benjamin Van Durme and Ashwin Lall. 2011. Efficient Online Locality Sensitive Hashing via Reservoir Counting. In Proceedings of ACL. Benjamin Van Durme. 2012. Jerboa: A toolkit for randomized and streaming algorithms. Technical Report 7, Human Language Technology Center of Excellence, Johns Hopkins University. Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw., 11:37–57, March. 58