acl acl2012 acl2012-219 knowledge-graph by maker-knowledge-mining

219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Source: pdf

Author: Marco Lui ; Timothy Baldwin

Abstract: We present langid .py, an off-the-shelflanguage identification tool. We discuss the design and implementation of langid .py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid .py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We discuss the design and implementation of langid . [sent-7, score-0.779]

2 py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. [sent-8, score-0.199]

3 py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data. [sent-10, score-0.271]

4 1 Introduction Language identification (LangID) is the task of determining the natural language that a document is written in. [sent-11, score-0.214]

5 Natural language processing techniques typically pre-suppose that all documents be- ing processed are written in a given language (e. [sent-13, score-0.056]

6 English), but as focus shifts onto processing documents from internet sources such as microblogging services, this becomes increasingly difficult to guarantee. [sent-15, score-0.056]

7 Language identification is also a key component of many web services. [sent-16, score-0.244]

8 For example, the language that a web page is written in is an important consideration in determining whether it is likely to be of interest to a particular user of a search engine, and automatic identification is an essential step in building language corpora from the web. [sent-17, score-0.262]

9 It has practical implications for social networking and social media, where it may be desirable to organize comments and other user-generated content by language. [sent-18, score-0.06]

10 What is required is thus a generic language identification tool that is usable off-the-shelf, i. [sent-23, score-0.335]

11 py, a LangID tool with the following characteristics: (1) fast, (2) usable off-the-shelf, (3) unaffected by domainspecific features (e. [sent-27, score-0.096]

12 HTML, XML, markdown), (4) single file with minimal dependencies, and (5) flexible interface 2 Methodology langid . [sent-29, score-0.8]

13 py is trained over a naive Bayes classifier with a multinomial event model (McCallum and Nigam, 1998), over a mixture of byte n-grams (1≤n≤4). [sent-30, score-0.149]

14 One key difference from conventional t(1ex≤t categorization solutions is that langid . [sent-31, score-0.79]

15 In order to address (2), we integrate information about the language identification task from a variety of domains by using LD feature selection (Lui and Baldwin, 2s0 b1y 1). [sent-36, score-0.332]

16 Lui and Baldwin (201 1) showed that it is relatively easy to attain high accuracy for language idenProce Jedijung, sR oefpu thbeli c50 othf K Aonrneua,a8l -M14e Jtiunlgy o 2f0 t1h2e. [sent-37, score-0.093]

17 9×10 Table 1: Summary of the LangID datasets tification in a traditional text categorization setting, where we have in-domain training data. [sent-66, score-0.107]

18 LD feature selection addresses this problem by focusing on key f seealteucrteios nth aatd are s rseelesv thanist to the language identification task. [sent-68, score-0.282]

19 It is based on Information Gain (IG), originally introduced as a splitting criteria for decision trees (Quinlan, 1986), and later shown to be effective for feature selection in text categorization (Yang and Pedersen, 1997; Forman, 2003). [sent-69, score-0.097]

20 gFuoarg practical reasons, before the IG calculation the candidate feature set is pruned by means of a term-frequency based feature selection. [sent-72, score-0.06]

21 Lui and Baldwin (201 1) presented empirical evidence that LD feature selection was effective for dodmeanince adaptation iantu language oidne wnatisfi ceaffteiocnti. [sent-73, score-0.068]

22 py, as well as two support modules LD feature s e le ct . [sent-77, score-0.128]

23 py is the single file which packages the language identification tool, and the only file needed to use langid . [sent-81, score-1.053]

24 It comes with an embedded model which covers 97 languages using training data drawn from 5 domains. [sent-83, score-0.087]

25 Tokenization and feature selection are carried out in a single pass over the input document via Aho-Corasick string matching (Aho and Cora26 sick, 1975). [sent-84, score-0.09]

26 The naive Bayes classifier is implemented using numpy,1 the de-facto numerical computation package for Python. [sent-89, score-0.095]

27 numpy is free and open source, and available for all major platforms. [sent-90, score-0.072]

28 Using numpy introduces a dependency on a library that is not in the Python standard library. [sent-91, score-0.096]

29 This is a reasonable tradeoff, as numpy provides us with an optimized implementation of matrix operations, which allows us to implement fast naive Bayes classification while maintaining the single-file concept of langid . [sent-92, score-0.903]

30 py can be used in the three ways: Command-line tool: langid . [sent-95, score-0.761]

31 py also supports language identification of entire files via redirection. [sent-99, score-0.214]

32 This allows a user to interactively explore data, as well as to integrate language identification into a pipeline of other unix-style tools. [sent-100, score-0.233]

33 However, use via redirection is not recommended for large quantities of documents as each invocation requires the trained model to be unpacked into documents are web service is unpacked once memory. [sent-101, score-0.264]

34 py can be started as a web service with a command-line switch. [sent-108, score-0.059]

35 py from other programming environments, as most languages include libraries for interacting with web services over HTTP. [sent-113, score-0.099]

36 It also allows the language identification service to be run as a network/internet service. [sent-114, score-0.243]

37 py implements tehsetiimr LatDion s coofr parameters for the multinomial naive Bayes model, as well as the construction of the DFA for the Aho-Corasick string matching algorithm. [sent-126, score-0.129]

38 Its input is a list of byte patterns representing a feature set (such as that selected via LD feature s e le ct . [sent-127, score-0.212]

39 It produces the final model as a single compressed, encoded string, which can be saved to an external file and used by langid . [sent-129, score-0.8]

40 py is distributed with an embedded model trained using the multi-domain language identification corpus of Lui and Baldwin (201 1). [sent-132, score-0.266]

41 This corpus contains documents in a total of 97 languages. [sent-133, score-0.056]

42 The data is drawn from 5 different domains: government documents, software documentation, newswire, online encyclopedia and an internet crawl, though no domain covers the full set of languages by itself, and some languages are present only in a single domain. [sent-134, score-0.113]

43 Previous research has shown that explicit encoding detection is not needed for language identification (Baldwin and Lui, 2010). [sent-139, score-0.214]

44 Our training data consists mostly of UTF8-encoded documents, but some of our evaluation datasets contain a mixture of encodings. [sent-140, score-0.078]

45 We compare the empirical results obtained from langid . [sent-143, score-0.761]

46 py to those obtained from other language identification toolkits which incorporate a pre-trained model, and are thus usable offthe-shelf for language identification. [sent-144, score-0.252]

47 It has traditionally been the de facto LangID tool of choice in research, and is the basis of language identification/filtering in the ClueWeb09 Dataset (Callan and Hoy, 2009) and CorpusBuilder (Ghani et al. [sent-148, score-0.058]

48 LangDetect implements a Naive Bayes classifier, using a character n-gram based representation without feature selection, with a set of normaliza- tion heuristics to improve accuracy. [sent-151, score-0.067]

49 CLD is a port of the embedded language identifier in Google’s Chromium browser, maintained by Mike McCandless. [sent-153, score-0.088]

50 The datasets come from a variety of domains, such as newswire (TCL), biomedical corpora (EMEA), government documents (EUROGOV, EUROPARL) and microblog services (T-BE, T-SC). [sent-155, score-0.354]

51 A number of these datasets have been previously used in language identification research. [sent-156, score-0.292]

52 2× Table 2: Comparison of standalone classification tools, in terms of accuracy and speed (documents/second), relative to langid. [sent-255, score-0.057]

53 py LangDet e ct Text Cat CLD 97 53 75 64+ http http http http / /www . [sent-257, score-0.148]

54 com/p / chromium-compact -language-det e ct or / : Table 3: Summary of the LangID tools compared brief summary of the characteristics of each dataset in Table 1. [sent-268, score-0.156]

55 The datasets we use for evaluation are different from and independent of the datasets from which the embedded model of langid . [sent-269, score-0.969]

56 In Table 2, we report the accuracy of each tool, measured as the proportion of documents from each dataset that are correctly classified. [sent-271, score-0.135]

57 We present the absolute accuracy and performance for langid . [sent-272, score-0.818]

58 py, and relative accuracy and slowdown for the other systems. [sent-273, score-0.057]

59 We only utilized a single core, as none of the language identification tools tested are inherently multicore. [sent-275, score-0.252]

60 It outperforms TextCat in terms of speed and accuracy on all of the datasets considered. [sent-281, score-0.135]

61 This is primarily due to the design of TextCat, which requires that the supplied models be read from file for each document classified. [sent-284, score-0.057]

62 py generally outperforms LangDetect, except in datasets derived from government documents (EUROGOV, EUROPARL). [sent-286, score-0.2]

63 However, the difference in accuracy between l angid . [sent-287, score-0.111]

64 py and LangDet e ct on such datasets is very small, and langid . [sent-288, score-0.915]

65 Here, LangDetect is much faster, but has extremely poor accuracy (0. [sent-291, score-0.057]

66 py and CLD both performed very well, providing evidence that it is possible to build a generic language identifier that is insensitive to domain-specific characteristics. [sent-298, score-0.061]

67 This may reveal some insight into the design of CLD, which is likely to have been tuned for language identification of web pages. [sent-302, score-0.262]

68 The EMEA corpus is heavy in XML markup, which CLD and langid . [sent-303, score-0.761]

69 However, this increase in speed comes at the cost of decreased accuracy in other domains, as we will see in Section 5. [sent-306, score-0.057]

70 3 Comparison on microblog messages The size of the input text is known to play a significant role in the accuracy of automatic language identification, with accuracy decreasing on shorter input documents (Cavnar and Trenkle, 1994; Sibun and Reynar, 1996; Baldwin and Lui, 2010). [sent-309, score-0.393]

71 Recently, language identification of short strings has generated interest in the research community. [sent-310, score-0.24]

72 They develop a method which uses a decision tree to integrate outputs from several different language identification approaches. [sent-313, score-0.233]

73 (2010) focus on messages of 5–21 characters, using n-gram language models over data drawn from UDHR in a naive Bayes classifier. [sent-315, score-0.172]

74 A recent application where language identification is an open issue is over the rapidly-increasing volume of data being generated by social media. [sent-316, score-0.232]

75 Microblog services such as Twitter4 allow users to post short text messages. [sent-317, score-0.06]

76 It is estimated that half the messages on Twitter are not in English. [sent-320, score-0.102]

77 This has led to recent research focused specifically on the task of language identification of Twitter messages. [sent-322, score-0.214]

78 (to appear) improve language identification in Twitter messages by augmenting standard methods 4http : //www . [sent-324, score-0.316]

79 pdf r_ 29 with language identification priors based on a user’s previous messages and by the content of links embedded in messages. [sent-328, score-0.39]

80 Tromp and Pechenizkiy (201 1) present a method for language identification of short text messages by means of a graph structure. [sent-329, score-0.342]

81 Despite the recently published results on language identification of microblog messages, there is no dedicated off-the-shelf system to perform the task. [sent-330, score-0.335]

82 We thus examine the accuracy and performance of using generic language identification tools to identify the language of microblog messages. [sent-331, score-0.455]

83 It is important to note that none of the systems we test have been specifically tuned for the microblog domain. [sent-332, score-0.121]

84 We make use of two datasets of Twitter messages kindly provided to us by other researchers. [sent-335, score-0.18]

85 The first is T-BE (Tromp and Pechenizkiy, 2011), which contains 9659 messages in 6 European languages. [sent-336, score-0.102]

86 , to appear), which contains 5000 messages in 5 European languages. [sent-338, score-0.102]

87 py has better accuracy than any of the other systems tested. [sent-340, score-0.057]

88 On T-BE, Tromp and Pechenizkiy (201 1) report accuracy between 0. [sent-341, score-0.057]

89 In our experiments, the accuracy of TextCat is much lower (0. [sent-351, score-0.057]

90 Our results show that it is possible for a generic language identification tool to attain reasonably high accuracy (0. [sent-355, score-0.39]

91 89) without artificially constraining the set of languages to be considered, which corresponds more closely to the demands of automatic language identification to real-world data sources, where there is generally no prior knowledge of the languages present. [sent-356, score-0.307]

92 We also observe that while CLD is still the fastest classifier, this has come at the cost of accuracy in an alternative domain such as Twitter messages, where both l angid . [sent-357, score-0.138]

93 py and LangDet e ct attain better accuracy than CLD. [sent-358, score-0.169]

94 An interesting point of comparison between the Twitter datasets is how the accuracy of all systems is generally higher on T-BE than on T-SC, despite them covering essentially the same languages (T-BE includes Italian, whereas T-SC does not). [sent-359, score-0.193]

95 This is likely to be because the T-BE dataset was produced using a semi-automatic method which involved a language identification step using the method of Cavnar and Trenkle (1994) (E Tromp, personal communication, July 6 2011). [sent-360, score-0.236]

96 This may also explain why TextCat, which is also based on Cavnar and Trenkle’s work, has unusually high accuracy on this dataset. [sent-361, score-0.057]

97 6 Conclusion In this paper, we presented langid . [sent-362, score-0.761]

98 We demonstrated the robustness of the tool over a range of test corpora of both long and short documents (including micro-blogs). [sent-364, score-0.14]

99 Language identification of short text segments with n-gram models. [sent-443, score-0.24]

100 A comparative study on feature selection in text categorization. [sent-448, score-0.068]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('langid', 0.761), ('identification', 0.214), ('textcat', 0.163), ('lui', 0.158), ('cld', 0.145), ('carter', 0.127), ('microblog', 0.121), ('ld', 0.114), ('emea', 0.11), ('langdetect', 0.109), ('tromp', 0.109), ('baldwin', 0.104), ('messages', 0.102), ('cavnar', 0.091), ('eurogov', 0.091), ('twitter', 0.079), ('datasets', 0.078), ('ct', 0.076), ('dfa', 0.072), ('numpy', 0.072), ('pechenizkiy', 0.072), ('tcl', 0.072), ('trenkle', 0.072), ('naive', 0.07), ('europarl', 0.069), ('tool', 0.058), ('accuracy', 0.057), ('documents', 0.056), ('angid', 0.054), ('byte', 0.054), ('langdet', 0.054), ('embedded', 0.052), ('bayes', 0.049), ('government', 0.043), ('file', 0.039), ('selection', 0.038), ('usable', 0.038), ('python', 0.038), ('tools', 0.038), ('implements', 0.037), ('identifier', 0.036), ('ig', 0.036), ('attain', 0.036), ('ceylan', 0.036), ('nicta', 0.036), ('unpacked', 0.036), ('usersupplied', 0.036), ('vatanen', 0.036), ('languages', 0.035), ('services', 0.034), ('marco', 0.032), ('vegas', 0.032), ('hoy', 0.032), ('rain', 0.032), ('lb', 0.032), ('ghani', 0.032), ('sibun', 0.032), ('callan', 0.032), ('domains', 0.031), ('feature', 0.03), ('web', 0.03), ('categorization', 0.029), ('service', 0.029), ('timothy', 0.028), ('customized', 0.027), ('requests', 0.027), ('fastest', 0.027), ('aho', 0.027), ('appear', 0.026), ('short', 0.026), ('automaton', 0.025), ('classifier', 0.025), ('generic', 0.025), ('org', 0.025), ('processors', 0.024), ('implications', 0.024), ('library', 0.024), ('generally', 0.023), ('las', 0.023), ('australian', 0.023), ('dataset', 0.022), ('string', 0.022), ('priors', 0.022), ('biomedical', 0.022), ('le', 0.022), ('lt', 0.021), ('quantities', 0.021), ('cat', 0.021), ('engine', 0.021), ('determination', 0.021), ('parallel', 0.02), ('xml', 0.02), ('summary', 0.02), ('integrate', 0.019), ('social', 0.018), ('communications', 0.018), ('consideration', 0.018), ('http', 0.018), ('design', 0.018), ('trying', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Author: Marco Lui ; Timothy Baldwin

2 0.060280651 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

3 0.053998437 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii

Abstract: The problem addressed in this paper is to segment a given multilingual document into segments for each language and then identify the language of each segment. The problem was motivated by an attempt to collect a large amount of linguistic data for non-major languages from the web. The problem is formulated in terms of obtaining the minimum description length of a text, and the proposed solution finds the segments and their languages through dynamic programming. Empirical results demonstrating the potential of this approach are presented for experiments using texts taken from the Universal Declaration of Human Rights and Wikipedia, covering more than 200 languages.

4 0.052617762 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

Author: Hao Wang ; Dogan Can ; Abe Kazemzadeh ; Francois Bar ; Shrikanth Narayanan

Abstract: This paper describes a system for real-time analysis of public sentiment toward presidential candidates in the 2012 U.S. election as expressed on Twitter, a microblogging service. Twitter has become a central site where people express their opinions and views on political parties and candidates. Emerging events or news are often followed almost instantly by a burst in Twitter volume, providing a unique opportunity to gauge the relation between expressed public sentiment and electoral events. In addition, sentiment analysis can help explore how these events affect public opinion. While traditional content analysis takes days or weeks to complete, the system demonstrated here analyzes sentiment in the entire Twitter traffic about the election, delivering results instantly and continuously. It offers the public, the media, politicians and scholars a new and timely perspective on the dynamics of the electoral process and public opinion. 1

5 0.040691823 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

Author: Patrick Simianer ; Stefan Riezler ; Chris Dyer

Abstract: With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy local features for SCFG-based SMT that can be read off from rules at runtime, and present a learning algorithm that applies ‘1/‘2 regularization for joint feature selection over distributed stochastic learning processes. We present experiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets.

6 0.039541554 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

7 0.038787425 160 acl-2012-Personalized Normalization for a Multilingual Chat System

8 0.038325135 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

9 0.034859806 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

10 0.034670826 64 acl-2012-Crosslingual Induction of Semantic Roles

11 0.034442604 37 acl-2012-Baselines and Bigrams: Simple, Good Sentiment and Topic Classification

12 0.033215489 205 acl-2012-Tweet Recommendation with Graph Co-Ranking

13 0.033180904 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling

14 0.032441396 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

15 0.032035295 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis

16 0.031814564 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

17 0.030959036 96 acl-2012-Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection

18 0.0304345 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

19 0.029397136 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

20 0.028474437 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.101), (1, 0.036), (2, 0.008), (3, 0.016), (4, 0.02), (5, 0.028), (6, 0.05), (7, 0.016), (8, 0.009), (9, 0.024), (10, -0.011), (11, 0.016), (12, 0.0), (13, 0.034), (14, -0.003), (15, -0.01), (16, 0.001), (17, 0.044), (18, -0.009), (19, -0.008), (20, -0.049), (21, 0.006), (22, 0.042), (23, -0.025), (24, -0.001), (25, 0.029), (26, -0.006), (27, 0.072), (28, 0.014), (29, -0.006), (30, 0.087), (31, -0.011), (32, 0.013), (33, 0.008), (34, 0.054), (35, -0.048), (36, -0.071), (37, 0.011), (38, -0.111), (39, 0.001), (40, 0.045), (41, 0.092), (42, -0.098), (43, -0.014), (44, -0.019), (45, -0.18), (46, 0.076), (47, 0.092), (48, 0.019), (49, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9070769 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Author: Marco Lui ; Timothy Baldwin

2 0.63850415 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

Author: Hiroshi Yamaguchi ; Kumiko Tanaka-Ishii

3 0.58120465 160 acl-2012-Personalized Normalization for a Multilingual Chat System

Author: Ai Ti Aw ; Lian Hau Lee

Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1

4 0.52557296 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

Author: Nicola Cancedda

Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.

5 0.5242967 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

Author: Marco Guerini ; Carlo Strapparava ; Oliviero Stock

Abstract: In recent years there has been a growing interest in crowdsourcing methodologies to be used in experimental research for NLP tasks. In particular, evaluation of systems and theories about persuasion is difficult to accommodate within existing frameworks. In this paper we present a new cheap and fast methodology that allows fast experiment building and evaluation with fully-automated analysis at a low cost. The central idea is exploiting existing commercial tools for advertising on the web, such as Google AdWords, to measure message impact in an ecological setting. The paper includes a description of the approach, tips for how to use AdWords for scientific research, and results of pilot experiments on the impact of affective text variations which confirm the effectiveness of the approach.

6 0.50603199 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

7 0.44378734 163 acl-2012-Prediction of Learning Curves in Machine Translation

8 0.44166645 153 acl-2012-Named Entity Disambiguation in Streaming Data

9 0.43180114 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

10 0.41917428 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

11 0.38252419 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

12 0.36120689 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

13 0.35363469 151 acl-2012-Multilingual Subjectivity and Sentiment Analysis

14 0.35284519 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

15 0.35127568 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum

16 0.35007292 137 acl-2012-Lemmatisation as a Tagging Task

17 0.3468217 68 acl-2012-Decoding Running Key Ciphers

18 0.33741787 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

19 0.33400843 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

20 0.3184264 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.034), (26, 0.045), (28, 0.046), (30, 0.017), (34, 0.294), (37, 0.044), (39, 0.063), (59, 0.016), (74, 0.032), (82, 0.022), (84, 0.027), (85, 0.049), (90, 0.107), (92, 0.04), (94, 0.013), (99, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91070694 200 acl-2012-Toward Automatically Assembling Hittite-Language Cuneiform Tablet Fragments into Larger Texts

Author: Stephen Tyndall

Abstract: This paper presents the problem within Hittite and Ancient Near Eastern studies of fragmented and damaged cuneiform texts, and proposes to use well-known text classification metrics, in combination with some facts about the structure of Hittite-language cuneiform texts, to help classify a number offragments of clay cuneiform-script tablets into more complete texts. In particular, Ipropose using Sumerian and Akkadian ideogrammatic signs within Hittite texts to improve the performance of Naive Bayes and Maximum Entropy classifiers. The performance in some cases is improved, and in some cases very much not, suggesting that the variable frequency of occurrence of these ideograms in individual fragments makes considerable difference in the ideal choice for a classification method. Further, complexities of the writing system and the digital availability ofHittite texts complicate the problem.

2 0.89921898 112 acl-2012-Humor as Circuits in Semantic Networks

Author: Igor Labutov ; Hod Lipson

Abstract: This work presents a first step to a general implementation of the Semantic-Script Theory of Humor (SSTH). Of the scarce amount of research in computational humor, no research had focused on humor generation beyond simple puns and punning riddles. We propose an algorithm for mining simple humorous scripts from a semantic network (ConceptNet) by specifically searching for dual scripts that jointly maximize overlap and incongruity metrics in line with Raskin’s Semantic-Script Theory of Humor. Initial results show that a more relaxed constraint of this form is capable of generating humor of deeper semantic content than wordplay riddles. We evaluate the said metrics through a user-assessed quality of the generated two-liners.

same-paper 3 0.74812812 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Author: Marco Lui ; Timothy Baldwin

4 0.59601659 191 acl-2012-Temporally Anchored Relation Extraction

Author: Guillermo Garrido ; Anselmo Penas ; Bernardo Cabaleiro ; Alvaro Rodrigo

Abstract: Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.

5 0.48660821 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

Author: Gerard de Melo ; Gerhard Weikum

Abstract: We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.

6 0.47814408 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

7 0.477406 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

8 0.47492141 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

9 0.47308123 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

10 0.4700602 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

11 0.46989405 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

12 0.46867689 187 acl-2012-Subgroup Detection in Ideological Discussions

13 0.46847892 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

14 0.46761009 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

15 0.46699917 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

16 0.4667064 136 acl-2012-Learning to Translate with Multiple Objectives

17 0.46663788 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

18 0.4659009 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

19 0.46579713 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

20 0.46532181 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords