acl acl2012 acl2012-160 knowledge-graph by maker-knowledge-mining

160 acl-2012-Personalized Normalization for a Multilingual Chat System


Source: pdf

Author: Ai Ti Aw ; Lian Hau Lee

Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 sg Abstract This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. [sent-4, score-1.613]

2 One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. [sent-5, score-2.013]

3 Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. [sent-6, score-1.206]

4 We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. [sent-7, score-2.389]

5 1 Introduction Processing user-generated textual content on social media and networking usually encounters challenges due to the language used by the online community. [sent-8, score-0.18]

6 Though some jargons of the online language has made their way into the standard dictionary, a large portion of the abbreviations, slang and context specific terms are still uncommon and only understood within the user community. [sent-9, score-0.233]

7 Consequently, content analysis or translation techniques developed for a more formal genre like news or even conversations cannot apply directly and effectively to the social media content. [sent-10, score-0.241]

8 , 2011) on text normalization to preprocess user generated 31 content such as tweets and short messages before further processing. [sent-14, score-0.645]

9 The approaches include supervised or unsupervised methods based on morphological and phonetic variations. [sent-15, score-0.025]

10 However, most of the multilingual chat systems on the Internet have not yet integrated this feature into their systems but requesting users to type in proper language so as to have good translation. [sent-16, score-0.883]

11 This is because the current techniques are not robust enough to model the different characteristics featured in the social media content. [sent-17, score-0.173]

12 It is also difficult to unify the language uniqueness among different users into a single model. [sent-19, score-0.046]

13 We propose a practical and effective method, exploiting a personalized dictionary for each user, to support the use of user-defined short-forms in a multilingual chat system - AsiaSpik. [sent-20, score-1.184]

14 The use of this personalized dictionary reduces the reliance on the availability and dependency of training data and empowers the users with the flexibility and interactivity to include and manage their own vocabularies during chat. [sent-21, score-0.474]

15 2 ASIASPIK System Overview AsiaSpik is a web-based multilingual instant messaging system that enables online chats written in one language to be readable in other languages by other users. [sent-22, score-0.208]

16 It describes the process flow between Chat Client, Chat Server, Translation Bot and Normalization Bot whenever Chat Client starts chat module. [sent-24, score-0.762]

17 When Chat Client starts chat module, the Chat Client checks if the normalization option for that language used by the user is active and activated. [sent-25, score-1.174]

18 c s 2o0c1ia2ti Aosns fo cria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsteiscs 31–36, so, any message sent by the user will be routed to the Normalization Bot for normalization before reaching the Chat Server. [sent-28, score-0.649]

19 The Chat Server then directs the message to the designated recipients. [sent-29, score-0.165]

20 Chat Client at each recipient invokes a translation request to the Translation Bot to translate the message to the language set by the recipient. [sent-30, score-0.283]

21 This allows the same source message to be received by different recipients in different target languages. [sent-31, score-0.187]

22 We custom build a web-based Chat Client to communicate with the Chat Server based on Jabber/XMPP to receive presence and messaging information. [sent-33, score-0.05]

23 We also develop a user management plug-in to synchronize and authenticate user login. [sent-34, score-0.408]

24 The translation and normalization function used by the Translation Bot and Normalization Bot are provided through Web Services. [sent-35, score-0.319]

25 The Translation Web Service uses in-house translation engines and supports the translation from Chinese, Malay and Indonesian to English and vice versa. [sent-36, score-0.161]

26 Multilingual chat among these languages is achieved through pivot translation using English as the pivot language. [sent-37, score-0.838]

27 Both web services are running on Apache Tomcat web server with Apache Axis2. [sent-39, score-0.116]

28 32 3 Personalized Normalization Personalized Normalization is the main distinction of AsiaSpik among other multilingual chat system. [sent-40, score-0.814]

29 It gives the flexibility for user to personalize his/her short-forms for messages in English. [sent-41, score-0.404]

30 1 Related Work The traditional text normalization strategy follows the noisy channel model (Shannon, 1948). [sent-43, score-0.281]

31 Suppose the chat message is C and its corresponding standard form is S , the approach aims to find arg max P(S | C) by computing arg max P(C |S) in which P(S) is usually a language model and P(C |S) is an error model. [sent-44, score-1.14]

32 The objective of using model in the chat message normalization context is to develop an appropriate error model for converting the non-standard and unconventional words found in chat messages into standard words. [sent-45, score-2.154]

33 ^ S  arg max P(S | C) S  arg max P(C | S)P(S) S Recently, Aw et al. [sent-46, score-0.21]

34 (2006) model text message normalization as translation from the texting language into the standard language. [sent-47, score-0.55]

35 (2007) model the word-level text generation process for SMS messages, by considering graphemic/phonetic abbreviations and unintentional typos as hidden Markov model (HMM) state transitions and emissions, respectively. [sent-49, score-0.16]

36 Cook and Stevenson (2009) expand the error model by introducing inference from different erroneous formation processes, according to the sample error distribution. [sent-50, score-0.097]

37 Han and Baldwin (201 1) use a classifier to detect ill-formed words, and generate correction candidates based on morphophonemic similarity. [sent-51, score-0.022]

38 These models are effective on their experiments conducted, however, much works remain to be done to handle the diversity and dynamic of content and fast evolution of words used in social media and networking. [sent-52, score-0.205]

39 As we notice that unlike spelling errors which are made mostly unintentionally by the writers, abbreviations or slangs found in chat messages are introduced intentionally by the senders most of the time. [sent-53, score-1.007]

40 This leads us to suggest that if facilities are given to users to define their abbreviations, the dynamic of the social content and the fast evolution of words could be well captured and managed by the user. [sent-54, score-0.155]

41 In this way, the normalization model could be evolved together with the social media language and chat message could also be personalized for each user dynamically and interactively. [sent-55, score-1.767]

42 2 Personalized Normalization Model We employ a simple but effective approach for chat normalization. [sent-57, score-0.715]

43 We define P(si,j| ci) as a uniform distribution computed through a set of dictionary collected from corpus, SMS messages and Internet sources. [sent-60, score-0.247]

44 A total of 11,119 entries are collected and each entry is assigned with an initial probability, Ps(si,j|ci)|c1i| , where |ci| is the number of entries defined in the dictionary. [sent-61, score-0.131]

45 We adjust the probability manually for some entries that are very ci common and occur more than a certain threshold, t , in the NUS SMS corpus (How and Kan, 2005) with a higher weight-age, w . [sent-62, score-0.191]

46 This model, together with the language model, forms our baseline system for chat normalization. [sent-63, score-0.739]

47 33 Ps( i,j|ci) |c 1 i|w | (s|i(s,ji ,cji, c)i|) |t t i f | (| s i,j c, i ) |t To enable personalized real-time management of user-defined abbreviations and short-forms, we define a personalized model Puser_i (si,j | ci ) for each user based on his/her dictionary profile. [sent-64, score-1.039]

48 Each personalized model is loaded into the memory once the user activates the normalization option. [sent-65, score-0.759]

49 Whenever there is a change in the entry, the entry’s probability will be re-distributed and updated based on the following model. [sent-66, score-0.023]

50 This characterizes the AsiaSpik system which supports personalized and dynamic chat normalization. [sent-67, score-1.055]

51 Puser_i(s,j|ci)  P1MNs(s1i,M j|c i) N M i f c ci ,siS, jD ,SDsi,jSD where SD denotes default dictionary; N denotes the number of ci entries in SD M denotes the number of ci entries in user dictionary. [sent-68, score-0.72]

52 The feature weights in the normalization model are optimized by minimum error rate training (Och, 2003), which searches for weights maximizing the normalization accuracy using a small development set. [sent-69, score-0.566]

53 We use standard state-ofthe-art open source tools, Moses (Koehn, 2007), to develop the system and the SRI language modeling toolkit (Stolcke,2003) to train a trigram language model on the English portion of the Europarl Corpus (Koehn, 2005). [sent-70, score-0.075]

54 3 Experiments We conducted a small experiment using 134 chat messages sent by high school students. [sent-72, score-0.942]

55 Out of these messages, 73 short-forms are uncommon and not found in our default dictionary. [sent-73, score-0.055]

56 Most of these short-forms are very irregular and hard to predict their standard forms using morphological and phonetic similarity. [sent-74, score-0.048]

57 It is also hard to train a statistical model if training data is not available. [sent-75, score-0.046]

58 We asked the students to define their personal abbreviations in the system and run through the system with and without the user dictionary. [sent-76, score-0.342]

59 We asked them to give a score of 1 if the output is acceptable to them as proper English, otherwise a 0 will be given. [sent-77, score-0.023]

60 We compared the results using both the baseline model and the model implemented using the same training data as in Aw et al. [sent-78, score-0.046]

61 Both models show improvement with the use of user dictionary. [sent-81, score-0.179]

62 It also shows that it is very critical to have similar training data for the targeted domain to have good normalization performance. [sent-82, score-0.258]

63 A simple model helps if such training data is unavailable. [sent-83, score-0.023]

64 Nevertheless, the use of a dictionary driven by the user is an alternative to improve the overall performance. [sent-84, score-0.248]

65 One reason for the inability of both models to capture the variations fully is because many messages require some degree of rephrasing in addition to insertion and deletion to make it readable and acceptable. [sent-85, score-0.235]

66 For example, the ideal output for “haiz, I wanna pontang school” is “Sigh, I not feel like going to school”, which may not do be just a normalization problem. [sent-86, score-0.258]

67 In the examples showed in Table 2, ‘din’ and ‘dnr’ are normalized to ‘didn ’t’ and ‘do not reply’ based on the entries captured in the default dictionary. [sent-90, score-0.069]

68 With the extension of normalization hypotheses in the user dictionary, the system produces the correct expansion to ‘dinner’ . [sent-91, score-0.481]

69 34 AsiaSpik Multilingual Chat Figure 2 and Figure 3 show the personal lingo defined by two users. [sent-92, score-0.023]

70 Note that expansions for “gtg” and “tgt” are defined differently and expanded differently for the two users. [sent-93, score-0.13]

71 ‘Me’ in the message box indicates the message typed by the user while ‘Expansion’ is the message expanded by the system. [sent-94, score-0.734]

72 Short-forms defined and messages expanded for user 1 expanded for user 2 Figure 4 shows the multilingual chat exchange between a Malay language user (Mahani) and an English user (Keith). [sent-96, score-1.828]

73 The figure shows the messages are first expanded to the correct forms before translated to the recipient language. [sent-97, score-0.273]

74 The system aims to overcome the limitations of normalizing social media content universally through a personalized normalization model. [sent-100, score-0.739]

75 The proposed strategy makes user the active contributor in defining the chat language and enables the system to model the user chat language dynamically. [sent-101, score-1.857]

76 35 The normalization approach is a simple probabilistic model making use of the normalization probability defined for each shortform and the language model probability. [sent-102, score-0.585]

77 The model can be further improved by fine-tuning the normalization probability and incorporate other feature functions. [sent-103, score-0.304]

78 The baseline model can also be further improved with more sophisticated method without changing the architecture of the full system. [sent-104, score-0.023]

79 We would like to expand the normalization model to include more features and support other languages such as Malay and Chinese. [sent-106, score-0.301]

80 We would also like to further enhance the system to convert the translated English chat messages back to the social media language as defined by the user. [sent-107, score-1.067]

81 Investigation and modeling of the structure of texting language. [sent-116, score-0.043]

82 Optimizing predictive text entry for short message service on mobile phones. [sent-129, score-0.234]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('chat', 0.715), ('personalized', 0.277), ('normalization', 0.258), ('user', 0.179), ('messages', 0.178), ('asiaspik', 0.174), ('message', 0.165), ('bot', 0.149), ('ci', 0.122), ('client', 0.105), ('multilingual', 0.099), ('abbreviations', 0.092), ('malay', 0.087), ('server', 0.076), ('social', 0.076), ('media', 0.074), ('aw', 0.071), ('sms', 0.07), ('dictionary', 0.069), ('translation', 0.061), ('expanded', 0.06), ('cook', 0.059), ('arg', 0.058), ('messaging', 0.05), ('max', 0.047), ('entries', 0.046), ('choudhury', 0.043), ('texting', 0.043), ('sd', 0.039), ('entry', 0.039), ('supports', 0.039), ('han', 0.037), ('recipient', 0.035), ('readable', 0.035), ('vocabularies', 0.035), ('uncommon', 0.032), ('apache', 0.031), ('pivot', 0.031), ('content', 0.03), ('service', 0.03), ('koehn', 0.028), ('develop', 0.028), ('ps', 0.028), ('error', 0.027), ('sent', 0.025), ('evolution', 0.025), ('phonetic', 0.025), ('whenever', 0.025), ('flexibility', 0.025), ('differently', 0.024), ('users', 0.024), ('school', 0.024), ('system', 0.024), ('europarl', 0.024), ('proper', 0.023), ('hard', 0.023), ('personal', 0.023), ('english', 0.023), ('model', 0.023), ('probability', 0.023), ('default', 0.023), ('starts', 0.022), ('activates', 0.022), ('animesh', 0.022), ('anupam', 0.022), ('monojit', 0.022), ('morphophonemic', 0.022), ('saraf', 0.022), ('sudeshna', 0.022), ('unintentional', 0.022), ('unintentionally', 0.022), ('vijit', 0.022), ('interactivity', 0.022), ('sbest', 0.022), ('routed', 0.022), ('inability', 0.022), ('lian', 0.022), ('shannon', 0.022), ('chatting', 0.022), ('uniqueness', 0.022), ('contributor', 0.022), ('invokes', 0.022), ('reliance', 0.022), ('expansions', 0.022), ('recipients', 0.022), ('khk', 0.022), ('emissions', 0.022), ('requesting', 0.022), ('synchronize', 0.022), ('slang', 0.022), ('personalize', 0.022), ('connexis', 0.022), ('indonesian', 0.022), ('unconventional', 0.022), ('internet', 0.021), ('denotes', 0.02), ('expansion', 0.02), ('expand', 0.02), ('web', 0.02), ('moses', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 160 acl-2012-Personalized Normalization for a Multilingual Chat System

Author: Ai Ti Aw ; Lian Hau Lee

Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1

2 0.11778974 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

Author: Fei Liu ; Fuliang Weng ; Xiao Jiang

Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.

3 0.095008589 114 acl-2012-IRIS: a Chat-oriented Dialogue System based on the Vector Space Model

Author: Rafael E. Banchs ; Haizhou Li

Abstract: This system demonstration paper presents IRIS (Informal Response Interactive System), a chat-oriented dialogue system based on the vector space model framework. The system belongs to the class of examplebased dialogue systems and builds its chat capabilities on a dual search strategy over a large collection of dialogue samples. Additional strategies allowing for system adaptation and learning implemented over the same vector model space framework are also described and discussed. 1

4 0.094814964 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

5 0.076592527 24 acl-2012-A Web-based Evaluation Framework for Spatial Instruction-Giving Systems

Author: Srinivasan Janarthanam ; Oliver Lemon ; Xingkun Liu

Abstract: We demonstrate a web-based environment for development and testing of different pedestrian route instruction-giving systems. The environment contains a City Model, a TTS interface, a game-world, and a user GUI including a simulated street-view. We describe the environment and components, the metrics that can be used for the evaluation of pedestrian route instruction-giving systems, and the shared challenge which is being organised using this environment.

6 0.069947414 205 acl-2012-Tweet Recommendation with Graph Co-Ranking

7 0.059107624 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

8 0.058269478 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

9 0.054815792 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

10 0.054310635 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

11 0.049450949 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

12 0.049358871 140 acl-2012-Machine Translation without Words through Substring Alignment

13 0.046607736 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

14 0.046075027 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

15 0.044476613 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

16 0.044076212 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

17 0.042893101 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

18 0.042327885 92 acl-2012-FLOW: A First-Language-Oriented Writing Assistant System

19 0.041738309 70 acl-2012-Demonstration of IlluMe: Creating Ambient According to Instant Message Logs

20 0.040939584 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.119), (1, -0.003), (2, 0.051), (3, 0.03), (4, 0.026), (5, 0.041), (6, 0.103), (7, 0.02), (8, 0.029), (9, 0.073), (10, -0.006), (11, 0.058), (12, -0.05), (13, 0.104), (14, -0.023), (15, -0.031), (16, -0.046), (17, 0.045), (18, 0.042), (19, -0.066), (20, -0.054), (21, 0.027), (22, 0.036), (23, -0.1), (24, -0.01), (25, 0.063), (26, 0.016), (27, 0.051), (28, 0.028), (29, -0.078), (30, 0.08), (31, -0.089), (32, 0.089), (33, 0.136), (34, 0.051), (35, 0.097), (36, 0.034), (37, 0.055), (38, -0.136), (39, 0.037), (40, 0.034), (41, -0.058), (42, -0.029), (43, 0.016), (44, -0.08), (45, -0.25), (46, -0.109), (47, 0.194), (48, 0.108), (49, -0.167)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94874668 160 acl-2012-Personalized Normalization for a Multilingual Chat System

Author: Ai Ti Aw ; Lian Hau Lee

Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1

2 0.59021038 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

Author: Marco Guerini ; Carlo Strapparava ; Oliviero Stock

Abstract: In recent years there has been a growing interest in crowdsourcing methodologies to be used in experimental research for NLP tasks. In particular, evaluation of systems and theories about persuasion is difficult to accommodate within existing frameworks. In this paper we present a new cheap and fast methodology that allows fast experiment building and evaluation with fully-automated analysis at a low cost. The central idea is exploiting existing commercial tools for advertising on the web, such as Google AdWords, to measure message impact in an ecological setting. The paper includes a description of the approach, tips for how to use AdWords for scientific research, and results of pilot experiments on the impact of affective text variations which confirm the effectiveness of the approach.

3 0.55348843 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

4 0.53850085 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

Author: Fei Liu ; Fuliang Weng ; Xiao Jiang

Abstract: Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any user-created nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word- and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90% word-coverage across all data sets (a . 10% absolute increase compared to state-ofthe-art); the broad word-coverage can also successfully translate into message-level performance gain, yielding 6% absolute increase compared to the best prior approach.

5 0.52321106 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

Author: Nicola Cancedda

Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.

6 0.51588523 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

7 0.44615844 70 acl-2012-Demonstration of IlluMe: Creating Ambient According to Instant Message Logs

8 0.3661682 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

9 0.36253375 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

10 0.35838991 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

11 0.34141767 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

12 0.33398888 68 acl-2012-Decoding Running Key Ciphers

13 0.32789072 92 acl-2012-FLOW: A First-Language-Oriented Writing Assistant System

14 0.32187772 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

15 0.31135461 114 acl-2012-IRIS: a Chat-oriented Dialogue System based on the Vector Space Model

16 0.30857348 24 acl-2012-A Web-based Evaluation Framework for Spatial Instruction-Giving Systems

17 0.27885786 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations

18 0.27421626 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

19 0.26612464 82 acl-2012-Entailment-based Text Exploration with Application to the Health-care Domain

20 0.25954866 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.019), (26, 0.025), (28, 0.042), (30, 0.016), (37, 0.025), (39, 0.064), (74, 0.028), (75, 0.285), (82, 0.016), (84, 0.016), (85, 0.034), (90, 0.169), (92, 0.048), (94, 0.034), (99, 0.083)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78763664 160 acl-2012-Personalized Normalization for a Multilingual Chat System

Author: Ai Ti Aw ; Lian Hau Lee

Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1

2 0.59950691 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

Author: Ce Zhang ; Feng Niu ; Christopher Re ; Jude Shavlik

Abstract: Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain. To mitigate this cost, NLU researchers have considered two newly available sources of less expensive (but potentially lower quality) labeled data from distant supervision and crowd sourcing. There is, however, no study comparing the relative impact of these two sources on the precision and recall of post-learning answers. To fill this gap, we empirically study how state-of-the-art techniques are affected by scaling these two sources. We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples. Our experiments show that increasing the corpus size for distant supervision has a statistically significant, positive impact on quality (F1 score). In contrast, human feedback has a positive and statistically significant, but lower, impact on precision and recall.

3 0.59121829 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

Author: Xinfan Meng ; Furu Wei ; Xiaohua Liu ; Ming Zhou ; Ge Xu ; Houfeng Wang

Abstract: The amount of labeled sentiment data in English is much larger than that in other languages. Such a disproportion arouse interest in cross-lingual sentiment classification, which aims to conduct sentiment classification in the target language (e.g. Chinese) using labeled data in the source language (e.g. English). Most existing work relies on machine translation engines to directly adapt labeled data from the source language to the target language. This approach suffers from the limited coverage of vocabulary in the machine translation results. In this paper, we propose a generative cross-lingual mixture model (CLMM) to leverage unlabeled bilingual parallel data. By fitting parameters to maximize the likelihood of the bilingual parallel data, the proposed model learns previously unseen sentiment words from the large bilingual parallel data and improves vocabulary coverage signifi- cantly. Experiments on multiple data sets show that CLMM is consistently effective in two settings: (1) labeled data in the target language are unavailable; and (2) labeled data in the target language are also available.

4 0.59036559 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

Author: Wan-Yu Lin ; Nanyun Peng ; Chun-Chao Yen ; Shou-de Lin

Abstract: In this paper, we introduce a framework that identifies online plagiarism by exploiting lexical, syntactic and semantic features that includes duplication-gram, reordering and alignment of words, POS and phrase tags, and semantic similarity of sentences. We establish an ensemble framework to combine the predictions of each model. Results demonstrate that our system can not only find considerable amount of real-world online plagiarism cases but also outperforms several state-of-the-art algorithms and commercial software. Keywords Plagiarism Detection, Lexical, Syntactic, Semantic 1.

5 0.58726609 191 acl-2012-Temporally Anchored Relation Extraction

Author: Guillermo Garrido ; Anselmo Penas ; Bernardo Cabaleiro ; Alvaro Rodrigo

Abstract: Although much work on relation extraction has aimed at obtaining static facts, many of the target relations are actually fluents, as their validity is naturally anchored to a certain time period. This paper proposes a methodological approach to temporally anchored relation extraction. Our proposal performs distant supervised learning to extract a set of relations from a natural language corpus, and anchors each of them to an interval of temporal validity, aggregating evidence from documents supporting the relation. We use a rich graphbased document-level representation to generate novel features for this task. Results show that our implementation for temporal anchoring is able to achieve a 69% of the upper bound performance imposed by the relation extraction step. Compared to the state of the art, the overall system achieves the highest precision reported.

6 0.58580208 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

7 0.58483338 61 acl-2012-Cross-Domain Co-Extraction of Sentiment and Topic Lexicons

8 0.58478057 28 acl-2012-Aspect Extraction through Semi-Supervised Modeling

9 0.58428264 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

10 0.58359802 73 acl-2012-Discriminative Learning for Joint Template Filling

11 0.58206248 140 acl-2012-Machine Translation without Words through Substring Alignment

12 0.58123344 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

13 0.58012652 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

14 0.57960182 168 acl-2012-Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

15 0.57930321 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

16 0.5790776 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

17 0.57840836 182 acl-2012-Spice it up? Mining Refinements to Online Instructions from User Generated Content

18 0.57833922 193 acl-2012-Text-level Discourse Parsing with Rich Linguistic Features

19 0.57796252 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

20 0.57795864 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics