acl acl2013 acl2013-34 knowledge-graph by maker-knowledge-mining

34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection


Source: pdf

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 hagiwara , satoshi Abstract Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . [sent-2, score-0.418]

2 Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. [sent-3, score-0.03]

3 We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. [sent-4, score-0.033]

4 1 Introduction Accurate word segmentation (WS) is the key components in successful language pro- cessing. [sent-6, score-0.059]

5 In particular, compound nouns pose difficulties to WS since they are productive, and often consist of unknown words. [sent-8, score-0.365]

6 In Japanese, transliterated foreign compound words written in Katakana are extremely difficult to split up into components without proper lexical knowledge. [sent-9, score-0.427]

7 For example, when splitting a compound noun ブ ラキッシュレッド burakisshureddo, a traditional word segmenter can easily segment this as ブ ラキッ/シュレッド “*blacki shred” since シュレッ ド shureddo “shred” is a known, frequent word. [sent-10, score-0.362]

8 com the language does not have a separate script to represent transliterated words. [sent-17, score-0.146]

9 Kaji and Kitsuregawa (2011) tackled Katakana compound splitting using backtransliteration and paraphrasing. [sent-18, score-0.375]

10 Their approach falls into an offline approach, which focuses on creating dictionaries by extracting new words from large corpora separately before WS. [sent-19, score-0.064]

11 However, offline approaches have limitation unless the lexicon is constantly updated. [sent-20, score-0.064]

12 Moreover, they only deal with Katakana, but their method is not directly applicable to Chinese since the language lacks a separate script for transliterated words. [sent-21, score-0.146]

13 Instead, we adopt an online approach, which deals with unknown words simultaneously as the model analyzes the input. [sent-22, score-0.143]

14 We refer to this process of transliterating unknown words into another language and using the target LM as LM projection. [sent-24, score-0.186]

15 Since the model employs a general transliteration model and a general English LM, it achieves robust WS for unknown words. [sent-25, score-0.331]

16 To the best of our knowledge, this paper is the first to use transliteration and projected LMs in an online, seamlessly integrated fashion for WS. [sent-26, score-0.222]

17 To show the effectiveness of our approach, we test our models on a Japanese balanced corpus and an electronic commerce domain corpus, and a balanced Chinese corpus. [sent-27, score-0.072]

18 2 Related Work In Japanese WS, unknown words are usually dealt with in an online manner with the unknown word model, which uses heuristics 183 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-29, score-0.286]

19 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 183–189, depending on character types (Kudo et al. [sent-31, score-0.043]

20 Nagata (1999) proposed a Japanese unknown word model which considers PoS (part of speech) , word length model and orthography. [sent-33, score-0.143]

21 (2001) proposed a maximum entropy morphological analyzer robust to unknown words. [sent-35, score-0.197]

22 For offline approaches, Mori and Nagao (1996) extracted unknown word and estimated their PoS from a corpus through distributional analysis. [sent-38, score-0.207]

23 Asahara and Matsumoto (2004) built a character-based chunking model using SVM for Japanese unknown word detection. [sent-39, score-0.143]

24 They built a model to split Katakana compounds using backtransliteration and paraphrasing mined from large corpora. [sent-41, score-0.17]

25 (2005) is a similar approach, using a Ja-En dictionary to translate compound components and check their occurrence in an English corpus. [sent-43, score-0.234]

26 Correct splitting of compound nouns has a positive effect on MT (Koehn and Knight, 2003) and IR (Braschler and Ripplinger, 2004) . [sent-45, score-0.313]

27 where compounds may not be explicitly split by whitespaces. [sent-47, score-0.092]

28 Koehn and Knight (2003) tackled the splitting problem in German, by using word statistics in a monolingual corpus. [sent-48, score-0.121]

29 They also used the information whether translations of compound parts appear in a German-English bilingual corpus. [sent-49, score-0.176]

30 Lehal (2010) used Urdu-Devnagri transliteration and a Hindi corpus for handling the space omission problem in Urdu compound words. [sent-50, score-0.404]

31 Here, wi and wi−1 denote the current and previous word in question, and and are level-j PoS tags assigned to them. [sent-66, score-0.173]

32 l(w) ai−n1d c(w) are the length and the set of character types of word w. [sent-67, score-0.043]

33 If there is a substring for which no dictionary entries are found, the unknown word model is invoked. [sent-68, score-0.201]

34 In Japanese, our unknown word model relies on heuristics based on character types and word length to generate word nodes, similar to that of MeCab (Kudo et al. [sent-69, score-0.186]

35 In Chinese, we aggregated consecutive 1 to 4 characters add them as “n (common noun) ”, “ns (place name) ”, “nr (personal name) ”, and “nz (other proper nouns) ,” since most of the unknown words in Chinese are proper nouns. [sent-71, score-0.293]

36 For other character types, a single node with PoS “w (others) ” is created. [sent-73, score-0.075]

37 1The Japan 1The Japanese dictionary and the corpus we used have 6 levels of PoS tag hierarchy, while the Chinese ones have only one level, which is why some of the PoS features are not included in Chinese. [sent-74, score-0.058]

38 As character type, Hiragana (JA) Katakana (JA) Latin alphabet, Number, Chinese characters, and Others, are distinguished. [sent-75, score-0.043]

39 h,a t, 184 Input: 大 人 気 very popular 色 ブ ブラ キ color キッ blackisキh ッシ ュ レ ッ red ド BOS大 人 気TranMsloi色tderalon. [sent-77, score-0.046]

40 Here the empirical probability p(wi) and p(wi−1 , wi) are computed from the source language corpus. [sent-84, score-0.033]

41 In Japanese, we applied this source language augmentation only to Katakana words. [sent-85, score-0.088]

42 1 Language Model Projection As we mentioned in Section 2, English LM knowledge helps split transliterated compounds. [sent-88, score-0.176]

43 For example, Feature 21 is set to φ1LMP(“blackish”) for node (a), to φ1LMP(“red”) for node (b), and Feature 22 is set to φ2LMP(“blackish”, “red”) for edge (c) in Figure 1. [sent-92, score-0.064]

44 If no transliterations were generated, or the n-grams do not appear in the English corpus, a small frequency ε is assumed. [sent-93, score-0.035]

45 Finally, the created edges are traversed from EOS, and associated original nodes are chosen as the WS result. [sent-94, score-0.091]

46 In Figure 1, the bold edges are traversed at the final step, and the corresponding nodes “大 - 人気 - 色 - ブラキッシュ- レッド” are chosen as the final WS result. [sent-95, score-0.091]

47 For Japanese, we only expand and project Katakana noun nodes (whether they are known or unknown words) since transliterated words are almost always written in Katakana. [sent-96, score-0.378]

48 For Chinese, only (place name) ”, (personal name) ”, and (other proper noun) ” nodes whose surface form is more than 1character long are transliterated. [sent-97, score-0.123]

49 “ns “nz 5 “nr Transliteration For transliterating Japanese/Chinese words back to English, we adopted the Joint Source Channel (JSC) Model (Li et al. [sent-99, score-0.043]

50 2 The JSC model, given an input of source word s and target word t, de- fines the transliteration probability based on transliteration units (TUs) ui = hsi, tii as: PJSC(hs, ti) = P(ui|ui−n+1, . [sent-103, score-0.467]

51 , ui−1), where( f ,ist it)he num∏ber of TUs| uin a given source / target word pair. [sent-106, score-0.033]

52 TUs are atomic pair units of source / target words, such as “la/ラ” and “ish/ッシュ”. [sent-107, score-0.033]

53 In order to generate transliteration candidates, we used a stack decoder described in (Hagiwara and Sekine, 2012) . [sent-109, score-0.188]

54 6% ∏if=1 2 2Note that one could also adopt other generative / discriminative transliteration models, such as (Jiampojamarn et al. [sent-117, score-0.188]

55 3We only allow TUs whose length is shorter than or equal to 3, both in the source and target side. [sent-120, score-0.033]

56 Therefore, we can regard this performance as a lower bound of the transliteration module performance we used for WS. [sent-123, score-0.188]

57 We additionally evaluated the performance limited to Katakana (JA) or proper nouns (ZH) in order to see the impact of compound splitting. [sent-142, score-0.297]

58 2 Japanese WS Results We compared the baseline model, the augmented model with the source language (+LM-S) and the projected model (+LM-P) . [sent-145, score-0.033]

59 0 (Kurohashi and Nagao, 1994) , 4Since the dictionary is not explicitly annotated with PoS tags, we firstly took the intersection of the training corpus and the dictionary words, and assigned all the possible PoS tags to the words which appeared in the corpus. [sent-149, score-0.116]

60 We observed slight improvement by incorporating the source LM, and observed a 0. [sent-155, score-0.033]

61 +LM-P also improved compounds whose components do not appear in the training data, such as * ルーカ スフィルム ruukasufirumu to ルーカ ス/フィルム “Lucus Film. [sent-171, score-0.062]

62 One type of errors can be attributed to non-English words such as ス ノ コ ベッド sunokobeddo, which is a compound of Japanese word ス ノコ sunoko “duckboard” and an En- glish word ベッド beddo “bed. [sent-174, score-0.176]

63 32942105 Table 4: Chinese WS Performance (%) performance, which may be because one cannot limit where the source LM features are applied. [sent-213, score-0.033]

64 However, considering the overall F-measure increase and proper noun F-measure decrease suggests that the effect of LM projection is not limited to proper nouns but also promoted finer granularity because we observed proper noun recall increase. [sent-217, score-0.432]

65 One of the reasons which make Chinese LM projection difficult is the corpus allows single tokens with a transliterated part and Chinese affices, e. [sent-218, score-0.225]

66 Proper noun performance for the Stanford segmenter is not shown since it does not assign PoS tags. [sent-223, score-0.095]

67 ours, — Overall (O) and Proper Nouns (P) propriate transliterations 维娜斯 weinasi “Venus,” spelled 维 纳 斯 weinasi. [sent-224, score-0.035]

68 The concept of LM projection is general enough to be used for splitting other compound nouns. [sent-228, score-0.346]

69 For example, for Japanese personal names such as 仲 里依紗 Naka Riisa, if we could successfully estimate the pronunciation Nakariisa and look up possible splits in an English LM, one is expected to find a correct WS Naka Riisa because the first and/or the last name are mentioned in the LM. [sent-229, score-0.067]

70 Seeking broader application of LM projection is a future work. [sent-230, score-0.079]

71 How effective is stemming and decompounding for german text retrieval? [sent-241, score-0.03]

72 The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics (in Japanese) . [sent-249, score-0.112]

73 Splitting noun compounds via monolingual and bilingual paraphrasing: A study on japanese katakana words. [sent-269, score-0.891]

74 A word segmentation system for handling space omission problem in urdu script. [sent-285, score-0.135]

75 A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context. [sent-320, score-0.143]

76 Automatic acquisition of basic katakana lexicon from a given corpus. [sent-324, score-0.443]

77 Chinese segmentation and new word detection using conditional random fields. [sent-336, score-0.059]

78 Morphological analysis based on a maximum entropy model — an approach to the unknown word problem — (in Japanese) . [sent-350, score-0.143]

79 Joint word segmentation and pos tagging using a single perceptron. [sent-354, score-0.114]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('katakana', 0.443), ('japanese', 0.345), ('ws', 0.293), ('lm', 0.213), ('transliteration', 0.188), ('compound', 0.176), ('wi', 0.173), ('bccwj', 0.161), ('transliterated', 0.146), ('chinese', 0.145), ('unknown', 0.143), ('blacki', 0.104), ('rakuten', 0.104), ('wer', 0.102), ('tus', 0.092), ('kudo', 0.091), ('splitting', 0.091), ('jiampojamarn', 0.085), ('mecab', 0.085), ('ja', 0.08), ('projection', 0.079), ('backtransliteration', 0.078), ('blackish', 0.078), ('jsc', 0.078), ('shred', 0.078), ('unidic', 0.078), ('hagiwara', 0.076), ('proper', 0.075), ('kytea', 0.069), ('sekine', 0.066), ('offline', 0.064), ('ogura', 0.064), ('compounds', 0.062), ('satoshi', 0.061), ('masato', 0.06), ('segmentation', 0.059), ('ui', 0.058), ('dictionary', 0.058), ('logp', 0.056), ('lattice', 0.055), ('augmentation', 0.055), ('kaji', 0.055), ('pos', 0.055), ('morphological', 0.054), ('segmenter', 0.054), ('hanae', 0.052), ('lehal', 0.052), ('mcenery', 0.052), ('naka', 0.052), ('riisa', 0.052), ('uchimoto', 0.049), ('nodes', 0.048), ('red', 0.046), ('braschler', 0.046), ('nagao', 0.046), ('nakazawa', 0.046), ('nile', 0.046), ('lms', 0.046), ('nouns', 0.046), ('character', 0.043), ('nz', 0.043), ('sittichai', 0.043), ('juman', 0.043), ('transliterating', 0.043), ('traversed', 0.043), ('haizhou', 0.042), ('noun', 0.041), ('asahara', 0.04), ('kitsuregawa', 0.04), ('omission', 0.04), ('zh', 0.04), ('mori', 0.038), ('shinsuke', 0.038), ('koehn', 0.037), ('finch', 0.036), ('urdu', 0.036), ('tseng', 0.036), ('news', 0.036), ('balanced', 0.036), ('personal', 0.035), ('knight', 0.035), ('transliterations', 0.035), ('seamlessly', 0.034), ('kiyotaka', 0.034), ('grzegorz', 0.033), ('neubig', 0.033), ('source', 0.033), ('name', 0.032), ('kumaran', 0.032), ('kurohashi', 0.032), ('node', 0.032), ('sadao', 0.031), ('den', 0.031), ('makoto', 0.03), ('tackled', 0.03), ('split', 0.03), ('english', 0.03), ('german', 0.03), ('yuji', 0.03), ('li', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.

2 0.18847585 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

Author: Tingting Li ; Tiejun Zhao ; Andrew Finch ; Chunyue Zhang

Abstract: Machine Transliteration is an essential task for many NLP applications. However, names and loan words typically originate from various languages, obey different transliteration rules, and therefore may benefit from being modeled independently. Recently, transliteration models based on Bayesian learning have overcome issues with over-fitting allowing for many-to-many alignment in the training of transliteration models. We propose a novel coupled Dirichlet process mixture model (cDPMM) that simultaneously clusters and bilingually aligns transliteration data within a single unified model. The unified model decomposes into two classes of non-parametric Bayesian component models: a Dirichlet process mixture model for clustering, and a set of multinomial Dirichlet process models that perform bilingual alignment independently for each cluster. The experimental results show that our method considerably outperforms conventional alignment models.

3 0.14093295 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.

4 0.13518591 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

Author: Yuki Arase ; Ming Zhou

Abstract: We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. Evaluation results show that the proposed method achieves an accuracy of 95.8% for sentences and 80.6% for text in noisy Web pages.

5 0.13341896 199 acl-2013-Integrating Multiple Dependency Corpora for Inducing Wide-coverage Japanese CCG Resources

Author: Sumire Uematsu ; Takuya Matsuzaki ; Hiroki Hanaoka ; Yusuke Miyao ; Hideki Mima

Abstract: This paper describes a method of inducing wide-coverage CCG resources for Japanese. While deep parsers with corpusinduced grammars have been emerging for some languages, those for Japanese have not been widely studied, mainly because most Japanese syntactic resources are dependency-based. Our method first integrates multiple dependency-based corpora into phrase structure trees and then converts the trees into CCG derivations. The method is empirically evaluated in terms of the coverage of the obtained lexi- con and the accuracy of parsing.

6 0.1257167 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

7 0.12149275 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

8 0.11420554 97 acl-2013-Cross-lingual Projections between Languages from Different Families

9 0.11392299 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

10 0.11299387 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

11 0.11045623 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

12 0.10631019 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

13 0.098067164 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

14 0.089894854 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

15 0.088199921 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation

16 0.085977428 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation

17 0.084986046 255 acl-2013-Name-aware Machine Translation

18 0.077995919 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

19 0.07699883 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

20 0.076653421 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, -0.09), (2, -0.062), (3, 0.047), (4, 0.105), (5, -0.061), (6, -0.084), (7, 0.034), (8, 0.019), (9, 0.01), (10, -0.005), (11, -0.046), (12, 0.039), (13, -0.004), (14, -0.087), (15, 0.003), (16, 0.051), (17, -0.045), (18, -0.043), (19, 0.027), (20, 0.022), (21, -0.063), (22, 0.05), (23, 0.026), (24, 0.105), (25, 0.088), (26, 0.031), (27, -0.001), (28, 0.021), (29, -0.079), (30, 0.032), (31, -0.015), (32, -0.018), (33, -0.107), (34, -0.014), (35, 0.005), (36, 0.014), (37, 0.148), (38, -0.061), (39, -0.072), (40, -0.137), (41, -0.087), (42, 0.095), (43, 0.001), (44, -0.05), (45, 0.016), (46, -0.171), (47, 0.096), (48, 0.056), (49, -0.006)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91913867 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.

2 0.60630286 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.

3 0.60049933 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration

Author: Phillippe Langlais

Abstract: Analogical learning over strings is a holistic model that has been investigated by a few authors as a means to map forms of a source language to forms of a target language. In this study, we revisit this learning paradigm and apply it to the transliteration task. We show that alone, it performs worse than a statistical phrase-based machine translation engine, but the combination of both approaches outperforms each one taken separately, demonstrating the usefulness of the information captured by a so-called formal analogy.

4 0.53885192 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

Author: Volkan Cirik

Abstract: We study substitute vectors to solve the part-of-speech ambiguity problem in an unsupervised setting. Part-of-speech tagging is a crucial preliminary process in many natural language processing applications. Because many words in natural languages have more than one part-of-speech tag, resolving part-of-speech ambiguity is an important task. We claim that partof-speech ambiguity can be solved using substitute vectors. A substitute vector is constructed with possible substitutes of a target word. This study is built on previous work which has proven that word substitutes are very fruitful for part-ofspeech induction. Experiments show that our methodology works for words with high ambiguity.

5 0.53711408 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue

Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1

6 0.52937573 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

7 0.52741635 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

8 0.5198707 235 acl-2013-Machine Translation Detection from Monolingual Web-Text

9 0.50856179 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation

10 0.50396323 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

11 0.50083202 128 acl-2013-Does Korean defeat phonotactic word segmentation?

12 0.49555475 80 acl-2013-Chinese Parsing Exploiting Characters

13 0.49180281 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners

14 0.48360342 97 acl-2013-Cross-lingual Projections between Languages from Different Families

15 0.48179227 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections

16 0.46256524 122 acl-2013-Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners

17 0.45041859 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison

18 0.43652844 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

19 0.4329851 50 acl-2013-An improved MDL-based compression algorithm for unsupervised word segmentation

20 0.42729849 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.035), (6, 0.026), (11, 0.103), (12, 0.261), (14, 0.012), (15, 0.02), (16, 0.014), (24, 0.044), (26, 0.064), (28, 0.017), (35, 0.063), (42, 0.051), (48, 0.042), (70, 0.033), (88, 0.029), (90, 0.022), (95, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8035301 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages

Author: Lian Tze Lim ; Lay-Ki Soon ; Tek Yong Lim ; Enya Kong Tang ; Bali Ranaivo-Malancon

Abstract: Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. The multilin- gual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0.65 (baseline 0.55) and mean reciprocal rank score of 0.81 (baseline 0.771). Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text.

same-paper 2 0.75904602 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

Author: Masato Hagiwara ; Satoshi Sekine

Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.

3 0.71619785 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora

Author: Dhouha Bouamor ; Nasredine Semmar ; Pierre Zweigenbaum

Abstract: This paper presents an approach that extends the standard approach used for bilingual lexicon extraction from comparable corpora. We focus on the unresolved problem of polysemous words revealed by the bilingual dictionary and introduce a use of a Word Sense Disambiguation process that aims at improving the adequacy of context vectors. On two specialized FrenchEnglish comparable corpora, empirical experimental results show that our method improves the results obtained by two stateof-the-art approaches.

4 0.58189732 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning

Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1

5 0.57787544 154 acl-2013-Extracting bilingual terminologies from comparable corpora

Author: Ahmet Aker ; Monica Paramita ; Rob Gaizauskas

Abstract: In this paper we present a method for extracting bilingual terminologies from comparable corpora. In our approach we treat bilingual term extraction as a classification problem. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. The performance of our classifier reaches the 100% precision level for many language pairs. We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora. The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations.

6 0.57302082 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

7 0.57133651 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

8 0.57016218 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation

9 0.56956536 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

10 0.56693089 245 acl-2013-Modeling Human Inference Process for Textual Entailment Recognition

11 0.56616598 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

12 0.56602335 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

13 0.56591409 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

14 0.56540096 333 acl-2013-Summarization Through Submodularity and Dispersion

15 0.56392163 318 acl-2013-Sentiment Relevance

16 0.56328869 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction

17 0.56301075 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

18 0.56286961 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

19 0.56260073 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction

20 0.56229329 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting