acl acl2013 acl2013-34 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Masato Hagiwara ; Satoshi Sekine
Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.
Reference: text
sentIndex sentText sentNum sentScore
1 hagiwara , satoshi Abstract Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . [sent-2, score-0.418]
2 Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. [sent-3, score-0.03]
3 We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. [sent-4, score-0.033]
4 1 Introduction Accurate word segmentation (WS) is the key components in successful language pro- cessing. [sent-6, score-0.059]
5 In particular, compound nouns pose difficulties to WS since they are productive, and often consist of unknown words. [sent-8, score-0.365]
6 In Japanese, transliterated foreign compound words written in Katakana are extremely difficult to split up into components without proper lexical knowledge. [sent-9, score-0.427]
7 For example, when splitting a compound noun ブ ラキッシュレッド burakisshureddo, a traditional word segmenter can easily segment this as ブ ラキッ/シュレッド “*blacki shred” since シュレッ ド shureddo “shred” is a known, frequent word. [sent-10, score-0.362]
8 com the language does not have a separate script to represent transliterated words. [sent-17, score-0.146]
9 Kaji and Kitsuregawa (2011) tackled Katakana compound splitting using backtransliteration and paraphrasing. [sent-18, score-0.375]
10 Their approach falls into an offline approach, which focuses on creating dictionaries by extracting new words from large corpora separately before WS. [sent-19, score-0.064]
11 However, offline approaches have limitation unless the lexicon is constantly updated. [sent-20, score-0.064]
12 Moreover, they only deal with Katakana, but their method is not directly applicable to Chinese since the language lacks a separate script for transliterated words. [sent-21, score-0.146]
13 Instead, we adopt an online approach, which deals with unknown words simultaneously as the model analyzes the input. [sent-22, score-0.143]
14 We refer to this process of transliterating unknown words into another language and using the target LM as LM projection. [sent-24, score-0.186]
15 Since the model employs a general transliteration model and a general English LM, it achieves robust WS for unknown words. [sent-25, score-0.331]
16 To the best of our knowledge, this paper is the first to use transliteration and projected LMs in an online, seamlessly integrated fashion for WS. [sent-26, score-0.222]
17 To show the effectiveness of our approach, we test our models on a Japanese balanced corpus and an electronic commerce domain corpus, and a balanced Chinese corpus. [sent-27, score-0.072]
18 2 Related Work In Japanese WS, unknown words are usually dealt with in an online manner with the unknown word model, which uses heuristics 183 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-29, score-0.286]
19 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 183–189, depending on character types (Kudo et al. [sent-31, score-0.043]
20 Nagata (1999) proposed a Japanese unknown word model which considers PoS (part of speech) , word length model and orthography. [sent-33, score-0.143]
21 (2001) proposed a maximum entropy morphological analyzer robust to unknown words. [sent-35, score-0.197]
22 For offline approaches, Mori and Nagao (1996) extracted unknown word and estimated their PoS from a corpus through distributional analysis. [sent-38, score-0.207]
23 Asahara and Matsumoto (2004) built a character-based chunking model using SVM for Japanese unknown word detection. [sent-39, score-0.143]
24 They built a model to split Katakana compounds using backtransliteration and paraphrasing mined from large corpora. [sent-41, score-0.17]
25 (2005) is a similar approach, using a Ja-En dictionary to translate compound components and check their occurrence in an English corpus. [sent-43, score-0.234]
26 Correct splitting of compound nouns has a positive effect on MT (Koehn and Knight, 2003) and IR (Braschler and Ripplinger, 2004) . [sent-45, score-0.313]
27 where compounds may not be explicitly split by whitespaces. [sent-47, score-0.092]
28 Koehn and Knight (2003) tackled the splitting problem in German, by using word statistics in a monolingual corpus. [sent-48, score-0.121]
29 They also used the information whether translations of compound parts appear in a German-English bilingual corpus. [sent-49, score-0.176]
30 Lehal (2010) used Urdu-Devnagri transliteration and a Hindi corpus for handling the space omission problem in Urdu compound words. [sent-50, score-0.404]
31 Here, wi and wi−1 denote the current and previous word in question, and and are level-j PoS tags assigned to them. [sent-66, score-0.173]
32 l(w) ai−n1d c(w) are the length and the set of character types of word w. [sent-67, score-0.043]
33 If there is a substring for which no dictionary entries are found, the unknown word model is invoked. [sent-68, score-0.201]
34 In Japanese, our unknown word model relies on heuristics based on character types and word length to generate word nodes, similar to that of MeCab (Kudo et al. [sent-69, score-0.186]
35 In Chinese, we aggregated consecutive 1 to 4 characters add them as “n (common noun) ”, “ns (place name) ”, “nr (personal name) ”, and “nz (other proper nouns) ,” since most of the unknown words in Chinese are proper nouns. [sent-71, score-0.293]
36 For other character types, a single node with PoS “w (others) ” is created. [sent-73, score-0.075]
37 1The Japan 1The Japanese dictionary and the corpus we used have 6 levels of PoS tag hierarchy, while the Chinese ones have only one level, which is why some of the PoS features are not included in Chinese. [sent-74, score-0.058]
38 As character type, Hiragana (JA) Katakana (JA) Latin alphabet, Number, Chinese characters, and Others, are distinguished. [sent-75, score-0.043]
39 h,a t, 184 Input: 大 人 気 very popular 色 ブ ブラ キ color キッ blackisキh ッシ ュ レ ッ red ド BOS大 人 気TranMsloi色tderalon. [sent-77, score-0.046]
40 Here the empirical probability p(wi) and p(wi−1 , wi) are computed from the source language corpus. [sent-84, score-0.033]
41 In Japanese, we applied this source language augmentation only to Katakana words. [sent-85, score-0.088]
42 1 Language Model Projection As we mentioned in Section 2, English LM knowledge helps split transliterated compounds. [sent-88, score-0.176]
43 For example, Feature 21 is set to φ1LMP(“blackish”) for node (a), to φ1LMP(“red”) for node (b), and Feature 22 is set to φ2LMP(“blackish”, “red”) for edge (c) in Figure 1. [sent-92, score-0.064]
44 If no transliterations were generated, or the n-grams do not appear in the English corpus, a small frequency ε is assumed. [sent-93, score-0.035]
45 Finally, the created edges are traversed from EOS, and associated original nodes are chosen as the WS result. [sent-94, score-0.091]
46 In Figure 1, the bold edges are traversed at the final step, and the corresponding nodes “大 - 人気 - 色 - ブラキッシュ- レッド” are chosen as the final WS result. [sent-95, score-0.091]
47 For Japanese, we only expand and project Katakana noun nodes (whether they are known or unknown words) since transliterated words are almost always written in Katakana. [sent-96, score-0.378]
48 For Chinese, only (place name) ”, (personal name) ”, and (other proper noun) ” nodes whose surface form is more than 1character long are transliterated. [sent-97, score-0.123]
49 “ns “nz 5 “nr Transliteration For transliterating Japanese/Chinese words back to English, we adopted the Joint Source Channel (JSC) Model (Li et al. [sent-99, score-0.043]
50 2 The JSC model, given an input of source word s and target word t, de- fines the transliteration probability based on transliteration units (TUs) ui = hsi, tii as: PJSC(hs, ti) = P(ui|ui−n+1, . [sent-103, score-0.467]
51 , ui−1), where( f ,ist it)he num∏ber of TUs| uin a given source / target word pair. [sent-106, score-0.033]
52 TUs are atomic pair units of source / target words, such as “la/ラ” and “ish/ッシュ”. [sent-107, score-0.033]
53 In order to generate transliteration candidates, we used a stack decoder described in (Hagiwara and Sekine, 2012) . [sent-109, score-0.188]
54 6% ∏if=1 2 2Note that one could also adopt other generative / discriminative transliteration models, such as (Jiampojamarn et al. [sent-117, score-0.188]
55 3We only allow TUs whose length is shorter than or equal to 3, both in the source and target side. [sent-120, score-0.033]
56 Therefore, we can regard this performance as a lower bound of the transliteration module performance we used for WS. [sent-123, score-0.188]
57 We additionally evaluated the performance limited to Katakana (JA) or proper nouns (ZH) in order to see the impact of compound splitting. [sent-142, score-0.297]
58 2 Japanese WS Results We compared the baseline model, the augmented model with the source language (+LM-S) and the projected model (+LM-P) . [sent-145, score-0.033]
59 0 (Kurohashi and Nagao, 1994) , 4Since the dictionary is not explicitly annotated with PoS tags, we firstly took the intersection of the training corpus and the dictionary words, and assigned all the possible PoS tags to the words which appeared in the corpus. [sent-149, score-0.116]
60 We observed slight improvement by incorporating the source LM, and observed a 0. [sent-155, score-0.033]
61 +LM-P also improved compounds whose components do not appear in the training data, such as * ルーカ スフィルム ruukasufirumu to ルーカ ス/フィルム “Lucus Film. [sent-171, score-0.062]
62 One type of errors can be attributed to non-English words such as ス ノ コ ベッド sunokobeddo, which is a compound of Japanese word ス ノコ sunoko “duckboard” and an En- glish word ベッド beddo “bed. [sent-174, score-0.176]
63 32942105 Table 4: Chinese WS Performance (%) performance, which may be because one cannot limit where the source LM features are applied. [sent-213, score-0.033]
64 However, considering the overall F-measure increase and proper noun F-measure decrease suggests that the effect of LM projection is not limited to proper nouns but also promoted finer granularity because we observed proper noun recall increase. [sent-217, score-0.432]
65 One of the reasons which make Chinese LM projection difficult is the corpus allows single tokens with a transliterated part and Chinese affices, e. [sent-218, score-0.225]
66 Proper noun performance for the Stanford segmenter is not shown since it does not assign PoS tags. [sent-223, score-0.095]
67 ours, — Overall (O) and Proper Nouns (P) propriate transliterations 维娜斯 weinasi “Venus,” spelled 维 纳 斯 weinasi. [sent-224, score-0.035]
68 The concept of LM projection is general enough to be used for splitting other compound nouns. [sent-228, score-0.346]
69 For example, for Japanese personal names such as 仲 里依紗 Naka Riisa, if we could successfully estimate the pronunciation Nakariisa and look up possible splits in an English LM, one is expected to find a correct WS Naka Riisa because the first and/or the last name are mentioned in the LM. [sent-229, score-0.067]
70 Seeking broader application of LM projection is a future work. [sent-230, score-0.079]
71 How effective is stemming and decompounding for german text retrieval? [sent-241, score-0.03]
72 The development of an electronic dictionary for morphological analysis and its application to Japanese corpus linguistics (in Japanese) . [sent-249, score-0.112]
73 Splitting noun compounds via monolingual and bilingual paraphrasing: A study on japanese katakana words. [sent-269, score-0.891]
74 A word segmentation system for handling space omission problem in urdu script. [sent-285, score-0.135]
75 A part of speech estimation method for Japanese unknown words using a statistical model of morphology and context. [sent-320, score-0.143]
76 Automatic acquisition of basic katakana lexicon from a given corpus. [sent-324, score-0.443]
77 Chinese segmentation and new word detection using conditional random fields. [sent-336, score-0.059]
78 Morphological analysis based on a maximum entropy model — an approach to the unknown word problem — (in Japanese) . [sent-350, score-0.143]
79 Joint word segmentation and pos tagging using a single perceptron. [sent-354, score-0.114]
wordName wordTfidf (topN-words)
[('katakana', 0.443), ('japanese', 0.345), ('ws', 0.293), ('lm', 0.213), ('transliteration', 0.188), ('compound', 0.176), ('wi', 0.173), ('bccwj', 0.161), ('transliterated', 0.146), ('chinese', 0.145), ('unknown', 0.143), ('blacki', 0.104), ('rakuten', 0.104), ('wer', 0.102), ('tus', 0.092), ('kudo', 0.091), ('splitting', 0.091), ('jiampojamarn', 0.085), ('mecab', 0.085), ('ja', 0.08), ('projection', 0.079), ('backtransliteration', 0.078), ('blackish', 0.078), ('jsc', 0.078), ('shred', 0.078), ('unidic', 0.078), ('hagiwara', 0.076), ('proper', 0.075), ('kytea', 0.069), ('sekine', 0.066), ('offline', 0.064), ('ogura', 0.064), ('compounds', 0.062), ('satoshi', 0.061), ('masato', 0.06), ('segmentation', 0.059), ('ui', 0.058), ('dictionary', 0.058), ('logp', 0.056), ('lattice', 0.055), ('augmentation', 0.055), ('kaji', 0.055), ('pos', 0.055), ('morphological', 0.054), ('segmenter', 0.054), ('hanae', 0.052), ('lehal', 0.052), ('mcenery', 0.052), ('naka', 0.052), ('riisa', 0.052), ('uchimoto', 0.049), ('nodes', 0.048), ('red', 0.046), ('braschler', 0.046), ('nagao', 0.046), ('nakazawa', 0.046), ('nile', 0.046), ('lms', 0.046), ('nouns', 0.046), ('character', 0.043), ('nz', 0.043), ('sittichai', 0.043), ('juman', 0.043), ('transliterating', 0.043), ('traversed', 0.043), ('haizhou', 0.042), ('noun', 0.041), ('asahara', 0.04), ('kitsuregawa', 0.04), ('omission', 0.04), ('zh', 0.04), ('mori', 0.038), ('shinsuke', 0.038), ('koehn', 0.037), ('finch', 0.036), ('urdu', 0.036), ('tseng', 0.036), ('news', 0.036), ('balanced', 0.036), ('personal', 0.035), ('knight', 0.035), ('transliterations', 0.035), ('seamlessly', 0.034), ('kiyotaka', 0.034), ('grzegorz', 0.033), ('neubig', 0.033), ('source', 0.033), ('name', 0.032), ('kumaran', 0.032), ('kurohashi', 0.032), ('node', 0.032), ('sadao', 0.031), ('den', 0.031), ('makoto', 0.03), ('tackled', 0.03), ('split', 0.03), ('english', 0.03), ('german', 0.03), ('yuji', 0.03), ('li', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
Author: Masato Hagiwara ; Satoshi Sekine
Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.
2 0.18847585 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration
Author: Tingting Li ; Tiejun Zhao ; Andrew Finch ; Chunyue Zhang
Abstract: Machine Transliteration is an essential task for many NLP applications. However, names and loan words typically originate from various languages, obey different transliteration rules, and therefore may benefit from being modeled independently. Recently, transliteration models based on Bayesian learning have overcome issues with over-fitting allowing for many-to-many alignment in the training of transliteration models. We propose a novel coupled Dirichlet process mixture model (cDPMM) that simultaneously clusters and bilingually aligns transliteration data within a single unified model. The unified model decomposes into two classes of non-parametric Bayesian component models: a Dirichlet process mixture model for clustering, and a set of multinomial Dirichlet process models that perform bilingual alignment independently for each cluster. The experimental results show that our method considerably outperforms conventional alignment models.
3 0.14093295 80 acl-2013-Chinese Parsing Exploiting Characters
Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu
Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.
4 0.13518591 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
Author: Yuki Arase ; Ming Zhou
Abstract: We propose a method for automatically detecting low-quality Web-text translated by statistical machine translation (SMT) systems. We focus on the phrase salad phenomenon that is observed in existing SMT results and propose a set of computationally inexpensive features to effectively detect such machine-translated sentences from a large-scale Web-mined text. Unlike previous approaches that require bilingual data, our method uses only monolingual text as input; therefore it is applicable for refining data produced by a variety of Web-mining activities. Evaluation results show that the proposed method achieves an accuracy of 95.8% for sentences and 80.6% for text in noisy Web pages.
5 0.13341896 199 acl-2013-Integrating Multiple Dependency Corpora for Inducing Wide-coverage Japanese CCG Resources
Author: Sumire Uematsu ; Takuya Matsuzaki ; Hiroki Hanaoka ; Yusuke Miyao ; Hideki Mima
Abstract: This paper describes a method of inducing wide-coverage CCG resources for Japanese. While deep parsers with corpusinduced grammars have been emerging for some languages, those for Japanese have not been widely studied, mainly because most Japanese syntactic resources are dependency-based. Our method first integrates multiple dependency-based corpora into phrase structure trees and then converts the trees into CCG derivations. The method is empirically evaluated in terms of the coverage of the obtained lexi- con and the accuracy of parsing.
6 0.1257167 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
7 0.12149275 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners
8 0.11420554 97 acl-2013-Cross-lingual Projections between Languages from Different Families
9 0.11392299 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
10 0.11299387 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
11 0.11045623 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
12 0.10631019 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
13 0.098067164 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
14 0.089894854 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration
15 0.088199921 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
16 0.085977428 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
17 0.084986046 255 acl-2013-Name-aware Machine Translation
18 0.077995919 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction
19 0.07699883 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation
20 0.076653421 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example
topicId topicWeight
[(0, 0.179), (1, -0.09), (2, -0.062), (3, 0.047), (4, 0.105), (5, -0.061), (6, -0.084), (7, 0.034), (8, 0.019), (9, 0.01), (10, -0.005), (11, -0.046), (12, 0.039), (13, -0.004), (14, -0.087), (15, 0.003), (16, 0.051), (17, -0.045), (18, -0.043), (19, 0.027), (20, 0.022), (21, -0.063), (22, 0.05), (23, 0.026), (24, 0.105), (25, 0.088), (26, 0.031), (27, -0.001), (28, 0.021), (29, -0.079), (30, 0.032), (31, -0.015), (32, -0.018), (33, -0.107), (34, -0.014), (35, 0.005), (36, 0.014), (37, 0.148), (38, -0.061), (39, -0.072), (40, -0.137), (41, -0.087), (42, 0.095), (43, 0.001), (44, -0.05), (45, 0.016), (46, -0.171), (47, 0.096), (48, 0.056), (49, -0.006)]
simIndex simValue paperId paperTitle
same-paper 1 0.91913867 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
Author: Masato Hagiwara ; Satoshi Sekine
Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.
2 0.60630286 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura
Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
Author: Phillippe Langlais
Abstract: Analogical learning over strings is a holistic model that has been investigated by a few authors as a means to map forms of a source language to forms of a target language. In this study, we revisit this learning paradigm and apply it to the transliteration task. We show that alone, it performs worse than a statistical phrase-based machine translation engine, but the combination of both approaches outperforms each one taken separately, demonstrating the usefulness of the information captured by a so-called formal analogy.
4 0.53885192 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors
Author: Volkan Cirik
Abstract: We study substitute vectors to solve the part-of-speech ambiguity problem in an unsupervised setting. Part-of-speech tagging is a crucial preliminary process in many natural language processing applications. Because many words in natural languages have more than one part-of-speech tag, resolving part-of-speech ambiguity is an important task. We claim that partof-speech ambiguity can be solved using substitute vectors. A substitute vector is constructed with possible substitutes of a target word. This study is built on previous work which has proven that word substitutes are very fruitful for part-ofspeech induction. Experiments show that our methodology works for words with high ambiguity.
5 0.53711408 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue
Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1
6 0.52937573 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing
7 0.52741635 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration
8 0.5198707 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
9 0.50856179 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
10 0.50396323 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
11 0.50083202 128 acl-2013-Does Korean defeat phonotactic word segmentation?
12 0.49555475 80 acl-2013-Chinese Parsing Exploiting Characters
13 0.49180281 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners
14 0.48360342 97 acl-2013-Cross-lingual Projections between Languages from Different Families
15 0.48179227 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections
16 0.46256524 122 acl-2013-Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners
17 0.45041859 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison
18 0.43652844 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
19 0.4329851 50 acl-2013-An improved MDL-based compression algorithm for unsupervised word segmentation
20 0.42729849 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts
topicId topicWeight
[(0, 0.035), (6, 0.026), (11, 0.103), (12, 0.261), (14, 0.012), (15, 0.02), (16, 0.014), (24, 0.044), (26, 0.064), (28, 0.017), (35, 0.063), (42, 0.051), (48, 0.042), (70, 0.033), (88, 0.029), (90, 0.022), (95, 0.092)]
simIndex simValue paperId paperTitle
1 0.8035301 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages
Author: Lian Tze Lim ; Lay-Ki Soon ; Tek Yong Lim ; Enya Kong Tang ; Bali Ranaivo-Malancon
Abstract: Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. The multilin- gual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0.65 (baseline 0.55) and mean reciprocal rank score of 0.81 (baseline 0.771). Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text.
same-paper 2 0.75904602 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
Author: Masato Hagiwara ; Satoshi Sekine
Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.
3 0.71619785 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora
Author: Dhouha Bouamor ; Nasredine Semmar ; Pierre Zweigenbaum
Abstract: This paper presents an approach that extends the standard approach used for bilingual lexicon extraction from comparable corpora. We focus on the unresolved problem of polysemous words revealed by the bilingual dictionary and introduce a use of a Word Sense Disambiguation process that aims at improving the adequacy of context vectors. On two specialized FrenchEnglish comparable corpora, empirical experimental results show that our method improves the results obtained by two stateof-the-art approaches.
4 0.58189732 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models
Author: Spence Green ; Sida Wang ; Daniel Cer ; Christopher D. Manning
Abstract: We present a fast and scalable online method for tuning statistical machine translation models with large feature sets. The standard tuning algorithm—MERT—only scales to tens of features. Recent discriminative algorithms that accommodate sparse features have produced smaller than expected translation quality gains in large systems. Our method, which is based on stochastic gradient descent with an adaptive learning rate, scales to millions of features and tuning sets with tens of thousands of sentences, while still converging after only a few epochs. Large-scale experiments on Arabic-English and Chinese-English show that our method produces significant translation quality gains by exploiting sparse features. Equally important is our analysis, which suggests techniques for mitigating overfitting and domain mismatch, and applies to other recent discriminative methods for machine translation. 1
5 0.57787544 154 acl-2013-Extracting bilingual terminologies from comparable corpora
Author: Ahmet Aker ; Monica Paramita ; Rob Gaizauskas
Abstract: In this paper we present a method for extracting bilingual terminologies from comparable corpora. In our approach we treat bilingual term extraction as a classification problem. For classification we use an SVM binary classifier and training data taken from the EUROVOC thesaurus. We test our approach on a held-out test set from EUROVOC and perform precision, recall and f-measure evaluations for 20 European language pairs. The performance of our classifier reaches the 100% precision level for many language pairs. We also perform manual evaluation on bilingual terms extracted from English-German term-tagged comparable corpora. The results of this manual evaluation showed 60-83% of the term pairs generated are exact translations and over 90% exact or partial translations.
6 0.57302082 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching
7 0.57133651 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
8 0.57016218 137 acl-2013-Enlisting the Ghost: Modeling Empty Categories for Machine Translation
9 0.56956536 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
10 0.56693089 245 acl-2013-Modeling Human Inference Process for Textual Entailment Recognition
11 0.56616598 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
12 0.56602335 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
13 0.56591409 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
14 0.56540096 333 acl-2013-Summarization Through Submodularity and Dispersion
15 0.56392163 318 acl-2013-Sentiment Relevance
16 0.56328869 207 acl-2013-Joint Inference for Fine-grained Opinion Extraction
17 0.56301075 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning
18 0.56286961 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation
19 0.56260073 81 acl-2013-Co-Regression for Cross-Language Review Rating Prediction
20 0.56229329 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting