acl acl2010 acl2010-180 knowledge-graph by maker-knowledge-mining

180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

Source: pdf

Author: Yufeng Chen ; Chengqing Zong ; Keh-Yih Su

Abstract: We observe that (1) how a given named entity (NE) is translated (i.e., either semantically or phonetically) depends greatly on its associated entity type, and (2) entities within an aligned pair should share the same type. Also, (3) those initially detected NEs are anchors, whose information should be used to give certainty scores when selecting candidates. From this basis, an integrated model is thus proposed in this paper to jointly identify and align bilingual named entities between Chinese and English. It adopts a new mapping type ratio feature (which is the proportion of NE internal tokens that are semantically translated), enforces an entity type consistency constraint, and utilizes additional monolingual candidate certainty factors (based on those NE anchors). The experi- ments show that this novel approach has substantially raised the type-sensitive F-score of identified NE-pairs from 68.4% to 81.7% (42.1% F-score imperfection reduction) in our Chinese-English NE alignment task.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 a Abstract We observe that (1) how a given named entity (NE) is translated (i. [sent-3, score-0.146]

2 , either semantically or phonetically) depends greatly on its associated entity type, and (2) entities within an aligned pair should share the same type. [sent-5, score-0.152]

3 Also, (3) those initially detected NEs are anchors, whose information should be used to give certainty scores when selecting candidates. [sent-6, score-0.162]

4 From this basis, an integrated model is thus proposed in this paper to jointly identify and align bilingual named entities between Chinese and English. [sent-7, score-0.228]

5 It adopts a new mapping type ratio feature (which is the proportion of NE internal tokens that are semantically translated), enforces an entity type consistency constraint, and utilizes additional monolingual candidate certainty factors (based on those NE anchors). [sent-8, score-0.762]

6 1% F-score imperfection reduction) in our Chinese-English NE alignment task. [sent-12, score-0.154]

7 1 Introduction In trans-lingual language processing tasks, such as machine translation and cross-lingual information retrieval, named entity (NE) translation is essential. [sent-13, score-0.162]

8 Since NE alignment can only be conducted after its associated NEs have first been identified, the including-rate of the first recognition stage significantly limits the final alignment performance. [sent-15, score-0.364]

9 In this way, it avoids the NE recognition errors which would otherwise be ac . [sent-20, score-0.089]

10 brought into the alignment stage from the target side; however, the NE errors from the source side still remain. [sent-26, score-0.195]

11 , 2003) expands the NE candidate-sets in both languages before conducting the alignment, which is done by treating the original results as anchors, and then re-generating further candidates by enlarging or shrinking those anchors' boundaries. [sent-28, score-0.088]

12 Of course, this strategy will be in vain if the NE anchor is missed in the initial detection stage. [sent-29, score-0.09]

13 Although the above expansion strategy has substantially alleviated the error accumulation problem, the final alignment accuracy is still not good (type-sensitive F-score only 68. [sent-33, score-0.151]

14 After having examined the data, we found that: (1) How a given NE is translated, either semantically (called translation) or phonetically (called transliteration), depends greatly on its associated en- type2. [sent-36, score-0.095]

15 2 The proportions of semantic translation, which denote the ratios of semantically translated words among all the associated NE words, for person names (PER), location names (LOC), and organization names (ORG) approximates 0%, 28. [sent-38, score-0.205]

16 ; The Initial Detection subtask first locates the initial NEs and their associated NE types inside both the Chinese and English sides. [sent-47, score-0.138]

17 Afterwards, the Expansion subtask re-generates the candidate-sets in both languages to recover those initial NE recognition errors. [sent-48, score-0.155]

18 Finally, the Alignment&Re-identification; subtask jointly recognizes and aligns bilingual NEs via the proposed joint model presented in Section 3. [sent-49, score-0.155]

19 6%, has been observed in our ChineseEnglish NE alignment task. [sent-53, score-0.115]

20 2 Motivation The problem of NE recognition requires both boundary identification and type classification. [sent-54, score-0.138]

21 Since alignment would force the linked NE pair to share the same semantic meaning, the NE that is more reliably identified in one language can be used to ensure its counterpart in another language. [sent-60, score-0.183]

22 This benefits both the NE boundary identification and type classification processes, and it hints that alignment can help to re-identify those initially recognized NEs which had been less reliable. [sent-61, score-0.309]

23 As shown in the following example, although the desired NE “北韩中央通信社” is recognized partially as “北韩中央” in the initial recognition stage, it would be more preferred if its English counterpart “North Korean's Central News Agency” is given. [sent-62, score-0.233]

24 (I) The initial NE detection in a Chinese sentence: 官方的北韩中央通信社引述海军. [sent-64, score-0.09]

25 (II) The initial NE detection of its English counterpart: Official North Korean's Central News Agency quoted the navy's statement… (III) The word alignment between two NEs: … … (VI) The re-identified Chinese NE boundary after alignment: 官方的北韩中央通信社引述海军声明. [sent-67, score-0.205]

26 As another example, the word “lake” in the English NE is linked to the Chinese character “湖 ” as illustrated below, and this mapping is found to be a translation and not a transliteration. [sent-70, score-0.129]

27 , 2003), the desired NE type “LOC” would be preferred to be shared between the English NE “Lake Constance” and its corresponding Chinese NE “康斯坦茨湖”. [sent-72, score-0.077]

28 As a result, the original incorrect type “PER” of the given English NE is fixed, and the necessity of using mapping type ratio and NE type consistency constraint becomes evident. [sent-73, score-0.417]

29 Let Score(RCNEk ,RENE[k ]) denote the associated linking score for a given candidate-pair RCNEk  and RENE[k ] , where  k  and [k ] are the associated indexes of the re-generated Chinese and English NE candidates, respectively. [sent-76, score-0.116]

30 Furthermore, let RTypek be the NE type to be reassigned and shared by RCNEk  and RENE[k ] (as they possess the same meaning). [sent-77, score-0.077]

31 Assume that RCNEk and RENE[k ] are derived from initially recognized CNEi and ENEj , respectively, and MIC denotes their internal component mapping, to be defined in Section 3. [sent-78, score-0.23]

32 The associated probability factors in the above linking score can be further derived as follows. [sent-81, score-0.15]

33 2) used to assign preference to each selected RCNE and RENE , based on the initially recognized NEs (which act as anchors). [sent-85, score-0.117]

34 1 Bilingual Related Factors The bilingual alignment factor mainly represents the likelihood value of a specific internal component mapping MIC , given a pair of possible NE configurations RCNE and RENE and their associated RType . [sent-87, score-0.416]

35 Since Chinese word segmentation is problematic, especially for transliterated words, the bilingual alignment factor P MIC RType, RCNE, RENE in Eq (2) is derived to be conditioned on RENE (i. [sent-88, score-0.241]

36 In total, there are N component mappings, with NTS translation mappings [cpnn1 ew[n1] TS]Nn1T S1 and NTL transliteration mappings [cpnn2 ew[n2] , so that N  NTS  NTL . [sent-92, score-0.102]

37 Moreover, since the mapping type distributions of various NE types deviate greatly from one another, as illustrated in the second footnote, the associated mapping type ratio  NTS / N is thus an important feature, and is included in the internal component mapping configuration specified above. [sent-93, score-0.587]

38 For example, the MIC between “康斯 , , , ,TL]Nn2TL1  坦茨湖” and “Constance Lake” is [康斯坦茨, Constance, TL] and [湖, Lake, TS], so its associated mapping type ratio will be “0. [sent-94, score-0.246]

39 2 Monolingual Candidate Certainty Factors On the other hand, the monolingual candidate certainty factors in Eq (2) indicate the likelihood that a re-generated NE candidate is the true NE given its originally detected NE. [sent-104, score-0.293]

40 Also, Str[ RCNE] stands for the associated Chinese string of RCNE , ccm denotes the m-th Chinese character within 通信 that string, and M denotes the total number of Chinese characters within RCNE . [sent-108, score-0.194]

41 Also, the bigram unit ccm of the Chinese NE string is replaced by the English word unit ewn . [sent-111, score-0.168]

42 All the bilingual and monolingual factors mentioned above, which are derived from Eq (1), are weighted differently according to their contributions. [sent-112, score-0.256]

43 The corresponding weighting coefficients are obtained using the well-known Minimum Error Rate Training (Och, 2003; commonly abbreviated as MERT) algorithm by minimizing the number of associated errors in the development set. [sent-113, score-0.131]

44 ; The Following Diagram gives the details of this framework: For each given bilingual sentence-pair: (A)I nitial NE Recognition: generates the initial NE anchors with off-the-self packages. [sent-116, score-0.232]

45 (B) NE-Candidate-Set Expansion: For each initially detected NE, several NE candidates will be re-generated from the original NE by allowing its boundaries to be shrunk or enlarged within a pre-specified range. [sent-117, score-0.083]

46 1) Create both RCNE and RENE candidate-sets, which are expanded from those initial NEs identified in the previous stage. [sent-119, score-0.062]

47 (C) NE Alignment&Re-identification;: Rank each candidate in the NE-Pair-CandidateSet constructed above with the linking score specified in Eq (1). [sent-122, score-0.064]

48 Steps to Generate the Final NE-Pairs It is our observation that, four Chinese characters for both shrinking and enlarging, two English words for shrinking and three for enlarging are enough in most cases. [sent-125, score-0.179]

49 , 2003) is re-implemented in our environment as the baseline, in which the translation cost, transliteration cost and tagging cost are used. [sent-134, score-0.136]

50 This model is selected for comparison because it not only adopts the same candidate-set expansion strategy as mentioned above, but also utilizes the monolingual information when selecting NE-pairs (however, only a simple bi-gram model is used as the tagging cost in their paper). [sent-135, score-0.201]

51 Note that it enforces the same NE type only when the tagging cost is evaluated: Ctag minRType[ log(mM1P( ccm |ccm1,RType)) log(Nn1P( ewn |ewn1,RType))] . [sent-136, score-0.275]

52 The second Part of the training set is the LDC2005T34 bilingual NE dictionary3, which is denoted as Training-Set-II. [sent-140, score-0.096]

53 In our experiments, for the baseline system, the translation cost and the transliteration cost are trained on Training-Set-II, while the tagging cost is trained on Training-Set-I. [sent-142, score-0.168]

54 For the proposed approach, the monolingual candidate certainty factors are trained on Training-Set-I, and Training-Set-II is used to train the parameters relating to bilingual alignment factors. [sent-143, score-0.468]

55 Afterwards, the answer keys for NE recognition and alignment were annotated manually, and used as the gold standard to calculate metrics of precision (P), recall (R), and F-score (F) for both NE recognition (NER) and NE alignment (NEA). [sent-148, score-0.352]

56 The number of NE pairs is less 3 The LDC2005T34 data-set consists of proofread bilingual entries: 73,352 person names, 76,460 location names and 68,960 organization names. [sent-150, score-0.131]

57 than that of NEs, because not all those recognized NEs can be aligned. [sent-151, score-0.07]

58 1 Baseline System Both the baseline and the proposed models share the same initial detection subtask, which adopts the Chinese NE recognizer reported by Wu et al. [sent-155, score-0.17]

59 (2005), which is a hybrid statistical model incorporating multi-knowledge sources, and the English NE recognizer included in the publicly available Mallet toolkit4 to generate initial NEs. [sent-156, score-0.091]

60 Initial Chinese NEs and English NEs are recognized by these two available packages respectively. [sent-157, score-0.07]

61 6/E Table 1 shows the initial NE recognition performances for both Chinese and English (the largest entry in each column is highlighted for visibility). [sent-164, score-0.123]

62 From Table 1, it is observed that the F-score of ORG type is the lowest among all NE types for both English and Chinese. [sent-165, score-0.077]

63 This is because many organization names are partially recognized or missed. [sent-166, score-0.105]

64 Besides, not shown in the table, the location names or abbreviated organization names tend to be incorrectly recognized as person names. [sent-167, score-0.199]

65 In general, the initial Chinese NER outperforms the initial English NER, as the NE type classification turns out to be a more difficult problem for this English NER system. [sent-168, score-0.201]

66 Such a low performance is mainly due to those NE recognition errors which have been brought into the alignment stage. [sent-171, score-0.204]

67 To diminish the effect of errors accumulating, which stems from the recognition stage, the baseline system also adopts the same expansion strategy described in Section 3. [sent-172, score-0.176]

68 Therefore, it is conjectured that the baseline alignment model is unable to achieve good performance if those features/factors proposed in this paper are not adopted. [sent-180, score-0.115]

69 Exp0 is the basic system, which ignores monolingual candidate certainty scores, and also disregards mapping type and NE type consistency constraint by ignoring P( Mtypen | ew[n ] ,RType) and P( |RType) , and also replacing P(cpnn  |Mtypen ,ew[n ] ,RType) with P( cpnn  |ew[n ]) in Eq (3). [sent-183, score-0.502]

70 In addition, Exp4 (named Exp0+RTypeReassignment) adds the NE type reassignment score, Eq (4), to Exp0 to show the effect of enforcing NE-type consistency. [sent-185, score-0.106]

71 Furthermore, Exp5 (named All-BiFactors) shows the full power of the set of proposed bi- lingual factors by turning on all the options mentioned above. [sent-186, score-0.072]

72 To show the influence of additional information carried by those initially recognized NEs, Exp7 (named Exp6+LeftD/RightD) adds left and right distance information into Exp6, as that specified in Eq (5). [sent-188, score-0.146]

73 To study the monolingual bigram capability, Exp8 (named Exp6+Bigram) adds the NEtype dependant bigram model of each language to Exp6. [sent-189, score-0.188]

74 Similar to what we have done on the bilingual alignment factor above, Exp9 (named Exp6+N-Bigram) adds the normalized NEtype dependant bigram to Exp6 for removing the bias induced by having differ- ent NE lengths. [sent-191, score-0.31]

75 The normalized Chinese NEtype dependant bigram score is defined as [mM1P( ccm |ccm1,RType)]M1. [sent-192, score-0.175]

76 Lastly, Exp10 (named Fully-JointModel) shows the full power of the proposed Recognition and Alignment Joint Model by adopting all the normalized factors mentioned above. [sent-194, score-0.072]

77 The first one (named type-insensitive) only checks the scope of each NE without taking its associated NE type into consideration, and is reported 5 http://www. [sent-197, score-0.121]

78 The second one (named type-sensitive) would also evaluate the associated NE type of each NE, and is given within parentheses in Table 2. [sent-201, score-0.121]

79 A large degradation is observed when NE type is also taken into account. [sent-202, score-0.077]

80 It directly adopts all those primitive features mentioned above as its inputs (including internal component mapping, initial and final NE type, NE bigram-based string, and left/right distance), without involving any related probability factors derived within the proposed model. [sent-208, score-0.326]

81 Since the ME approach is unable to utilize the bilingual NE dictionary (Training-SetII), for fair comparison, this dictionary was also not used to train our models here. [sent-211, score-0.096]

82 5 Error Analysis and Discussion Although the proposed model has substantially improved the performance of both NE alignment and recognition, some errors still remain. [sent-227, score-0.143]

83 (B) NE components are one-to-one linked, but the associated NE anchors generated from the initial recognition stage are either missing or spurious (24%). [sent-229, score-0.27]

84 Although increasing the number of output candidates generated from the initial recognition stage might cover the missing problem, possible side effects might also be expected (as the complexity of the alignment task would also be increased). [sent-230, score-0.29]

85 For example, one NE is abbreviated while its counterpart is not; or some loanwords or out-of-vocabulary terms are translated neither semantically nor phonetically. [sent-232, score-0.181]

86 Errors of this type are uneasy to resolve, and their possible solutions are beyond the scope of this paper. [sent-234, score-0.103]

87 As an instance of abbreviation errors, a Chinese NE “葛兰素制药厂 (GlaxoSmithKline Factory)” is tagged as “葛兰素/PRR 药厂/n”, while its counterpart in the English side is simply abbreviated as “GSK” (or replaced by a pronoun “it” sometimes). [sent-236, score-0.122]

88 As an example of errors resulting from loanwords; Japanese kanji “ 明仁” (the name of a Japanese emperor) is linked to the English word “Akihito”. [sent-239, score-0.082]

89 Here the Japanese kanji “明仁” is directly adopted as the corresponding Chinese characters (as those characters were originally borrowed from Chinese), which would be pro- 制 637 nounced as “Mingren” in Chinese and thus deviates greatly from the English pronunciation of “Akihito”. [sent-240, score-0.177]

90 Further extending the model to cover this new conversion type seems necessary; however, such a kind of extension is very likely to be language pair dependent. [sent-242, score-0.077]

91 The corresponding differences in performance (of the weighted version) when compared with the initial NER ( P , R and F ) are shown in Table 4. [sent-244, score-0.062]

92 E0/R The result shows that the proposed joint model has a clear win over the initial NER for either Chinese or English NER. [sent-253, score-0.062]

93 However, if the mapping type ratio is omitted, only 21. [sent-261, score-0.202]

94 With the benefits shown above, the alignment model could thus be used to train the monolin- gual NE recognition model via semi-supervised learning. [sent-265, score-0.176]

95 Therefore, only the English NE recognizer and the alignment model are updated during training iterations. [sent-268, score-0.144]

96 Table 5 shows the results of semi-supervised learning after convergence for adopting only the English NER model (NER-Only), the baseline alignment model (NER+Baseline), and our un-weighted joint model (NER+JointModel) respectively. [sent-271, score-0.115]

97 However, with additional mapping constraints from the aligned sentence of another language, the alignment module could guide the searching process to converge to a more desirable point in the parameter space; and these additional constraints become more effective as the seed-corpus gets smaller. [sent-277, score-0.193]

98 Supervised Learning of English NE Recognition 7 Conclusion In summary, our experiments show that the new monolingual candidate certainty factors are more effective than the tagging cost (only bigram model) adopted in the baseline system. [sent-278, score-0.356]

99 Moreover, both the mapping type ratio and the entity type consistency constraint are very helpful in identifying the associated NE boundaries and types. [sent-279, score-0.454]

100 After having adopted the features and enforced 638 the constraint mentioned above, the proposed framework, which jointly recognizes and aligns bilingual named entities, achieves a remarkable 42. [sent-280, score-0.293]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ne', 0.508), ('rtype', 0.34), ('rcne', 0.288), ('nes', 0.279), ('rene', 0.262), ('cpn', 0.196), ('chinese', 0.163), ('mic', 0.131), ('eq', 0.121), ('cne', 0.118), ('alignment', 0.115), ('certainty', 0.115), ('ccm', 0.105), ('ctype', 0.105), ('bilingual', 0.096), ('constance', 0.092), ('ene', 0.092), ('ner', 0.082), ('named', 0.082), ('etype', 0.078), ('leftd', 0.078), ('mtypen', 0.078), ('rightd', 0.078), ('mapping', 0.078), ('type', 0.077), ('anchors', 0.074), ('ew', 0.071), ('recognized', 0.07), ('lake', 0.066), ('initial', 0.062), ('recognition', 0.061), ('abbreviated', 0.059), ('monolingual', 0.058), ('org', 0.057), ('internal', 0.053), ('lenc', 0.052), ('adopts', 0.051), ('transliteration', 0.049), ('english', 0.049), ('factors', 0.048), ('initially', 0.047), ('ratio', 0.047), ('shrinking', 0.046), ('characters', 0.045), ('associated', 0.044), ('enlarging', 0.042), ('counterpart', 0.04), ('cnei', 0.039), ('dependant', 0.039), ('enej', 0.039), ('imperfection', 0.039), ('netype', 0.039), ('adopted', 0.036), ('boundaries', 0.036), ('expansion', 0.036), ('candidate', 0.036), ('names', 0.035), ('entity', 0.034), ('loc', 0.034), ('consistency', 0.033), ('subtask', 0.032), ('cost', 0.032), ('nts', 0.032), ('ewn', 0.032), ('bigram', 0.031), ('translated', 0.03), ('derived', 0.03), ('nn', 0.03), ('component', 0.03), ('afterwards', 0.029), ('enforces', 0.029), ('adds', 0.029), ('recognizer', 0.029), ('stage', 0.029), ('constraint', 0.028), ('linking', 0.028), ('linked', 0.028), ('primitive', 0.028), ('detection', 0.028), ('errors', 0.028), ('wrong', 0.027), ('jointly', 0.027), ('akihito', 0.026), ('captain', 0.026), ('ctypei', 0.026), ('etypej', 0.026), ('ferry', 0.026), ('gsk', 0.026), ('kanji', 0.026), ('loanwords', 0.026), ('ntl', 0.026), ('uneasy', 0.026), ('semantically', 0.026), ('greatly', 0.025), ('agency', 0.025), ('mentioned', 0.024), ('translation', 0.023), ('entities', 0.023), ('ying', 0.023), ('side', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

Author: Yufeng Chen ; Chengqing Zong ; Keh-Yih Su

2 0.21697867 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

Author: Yassine Benajiba ; Imed Zitouni ; Mona Diab ; Paolo Rosso

Abstract: Building an accurate Named Entity Recognition (NER) system for languages with complex morphology is a challenging task. In this paper, we present research that explores the feature space using both gold and bootstrapped noisy features to build an improved highly accurate Arabic NER system. We bootstrap noisy features by projection from an Arabic-English parallel corpus that is automatically tagged with a baseline NER system. The feature space covers lexical, morphological, and syntactic features. The proposed approach yields an improvement of up to 1.64 F-measure (absolute).

3 0.089275144 262 acl-2010-Word Alignment with Synonym Regularization

Author: Hiroyuki Shindo ; Akinori Fujino ; Masaaki Nagata

Abstract: We present a novel framework for word alignment that incorporates synonym knowledge collected from monolingual linguistic resources in a bilingual probabilistic model. Synonym information is helpful for word alignment because we can expect a synonym to correspond to the same word in a different language. We design a generative model for word alignment that uses synonym information as a regularization term. The experimental results show that our proposed method significantly improves word alignment quality.

4 0.085838258 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment

Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell

Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.

5 0.082583167 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

Author: Wenliang Chen ; Jun'ichi Kazama ; Kentaro Torisawa

Abstract: This paper proposes a dependency parsing method that uses bilingual constraints to improve the accuracy of parsing bilingual texts (bitexts). In our method, a targetside tree fragment that corresponds to a source-side tree fragment is identified via word alignment and mapping rules that are automatically learned. Then it is verified by checking the subtree list that is collected from large scale automatically parsed data on the target side. Our method, thus, requires gold standard trees only on the source side of a bilingual corpus in the training phase, unlike the joint parsing model, which requires gold standard trees on the both sides. Compared to the reordering constraint model, which requires the same training data as ours, our method achieved higher accuracy because ofricher bilingual constraints. Experiments on the translated portion of the Chinese Treebank show that our system outperforms monolingual parsers by 2.93 points for Chinese and 1.64 points for English.

6 0.078071781 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

7 0.07582438 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

8 0.074371248 133 acl-2010-Hierarchical Search for Word Alignment

9 0.073113471 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction

10 0.071854457 170 acl-2010-Letter-Phoneme Alignment: An Exploration

11 0.068810686 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation

12 0.066858158 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

13 0.065680824 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

14 0.065171972 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

15 0.062417295 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

16 0.062004201 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

17 0.061525282 79 acl-2010-Cross-Lingual Latent Topic Extraction

18 0.06009382 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

19 0.05820955 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

20 0.057100851 169 acl-2010-Learning to Translate with Source and Target Syntax

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.151), (1, -0.081), (2, -0.025), (3, 0.005), (4, 0.038), (5, 0.041), (6, -0.052), (7, 0.008), (8, 0.021), (9, 0.045), (10, -0.032), (11, -0.019), (12, -0.018), (13, -0.035), (14, -0.048), (15, -0.027), (16, 0.001), (17, -0.013), (18, 0.09), (19, -0.109), (20, -0.064), (21, -0.075), (22, -0.012), (23, 0.07), (24, -0.034), (25, -0.001), (26, -0.153), (27, -0.036), (28, -0.001), (29, -0.092), (30, -0.048), (31, 0.033), (32, 0.009), (33, -0.128), (34, -0.088), (35, 0.103), (36, -0.137), (37, 0.08), (38, 0.061), (39, -0.156), (40, -0.02), (41, 0.073), (42, -0.137), (43, 0.091), (44, 0.103), (45, 0.135), (46, -0.066), (47, -0.063), (48, 0.043), (49, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95305091 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

Author: Yufeng Chen ; Chengqing Zong ; Keh-Yih Su

2 0.6893751 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

Author: Yassine Benajiba ; Imed Zitouni ; Mona Diab ; Paolo Rosso

3 0.56082606 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

Author: Nadir Durrani ; Hassan Sajjad ; Alexander Fraser ; Helmut Schmid

Abstract: We present a novel approach to integrate transliteration into Hindi-to-Urdu statistical machine translation. We propose two probabilistic models, based on conditional and joint probability formulations, that are novel solutions to the problem. Our models consider both transliteration and translation when translating a particular Hindi word given the context whereas in previous work transliteration is only used for translating OOV (out-of-vocabulary) words. We use transliteration as a tool for disambiguation of Hindi homonyms which can be both translated or transliterated or transliterated differently based on different contexts. We obtain final BLEU scores of 19.35 (conditional prob- ability model) and 19.00 (joint probability model) as compared to 14.30 for a baseline phrase-based system and 16.25 for a system which transliterates OOV words in the baseline system. This indicates that transliteration is useful for more than only translating OOV words for language pairs like Hindi-Urdu.

4 0.49367863 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

Author: Jenny Rose Finkel ; Christopher D. Manning

Abstract: One of the main obstacles to producing high quality joint models is the lack of jointly annotated data. Joint modeling of multiple natural language processing tasks outperforms single-task models learned from the same data, but still underperforms compared to single-task models learned on the more abundant quantities of available single-task annotated data. In this paper we present a novel model which makes use of additional single-task annotated data to improve the performance of a joint model. Our model utilizes a hierarchical prior to link the feature weights for shared features in several single-task models and the joint model. Experiments on joint parsing and named entity recog- nition, using the OntoNotes corpus, show that our hierarchical joint model can produce substantial gains over a joint model trained on only the jointly annotated data.

5 0.43320304 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

Author: Dong Yang ; Paul Dixon ; Sadaoki Furui

Abstract: This paper presents a joint optimization method of a two-step conditional random field (CRF) model for machine transliteration and a fast decoding algorithm for the proposed method. Our method lies in the category of direct orthographical mapping (DOM) between two languages without using any intermediate phonemic mapping. In the two-step CRF model, the first CRF segments an input word into chunks and the second one converts each chunk into one unit in the target language. In this paper, we propose a method to jointly optimize the two-step CRFs and also a fast algorithm to realize it. Our experiments show that the proposed method outper- forms the well-known joint source channel model (JSCM) and our proposed fast algorithm decreases the decoding time significantly. Furthermore, combination of the proposed method and the JSCM gives further improvement, which outperforms state-of-the-art results in terms of top-1 accuracy.

6 0.42546016 262 acl-2010-Word Alignment with Synonym Regularization

7 0.40614417 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

8 0.36746266 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment

9 0.35162869 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures

10 0.35098112 133 acl-2010-Hierarchical Search for Word Alignment

11 0.34860957 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

12 0.34039137 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages

13 0.33764914 28 acl-2010-An Entity-Level Approach to Information Extraction

14 0.33281207 195 acl-2010-Phylogenetic Grammar Induction

15 0.33083743 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

16 0.31591067 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

17 0.31487095 79 acl-2010-Cross-Lingual Latent Topic Extraction

18 0.31248018 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

19 0.30050018 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems

20 0.29604712 68 acl-2010-Conditional Random Fields for Word Hyphenation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.042), (33, 0.011), (39, 0.4), (42, 0.01), (59, 0.085), (73, 0.04), (78, 0.021), (83, 0.099), (84, 0.03), (98, 0.138)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89471042 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices

Author: Matthew Skala ; Victoria Krakovna ; Janos Kramar ; Gerald Penn

Abstract: Constructing an encoding of a concept lattice using short bit vectors allows for efficient computation of join operations on the lattice. Join is the central operation any unification-based parser must support. We extend the traditional bit vector encoding, which represents join failure using the zero vector, to count any vector with less than a fixed number of one bits as failure. This allows non-joinable elements to share bits, resulting in a smaller vector size. A constraint solver is used to construct the encoding, and a variety of techniques are employed to find near-optimal solutions and handle timeouts. An evaluation is provided comparing the extended representation of failure with traditional bit vector techniques.

2 0.85317421 2 acl-2010-"Was It Good? It Was Provocative." Learning the Meaning of Scalar Adjectives

Author: Marie-Catherine de Marneffe ; Christopher D. Manning ; Christopher Potts

Abstract: Texts and dialogues often express information indirectly. For instance, speakers’ answers to yes/no questions do not always straightforwardly convey a ‘yes’ or ‘no’ answer. The intended reply is clear in some cases (Was it good? It was great!) but uncertain in others (Was it acceptable? It was unprecedented.). In this paper, we present methods for interpreting the answers to questions like these which involve scalar modifiers. We show how to ground scalar modifier meaning based on data collected from the Web. We learn scales between modifiers and infer the extent to which a given answer conveys ‘yes’ or ‘no’ . To evaluate the methods, we collected examples of question–answer pairs involving scalar modifiers from CNN transcripts and the Dialog Act corpus and use response distributions from Mechanical Turk workers to assess the degree to which each answer conveys ‘yes’ or ‘no’ . Our experimental results closely match the Turkers’ response data, demonstrating that meanings can be learned from Web data and that such meanings can drive pragmatic inference.

same-paper 3 0.81426585 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

Author: Yufeng Chen ; Chengqing Zong ; Keh-Yih Su

4 0.80464029 57 acl-2010-Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation

Author: Michael Bloodgood ; Chris Callison-Burch

Abstract: We explore how to improve machine translation systems by adding more translation data in situations where we already have substantial resources. The main challenge is how to buck the trend of diminishing returns that is commonly encountered. We present an active learning-style data solicitation algorithm to meet this challenge. We test it, gathering annotations via Amazon Mechanical Turk, and find that we get an order of magnitude increase in performance rates of improvement.

5 0.51093304 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

Author: Eric Corlett ; Gerald Penn

Abstract: Letter-substitution ciphers encode a document from a known or hypothesized language into an unknown writing system or an unknown encoding of a known writing system. It is a problem that can occur in a number of practical applications, such as in the problem of determining the encodings of electronic documents in which the language is known, but the encoding standard is not. It has also been used in relation to OCR applications. In this paper, we introduce an exact method for deciphering messages using a generalization of the Viterbi algorithm. We test this model on a set of ciphers developed from various web sites, and find that our algorithm has the potential to be a viable, practical method for efficiently solving decipherment prob- lems.

6 0.50779873 65 acl-2010-Complexity Metrics in an Incremental Right-Corner Parser

7 0.49440557 130 acl-2010-Hard Constraints for Grammatical Function Labelling

8 0.48593658 39 acl-2010-Automatic Generation of Story Highlights

9 0.48110271 93 acl-2010-Dynamic Programming for Linear-Time Incremental Parsing

10 0.47811234 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

11 0.47145873 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing

12 0.47123593 56 acl-2010-Bridging SMT and TM with Translation Recommendation

13 0.47080028 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

14 0.46994594 199 acl-2010-Preferences versus Adaptation during Referring Expression Generation

15 0.46940628 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

16 0.46794605 170 acl-2010-Letter-Phoneme Alignment: An Exploration

17 0.46788895 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction

18 0.46770126 71 acl-2010-Convolution Kernel over Packed Parse Forest

19 0.46761119 115 acl-2010-Filtering Syntactic Constraints for Statistical Machine Translation

20 0.46700472 188 acl-2010-Optimizing Informativeness and Readability for Sentiment Summarization