acl acl2010 acl2010-50 knowledge-graph by maker-knowledge-mining

50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures


Source: pdf

Author: Daphna Shezaf ; Ari Rappoport

Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. [sent-6, score-0.325]

2 Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. [sent-7, score-0.708]

3 We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. [sent-8, score-0.326]

4 Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. [sent-9, score-0.281]

5 We use NAS to eliminate incorrect translations from the generated lexicon. [sent-10, score-0.311]

6 We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. [sent-11, score-0.759]

7 They provide, for each source language word or phrase, a set of translations in the target language, and thus they are a basic component of dictionaries, which also include syntactic information, sense division, usage examples, semantic fields, usage guidelines, etc. [sent-14, score-0.342]

8 Traditionally, when bilingual lexicons are not compiled manually, they are extracted from parallel corpora. [sent-15, score-0.413]

9 Bilingual lexicons can be generated using nonparallel corpora or pivot language lexicons (see Ari Rappoport Institute of Computer Science Hebrew University of Jerusalem arir@cs. [sent-17, score-0.979]

10 In this paper we present a method for generating a high quality lexicon given such a noisy one. [sent-22, score-0.347]

11 A naive method for pivot-based lexicon generation goes as follows. [sent-26, score-0.315]

12 For each source headword1 , take its translations to the pivot language using the source-to-pivot lexicon, then for each such translation take its translations to the target language using the pivot-to-target lexicon. [sent-27, score-1.099]

13 This method yields highly noisy (‘divergent’) lexicons, because lexicons are generally intransitive. [sent-28, score-0.341]

14 This intransitivity stems from polysemy in the pivot language that does not exist in the source language. [sent-29, score-0.499]

15 Further translating spring into Spanish yields both the correct translation primavera and an incorrect one, resorte (the elastic object). [sent-32, score-0.381]

16 For each given source headword we compute its signature and the signatures of all of its candidate translations. [sent-45, score-0.567]

17 We present the non-aligned signatures (NAS) similarity score for signature and use it to rank these translations. [sent-46, score-0.456]

18 NAS is based on the number of headword signature words that may be translated using the input noisy lexicon into words in the signature of a candidate translation. [sent-47, score-0.941]

19 We evaluate our algorithm by generating a bilingual lexicon for Hebrew and Spanish using pivot Hebrew-English and English-Spanish lexicons compiled by a professional publishing house. [sent-48, score-1.054]

20 1 Inverse Consultation Tanaka and Umemura (1994) generated a bilingual lexicon using a pivot language. [sent-59, score-0.781]

21 IC examines the inter- section of two pivot language sets: the set of pivot translations of a source-language word w, and the set of pivot translations of each target-language word that is a candidate for being a translation to w. [sent-61, score-1.845]

22 For example, the intersection of the English translations of French printemps and Spanish resorte contains only a single word, spring. [sent-63, score-0.399]

23 The intersection for a correct translation pairprintemps and primavera may include two synonym words, spring and springtime. [sent-64, score-0.319]

24 One weakness of IC is that it relies on pivot language synonyms to identify correct translations. [sent-68, score-0.475]

25 (2009) used many input bilingual lexicons to create bilingual lexicons for new language pairs. [sent-74, score-0.714]

26 They represent the multiple input lexicons in a single undirected graph, with words from all the lexicons as nodes. [sent-75, score-0.506]

27 The input lexicons translation pairs define the edges in the graph. [sent-76, score-0.353]

28 In a sense, this is a generalization of the pivot language idea, where multiple pivots are used. [sent-78, score-0.437]

29 In the example above, if both English and German are used as pivots, printemps andprimavera would be accepted as correct because they are linked by both English spring and German Fruehling, while printemps and resorte are not linked by any German pivot. [sent-79, score-0.317]

30 This multiple-pivot idea is similar to Inverse Consultation in that multiple pivots are required, but using multiple pivot languages frees it from the dependency on rich input lexicons that contain a variety of synonyms. [sent-80, score-0.712]

31 Thus the first translation of some English headword in the English-Spanish and in the English-Hebrew dictionaries would correspond to the same sense of the headword, and would therefore constitute translations of each other. [sent-85, score-0.507]

32 Both works rely on a pre-existing large (16-20K entries), correct, oneto-one lexicon between the source and target languages, which is used to align context vectors between languages. [sent-90, score-0.417]

33 Koehn and Knight (2002) were able to do without the initial large lexicon by limiting themselves to related languages that share a writing system, and using identicallyspelled words as context words. [sent-92, score-0.366]

34 In the latter work, the oneto-one lexicon assumption was not made: when a context word had multiple equivalents, it was mapped into all of them, with the original probability equally distributed between them. [sent-99, score-0.282]

35 Using cross-lingual cooccurrences to improve a lexicon generated using a pivot language was suggested by Tanaka and Iwasaki (1996). [sent-101, score-0.665]

36 Schafer and Yarowsky (2002) created lexicons between English and a target local language (e. [sent-102, score-0.281]

37 An English pivot lexicon was used in conjunction with pivot-target cognates. [sent-107, score-0.642]

38 (2008) used a pivot English lexicon to generate initial Japanese-Chinese and ChineseJapanese lexicons, then used co-occurrences information, aligned using the initial lexicon, to identify correct translations. [sent-111, score-0.774]

39 3 Algorithm Our algorithm transforms a noisy lexicon into a high quality one. [sent-116, score-0.326]

40 As explained above, in this paper we focus on noisy lexicons generated using pivot language lexicons. [sent-117, score-0.738]

41 Other methods for obtaining an initial noisy lexicon could be used as well; their evaluation is deferred to future work. [sent-118, score-0.387]

42 In the setting evaluated in this paper, we first generate an initial noisy lexicon iLex possibly containing many translation candidates for each source headword. [sent-119, score-0.516]

43 iLex is computed from two pivot-language lexicons, and is the only place in which the algorithm utilizes the pivot language. [sent-120, score-0.427]

44 Afterwards, for each source headword, we compute its signature and the signatures of each of its translation candidates. [sent-121, score-0.532]

45 We now rank the candidates according to the non-aligned signatures (NAS) similarity score, which assesses the similarity between each candidate’s signature and that of the headword. [sent-123, score-0.454]

46 For each headword, we select the t translations with the highest NAS scores as correct translations. [sent-124, score-0.356]

47 1 Input Resources The resources required by our algorithm as evaluated in this paper are: (a) two bilingual lexicons, one from the source to the pivot language and the other from the pivot to the target language. [sent-126, score-0.998]

48 In principle, these two pivot lexicons can be noisy, although in our evaluation we use manually compiled lexicons; (b) two monolingual corpora, one for each of the source and target languages. [sent-127, score-0.794]

49 2 Initial Lexicon Construction We create an initial lexicon from the source to the target language using the pivot language: we look up each source language word s in the sourcepivot lexicon, and obtain the set Ps of its pivot 100 translations. [sent-130, score-1.207]

50 Note that it is possible that not all target lexicon words appear as translation candidates. [sent-133, score-0.423]

51 Therefore, two signatures of words in different languages cannot be directly compared; we compare them using a lexicon L as explained below. [sent-143, score-0.463]

52 For a lexicon L, a source word s and a target word t, NASL(s, t) is defined as the number of words in the signature G(s)N,k of s that may be translated, using L, to words in the signature G(t)N,k of t, normalized by dividing it by N. [sent-150, score-0.807]

53 Formally, NASL (s, t) = |{w∈G(s)|L(Nw)∩G(t)6=∅}| Where L(x) is the set of candidate translations of x under the lexicon L. [sent-151, score-0.545]

54 4 Lexicon Generation Experiments We tested our algorithm by generating bilingual lexicons for Hebrew and Spanish, using English as a pivot language. [sent-163, score-0.752]

55 2 Lexicons The source of the Hebrew-English lexicon was the Babylon on-line dictionary3. [sent-195, score-0.299]

56 Since the corpus was segmented to words using spaces, lexicon entries containing spaces were discarded. [sent-197, score-0.311]

57 Therefore, every L1-L2 lexicon we mention is identical to the corresponding L2-L1 lexicon in the set of translation pairs it contains. [sent-200, score-0.606]

58 Our lexicon is thus the ‘noisiest’ that can be generated using a pivot language and two source-pivot-target lexicons, but it also provides the most complete candidate set possible. [sent-201, score-0.713]

59 Table 2 details the sizes and branching factors (BF) (the average number of translations for headword) of the input lexicons, as well as those of the generated initial noisy lexicon. [sent-203, score-0.412]

60 , the candidate translations of a source word s, by the size of the intersections of the sets of pivot translations of ti and s. [sent-217, score-0.995]

61 Our initial lexicon is a many-to-many relation, so multiple alignments were possible; in fact, the number of possible alignments tends to be very large5. [sent-225, score-0.343]

62 The 1000 highestfrequency tokens were discarded, as a large number of these are utilized as auxiliary syntactic 4We modified the standard cosine and city block metrics so that for all measures higher values would be better. [sent-235, score-0.281]

63 For each of the test words, the correct translations were extracted from a modern professional concise printed Hebrew-Spanish-Hebrew dictionary (Prolog, 2003). [sent-246, score-0.404]

64 2 Results Tables 3 and 4 summarize the results of the Hebrew-Spanish and Spanish-Hebrew lexicon generation respectively, for both the R1 and R2 test sets. [sent-262, score-0.294]

65 In the three co-occurrence based methods, NAS similarity, cosine distance and and city block distance, the highest ranking translation was selected. [sent-263, score-0.37]

66 test words whose selected translation was one of the translations in the gold standard. [sent-269, score-0.415]

67 IC translations ranking is a partial order, as usually many translations are scored equally. [sent-270, score-0.5]

68 A result was counted as precise if any of the highest-ranking translations was in the goldstandard, even if other translations were equally ranked, creating a bias in favor of IC. [sent-273, score-0.5]

69 In both of the Hebrew-Spanish and the SpanishHebrew cases, our method significantly outperformed all baselines in generating a precise lexicon on the highest-ranking translations. [sent-274, score-0.292]

70 That the results for the Spanish-Hebrew lexicon are higher may arise from the difference in the gold standard. [sent-277, score-0.276]

71 IC results are not included, as they are incomparable to those of the other methods: IC tends to score many candidate translations identically, and in practice, the three highest-scoring sets of translation candidates contained on average 77% of all 103 1st 2nd 3rd total NAS 82. [sent-283, score-0.455]

72 4% Table 5: Hebrew-Spanish lexicon generation: accuracy of 3 best translations for the R1 condition. [sent-295, score-0.497]

73 7% 142% Table 6: Spanish-Hebrew lexicon generation: accuracy of 3 best translations for the R1 condition. [sent-307, score-0.497]

74 For all methods, many ofthe correct translations that do not rank first, rank as second or third. [sent-312, score-0.33]

75 The percentage of cases in which each of the scoring methods was able to successfully distinguish the correct (SCE1) or possible correct (SCE2) translation from the random translation. [sent-327, score-0.315]

76 The set of possible translations in iLex tends to include, besides the “correct” translation ofthe gold standard, other translations that are suitable in certain contexts or are semantically related. [sent-329, score-0.641]

77 Thus to better compare the capability of NAS to distinguish correct and incorrect translations with that of other scores, we performed two more experiments. [sent-331, score-0.368]

78 In the first score comparison experiment (SCE1), we used the two R1 test sets, Hebrew and Spanish, from the lexicon generation test (section 4. [sent-332, score-0.339]

79 With all scores, precision values in SCE1 are higher than in the lexicon generation experiment. [sent-344, score-0.359]

80 This is consistent with the expectation that selection between a correct and a random, probably incorrect, translation is easier than selecting among the translations in iLex. [sent-345, score-0.442]

81 This may be a result of both translations in SCE2 being 104 Figure 1: NAS values (not algorithm precision) for various N sizes. [sent-347, score-0.283]

82 As N approaches the size of the lexicon used for alignment, NAS values approach 1for all word pairs. [sent-355, score-0.28]

83 Yet an empirical test has shown that NAS may be useful for a wide range of N values: we computed NAS values for the correct and random translations used in the Hebrew-Spanish SCE1 experiment (section 5), using N values between 50 and 2000. [sent-357, score-0.417]

84 Figure 1 shows the average score values (note that these are not precision values) for the correct and random translations across that N range. [sent-358, score-0.461]

85 The scores for the correct translations are consistently higher than those of the random translations, even while there is a discernible decline in the difference between them. [sent-359, score-0.377]

86 2 Dependency on Alignment Lexicon NASL values depend on L, the lexicon in use. [sent-363, score-0.28]

87 Clearly again, in the extremes, an almost empty lexicon or a lexicon containing every possible pair of words (a Cartesian product), this score would not be useful. [sent-364, score-0.563]

88 In general, correct lemmatization should improve results, since the signatures would consist of more meaningful information. [sent-375, score-0.309]

89 7 Conclusion We presented a method to create a high quality bilingual lexicon given a noisy one. [sent-384, score-0.463]

90 We focused on the case in which the noisy lexicon is created using two pivot language lexicons. [sent-385, score-0.721]

91 At the heart of our method is the non-aligned signatures (NAS) context similarity score, used for removing incorrect translations using cross-lingual cooccurrences. [sent-387, score-0.545]

92 The common method for context similarity scoring utilizes some algebraic distance between context vectors, and requires a single alignment of context vectors in one language into the other. [sent-389, score-0.334]

93 Finding a single correct alignment is unrealistic even when a perfectly correct lexicon is available. [sent-390, score-0.442]

94 In our task, moreover, the lexicon used for alignment was automatically generated from pivot language lexicons and was expected to contain errors. [sent-392, score-0.941]

95 While the purpose of this work was to discern correct translations from incorrect one, it is worth noting that our method actually ranks translation correctness. [sent-399, score-0.554]

96 Unsupervised concept discovery in hebrew using simple unsupervised word prefix segmentation for hebrew and arabic. [sent-420, score-0.4]

97 A statistical view on bilingual lexicon extraction:from parallel corpora to nonparallel corpora. [sent-424, score-0.473]

98 Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. [sent-428, score-0.449]

99 Automatic identification of word translations from unrelated english and german corpora. [sent-465, score-0.278]

100 Inducing translation lexicons via diverse similarity measures and bridge languages. [sent-469, score-0.396]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('nas', 0.517), ('pivot', 0.395), ('translations', 0.25), ('lexicon', 0.247), ('lexicons', 0.241), ('signature', 0.21), ('hebrew', 0.2), ('signatures', 0.158), ('ic', 0.128), ('bilingual', 0.116), ('translation', 0.112), ('spanish', 0.106), ('ilex', 0.104), ('headword', 0.099), ('correct', 0.08), ('noisy', 0.079), ('block', 0.078), ('cosine', 0.078), ('consultation', 0.076), ('lemmatization', 0.071), ('printemps', 0.069), ('city', 0.069), ('headwords', 0.061), ('kaji', 0.061), ('tanaka', 0.056), ('dictionary', 0.052), ('intransitivity', 0.052), ('nasl', 0.052), ('primavera', 0.052), ('resorte', 0.052), ('source', 0.052), ('inverse', 0.051), ('corpora', 0.049), ('candidate', 0.048), ('spring', 0.047), ('generation', 0.047), ('pmi', 0.047), ('dictionaries', 0.046), ('bf', 0.046), ('garera', 0.046), ('pekar', 0.046), ('score', 0.045), ('similarity', 0.043), ('vectors', 0.043), ('pivots', 0.042), ('monolingual', 0.041), ('segmented', 0.04), ('target', 0.04), ('relatedness', 0.039), ('incorrect', 0.038), ('context', 0.035), ('alignment', 0.035), ('alignments', 0.035), ('babylon', 0.035), ('deferred', 0.035), ('dinur', 0.035), ('paik', 0.035), ('spanishhebrew', 0.035), ('branching', 0.034), ('languages', 0.034), ('distance', 0.033), ('values', 0.033), ('precision', 0.032), ('utilizes', 0.032), ('parallel', 0.031), ('nonparallel', 0.03), ('bond', 0.03), ('discern', 0.03), ('kumiko', 0.03), ('schafer', 0.03), ('publishing', 0.03), ('gold', 0.029), ('unrelated', 0.028), ('intersection', 0.028), ('hiroyuki', 0.028), ('jerusalem', 0.028), ('nouns', 0.027), ('scores', 0.026), ('wx', 0.026), ('directionality', 0.026), ('initial', 0.026), ('construction', 0.025), ('keys', 0.025), ('mausam', 0.025), ('cooccurrence', 0.025), ('compiled', 0.025), ('divergence', 0.024), ('baselines', 0.024), ('rapp', 0.024), ('words', 0.024), ('tokens', 0.023), ('ranks', 0.023), ('generated', 0.023), ('recall', 0.023), ('comparable', 0.023), ('modern', 0.022), ('counting', 0.022), ('scoring', 0.022), ('method', 0.021), ('random', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures

Author: Daphna Shezaf ; Ari Rappoport

Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.

2 0.25541398 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications

Author: Bin Wei ; Christopher Pal

Abstract: In this paper, we study the problem of using an annotated corpus in English for the same natural language processing task in another language. While various machine translation systems are available, automated translation is still far from perfect. To minimize the noise introduced by translations, we propose to use only key ‘reliable” parts from the translations and apply structural correspondence learning (SCL) to find a low dimensional representation shared by the two languages. We perform experiments on an EnglishChinese sentiment classification task and compare our results with a previous cotraining approach. To alleviate the problem of data sparseness, we create extra pseudo-examples for SCL by making queries to a search engine. Experiments on real-world on-line review data demonstrate the two techniques can effectively improvetheperformancecomparedtoprevious work.

3 0.13740416 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

Author: Boxing Chen ; George Foster ; Roland Kuhn

Abstract: This paper proposes new algorithms to compute the sense similarity between two units (words, phrases, rules, etc.) from parallel corpora. The sense similarity scores are computed by using the vector space model. We then apply the algorithms to statistical machine translation by computing the sense similarity between the source and target side of translation rule pairs. Similarity scores are used as additional features of the translation model to improve translation performance. Significant improvements are obtained over a state-of-the-art hierarchical phrase-based machine translation system. 1

4 0.13341105 123 acl-2010-Generating Focused Topic-Specific Sentiment Lexicons

Author: Valentin Jijkoun ; Maarten de Rijke ; Wouter Weerkamp

Abstract: We present a method for automatically generating focused and accurate topicspecific subjectivity lexicons from a general purpose polarity lexicon that allow users to pin-point subjective on-topic information in a set of relevant documents. We motivate the need for such lexicons in the field of media analysis, describe a bootstrapping method for generating a topic-specific lexicon from a general purpose polarity lexicon, and evaluate the quality of the generated lexicons both manually and using a TREC Blog track test set for opinionated blog post retrieval. Although the generated lexicons can be an order of magnitude more selective than the general purpose lexicon, they maintain, or even improve, the performance of an opin- ion retrieval system.

5 0.13267003 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning

Author: Peter Prettenhofer ; Benno Stein

Abstract: We present a new approach to crosslanguage text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce taskspecific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of unlabeled data and the complexity of interlanguage correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.

6 0.12601231 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures

7 0.10424069 11 acl-2010-A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm

8 0.09466742 16 acl-2010-A Statistical Model for Lost Language Decipherment

9 0.079857774 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

10 0.078789741 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out

11 0.077979513 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction

12 0.07761202 44 acl-2010-BabelNet: Building a Very Large Multilingual Semantic Network

13 0.076643892 54 acl-2010-Boosting-Based System Combination for Machine Translation

14 0.076252826 133 acl-2010-Hierarchical Search for Word Alignment

15 0.073536426 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking

16 0.07275895 210 acl-2010-Sentiment Translation through Lexicon Induction

17 0.071306929 262 acl-2010-Word Alignment with Synonym Regularization

18 0.070972763 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

19 0.070731863 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

20 0.06855379 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.191), (1, -0.065), (2, -0.107), (3, 0.072), (4, 0.061), (5, 0.013), (6, -0.013), (7, 0.0), (8, 0.022), (9, 0.041), (10, 0.035), (11, 0.098), (12, 0.035), (13, -0.083), (14, -0.053), (15, -0.011), (16, 0.145), (17, -0.144), (18, -0.004), (19, -0.076), (20, -0.017), (21, -0.099), (22, -0.026), (23, -0.079), (24, -0.004), (25, 0.007), (26, -0.035), (27, -0.041), (28, -0.014), (29, -0.026), (30, -0.075), (31, -0.13), (32, 0.24), (33, -0.213), (34, -0.046), (35, 0.101), (36, 0.033), (37, -0.131), (38, 0.165), (39, 0.052), (40, -0.064), (41, -0.104), (42, -0.07), (43, -0.039), (44, -0.033), (45, -0.087), (46, -0.182), (47, 0.063), (48, 0.116), (49, 0.092)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92547864 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures

Author: Daphna Shezaf ; Ari Rappoport

Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.

2 0.70155793 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications

Author: Bin Wei ; Christopher Pal

Abstract: In this paper, we study the problem of using an annotated corpus in English for the same natural language processing task in another language. While various machine translation systems are available, automated translation is still far from perfect. To minimize the noise introduced by translations, we propose to use only key ‘reliable” parts from the translations and apply structural correspondence learning (SCL) to find a low dimensional representation shared by the two languages. We perform experiments on an EnglishChinese sentiment classification task and compare our results with a previous cotraining approach. To alleviate the problem of data sparseness, we create extra pseudo-examples for SCL by making queries to a search engine. Experiments on real-world on-line review data demonstrate the two techniques can effectively improvetheperformancecomparedtoprevious work.

3 0.60876799 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning

Author: Peter Prettenhofer ; Benno Stein

Abstract: We present a new approach to crosslanguage text classification that builds on structural correspondence learning, a recently proposed theory for domain adaptation. The approach uses unlabeled documents, along with a simple word translation oracle, in order to induce taskspecific, cross-lingual word correspondences. We report on analyses that reveal quantitative insights about the use of unlabeled data and the complexity of interlanguage correspondence modeling. We conduct experiments in the field of cross-language sentiment classification, employing English as source language, and German, French, and Japanese as target languages. The results are convincing; they demonstrate both the robustness and the competitiveness of the presented ideas.

4 0.51274854 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures

Author: Benjamin Van Durme ; Ashwin Lall

Abstract: Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique.

5 0.48416516 104 acl-2010-Evaluating Machine Translations Using mNCD

Author: Marcus Dobrinkat ; Tero Tapiovaara ; Jaakko Vayrynen ; Kimmo Kettunen

Abstract: This paper introduces mNCD, a method for automatic evaluation of machine translations. The measure is based on normalized compression distance (NCD), a general information theoretic measure of string similarity, and flexible word matching provided by stemming and synonyms. The mNCD measure outperforms NCD in system-level correlation to human judgments in English.

6 0.42650953 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems

7 0.39480671 16 acl-2010-A Statistical Model for Lost Language Decipherment

8 0.39038733 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

9 0.37981319 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

10 0.37975326 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

11 0.37490255 235 acl-2010-Tools for Multilingual Grammar-Based Translation on the Web

12 0.37116101 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking

13 0.35830602 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities

14 0.34939796 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration

15 0.34306958 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

16 0.34218895 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.

17 0.32783735 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

18 0.31949735 177 acl-2010-Multilingual Pseudo-Relevance Feedback: Performance Study of Assisting Languages

19 0.31190398 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

20 0.29399574 11 acl-2010-A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.252), (25, 0.033), (42, 0.067), (44, 0.02), (59, 0.14), (73, 0.039), (78, 0.036), (83, 0.099), (84, 0.026), (98, 0.155)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81020916 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures

Author: Daphna Shezaf ; Ari Rappoport

Abstract: Bilingual lexicons are fundamental resources. Modern automated lexicon generation methods usually require parallel corpora, which are not available for most language pairs. Lexicons can be generated using non-parallel corpora or a pivot language, but such lexicons are noisy. We present an algorithm for generating a high quality lexicon from a noisy one, which only requires an independent corpus for each language. Our algorithm introduces non-aligned signatures (NAS), a cross-lingual word context similarity score that avoids the over-constrained and inef- ficient nature of alignment-based methods. We use NAS to eliminate incorrect translations from the generated lexicon. We evaluate our method by improving the quality of noisy Spanish-Hebrew lexicons generated from two pivot English lexicons. Our algorithm substantially outperforms other lexicon generation methods.

2 0.77168429 206 acl-2010-Semantic Parsing: The Task, the State of the Art and the Future

Author: Rohit J. Kate ; Yuk Wah Wong

Abstract: unkown-abstract

3 0.68984139 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images

Author: Yansong Feng ; Mirella Lapata

Abstract: In this paper we tackle the problem of automatic caption generation for news images. Our approach leverages the vast resource of pictures available on the web and the fact that many of them are captioned. Inspired by recent work in summarization, we propose extractive and abstractive caption generation models. They both operate over the output of a probabilistic image annotation model that preprocesses the pictures and suggests keywords to describe their content. Experimental results show that an abstractive model defined over phrases is superior to extractive methods.

4 0.683599 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

Author: Fei Huang ; Alexander Yates

Abstract: Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Semantic role labeling techniques are typically trained on newswire text, and in tests their performance on fiction is as much as 19% worse than their performance on newswire text. We investigate techniques for building open-domain semantic role labeling systems that approach the ideal of a train-once, use-anywhere system. We leverage recently-developed techniques for learning representations of text using latent-variable language models, and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling. In experiments, our novel system reduces error by 16% relative to the previous state of the art on out-of-domain text.

5 0.68123066 150 acl-2010-Inducing Domain-Specific Semantic Class Taggers from (Almost) Nothing

Author: Ruihong Huang ; Ellen Riloff

Abstract: This research explores the idea of inducing domain-specific semantic class taggers using only a domain-specific text collection and seed words. The learning process begins by inducing a classifier that only has access to contextual features, forcing it to generalize beyond the seeds. The contextual classifier then labels new instances, to expand and diversify the training set. Next, a cross-category bootstrapping process simultaneously trains a suite of classifiers for multiple semantic classes. The positive instances for one class are used as negative instances for the others in an iterative bootstrapping cycle. We also explore a one-semantic-class-per-discourse heuristic, and use the classifiers to dynam- ically create semantic features. We evaluate our approach by inducing six semantic taggers from a collection of veterinary medicine message board posts.

6 0.68071675 214 acl-2010-Sparsity in Dependency Grammar Induction

7 0.67926824 148 acl-2010-Improving the Use of Pseudo-Words for Evaluating Selectional Preferences

8 0.67492616 144 acl-2010-Improved Unsupervised POS Induction through Prototype Discovery

9 0.67393672 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

10 0.67325211 56 acl-2010-Bridging SMT and TM with Translation Recommendation

11 0.67281187 254 acl-2010-Using Speech to Reply to SMS Messages While Driving: An In-Car Simulator User Study

12 0.67265278 54 acl-2010-Boosting-Based System Combination for Machine Translation

13 0.67226082 194 acl-2010-Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning

14 0.67155129 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

15 0.67137617 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

16 0.67130595 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

17 0.67112398 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment

18 0.67050314 160 acl-2010-Learning Arguments and Supertypes of Semantic Relations Using Recursive Patterns

19 0.67027706 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

20 0.67007446 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation