acl acl2012 acl2012-150 knowledge-graph by maker-knowledge-mining

150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Source: pdf

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

Abstract: In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 kr Abstract In this paper we propose a method to automatically label multi-lingual data with named entity tags. [sent-6, score-0.368]

2 The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. [sent-8, score-0.629]

3 The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata. [sent-9, score-0.403]

4 The first has been to devise an algorithm to tag foreign language entities using metadata from the semi-structured Wikipedia repository: inter-wiki links, article categories, and crosslanguage links (Richman and Schone, 2008). [sent-13, score-1.015]

5 The second has been to use parallel English-foreign language data, a high-quality NER tagger for English, and projected annotations for the foreign language (Yarowsky et al. [sent-14, score-0.798]

6 ∗This research was conducted during the author’s internship at Microsoft Research 694 The goal of this work is to create high-accuracy NER annotated data for foreign languages. [sent-19, score-0.449]

7 It is a conditional model for target sentence annotation given an aligned English source sentence, where the English sentence is used only as a source of features. [sent-25, score-0.465]

8 Our results show that the semi-CRF model improves on the performance of projection models by more than 10 points in F-measure, and that we can achieve tagging F-measure of over 91 using a very small number of annotated sentence pairs. [sent-27, score-0.489]

9 Next, we present our two baseline methods: A Wikipedia metadata-based tagger and a cross- lingual projection tagger in Sections 3 and 4, respectively. [sent-29, score-0.749]

10 2 Data and task As a case study, we focus on two very different foreign languages: Korean and Bulgarian. [sent-31, score-0.449]

11 The English and foreign language sentences that comprise our training and test data are extracted from Wikipedia (http://www. [sent-32, score-0.483]

12 Of these, we manually annotated 91 EnglishBulgarian and 79 English-Korean sentence pairs with source and target named entities as well as word-alignment links among named entities in the two languages. [sent-44, score-0.934]

13 The named entity annotation scheme followed has the labels GPE (Geopolitical entity), PER (Person), ORG (Organization), and DATE. [sent-46, score-0.36]

14 The task we evaluate on is tagging of foreign language sentences. [sent-51, score-0.498]

15 Table 1 shows the total number of English, Bulgarian and Korean entities and the percentage of entities that were manually aligned to an entity of the same type in the other language. [sent-54, score-0.674]

16 3 Wiki-based tagger: annotating sentences based on Wikipedia metadata We followed the approach of Richman and Schone (2008) to derive named entity annotations of both English and foreign phrases in Wikipedia, using Wikipedia metadata. [sent-68, score-0.972]

17 To tag English language phrases, we first derived named entity categorizations of English article titles, by assigning a tag based on the article’s category information. [sent-73, score-0.609]

18 Using the article-level annotations and article links we define a local English wiki-based tagger and a global English wiki-based tagger, which will be described in detail next. [sent-79, score-0.776]

19 This Wiki-based tagger tags phrases in an English article based on the article links from these phrases to NE-tagged articles. [sent-81, score-0.813]

20 For example, suppose that the phrase “Split” in the article with title “Igor Tudor” is linked to the article with title “Split”, which is classified as GPE. [sent-82, score-0.546]

21 Thus the local English Wiki-based tagger can tag this phrase as GPE. [sent-83, score-0.387]

22 If, within the same article, the phrase “Split” occurs again, it can be tagged again even if it is not linked to a tagged article (this is the one sense per document assumption). [sent-84, score-0.346]

23 Addition696 ally, the tagger tags English phrases as DATE if they match a set of manually specified regular expressions. [sent-85, score-0.417]

24 This tagger tags phrases with NE tags if these phrases have ever been linked to a categorized article (the most frequent label is used). [sent-88, score-0.757]

25 For example, if “Split” does not have a link anywhere in the current article, but has been linked to the GPE-labeled article with title “Split” in another article, it will still be tagged as GPE. [sent-89, score-0.375]

26 We also apply a local+global Wiki-tagger, which tags entities according to the local Wikitagger and additionally tags any non-conflicting entities according to the global tagger. [sent-90, score-0.777]

27 The idea is the same as for the local English tagger, with the difference that we first assign NE tags to foreign language articles by using the NE tags assigned to English articles to which they are connected with interwiki links. [sent-92, score-0.857]

28 Because we do not have maps from category phrases to NE tags for foreign languages, using inter-wiki links is a way to transfer this knowledge to the foreign languages. [sent-93, score-1.147]

29 After we have categorized foreign language articles we follow the same algorithm as for the local English Wiki-based tagger. [sent-94, score-0.595]

30 Global foreign Wiki-based tagger The global and local+global taggers are analogous, using the categorization of foreign articles as above. [sent-96, score-1.406]

31 The global Wiki-based tagger could assign multiple labels to the same string (corresponding to different senses in different occurrences). [sent-98, score-0.357]

32 The local tagger is best for Korean, as the precision suffers too much due to the global tagger. [sent-114, score-0.429]

33 The projection model described in this section and the Semi-CRF model described in Section 5 are trained using annotated data. [sent-123, score-0.405]

34 They can be applied to tag foreign sentences in English-foreign sentence pairs extracted from Wikipedia. [sent-124, score-0.564]

35 The task of projection is re-cast as a ranking task, where for each source entity Si, we rank all possible candidate target entity spans Tj and select the best span as corresponding to this source entity. [sent-125, score-1.143]

36 The probability distribution over target spans Tj for a given source entity Si is defined as follows: p(Si|Tj) =Pejx0pe(xλp(fλ(Sf(i,STi,jT) j0) where λ is a parameterP vector, and f(Si, Tj) is a fea697 ture vector for the candidate entity pair. [sent-127, score-0.692]

37 The model projects these entities to corresponding foreign entities. [sent-129, score-0.711]

38 We train and evaluate the projection model using 10-fold cross-validation on the dataset from Table 1. [sent-130, score-0.371]

39 For training, we use the humanannotated gold English entities and the manuallyspecified entity alignments to derive corresponding target entities. [sent-131, score-0.595]

40 At test time we use the local+global Wiki-based tagger to define the English entities and we don’t use the manually annotated alignments. [sent-132, score-0.434]

41 Sum of posterior probabilities of links from Swuomrds o fins piodset one entity btiol twieosrd osf o liuntskisde f an- other entity Pi∈i1. [sent-152, score-0.534]

42 • Indicator feature target entity can according to the (grow-diag-final) traction heuristic for whether the source and b feo rex wtrhaecttheder as a phrase pair combined Viterbi alignments and the standard phrase ex(Koehn et al. [sent-160, score-0.519]

43 Phonetic similarity features These features measure the similarity between a source and target entity based on pronunciation. [sent-162, score-0.511]

44 We utilize a transliteration model (Cherry and Suzuki, 2009), trained from pairs of English person names and corresponding foreign language names, extracted from Wikipedia. [sent-163, score-0.602]

45 The transliteration model can return an n-best list of transliterations of a foreign string, together with scores. [sent-164, score-0.636]

46 We estimate phonetic similarity between a source and target entity by computing Levenshtein and other distance metrics between the source entity and the closest transliteration of the target (out of a 10-best list of transliterations). [sent-166, score-0.919]

47 Position/Length features These report relative length and position of the English and foreign entity following (Feng et al. [sent-169, score-0.71]

48 Wiki-based tagger features These features look at the degree of match between the source and target entities based on the tags assigned to them by the local and global Wiki-taggers for English and the foreign language, and by the Stanford tagger for English. [sent-171, score-1.796]

49 These are indicator features separate for the different source-target tagger combinations, looking at whether the taggers agree in their assignments to the candidate entities. [sent-172, score-0.501]

50 2 Model Evaluation We evaluate the tagging F-measure for projection models on the English-Bulgarian and EnglishKorean datasets. [sent-174, score-0.386]

51 The foreign language NE F-measure is reported in Table 3. [sent-176, score-0.449]

52 We present a detailed evaluation of the model to gain understanding of the strengths and limitations of the projection approach and to motivate our direct semi-CRF model. [sent-178, score-0.406]

53 To give an estimate of the upper bound on performance for the projection model, we first present two oracles. [sent-179, score-0.337]

54 The goal of the oracles it to estimate the impact of two sources of error for the projection model: the first is the error in detecting English entities, and the second is the error in determining the corresponding foreign entity for a given English entity. [sent-180, score-1.109]

55 The first oracle ORACLE1 has access to the goldstandard English entities and gold-standard word alignments among English and foreign words. [sent-181, score-0.746]

56 For each source entity, ORACLE1 selects the longest foreign language sequence of words that could be extracted in a phrase pair coupled with the source entity word sequence (according the standard phrase extraction heuristic (Koehn et al. [sent-182, score-0.893]

57 Note that the word alignments do not uniquely identify the corresponding foreign phrase for each English phrase and some error is possible due to this. [sent-184, score-0.628]

58 The second oracle ORACLE2 provides the performance of the projection model when gold-standard source entities are known, but the corresponding target entities still have to be determined by the projection model (gold-standard alignments are not known). [sent-186, score-1.46]

59 In other words, ORACLE2 is the projection model with all features, where in the test set we provide the gold standard English entities as input. [sent-187, score-0.573]

60 10 mance of non-oracle projection models, which do not have access to any manually labeled information. [sent-198, score-0.389]

61 The line above, PM-WF represents the projection model without the Wiki-tagger derived features, and is included to show that the gain from using these features is substantial. [sent-201, score-0.463]

62 The difference in accuracy between the projection model and ORACLE2 is very large, and is due to the error of the Wiki-based English taggers. [sent-202, score-0.402]

63 When source entities are assigned with error for this language pair, projecting entity annotations from the source is not better than using the target Wiki-based annotations directly. [sent-205, score-0.923]

64 For Korean while the trend in model performance is similar as oracle information is removed, the projection model achieves substantially better performance (80. [sent-206, score-0.442]

65 The drawback of the projection model is that it determines target entities only by assigning the best candidate for each source entity. [sent-209, score-0.857]

66 It cannot create target entities that do not correspond to source entities, it is not able to take into account multiple conflicting source NE taggers as sources of information, and it does not make use of target sentence context and entity consistency constraints. [sent-210, score-1.007]

67 We apply Semi-CRFs to learn a NE tagger for labeling foreign sentences in the context of corresponding source sentences with existing NE annotations. [sent-214, score-0.837]

68 The semi-CRF defines a distribution over foreign sentence labeled segmentations (where the segments are named entities with their labels, or segments of length one with label “NONE”). [sent-215, score-0.944]

69 , spi denote a segmentation of theL foreign hssentence x, dwenhoetree a segment sj = htj , uj , yji is determined by its start position tj, end position uj, a dnedt lrambienl yj. [sent-219, score-0.568]

70 The features look at the English and foreign sentence as well as external annotations A. [sent-229, score-0.696]

71 Different and possibly conflicting NE tags for candidate English and foreign sentence substrings according to the Wiki-based taggers and the Stanford tagger are specified as one type of external annotations (see Figure 2). [sent-231, score-1.153]

72 They provide two kinds of alignment links between English and foreign tokens: one based on the HMM-word alignments (posterior probability of the link in both directions), and another based on different character-based distance metrics between transliterations of foreign words and English words. [sent-233, score-1.194]

73 A third annotation type is automatically derived links between foreign candidate entity strings (sequences of tokens) and best corresponding English candidate entities. [sent-236, score-1.022]

74 The candidate English entities are defined by the union of entities proposed by the Wiki-based taggers and the Stanford tagger. [sent-237, score-0.642]

75 We link foreign candidate seg- ments with English candidate entities based on the projection model described in Section 4 and trained on the same data. [sent-239, score-1.249]

76 The projection model scores every source-target entity pair and selects the best source for each target candidate entity. [sent-240, score-0.859]

77 For our example target segment, the corresponding source candidate entity is “Split”, labeled GPE by the local+global Wiki-tagger and by the global Wiki-tagger. [sent-241, score-0.656]

78 These features look at target segments and extract indicators of whether the label of the segment agrees with the label assigned by the local, global, and/or local+global wiki tagger. [sent-244, score-0.542]

79 For the example segment from the sentence in Figure 1, since neither the local nor global tagger have assigned a label GPE, the first three features have value zero. [sent-245, score-0.688]

80 These features look at the linked English segment for the candidate target segment and compare the tags assigned to the English segment by the different English taggers to the candidate target label. [sent-258, score-1.115]

81 In addition to segmentlevel comparisons, they also look at tag assignments for individual source tokens linked to the individual target tokens (by word alignment and transliteration links). [sent-259, score-0.521]

82 The feature SOURCE-EWIKI-TAG-MATCH looks at whether the correspond- ing source entity has the same local+global Wikitagger assigned tag as the candidate target entity. [sent-261, score-0.565]

83 The next two features look at the Stanford tagger and the global Wiki-tagger. [sent-262, score-0.445]

84 The real-valued features like SCORE-SOURCE-E-WIKI-TAG-MATCH return the score of the matching between the source and target candidate entities (according to the projection model), if the labels match. [sent-263, score-0.915]

85 We perform 10-fold cross-validation as in the projection experiments. [sent-267, score-0.337]

86 Additionally, we report performance of the full bilingual model with all features, but when English candidate entities are generated only according to the local+global Wiki-taggger (BI-ALL-WT). [sent-274, score-0.379]

87 The main results show that the full semi-CRF model greatly outperforms the baseline projection and Wiki-taggers. [sent-275, score-0.371]

88 2, more than 10 points higher than the performance of the projection model. [sent-280, score-0.367]

89 The additional gain due to considering candidate source entities generated from all English taggers was 1. [sent-288, score-0.528]

90 If we restrict the semi-CRF to use only features similar to the ones used by the projection model, we still obtain performance much better than that of the projection model: comparing BI to the projection model, we see gains of 9. [sent-290, score-1.068]

91 This is due to the fact that the semi-CRF is able to relax the assumption of one-toone correspondence between source and target entities, and can effectively combine information from multiple source and target taggers. [sent-292, score-0.386]

92 We should note that the proposed method can only tag foreign sentences in English-foreign sentence pairs. [sent-293, score-0.564]

93 The next step for this work is to train monolingual NE taggers for the foreign languages, which can work on text within or outside of Wikipedia. [sent-294, score-0.65]

94 6 Related Work As discussed throughout the paper, our model builds upon prior work on Wikipedia metadata-based NE tagging (Richman and Schone, 2008) and crosslingual projection for named entities (Feng et al. [sent-296, score-0.711]

95 In contrast, our model is not concerned with tagging English sentences but only tags foreign sentences in the context of English sentences. [sent-302, score-0.675]

96 (2010a), our semi-CRF approach does not require enumeration of n-best candidates for the English sentence and is not limited to n-best candidates for the foreign sentence. [sent-304, score-0.488]

97 7 Conclusions In this paper we showed that using resources from Wikipedia, it is possible to combine metadata-based approaches and projection-based approaches for inducing named entity annotations for foreign languages. [sent-306, score-0.827]

98 We presented a direct semi-CRF tagging model for labeling foreign sentences in parallel sen- tence pairs, which outperformed projection by more than 10 F-measure points for Bulgarian and Korean. [sent-307, score-1.026]

99 Improved named entity translation and bilingual named entity extraction. [sent-337, score-0.638]

100 Inducing multilingual text analysis tools via robust projection across aligned corpora. [sent-385, score-0.377]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('foreign', 0.449), ('projection', 0.337), ('bulgarian', 0.242), ('tagger', 0.206), ('entity', 0.204), ('entities', 0.202), ('korean', 0.172), ('article', 0.168), ('ne', 0.153), ('english', 0.15), ('taggers', 0.147), ('wikipedia', 0.128), ('global', 0.116), ('gpe', 0.113), ('richman', 0.108), ('local', 0.107), ('target', 0.105), ('burkett', 0.096), ('links', 0.094), ('transliteration', 0.093), ('candidate', 0.091), ('named', 0.089), ('linked', 0.088), ('source', 0.088), ('tj', 0.086), ('igor', 0.086), ('segment', 0.085), ('annotations', 0.085), ('schone', 0.075), ('tags', 0.075), ('look', 0.066), ('sarawagi', 0.06), ('transliterations', 0.06), ('wiki', 0.06), ('metadata', 0.06), ('parallel', 0.058), ('alignments', 0.058), ('features', 0.057), ('tudor', 0.056), ('ner', 0.055), ('monolingual', 0.054), ('bilingual', 0.052), ('phrases', 0.051), ('tagging', 0.049), ('segments', 0.048), ('split', 0.046), ('link', 0.045), ('title', 0.045), ('stanford', 0.045), ('matchings', 0.043), ('postech', 0.043), ('sunita', 0.043), ('wikitagger', 0.043), ('label', 0.043), ('tag', 0.042), ('yarowsky', 0.041), ('aligned', 0.04), ('alignment', 0.039), ('sentence', 0.039), ('articles', 0.039), ('si', 0.038), ('feng', 0.038), ('pohang', 0.038), ('interwiki', 0.038), ('oracle', 0.037), ('smith', 0.037), ('derived', 0.035), ('direct', 0.035), ('labels', 0.035), ('assigned', 0.035), ('sentences', 0.034), ('uj', 0.034), ('model', 0.034), ('posterior', 0.032), ('annotation', 0.032), ('phonetic', 0.032), ('kr', 0.032), ('semimarkov', 0.032), ('wf', 0.032), ('pj', 0.032), ('specified', 0.032), ('phrase', 0.032), ('error', 0.031), ('analyzers', 0.03), ('points', 0.03), ('category', 0.029), ('tagged', 0.029), ('conflicting', 0.029), ('capitalization', 0.029), ('versus', 0.028), ('south', 0.028), ('blitzer', 0.028), ('hmm', 0.027), ('match', 0.027), ('labeled', 0.026), ('manually', 0.026), ('bi', 0.026), ('snyder', 0.026), ('cherry', 0.026), ('corresponding', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

2 0.31486487 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

Author: Seokhwan Kim ; Gary Geunbae Lee

Abstract: Although researchers have conducted extensive studies on relation extraction in the last decade, supervised approaches are still limited because they require large amounts of training data to achieve high performances. To build a relation extractor without significant annotation effort, we can exploit cross-lingual annotation projection, which leverages parallel corpora as external resources for supervision. This paper proposes a novel graph-based projection approach and demonstrates the merits of it by using a Korean relation extraction system based on projected dataset from an English-Korean parallel corpus.

3 0.1492857 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

Author: Preslav Nakov ; Jorg Tiedemann

Abstract: We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.

4 0.14511113 134 acl-2012-Learning to Find Translations and Transliterations on the Web

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

5 0.134929 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

Author: Limin Yao ; Sebastian Riedel ; Andrew McCallum

Abstract: To discover relation types from text, most methods cluster shallow or syntactic patterns of relation mentions, but consider only one possible sense per pattern. In practice this assumption is often violated. In this paper we overcome this issue by inducing clusters of pattern senses from feature representations of patterns. In particular, we employ a topic model to partition entity pairs associated with patterns into sense clusters using local and global features. We merge these sense clusters into semantic relations using hierarchical agglomerative clustering. We compare against several baselines: a generative latent-variable model, a clustering method that does not disambiguate between path senses, and our own approach but with only local features. Experimental results show our proposed approach discovers dramatically more accurate clusters than models without sense disambiguation, and that incorporating global features, such as the document theme, is crucial.

6 0.12722953 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

7 0.10926953 73 acl-2012-Discriminative Learning for Joint Template Filling

8 0.10723397 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

9 0.10589394 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

10 0.1048556 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

11 0.10348944 142 acl-2012-Mining Entity Types from Query Logs via User Intent Modeling

12 0.10243277 153 acl-2012-Named Entity Disambiguation in Streaming Data

13 0.099774867 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

14 0.099753439 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

15 0.0968135 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

16 0.096727744 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

17 0.095798977 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

18 0.094798028 212 acl-2012-Using Search-Logs to Improve Query Tagging

19 0.094489433 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

20 0.091815941 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.27), (1, 0.027), (2, -0.015), (3, 0.091), (4, 0.147), (5, 0.138), (6, 0.003), (7, -0.057), (8, 0.044), (9, -0.109), (10, 0.153), (11, -0.088), (12, -0.015), (13, -0.002), (14, 0.082), (15, 0.077), (16, 0.009), (17, -0.073), (18, 0.087), (19, 0.101), (20, -0.204), (21, 0.076), (22, 0.093), (23, -0.066), (24, -0.082), (25, 0.182), (26, -0.1), (27, -0.049), (28, -0.166), (29, 0.29), (30, -0.025), (31, -0.026), (32, -0.041), (33, 0.148), (34, -0.126), (35, -0.024), (36, -0.091), (37, 0.013), (38, 0.029), (39, 0.12), (40, -0.062), (41, 0.059), (42, -0.095), (43, 0.04), (44, 0.088), (45, 0.037), (46, -0.078), (47, 0.074), (48, 0.027), (49, 0.071)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95929408 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

2 0.8078981 12 acl-2012-A Graph-based Cross-lingual Projection Approach for Weakly Supervised Relation Extraction

Author: Seokhwan Kim ; Gary Geunbae Lee

3 0.51825553 42 acl-2012-Bootstrapping via Graph Propagation

Author: Max Whitney ; Anoop Sarkar

Abstract: Bootstrapping a classifier from a small set of seed rules can be viewed as the propagation of labels between examples via features shared between them. This paper introduces a novel variant of the Yarowsky algorithm based on this view. It is a bootstrapping learning method which uses a graph propagation algorithm with a well defined objective function. The experimental results show that our proposed bootstrapping algorithm achieves state of the art performance or better on several different natural language data sets.

4 0.50974923 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

Author: Marcis Pinnis ; Radu Ion ; Dan Stefanescu ; Fangzhong Su ; Inguna Skadina ; Andrejs Vasiljevs ; Bogdan Babych

Abstract: The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. This demonstration focuses on the English, Latvian, Lithuanian, and Romanian languages.

5 0.5057829 153 acl-2012-Named Entity Disambiguation in Streaming Data

Author: Alexandre Davis ; Adriano Veloso ; Altigran Soares ; Alberto Laender ; Wagner Meira Jr.

Abstract: The named entity disambiguation task is to resolve the many-to-many correspondence between ambiguous names and the unique realworld entity. This task can be modeled as a classification problem, provided that positive and negative examples are available for learning binary classifiers. High-quality senseannotated data, however, are hard to be obtained in streaming environments, since the training corpus would have to be constantly updated in order to accomodate the fresh data coming on the stream. On the other hand, few positive examples plus large amounts of unlabeled data may be easily acquired. Producing binary classifiers directly from this data, however, leads to poor disambiguation performance. Thus, we propose to enhance the quality of the classifiers using finer-grained variations of the well-known ExpectationMaximization (EM) algorithm. We conducted a systematic evaluation using Twitter streaming data and the results show that our classifiers are extremely effective, providing improvements ranging from 1% to 20%, when compared to the current state-of-the-art biased SVMs, being more than 120 times faster.

6 0.50356328 20 acl-2012-A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

7 0.45886365 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

8 0.45855755 73 acl-2012-Discriminative Learning for Joint Template Filling

9 0.44600928 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

10 0.43972188 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

11 0.42536843 134 acl-2012-Learning to Find Translations and Transliterations on the Web

12 0.41175807 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

13 0.40568364 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

14 0.40186745 195 acl-2012-The Creation of a Corpus of English Metalanguage

15 0.37988335 18 acl-2012-A Probabilistic Model for Canonicalizing Named Entity Mentions

16 0.37587237 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench

17 0.37264869 137 acl-2012-Lemmatisation as a Tagging Task

18 0.35847604 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

19 0.35805336 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

20 0.35521859 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.032), (26, 0.059), (28, 0.071), (30, 0.021), (37, 0.043), (39, 0.043), (57, 0.012), (74, 0.037), (82, 0.064), (84, 0.019), (85, 0.035), (86, 0.158), (90, 0.198), (92, 0.046), (94, 0.019), (99, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96741509 85 acl-2012-Event Linking: Grounding Event Reference in a News Archive

Author: Joel Nothman ; Matthew Honnibal ; Ben Hachey ; James R. Curran

Abstract: Interpreting news requires identifying its constituent events. Events are complex linguistically and ontologically, so disambiguating their reference is challenging. We introduce event linking, which canonically labels an event reference with the article where it was first reported. This implicitly relaxes coreference to co-reporting, and will practically enable augmenting news archives with semantic hyperlinks. We annotate and analyse a corpus of 150 documents, extracting 501 links to a news archive with reasonable inter-annotator agreement.

2 0.95881408 32 acl-2012-Automated Essay Scoring Based on Finite State Transducer: towards ASR Transcription of Oral English Speech

Author: Xingyuan Peng ; Dengfeng Ke ; Bo Xu

Abstract: Conventional Automated Essay Scoring (AES) measures may cause severe problems when directly applied in scoring Automatic Speech Recognition (ASR) transcription as they are error sensitive and unsuitable for the characteristic of ASR transcription. Therefore, we introduce a framework of Finite State Transducer (FST) to avoid the shortcomings. Compared with the Latent Semantic Analysis with Support Vector Regression (LSA-SVR) method (stands for the conventional measures), our FST method shows better performance especially towards the ASR transcription. In addition, we apply the synonyms similarity to expand the FST model. The final scoring performance reaches an acceptable level of 0.80 which is only 0.07 lower than the correlation (0.87) between human raters.

3 0.94176263 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

Abstract: We address a core aspect of the multilingual content synchronization task: the identification of novel, more informative or semantically equivalent pieces of information in two documents about the same topic. This can be seen as an application-oriented variant of textual entailment recognition where: i) T and H are in different languages, and ii) entailment relations between T and H have to be checked in both directions. Using a combination of lexical, syntactic, and semantic features to train a cross-lingual textual entailment system, we report promising results on different datasets.

same-paper 4 0.87295717 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

5 0.81889528 187 acl-2012-Subgroup Detection in Ideological Discussions

Author: Amjad Abu-Jbara ; Pradeep Dasigi ; Mona Diab ; Dragomir Radev

Abstract: The rapid and continuous growth of social networking sites has led to the emergence of many communities of communicating groups. Many of these groups discuss ideological and political topics. It is not uncommon that the participants in such discussions split into two or more subgroups. The members of each subgroup share the same opinion toward the discussion topic and are more likely to agree with members of the same subgroup and disagree with members from opposing subgroups. In this paper, we propose an unsupervised approach for automatically detecting discussant subgroups in online communities. We analyze the text exchanged between the participants of a discussion to identify the attitude they carry toward each other and towards the various aspects of the discussion topic. We use attitude predictions to construct an attitude vector for each discussant. We use clustering techniques to cluster these vectors and, hence, determine the subgroup membership of each participant. We compare our methods to text clustering and other baselines, and show that our method achieves promising results.

6 0.81327307 45 acl-2012-Capturing Paradigmatic and Syntagmatic Lexical Relations: Towards Accurate Chinese Part-of-Speech Tagging

7 0.81318456 191 acl-2012-Temporally Anchored Relation Extraction

8 0.8119396 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

9 0.80762964 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

10 0.80749446 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

11 0.80725121 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

12 0.80425787 140 acl-2012-Machine Translation without Words through Substring Alignment

13 0.80319333 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

14 0.80183059 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

15 0.80170643 73 acl-2012-Discriminative Learning for Joint Template Filling

16 0.80137324 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

17 0.80035126 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

18 0.79960656 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

19 0.7990973 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

20 0.79900545 131 acl-2012-Learning Translation Consensus with Structured Label Propagation