acl acl2013 acl2013-360 knowledge-graph by maker-knowledge-mining

360 acl-2013-Translating Italian connectives into Italian Sign Language

Source: pdf

Author: Camillo Lugaresi ; Barbara Di Eugenio

Abstract: We present a corpus analysis of how Italian connectives are translated into LIS, the Italian Sign Language. Since corpus resources are scarce, we propose an alignment method between the syntactic trees of the Italian sentence and of its LIS translation. This method, and clustering applied to its outputs, highlight the different ways a connective can be rendered in LIS: with a corresponding sign, by affecting the location or shape of other signs, or being omitted altogether. We translate these findings into a computational model that will be integrated into the pipeline of an existing Italian-LIS rendering system. Initial experiments to learn the four possible translations with Decision Trees give promising results.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Translating Italian connectives into Italian Sign Language Camillo Lugaresi University of Illinois at Chicago Politecnico di Milano clugar2 @ uic . [sent-1, score-0.342]

2 edu Abstract We present a corpus analysis of how Italian connectives are translated into LIS, the Italian Sign Language. [sent-2, score-0.225]

3 This method, and clustering applied to its outputs, highlight the different ways a connective can be rendered in LIS: with a corresponding sign, by affecting the location or shape of other signs, or being omitted altogether. [sent-4, score-0.12]

4 We translate these findings into a computational model that will be integrated into the pipeline of an existing Italian-LIS rendering system. [sent-5, score-0.029]

5 1 Introduction Automatic translation between a spoken language and a signed language gives rise to some of the same difficulties as translation between spoken languages, but adds unique challenges of its own. [sent-7, score-0.393]

6 Therefore, translation from any spoken language into the signed language of that specific region is at least as complicated as between any pairs of unrelated languages. [sent-9, score-0.272]

7 The problem of automatic translation is compounded by the fact that the amount of computational resources to draw on is much smaller than is typical for major spoken languages. [sent-10, score-0.222]

8 Moreover, the fact that sign languages employ a different Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago bdieugen @ uic . [sent-11, score-0.45]

9 edu transmission modality (gestures and expressions instead of sounds) means that existing writing systems are not easily adaptable to them. [sent-12, score-0.062]

10 The resulting lack of a shared written form does nothing to improve the availability of sign language corpora; bilingual corpora, which are of particular importance to a translation system, are especially rare. [sent-13, score-0.477]

11 In fact, various projects around the world are trying to ameliorate this sad state of affairs for specific Sign Languages (Lu and Huenerfauth, 2010; Braffort et al. [sent-14, score-0.122]

12 In this paper, we describe the work we performed as concerns the translation of connectives from the Italian language into LIS, the Italian Sign Language (Lingua Italiana dei Segni). [sent-17, score-0.355]

13 Because the communities of signers in Italy are relatively small and fragmented, and the language has a relatively short history, there is far less existing research and material to draw on than for, say, ASL (American Sign Language) or BSL (British Sign Language). [sent-18, score-0.068]

14 Our work was undertaken within the purview of the ATLAS project (Bertoldi et al. [sent-19, score-0.061]

15 , 2012), which developed a full pipeline for translating Italian into LIS. [sent-24, score-0.027]

16 ATLAS is part of a recent crop of projects devoted to developing automatic translation from language L spoken in geographic area G into the sign language spoken in G (Dreuw et al. [sent-25, score-0.716]

17 Input is taken in the form of written Italian text, parsed, and converted into a semantic representation of its contents; from this semantic representation, LIS output is produced, using a custom serialization format called AEWLIS (which we will describe later). [sent-29, score-0.121]

18 This representation is then augmented with space positioning information, and fed into a final renderer component that performs the signs using a virtual actor. [sent-30, score-0.134]

19 ATLAS focused on a limited domain for which a bilingual Italian/LIS cor270 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t. [sent-31, score-0.041]

20 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 270–280, pus was available: weather forecasts, for which the Italian public broadcasting corporation (RAI) had long been producing special broadcasts with a signed translation. [sent-33, score-0.18]

21 This yielded a corpus of 376 LIS sentences with corresponding Italian text: this corpus, converted into AEWLIS format, was the main data source for the project. [sent-34, score-0.027]

22 Still, it is a very small corpus, hence the main project shied away from statistical NLP techniques, relying instead on rule-based approaches developed with the help of a native Italian/LIS bilingual speaker; a similar approach is taken e. [sent-35, score-0.071]

23 The main semantic-bearing elements of an Italian sentence, such as nouns or verbs, typically have a LIS sign as their direct translation. [sent-41, score-0.397]

24 We focus on a different class of elements, comprising conjunctions and prepositions, but also some adverbs and prepositional phrases; collectively, we refer to them as connectives. [sent-42, score-0.095]

25 Since they are mainly structural elements, they are more heavily affected by differences in the syntax and grammar of Italian and LIS (and, presumably, in those of any spoken language and the “corresponding” SL). [sent-43, score-0.105]

26 Specifically, as we will see later, some connectives are translated with a sign, some connectives are dropped, whereas others affect the positioning of other signs, or just their syntactic proximity. [sent-44, score-0.511]

27 For example, while prepositions can be seen as connectives (Ferrari, 2008), only a few adverbs can work as connectives. [sent-46, score-0.317]

28 From the Italian Treebank, we extracted all words or phrases that belonged to a syntactic category that can be a connective (conjunction, preposition, adverb or prepositional phrase). [sent-47, score-0.124]

29 We then found that we could better serve the needs of ATLAS by running our analysis on the entire resulting list, without filtering it by elim- inating the entries that are not actual connectives. [sent-48, score-0.035]

30 , the temporal adverbs “domani” and “dopodomani” are nearly always preserved, as they do carry key information (especially for weather forecasting) and are not structural elements. [sent-51, score-0.196]

31 In performing our analysis, we pursued a different path from the main project, relying entirely on the bilingual corpus. [sent-52, score-0.106]

32 Although the use of statistical techniques was hampered by the small size of the corpus, at the same time it presented an interesting opportunity to attack the problem from a different angle. [sent-53, score-0.035]

33 In this paper we describe how we uncovered the translation distributions of the different connectives from Italian to LIS via tree alignment. [sent-54, score-0.311]

34 2 Corpus Analysis The corpus consists of 40 weather forecasts in Italian and LIS. [sent-55, score-0.169]

35 The Italian spoken utterance and LIS signing were transcribed from the original videos one example of an Italian sentence and its LIS – equivalent are shown in Figure 1. [sent-56, score-0.196]

36 An English word-by-word translation is provided for the Italian sentence, followed by a more fluent translation; the LIS glosses are literally translated. [sent-57, score-0.085]

37 Note that as concerns LIS, this simply includes the gloss for the corresponding sign. [sent-58, score-0.041]

38 The 40 weather forecast comprise 374 Italian sentences and 376 LIS sentences, stored in 372 AEWLIS files. [sent-59, score-0.17]

39 In most cases, a file corresponds to one Italian sentence and one corresponding LIS sentences; however, there are 4 files where an Italian sentence is split into two LIS sentences, and 2 files where two Italian sentences are merged into one LIS sentence. [sent-60, score-0.104]

40 AEWLIS is an XML-based format (see Figure 2) which represents each sign in the LIS sentence as an element, in the order in which they occur in the sentence. [sent-61, score-0.392]

41 A sign’s lemma is represented by the Italian word with the same meaning, always written in uppercase, and with its part of speech (tipoAG in Figure 2); there are also IDs referencing the lemma’s position in a few dictionaries, but these are not always present. [sent-62, score-0.142]

42 These attributes are stored as elements grouped by type, and reference the corresponding sign element by its ordinal position in the sentence. [sent-64, score-0.495]

43 The additional attributes are not always available: morphological variations are annotated only when they differ from an assumed standard form of the sign, while the syntactic structure was annotated for only 89 sentences. [sent-65, score-0.07]

44 ) Anche Also ma tendenza but trend sulla on poi then Sardegna qualche annuvolamento pomeridiano, possibilit a` di qualche breve scroscio di pioggia, Sardinia a few cloud covers afternoon[adj], chance of a few brief downpour of rain, a schiarite. [sent-67, score-0.581]

45 “Also on Sardinia skies will become overcast in the afternoon, chance of a few brief downpours of rain, but then a trend towards a mix of sun and clouds”. [sent-69, score-0.095]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lis', 0.536), ('italian', 0.463), ('sign', 0.352), ('connectives', 0.225), ('aewlis', 0.198), ('atlas', 0.129), ('sardinia', 0.119), ('weather', 0.111), ('spoken', 0.105), ('afternoon', 0.097), ('cloud', 0.08), ('almohimeed', 0.079), ('downpour', 0.079), ('huenerfauth', 0.079), ('nuvola', 0.079), ('qualche', 0.079), ('sardegna', 0.079), ('signs', 0.073), ('signed', 0.069), ('poi', 0.065), ('lombardo', 0.065), ('signing', 0.065), ('uic', 0.065), ('rain', 0.061), ('positioning', 0.061), ('adverbs', 0.058), ('forecasts', 0.058), ('translation', 0.057), ('connective', 0.055), ('di', 0.052), ('elements', 0.045), ('illinois', 0.044), ('attributes', 0.043), ('bilingual', 0.041), ('concerns', 0.041), ('region', 0.041), ('format', 0.04), ('location', 0.039), ('chicago', 0.037), ('prepositional', 0.037), ('hampered', 0.035), ('arose', 0.035), ('clouds', 0.035), ('crop', 0.035), ('pursued', 0.035), ('rai', 0.035), ('transmission', 0.035), ('morrissey', 0.035), ('inating', 0.035), ('ferrari', 0.035), ('affairs', 0.035), ('dreuw', 0.035), ('signers', 0.035), ('files', 0.035), ('brief', 0.034), ('prepositions', 0.034), ('file', 0.034), ('area', 0.034), ('draw', 0.033), ('languages', 0.033), ('eugenio', 0.032), ('referencing', 0.032), ('purview', 0.032), ('sunny', 0.032), ('forecasting', 0.032), ('dei', 0.032), ('ameliorate', 0.032), ('lingua', 0.032), ('belonged', 0.032), ('ahmad', 0.032), ('bsl', 0.032), ('trend', 0.031), ('forecast', 0.03), ('fragmented', 0.03), ('uppercase', 0.03), ('gesture', 0.03), ('relying', 0.03), ('chance', 0.03), ('stored', 0.029), ('lemma', 0.029), ('undertaken', 0.029), ('uncovered', 0.029), ('asl', 0.029), ('facial', 0.029), ('rendering', 0.029), ('lu', 0.028), ('projects', 0.028), ('literally', 0.028), ('translating', 0.027), ('always', 0.027), ('converted', 0.027), ('adaptable', 0.027), ('compounded', 0.027), ('custom', 0.027), ('sad', 0.027), ('written', 0.027), ('reference', 0.026), ('rendered', 0.026), ('accompanying', 0.026), ('videos', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 360 acl-2013-Translating Italian connectives into Italian Sign Language

Author: Camillo Lugaresi ; Barbara Di Eugenio

2 0.18010506 321 acl-2013-Sign Language Lexical Recognition With Propositional Dynamic Logic

Author: Arturo Curiel ; Christophe Collet

Abstract: . This paper explores the use of Propositional Dynamic Logic (PDL) as a suitable formal framework for describing Sign Language (SL) , the language of deaf people, in the context of natural language processing. SLs are visual, complete, standalone languages which are just as expressive as oral languages. Signs in SL usually correspond to sequences of highly specific body postures interleaved with movements, which make reference to real world objects, characters or situations. Here we propose a formal representation of SL signs, that will help us with the analysis of automatically-collected hand tracking data from French Sign Language (FSL) video corpora. We further show how such a representation could help us with the design of computer aided SL verification tools, which in turn would bring us closer to the development of an automatic recognition system for these languages.

3 0.10739921 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization

Author: Vivi Nastase ; Carlo Strapparava

Abstract: We propose the hypothesis that word etymology is useful for NLP applications as a bridge between languages. We support this hypothesis with experiments in crosslanguage (English-Italian) document categorization. In a straightforward bag-ofwords experimental set-up we add etymological ancestors of the words in the documents, and investigate the performance of a model built on English data, on Italian test data (and viceversa). The results show not only statistically significant, but a large improvement a jump of almost 40 points in F1-score over the raw (vanilla bag-ofwords) representation.

4 0.09734641 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages

Author: Lian Tze Lim ; Lay-Ki Soon ; Tek Yong Lim ; Enya Kong Tang ; Bali Ranaivo-Malancon

Abstract: Current approaches for word sense disambiguation and translation selection typically require lexical resources or large bilingual corpora with rich information fields and annotations, which are often infeasible for under-resourced languages. We extract translation context knowledge from a bilingual comparable corpora of a richer-resourced language pair, and inject it into a multilingual lexicon. The multilin- gual lexicon can then be used to perform context-dependent lexical lookup on texts of any language, including under-resourced ones. Evaluations on a prototype lookup tool, trained on a English–Malay bilingual Wikipedia corpus, show a precision score of 0.65 (baseline 0.55) and mean reciprocal rank score of 0.81 (baseline 0.771). Based on the early encouraging results, the context-dependent lexical lookup tool may be developed further into an intelligent reading aid, to help users grasp the gist of a second or foreign language text.

5 0.05213353 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections

Author: Long Duong ; Paul Cook ; Steven Bird ; Pavel Pecina

Abstract: We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-the-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying “good” training sentences from the parallel corpus and applying self-training. In experimental results on eight languages, our method achieves state-of-the-art results. 1 Unsupervised part-of-speech tagging Currently, part-of-speech (POS) taggers are available for many highly spoken and well-resourced languages such as English, French, German, Italian, and Arabic. For example, Petrov et al. (2012) build supervised POS taggers for 22 languages using the TNT tagger (Brants, 2000), with an average accuracy of 95.2%. However, many widelyspoken languages including Bengali, Javanese, and Lahnda have little data manually labelled for POS, limiting supervised approaches to POS tagging for these languages. However, with the growing quantity of text available online, and in particular, multilingual parallel texts from sources such as multilingual websites, government documents and large archives ofhuman translations ofbooks, news, and so forth, unannotated parallel data is becoming more widely available. This parallel data can be exploited to bridge languages, and in particular, transfer information from a highly-resourced language to a lesser-resourced language, to build unsupervised POS taggers. In this paper, we propose an unsupervised approach to POS tagging in a similar vein to the work of Das and Petrov (201 1). In this approach, — — pecina@ ufal .mff .cuni . c z a parallel corpus for a more-resourced language having a POS tagger, and a lesser-resourced language, is word-aligned. These alignments are exploited to infer an unsupervised tagger for the target language (i.e., a tagger not requiring manuallylabelled data in the target language). Our approach is substantially simpler than that of Das and Petrov, the current state-of-the art, yet performs comparably well. 2 Related work There is a wealth of prior research on building unsupervised POS taggers. Some approaches have exploited similarities between typologically similar languages (e.g., Czech and Russian, or Telugu and Kannada) to estimate the transition probabilities for an HMM tagger for one language based on a corpus for another language (e.g., Hana et al., 2004; Feldman et al., 2006; Reddy and Sharoff, 2011). Other approaches have simultaneously tagged two languages based on alignments in a parallel corpus (e.g., Snyder et al., 2008). A number of studies have used tag projection to copy tag information from a resource-rich to a resource-poor language, based on word alignments in a parallel corpus. After alignment, the resource-rich language is tagged, and tags are projected from the source language to the target language based on the alignment (e.g., Yarowsky and Ngai, 2001 ; Das and Petrov, 2011). Das and Petrov (201 1) achieved the current state-of-the-art for unsupervised tagging by exploiting high confidence alignments to copy tags from the source language to the target language. Graph-based label propagation was used to automatically produce more labelled training data. First, a graph was constructed in which each vertex corresponds to a unique trigram, and edge weights represent the syntactic similarity between vertices. Labels were then propagated by optimizing a convex function to favor the same tags for closely related nodes 634 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 634–639, ModelCoverageAccuracy Many-to-1 alignments88%68% 1-to-1 alignments 68% 78% 1-to-1 alignments: Top 60k sents 91% 80% Table 1: Token coverage and accuracy of manyto-one and 1-to-1 alignments, as well as the top 60k sentences based on alignment score for 1-to-1 alignments, using directly-projected labels only. while keeping a uniform tag distribution for unrelated nodes. A tag dictionary was then extracted from the automatically labelled data, and this was used to constrain a feature-based HMM tagger. The method we propose here is simpler to that of Das and Petrov in that it does not require convex optimization for label propagation or a feature based HMM, yet it achieves comparable results. 3 Tagset Our tagger exploits the idea ofprojecting tag information from a resource-rich to resource-poor language. To facilitate this mapping, we adopt Petrov et al.’s (2012) twelve universal tags: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET (de- terminers and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), “.” (punctuation), and X (all other categories, e.g., foreign words, abbreviations). These twelve basic tags are common across taggers for most languages. Adopting a universal tagset avoids the need to map between a variety of different, languagespecific tagsets. Furthermore, it makes it possible to apply unsupervised tagging methods to languages for which no tagset is available, such as Telugu and Vietnamese. 4 A Simpler Unsupervised POS Tagger Here we describe our proposed tagger. The key idea is to maximize the amount of information gleaned from the source language, while limiting the amount of noise. We describe the seed model and then explain how it is successively refined through self-training and revision. 4.1 Seed Model The first step is to construct a seed tagger from directly-projected labels. Given a parallel corpus for a source and target language, Algorithm 1provides a method for building an unsupervised tagger for the target language. In typical applications, the source language would be a better-resourced language having a tagger, while the target language would be lesser-resourced, lacking a tagger and large amounts of manually POS-labelled data. Algorithm 1 Build seed model Algorithm 1Build seed model 1:Tag source side. 2: Word align the corpus with Giza++ and remove the many-to-one mappings. 3: Project tags from source to target using the remaining 1-to-1 alignments. 4: Select the top n sentences based on sentence alignment score. 5: Estimate emission and transition probabilities. 6: Build seed tagger T. We eliminate many-to-one alignments (Step 2). Keeping these would give more POS-tagged tokens for the target side, but also introduce noise. For example, suppose English and French were the source and target language, respectively. In this case alignments such as English laws (NNS) to French les (DT) lois (NNS) would be expected (Yarowsky and Ngai, 2001). However, in Step 3, where tags are projected from the source to target language, this would incorrectly tag French les as NN. We build a French tagger based on English– French data from the Europarl Corpus (Koehn, 2005). We also compare the accuracy and coverage of the tags obtained through direct projection using the French Melt POS tagger (Denis and Sagot, 2009). Table 1confirms that the one-to-one alignments indeed give higher accuracy but lower coverage than the many-to-one alignments. At this stage of the model we hypothesize that highconfidence tags are important, and hence eliminate the many-to-one alignments. In Step 4, in an effort to again obtain higher quality target language tags from direct projection, we eliminate all but the top n sentences based on their alignment scores, as provided by the aligner via IBM model 3. We heuristically set this cutoff × to 60k to balance the accuracy and size of the seed model.1 Returning to our preliminary English– French experiments in Table 1, this process gives improvements in both accuracy and coverage.2 1We considered values in the range 60–90k, but this choice had little impact on the accuracy of the model. 2We also considered using all projected labels for the top 60k sentences, not just 1-to-1 alignments, but in preliminary experiments this did not perform as well, possibly due to the previously-observed problems with many-to-one alignments. 635 The number of parameters for the emission probability is |V | |T| where V is the vocabulary and aTb iilsi ttyh eis tag |s e×t. TTh| ew htrearnesi Vtio ins probability, on atnhed other hand, has only |T|3 parameters for the trigram hmaondde,l we use. TB|ecause of this difference in number of parameters, in step 5, we use different strategies to estimate the emission and transition probabilities. The emission probability is estimated from all 60k selected sentences. However, for the transition probability, which has less parameters, we again focus on “better” sentences, by estimating this probability from only those sen- tences that have (1) token coverage > 90% (based on direct projection of tags from the source language), and (2) length > 4 tokens. These criteria aim to identify longer, mostly-tagged sentences, which we hypothesize are particularly useful as training data. In the case of our preliminary English–French experiments, roughly 62% of the 60k selected sentences meet these criteria and are used to estimate the transition probability. For unaligned words, we simply assign a random POS and very low probability, which does not substantially affect transition probability estimates. In Step 6 we build a tagger by feeding the estimated emission and transition probabilities into the TNT tagger (Brants, 2000), an implementation of a trigram HMM tagger. 4.2 Self training and revision For self training and revision, we use the seed model, along with the large number of target language sentences available that have been partially tagged through direct projection, in order to build a more accurate tagger. Algorithm 2 describes this process of self training and revision, and assumes that the parallel source–target corpus has been word aligned, with many-to-one alignments removed, and that the sentences are sorted by alignment score. In contrast to Algorithm 1, all sentences are used, not just the 60k sentences with the highest alignment scores. We believe that sentence alignment score might correspond to difficulty to tag. By sorting the sentences by alignment score, sentences which are more difficult to tag are tagged using a more mature model. Following Algorithm 1, we divide sentences into blocks of 60k. In step 3 the tagged block is revised by comparing the tags from the tagger with those obtained through direct projection. Suppose source Algorithm 2 Self training and revision 1:Divide target language sentences into blocks of n sentences. 2: Tag the first block with the seed tagger. 3: Revise the tagged block. 4: Train a new tagger on the tagged block. 5: Add the previous tagger’s lexicon to the new tagger. 6: Use the new tagger to tag the next block. 7: Goto 3 and repeat until all blocks are tagged. language word wis is aligned with target language word wjt with probability p(wjt |wsi), Tis is the tag for wis using the tagger availa|bwle for the source language, and Tjt is the tag for wjt using the tagger learned for the > S, where S is a threshold which we heuristically set to 0.7, we replace Tjt by Tis. Self-training can suffer from over-fitting, in which errors in the original model are repeated and amplified in the new model (McClosky et al., 2006). To avoid this, we remove the tag of any token that the model is uncertain of, i.e., if p(wjt |wsi) < S and Tjt Tis then Tjt = Null. So, on th|ew target side, aligned words have a tag from direct projection or no tag, and unaligned words have a tag assigned by our model. Step 4 estimates the emission and transition target language. If p(wtj|wis) = probabilities as in Algorithm 1. In Step 5, emission probabilities for lexical items in the previous model, but missing from the current model, are added to the current model. Later models therefore take advantage of information from earlier models, and have wider coverage. 5 Experimental Results Using parallel data from Europarl (Koehn, 2005) we apply our method to build taggers for the same eight target languages as Das and Petrov (201 1) Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish with English as the source language. Our training data (Europarl) is a subset of the training data of Das and Petrov (who also used the ODS United Nations dataset which we were unable to obtain). The evaluation metric and test data are the same as that used by Das and Petrov. Our results are comparable to theirs, although our system is penalized by having less training data. We tag the source language with the Stanford POS tagger (Toutanova et al., 2003). — — 636 DanishDutchGermanGreekItalianPortugueseSpanishSwedishAverage Seed model83.781.183.677.878.684.981.478.981.3 Self training + revision 85.6 84.0 85.4 80.4 81.4 86.3 83.3 81.0 83.4 Das and Petrov (2011) 83.2 79.5 82.8 82.5 86.8 87.9 84.2 80.5 83.4 Table 2: Token-level POS tagging accuracy for our seed model, self training and revision, and the method of Das and Petrov (201 1). The best results on each language, and on average, are shown in bold. 1 1 Iteration 2 2 3 1 1 2 2 3 Iteration Figure 1: Overall accuracy, accuracy on known tokens, accuracy on unknown tokens, and proportion of known tokens for Italian (left) and Dutch (right). Table 2 shows results for our seed model, self training and revision, and the results reported by Das and Petrov. Self training and revision improve the accuracy for every language over the seed model, and gives an average improvement of roughly two percentage points. The average accuracy of self training and revision is on par with that reported by Das and Petrov. On individual languages, self training and revision and the method of Das and Petrov are split each performs better on half of the cases. Interestingly, our method achieves higher accuracies on Germanic languages the family of our source language, English while Das and Petrov perform better on Romance languages. This might be because our model relies on alignments, which might be more accurate for more-related languages, whereas Das and Petrov additionally rely on label propagation. Compared to Das and Petrov, our model performs poorest on Italian, in terms of percentage point difference in accuracy. Figure 1 (left panel) shows accuracy, accuracy on known words, accuracy on unknown words, and proportion of known tokens for each iteration of our model for Italian; iteration 0 is the seed model, and iteration 3 1 is the final model. Our model performs poorly on unknown words as indicated by the low accuracy on unknown words, and high accuracy on known — — — words compared to the overall accuracy. The poor performance on unknown words is expected because we do not use any language-specific rules to handle this case. Moreover, on average for the final model, approximately 10% of the test data tokens are unknown. One way to improve the performance of our tagger might be to reduce the proportion of unknown words by using a larger training corpus, as Das and Petrov did. We examine the impact of self-training and revision over training iterations. We find that for all languages, accuracy rises quickly in the first 5–6 iterations, and then subsequently improves only slightly. We exemplify this in Figure 1 (right panel) for Dutch. (Findings are similar for other languages.) Although accuracy does not increase much in later iterations, they may still have some benefit as the vocabulary size continues to grow. 6 Conclusion We have proposed a method for unsupervised POS tagging that performs on par with the current state- of-the-art (Das and Petrov, 2011), but is substantially less-sophisticated (specifically not requiring convex optimization or a feature-based HMM). The complexity of our algorithm is O(nlogn) compared to O(n2) for that of Das and Petrov 637 (201 1) where n is the size of training data.3 We made our code are available for download.4 In future work we intend to consider using a larger training corpus to reduce the proportion of unknown tokens and improve accuracy. Given the improvements of our model over that of Das and Petrov on languages from the same family as our source language, and the observation of Snyder et al. (2008) that a better tagger can be learned from a more-closely related language, we also plan to consider strategies for selecting an appropriate source language for a given target language. Using our final model with unsupervised HMM methods might improve the final performance too, i.e. use our final model as the initial state for HMM, then experiment with differ- ent inference algorithms such as Expectation Maximization (EM), Variational Bayers (VB) or Gibbs sampling (GS).5 Gao and Johnson (2008) compare EM, VB and GS for unsupervised English POS tagging. In many cases, GS outperformed other methods, thus we would like to try GS first for our model. 7 Acknowledgements This work is funded by Erasmus Mundus European Masters Program in Language and Communication Technologies (EM-LCT) and by the Czech Science Foundation (grant no. P103/12/G084). We would like to thank Prokopis Prokopidis for providing us the Greek Treebank and Antonia Marti for the Spanish CoNLL 06 dataset. Finally, we thank Siva Reddy and Spandana Gella for many discussions and suggestions. References Thorsten Brants. 2000. TnT: A statistical part-ofspeech tagger. In Proceedings of the sixth conference on Applied natural language processing (ANLP ’00), pages 224–231 . Seattle, Washington, USA. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of 3We re-implemented label propagation from Das and Petrov (2011). It took over a day to complete this step on an eight core Intel Xeon 3.16GHz CPU with 32 Gb Ram, but only 15 minutes for our model. 4https://code.google.com/p/universal-tagger/ 5We in fact have tried EM, but it did not help. The overall performance dropped slightly. This might be because selftraining with revision already found the local maximal point. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (ACL 2011), pages 600–609. Portland, Oregon, USA. Pascal Denis and Beno ıˆt Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proceedings of the 23rd PacificAsia Conference on Language, Information and Computation, pages 721–736. Hong Kong, China. Anna Feldman, Jirka Hana, and Chris Brew. 2006. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’06), pages 549–554. Genoa, Italy. Jianfeng Gao and Mark Johnson. 2008. A comparison of bayesian estimators for unsupervised hidden markov model pos taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 344–352. Association for Computational Linguistics, Stroudsburg, PA, USA. Jiri Hana, Anna Feldman, and Chris Brew. 2004. A resource-light approach to Russian morphology: Tagging Russian using Czech resources. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP ’04), pages 222–229. Barcelona, Spain. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the Tenth Machine Translation Summit (MT Summit X), pages 79–86. AAMT, Phuket, Thailand. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on Human Language Technology Conference ofthe North American Chapter of the Association of Computational Linguistics (HLT-NAACL ’06), pages 152–159. New York, USA. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 2089–2096. Istanbul, Turkey. Siva Reddy and Serge Sharoff. 2011. Cross language POS Taggers (and other tools) for Indian 638 languages: An experiment with Kannada using Telugu resources. In Proceedings of the IJCNLP 2011 workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA 2011). Chiang Mai, Thailand. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2008. Unsupervised multilingual learning for POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 1041–1050. Honolulu, Hawaii. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol- ume 1 (NAACL ’03), pages 173–180. Edmonton, Canada. David Yarowsky and Grace Ngai. 2001 . Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (NAACL ’01), pages 1–8. Pittsburgh, Pennsylvania, USA. 639

6 0.040951308 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

7 0.040107746 21 acl-2013-A Statistical NLG Framework for Aggregated Planning and Realization

8 0.039361332 197 acl-2013-Incremental Topic-Based Translation Model Adaptation for Conversational Spoken Language Translation

9 0.039334912 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

10 0.037906196 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

11 0.037519988 154 acl-2013-Extracting bilingual terminologies from comparable corpora

12 0.03718574 229 acl-2013-Leveraging Synthetic Discourse Data via Multi-task Learning for Implicit Discourse Relation Recognition

13 0.034831706 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora

14 0.034660436 6 acl-2013-A Java Framework for Multilingual Definition and Hypernym Extraction

15 0.032627944 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

16 0.031721469 115 acl-2013-Detecting Event-Related Links and Sentiments from Social Media Texts

17 0.031423911 249 acl-2013-Models of Semantic Representation with Visual Attributes

18 0.031141853 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

19 0.031010637 90 acl-2013-Conditional Random Fields for Responsive Surface Realisation using Global Features

20 0.030358948 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.091), (1, -0.003), (2, 0.029), (3, 0.006), (4, -0.016), (5, -0.033), (6, -0.002), (7, 0.003), (8, 0.034), (9, 0.002), (10, -0.01), (11, -0.017), (12, -0.013), (13, 0.05), (14, -0.021), (15, -0.022), (16, 0.006), (17, -0.044), (18, -0.049), (19, -0.037), (20, -0.018), (21, -0.031), (22, 0.019), (23, -0.0), (24, -0.015), (25, 0.007), (26, -0.053), (27, 0.031), (28, 0.046), (29, -0.005), (30, -0.018), (31, 0.062), (32, 0.029), (33, -0.003), (34, 0.065), (35, 0.018), (36, -0.061), (37, 0.041), (38, 0.06), (39, -0.033), (40, -0.043), (41, 0.09), (42, 0.005), (43, -0.012), (44, -0.098), (45, 0.033), (46, 0.085), (47, 0.015), (48, -0.056), (49, -0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92040992 360 acl-2013-Translating Italian connectives into Italian Sign Language

Author: Camillo Lugaresi ; Barbara Di Eugenio

2 0.69357985 321 acl-2013-Sign Language Lexical Recognition With Propositional Dynamic Logic

Author: Arturo Curiel ; Christophe Collet

3 0.56530201 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization

Author: Vivi Nastase ; Carlo Strapparava

4 0.51484466 86 acl-2013-Combining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures

Author: Sina Zarriess ; Jonas Kuhn

Abstract: We suggest a generation task that integrates discourse-level referring expression generation and sentence-level surface realization. We present a data set of German articles annotated with deep syntax and referents, including some types of implicit referents. Our experiments compare several architectures varying the order of a set of trainable modules. The results suggest that a revision-based pipeline, with intermediate linearization, significantly outperforms standard pipelines or a parallel architecture.

5 0.51084989 92 acl-2013-Context-Dependent Multilingual Lexical Lookup for Under-Resourced Languages

Author: Lian Tze Lim ; Lay-Ki Soon ; Tek Yong Lim ; Enya Kong Tang ; Bali Ranaivo-Malancon

6 0.5014258 303 acl-2013-Robust multilingual statistical morphological generation models

7 0.49072552 154 acl-2013-Extracting bilingual terminologies from comparable corpora

8 0.47279114 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German

9 0.45575517 90 acl-2013-Conditional Random Fields for Responsive Surface Realisation using Global Features

10 0.43675947 13 acl-2013-A New Syntactic Metric for Evaluation of Machine Translation

11 0.42572343 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages

12 0.42135933 110 acl-2013-Deepfix: Statistical Post-editing of Statistical Machine Translation Using Deep Syntactic Analysis

13 0.41946357 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation

14 0.41795626 337 acl-2013-Tag2Blog: Narrative Generation from Satellite Tag Data

15 0.41455477 203 acl-2013-Is word-to-phone mapping better than phone-phone mapping for handling English words?

16 0.41283241 270 acl-2013-ParGramBank: The ParGram Parallel Treebank

17 0.41101211 93 acl-2013-Context Vector Disambiguation for Bilingual Lexicon Extraction from Comparable Corpora

18 0.4089838 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

19 0.39921901 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

20 0.39415893 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.028), (6, 0.021), (11, 0.037), (24, 0.07), (26, 0.032), (35, 0.057), (42, 0.049), (46, 0.409), (48, 0.025), (70, 0.033), (88, 0.043), (90, 0.041), (95, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78778452 360 acl-2013-Translating Italian connectives into Italian Sign Language

Author: Camillo Lugaresi ; Barbara Di Eugenio

2 0.57003915 176 acl-2013-Grounded Unsupervised Semantic Parsing

Author: Hoifung Poon

Abstract: We present the first unsupervised approach for semantic parsing that rivals the accuracy of supervised approaches in translating natural-language questions to database queries. Our GUSP system produces a semantic parse by annotating the dependency-tree nodes and edges with latent states, and learns a probabilistic grammar using EM. To compensate for the lack of example annotations or question-answer pairs, GUSP adopts a novel grounded-learning approach to leverage database for indirect supervision. On the challenging ATIS dataset, GUSP attained an accuracy of 84%, effectively tying with the best published results by supervised approaches.

3 0.37170318 174 acl-2013-Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation

Author: Majid Razmara ; Maryam Siahbani ; Reza Haffari ; Anoop Sarkar

Abstract: Out-of-vocabulary (oov) words or phrases still remain a challenge in statistical machine translation especially when a limited amount of parallel text is available for training or when there is a domain shift from training data to test data. In this paper, we propose a novel approach to finding translations for oov words. We induce a lexicon by constructing a graph on source language monolingual text and employ a graph propagation technique in order to find translations for all the source language phrases. Our method differs from previous approaches by adopting a graph propagation approach that takes into account not only one-step (from oov directly to a source language phrase that has a translation) but multi-step paraphrases from oov source language words to other source language phrases and eventually to target language translations. Experimental results show that our graph propagation method significantly improves performance over two strong baselines under intrinsic and extrinsic evaluation metrics.

4 0.35705402 126 acl-2013-Diverse Keyword Extraction from Conversations

Author: Maryam Habibi ; Andrei Popescu-Belis

Abstract: A new method for keyword extraction from conversations is introduced, which preserves the diversity of topics that are mentioned. Inspired from summarization, the method maximizes the coverage of topics that are recognized automatically in transcripts of conversation fragments. The method is evaluated on excerpts of the Fisher and AMI corpora, using a crowdsourcing platform to elicit comparative relevance judgments. The results demonstrate that the method outperforms two competitive baselines.

5 0.32069871 172 acl-2013-Graph-based Local Coherence Modeling

Author: Camille Guinaudeau ; Michael Strube

Abstract: We propose a computationally efficient graph-based approach for local coherence modeling. We evaluate our system on three tasks: sentence ordering, summary coherence rating and readability assessment. The performance is comparable to entity grid based approaches though these rely on a computationally expensive training phase and face data sparsity problems.

6 0.32043299 252 acl-2013-Multigraph Clustering for Unsupervised Coreference Resolution

7 0.3179076 2 acl-2013-A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations

8 0.31647441 341 acl-2013-Text Classification based on the Latent Topics of Important Sentences extracted by the PageRank Algorithm

9 0.31558314 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

10 0.31456077 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

11 0.31429708 85 acl-2013-Combining Intra- and Multi-sentential Rhetorical Parsing for Document-level Discourse Analysis

12 0.31425345 267 acl-2013-PARMA: A Predicate Argument Aligner

13 0.31403092 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

14 0.3134996 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

15 0.31340301 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

16 0.31302118 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction

17 0.31286728 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

18 0.3128171 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

19 0.31238544 187 acl-2013-Identifying Opinion Subgroups in Arabic Online Discussions

20 0.31224677 68 acl-2013-Bilingual Data Cleaning for SMT using Graph-based Random Walk