acl acl2013 acl2013-39 knowledge-graph by maker-knowledge-mining

39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

Source: pdf

Author: Volkan Cirik

Abstract: We study substitute vectors to solve the part-of-speech ambiguity problem in an unsupervised setting. Part-of-speech tagging is a crucial preliminary process in many natural language processing applications. Because many words in natural languages have more than one part-of-speech tag, resolving part-of-speech ambiguity is an important task. We claim that partof-speech ambiguity can be solved using substitute vectors. A substitute vector is constructed with possible substitutes of a target word. This study is built on previous work which has proven that word substitutes are very fruitful for part-ofspeech induction. Experiments show that our methodology works for words with high ambiguity.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 t r Abstract We study substitute vectors to solve the part-of-speech ambiguity problem in an unsupervised setting. [sent-3, score-0.855]

2 Because many words in natural languages have more than one part-of-speech tag, resolving part-of-speech ambiguity is an important task. [sent-5, score-0.198]

3 We claim that partof-speech ambiguity can be solved using substitute vectors. [sent-6, score-0.727]

4 A substitute vector is constructed with possible substitutes of a target word. [sent-7, score-0.723]

5 This study is built on previous work which has proven that word substitutes are very fruitful for part-ofspeech induction. [sent-8, score-0.222]

6 Token based methods (BergKirkpatrick and Klein, 2010; Goldwater and Griffiths, 2007) categorize word occurrences into syntactic groups. [sent-17, score-0.393]

7 Type based methods (Clark, 2003; Blunsom and Cohn, 2011) on the other hand, categorize word types and yield the ambiguity problem unlike the token based methods. [sent-18, score-0.447]

8 Type based methods suffer from POS ambigu- ity because one POS tag is assigned to each word type. [sent-19, score-0.21]

9 However, occurrences of many words may have different POS tags. [sent-20, score-0.209]

10 They illustrate a situation where two occurrences of the “offers” have different POS tags. [sent-22, score-0.209]

11 (1) “Two rival bidders for Connaught BioSciences extended their offers to acquire the Toronto-based vaccine manufacturer Friday. [sent-24, score-0.154]

12 ” (2) “The company currently offers a word-processing package for personal computers called Legend. [sent-25, score-0.154]

13 , 2012) by solving the ambiguity problem it suffers because it has a type based approach. [sent-27, score-0.263]

14 The clustering based studies (Sch u¨tze, 1995) (Mintz, 2003) represent the context of a word with a vector using neighbour words. [sent-28, score-0.183]

15 They claim that the substitutes of a word have similar syntactic categories and they are determined by the context of the word. [sent-31, score-0.37]

16 In addition, we suggest that the occurrences with different part-of-speech categories of a word should be seen in different contexts. [sent-32, score-0.376]

17 In other words, if we categorize the contexts of a word type we can determine different POS tags of the word. [sent-33, score-0.413]

18 We represent the context of a word by constructing substitute vectors using possible substitutes of the word as (Yatbaz et al. [sent-34, score-0.946]

19 Table 1illustrates the substitute vector ofthe occurrence of “offers” in (1). [sent-36, score-0.508]

20 To resolve ambiguity 117 Sofia, BuPrlgoacreiead, iAngusgu osft 4h-e9 A 2C01L3 S. [sent-39, score-0.198]

21 of a target word, we separate occurrences of the word into different groups depending on the context information represented by substitute vectors. [sent-42, score-0.904]

22 In the first experiment, for each word type we investigated, we separate all occurences into two categories using substitute vectors. [sent-44, score-0.719]

23 In the second one we guess the number of the categories we should separate for each word type. [sent-45, score-0.269]

24 The level of ambiguity can be measured with perplexity of word’s gold tag distribution. [sent-48, score-0.63]

25 For instance,the gold tag perplexity of word “offers” in the Penn Treebank Wall Street Journal corpus we worked on equals to 1. [sent-49, score-0.599]

26 Accordingly, the number of different gold tags of “offers” is 2. [sent-51, score-0.219]

27 Although the number of different tags for “board” is equal to 2, only a small fraction of the tags of board differs from each other. [sent-54, score-0.365]

28 In this paper we present a method to solve POS ambiguity for a type based POS induction approach. [sent-56, score-0.341]

29 2 Algorithm We claim that if we categorize contexts a word type occurs in, we can address ambiguity by separating its occurrences before POS induction. [sent-59, score-0.805]

30 In order to do that, we represent contexts of word occurrences with substitute vectors. [sent-60, score-0.81]

31 A substitute vector is formed by the whole vocabulary of words and their corresponding probabilities of occurring in the position of the target word. [sent-61, score-0.699]

32 We generate substitute vectors for all tokens in our dataset. [sent-65, score-0.673]

33 We want to cluster occurrences of our target words using them. [sent-66, score-0.334]

34 In each substitute vector, there is a row for every word in the vocabulary. [sent-67, score-0.547]

35 As a result, the dimension of substitute vectors is equal to 49,206. [sent-68, score-0.654]

36 Thus, in order not to suffer from the curse of dimensionality, we reduce dimensions of substitute vectors. [sent-69, score-0.548]

37 Before reducing the dimensions of these vectors, distance matrices are created using Jensen distance metric for each word type in step (a) of Figure 1. [sent-70, score-0.245]

38 We should note that these matrices are created with substitute vectors of each word type, not with all of the substitute vectors. [sent-71, score-1.195]

39 The output vectors of the ISOMAP algorithm are in 64 dimensions. [sent-74, score-0.141]

40 We repeated our experiments for different numbers of dimensions and the best results are achieved when vectors are in 64 dimensions. [sent-75, score-0.183]

41 In step (c) of Figure 1, after creating vectors in lower dimension, using a modified k-means algorithm (Arthur and Vassilvitskii, 2007) 64dimensional vectors are clustered for each word type. [sent-76, score-0.389]

42 The number of clusters given as an input to k-means varies with experiments. [sent-77, score-0.148]

43 We induce number of POS tags of a word type at this step. [sent-78, score-0.246]

44 , 2012) demonstrates that clustering substitute vectors of all word types alone has limited success in predicting partof-speech tag of a word. [sent-80, score-0.936]

45 To make use of both word identity and context information of a given type, we use S-CODE co-occurrence modeling (Maron et al. [sent-81, score-0.141]

46 Given a pair of categorical variables, the SCODE model represents each of their values on a unit sphere such that frequently co-occurring values are located closely. [sent-84, score-0.279]

47 In step (d) of Figure 1, the first part of the pair is the word identity concatenated with cluster ids we got from the previous step. [sent-86, score-0.238]

48 The cluster ids separate word occurrences seen in different context groups. [sent-87, score-0.498]

49 By doing that, we make sure that the occurrences 118 Figure 1: General Flow of The Algorithm of a same word can be separated on the unit sphere if they are seen in different context groups. [sent-88, score-0.644]

50 The second part of the pair is a substitute word. [sent-89, score-0.476]

51 For an instance of a target word, we sample a substitute word according to the target word’s substitute vector probabilities. [sent-90, score-1.183]

52 If occurrences of two different or the same word types have the same substitutes, they should be seen in the similar contexts. [sent-91, score-0.382]

53 As a result, words occurring in the similar contexts will be close to each other on the unit sphere. [sent-92, score-0.196]

54 In step (e) of Figure 1, on the output of the SCODE sphere, the words occurring in the similar contexts and having the same word-identity are closely located. [sent-95, score-0.176]

55 For instance, verb occurrences of “offers” are close to each other on the unit sphere. [sent-97, score-0.265]

56 Furthermore, 119 they are separated with occurrences of “offers” which are nouns. [sent-99, score-0.259]

57 Lastly, in step (f) of Figure 1, we run k-means clustering method on the S-CODE sphere and split word-substitute word pairs into 45 clusters because the treebank we worked on uses 45 part- of-speech tags. [sent-100, score-0.551]

58 The output of clustering induces part-of-speech categories of words tokens. [sent-101, score-0.143]

59 This subset is chosen as such because word types occurring more than 4000 times are all with low gold tag perplexity. [sent-110, score-0.47]

60 We exclude word types occurring less than 100 times, because the clustering algorithm running on 64-dimension vectors does not work accurately. [sent-112, score-0.441]

61 In that experiment, POS induction is done by using word identities and context information represented by substitute words. [sent-120, score-0.661]

62 As a result, this method inaccurately induces POS tags for the occurrences of word types with high gold tag perplexity. [sent-122, score-0.841]

63 2 Upperbound In this experiment, for each word occurence, we concatenate the gold tag for the first part of the pairs in the co-occurence input file. [sent-125, score-0.319]

64 The purpose of this experiment is to set an upperbound for all experiments since we cannot cluster the word tokens any better than the gold tags. [sent-127, score-0.504]

65 3 Experiment 1 In the algorithm section, we mention that after dimensionality reduction step, we cluster the vectors to separate tokens of a target word seen in the similar contexts. [sent-131, score-0.542]

66 In this experiment, we set the number of clusters for each type to 2. [sent-132, score-0.181]

67 In other words, we assume that the number of different POS tags of each word type is equal to 2. [sent-133, score-0.283]

68 Nevertheless, separating all the words into 2 clusters results in some inaccuracy in POS induction. [sent-134, score-0.158]

69 That is because not all words have POS ambiguity and some have more than 2 different POS tags However, the main purpose of this experiment is to observe whether we can increase the POS induction accuracy for ambiguous types with our approach. [sent-135, score-0.708]

70 4 Experiment 2 In the previous experiment, we set the number of clusters for each word type to 2. [sent-139, score-0.252]

71 However, the number of different POS tags differs for each word type. [sent-140, score-0.181]

72 More importantly, around 41% of our target tokens belongs to unambiguous word types. [sent-141, score-0.253]

73 Also, around 36% of our target tokens comes from word types whose gold perplexity is below 1. [sent-142, score-0.549]

74 In this experiment, instead of splitting all types, we guess which types should be splitted. [sent-145, score-0.156]

75 Also, we guess the number of clusters for each type. [sent-146, score-0.207]

76 The Gap statistic is a statistical method to guess the number of clusters formed in given data points. [sent-149, score-0.334]

77 We expect that substitute vectors occurring in the similar context should be closely located in 64-dimensional space. [sent-150, score-0.777]

78 Thus, gap statistic can provide us the number of groups formed by vectors in 64-dimensional space. [sent-151, score-0.33]

79 That number is possibly equal to the number of the number of different POS tags of the word types. [sent-152, score-0.218]

80 5 Experiment 3 In this experiment, we set the number of clusters for each type to gold number of tags of each type. [sent-156, score-0.4]

81 The purpose of this experiment is to observe how the accuracy of number of tags given, which is used at step (c), affects the system. [sent-157, score-0.358]

82 We present our results in 3 separated tables because the accuracy of these methods varies with the ambiguity level of word types. [sent-162, score-0.39]

83 In Table 3, results for the word types whose gold tag perplexity is lower than 1. [sent-166, score-0.568]

84 Lastly, in Table 4, we present the results for word types whose gold tag perplexity is greater than 1. [sent-170, score-0.568]

85 Table 3: Results for Target Words with gold tag perplexity ≤ 1. [sent-180, score-0.432]

86 -0 t21o)-OneScor Table 4: Results for Target Words with gold tag perplexity ≥ 1. [sent-185, score-0.432]

87 That is because our experiments inaccurately induce more than one tag to unambiguous types. [sent-190, score-0.299]

88 Additionally, most of our target words have low gold tag perplexity. [sent-191, score-0.312]

89 That is because, when ambiguity increases, the baseline method inaccurately assigns one POS tag to word types. [sent-194, score-0.506]

90 On the other hand, the gap statistic method is not fully efficient in guessing the number of clusters. [sent-195, score-0.202]

91 It sometimes separates unambiguous types or it does not separate highly ambiguous word types. [sent-196, score-0.291]

92 Additionally, the results of our experiments show that, accurately guessing number of clusters plays a crucial role in this approach. [sent-198, score-0.17]

93 Even using the gold number of different tags in Experiment 3 does not result in a significantly accurate system. [sent-199, score-0.219]

94 That is because, the number of different tags does not reflect the perplexity of a word type. [sent-200, score-0.365]

95 The results show that, POS ambiguity can be addressed by using substitute vectors for word types with high ambiguity. [sent-201, score-0.951]

96 The accuracy of this approach correlates with the level of ambiguity of word types. [sent-202, score-0.308]

97 Thus, the detection of the level of ambiguity for word types should be the future direction of this research. [sent-203, score-0.334]

98 We again propose that substitute vector distributions could be useful to extract perplexity information for a word type. [sent-204, score-0.763]

99 Two decades of unsupervised pos induction: how far have we come? [sent-226, score-0.23]

100 Estimating the number of data clusters via the gap statistic. [sent-284, score-0.178]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('substitute', 0.476), ('yatbaz', 0.332), ('occurrences', 0.209), ('ambiguity', 0.198), ('pos', 0.19), ('sphere', 0.185), ('perplexity', 0.184), ('offers', 0.154), ('substitutes', 0.151), ('vectors', 0.141), ('tag', 0.139), ('experiment', 0.133), ('clusters', 0.116), ('categorize', 0.113), ('lrinm', 0.111), ('tags', 0.11), ('gold', 0.109), ('board', 0.108), ('inaccurately', 0.098), ('guess', 0.091), ('occurring', 0.086), ('statistic', 0.086), ('induction', 0.078), ('isomap', 0.074), ('koc', 0.074), ('maron', 0.074), ('onescor', 0.074), ('scode', 0.074), ('upperbound', 0.074), ('word', 0.071), ('types', 0.065), ('type', 0.065), ('dimensionality', 0.064), ('target', 0.064), ('wall', 0.062), ('unambiguous', 0.062), ('gap', 0.062), ('cluster', 0.061), ('street', 0.061), ('tpheer', 0.06), ('christodoulopoulos', 0.06), ('worked', 0.059), ('categories', 0.059), ('unit', 0.056), ('tokens', 0.056), ('goldwater', 0.055), ('guessing', 0.054), ('contexts', 0.054), ('claim', 0.053), ('lastly', 0.053), ('tenenbaum', 0.052), ('arthur', 0.05), ('separated', 0.05), ('separate', 0.048), ('ambiguous', 0.045), ('clustering', 0.044), ('tibshirani', 0.044), ('dimensions', 0.042), ('xa', 0.042), ('graff', 0.042), ('separating', 0.042), ('formed', 0.041), ('treebank', 0.04), ('induces', 0.04), ('observe', 0.04), ('unsupervised', 0.04), ('blunsom', 0.04), ('accuracy', 0.039), ('located', 0.038), ('equal', 0.037), ('equals', 0.037), ('seen', 0.037), ('step', 0.036), ('context', 0.036), ('ids', 0.036), ('sharon', 0.035), ('identity', 0.034), ('exclude', 0.034), ('penn', 0.033), ('elie', 0.033), ('internaconference', 0.033), ('lanmodeling', 0.033), ('occurence', 0.033), ('rik', 0.033), ('sert', 0.033), ('walther', 0.033), ('zemel', 0.033), ('varies', 0.032), ('ps', 0.032), ('vector', 0.032), ('matrices', 0.031), ('versity', 0.03), ('bergkirkpatrick', 0.03), ('christos', 0.03), ('curse', 0.03), ('roni', 0.03), ('iwould', 0.028), ('skipped', 0.028), ('murat', 0.028), ('phylogenetic', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

Author: Volkan Cirik

2 0.15564486 62 acl-2013-Automatic Term Ambiguity Detection

Author: Tyler Baldwin ; Yunyao Li ; Bogdan Alexe ; Ioana R. Stanoi

Abstract: While the resolution of term ambiguity is important for information extraction (IE) systems, the cost of resolving each instance of an entity can be prohibitively expensive on large datasets. To combat this, this work looks at ambiguity detection at the term, rather than the instance, level. By making a judgment about the general ambiguity of a term, a system is able to handle ambiguous and unambiguous cases differently, improving throughput and quality. To address the term ambiguity detection problem, we employ a model that combines data from language models, ontologies, and topic modeling. Results over a dataset of entities from four product domains show that the proposed approach achieves significantly above baseline F-measure of 0.96.

3 0.1057208 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.

4 0.097293898 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections

Author: Long Duong ; Paul Cook ; Steven Bird ; Pavel Pecina

Abstract: We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-the-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying “good” training sentences from the parallel corpus and applying self-training. In experimental results on eight languages, our method achieves state-of-the-art results. 1 Unsupervised part-of-speech tagging Currently, part-of-speech (POS) taggers are available for many highly spoken and well-resourced languages such as English, French, German, Italian, and Arabic. For example, Petrov et al. (2012) build supervised POS taggers for 22 languages using the TNT tagger (Brants, 2000), with an average accuracy of 95.2%. However, many widelyspoken languages including Bengali, Javanese, and Lahnda have little data manually labelled for POS, limiting supervised approaches to POS tagging for these languages. However, with the growing quantity of text available online, and in particular, multilingual parallel texts from sources such as multilingual websites, government documents and large archives ofhuman translations ofbooks, news, and so forth, unannotated parallel data is becoming more widely available. This parallel data can be exploited to bridge languages, and in particular, transfer information from a highly-resourced language to a lesser-resourced language, to build unsupervised POS taggers. In this paper, we propose an unsupervised approach to POS tagging in a similar vein to the work of Das and Petrov (201 1). In this approach, — — pecina@ ufal .mff .cuni . c z a parallel corpus for a more-resourced language having a POS tagger, and a lesser-resourced language, is word-aligned. These alignments are exploited to infer an unsupervised tagger for the target language (i.e., a tagger not requiring manuallylabelled data in the target language). Our approach is substantially simpler than that of Das and Petrov, the current state-of-the art, yet performs comparably well. 2 Related work There is a wealth of prior research on building unsupervised POS taggers. Some approaches have exploited similarities between typologically similar languages (e.g., Czech and Russian, or Telugu and Kannada) to estimate the transition probabilities for an HMM tagger for one language based on a corpus for another language (e.g., Hana et al., 2004; Feldman et al., 2006; Reddy and Sharoff, 2011). Other approaches have simultaneously tagged two languages based on alignments in a parallel corpus (e.g., Snyder et al., 2008). A number of studies have used tag projection to copy tag information from a resource-rich to a resource-poor language, based on word alignments in a parallel corpus. After alignment, the resource-rich language is tagged, and tags are projected from the source language to the target language based on the alignment (e.g., Yarowsky and Ngai, 2001 ; Das and Petrov, 2011). Das and Petrov (201 1) achieved the current state-of-the-art for unsupervised tagging by exploiting high confidence alignments to copy tags from the source language to the target language. Graph-based label propagation was used to automatically produce more labelled training data. First, a graph was constructed in which each vertex corresponds to a unique trigram, and edge weights represent the syntactic similarity between vertices. Labels were then propagated by optimizing a convex function to favor the same tags for closely related nodes 634 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 634–639, ModelCoverageAccuracy Many-to-1 alignments88%68% 1-to-1 alignments 68% 78% 1-to-1 alignments: Top 60k sents 91% 80% Table 1: Token coverage and accuracy of manyto-one and 1-to-1 alignments, as well as the top 60k sentences based on alignment score for 1-to-1 alignments, using directly-projected labels only. while keeping a uniform tag distribution for unrelated nodes. A tag dictionary was then extracted from the automatically labelled data, and this was used to constrain a feature-based HMM tagger. The method we propose here is simpler to that of Das and Petrov in that it does not require convex optimization for label propagation or a feature based HMM, yet it achieves comparable results. 3 Tagset Our tagger exploits the idea ofprojecting tag information from a resource-rich to resource-poor language. To facilitate this mapping, we adopt Petrov et al.’s (2012) twelve universal tags: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET (de- terminers and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), “.” (punctuation), and X (all other categories, e.g., foreign words, abbreviations). These twelve basic tags are common across taggers for most languages. Adopting a universal tagset avoids the need to map between a variety of different, languagespecific tagsets. Furthermore, it makes it possible to apply unsupervised tagging methods to languages for which no tagset is available, such as Telugu and Vietnamese. 4 A Simpler Unsupervised POS Tagger Here we describe our proposed tagger. The key idea is to maximize the amount of information gleaned from the source language, while limiting the amount of noise. We describe the seed model and then explain how it is successively refined through self-training and revision. 4.1 Seed Model The first step is to construct a seed tagger from directly-projected labels. Given a parallel corpus for a source and target language, Algorithm 1provides a method for building an unsupervised tagger for the target language. In typical applications, the source language would be a better-resourced language having a tagger, while the target language would be lesser-resourced, lacking a tagger and large amounts of manually POS-labelled data. Algorithm 1 Build seed model Algorithm 1Build seed model 1:Tag source side. 2: Word align the corpus with Giza++ and remove the many-to-one mappings. 3: Project tags from source to target using the remaining 1-to-1 alignments. 4: Select the top n sentences based on sentence alignment score. 5: Estimate emission and transition probabilities. 6: Build seed tagger T. We eliminate many-to-one alignments (Step 2). Keeping these would give more POS-tagged tokens for the target side, but also introduce noise. For example, suppose English and French were the source and target language, respectively. In this case alignments such as English laws (NNS) to French les (DT) lois (NNS) would be expected (Yarowsky and Ngai, 2001). However, in Step 3, where tags are projected from the source to target language, this would incorrectly tag French les as NN. We build a French tagger based on English– French data from the Europarl Corpus (Koehn, 2005). We also compare the accuracy and coverage of the tags obtained through direct projection using the French Melt POS tagger (Denis and Sagot, 2009). Table 1confirms that the one-to-one alignments indeed give higher accuracy but lower coverage than the many-to-one alignments. At this stage of the model we hypothesize that highconfidence tags are important, and hence eliminate the many-to-one alignments. In Step 4, in an effort to again obtain higher quality target language tags from direct projection, we eliminate all but the top n sentences based on their alignment scores, as provided by the aligner via IBM model 3. We heuristically set this cutoff × to 60k to balance the accuracy and size of the seed model.1 Returning to our preliminary English– French experiments in Table 1, this process gives improvements in both accuracy and coverage.2 1We considered values in the range 60–90k, but this choice had little impact on the accuracy of the model. 2We also considered using all projected labels for the top 60k sentences, not just 1-to-1 alignments, but in preliminary experiments this did not perform as well, possibly due to the previously-observed problems with many-to-one alignments. 635 The number of parameters for the emission probability is |V | |T| where V is the vocabulary and aTb iilsi ttyh eis tag |s e×t. TTh| ew htrearnesi Vtio ins probability, on atnhed other hand, has only |T|3 parameters for the trigram hmaondde,l we use. TB|ecause of this difference in number of parameters, in step 5, we use different strategies to estimate the emission and transition probabilities. The emission probability is estimated from all 60k selected sentences. However, for the transition probability, which has less parameters, we again focus on “better” sentences, by estimating this probability from only those sen- tences that have (1) token coverage > 90% (based on direct projection of tags from the source language), and (2) length > 4 tokens. These criteria aim to identify longer, mostly-tagged sentences, which we hypothesize are particularly useful as training data. In the case of our preliminary English–French experiments, roughly 62% of the 60k selected sentences meet these criteria and are used to estimate the transition probability. For unaligned words, we simply assign a random POS and very low probability, which does not substantially affect transition probability estimates. In Step 6 we build a tagger by feeding the estimated emission and transition probabilities into the TNT tagger (Brants, 2000), an implementation of a trigram HMM tagger. 4.2 Self training and revision For self training and revision, we use the seed model, along with the large number of target language sentences available that have been partially tagged through direct projection, in order to build a more accurate tagger. Algorithm 2 describes this process of self training and revision, and assumes that the parallel source–target corpus has been word aligned, with many-to-one alignments removed, and that the sentences are sorted by alignment score. In contrast to Algorithm 1, all sentences are used, not just the 60k sentences with the highest alignment scores. We believe that sentence alignment score might correspond to difficulty to tag. By sorting the sentences by alignment score, sentences which are more difficult to tag are tagged using a more mature model. Following Algorithm 1, we divide sentences into blocks of 60k. In step 3 the tagged block is revised by comparing the tags from the tagger with those obtained through direct projection. Suppose source Algorithm 2 Self training and revision 1:Divide target language sentences into blocks of n sentences. 2: Tag the first block with the seed tagger. 3: Revise the tagged block. 4: Train a new tagger on the tagged block. 5: Add the previous tagger’s lexicon to the new tagger. 6: Use the new tagger to tag the next block. 7: Goto 3 and repeat until all blocks are tagged. language word wis is aligned with target language word wjt with probability p(wjt |wsi), Tis is the tag for wis using the tagger availa|bwle for the source language, and Tjt is the tag for wjt using the tagger learned for the > S, where S is a threshold which we heuristically set to 0.7, we replace Tjt by Tis. Self-training can suffer from over-fitting, in which errors in the original model are repeated and amplified in the new model (McClosky et al., 2006). To avoid this, we remove the tag of any token that the model is uncertain of, i.e., if p(wjt |wsi) < S and Tjt Tis then Tjt = Null. So, on th|ew target side, aligned words have a tag from direct projection or no tag, and unaligned words have a tag assigned by our model. Step 4 estimates the emission and transition target language. If p(wtj|wis) = probabilities as in Algorithm 1. In Step 5, emission probabilities for lexical items in the previous model, but missing from the current model, are added to the current model. Later models therefore take advantage of information from earlier models, and have wider coverage. 5 Experimental Results Using parallel data from Europarl (Koehn, 2005) we apply our method to build taggers for the same eight target languages as Das and Petrov (201 1) Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish with English as the source language. Our training data (Europarl) is a subset of the training data of Das and Petrov (who also used the ODS United Nations dataset which we were unable to obtain). The evaluation metric and test data are the same as that used by Das and Petrov. Our results are comparable to theirs, although our system is penalized by having less training data. We tag the source language with the Stanford POS tagger (Toutanova et al., 2003). — — 636 DanishDutchGermanGreekItalianPortugueseSpanishSwedishAverage Seed model83.781.183.677.878.684.981.478.981.3 Self training + revision 85.6 84.0 85.4 80.4 81.4 86.3 83.3 81.0 83.4 Das and Petrov (2011) 83.2 79.5 82.8 82.5 86.8 87.9 84.2 80.5 83.4 Table 2: Token-level POS tagging accuracy for our seed model, self training and revision, and the method of Das and Petrov (201 1). The best results on each language, and on average, are shown in bold. 1 1 Iteration 2 2 3 1 1 2 2 3 Iteration Figure 1: Overall accuracy, accuracy on known tokens, accuracy on unknown tokens, and proportion of known tokens for Italian (left) and Dutch (right). Table 2 shows results for our seed model, self training and revision, and the results reported by Das and Petrov. Self training and revision improve the accuracy for every language over the seed model, and gives an average improvement of roughly two percentage points. The average accuracy of self training and revision is on par with that reported by Das and Petrov. On individual languages, self training and revision and the method of Das and Petrov are split each performs better on half of the cases. Interestingly, our method achieves higher accuracies on Germanic languages the family of our source language, English while Das and Petrov perform better on Romance languages. This might be because our model relies on alignments, which might be more accurate for more-related languages, whereas Das and Petrov additionally rely on label propagation. Compared to Das and Petrov, our model performs poorest on Italian, in terms of percentage point difference in accuracy. Figure 1 (left panel) shows accuracy, accuracy on known words, accuracy on unknown words, and proportion of known tokens for each iteration of our model for Italian; iteration 0 is the seed model, and iteration 3 1 is the final model. Our model performs poorly on unknown words as indicated by the low accuracy on unknown words, and high accuracy on known — — — words compared to the overall accuracy. The poor performance on unknown words is expected because we do not use any language-specific rules to handle this case. Moreover, on average for the final model, approximately 10% of the test data tokens are unknown. One way to improve the performance of our tagger might be to reduce the proportion of unknown words by using a larger training corpus, as Das and Petrov did. We examine the impact of self-training and revision over training iterations. We find that for all languages, accuracy rises quickly in the first 5–6 iterations, and then subsequently improves only slightly. We exemplify this in Figure 1 (right panel) for Dutch. (Findings are similar for other languages.) Although accuracy does not increase much in later iterations, they may still have some benefit as the vocabulary size continues to grow. 6 Conclusion We have proposed a method for unsupervised POS tagging that performs on par with the current state- of-the-art (Das and Petrov, 2011), but is substantially less-sophisticated (specifically not requiring convex optimization or a feature-based HMM). The complexity of our algorithm is O(nlogn) compared to O(n2) for that of Das and Petrov 637 (201 1) where n is the size of training data.3 We made our code are available for download.4 In future work we intend to consider using a larger training corpus to reduce the proportion of unknown tokens and improve accuracy. Given the improvements of our model over that of Das and Petrov on languages from the same family as our source language, and the observation of Snyder et al. (2008) that a better tagger can be learned from a more-closely related language, we also plan to consider strategies for selecting an appropriate source language for a given target language. Using our final model with unsupervised HMM methods might improve the final performance too, i.e. use our final model as the initial state for HMM, then experiment with differ- ent inference algorithms such as Expectation Maximization (EM), Variational Bayers (VB) or Gibbs sampling (GS).5 Gao and Johnson (2008) compare EM, VB and GS for unsupervised English POS tagging. In many cases, GS outperformed other methods, thus we would like to try GS first for our model. 7 Acknowledgements This work is funded by Erasmus Mundus European Masters Program in Language and Communication Technologies (EM-LCT) and by the Czech Science Foundation (grant no. P103/12/G084). We would like to thank Prokopis Prokopidis for providing us the Greek Treebank and Antonia Marti for the Spanish CoNLL 06 dataset. Finally, we thank Siva Reddy and Spandana Gella for many discussions and suggestions. References Thorsten Brants. 2000. TnT: A statistical part-ofspeech tagger. In Proceedings of the sixth conference on Applied natural language processing (ANLP ’00), pages 224–231 . Seattle, Washington, USA. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of 3We re-implemented label propagation from Das and Petrov (2011). It took over a day to complete this step on an eight core Intel Xeon 3.16GHz CPU with 32 Gb Ram, but only 15 minutes for our model. 4https://code.google.com/p/universal-tagger/ 5We in fact have tried EM, but it did not help. The overall performance dropped slightly. This might be because selftraining with revision already found the local maximal point. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (ACL 2011), pages 600–609. Portland, Oregon, USA. Pascal Denis and Beno ıˆt Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proceedings of the 23rd PacificAsia Conference on Language, Information and Computation, pages 721–736. Hong Kong, China. Anna Feldman, Jirka Hana, and Chris Brew. 2006. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’06), pages 549–554. Genoa, Italy. Jianfeng Gao and Mark Johnson. 2008. A comparison of bayesian estimators for unsupervised hidden markov model pos taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 344–352. Association for Computational Linguistics, Stroudsburg, PA, USA. Jiri Hana, Anna Feldman, and Chris Brew. 2004. A resource-light approach to Russian morphology: Tagging Russian using Czech resources. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP ’04), pages 222–229. Barcelona, Spain. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the Tenth Machine Translation Summit (MT Summit X), pages 79–86. AAMT, Phuket, Thailand. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on Human Language Technology Conference ofthe North American Chapter of the Association of Computational Linguistics (HLT-NAACL ’06), pages 152–159. New York, USA. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 2089–2096. Istanbul, Turkey. Siva Reddy and Serge Sharoff. 2011. Cross language POS Taggers (and other tools) for Indian 638 languages: An experiment with Kannada using Telugu resources. In Proceedings of the IJCNLP 2011 workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA 2011). Chiang Mai, Thailand. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2008. Unsupervised multilingual learning for POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 1041–1050. Honolulu, Hawaii. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol- ume 1 (NAACL ’03), pages 173–180. Edmonton, Canada. David Yarowsky and Grace Ngai. 2001 . Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (NAACL ’01), pages 1–8. Pittsburgh, Pennsylvania, USA. 639

5 0.093794204 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

Author: Rico Sennrich ; Holger Schwenk ; Walid Aransa

Abstract: While domain adaptation techniques for SMT have proven to be effective at improving translation quality, their practicality for a multi-domain environment is often limited because of the computational and human costs of developing and maintaining multiple systems adapted to different domains. We present an architecture that delays the computation of translation model features until decoding, allowing for the application of mixture-modeling techniques at decoding time. We also de- scribe a method for unsupervised adaptation with development and test data from multiple domains. Experimental results on two language pairs demonstrate the effectiveness of both our translation model architecture and automatic clustering, with gains of up to 1BLEU over unadapted systems and single-domain adaptation.

6 0.08929009 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

7 0.088707261 97 acl-2013-Cross-lingual Projections between Languages from Different Families

8 0.088325851 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing

9 0.077375568 80 acl-2013-Chinese Parsing Exploiting Characters

10 0.074232854 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

11 0.073591515 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

12 0.071201175 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

13 0.070458017 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

14 0.064833529 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

15 0.064245239 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

16 0.064195238 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

17 0.063237727 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering

18 0.061749168 192 acl-2013-Improved Lexical Acquisition through DPP-based Verb Clustering

19 0.061044633 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

20 0.061032597 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.172), (1, -0.022), (2, -0.041), (3, -0.027), (4, 0.006), (5, -0.077), (6, -0.043), (7, 0.038), (8, -0.021), (9, -0.033), (10, 0.027), (11, -0.07), (12, 0.04), (13, -0.018), (14, -0.073), (15, -0.017), (16, -0.021), (17, -0.027), (18, -0.007), (19, -0.041), (20, -0.029), (21, -0.013), (22, 0.067), (23, -0.015), (24, 0.037), (25, -0.043), (26, 0.071), (27, -0.072), (28, 0.037), (29, 0.004), (30, 0.021), (31, 0.01), (32, -0.02), (33, -0.055), (34, -0.005), (35, 0.034), (36, 0.077), (37, 0.018), (38, -0.113), (39, 0.005), (40, -0.005), (41, -0.006), (42, -0.014), (43, -0.076), (44, 0.001), (45, 0.071), (46, 0.024), (47, 0.049), (48, 0.092), (49, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94545335 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

Author: Volkan Cirik

2 0.71665955 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

3 0.68078828 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation

Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura

4 0.66210669 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections

Author: Long Duong ; Paul Cook ; Steven Bird ; Pavel Pecina

5 0.61698109 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling

Author: Heike Adel ; Ngoc Thang Vu ; Tanja Schultz

Abstract: In this paper, we investigate the application of recurrent neural network language models (RNNLM) and factored language models (FLM) to the task of language modeling for Code-Switching speech. We present a way to integrate partof-speech tags (POS) and language information (LID) into these models which leads to significant improvements in terms of perplexity. Furthermore, a comparison between RNNLMs and FLMs and a detailed analysis of perplexities on the different backoff levels are performed. Finally, we show that recurrent neural networks and factored language models can . be combined using linear interpolation to achieve the best performance. The final combined language model provides 37.8% relative improvement in terms of perplexity on the SEAME development set and a relative improvement of 32.7% on the evaluation set compared to the traditional n-gram language model. Index Terms: multilingual speech processing, code switching, language modeling, recurrent neural networks, factored language models

6 0.61623591 247 acl-2013-Modeling of term-distance and term-occurrence information for improving n-gram language model performance

7 0.60859782 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection

8 0.60788327 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

9 0.60212851 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language

10 0.59630293 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

11 0.59325719 390 acl-2013-Word surprisal predicts N400 amplitude during reading

12 0.56803238 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors

13 0.56637543 97 acl-2013-Cross-lingual Projections between Languages from Different Families

14 0.55702901 62 acl-2013-Automatic Term Ambiguity Detection

15 0.55199802 325 acl-2013-Smoothed marginal distribution constraints for language modeling

16 0.54920882 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts

17 0.54639536 227 acl-2013-Learning to lemmatise Polish noun phrases

18 0.54384315 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

19 0.53724283 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach

20 0.53672945 371 acl-2013-Unsupervised joke generation from big data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.029), (11, 0.046), (14, 0.022), (24, 0.014), (26, 0.057), (35, 0.072), (42, 0.055), (48, 0.511), (70, 0.019), (88, 0.029), (95, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9588967 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

Author: Jun Suzuki ; Masaaki Nagata

Abstract: This paper proposes a framework of supervised model learning that realizes feature grouping to obtain lower complexity models. The main idea of our method is to integrate a discrete constraint into model learning with the help of the dual decomposition technique. Experiments on two well-studied NLP tasks, dependency parsing and NER, demonstrate that our method can provide state-of-the-art performance even if the degrees of freedom in trained models are surprisingly small, i.e., 8 or even 2. This significant benefit enables us to provide compact model representation, which is especially useful in actual use.

same-paper 2 0.95659322 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

Author: Volkan Cirik

3 0.93443161 54 acl-2013-Are School-of-thought Words Characterizable?

Author: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge

Abstract: School of thought analysis is an important yet not-well-elaborated scientific knowledge discovery task. This paper makes the first attempt at this problem. We focus on one aspect of the problem: do characteristic school-of-thought words exist and whether they are characterizable? To answer these questions, we propose a probabilistic generative School-Of-Thought (SOT) model to simulate the scientific authoring process based on several assumptions. SOT defines a school of thought as a distribution of topics and assumes that authors determine the school of thought for each sentence before choosing words to deliver scientific ideas. SOT distinguishes between two types of school-ofthought words for either the general background of a school of thought or the original ideas each paper contributes to its school of thought. Narrative and quantitative experiments show positive and promising results to the questions raised above. 1

4 0.87609339 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

Author: Tiziano Flati ; Roberto Navigli

Abstract: We present SPred, a novel method for the creation of large repositories of semantic predicates. We start from existing collocations to form lexical predicates (e.g., break ∗) and learn the semantic classes that best f∗it) tahned ∗ argument. Taon idco this, we extract failtl thhee ∗ occurrences ion Wikipedia ewxthraiccht match the predicate and abstract its arguments to general semantic classes (e.g., break BODY PART, break AGREEMENT, etc.). Our experiments show that we are able to create a large collection of semantic predicates from the Oxford Advanced Learner’s Dictionary with high precision and recall, and perform well against the most similar approach.

5 0.87143838 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics

Author: Angeliki Lazaridou ; Marco Marelli ; Roberto Zamparelli ; Marco Baroni

Abstract: Speakers of a language can construct an unlimited number of new words through morphological derivation. This is a major cause of data sparseness for corpus-based approaches to lexical semantics, such as distributional semantic models of word meaning. We adapt compositional methods originally developed for phrases to the task of deriving the distributional meaning of morphologically complex words from their parts. Semantic representations constructed in this way beat a strong baseline and can be of higher quality than representations directly constructed from corpus data. Our results constitute a novel evaluation of the proposed composition methods, in which the full additive model achieves the best performance, and demonstrate the usefulness of a compositional morphology component in distributional semantics.

6 0.85553986 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

7 0.63174134 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words

8 0.62497914 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit

9 0.61181092 237 acl-2013-Margin-based Decomposed Amortized Inference

10 0.6087501 294 acl-2013-Re-embedding words

11 0.60236114 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

12 0.58851558 62 acl-2013-Automatic Term Ambiguity Detection

13 0.58365148 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

14 0.58239841 275 acl-2013-Parsing with Compositional Vector Grammars

15 0.5730173 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

16 0.56287348 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics

17 0.56204742 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

18 0.5604673 91 acl-2013-Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning

19 0.55370784 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

20 0.55186051 175 acl-2013-Grounded Language Learning from Video Described with Sentences