acl acl2013 acl2013-276 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura
Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
Reference: text
sentIndex sentText sentNum sentScore
1 j opf Abstract This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). [sent-13, score-0.393]
2 In particular, we extend the monolingual infinite tree model (Finkel et al. [sent-14, score-0.421]
3 , 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). [sent-15, score-0.657]
4 Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. [sent-16, score-0.354]
5 Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model. [sent-17, score-0.194]
6 In the face of the above situations, this paper proposes an unsupervised method for inducing POS tags for SMT, and aims to improve the performance of syntax-based SMT by utilizing the in- duced POS tagset. [sent-37, score-0.174]
7 The proposed method is based on the infinite tree model proposed by Finkel et al. [sent-38, score-0.393]
8 (2007), which is a nonparametric Bayesian method for inducing POS tags from syntactic dependency structures. [sent-39, score-0.318]
9 In this model, hidden states represent POS tags, the observations they generate represent the words themselves, and tree structures represent syntactic dependencies between pairs of POS tags. [sent-40, score-0.322]
10 In the joint model, each hidden state jointly emits both a source word and its aligned target word as an observation. [sent-43, score-0.448]
11 The independent model separately emits words in two languages from hidden states. [sent-44, score-0.227]
12 c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 841–851, tags based on bilingual observations, both models can induce POS tags by incorporating information from the other language. [sent-47, score-0.313]
13 , the infinite tree model), the “利用” in Example 1 and 2 would both be assigned the same POS tag since they share the same observation. [sent-51, score-0.417]
14 Inference is efficiently carried out by beam sampling (Gael et al. [sent-53, score-0.183]
15 Experiments are carried out on the NTCIR-9 Japaneseto-English task using a binarized forest-to-string SMT system with dependency trees as its source side. [sent-55, score-0.17]
16 Our bilingually-induced tagset significantly outperforms the original tagset and the monolingually-induced tagset. [sent-56, score-0.278]
17 Further, our independent model achieves a more than 1 point gain in BLEU, which resolves the sparseness problem introduced by the bi-word observations. [sent-57, score-0.15]
18 This limitation has been overcome by automatically adjusting the number of possible POS tags using nonparametric Bayesian methods (Finkel et al. [sent-60, score-0.17]
19 (2007) proposed the infinite tree model, which represents recursive branching structures over infinite hidden states and induces POS tags from syntactic dependency structures. [sent-70, score-1.007]
20 In the following, we overview the infinite tree model, which is the basis of our proposed model. [sent-71, score-0.364]
21 A node t has a hidden state zt (the POS tag) and an observation xt (the word). [sent-78, score-1.05]
22 fined: pT(Tt) = p(xt|zt) Let each hidden state variable have C possible values indexed by k. [sent-80, score-0.187]
23 For each state k, there is a parameter ϕk which parameterizes the observation distribution for that state: xt |zt ∼ F(ϕzt ). [sent-81, score-0.371]
24 tions between states are governed by Markov dynamics parameterized by π, where πij = p(zc(t) = j |zt = i) and πk are the transition probabilities f=rom j| the parent’s state k. [sent-84, score-0.24]
25 The hidden state of ea|cρh c ∼hild D zt′ cish ldeti(stρr,ib. [sent-89, score-0.187]
26 ccording to a multinomial distribution πzt specific to the parent’s state zt: zt′ |zt ∼ Multinomial(πzt ). [sent-93, score-0.171]
27 2 Infinite Tree Model In the infinite tree model, the number of possible hidden states is potentially infinite. [sent-95, score-0.508]
28 The infinite model is formed by extending the finite tree model using a hierarchical Dirichlet process (HDP) (Teh et al. [sent-96, score-0.47]
29 (2007) originally proposed three types of models: besides the independent children model, the simultaneous children model and the markov children model. [sent-99, score-0.184]
30 oindexed distributions: πk, a distribution over the transition probabilities from the parent’s state k, and ϕk′, an observation distribution for the state k′. [sent-107, score-0.363]
31 Then, the infinite tree model is formally defined as follows: πk β|γ ∼ GEM(γ), πk |α0, β ϕk ∼ H, ∼ DP(α0, β) , zt′ |zt ∼ Multinomial(πzt ) , xt |zt ∼ F(ϕzt ) . [sent-108, score-0.574]
32 Figure 3 shows the graphical representation of the infinite tree model. [sent-109, score-0.364]
33 ∑lk=−11 αγH0φπ β ∞k“+z払pa1うy”“私z+2I”“zが4”+“fz料e3金s”+u“zs利a5g用e”“zを6 Figure 4: An Example of the Joint Model tween Figure 2 and Figure 3 is whether the number of copies of the state is finite or not. [sent-118, score-0.151]
34 3 Bilingual Infinite Tree Model We propose a bilingual variant of the infinite tree model, the bilingual infinite tree model, which utilizes information from the other language. [sent-119, score-0.846]
35 Specifically, the proposed model introduces bilingual observations by embedding the aligned target words in the source-side dependency trees. [sent-120, score-0.419]
36 1 Joint Model The joint model is a simple application of the infinite tree model under a bilingual scenario. [sent-123, score-0.525]
37 The only difference from the infinite tree model is the instances of observations (xt). [sent-126, score-0.478]
38 Observations in the joint model are the combination of source words and their aligned target words4, while observations in the monolingual infinite tree model represent only source words. [sent-127, score-0.724]
39 Therefore, a single target word may be emitted multiple times if the target word is aligned with multiple source words. [sent-129, score-0.236]
40 Figure 4 shows the process of generating Example 2 in Figure 1 through the joint model, where aligned words are jointly emitted as observations. [sent-131, score-0.221]
41 Hence, this model can assign different POS tags to the two different instances of the word “利用”, based on the different observation distributions in inference. [sent-135, score-0.214]
42 2 Independent Model The joint model is prone to a data sparseness problem, since each observation is a combination of a source word and its aligned target word. [sent-137, score-0.326]
43 Thus, we propose an independent model, where each hidden state generates a source word and its aligned target word separately. [sent-138, score-0.403]
44 For the aligned target side, we introduce an observation variable xt′ for each zt and a parameter ϕ′k for each state k, which parameterizes a distinct distribution over the observations xt′ for that state. [sent-139, score-1.012]
45 When multiple target words are aligned to a single source word, each aligned word is generated separately from observation distribution parameterized by ϕ′k. [sent-142, score-0.369]
46 3 Introduction of Other Factors We assumed the surface form of aligned target words as additional observations in previous sections. [sent-148, score-0.23]
47 In the independent model, we introduce observation variables (e. [sent-156, score-0.165]
48 Specifically, xt′ and ϕ′k are introduced for the surface form of aligned words, and x′t′ and ϕ′k′ for the POS of aligned words. [sent-161, score-0.202]
49 The POS tag of “利用” generates the string “利用+use+verb” as the observation in the joint model, while it generates “利用”, “use”, and “verb” independently in the independent model. [sent-163, score-0.226]
50 4 POS Refinement We have assumed a completely unsupervised way of inducing POS tags in dependency trees. [sent-165, score-0.275]
51 , 2007) so that each refined sub-POS tag may reflect the information from the aligned words while preserving the handcrafted distinction from original POS tagset. [sent-168, score-0.154]
52 Major difference is that we introduce separate transition probabilities πks and observation distributions (ϕsk, for each existing POS tag s. [sent-169, score-0.152]
53 Then, each node t is constrained to follow the distributions indicated by the initially assigned POS tag st, and we use the pair (st, zt) as a state representation. [sent-170, score-0.188]
54 5 Inference In inference, we find the state set that maximizes the posterior probability of state transitions given observations (i. [sent-172, score-0.37]
55 (2007) presented a sampling algorithm for the infinite tree model, which is based on the Gibbs sampling in the direct assignment representation for iHMM (Teh et al. [sent-177, score-0.634]
56 In the 844 Gibbs sampling, individual hidden state variables are resampled conditioned on all other variables. [sent-179, score-0.223]
57 We present an inference procedure based on beam sampling (Gael et al. [sent-182, score-0.183]
58 Beam sampling limits the number of possible state transitions for each node to a finite number using slice sampling (Neal, 2003), and then efficiently sam- ples whole hidden state transitions using dynamic programming. [sent-184, score-0.773]
59 Beam sampling does not suffer from slow convergence as in Gibbs sampling by sampling the whole state variables at once. [sent-185, score-0.544]
60 (2008) showed that beam sampling is more robust to initialization and hyperparameter choice than Gibbs sampling. [sent-187, score-0.183]
61 Specifically, we introduce an auxiliary variable ut for each node in a dependency tree to limit the number of possible transitions. [sent-188, score-0.342]
62 Our procedure alternates between sampling each of the following variables: the auxiliary variables u, the state assignments z, the transition probabilities π, the shared DP parameters β, and the hyperparameters α0 and γ. [sent-189, score-0.393]
63 We can parallelize procedures in sampling u and z because the slice sampling for u and the dynamic programing for z are independent for each sentence. [sent-190, score-0.382]
64 The only difference between inferences in the joint model and the independent model is in computing the posterior probability of state transitions given observations (e. [sent-193, score-0.44]
65 Sampling u: Each ut is sampled from the uniform distribution on [0, πzd(t)zt] , where d(t) is the parent of t: ut ∼ Uniform(0, πzd(t)zt ). [sent-199, score-0.215]
66 Sampling z: Possible values k of zt are divided into the two sets using ut: a finite set with πzd(t)k > ut and an infinite set with πzd(t)k ≤ ut. [sent-201, score-0.987]
67 The beam sampling considers only the f≤orm uer set. [sent-202, score-0.183]
68 Owing to the truncation of the latter set, we can compute the posterior probability of a state zt given observations for all t (t = 1, . [sent-203, score-0.813]
69 Under this assumption, the posterior probability of an zd(t) :π∑zd(t)zt observation is as follows: p(xt|zt) = n n˙˙xt+k+ N ρρ, where n˙ xk is the number of observation·sk x with state k, n˙ ·k is the number of hidden states whose values are k, and N is the total number of observa- tions x. [sent-212, score-0.338]
70 Sampling π: We introduce a count variable nij ∈ n, which is the number of observations∈ with state j whose parent’s state is i. [sent-215, score-0.238]
71 We introduce two types of auxiliary variables for each state (k = 1, . [sent-234, score-0.179]
72 1 Experimental Setup We evaluated our bilingual infinite tree model for POS induction using an in-house developed syntax-based forest-to-string SMT system. [sent-255, score-0.501]
73 In the training process, the following steps are performed sequentially: preprocessing, inducing a POS tagset for a source language, training a POS tagger and a dependency parser, and training a forest-to-string MT model. [sent-256, score-0.287]
74 Preprocessing We used the first 10,000 Japanese-English sentence pairs in the NTCIR-9 training data for inducing a POS tagset for Japanese6. [sent-258, score-0.186]
75 The Japanese POS tags come from the secondlevel POS tags in the IPA POS tagset (Asahara and Matsumoto, 2003) and the English POS tags are derived from the Penn Treebank. [sent-260, score-0.52]
76 Note that the Japanese POS tags are used for initialization of hidden states and the English POS tags are used as observations emitted by hidden states. [sent-261, score-0.614]
77 The Japanese sentences are parsed using CaboCha (Kudo and Matsumoto, 2002), which generates dependency structures using a phrasal unit called a bunsetsu8, rather than a word unit as in English or Chinese dependency parsing. [sent-265, score-0.202]
78 POS Induction A POS tag for each word in the Japanese sentences is inferred by our bilingual infinite tree model, ei6Due to the high computational cost, we did not use all the NTCIR-9 training data. [sent-268, score-0.476]
79 9We could use other word-based dependency trees such as trees by the infinite PCFG model (Liang et al. [sent-276, score-0.471]
80 In sampling α0 and γ, hyperparameters αa, αb, γa, and γb are set to 2, 1, 1, and 1, respectively, which is the same setting in Gael et al. [sent-284, score-0.173]
81 In the experiments, three types of factors for the aligned English words are compared: surface forms (‘s’), POS tags (‘P’), and the combination of both (‘s+P’). [sent-291, score-0.228]
82 In both frameworks, each hidden state zt is first initialized to the POS tags assigned by MeCab (the IPA POS tagset), and then each state is updated through the inference procedure described in Section 3. [sent-293, score-1.009]
83 Note that in REF, the sampling distribution over zt is constrained to include only states that are a refinement of the initially assigned POS tag. [sent-295, score-0.852]
84 Training a POS Tagger and a Dependency Parser In this step, we train a Japanese dependency parser from the 10,000 Japanese dependency trees with the induced POS tags which are derived from Step 2. [sent-297, score-0.451]
85 We employed a transition-based dependency parser which can jointly learn POS tagging and dependency parsing (Hatori et al. [sent-298, score-0.296]
86 Training a Forest-to-String MT In this step, we train a forest-to-string MT model based on the learned dependency parser in Step 3. [sent-302, score-0.166]
87 All the Japanese and English sentences in the NTCIR-9 training data are segmented in the same way as in Step 1, and then each Japanese sentence is parsed by the dependency parser learned in Step 3, which simultaneously assigns induced POS tags and word dependencies. [sent-304, score-0.315]
88 The results indicate that integrating the aligned target-side information in POS induction makes inferred tagsets more suitable for SMT. [sent-324, score-0.182]
89 This means that sparseness is a severe problem in 847 IJPAIoMniPdnoO[tdsS +e lPtaP]gsI1 N6042D42R65E2107F Table 2: The Number of POS Tags POS induction when jointly encoding bilingual information into observations. [sent-326, score-0.187]
90 1 Comparison to the IPA POS Tagset Table 2 shows the number of the IPA POS tags used in the experiments and the POS tags induced by the proposed models. [sent-337, score-0.305]
91 This table shows that each induced tagset contains more POS tags than the IPA POS tagset. [sent-338, score-0.317]
92 These examples show that the proposed models can disambiguate POS tags that have different functions in English, whereas the IPA POS tagset treats them jointly. [sent-351, score-0.266]
93 2 Impact of Tagging and Dependency Accuracy The performance of our methods depends not only on the quality of the induced tag sets but also on the performance of the dependency parser learned in Step 3 of Section 4. [sent-354, score-0.241]
94 We cannot directly evaluate the tagging accuracy of the parser trained through Step 3 because we do not have any data with induced POS tags other than the 10,000sentence data gained through Step 2. [sent-356, score-0.243]
95 Note that the dependency accuracies are measured on the automatically parsed dependency trees, not on the syntactically correct gold standard trees. [sent-360, score-0.234]
96 It seems performing parsing and tagging with the bilingually-induced POS tagset is too difficult when only monolingual in848 formation is available to the parser. [sent-363, score-0.196]
97 The tagging accuracies for Joint[P] both in IND and REF are significantly lower than the others, while the dependency accuracies do not differ significantly. [sent-365, score-0.194]
98 6 Conclusion We proposed a novel method for inducing POS tags for SMT. [sent-367, score-0.174]
99 , POS tags) based on observations representing not only source words themselves but also aligned target words. [sent-370, score-0.23]
100 Our experiments showed that a more favorable POS tagset can be induced by integrating aligned information, and furthermore, the POS tagset generated by the proposed method is more effective for SMT than an existing POS tagset (the IPA POS tagset). [sent-371, score-0.569]
wordName wordTfidf (topN-words)
[('zt', 0.592), ('pos', 0.294), ('infinite', 0.271), ('xt', 0.181), ('japanese', 0.161), ('ipa', 0.149), ('zd', 0.143), ('tagset', 0.139), ('sampling', 0.135), ('tags', 0.127), ('gael', 0.124), ('state', 0.103), ('aligned', 0.101), ('dependency', 0.101), ('tree', 0.093), ('kk', 0.09), ('finkel', 0.089), ('observations', 0.085), ('hidden', 0.084), ('poss', 0.079), ('smt', 0.079), ('ind', 0.078), ('dp', 0.077), ('ut', 0.076), ('ref', 0.072), ('independent', 0.071), ('states', 0.06), ('dirichlet', 0.059), ('bilingual', 0.059), ('observation', 0.058), ('hdp', 0.057), ('gem', 0.057), ('tag', 0.053), ('vk', 0.053), ('induced', 0.051), ('sparseness', 0.05), ('beal', 0.05), ('bunsetsu', 0.05), ('induction', 0.049), ('ihmm', 0.049), ('zc', 0.049), ('beam', 0.048), ('finite', 0.048), ('emitted', 0.047), ('inducing', 0.047), ('transitions', 0.046), ('gk', 0.045), ('wk', 0.044), ('target', 0.044), ('joint', 0.044), ('nonparametric', 0.043), ('alum', 0.043), ('sirts', 0.043), ('emits', 0.043), ('slice', 0.041), ('transition', 0.041), ('auxiliary', 0.04), ('translation', 0.04), ('mono', 0.04), ('bleu', 0.039), ('tt', 0.039), ('multinomial', 0.039), ('watanabe', 0.039), ('hyperparameters', 0.038), ('goto', 0.037), ('hatori', 0.037), ('variables', 0.036), ('teh', 0.036), ('hmm', 0.036), ('parameterized', 0.036), ('refinement', 0.036), ('parser', 0.036), ('blunsom', 0.035), ('trees', 0.035), ('parent', 0.034), ('binarized', 0.034), ('mi', 0.033), ('posterior', 0.033), ('hyperprior', 0.032), ('nij', 0.032), ('accuracies', 0.032), ('tagsets', 0.032), ('zoubin', 0.032), ('node', 0.032), ('taro', 0.031), ('bs', 0.03), ('cohn', 0.03), ('jointly', 0.029), ('liu', 0.029), ('distribution', 0.029), ('model', 0.029), ('tagging', 0.029), ('nakazawa', 0.029), ('isao', 0.029), ('jurgen', 0.029), ('children', 0.028), ('sumita', 0.028), ('liang', 0.028), ('monolingual', 0.028), ('usage', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura
Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
2 0.14982568 80 acl-2013-Chinese Parsing Exploiting Characters
Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu
Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.
3 0.13706617 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue
Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1
4 0.13130337 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
Author: Trevor Cohn ; Gholamreza Haffari
Abstract: Modern phrase-based machine translation systems make extensive use of wordbased translation models for inducing alignments from parallel corpora. This is problematic, as the systems are incapable of accurately modelling many translation phenomena that do not decompose into word-for-word translation. This paper presents a novel method for inducing phrase-based translation units directly from parallel data, which we frame as learning an inverse transduction grammar (ITG) using a recursive Bayesian prior. Overall this leads to a model which learns translations of entire sentences, while also learning their decomposition into smaller units (phrase-pairs) recursively, terminating at word translations. Our experiments on Arabic, Urdu and Farsi to English demonstrate improvements over competitive baseline systems.
5 0.12639463 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections
Author: Long Duong ; Paul Cook ; Steven Bird ; Pavel Pecina
Abstract: We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-the-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying “good” training sentences from the parallel corpus and applying self-training. In experimental results on eight languages, our method achieves state-of-the-art results. 1 Unsupervised part-of-speech tagging Currently, part-of-speech (POS) taggers are available for many highly spoken and well-resourced languages such as English, French, German, Italian, and Arabic. For example, Petrov et al. (2012) build supervised POS taggers for 22 languages using the TNT tagger (Brants, 2000), with an average accuracy of 95.2%. However, many widelyspoken languages including Bengali, Javanese, and Lahnda have little data manually labelled for POS, limiting supervised approaches to POS tagging for these languages. However, with the growing quantity of text available online, and in particular, multilingual parallel texts from sources such as multilingual websites, government documents and large archives ofhuman translations ofbooks, news, and so forth, unannotated parallel data is becoming more widely available. This parallel data can be exploited to bridge languages, and in particular, transfer information from a highly-resourced language to a lesser-resourced language, to build unsupervised POS taggers. In this paper, we propose an unsupervised approach to POS tagging in a similar vein to the work of Das and Petrov (201 1). In this approach, — — pecina@ ufal .mff .cuni . c z a parallel corpus for a more-resourced language having a POS tagger, and a lesser-resourced language, is word-aligned. These alignments are exploited to infer an unsupervised tagger for the target language (i.e., a tagger not requiring manuallylabelled data in the target language). Our approach is substantially simpler than that of Das and Petrov, the current state-of-the art, yet performs comparably well. 2 Related work There is a wealth of prior research on building unsupervised POS taggers. Some approaches have exploited similarities between typologically similar languages (e.g., Czech and Russian, or Telugu and Kannada) to estimate the transition probabilities for an HMM tagger for one language based on a corpus for another language (e.g., Hana et al., 2004; Feldman et al., 2006; Reddy and Sharoff, 2011). Other approaches have simultaneously tagged two languages based on alignments in a parallel corpus (e.g., Snyder et al., 2008). A number of studies have used tag projection to copy tag information from a resource-rich to a resource-poor language, based on word alignments in a parallel corpus. After alignment, the resource-rich language is tagged, and tags are projected from the source language to the target language based on the alignment (e.g., Yarowsky and Ngai, 2001 ; Das and Petrov, 2011). Das and Petrov (201 1) achieved the current state-of-the-art for unsupervised tagging by exploiting high confidence alignments to copy tags from the source language to the target language. Graph-based label propagation was used to automatically produce more labelled training data. First, a graph was constructed in which each vertex corresponds to a unique trigram, and edge weights represent the syntactic similarity between vertices. Labels were then propagated by optimizing a convex function to favor the same tags for closely related nodes 634 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 634–639, ModelCoverageAccuracy Many-to-1 alignments88%68% 1-to-1 alignments 68% 78% 1-to-1 alignments: Top 60k sents 91% 80% Table 1: Token coverage and accuracy of manyto-one and 1-to-1 alignments, as well as the top 60k sentences based on alignment score for 1-to-1 alignments, using directly-projected labels only. while keeping a uniform tag distribution for unrelated nodes. A tag dictionary was then extracted from the automatically labelled data, and this was used to constrain a feature-based HMM tagger. The method we propose here is simpler to that of Das and Petrov in that it does not require convex optimization for label propagation or a feature based HMM, yet it achieves comparable results. 3 Tagset Our tagger exploits the idea ofprojecting tag information from a resource-rich to resource-poor language. To facilitate this mapping, we adopt Petrov et al.’s (2012) twelve universal tags: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET (de- terminers and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), “.” (punctuation), and X (all other categories, e.g., foreign words, abbreviations). These twelve basic tags are common across taggers for most languages. Adopting a universal tagset avoids the need to map between a variety of different, languagespecific tagsets. Furthermore, it makes it possible to apply unsupervised tagging methods to languages for which no tagset is available, such as Telugu and Vietnamese. 4 A Simpler Unsupervised POS Tagger Here we describe our proposed tagger. The key idea is to maximize the amount of information gleaned from the source language, while limiting the amount of noise. We describe the seed model and then explain how it is successively refined through self-training and revision. 4.1 Seed Model The first step is to construct a seed tagger from directly-projected labels. Given a parallel corpus for a source and target language, Algorithm 1provides a method for building an unsupervised tagger for the target language. In typical applications, the source language would be a better-resourced language having a tagger, while the target language would be lesser-resourced, lacking a tagger and large amounts of manually POS-labelled data. Algorithm 1 Build seed model Algorithm 1Build seed model 1:Tag source side. 2: Word align the corpus with Giza++ and remove the many-to-one mappings. 3: Project tags from source to target using the remaining 1-to-1 alignments. 4: Select the top n sentences based on sentence alignment score. 5: Estimate emission and transition probabilities. 6: Build seed tagger T. We eliminate many-to-one alignments (Step 2). Keeping these would give more POS-tagged tokens for the target side, but also introduce noise. For example, suppose English and French were the source and target language, respectively. In this case alignments such as English laws (NNS) to French les (DT) lois (NNS) would be expected (Yarowsky and Ngai, 2001). However, in Step 3, where tags are projected from the source to target language, this would incorrectly tag French les as NN. We build a French tagger based on English– French data from the Europarl Corpus (Koehn, 2005). We also compare the accuracy and coverage of the tags obtained through direct projection using the French Melt POS tagger (Denis and Sagot, 2009). Table 1confirms that the one-to-one alignments indeed give higher accuracy but lower coverage than the many-to-one alignments. At this stage of the model we hypothesize that highconfidence tags are important, and hence eliminate the many-to-one alignments. In Step 4, in an effort to again obtain higher quality target language tags from direct projection, we eliminate all but the top n sentences based on their alignment scores, as provided by the aligner via IBM model 3. We heuristically set this cutoff × to 60k to balance the accuracy and size of the seed model.1 Returning to our preliminary English– French experiments in Table 1, this process gives improvements in both accuracy and coverage.2 1We considered values in the range 60–90k, but this choice had little impact on the accuracy of the model. 2We also considered using all projected labels for the top 60k sentences, not just 1-to-1 alignments, but in preliminary experiments this did not perform as well, possibly due to the previously-observed problems with many-to-one alignments. 635 The number of parameters for the emission probability is |V | |T| where V is the vocabulary and aTb iilsi ttyh eis tag |s e×t. TTh| ew htrearnesi Vtio ins probability, on atnhed other hand, has only |T|3 parameters for the trigram hmaondde,l we use. TB|ecause of this difference in number of parameters, in step 5, we use different strategies to estimate the emission and transition probabilities. The emission probability is estimated from all 60k selected sentences. However, for the transition probability, which has less parameters, we again focus on “better” sentences, by estimating this probability from only those sen- tences that have (1) token coverage > 90% (based on direct projection of tags from the source language), and (2) length > 4 tokens. These criteria aim to identify longer, mostly-tagged sentences, which we hypothesize are particularly useful as training data. In the case of our preliminary English–French experiments, roughly 62% of the 60k selected sentences meet these criteria and are used to estimate the transition probability. For unaligned words, we simply assign a random POS and very low probability, which does not substantially affect transition probability estimates. In Step 6 we build a tagger by feeding the estimated emission and transition probabilities into the TNT tagger (Brants, 2000), an implementation of a trigram HMM tagger. 4.2 Self training and revision For self training and revision, we use the seed model, along with the large number of target language sentences available that have been partially tagged through direct projection, in order to build a more accurate tagger. Algorithm 2 describes this process of self training and revision, and assumes that the parallel source–target corpus has been word aligned, with many-to-one alignments removed, and that the sentences are sorted by alignment score. In contrast to Algorithm 1, all sentences are used, not just the 60k sentences with the highest alignment scores. We believe that sentence alignment score might correspond to difficulty to tag. By sorting the sentences by alignment score, sentences which are more difficult to tag are tagged using a more mature model. Following Algorithm 1, we divide sentences into blocks of 60k. In step 3 the tagged block is revised by comparing the tags from the tagger with those obtained through direct projection. Suppose source Algorithm 2 Self training and revision 1:Divide target language sentences into blocks of n sentences. 2: Tag the first block with the seed tagger. 3: Revise the tagged block. 4: Train a new tagger on the tagged block. 5: Add the previous tagger’s lexicon to the new tagger. 6: Use the new tagger to tag the next block. 7: Goto 3 and repeat until all blocks are tagged. language word wis is aligned with target language word wjt with probability p(wjt |wsi), Tis is the tag for wis using the tagger availa|bwle for the source language, and Tjt is the tag for wjt using the tagger learned for the > S, where S is a threshold which we heuristically set to 0.7, we replace Tjt by Tis. Self-training can suffer from over-fitting, in which errors in the original model are repeated and amplified in the new model (McClosky et al., 2006). To avoid this, we remove the tag of any token that the model is uncertain of, i.e., if p(wjt |wsi) < S and Tjt Tis then Tjt = Null. So, on th|ew target side, aligned words have a tag from direct projection or no tag, and unaligned words have a tag assigned by our model. Step 4 estimates the emission and transition target language. If p(wtj|wis) = probabilities as in Algorithm 1. In Step 5, emission probabilities for lexical items in the previous model, but missing from the current model, are added to the current model. Later models therefore take advantage of information from earlier models, and have wider coverage. 5 Experimental Results Using parallel data from Europarl (Koehn, 2005) we apply our method to build taggers for the same eight target languages as Das and Petrov (201 1) Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish with English as the source language. Our training data (Europarl) is a subset of the training data of Das and Petrov (who also used the ODS United Nations dataset which we were unable to obtain). The evaluation metric and test data are the same as that used by Das and Petrov. Our results are comparable to theirs, although our system is penalized by having less training data. We tag the source language with the Stanford POS tagger (Toutanova et al., 2003). — — 636 DanishDutchGermanGreekItalianPortugueseSpanishSwedishAverage Seed model83.781.183.677.878.684.981.478.981.3 Self training + revision 85.6 84.0 85.4 80.4 81.4 86.3 83.3 81.0 83.4 Das and Petrov (2011) 83.2 79.5 82.8 82.5 86.8 87.9 84.2 80.5 83.4 Table 2: Token-level POS tagging accuracy for our seed model, self training and revision, and the method of Das and Petrov (201 1). The best results on each language, and on average, are shown in bold. 1 1 Iteration 2 2 3 1 1 2 2 3 Iteration Figure 1: Overall accuracy, accuracy on known tokens, accuracy on unknown tokens, and proportion of known tokens for Italian (left) and Dutch (right). Table 2 shows results for our seed model, self training and revision, and the results reported by Das and Petrov. Self training and revision improve the accuracy for every language over the seed model, and gives an average improvement of roughly two percentage points. The average accuracy of self training and revision is on par with that reported by Das and Petrov. On individual languages, self training and revision and the method of Das and Petrov are split each performs better on half of the cases. Interestingly, our method achieves higher accuracies on Germanic languages the family of our source language, English while Das and Petrov perform better on Romance languages. This might be because our model relies on alignments, which might be more accurate for more-related languages, whereas Das and Petrov additionally rely on label propagation. Compared to Das and Petrov, our model performs poorest on Italian, in terms of percentage point difference in accuracy. Figure 1 (left panel) shows accuracy, accuracy on known words, accuracy on unknown words, and proportion of known tokens for each iteration of our model for Italian; iteration 0 is the seed model, and iteration 3 1 is the final model. Our model performs poorly on unknown words as indicated by the low accuracy on unknown words, and high accuracy on known — — — words compared to the overall accuracy. The poor performance on unknown words is expected because we do not use any language-specific rules to handle this case. Moreover, on average for the final model, approximately 10% of the test data tokens are unknown. One way to improve the performance of our tagger might be to reduce the proportion of unknown words by using a larger training corpus, as Das and Petrov did. We examine the impact of self-training and revision over training iterations. We find that for all languages, accuracy rises quickly in the first 5–6 iterations, and then subsequently improves only slightly. We exemplify this in Figure 1 (right panel) for Dutch. (Findings are similar for other languages.) Although accuracy does not increase much in later iterations, they may still have some benefit as the vocabulary size continues to grow. 6 Conclusion We have proposed a method for unsupervised POS tagging that performs on par with the current state- of-the-art (Das and Petrov, 2011), but is substantially less-sophisticated (specifically not requiring convex optimization or a feature-based HMM). The complexity of our algorithm is O(nlogn) compared to O(n2) for that of Das and Petrov 637 (201 1) where n is the size of training data.3 We made our code are available for download.4 In future work we intend to consider using a larger training corpus to reduce the proportion of unknown tokens and improve accuracy. Given the improvements of our model over that of Das and Petrov on languages from the same family as our source language, and the observation of Snyder et al. (2008) that a better tagger can be learned from a more-closely related language, we also plan to consider strategies for selecting an appropriate source language for a given target language. Using our final model with unsupervised HMM methods might improve the final performance too, i.e. use our final model as the initial state for HMM, then experiment with differ- ent inference algorithms such as Expectation Maximization (EM), Variational Bayers (VB) or Gibbs sampling (GS).5 Gao and Johnson (2008) compare EM, VB and GS for unsupervised English POS tagging. In many cases, GS outperformed other methods, thus we would like to try GS first for our model. 7 Acknowledgements This work is funded by Erasmus Mundus European Masters Program in Language and Communication Technologies (EM-LCT) and by the Czech Science Foundation (grant no. P103/12/G084). We would like to thank Prokopis Prokopidis for providing us the Greek Treebank and Antonia Marti for the Spanish CoNLL 06 dataset. Finally, we thank Siva Reddy and Spandana Gella for many discussions and suggestions. References Thorsten Brants. 2000. TnT: A statistical part-ofspeech tagger. In Proceedings of the sixth conference on Applied natural language processing (ANLP ’00), pages 224–231 . Seattle, Washington, USA. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of 3We re-implemented label propagation from Das and Petrov (2011). It took over a day to complete this step on an eight core Intel Xeon 3.16GHz CPU with 32 Gb Ram, but only 15 minutes for our model. 4https://code.google.com/p/universal-tagger/ 5We in fact have tried EM, but it did not help. The overall performance dropped slightly. This might be because selftraining with revision already found the local maximal point. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (ACL 2011), pages 600–609. Portland, Oregon, USA. Pascal Denis and Beno ıˆt Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proceedings of the 23rd PacificAsia Conference on Language, Information and Computation, pages 721–736. Hong Kong, China. Anna Feldman, Jirka Hana, and Chris Brew. 2006. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’06), pages 549–554. Genoa, Italy. Jianfeng Gao and Mark Johnson. 2008. A comparison of bayesian estimators for unsupervised hidden markov model pos taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 344–352. Association for Computational Linguistics, Stroudsburg, PA, USA. Jiri Hana, Anna Feldman, and Chris Brew. 2004. A resource-light approach to Russian morphology: Tagging Russian using Czech resources. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP ’04), pages 222–229. Barcelona, Spain. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the Tenth Machine Translation Summit (MT Summit X), pages 79–86. AAMT, Phuket, Thailand. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on Human Language Technology Conference ofthe North American Chapter of the Association of Computational Linguistics (HLT-NAACL ’06), pages 152–159. New York, USA. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 2089–2096. Istanbul, Turkey. Siva Reddy and Serge Sharoff. 2011. Cross language POS Taggers (and other tools) for Indian 638 languages: An experiment with Kannada using Telugu resources. In Proceedings of the IJCNLP 2011 workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA 2011). Chiang Mai, Thailand. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2008. Unsupervised multilingual learning for POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 1041–1050. Honolulu, Hawaii. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol- ume 1 (NAACL ’03), pages 173–180. Edmonton, Canada. David Yarowsky and Grace Ngai. 2001 . Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (NAACL ’01), pages 1–8. Pittsburgh, Pennsylvania, USA. 639
6 0.11101793 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching
7 0.11037555 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation
8 0.10941683 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
9 0.10890546 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner
10 0.10631019 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
11 0.1057208 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors
12 0.10525912 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation
13 0.098440647 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
14 0.09661366 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
15 0.095964998 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling
16 0.093404941 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
17 0.091666549 181 acl-2013-Hierarchical Phrase Table Combination for Machine Translation
18 0.090755269 361 acl-2013-Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers
19 0.088783897 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
20 0.088289998 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction
topicId topicWeight
[(0, 0.226), (1, -0.153), (2, -0.06), (3, 0.071), (4, -0.015), (5, -0.033), (6, -0.001), (7, -0.01), (8, -0.021), (9, -0.06), (10, 0.034), (11, -0.014), (12, 0.048), (13, -0.022), (14, 0.001), (15, -0.029), (16, -0.03), (17, 0.035), (18, 0.027), (19, -0.04), (20, 0.014), (21, -0.021), (22, 0.08), (23, 0.025), (24, 0.049), (25, 0.048), (26, -0.03), (27, -0.046), (28, 0.049), (29, -0.015), (30, 0.011), (31, 0.011), (32, -0.028), (33, -0.051), (34, -0.092), (35, 0.044), (36, 0.032), (37, 0.052), (38, -0.095), (39, -0.066), (40, -0.048), (41, -0.053), (42, 0.026), (43, -0.038), (44, 0.053), (45, 0.058), (46, -0.095), (47, 0.098), (48, 0.125), (49, 0.067)]
simIndex simValue paperId paperTitle
same-paper 1 0.94270837 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura
Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
2 0.71359146 34 acl-2013-Accurate Word Segmentation using Transliteration and Language Model Projection
Author: Masato Hagiwara ; Satoshi Sekine
Abstract: Transliterated compound nouns not separated by whitespaces pose difficulty on word segmentation (WS) . Offline approaches have been proposed to split them using word statistics, but they rely on static lexicon, limiting their use. We propose an online approach, integrating source LM, and/or, back-transliteration and English LM. The experiments on Japanese and Chinese WS have shown that the proposed models achieve significant improvement over state-of-the-art, reducing 16% errors in Japanese.
3 0.69266838 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors
Author: Volkan Cirik
Abstract: We study substitute vectors to solve the part-of-speech ambiguity problem in an unsupervised setting. Part-of-speech tagging is a crucial preliminary process in many natural language processing applications. Because many words in natural languages have more than one part-of-speech tag, resolving part-of-speech ambiguity is an important task. We claim that partof-speech ambiguity can be solved using substitute vectors. A substitute vector is constructed with possible substitutes of a target word. This study is built on previous work which has proven that word substitutes are very fruitful for part-ofspeech induction. Experiments show that our methodology works for words with high ambiguity.
4 0.68882191 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections
Author: Long Duong ; Paul Cook ; Steven Bird ; Pavel Pecina
Abstract: We present an unsupervised approach to part-of-speech tagging based on projections of tags in a word-aligned bilingual parallel corpus. In contrast to the existing state-of-the-art approach of Das and Petrov, we have developed a substantially simpler method by automatically identifying “good” training sentences from the parallel corpus and applying self-training. In experimental results on eight languages, our method achieves state-of-the-art results. 1 Unsupervised part-of-speech tagging Currently, part-of-speech (POS) taggers are available for many highly spoken and well-resourced languages such as English, French, German, Italian, and Arabic. For example, Petrov et al. (2012) build supervised POS taggers for 22 languages using the TNT tagger (Brants, 2000), with an average accuracy of 95.2%. However, many widelyspoken languages including Bengali, Javanese, and Lahnda have little data manually labelled for POS, limiting supervised approaches to POS tagging for these languages. However, with the growing quantity of text available online, and in particular, multilingual parallel texts from sources such as multilingual websites, government documents and large archives ofhuman translations ofbooks, news, and so forth, unannotated parallel data is becoming more widely available. This parallel data can be exploited to bridge languages, and in particular, transfer information from a highly-resourced language to a lesser-resourced language, to build unsupervised POS taggers. In this paper, we propose an unsupervised approach to POS tagging in a similar vein to the work of Das and Petrov (201 1). In this approach, — — pecina@ ufal .mff .cuni . c z a parallel corpus for a more-resourced language having a POS tagger, and a lesser-resourced language, is word-aligned. These alignments are exploited to infer an unsupervised tagger for the target language (i.e., a tagger not requiring manuallylabelled data in the target language). Our approach is substantially simpler than that of Das and Petrov, the current state-of-the art, yet performs comparably well. 2 Related work There is a wealth of prior research on building unsupervised POS taggers. Some approaches have exploited similarities between typologically similar languages (e.g., Czech and Russian, or Telugu and Kannada) to estimate the transition probabilities for an HMM tagger for one language based on a corpus for another language (e.g., Hana et al., 2004; Feldman et al., 2006; Reddy and Sharoff, 2011). Other approaches have simultaneously tagged two languages based on alignments in a parallel corpus (e.g., Snyder et al., 2008). A number of studies have used tag projection to copy tag information from a resource-rich to a resource-poor language, based on word alignments in a parallel corpus. After alignment, the resource-rich language is tagged, and tags are projected from the source language to the target language based on the alignment (e.g., Yarowsky and Ngai, 2001 ; Das and Petrov, 2011). Das and Petrov (201 1) achieved the current state-of-the-art for unsupervised tagging by exploiting high confidence alignments to copy tags from the source language to the target language. Graph-based label propagation was used to automatically produce more labelled training data. First, a graph was constructed in which each vertex corresponds to a unique trigram, and edge weights represent the syntactic similarity between vertices. Labels were then propagated by optimizing a convex function to favor the same tags for closely related nodes 634 Proce dingSsof oifa, th Beu 5l1gsarti Aan,An u aglu Mste 4e-ti9n2g 0 o1f3 t.he ?c A2s0s1o3ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 634–639, ModelCoverageAccuracy Many-to-1 alignments88%68% 1-to-1 alignments 68% 78% 1-to-1 alignments: Top 60k sents 91% 80% Table 1: Token coverage and accuracy of manyto-one and 1-to-1 alignments, as well as the top 60k sentences based on alignment score for 1-to-1 alignments, using directly-projected labels only. while keeping a uniform tag distribution for unrelated nodes. A tag dictionary was then extracted from the automatically labelled data, and this was used to constrain a feature-based HMM tagger. The method we propose here is simpler to that of Das and Petrov in that it does not require convex optimization for label propagation or a feature based HMM, yet it achieves comparable results. 3 Tagset Our tagger exploits the idea ofprojecting tag information from a resource-rich to resource-poor language. To facilitate this mapping, we adopt Petrov et al.’s (2012) twelve universal tags: NOUN, VERB, ADJ, ADV, PRON (pronouns), DET (de- terminers and articles), ADP (prepositions and postpositions), NUM (numerals), CONJ (conjunctions), PRT (particles), “.” (punctuation), and X (all other categories, e.g., foreign words, abbreviations). These twelve basic tags are common across taggers for most languages. Adopting a universal tagset avoids the need to map between a variety of different, languagespecific tagsets. Furthermore, it makes it possible to apply unsupervised tagging methods to languages for which no tagset is available, such as Telugu and Vietnamese. 4 A Simpler Unsupervised POS Tagger Here we describe our proposed tagger. The key idea is to maximize the amount of information gleaned from the source language, while limiting the amount of noise. We describe the seed model and then explain how it is successively refined through self-training and revision. 4.1 Seed Model The first step is to construct a seed tagger from directly-projected labels. Given a parallel corpus for a source and target language, Algorithm 1provides a method for building an unsupervised tagger for the target language. In typical applications, the source language would be a better-resourced language having a tagger, while the target language would be lesser-resourced, lacking a tagger and large amounts of manually POS-labelled data. Algorithm 1 Build seed model Algorithm 1Build seed model 1:Tag source side. 2: Word align the corpus with Giza++ and remove the many-to-one mappings. 3: Project tags from source to target using the remaining 1-to-1 alignments. 4: Select the top n sentences based on sentence alignment score. 5: Estimate emission and transition probabilities. 6: Build seed tagger T. We eliminate many-to-one alignments (Step 2). Keeping these would give more POS-tagged tokens for the target side, but also introduce noise. For example, suppose English and French were the source and target language, respectively. In this case alignments such as English laws (NNS) to French les (DT) lois (NNS) would be expected (Yarowsky and Ngai, 2001). However, in Step 3, where tags are projected from the source to target language, this would incorrectly tag French les as NN. We build a French tagger based on English– French data from the Europarl Corpus (Koehn, 2005). We also compare the accuracy and coverage of the tags obtained through direct projection using the French Melt POS tagger (Denis and Sagot, 2009). Table 1confirms that the one-to-one alignments indeed give higher accuracy but lower coverage than the many-to-one alignments. At this stage of the model we hypothesize that highconfidence tags are important, and hence eliminate the many-to-one alignments. In Step 4, in an effort to again obtain higher quality target language tags from direct projection, we eliminate all but the top n sentences based on their alignment scores, as provided by the aligner via IBM model 3. We heuristically set this cutoff × to 60k to balance the accuracy and size of the seed model.1 Returning to our preliminary English– French experiments in Table 1, this process gives improvements in both accuracy and coverage.2 1We considered values in the range 60–90k, but this choice had little impact on the accuracy of the model. 2We also considered using all projected labels for the top 60k sentences, not just 1-to-1 alignments, but in preliminary experiments this did not perform as well, possibly due to the previously-observed problems with many-to-one alignments. 635 The number of parameters for the emission probability is |V | |T| where V is the vocabulary and aTb iilsi ttyh eis tag |s e×t. TTh| ew htrearnesi Vtio ins probability, on atnhed other hand, has only |T|3 parameters for the trigram hmaondde,l we use. TB|ecause of this difference in number of parameters, in step 5, we use different strategies to estimate the emission and transition probabilities. The emission probability is estimated from all 60k selected sentences. However, for the transition probability, which has less parameters, we again focus on “better” sentences, by estimating this probability from only those sen- tences that have (1) token coverage > 90% (based on direct projection of tags from the source language), and (2) length > 4 tokens. These criteria aim to identify longer, mostly-tagged sentences, which we hypothesize are particularly useful as training data. In the case of our preliminary English–French experiments, roughly 62% of the 60k selected sentences meet these criteria and are used to estimate the transition probability. For unaligned words, we simply assign a random POS and very low probability, which does not substantially affect transition probability estimates. In Step 6 we build a tagger by feeding the estimated emission and transition probabilities into the TNT tagger (Brants, 2000), an implementation of a trigram HMM tagger. 4.2 Self training and revision For self training and revision, we use the seed model, along with the large number of target language sentences available that have been partially tagged through direct projection, in order to build a more accurate tagger. Algorithm 2 describes this process of self training and revision, and assumes that the parallel source–target corpus has been word aligned, with many-to-one alignments removed, and that the sentences are sorted by alignment score. In contrast to Algorithm 1, all sentences are used, not just the 60k sentences with the highest alignment scores. We believe that sentence alignment score might correspond to difficulty to tag. By sorting the sentences by alignment score, sentences which are more difficult to tag are tagged using a more mature model. Following Algorithm 1, we divide sentences into blocks of 60k. In step 3 the tagged block is revised by comparing the tags from the tagger with those obtained through direct projection. Suppose source Algorithm 2 Self training and revision 1:Divide target language sentences into blocks of n sentences. 2: Tag the first block with the seed tagger. 3: Revise the tagged block. 4: Train a new tagger on the tagged block. 5: Add the previous tagger’s lexicon to the new tagger. 6: Use the new tagger to tag the next block. 7: Goto 3 and repeat until all blocks are tagged. language word wis is aligned with target language word wjt with probability p(wjt |wsi), Tis is the tag for wis using the tagger availa|bwle for the source language, and Tjt is the tag for wjt using the tagger learned for the > S, where S is a threshold which we heuristically set to 0.7, we replace Tjt by Tis. Self-training can suffer from over-fitting, in which errors in the original model are repeated and amplified in the new model (McClosky et al., 2006). To avoid this, we remove the tag of any token that the model is uncertain of, i.e., if p(wjt |wsi) < S and Tjt Tis then Tjt = Null. So, on th|ew target side, aligned words have a tag from direct projection or no tag, and unaligned words have a tag assigned by our model. Step 4 estimates the emission and transition target language. If p(wtj|wis) = probabilities as in Algorithm 1. In Step 5, emission probabilities for lexical items in the previous model, but missing from the current model, are added to the current model. Later models therefore take advantage of information from earlier models, and have wider coverage. 5 Experimental Results Using parallel data from Europarl (Koehn, 2005) we apply our method to build taggers for the same eight target languages as Das and Petrov (201 1) Danish, Dutch, German, Greek, Italian, Portuguese, Spanish and Swedish with English as the source language. Our training data (Europarl) is a subset of the training data of Das and Petrov (who also used the ODS United Nations dataset which we were unable to obtain). The evaluation metric and test data are the same as that used by Das and Petrov. Our results are comparable to theirs, although our system is penalized by having less training data. We tag the source language with the Stanford POS tagger (Toutanova et al., 2003). — — 636 DanishDutchGermanGreekItalianPortugueseSpanishSwedishAverage Seed model83.781.183.677.878.684.981.478.981.3 Self training + revision 85.6 84.0 85.4 80.4 81.4 86.3 83.3 81.0 83.4 Das and Petrov (2011) 83.2 79.5 82.8 82.5 86.8 87.9 84.2 80.5 83.4 Table 2: Token-level POS tagging accuracy for our seed model, self training and revision, and the method of Das and Petrov (201 1). The best results on each language, and on average, are shown in bold. 1 1 Iteration 2 2 3 1 1 2 2 3 Iteration Figure 1: Overall accuracy, accuracy on known tokens, accuracy on unknown tokens, and proportion of known tokens for Italian (left) and Dutch (right). Table 2 shows results for our seed model, self training and revision, and the results reported by Das and Petrov. Self training and revision improve the accuracy for every language over the seed model, and gives an average improvement of roughly two percentage points. The average accuracy of self training and revision is on par with that reported by Das and Petrov. On individual languages, self training and revision and the method of Das and Petrov are split each performs better on half of the cases. Interestingly, our method achieves higher accuracies on Germanic languages the family of our source language, English while Das and Petrov perform better on Romance languages. This might be because our model relies on alignments, which might be more accurate for more-related languages, whereas Das and Petrov additionally rely on label propagation. Compared to Das and Petrov, our model performs poorest on Italian, in terms of percentage point difference in accuracy. Figure 1 (left panel) shows accuracy, accuracy on known words, accuracy on unknown words, and proportion of known tokens for each iteration of our model for Italian; iteration 0 is the seed model, and iteration 3 1 is the final model. Our model performs poorly on unknown words as indicated by the low accuracy on unknown words, and high accuracy on known — — — words compared to the overall accuracy. The poor performance on unknown words is expected because we do not use any language-specific rules to handle this case. Moreover, on average for the final model, approximately 10% of the test data tokens are unknown. One way to improve the performance of our tagger might be to reduce the proportion of unknown words by using a larger training corpus, as Das and Petrov did. We examine the impact of self-training and revision over training iterations. We find that for all languages, accuracy rises quickly in the first 5–6 iterations, and then subsequently improves only slightly. We exemplify this in Figure 1 (right panel) for Dutch. (Findings are similar for other languages.) Although accuracy does not increase much in later iterations, they may still have some benefit as the vocabulary size continues to grow. 6 Conclusion We have proposed a method for unsupervised POS tagging that performs on par with the current state- of-the-art (Das and Petrov, 2011), but is substantially less-sophisticated (specifically not requiring convex optimization or a feature-based HMM). The complexity of our algorithm is O(nlogn) compared to O(n2) for that of Das and Petrov 637 (201 1) where n is the size of training data.3 We made our code are available for download.4 In future work we intend to consider using a larger training corpus to reduce the proportion of unknown tokens and improve accuracy. Given the improvements of our model over that of Das and Petrov on languages from the same family as our source language, and the observation of Snyder et al. (2008) that a better tagger can be learned from a more-closely related language, we also plan to consider strategies for selecting an appropriate source language for a given target language. Using our final model with unsupervised HMM methods might improve the final performance too, i.e. use our final model as the initial state for HMM, then experiment with differ- ent inference algorithms such as Expectation Maximization (EM), Variational Bayers (VB) or Gibbs sampling (GS).5 Gao and Johnson (2008) compare EM, VB and GS for unsupervised English POS tagging. In many cases, GS outperformed other methods, thus we would like to try GS first for our model. 7 Acknowledgements This work is funded by Erasmus Mundus European Masters Program in Language and Communication Technologies (EM-LCT) and by the Czech Science Foundation (grant no. P103/12/G084). We would like to thank Prokopis Prokopidis for providing us the Greek Treebank and Antonia Marti for the Spanish CoNLL 06 dataset. Finally, we thank Siva Reddy and Spandana Gella for many discussions and suggestions. References Thorsten Brants. 2000. TnT: A statistical part-ofspeech tagger. In Proceedings of the sixth conference on Applied natural language processing (ANLP ’00), pages 224–231 . Seattle, Washington, USA. Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of 3We re-implemented label propagation from Das and Petrov (2011). It took over a day to complete this step on an eight core Intel Xeon 3.16GHz CPU with 32 Gb Ram, but only 15 minutes for our model. 4https://code.google.com/p/universal-tagger/ 5We in fact have tried EM, but it did not help. The overall performance dropped slightly. This might be because selftraining with revision already found the local maximal point. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (ACL 2011), pages 600–609. Portland, Oregon, USA. Pascal Denis and Beno ıˆt Sagot. 2009. Coupling an annotated corpus and a morphosyntactic lexicon for state-of-the-art POS tagging with less human effort. In Proceedings of the 23rd PacificAsia Conference on Language, Information and Computation, pages 721–736. Hong Kong, China. Anna Feldman, Jirka Hana, and Chris Brew. 2006. A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’06), pages 549–554. Genoa, Italy. Jianfeng Gao and Mark Johnson. 2008. A comparison of bayesian estimators for unsupervised hidden markov model pos taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 344–352. Association for Computational Linguistics, Stroudsburg, PA, USA. Jiri Hana, Anna Feldman, and Chris Brew. 2004. A resource-light approach to Russian morphology: Tagging Russian using Czech resources. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP ’04), pages 222–229. Barcelona, Spain. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the Tenth Machine Translation Summit (MT Summit X), pages 79–86. AAMT, Phuket, Thailand. David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of the main conference on Human Language Technology Conference ofthe North American Chapter of the Association of Computational Linguistics (HLT-NAACL ’06), pages 152–159. New York, USA. Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), pages 2089–2096. Istanbul, Turkey. Siva Reddy and Serge Sharoff. 2011. Cross language POS Taggers (and other tools) for Indian 638 languages: An experiment with Kannada using Telugu resources. In Proceedings of the IJCNLP 2011 workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA 2011). Chiang Mai, Thailand. Benjamin Snyder, Tahira Naseem, Jacob Eisenstein, and Regina Barzilay. 2008. Unsupervised multilingual learning for POS tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP ’08), pages 1041–1050. Honolulu, Hawaii. Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol- ume 1 (NAACL ’03), pages 173–180. Edmonton, Canada. David Yarowsky and Grace Ngai. 2001 . Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies (NAACL ’01), pages 1–8. Pittsburgh, Pennsylvania, USA. 639
5 0.60939038 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration
Author: Tingting Li ; Tiejun Zhao ; Andrew Finch ; Chunyue Zhang
Abstract: Machine Transliteration is an essential task for many NLP applications. However, names and loan words typically originate from various languages, obey different transliteration rules, and therefore may benefit from being modeled independently. Recently, transliteration models based on Bayesian learning have overcome issues with over-fitting allowing for many-to-many alignment in the training of transliteration models. We propose a novel coupled Dirichlet process mixture model (cDPMM) that simultaneously clusters and bilingually aligns transliteration data within a single unified model. The unified model decomposes into two classes of non-parametric Bayesian component models: a Dirichlet process mixture model for clustering, and a set of multinomial Dirichlet process models that perform bilingual alignment independently for each cluster. The experimental results show that our method considerably outperforms conventional alignment models.
6 0.60243076 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation
7 0.5821479 226 acl-2013-Learning to Prune: Context-Sensitive Pruning for Syntactic MT
8 0.58144712 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
9 0.57924116 331 acl-2013-Stop-probability estimates computed on a large corpus improve Unsupervised Dependency Parsing
10 0.57388592 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
11 0.5557124 145 acl-2013-Exploiting Qualitative Information from Automatic Word Alignment for Cross-lingual NLP Tasks
12 0.55241925 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation
13 0.55172133 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression
14 0.5475198 235 acl-2013-Machine Translation Detection from Monolingual Web-Text
15 0.54430151 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
16 0.54302275 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
17 0.54164749 236 acl-2013-Mapping Source to Target Strings without Alignment by Analogical Learning: A Case Study with Transliteration
18 0.54108596 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
19 0.52779961 80 acl-2013-Chinese Parsing Exploiting Characters
20 0.5246352 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
topicId topicWeight
[(0, 0.07), (6, 0.039), (11, 0.089), (15, 0.016), (21, 0.18), (24, 0.035), (26, 0.079), (35, 0.074), (42, 0.066), (48, 0.066), (70, 0.069), (88, 0.02), (90, 0.042), (95, 0.058)]
simIndex simValue paperId paperTitle
1 0.93943596 293 acl-2013-Random Walk Factoid Annotation for Collective Discourse
Author: Ben King ; Rahul Jha ; Dragomir Radev ; Robert Mankoff
Abstract: In this paper, we study the problem of automatically annotating the factoids present in collective discourse. Factoids are information units that are shared between instances of collective discourse and may have many different ways ofbeing realized in words. Our approach divides this problem into two steps, using a graph-based approach for each step: (1) factoid discovery, finding groups of words that correspond to the same factoid, and (2) factoid assignment, using these groups of words to mark collective discourse units that contain the respective factoids. We study this on two novel data sets: the New Yorker caption contest data set, and the crossword clues data set.
2 0.86438823 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction
Author: Xingxing Zhang ; Jianwen Zhang ; Junyu Zeng ; Jun Yan ; Zheng Chen ; Zhifang Sui
Abstract: Distant supervision (DS) is an appealing learning method which learns from existing relational facts to extract more from a text corpus. However, the accuracy is still not satisfying. In this paper, we point out and analyze some critical factors in DS which have great impact on accuracy, including valid entity type detection, negative training examples construction and ensembles. We propose an approach to handle these factors. By experimenting on Wikipedia articles to extract the facts in Freebase (the top 92 relations), we show the impact of these three factors on the accuracy of DS and the remarkable improvement led by the proposed approach.
3 0.85551429 175 acl-2013-Grounded Language Learning from Video Described with Sentences
Author: Haonan Yu ; Jeffrey Mark Siskind
Abstract: We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts si- multaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.
4 0.85126334 282 acl-2013-Predicting and Eliciting Addressee's Emotion in Online Dialogue
Author: Takayuki Hasegawa ; Nobuhiro Kaji ; Naoki Yoshinaga ; Masashi Toyoda
Abstract: While there have been many attempts to estimate the emotion of an addresser from her/his utterance, few studies have explored how her/his utterance affects the emotion of the addressee. This has motivated us to investigate two novel tasks: predicting the emotion of the addressee and generating a response that elicits a specific emotion in the addressee’s mind. We target Japanese Twitter posts as a source of dialogue data and automatically build training data for learning the predictors and generators. The feasibility of our approaches is assessed by using 1099 utterance-response pairs that are built by . five human workers.
same-paper 5 0.84101331 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura
Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
6 0.76053154 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction
7 0.73341441 275 acl-2013-Parsing with Compositional Vector Grammars
8 0.73069412 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
9 0.72961521 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
10 0.72631228 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
11 0.72143173 318 acl-2013-Sentiment Relevance
12 0.72142571 225 acl-2013-Learning to Order Natural Language Texts
13 0.72136265 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching
14 0.7211625 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
15 0.72108561 80 acl-2013-Chinese Parsing Exploiting Characters
16 0.71984559 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
17 0.71883988 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
18 0.71774191 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
19 0.71703112 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
20 0.7157734 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction