Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.
1 edu Abstract We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). [sent-8, score-1.293]
2 The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. [sent-10, score-0.91]
3 1 Introduction Most contemporary SMT systems view parallel data as independent sentence-pairs whether or not they are from the same document-pair. [sent-12, score-0.096]
4 Consequently, translation models are learned only at sentence-pair level, and document contexts – essential factors for translating documents – are generally overlooked. [sent-13, score-0.483]
5 Indeed, translating documents differs considerably from translating a group of unrelated sentences. [sent-14, score-0.194]
6 One should avoid destroying a coherent document by simply translating it into a group of sentences which are indifferent to each other and detached from the context. [sent-16, score-0.217]
7 Developments in statistics, genetics, and machine learning have shown that latent semantic aspects of complex data can often be captured by a model known as the statistical admixture (or mixed membership model [4]). [sent-17, score-0.205]
8 Statistically, an object is said to be derived from an admixture if it consists of a bag of elements, each sampled independently or coupled in a certain way, from a mixture model. [sent-18, score-0.112]
9 In the context of SMT, each parallel document-pair is treated as one such object. [sent-19, score-0.096]
10 Variants of admixture models have appeared in population genetics [6] and text modeling [1, 4]. [sent-24, score-0.122]
11 Recently, a Bilingual Topic-AdMixture (BiTAM) model was proposed to capture the topical aspects of SMT [10]; word-pairs from a parallel document-pair follow the same weighted mixtures of translation lexicons, inferred for the given document-context. [sent-25, score-0.631]
12 However, they do not capture locality 1 constraints of word alignment, i. [sent-27, score-0.191]
13 , words “close-in-source” are usually aligned to words “close-intarget”, under document-specific topical assignment. [sent-29, score-0.28]
14 To incorporate such constituents, we integrate the strengths of both HMM and BiTAM, and propose a Hidden Markov Bilingual Topic-AdMixture model, or HM-BiTAM, for word alignment to leverage both locality constraints and topical context underlying parallel document-pairs. [sent-30, score-0.683]
15 In the HM-BiTAM framework, one can estimate topic-specific word-to-word translation lexicons (lexical mappings), as well as the monolingual topic-specific word-frequencies for both languages, based on parallel document-pairs. [sent-31, score-0.87]
16 The resulting model offers a principled way of inferring optimal translation from a given source language in a context-dependent fashion. [sent-32, score-0.371]
17 We show our model’s effectiveness on the word-alignment task; we also demonstrate two application aspects which were untouched in [10]: the utility of HM-BiTAM for bilingual topic exploration, and its application for improving translation qualities. [sent-34, score-0.896]
18 2 Revisit HMM for SMT An SMT system can be formulated as a noisy-channel model [2]: e∗ = arg max P (e|f ) = arg max P (f |e)P (e), e (1) e where a translation corresponds to searching for the target sentence e∗ which explains the source sentence f best. [sent-35, score-0.417]
19 The key component is P (f |e), the translation model; P (e) is monolingual language model. [sent-36, score-0.635]
20 An HMM implements the “proximity-bias” assumption — that words “close-in-source” are aligned to words “close-in-target”, which is effective for improving word alignment accuracies, especially for linguistically close language-pairs [8]. [sent-38, score-0.537]
21 Following [8], to model word-to-word translation, we introduce the mapping j → aj , which assigns a French word fj in position j to an English word ei in position i = aj denoted as eaj . [sent-39, score-0.687]
22 Each (ordered) French word fj is an observation, and it is generated by an HMM state defined as [eaj , aj ], where the alignment indicator aj for position j is considered to have a dependency on the previous alignment aj−1 . [sent-40, score-0.863]
23 Thus a first-order HMM for an alignment between e ≡ e1:I and f ≡ f1:J is defined as: J p(f1:J |e1:I ) = p(fj |eaj )p(aj |aj−1 ), (2) a1:J j=1 where p(aj |aj−1 ) is the state transition probability; J and I are sentence lengths of the French and English sentences, respectively. [sent-41, score-0.278]
24 An additional pseudo word ”NULL” is used at the beginning of English sentences for HMM to start with. [sent-43, score-0.242]
25 2 3 Hidden Markov Bilingual Topic-AdMixture We assume that in training corpora of bilingual documents, the document-pair boundaries are known, and indeed they serve as the key information for defining document-specific topic weights underlying aligned sentence-pairs or word-pairs. [sent-48, score-0.653]
26 To simplify the outline, the topics here are sampled at sentence-pair level; topics sampled at word-pair level can be easily derived following the outlined algorithms, in the same spirit of [10]. [sent-49, score-0.48]
27 Given a document-pair (F, E) containing N parallel sentence-pairs (en , fn ), HM-BiTAM implements the following generative scheme. [sent-50, score-0.155]
28 The sentence-pairs {fn , en } are drawn independently from a mixture of topics. [sent-54, score-0.071]
29 For each sentence-pair (fn , en ), (a) zn ∼ Multinomial(θ) sample the topic (b) en,1:In |zn ∼ P (en |zn ; β) sample all English words from a monolingual topic model (e. [sent-58, score-0.89]
30 , an unigram model), (c) For each position jn = 1, . [sent-60, score-0.223]
31 ajn ∼ P (ajn |ajn −1 ;T ) sample an alignment link ajn from a first-order Markov process, ii. [sent-64, score-0.495]
32 fjn ∼ P (fjn |en , ajn , zn ; B) sample a foreign word fjn according to a topic specific translation lexicon. [sent-65, score-1.221]
33 Under an HM-BiTAM model, each sentence-pair consists of a mixture of latent bilingual topics; each topic is associated with a distribution over bilingual word-pairs. [sent-66, score-0.908]
34 Each word f is generated by two hidden factors: a latent topic z drawn from a document-specific distribution over K topics, and the English word e identified by the hidden alignment variable a. [sent-67, score-0.921]
35 2 Extracting Bilingual Topics from HM-BiTAM Because of the parallel nature of the data, the topics of English and the foreign language will share similar semantic meanings. [sent-69, score-0.513]
36 Shown in Figure 1(b), both the English and foreign topics are sampled from the same distribution θ, which is a documentspecific topic-weight vector. [sent-71, score-0.355]
37 , unigram) of foreign word fw under topic k can be computed by P (fw |k) = P (fw |e, Bk )P (e|βk ). [sent-75, score-0.562]
38 (3) e As a result, HM-BiTAM can actually be used as a bilingual topic explorer in the LDA-style and beyond. [sent-76, score-0.552]
39 Given paired documents, it can extract the representations of each topic in both languages in a consistent fashion (which is not guaranteed if topics are extracted separately from each language using, e. [sent-77, score-0.56]
40 , LDA), as well as the lexical mappings under each topics, based on a maximal likelihood or Bayesian principle. [sent-79, score-0.089]
41 3 4 Learning and Inference We sketch a generalized mean-field approximation scheme for inferring latent variables in HMBiTAM, and a variational EM algorithm for estimating model parameters. [sent-83, score-0.092]
42 (6) represents the approximate posterior of the ˆ topic weights for each sentence-pair (fn , en ). [sent-95, score-0.292]
43 The topical information for updating φn is collected from three aspects: aligned word-pairs weighted by the corresponding topic-specific translation lexicon probabilities, topical distributions of monolingual English language model, and the smoothing factors from the topic prior. [sent-96, score-1.313]
44 Equation (7) gives the approximate posterior probability for alignment between the j-th word in fn and the i-th word in en , in the form of an exponential model. [sent-97, score-0.743]
45 Inference of optimum word-alignment One of the translation model’s goals is to infer the optimum word alignment: a∗ = arg maxa P (a|F, E). [sent-99, score-0.481]
46 The variational inference scheme described above leads to an approximate alignment posterior q(a|λ), which is in fact a reparameterized HMM. [sent-100, score-0.326]
47 Thus, extracting the optimum alignment amounts to applying an Viterbi algorithm on q(a|λ). [sent-101, score-0.231]
48 5 Experiments In this section, we investigate three main aspects of the HM-BiTAM model, including word alignment, bilingual topic exploration, and machine translation. [sent-107, score-0.797]
49 The training data is a collection of parallel document-pairs, with document boundaries explicitly given. [sent-114, score-0.164]
50 As shown in Table 1, our training corpora are general newswire, covering topics mainly about economics, politics, educations and sports. [sent-115, score-0.277]
51 This test set contains relatively long sentence-pairs, with an average sentence length of 40. [sent-118, score-0.047]
52 The long sentences introduce more ambiguities for alignment tasks. [sent-120, score-0.282]
53 For testing translation quality, TIDES’02 MT evaluation data is used as development data, and ten documents from TIDES’04 MT-evaluation are used as the unseen test data. [sent-121, score-0.424]
54 BLEU scores are reported to evaluate translation quality with HM-BiTAM models. [sent-122, score-0.29]
55 1 Empirical Validation Word Alignment Accuracy We trained HM-BiATMs with ten topics using parallel corpora of sizes ranging from 6M to 22. [sent-124, score-0.405]
56 Following the same logics for all BiTAMs in [10], we choose HM-BiTAM in which topics are sampled at word-pair level over sentence-pair level. [sent-126, score-0.24]
57 Figure 2 shows the alignment accuracies of HM-BiTAM, in comparison with that of the baselineHMM, the baseline BiTAM, and the IBM Model-4. [sent-129, score-0.258]
58 ent models trained on corpora of different sizes. [sent-140, score-0.066]
59 In HM-BiTAM, two factors contribute to narrowing down the word-alignment decisions: the position and the lexical mapping. [sent-143, score-0.086]
60 Whereas the emission lexical probability is different, each state is a mixture of topic-specific translation lexicons, of which the weights are inferred using document contexts. [sent-145, score-0.445]
61 The topic-specific translation lexicons are sharper and smaller than the global one used in HMM. [sent-146, score-0.477]
62 Not surprisingly, HM-BiTAM also outperforms the baseline-BiTAM significantly, because BiTAM captures only the topical aspects and ignores the proximity bias. [sent-148, score-0.219]
63 However, IBM Model-4 does not have a scheme to adjust its lexicon probabilities specific to document topicalcontext as in HM-BiTAM. [sent-159, score-0.16]
64 In a way, HM-BiTAM wins over IBM-4 by leveraging topic models that capture the document context. [sent-160, score-0.289]
65 Overall the likelihoods under HM-BiTAM are significantly better than those under HMM and IBM Model-4, revealing the better modeling power of HM-BiTAM. [sent-163, score-0.066]
66 As shown in Table 2, the likelihoods of HM-BiTAM on these unseen data dominates significantly over that of HMM, BiTAM, and IBM Models in every case, confirming that HM-BiTAM indeed offers a better fit and generalizability for the bilingual document-pairs. [sent-166, score-0.417]
67 Publishers Genre IBM-1 HMM IBM-4 BiTAM HM-BiTAM AgenceFrance(AFP) AgenceFrance(AFP) AgenceFrance(AFP) ForeignMinistryPRC HongKongNews People’s Daily United Nation XinHua News XinHua News ZaoBao News news news news speech speech editorial speech news news editorial -3752. [sent-167, score-0.423]
68 Perplexity Table 2: Likelihoods of unseen documents under HM-BiTAMs, in comparison with competing models. [sent-222, score-0.102]
69 2 Application 1: Bilingual Topic Extraction Monolingual topics: HM-BiTAM facilitates inference of the latent LDA-style representations of topics [1] in both English and the foreign language (i. [sent-224, score-0.46]
70 The English topics (represented by the topic-specific word frequencies) can be directly read-off from HM-BiTAM parameters β. [sent-227, score-0.402]
71 2, even though the topic-specific distributions 6 of words in the Chinese corpora are not directly encoded in HM-BiTAM, one can marginalize over alignments of the parallel data to synthesize them based on the monolingual English topics and the topic-specific lexical mapping from English to Chinese. [sent-229, score-0.802]
72 The top-ranked frequent words in each topic exhibit coherent semantic meanings; and there are also consistencies between the word semantics under the same topic indexes across languages. [sent-231, score-0.745]
73 Under HM-BiTAM, the two respective monolingual word-distributions for the same topic are statistically coupled due to sharing of the same topic for each sentence-pair in the two languages. [sent-232, score-0.739]
74 Whereas if one merely apply LDA to the corpora in each language separately, such coupling can not be exploited. [sent-233, score-0.114]
75 This coupling enforces consistency between the topics across languages. [sent-234, score-0.211]
76 However, like general clustering algorithms, topics in HM-BiTAM, are not necessarily to present obvious semantic labels. [sent-235, score-0.254]
77 ) (reporters) (relations) (Russian) (France) (ChongQing) (countries) (ChongQing) (Factory) (TianJin) (Government) (project) (national) (Shenzhen) (take over) (buy) Figure 4: Monolingual topics of both languages learned from parallel data. [sent-238, score-0.354]
78 It appears that the English topics (on the left panel) are highly parallel to the Chinese ones (annotated with English gloss, on the right panel). [sent-239, score-0.307]
79 Topic-Specific Lexicon Mapping: Table 3 shows two examples of topic-specific lexicon mapping learned by HM-BiTAM. [sent-240, score-0.123]
80 Given a topic assignment, a word usually has much less translation candidates, and the topic-specific translation lexicons are generally much smaller and sharper. [sent-241, score-1.179]
81 Different topic-specific lexicons emphasize different aspects of translating the same source words, which can not be captured by the IBM models or HMM. [sent-242, score-0.343]
82 Topics Topic-1 Topic-2 Topic-3 Topic-4 Topic-5 Topic-6 Topic-7 Topic-8 Topic-9 Topic-10 IBM Model-1 HMM IBM Model-4 TopCand ° Ú ó Æ - ° ° ° ° Ú - “meet” Meaning sports meeting to satisfy to adapt to adjust to see someone to satisfy sports meeting to see someone Probability 0. [sent-244, score-0.402]
83 551466 sports meeting sports meeting sports meeting 0. [sent-252, score-0.516]
84 608391 TopCand - ¦ “power” Meaning electric power electricity factory to be relevant strength strength Electric watt power to generate strength Probability 0. [sent-255, score-0.16]
85 506258 Table 3: Topic-specific translation lexicons learned by HM-BiTAM. [sent-267, score-0.477]
86 We show the top candidate (TopCand) lexicon mappings of “meet” and “power” under ten topics. [sent-268, score-0.152]
87 (The symbol “-” means inexistence of significant lexicon mapping under that topic. [sent-269, score-0.123]
88 ) Also shown are the semantic meanings of the mapped Chinese words, and the mapping probability p(f |e, k). [sent-270, score-0.1]
89 3 Application 2: Machine Translation The parallelism of topic-assignment between languages modeled by HM-BiTAM, as shown in § 3. [sent-272, score-0.047]
90 4, enables a natural way of improving translation by exploiting semantic consistency and contextual coherency more explicitly and aggressively. [sent-274, score-0.333]
91 (11) k=1 We used p(e|f, DF ) to score the bilingual phrase-pairs in a state-of-the-art GALE translation system trained with 250 M words. [sent-277, score-0.621]
92 Then decoding of the unseen ten MT04 documents in Table 2 was carried out. [sent-279, score-0.134]
93 Experiments using the topic assignments inferred from ground truth and the ones inferred via HM-BITAM; ngram precisions together with final BLEUr4n4 scores are evaluated. [sent-303, score-0.306]
94 If we know the ground truth of translation to infer the topic-weights, improvement is from 32. [sent-305, score-0.29]
95 With topical inference from HM-BiTAM using monolingual source document, improved N-gram precisions in the translation were observed from 1-gram to 4-gram. [sent-308, score-0.846]
96 6 Discussion and Conclusion We presented a novel framework, HM-BiTAM, for exploring bilingual topics, and generalizing over traditional HMM for improved word-alignment accuracies and translation quality. [sent-317, score-0.648]
97 A variational inference and learning procedure was developed for efficient training and application in translation. [sent-318, score-0.095]
98 We demonstrated significant improvement of word-alignment accuracy over a number of existing systems, and the interesting capability of HM-BiTAM to simultaneously extract coherent monolingual topics from both languages. [sent-319, score-0.537]
99 We also report encouraging improvement of translation quality over current benchmarks; although the margin is modest, it is noteworthy that the current version of HM-BiTAM remains a purely autonomously trained system. [sent-320, score-0.29]
100 A generalized mean field algorithm for variational inference in exponential families. [sent-375, score-0.095]
