acl acl2013 acl2013-143 knowledge-graph by maker-knowledge-mining

143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model


Source: pdf

Author: Chris Quirk

Abstract: The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. Initial attempts at modeling fertility used heuristic search methods. Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. Yet in practice we also need the single best alignment, which is difficult to find using Gibbs. Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. [sent-2, score-0.769]

2 Initial attempts at modeling fertility used heuristic search methods. [sent-3, score-0.568]

3 Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. [sent-4, score-0.176]

4 Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. [sent-6, score-1.089]

5 Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality. [sent-7, score-0.371]

6 1 Introduction Word-based translation models intended to model the translation process have found new uses identifying word correspondences in sentence pairs. [sent-8, score-0.125]

7 These word alignments are a crucial training component in most machine translation systems. [sent-9, score-0.079]

8 With minor improvements to initialization (Moore, 2004) (which may be important (Toutanova and Galley, 2011)), it can be quite competitive. [sent-15, score-0.034]

9 Models 7 2 and 3 incorporate a positional model based on the absolute position of the word; Models 4 and 5 use a relative position model instead (an English word tends to align to a French word that is nearby the French word aligned to the previous English word). [sent-17, score-0.141]

10 Although these latter models covered a broad range of phenomena, estimation techniques and MAP inference were challenging. [sent-19, score-0.125]

11 This captures the positional information in the IBM models in a framework that admits exact parameter estimation inference, though the objective function is not concave: local maxima are a concern. [sent-24, score-0.22]

12 Modeling fertility is challenging in the HMM framework as it violates the Markov assumption. [sent-25, score-0.622]

13 Where the HMM jump model considers only the prior state, fertility requires looking across the whole state space. [sent-26, score-0.71]

14 Recent work (Zhao and Gildea, 2010) described an extension to the HMM with a fertility model, using MCMC techniques for parameter estimation. [sent-28, score-0.623]

15 This paper introduces a method for exact MAP inference with the fertility HMM using dual decomposition. [sent-30, score-0.944]

16 The resulting model leads to substantial improvements in alignment quality. [sent-31, score-0.203]

17 Proce diSnogfsia o,f B thuelg 5a1rista, A Annu gauslt M 4-e9eti 2n0g1 o3f. [sent-32, score-0.029]

18 hc e2 A01s 3o Acisastoiocnia ftoiorn C fo rm Cpoumtaptiuotna tilo Lnianlg Lui nstgiucsis,pt iacgses 7–1 , 2 HMM alignment Let us briefly review the HMM translation model as a starting point. [sent-34, score-0.243]

19 This model produces distributions over French word sequences f = f1, . [sent-39, score-0.033]

20 J] indicates the English word generating t∈he jth JF]re inndchic word, 0e representing a special NULL state to handle systematically unaligned words. [sent-47, score-0.067]

21 Yj=1 The generative story begins by predicting the number of words in the French sentence (hence the number ofelements in the alignment vector). [sent-52, score-0.168]

22 Then for each French word position, first the alignment variable (English word index used to generate the current French word) is selected based on only the prior alignment variable. [sent-53, score-0.445]

23 Next the French word is predicted based on its aligned English word. [sent-54, score-0.033]

24 Following prior work (Zhao and Gildea, 2010), we augment the standard HMM with a fertility distribution. [sent-55, score-0.611]

25 This deficient model wastes some probability mass on inconsistent configurations where the number of times that a state iis visited does not match its fertility φi. [sent-61, score-0.677]

26 Following in the footsteps of older, richer, and wiser colleagues (Brown et al. [sent-62, score-0.031]

27 1 Parameter estimation Of greater concern is the exponential complexity of inference in this model. [sent-65, score-0.125]

28 For the standard HMM, there is a dynamic programming algorithm to compute the posterior probability over word alignments Pr(a|e, f). [sent-66, score-0.184]

29 The structure of the fertility model violates the Markov assumptions used in this dynamic programming method. [sent-69, score-0.686]

30 However, we may empirically 8 estimate the posterior distribution using Markov chain Monte Carlo methods such as Gibbs sampling (Zhao and Gildea, 2010). [sent-70, score-0.155]

31 We then repeatedly re- sample each element of that vector conditioned on all other positions according to the distribution Pr(aj |a−j , e, f). [sent-72, score-0.13]

32 Given a complete assignment of the| alignment for all words except the current, computing the complete probability including transition, emission, and jump, is straightforward. [sent-73, score-0.221]

33 This estimate comes with a computational cost: we must cycle through all positions of the vector repeatedly to gather a good estimate. [sent-74, score-0.072]

34 2 MAP inference with dual decomposition Dual decomposition, also known as Lagrangian relaxation, is a method for solving complex combinatorial optimization problems (Rush and Collins, 2012). [sent-77, score-0.575]

35 These complex problems are separated into distinct components with tractable MAP inference procedures. [sent-78, score-0.081]

36 The subproblems are repeatedly solved with some communication over consistency until a consistent and globally optimal solution is found. [sent-79, score-0.072]

37 Here we are interested in the problem of find- ing the most likely alignment of a sentence pair e, f. [sent-80, score-0.168]

38 Thus, we need to solve the combinatorial optimization problem arg maxa Pr(f, a|e). [sent-81, score-0.121]

39 ,N thoete p (hJo|wI )w teer’mve split tshtaen optimization into two portions. [sent-88, score-0.062]

40 The first captures fertility as well as some component of the translation distribution, and the second captures the jump distribution and the remainder of the translation distribution. [sent-89, score-0.851]

41 Define ya as ya(i, j) = 1if aj = i, and 0 otherwise. [sent-91, score-0.396]

42 Let z ∈ {0, 1}I×J be a binary u(0)(i,j) := 0 ∀i ∈ 1. [sent-92, score-0.047]

43 z(k) := argmaxz if ya = z return a(k) end if u(k)(i,j) := u(k)(i,j) end for return a(K) + δk ? [sent-102, score-0.273]

44 Figure 1: The dual decomposition algorithm for the fertility HMM, where δk is the step size at the kth iteration for 1 ≤ k ≤ K, and K is the max nkuthm ibteerra otifo inter faotrio 1ns ≤. [sent-104, score-1.17]

45 We use the dual decomposition algorithm from Rush and Collins (2012), reproduced here in Figure 1. [sent-117, score-0.488]

46 Note how the langrangian adds one additional term word, scaled by a value indicating whether that word is aligned in the current position. [sent-118, score-0.033]

47 The g function, on the other hand, does not have a commonly used decomposition structure. [sent-136, score-0.208]

48 Luckily we can factor this maximization into pieces that allow for efficient computation. [sent-137, score-0.034]

49 Unlike the HMM, × where each French word must have exactly one English generator, this maximization allows each z(i, j) := 0 ∀(i, j) ∈ [1. [sent-139, score-0.067]

50 component of the French word to have zero or many generators. [sent-144, score-0.033]

51 Because assignments that are in accordance between this model and the HMM will meet the HMM’s constraints, the overall dual decomposition algorithm will return valid assignments, even though individual selections for this model may fail to meet the requirements. [sent-145, score-0.654]

52 As the scoring function g can be decomposed into a sum of scores for each row Pi gi (i. [sent-146, score-0.117]

53 , there are no interactions between distiPnct rows of the matrix) we can maximize each row independently: XI mzaxXgi(zi) Xi= X1 = Xmzaxgi(zi) XI Xi= X1 Within each row, we seek the best of all 2J possible configurations. [sent-148, score-0.06]

54 These configurations may be grouped into equivalence classes based on the number of non-zero entries. [sent-149, score-0.044]

55 In each class, the max assignment is the one using words with the highest log probabilities; the total score of this assignment is the sum those log probabilities and the log probability of that fertility. [sent-150, score-0.44]

56 Sorting the scores of each cell in the row in descending order by log probability allows for linear time computation of the max for each row. [sent-151, score-0.277]

57 The algorithm described in Figure 2 finds this maximal assignment in O(IJ log J) time, generally faster than the O(I2J) time used by Viterbi. [sent-152, score-0.147]

58 We note in passing that this maximizer is picking from an unconstrained set of binary matri9 ces. [sent-153, score-0.047]

59 Since each English word may generate as many French words as it likes, regardless of all other words in the sentence, the underlying matrix have many more or many fewer non-zero entries than there are French words. [sent-154, score-0.033]

60 A straightforward extension to the algorithm ofFigure 2 returns only z matrices with exactly J nonzero entries. [sent-155, score-0.03]

61 Rather than maximizing each row totally independently, we keep track of the best configurations for each number of words generated in each row, and then pick the best combination that sums to J: another straightforward exercise in dynamic programming. [sent-156, score-0.209]

62 This refinement does not change the correctness of the dual decomposition algorithm; rather it speeds the convergence. [sent-157, score-0.458]

63 3 Fertility distribution parameters Original IBM models used a categorical distribution of fertility, one such distribution for each English word. [sent-158, score-0.174]

64 This gives EM a great amount of freedom in parameter estimation, with no smoothing or parameter tying of even rare words. [sent-159, score-0.11]

65 Prior work addressed this by using the single parameter Poisson distribution, forcing infrequent words to share a global parameter estimated from the fertility of all words in the corpus (Zhao and Gildea, 2010). [sent-160, score-0.678]

66 Prior work has explored feature-rich approaches to modeling the translation distribution (Berg-Kirkpatrick et al. [sent-162, score-0.104]

67 , 2010); we use the same technique, but only for the fertility model. [sent-163, score-0.568]

68 The fertility distribution is modeled as a log-linear distribution of F, a binary feature set: p(φ|e) ∝ exp (θ · F(e, φ)). [sent-164, score-0.731]

69 We include a simple sp(etφ o|ef) )fe ∝atu erxesp: • A binary indicator for each fertility φ. [sent-165, score-0.658]

70 This Afea btuinrea yis present ffoorr eaallc words, acting as smoothing. [sent-166, score-0.031]

71 • A binary indicator for each word id and fertility, aifr yth ien wdicoardto occurs more othrdan i 1d0 a tnidm efesr. [sent-167, score-0.152]

72 • A binary indicator for each word length (in letters) ayn idn fertility. [sent-168, score-0.152]

73 • A binary indicator for each four letter word prefix aarnyd i fertility. [sent-169, score-0.123]

74 Together these produce a distribution that can learn a reasonable distribution not only for common words, but also for rare words. [sent-170, score-0.116]

75 Including word length information aids in for languages with compounding: long words in one language may correspond to multiple words in the other. [sent-171, score-0.033]

76 4 Evaluation We explore the impact of this improved MAP inference procedure on a task in German-English word alignment. [sent-180, score-0.114]

77 For training data we use the news commentary data from the WMT 2012 translation task. [sent-181, score-0.046]

78 1 120 of the training sentences were manually annotated with word alignments. [sent-182, score-0.033]

79 The first line is a baseline HMM using exact posterior computation and inference with the standard dynamic programming algorithms. [sent-184, score-0.275]

80 The next line shows the fertility HMM with approximate posterior computation from Gibbs sampling but with final alignment selected by the Viterbi algorithm. [sent-185, score-0.861]

81 The prior work compared Viterbi with a form of local search (sampling repeatedly and keeping the max), finding little difference between the two (Zhao and Gildea, 2010). [sent-187, score-0.143]

82 Here, however, the difference between a dual decomposition and Viterbi is significant: their results were likely due to search error. [sent-188, score-0.458]

83 5 Conclusions and future work We have introduced a dual decomposition approach to alignment inference that substantially reduces alignment error. [sent-189, score-0.875]

84 Unfortunately the algorithm is rather slow to converge: after 40 iterations of the dual decomposition, still only 55 percent of the test sentences have converged. [sent-190, score-0.28]

85 We are exploring improvements to the simple sub-gradient method applied here in hopes of finding faster convergence, fast enough to make this algorithm practical. [sent-191, score-0.087]

86 Alternate parameter estimation techniques appear promising given the improvements of dual decomposition over sampling. [sent-192, score-0.557]

87 Once the performance issues of this algorithm are improved, exploring hard EM or some variant thereof might lead to more substantial improvements. [sent-193, score-0.065]

88 Using word-dependent transition models in HMM-based word alignment for statistical machine translation. [sent-212, score-0.23]

89 A tutorial on dual decomposition and lagrangian relaxation for inference in natural language processing. [sent-227, score-0.623]

90 Why initialization matters for ibm model 1: Multiple optima and non-strict convexity. [sent-231, score-0.094]

91 A fast fertility hidden markov model for word alignment using MCMC. [sent-240, score-0.817]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fertility', 0.568), ('aj', 0.267), ('dual', 0.25), ('hmm', 0.236), ('decomposition', 0.208), ('logp', 0.188), ('alignment', 0.168), ('eaj', 0.155), ('fj', 0.147), ('ya', 0.129), ('french', 0.129), ('ei', 0.113), ('max', 0.085), ('inference', 0.081), ('yj', 0.078), ('zhao', 0.077), ('gildea', 0.074), ('repeatedly', 0.072), ('pr', 0.071), ('argmaxa', 0.07), ('zi', 0.069), ('rush', 0.067), ('jump', 0.065), ('viterbi', 0.065), ('log', 0.064), ('xi', 0.062), ('ibm', 0.06), ('row', 0.06), ('distribution', 0.058), ('posterior', 0.057), ('sum', 0.057), ('parameter', 0.055), ('yp', 0.054), ('arg', 0.054), ('violates', 0.054), ('map', 0.054), ('assignment', 0.053), ('markov', 0.048), ('lagrangian', 0.047), ('binary', 0.047), ('translation', 0.046), ('exact', 0.045), ('configurations', 0.044), ('estimation', 0.044), ('indicator', 0.043), ('prior', 0.043), ('positional', 0.042), ('return', 0.041), ('sampling', 0.04), ('gibbs', 0.04), ('sums', 0.04), ('descending', 0.04), ('relaxation', 0.037), ('dynamic', 0.036), ('combinatorial', 0.036), ('substantial', 0.035), ('brown', 0.035), ('vogel', 0.034), ('initialization', 0.034), ('captures', 0.034), ('maximization', 0.034), ('state', 0.034), ('della', 0.033), ('word', 0.033), ('meet', 0.033), ('otog', 0.031), ('wiser', 0.031), ('jf', 0.031), ('concave', 0.031), ('luckily', 0.031), ('maxa', 0.031), ('mve', 0.031), ('pjj', 0.031), ('recovers', 0.031), ('thoete', 0.031), ('tshtaen', 0.031), ('vz', 0.031), ('wastes', 0.031), ('xjj', 0.031), ('yis', 0.031), ('pietra', 0.031), ('end', 0.031), ('algorithm', 0.03), ('assignments', 0.03), ('transition', 0.029), ('microsoft', 0.029), ('ayn', 0.029), ('cpoumtaptiuotna', 0.029), ('gauslt', 0.029), ('tilo', 0.029), ('dard', 0.029), ('aifr', 0.029), ('exercise', 0.029), ('hopes', 0.029), ('otifo', 0.029), ('poisson', 0.029), ('selections', 0.029), ('programming', 0.028), ('finding', 0.028), ('computation', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

Author: Chris Quirk

Abstract: The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. Initial attempts at modeling fertility used heuristic search methods. Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. Yet in practice we also need the single best alignment, which is difficult to find using Gibbs. Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality.

2 0.22970404 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

Author: Thomas Schoenemann

Abstract: We derive variants of the fertility based models IBM-3 and IBM-4 that, while maintaining their zero and first order parameters, are nondeficient. Subsequently, we proceed to derive a method to compute a likely alignment and its neighbors as well as give a solution of EM training. The arising M-step energies are non-trivial and handled via projected gradient ascent. Our evaluation on gold alignments shows substantial improvements (in weighted Fmeasure) for the IBM-3. For the IBM4 there are no consistent improvements. Training the nondeficient IBM-5 in the regular way gives surprisingly good results. Using the resulting alignments for phrase- based translation systems offers no clear insights w.r.t. BLEU scores.

3 0.20640112 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

Author: Mengqiu Wang ; Wanxiang Che ; Christopher D. Manning

Abstract: Translated bi-texts contain complementary language cues, and previous work on Named Entity Recognition (NER) has demonstrated improvements in performance over monolingual taggers by promoting agreement of tagging decisions between the two languages. However, most previous approaches to bilingual tagging assume word alignments are given as fixed input, which can cause cascading errors. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment, by combining two monolingual tagging models with two unidirectional alignment models. We intro- duce additional cross-lingual edge factors that encourage agreements between tagging and alignment decisions. We design a dual decomposition inference algorithm to perform joint decoding over the combined alignment and NER output space. Experiments on the OntoNotes dataset demonstrate that our method yields significant improvements in both NER and word alignment over state-of-the-art monolingual baselines.

4 0.17584033 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).

5 0.15243146 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference

Author: Yang Feng ; Trevor Cohn

Abstract: Most modern machine translation systems use phrase pairs as translation units, allowing for accurate modelling of phraseinternal translation and reordering. However phrase-based approaches are much less able to model sentence level effects between different phrase-pairs. We propose a new model to address this imbalance, based on a word-based Markov model of translation which generates target translations left-to-right. Our model encodes word and phrase level phenomena by conditioning translation decisions on previous decisions and uses a hierarchical Pitman-Yor Process prior to provide dynamic adaptive smoothing. This mechanism implicitly supports not only traditional phrase pairs, but also gapping phrases which are non-consecutive in the source. Our experiments on Chinese to English and Arabic to English translation show consistent improvements over competitive baselines, of up to +3.4 BLEU.

6 0.15241854 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network

7 0.12443121 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment

8 0.12360664 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

9 0.12088507 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

10 0.12067223 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

11 0.12030669 237 acl-2013-Margin-based Decomposed Amortized Inference

12 0.1199669 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification

13 0.11906013 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

14 0.10958961 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation

15 0.092343606 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

16 0.091303341 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner

17 0.087391086 323 acl-2013-Simpler unsupervised POS tagging with bilingual projections

18 0.086621664 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

19 0.079228997 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

20 0.07752455 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, -0.084), (2, 0.057), (3, 0.047), (4, -0.01), (5, -0.003), (6, 0.007), (7, -0.025), (8, -0.124), (9, -0.055), (10, 0.04), (11, -0.204), (12, -0.032), (13, -0.222), (14, 0.021), (15, -0.043), (16, 0.088), (17, 0.028), (18, 0.107), (19, -0.116), (20, -0.051), (21, 0.06), (22, -0.02), (23, 0.054), (24, -0.067), (25, 0.046), (26, -0.167), (27, 0.019), (28, 0.054), (29, -0.019), (30, 0.059), (31, -0.016), (32, 0.01), (33, 0.098), (34, -0.021), (35, 0.076), (36, 0.007), (37, 0.019), (38, 0.109), (39, 0.094), (40, -0.021), (41, -0.035), (42, -0.029), (43, -0.09), (44, 0.044), (45, -0.032), (46, -0.007), (47, 0.07), (48, -0.067), (49, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94563967 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

Author: Chris Quirk

Abstract: The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. Initial attempts at modeling fertility used heuristic search methods. Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. Yet in practice we also need the single best alignment, which is difficult to find using Gibbs. Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality.

2 0.76790047 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

Author: Thomas Schoenemann

Abstract: We derive variants of the fertility based models IBM-3 and IBM-4 that, while maintaining their zero and first order parameters, are nondeficient. Subsequently, we proceed to derive a method to compute a likely alignment and its neighbors as well as give a solution of EM training. The arising M-step energies are non-trivial and handled via projected gradient ascent. Our evaluation on gold alignments shows substantial improvements (in weighted Fmeasure) for the IBM-3. For the IBM4 there are no consistent improvements. Training the nondeficient IBM-5 in the regular way gives surprisingly good results. Using the resulting alignments for phrase- based translation systems offers no clear insights w.r.t. BLEU scores.

3 0.746719 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

Author: Mengqiu Wang ; Wanxiang Che ; Christopher D. Manning

Abstract: Translated bi-texts contain complementary language cues, and previous work on Named Entity Recognition (NER) has demonstrated improvements in performance over monolingual taggers by promoting agreement of tagging decisions between the two languages. However, most previous approaches to bilingual tagging assume word alignments are given as fixed input, which can cause cascading errors. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment, by combining two monolingual tagging models with two unidirectional alignment models. We intro- duce additional cross-lingual edge factors that encourage agreements between tagging and alignment decisions. We design a dual decomposition inference algorithm to perform joint decoding over the combined alignment and NER output space. Experiments on the OntoNotes dataset demonstrate that our method yields significant improvements in both NER and word alignment over state-of-the-art monolingual baselines.

4 0.7342844 237 acl-2013-Margin-based Decomposed Amortized Inference

Author: Gourab Kundu ; Vivek Srikumar ; Dan Roth

Abstract: Given that structured output prediction is typically performed over entire datasets, one natural question is whether it is possible to re-use computation from earlier inference instances to speed up inference for future instances. Amortized inference has been proposed as a way to accomplish this. In this paper, first, we introduce a new amortized inference algorithm called the Margin-based Amortized Inference, which uses the notion of structured margin to identify inference problems for which previous solutions are provably optimal. Second, we introduce decomposed amortized inference, which is designed to address very large inference problems, where earlier amortization methods become less ef- fective. This approach works by decomposing the output structure and applying amortization piece-wise, thus increasing the chance that we can re-use previous solutions for parts of the output structure. These parts are then combined to a global coherent solution using Lagrangian relaxation. In our experiments, using the NLP tasks of semantic role labeling and entityrelation extraction, we demonstrate that with the margin-based algorithm, we need to call the inference engine only for a third of the test examples. Further, we show that the decomposed variant of margin-based amortized inference achieves a greater reduction in the number of inference calls.

5 0.7123273 382 acl-2013-Variational Inference for Structured NLP Models

Author: David Burkett ; Dan Klein

Abstract: unkown-abstract

6 0.58353692 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

7 0.57729 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

8 0.57381409 15 acl-2013-A Novel Graph-based Compact Representation of Word Alignment

9 0.56590712 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

10 0.52933443 9 acl-2013-A Lightweight and High Performance Monolingual Word Aligner

11 0.51110303 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation

12 0.50766814 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

13 0.5074293 25 acl-2013-A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration

14 0.50024527 269 acl-2013-PLIS: a Probabilistic Lexical Inference System

15 0.47726145 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

16 0.46648663 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network

17 0.46468309 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

18 0.4640041 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

19 0.44493079 370 acl-2013-Unsupervised Transcription of Historical Documents

20 0.44129118 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.065), (6, 0.448), (11, 0.05), (24, 0.027), (26, 0.049), (35, 0.064), (40, 0.011), (42, 0.039), (48, 0.045), (69, 0.01), (70, 0.039), (88, 0.015), (90, 0.018), (95, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91670334 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

Author: Chris Quirk

Abstract: The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. Initial attempts at modeling fertility used heuristic search methods. Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. Yet in practice we also need the single best alignment, which is difficult to find using Gibbs. Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality.

2 0.90624869 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning

Author: Daniel Beck ; Lucia Specia ; Trevor Cohn

Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .

3 0.90380639 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

Author: Dehong Gao ; Wenjie Li ; Renxian Zhang

Abstract: The growth of the Web 2.0 technologies has led to an explosion of social networking media sites. Among them, Twitter is the most popular service by far due to its ease for realtime sharing of information. It collects millions of tweets per day and monitors what people are talking about in the trending topics updated timely. Then the question is how users can understand a topic in a short time when they are frustrated with the overwhelming and unorganized tweets. In this paper, this problem is approached by sequential summarization which aims to produce a sequential summary, i.e., a series of chronologically ordered short subsummaries that collectively provide a full story about topic development. Both the number and the content of sub-summaries are automatically identified by the proposed stream-based and semantic-based approaches. These approaches are evaluated in terms of sequence coverage, sequence novelty and sequence correlation and the effectiveness of their combination is demonstrated. 1 Introduction and Background Twitter, as a popular micro-blogging service, collects millions of real-time short text messages (known as tweets) every second. It acts as not only a public platform for posting trifles about users’ daily lives, but also a public reporter for real-time news. Twitter has shown its powerful ability in information delivery in many events, like the wildfires in San Diego and the earthquake in Japan. Nevertheless, the side effect is individual users usually sink deep under millions of flooding-in tweets. To alleviate this problem, the applications like whatthetrend 1 have evolved from Twitter to provide services that encourage users to edit explanatory tweets about a trending topic, which can be regarded as topic summaries. It is to some extent a good way to help users understand trending topics. 1 whatthetrend.com There is also pioneering research in automatic Twitter trending topic summarization. (O'Connor et al., 2010) explained Twitter trending topics by providing a list of significant terms. Users could utilize these terms to drill down to the tweets which are related to the trending topics. (Sharifi et al., 2010) attempted to provide a one-line summary for each trending topic using phrase reinforcement ranking. The relevance model employed by (Harabagiu and Hickl, 2011) generated summaries in larger size, i.e., 250word summaries, by synthesizing multiple high rank tweets. (Duan et al., 2012) incorporate the user influence and content quality information in timeline tweet summarization and employ reinforcement graph to generate summaries for trending topics. Twitter summarization is an emerging research area. Current approaches still followed the traditional summarization route and mainly focused on mining tweets of both significance and representativeness. Though, the summaries generated in such a way can sketch the most important aspects of the topic, they are incapable of providing full descriptions of the changes of the focus of a topic, and the temporal information or freshness of the tweets, especially for those newsworthy trending topics, like earthquake and sports meeting. As the main information producer in Twitter, the massive crowd keeps close pace with the development of trending topics and provide the timely updated information. The information dynamics and timeliness is an important consideration for Twitter summarization. That is why we propose sequential summarization in this work, which aims to produce sequential summaries to capture the temporal changes of mass focus. Our work resembles update summarization promoted by TAC 2 which required creating summaries with new information assuming the reader has already read some previous documents under the same topic. Given two chronologically ordered documents sets about a topic, the systems were asked to generate two 2 www.nist.gov/tac 567 summaries, and the second one should inform the user of new information only. In order to achieve this goal, existing approaches mainly emphasized the novelty of the subsequent summary (Li and Croft, 2006; Varma et al., 2009; Steinberger and Jezek, 2009). Different from update summarization, we focus more on the temporal change of trending topics. In particular, we need to automatically detect the “update points” among a myriad of related tweets. It is the goal of this paper to set up a new practical summarization application tailored for timely updated Twitter messages. With the aim of providing a full description of the focus changes and the records of the timeline of a trending topic, the systems are expected to discover the chronologically ordered sets of information by themselves and they are free to generate any number of update summaries according to the actual situations instead of a fixed number of summaries as specified in DUC/TAC. Our main contributions include novel approaches to sequential summarization and corresponding evaluation criteria for this new application. All of them will be detailed in the following sections. 2 Sequential Summarization Sequential summarization proposed here aims to generate a series of chronologically ordered subsummaries for a given Twitter trending topic. Each sub-summary is supposed to represent one main subtopic or one main aspect of the topic, while a sequential summary, made up by the subsummaries, should retain the order the information is delivered to the public. In such a way, the sequential summary is able to provide a general picture of the entire topic development. 2.1 Subtopic Segmentation One of the keys to sequential summarization is subtopic segmentation. How many subtopics have attracted the public attention, what are they, and how are they developed? It is important to provide the valuable and organized materials for more fine-grained summarization approaches. We proposed the following two approaches to automatically detect and chronologically order the subtopics. 2.1.1 Stream-based Subtopic Detection and Ordering Typically when a subtopic is popular enough, it will create a certain level of surge in the tweet stream. In other words, every surge in the tweet stream can be regarded as an indicator of the appearance of a subtopic that is worthy of being summarized. Our early investigation provides evidence to support this assumption. By examining the correlations between tweet content changes and volume changes in randomly selected topics, we have observed that the changes in tweet volume can really provide the clues of topic development or changes of crowd focus. The stream-based subtopic detection approach employs the offline peak area detection (Opad) algorithm (Shamma et al., 2010) to locate such surges by tracing tweet volume changes. It regards the collection of tweets at each such surge time range as a new subtopic. Offline Peak Area Detection (Opad) Algorithm 1: Input: TS (tweets stream, each twi with timestamp ti); peak interval window ∆? (in hour), and time stepℎ (ℎ ≪ ∆?); 2: Output: Peak Areas PA. 3: Initial: two time slots: ?′ = ? = ?0 + ∆?; Tweet numbers: ?′ = ? = ?????(?) 4: while (?? = ? + ℎ) < ??−1 5: update ?′ = ?? + ∆? and ?′ = ?????(?′) 6: if (?′ < ? And up-hilling) 7: output one peak area ??? 8: state of down-hilling 9: else 10: update ? = ?′ and ? = ?′ 11: state of up-hilling 12: 13: function ?????(?) 14: Count tweets in time interval T The subtopics detected by the Opad algorithm are naturally ordered in the timeline. 2.1.2 Semantic-based Subtopic Detection and Ordering Basically the stream-based approach monitors the changes of the level of user attention. It is easy to implement and intuitively works, but it fails to handle the cases where the posts about the same subtopic are received at different time ranges due to the difference of geographical and time zones. This may make some subtopics scattered into several time slots (peak areas) or one peak area mixed with more than one subtopic. In order to sequentially segment the subtopics from the semantic aspect, the semantic-based subtopic detection approach breaks the time order of tweet stream, and regards each tweet as an individual short document. It takes advantage of Dynamic Topic Modeling (David and Michael, 2006) to explore the tweet content. 568 DTM in nature is a clustering approach which can dynamically generate the subtopic underlying the topic. Any clustering approach requires a pre-specified cluster number. To avoid tuning the cluster number experimentally, the subtopic number required by the semantic-based approach is either calculated according to heuristics or determined by the number of the peak areas detected from the stream-based approach in this work. Unlike the stream-based approach, the subtopics formed by DTM are the sets of distributions of subtopic and word probabilities. They are time independent. Thus, the temporal order among these subtopics is not obvious and needs to be discovered. We use the probabilistic relationships between tweets and topics learned from DTM to assign each tweet to a subtopic that it most likely belongs to. Then the subtopics are ordered temporally according to the mean values of their tweets’ timestamps. 2.2 Sequential Summary Generation Once the subtopics are detected and ordered, the tweets belonging to each subtopic are ranked and the most significant one is extracted to generate the sub-summary regarding that subtopic. Two different ranking strategies are adopted to conform to two different subtopic detection mechanisms. For a tweet in a peak area, the linear combination of two measures is considered to independently. Each sub-summary is up to 140 characters in length to comply with the limit of tweet, but the annotators are free to choose the number of sub-summaries. It ends up with 6.3 and 4.8 sub-summaries on average in a sequential summary written by the two annotators respectively. These two sets of sequential summaries are regarded as reference summaries to evaluate system-generated summaries from the following three aspects. Sequence Coverage Sequence coverage measures the N-gram match between system-generated summaries and human-written summaries (stopword removed first). Considering temporal information is an important factor in sequential summaries, we evaluate its significance to be a sub-summary: (1) subtopic representativeness measured by the  cosine similarity between the tweet and the centroid of all the tweets in the same peak area; (2) crowding endorsement measured by the times that the tweet is re-tweeted normalized by the total number of re-tweeting. With the DTM model, the significance of the tweets is evaluated directly by word distribution per subtopic. MMR (Carbonell and Goldstein, 1998) is used to reduce redundancy in sub-summary generation. 3 Experiments and Evaluations The experiments are conducted on the 24 Twitter trending topics collected using Twitter APIs 3 . The statistics are shown in Table 1. Due to the shortage of gold-standard sequential summaries, we invite two annotators to read the chronologically ordered tweets, and write a series of sub-summaries for each topic 3https://dev.twitter.com/ propose the position-aware coverage measure by accommodating the position information in matching. Let S={s1, s2, sk} denote a … … …, sequential summary and si the ith sub-summary, N-gram coverage is defined as: ???????? =|? 1?|?∑?∈? ?∑? ? ?∈?∙ℎ ?∑ ? ?∈?-?ℎ? ?∑? ∈-? ?,? ? ? ?∈? ? ? ? ? ? ? (ℎ?(?-?-? ? ? ?) where, ??? = |? − ?| + 1, i and j denote the serial numbers of the sub-summaries in the systemgenerated summary ??? and the human-written summary ?ℎ? , respectively. ? serves as a coefficient to discount long-distance matched sub-summaries. We evaluate unigram, bigram, and skipped bigram matches. Like in ROUGE (Lin, 2004), the skip distance is up to four words.  Sequence Novelty Sequence novelty evaluates the average novelty of two successive sub-summaries. Information content (IC) has been used to measure the novelty of update summaries by (Aggarwal et al., 2009). In this paper, the novelty of a system569 generated sequential summary is defined as the average of IC increments of two adjacent subsummaries, ??????? =|?|1 − 1?∑>1(????− ????, ??−1) × where |?| is the number of sub-summaries in the sequential summary. ???? = ∑?∈?? ??? . ????, ??−1 = ∑?∈??∩??−1 ??? is the overlapped information in the two adjacent sub-summaries. ??? = ???? ?????????(?, ???) where w is a word, ???? is the inverse tweet frequency of w, and ??? is all the tweets in the trending topic. The relevance function is introduced to ensure that the information brought by new sub-summaries is not only novel but also related to the topic.  Sequence Correlation Sequence correlation evaluates the sequential matching degree between system-generated and human-written summaries. In statistics, Kendall’s tau coefficient is often used to measure the association between two sequences (Lapata, 2006). The basic idea is to count the concordant and discordant pairs which contain the same elements in two sequences. Borrowing this idea, for each sub-summary in a human-generated summary, we find its most matched subsummary (judged by the cosine similarity measure) in the corresponding system-generated summary and then define the correlation according to the concordance between the two matched sub-summary sequences. ??????????? 2(|#???????????????| |#???????????????|) − = ?(? − 1) where n is the number of human-written subsummaries. Tables 2 and 3 below present the evaluation results. For the stream-based approach, we set ∆t=3 hours experimentally. For the semanticbased approach, we compare three different approaches to defining the sub-topic number K: (1) Semantic-based 1: Following the approach proposed in (Li et al., 2007), we first derive the matrix of tweet cosine similarity. Given the 1norm of eigenvalues ?????? (? = 1, 2, ,?) of the similarity matrix and the ratios ?? = ??????/?2 , the subtopic number ? = ? + 1 if ?? − ??+1 > ? (? 0.4 ). (2) Semantic-based 2: Using the rule of thumb in (Wan and Yang, 2008), ? = √? , where n is the tweet number. (3) Combined: K is defined as the number of the peak areas detected from the Opad algorithm, meanwhile we use the … = tweets within peak areas as the tweets of DTM. This is our new idea. The experiments confirm the superiority of the semantic-based approach over the stream-based approach in summary content coverage and novelty evaluations, showing that the former is better at subtopic content modeling. The subsummaries generated by the stream-based approach have comparative sequence (i.e., order) correlation with the human summaries. Combining the advantages the two approaches leads to the best overall results. SCebomaSCs beonmtdivr1eac( ∆nrdδ-bm(ta=i∆g0-cs3e.t)5=d32U0 n.3ig510r32a7m B0 .i1g 6r3589a46m87 SB0 k.i1 gp8725r69ame173d Table 2. N-Gram Coverage Evaluation Sem CtraeonTmaA tmicapb-nplibentria ec3os-de.abcd N(a∆hs(o1evt∆=(sdetδ=3l2)t 0y).a4n)dCoN0r .o 73e vl071ea96lti783 oy nEvCalo0ur a. 3 tei3792ol3a489nt650io n 4 Concluding Remarks We start a new application for Twitter trending topics, i.e., sequential summarization, to reveal the developing scenario of the trending topics while retaining the order of information presentation. We develop several solutions to automatically detect, segment and order subtopics temporally, and extract the most significant tweets into the sub-summaries to compose sequential summaries. Empirically, the combination of the stream-based approach and the semantic-based approach leads to sequential summaries with high coverage, low redundancy, and good order. Acknowledgments The work described in this paper is supported by a Hong Kong RGC project (PolyU No. 5202/12E) and a National Nature Science Foundation of China (NSFC No. 61272291). References Aggarwal Gaurav, Sumbaly Roshan and Sinha Shakti. 2009. Update Summarization. Stanford: CS224N Final Projects. 570 Blei M. David and Jordan I. Michael. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, 113120. Pittsburgh, Pennsylvania. Carbonell Jaime and Goldstein Jade. 1998. The use of MMR, diversity based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval, 335-336. Melbourne, Australia. Duan Yajuan, Chen Zhimin, Wei Furu, Zhou Ming and Heung-Yeung Shum. 2012. Twitter Topic Summarization by Ranking Tweets using Social Influence and Content Quality. In Proceedings of the 24th International Conference on Computational Linguistics, 763-780. Mumbai, India. Harabagiu Sanda and Hickl Andrew. 2011. Relevance Modeling for Microblog Summarization. In Proceedings of 5th International AAAI Conference on Weblogs and Social Media. Barcelona, Spain. Lapata Mirella. 2006. Automatic evaluation of information ordering: Kendall’s tau. Computational Linguistics, 32(4): 1-14. Li Wenyuan, Ng Wee-Keong, Liu Ying and Ong Kok-Leong. 2007. Enhancing the Effectiveness of Clustering with Spectra Analysis. IEEE Transactions on Knowledge and Data Engineering, 19(7):887-902. Li Xiaoyan and Croft W. Bruce. 2006. Improving novelty detection for general topics using sentence level information patterns. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 238-247. New York, USA. Lin Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out, 74-81 . Barcelona, Spain. Liu Fei, Liu Yang and Weng Fuliang. 2011. Why is “SXSW ” trending? Exploring Multiple Text Sources for Twitter Topic Summarization. In Proceedings of the ACL Workshop on Language in Social Media, 66-75. Portland, Oregon. O'Connor Brendan, Krieger Michel and Ahn David. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, 384-385. Atlanta, Georgia. Shamma A. David, Kennedy Lyndon and Churchill F. Elizabeth. 2010. Tweetgeist: Can the Twitter Timeline Reveal the Structure of Broadcast Events? In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, 589-593. Savannah, Georgia, USA. Sharifi Beaux, Hutton Mark-Anthony and Kalita Jugal. 2010. Summarizing Microblogs Automatically. In Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 685688. Los Angeles, California. Steinberger Josef and Jezek Karel. 2009. Update summarization based on novel topic distribution. In Proceedings of the 9th ACM Symposium on Document Engineering, 205-213. Munich, Germany. Varma Vasudeva, Bharat Vijay, Kovelamudi Sudheer, Praveen Bysani, Kumar K. N, Kranthi Reddy, Karuna Kumar and Nitin Maganti. 2009. IIIT Hyderabad at TAC 2009. In Proceedings of the 2009 Text Analysis Conference. GaithsBurg, Maryland. Wan Xiaojun and Yang Jianjun. 2008. Multidocument summarization using cluster-based link analysis. In Proceedings of the 3 1st Annual International Conference on Research and Development in Information Retrieval, 299-306. Singapore, Singapore. 571

4 0.88757837 145 acl-2013-Exploiting Qualitative Information from Automatic Word Alignment for Cross-lingual NLP Tasks

Author: Jose G.C. de Souza ; Miquel Espla-Gomis ; Marco Turchi ; Matteo Negri

Abstract: The use of automatic word alignment to capture sentence-level semantic relations is common to a number of cross-lingual NLP applications. Despite its proved usefulness, however, word alignment information is typically considered from a quantitative point of view (e.g. the number of alignments), disregarding qualitative aspects (the importance of aligned terms). In this paper we demonstrate that integrating qualitative information can bring significant performance improvements with negligible impact on system complexity. Focusing on the cross-lingual textual en- tailment task, we contribute with a novel method that: i) significantly outperforms the state of the art, and ii) is portable, with limited loss in performance, to language pairs where training data are not available.

5 0.83636755 246 acl-2013-Modeling Thesis Clarity in Student Essays

Author: Isaac Persing ; Vincent Ng

Abstract: Recently, researchers have begun exploring methods of scoring student essays with respect to particular dimensions of quality such as coherence, technical errors, and relevance to prompt, but there is relatively little work on modeling thesis clarity. We present a new annotated corpus and propose a learning-based approach to scoring essays along the thesis clarity dimension. Additionally, in order to provide more valuable feedback on why an essay is scored as it is, we propose a second learning-based approach to identifying what kinds of errors an essay has that may lower its thesis clarity score.

6 0.82593787 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

7 0.80737954 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

8 0.59014988 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

9 0.57195312 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

10 0.56940073 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

11 0.56501251 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

12 0.5584718 333 acl-2013-Summarization Through Submodularity and Dispersion

13 0.55799699 353 acl-2013-Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain

14 0.54977924 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

15 0.54677433 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

16 0.54410338 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

17 0.54204494 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

18 0.53761727 176 acl-2013-Grounded Unsupervised Semantic Parsing

19 0.53207022 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

20 0.51654619 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations