acl acl2010 acl2010-24 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell
Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.
Reference: text
sentIndex sentText sentNum sentScore
1 lleond uUniversity 5000 Forbes Avenue, Pittsburgh, PA 15213, USA Abstract Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. [sent-5, score-1.235]
2 Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. [sent-6, score-1.885]
3 Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner. [sent-7, score-1.758]
4 1 Introduction Corpus-based approaches to machine translation have become predominant, with phrase-based sta- tistical machine translation (PB-SMT) (Koehn et al. [sent-8, score-0.158]
5 Parameters of these alignment models are learnt in an unsupervised manner using the EM algorithm over sentence-level aligned parallel corpora. [sent-12, score-0.604]
6 Increased parallel data enables better estimation of the model parameters, but a large number of language pairs still lack such resources. [sent-14, score-0.132]
7 The second is to use extra annotation, typically word-level human alignment for some sentence pairs, in conjunction with the parallel data to learn alignment in a semi-supervised manner. [sent-17, score-1.113]
8 Our research is in the direction of the latter, and aims to reduce the effort involved in hand-generation of word alignments by using active learning strategies for careful selection of word pairs to seek alignment. [sent-18, score-1.042]
9 In this paper we explore active learning for word alignment, where the input to the active learner is a sentence pair (S, T) and the annotation elicited from human is a set of links {aij ,∀si ∈ S, tj ∈ T}. [sent-21, score-1.104]
10 require e plricevitiaotiuosn a poffull alignment for the sentence pair, which could be effort-intensive. [sent-23, score-0.506]
11 We propose active learning query strategies to selectively elicit partial align– ment information. [sent-24, score-0.649]
12 Experiments in Section 5 show that our selection strategies reduce alignment error rates significantly over baseline. [sent-25, score-0.749]
13 Fraser and Marcu (2006) pose the problem of alignment as a search problem in log-linear space with features coming from the IBM alignment models. [sent-27, score-1.012]
14 They propose a semisupervised training algorithm which alternates between discriminative error training on the labeled data to learn the weighting parameters and maximum-likelihood EM training on unlabeled data to estimate the parameters. [sent-31, score-0.133]
15 (2004) also improve alignment by interpolating human alignments with automatic alignments. [sent-33, score-0.691]
16 They observe that while working with such data sets, alignments of higher quality should be given a much higher weight than the lower-quality alignments. [sent-34, score-0.184]
17 To our knowledge, there is no prior work that has looked at reducing human effort by selective elicitation of partial word alignment using active learning techniques. [sent-38, score-1.197]
18 Several studies (Tong and Koller, 2002; Nguyen and Smeulders, 2004; Donmez and Carbonell, 2008) show that active learning greatly helps to reduce the labeling effort in various classification tasks. [sent-41, score-0.449]
19 1 Active Learning Setup We discuss our active learning setup for word alignment in Algorithm 1. [sent-43, score-0.915]
20 We start with an unlabeled dataset U = {(Sk, Tk)}, indexed by k, laanbde a dse dedat pool Uof partial alignment dlienxkesd A0 = {aikj ,∀si ∈ Sk, tj ∈ Tk}. [sent-44, score-0.679]
21 We take a pool-based active learning strategy, where we have access to all the automatically aligned links and we can score the links based on our active learning query strategy. [sent-47, score-1.317]
22 The query strategy uses the automatically trained alignment model Mt from current iteration t for scoring the links. [sent-48, score-0.782]
23 Re-training and re-tuning an SMT system for each link at a time is computationally infeasible. [sent-49, score-0.202]
24 We therefore perform batch learning by se- lecting a set of N links scored high by our query strategy. [sent-50, score-0.531]
25 We seek manual corrections for the selected links and add the alignment data to the current labeled data set. [sent-51, score-0.973]
26 The word-level aligned labeled data is provided to our semi-supervised word alignment algorithm for training an alignment model Mt+1 over U. [sent-52, score-1.12]
27 In a more typical scenario, since reducing human effort or cost of elicitation is the objective, we iterate until the available budget is exhausted. [sent-54, score-0.271]
28 Manual alignments are incorporated in the EM training phase of these models as constraints that restrict the summation over all possible alignment paths. [sent-57, score-0.711]
29 The manual alignments allow for one-tomany alignments and many-to-many alignments in both directions. [sent-59, score-0.544]
30 Therefore, the restriction of the alignment paths reduces to restricting the summation in EM. [sent-62, score-0.564]
31 4 Query Strategies for Link Selection We propose multiple query selection strategies for our active learning setup. [sent-63, score-0.661]
32 The scoring criteria is designed to select alignment links across sentence pairs that are highly uncertain under current automatic translation models. [sent-64, score-1.074]
33 These links are difficult to align correctly by automatic alignment and will cause incorrect phrase pairs to be extracted in the translation model, in turn hurting the translation quality of the SMT system. [sent-65, score-0.971]
34 Manual correction of such links produces the maximal benefit to the model. [sent-66, score-0.311]
35 We would ideally like to elicit the least number of manual corrections possible in order to reduce the cost of data acquisition. [sent-67, score-0.282]
36 In this section we discuss our link selection strategies based on the standard active learning paradigm of ‘uncer- tainty sampling’(Lewis and Catlett, 1994). [sent-68, score-0.735]
37 We use the automatically trained translation model θt for scoring each link for uncertainty, which consists of bidirectional translation lexicon tables computed from the bidirectional alignments. [sent-69, score-0.549]
38 1 Uncertainty Sampling: Bidirectional Alignment Scores The automatic Viterbi alignment produced by the alignment models is used to obtain translation lexicons. [sent-71, score-1.091]
39 These lexicons capture the conditional distributions of source-given-target P(s/t) and target-given-source P(t/s) probabilities at the word level where si ∈ S and tj ∈ T. [sent-72, score-0.151]
40 We define certainty of a link as tShe a hnadrm ton∈ic mean oef tdheebidirectional probabilities. [sent-73, score-0.202]
41 The selection strategy selects the least scoring links according to the formula below which corresponds to links with maximum uncertainty: Score(aij/s1I,t1J) =2P ∗( Ptj(/tsj/is)i +) ∗ P P(s(is/it/jt)j) 4. [sent-74, score-0.733]
42 Given a sentence pair (sI1, t1J) and its word alignment, we compute two confidence metrics at alignment link level based on the posterior link probability as seen in Equation 5. [sent-77, score-1.076]
43 We select the alignment links that the initial word aligner is least confident according to our metric and seek manual correction of the links. [sent-78, score-1.001]
44 Targeting some of the uncertain parts of word alignment has already been shown to improve translation quality in SMT (Huang, 2009). [sent-80, score-0.718]
45 We use confidence metrics as an active learning sampling strategy to obtain most informative links. [sent-81, score-0.78]
46 We also experimented with other confidence metrics as discussed in (Ueffing and Ney, 2007), especially the IBM 1 model score metric, but it did not show significant improvement in this task. [sent-82, score-0.13]
47 3 Query by Committee The generative alignments produced differ based on the choice of direction of the language pair. [sent-84, score-0.217]
48 We use As2t to denote alignment in the source to target direction and At2s to denote the target to source direction. [sent-85, score-0.542]
49 We consider these alignments to be two experts that have two different views of the alignment process. [sent-86, score-0.691]
50 We formulate our query strategy to select links where the agreement differs across these two alignments. [sent-87, score-0.48]
51 In general query by committee is a standard sampling strategy in active learning(Freund et al. [sent-88, score-0.852]
52 We formulate a query by committee sampling strategy for word alignment as shown in Equation 6. [sent-90, score-1.056]
53 In order to break ties, we extend this approach to select the link with higher average frequency of occurrence of words involved in the link. [sent-91, score-0.28]
54 4 Margin Sampling The strategy for confidence based sampling only considers information about the best scoring link 367 conf(aij/S, T). [sent-93, score-0.639]
55 However we could benefit from information about the second best scoring link as well. [sent-94, score-0.279]
56 , 2001), where the difference between the probabilities assigned by the underlying model to the first best and second best labels is used as a sampling criteria. [sent-96, score-0.194]
57 Our margin technique is formulated below, where and are potential first best and second best scoring alignment links for a word at position iin the source sentence S with translation T. [sent-98, score-1.033]
58 The word with minimum margin value is chosen for human alignment. [sent-99, score-0.136]
59 1 Data Setup Our aim in this paper is to show that active learning can help select the most informative alignment links that have high uncertainty according to a given automatically trained model. [sent-102, score-1.235]
60 We also show that fixing such alignments leads to the maximum reduction of error in word alignment, as measured by AER. [sent-103, score-0.265]
61 We compare this with a baseline where links are selected at random for manual correction. [sent-104, score-0.342]
62 To run our experiments iteratively, we automate the setup by using a parallel corpus for which the gold-standard human alignment is already available. [sent-105, score-0.642]
63 We select the Chinese-English language pair, where we have access to 21,863 sentence pairs along with complete manual alignment. [sent-106, score-0.179]
64 We then use the learned model in running our link selection algorithm over the entire corpus to determine the most uncertain links according to each active learning strategy. [sent-109, score-0.983]
65 The links are then looked up in the gold-standard human alignment database and corrected. [sent-110, score-0.783]
66 In case a link is not present in the gold-standard data, we introduce a NULL alignment, else we propose the alignment as given in Figure 1: Performance of active sampling strategies for link selection the gold standard. [sent-111, score-1.637]
67 We select the partial alignment as a set of alignment links and provide it to our semi-supervised word aligner. [sent-112, score-1.377]
68 We plot performance curves as number of links used in each iteration vs. [sent-113, score-0.239]
69 Query by committee performs worse than random indicating that two alignments differing in direction are not sufficient in deciding for uncertainty. [sent-115, score-0.304]
70 We observe that confidence based metrics perform significantly better than the baseline. [sent-117, score-0.167]
71 From the scatter plots in Figure 1 1 we can say that using our best selection strategy one achieves similar performance to the baseline, but at a much lower cost of elicitation assuming cost per link is uniform. [sent-118, score-0.587]
72 We also perform end-to-end machine translation experiments to show that our improvement of alignment quality leads to an improvement of translation scores. [sent-119, score-0.664]
73 We first obtain the baseline score where no manual alignment was used. [sent-123, score-0.609]
74 We also train a configuration using gold standard manual align- ment data for the parallel corpus. [sent-124, score-0.166]
75 This is the maximum translation accuracy that we can achieve by any link selection algorithm. [sent-125, score-0.388]
76 We now take the best link selection criteria, which is the confidence 1X axis has number of links elicited on a log-scale 368 HumaBSnay sAsetl ei mngenmentB1198L. [sent-126, score-0.687]
77 Therefore we achieve 45% of the possible improvement by only using 20% elicitation effort. [sent-138, score-0.133]
78 3 Batch Selection Re-training the word alignment models after eliciting every individual alignment link is infeasible. [sent-140, score-1.308]
79 In our data set of 21,863 sentences with 588,075 links, it would be computationally intensive to retrain after eliciting even 100 links in a batch. [sent-141, score-0.297]
80 We therefore sample links as a discrete batch, and train alignment models to report performance at fixed points. [sent-142, score-0.745]
81 Such a batch selection is only going to be sub-optimal as the underlying model changes with every alignment link and therefore becomes ‘stale’ for future selections. [sent-143, score-0.979]
82 We observe that in some scenarios while fixing one alignment link could potentially fix all the mis-alignments in a sentence pair, our batch selection mechanism still samples from the rest of the links in the sentence pair. [sent-144, score-1.294]
83 We experimented with an exponential decay function over the number of links previously selected, in order to discourage repeated sampling from the same sentence pair. [sent-145, score-0.538]
84 We performed an experiment by selecting one of our best performing selection strategies (conf) and ran it in both configurations - one with the decay parameter (batchdecay) and one without it (batch). [sent-146, score-0.3]
85 As seen in Figure 2, the decay function has an effect in the initial part of the curve where sampling is sparse but the effect gradually fades away as we observe more samples. [sent-147, score-0.336]
86 In the reported results we do not use batch decay, but an optimal estimation of ‘staleness’ could lead to better gains in batch link selection using active learning. [sent-148, score-1.01]
87 Figure 2: Batch decay effects on Conf-posterior sampling strategy 6 Conclusion and Future Work Word-Alignment is a particularly challenging problem and has been addressed in a completely unsupervised manner thus far (Brown et al. [sent-149, score-0.37]
88 While generative alignment models have been suc- cessful, lack of sufficient data, model assumptions and local optimum during training are well known problems. [sent-151, score-0.575]
89 Semi-supervised techniques use partial manual alignment data to address some of these issues. [sent-152, score-0.657]
90 We have shown that active learning strategies can reduce the effort involved in eliciting human alignment data. [sent-153, score-1.175]
91 The reduction in effort is due to careful selection of maximally uncertain links that provide the most benefit to the alignment model when used in a semi-supervised training fashion. [sent-154, score-1.087]
92 In future we wish to work with word alignments for other language pairs like Arabic and English. [sent-156, score-0.217]
93 We have tested out the feasibility of obtaining human word alignment data using Amazon Mechanical Turk and plan to obtain more data reduce the cost of annotation. [sent-157, score-0.665]
94 The first author would like to thank Qin Gao for the semi-supervised word alignment software and help with running experiments. [sent-160, score-0.542]
95 Statistical machine translation with word- and sentence-aligned parallel corpora. [sent-176, score-0.142]
96 Soft syntactic constraints for word alignment through discriminative training. [sent-181, score-0.542]
97 Optimizing estimated loss reduction for active sampling in rank learning. [sent-186, score-0.575]
98 Parallel implementa- tions of word alignment tool. [sent-213, score-0.542]
99 Support vector machine active learning with applications to text classification. [sent-265, score-0.338]
100 Boosting statistical word alignment using labeled and unlabeled data. [sent-275, score-0.633]
wordName wordTfidf (topN-words)
[('alignment', 0.506), ('active', 0.338), ('links', 0.239), ('link', 0.202), ('sampling', 0.194), ('batch', 0.164), ('alignments', 0.147), ('elicitation', 0.133), ('query', 0.128), ('committee', 0.121), ('selection', 0.107), ('decay', 0.105), ('manual', 0.103), ('fraser', 0.101), ('uncertain', 0.097), ('confidence', 0.095), ('strategies', 0.088), ('translation', 0.079), ('scoring', 0.077), ('aer', 0.075), ('strategy', 0.071), ('tj', 0.071), ('mt', 0.065), ('effort', 0.063), ('parallel', 0.063), ('uncertainty', 0.063), ('aikj', 0.062), ('donmez', 0.062), ('scheffer', 0.062), ('margin', 0.062), ('smt', 0.059), ('summation', 0.058), ('eliciting', 0.058), ('ueffing', 0.058), ('morristown', 0.057), ('bidirectional', 0.056), ('sk', 0.055), ('aug', 0.054), ('conf', 0.054), ('haffari', 0.054), ('unlabeled', 0.054), ('ibm', 0.052), ('tk', 0.052), ('aij', 0.05), ('vamshi', 0.05), ('nj', 0.05), ('partial', 0.048), ('reduce', 0.048), ('em', 0.048), ('informative', 0.047), ('corrections', 0.047), ('elicit', 0.047), ('qin', 0.047), ('gao', 0.046), ('elicited', 0.044), ('meteor', 0.044), ('lt', 0.044), ('si', 0.044), ('ney', 0.044), ('reduction', 0.043), ('semisupervised', 0.042), ('carbonell', 0.042), ('blatz', 0.042), ('select', 0.042), ('koehn', 0.042), ('seek', 0.041), ('jaime', 0.04), ('lavie', 0.04), ('nicola', 0.04), ('fixing', 0.039), ('human', 0.038), ('freund', 0.038), ('tong', 0.038), ('experts', 0.038), ('maximal', 0.038), ('cost', 0.037), ('labeled', 0.037), ('observe', 0.037), ('vogel', 0.036), ('involved', 0.036), ('direction', 0.036), ('marcu', 0.036), ('word', 0.036), ('estimation', 0.035), ('lewis', 0.035), ('selective', 0.035), ('nguyen', 0.035), ('metrics', 0.035), ('aligned', 0.035), ('setup', 0.035), ('assumptions', 0.035), ('iin', 0.034), ('pairs', 0.034), ('generative', 0.034), ('align', 0.034), ('stephan', 0.034), ('correction', 0.034), ('cherry', 0.033), ('careful', 0.032), ('reached', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment
Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell
Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.
2 0.36561239 133 acl-2010-Hierarchical Search for Word Alignment
Author: Jason Riesa ; Daniel Marcu
Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Our algorithm induces a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of features, and trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
3 0.35356748 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages
Author: Bing Xiang ; Yonggang Deng ; Bowen Zhou
Abstract: We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversified alignments based on different motivations, such as linguistic knowledge, morphology and heuristics. We demonstrate this approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. The combined alignment outperforms the baseline alignment, with significantly higher F-scores and better transla- tion performance.
4 0.32651657 170 acl-2010-Letter-Phoneme Alignment: An Exploration
Author: Sittichai Jiampojamarn ; Grzegorz Kondrak
Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.
5 0.30380481 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation
Author: John DeNero ; Dan Klein
Abstract: We present a discriminative model that directly predicts which set ofphrasal translation rules should be extracted from a sentence pair. Our model scores extraction sets: nested collections of all the overlapping phrase pairs consistent with an underlying word alignment. Extraction set models provide two principle advantages over word-factored alignment models. First, we can incorporate features on phrase pairs, in addition to word links. Second, we can optimize for an extraction-based loss function that relates directly to the end task of generating translations. Our model gives improvements in alignment quality relative to state-of-the-art unsupervised and supervised baselines, as well as providing up to a 1.4 improvement in BLEU score in Chinese-to-English translation experiments.
6 0.25312909 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out
7 0.23682754 262 acl-2010-Word Alignment with Synonym Regularization
8 0.23645023 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation
9 0.19125463 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment
10 0.18786886 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels
11 0.1800935 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features
12 0.14977269 253 acl-2010-Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing
13 0.12945917 164 acl-2010-Learning Phrase-Based Spelling Error Models from Clickthrough Data
14 0.11764578 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment
15 0.11569145 194 acl-2010-Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning
16 0.11508211 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation
17 0.11390848 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models
18 0.11159629 54 acl-2010-Boosting-Based System Combination for Machine Translation
19 0.10437876 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
20 0.1022153 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure
topicId topicWeight
[(0, -0.295), (1, -0.341), (2, -0.11), (3, -0.017), (4, 0.138), (5, 0.101), (6, -0.253), (7, 0.096), (8, 0.159), (9, -0.099), (10, -0.151), (11, -0.075), (12, -0.199), (13, 0.051), (14, -0.073), (15, 0.024), (16, 0.021), (17, -0.039), (18, -0.058), (19, -0.022), (20, 0.033), (21, 0.021), (22, 0.069), (23, -0.037), (24, 0.01), (25, -0.099), (26, 0.135), (27, 0.059), (28, 0.073), (29, 0.113), (30, 0.005), (31, 0.017), (32, -0.055), (33, -0.034), (34, -0.011), (35, -0.029), (36, -0.003), (37, -0.009), (38, 0.062), (39, 0.027), (40, -0.026), (41, -0.054), (42, -0.074), (43, 0.025), (44, 0.011), (45, 0.1), (46, 0.004), (47, -0.023), (48, 0.055), (49, -0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.98282272 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment
Author: Vamshi Ambati ; Stephan Vogel ; Jaime Carbonell
Abstract: Semi-supervised word alignment aims to improve the accuracy of automatic word alignment by incorporating full or partial manual alignments. Motivated by standard active learning query sampling frameworks like uncertainty-, margin- and query-by-committee sampling we propose multiple query strategies for the alignment link selection task. Our experiments show that by active selection of uncertain and informative links, we reduce the overall manual effort involved in elicitation of alignment link data for training a semisupervised word aligner.
2 0.87440377 170 acl-2010-Letter-Phoneme Alignment: An Exploration
Author: Sittichai Jiampojamarn ; Grzegorz Kondrak
Abstract: Letter-phoneme alignment is usually generated by a straightforward application of the EM algorithm. We explore several alternative alignment methods that employ phonetics, integer programming, and sets of constraints, and propose a novel approach of refining the EM alignment by aggregation of best alignments. We perform both intrinsic and extrinsic evaluation of the assortment of methods. We show that our proposed EM-Aggregation algorithm leads to the improvement of the state of the art in letter-to-phoneme conversion on several different data sets.
3 0.87393731 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages
Author: Bing Xiang ; Yonggang Deng ; Bowen Zhou
Abstract: We present a novel method to improve word alignment quality and eventually the translation performance by producing and combining complementary word alignments for low-resource languages. Instead of focusing on the improvement of a single set of word alignments, we generate multiple sets of diversified alignments based on different motivations, such as linguistic knowledge, morphology and heuristics. We demonstrate this approach on an English-to-Pashto translation task by combining the alignments obtained from syntactic reordering, stemming, and partial words. The combined alignment outperforms the baseline alignment, with significantly higher F-scores and better transla- tion performance.
4 0.83483356 133 acl-2010-Hierarchical Search for Word Alignment
Author: Jason Riesa ; Daniel Marcu
Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Our algorithm induces a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of features, and trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.
5 0.81411588 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation
Author: John DeNero ; Dan Klein
Abstract: We present a discriminative model that directly predicts which set ofphrasal translation rules should be extracted from a sentence pair. Our model scores extraction sets: nested collections of all the overlapping phrase pairs consistent with an underlying word alignment. Extraction set models provide two principle advantages over word-factored alignment models. First, we can incorporate features on phrase pairs, in addition to word links. Second, we can optimize for an extraction-based loss function that relates directly to the end task of generating translations. Our model gives improvements in alignment quality relative to state-of-the-art unsupervised and supervised baselines, as well as providing up to a 1.4 improvement in BLEU score in Chinese-to-English translation experiments.
6 0.780783 88 acl-2010-Discriminative Pruning for Discriminative ITG Alignment
7 0.74380815 262 acl-2010-Word Alignment with Synonym Regularization
8 0.63464803 147 acl-2010-Improving Statistical Machine Translation with Monolingual Collocation
9 0.57641554 240 acl-2010-Training Phrase Translation Models with Leaving-One-Out
10 0.53420222 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels
11 0.50438088 164 acl-2010-Learning Phrase-Based Spelling Error Models from Clickthrough Data
12 0.47083285 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features
13 0.46929035 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation
14 0.46464708 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts
15 0.4450171 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure
16 0.42691106 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models
17 0.42069453 57 acl-2010-Bucking the Trend: Large-Scale Cost-Focused Active Learning for Statistical Machine Translation
18 0.39119875 180 acl-2010-On Jointly Recognizing and Aligning Bilingual Named Entities
19 0.39116853 253 acl-2010-Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing
20 0.38281348 135 acl-2010-Hindi-to-Urdu Machine Translation through Transliteration
topicId topicWeight
[(16, 0.023), (25, 0.026), (39, 0.02), (59, 0.093), (73, 0.053), (78, 0.014), (83, 0.068), (98, 0.611)]
simIndex simValue paperId paperTitle
1 0.9963553 129 acl-2010-Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach
Author: Yabin Zheng ; Zhiyuan Liu ; Lixing Xie
Abstract: Motivated by Google Sets, we study the problem of growing related words from a single seed word by leveraging user behaviors hiding in user records of Chinese input method. Our proposed method is motivated by the observation that the more frequently two words cooccur in user records, the more related they are. First, we utilize user behaviors to generate candidate words. Then, we utilize search engine to enrich candidate words with adequate semantic features. Finally, we reorder candidate words according to their semantic relatedness to the seed word. Experimental results on a Chinese input method dataset show that our method gains better performance. 1
2 0.99563742 242 acl-2010-Tree-Based Deterministic Dependency Parsing - An Application to Nivre's Method -
Author: Kotaro Kitagawa ; Kumiko Tanaka-Ishii
Abstract: Nivre’s method was improved by enhancing deterministic dependency parsing through application of a tree-based model. The model considers all words necessary for selection of parsing actions by including words in the form of trees. It chooses the most probable head candidate from among the trees and uses this candidate to select a parsing action. In an evaluation experiment using the Penn Treebank (WSJ section), the proposed model achieved higher accuracy than did previous deterministic models. Although the proposed model’s worst-case time complexity is O(n2), the experimental results demonstrated an average pars- ing time not much slower than O(n).
Author: Reyyan Yeniterzi ; Kemal Oflazer
Abstract: We present a novel scheme to apply factored phrase-based SMT to a language pair with very disparate morphological structures. Our approach relies on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as additional factors in the training data. On the target side (Turkish), we only perform morphological analysis and disambiguation but treat the complete complex morphological tag as a factor, instead of separating morphemes. We incrementally explore capturing various syntactic substructures as complex tags on the English side, and evaluate how our translations improve in BLEU scores. Our maximal set of source and target side transformations, coupled with some additional techniques, provide an 39% relative improvement from a baseline 17.08 to 23.78 BLEU, all averaged over 10 training and test sets. Now that the syntactic analysis on the English side is available, we also experiment with more long distance constituent reordering to bring the English constituent order close to Turkish, but find that these transformations do not provide any additional consistent tangible gains when averaged over the 10 sets.
4 0.99213582 27 acl-2010-An Active Learning Approach to Finding Related Terms
Author: David Vickrey ; Oscar Kipersztok ; Daphne Koller
Abstract: We present a novel system that helps nonexperts find sets of similar words. The user begins by specifying one or more seed words. The system then iteratively suggests a series of candidate words, which the user can either accept or reject. Current techniques for this task typically bootstrap a classifier based on a fixed seed set. In contrast, our system involves the user throughout the labeling process, using active learning to intelligently explore the space of similar words. In particular, our system can take advantage of negative examples provided by the user. Our system combines multiple preexisting sources of similarity data (a standard thesaurus, WordNet, contextual similarity), enabling it to capture many types of similarity groups (“synonyms of crash,” “types of car,” etc.). We evaluate on a hand-labeled evaluation set; our system improves over a strong baseline by 36%.
5 0.99082541 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation
Author: Xiangyu Duan ; Min Zhang ; Haizhou Li
Abstract: The pipeline of most Phrase-Based Statistical Machine Translation (PB-SMT) systems starts from automatically word aligned parallel corpus. But word appears to be too fine-grained in some cases such as non-compositional phrasal equivalences, where no clear word alignments exist. Using words as inputs to PBSMT pipeline has inborn deficiency. This paper proposes pseudo-word as a new start point for PB-SMT pipeline. Pseudo-word is a kind of basic multi-word expression that characterizes minimal sequence of consecutive words in sense of translation. By casting pseudo-word searching problem into a parsing framework, we search for pseudo-words in a monolingual way and a bilingual synchronous way. Experiments show that pseudo-word significantly outperforms word for PB-SMT model in both travel translation domain and news translation domain. 1
same-paper 6 0.98715115 24 acl-2010-Active Learning-Based Elicitation for Semi-Supervised Word Alignment
7 0.98708522 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization
8 0.96299338 253 acl-2010-Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing
9 0.95101058 20 acl-2010-A Transition-Based Parser for 2-Planar Dependency Structures
10 0.94780523 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models
11 0.94035709 90 acl-2010-Diversify and Combine: Improving Word Alignment for Machine Translation on Low-Resource Languages
12 0.90857035 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction
13 0.90479445 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints
14 0.89711756 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification
15 0.89710677 79 acl-2010-Cross-Lingual Latent Topic Extraction
16 0.88317525 37 acl-2010-Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking
17 0.88273168 262 acl-2010-Word Alignment with Synonym Regularization
18 0.87946057 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels
19 0.87653434 188 acl-2010-Optimizing Informativeness and Readability for Sentiment Summarization
20 0.87305212 133 acl-2010-Hierarchical Search for Word Alignment