acl acl2012 acl2012-163 knowledge-graph by maker-knowledge-mining

163 acl-2012-Prediction of Learning Curves in Machine Translation

Source: pdf

Author: Prasanth Kolachina ; Nicola Cancedda ; Marc Dymetman ; Sriram Venkatapathy

Abstract: Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose. Since ad-hoc manual translation can represent a significant investment in time and money, a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. In this work, we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. We propose methods for predicting learning curves in both these scenarios.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Since ad-hoc manual translation can represent a significant investment in time and money, a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. [sent-2, score-0.182]

2 In this work, we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. [sent-3, score-0.579]

3 We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. [sent-4, score-0.354]

4 We propose methods for predicting learning curves in both these scenarios. [sent-5, score-0.403]

5 In many cases it is possible to allocate some budget for manually translating a limited sample of relevant documents, be it via professional translation services or through increasingly fashionable crowdsourcing. [sent-7, score-0.145]

6 However, it is often difficult to predict how much training data will be required to achieve satisfactory translation accuracy, preventing sound provisional budgetting. [sent-8, score-0.214]

7 This prediction, or more generally the prediction of the learning curve of an SMT system as a function of available in-domain parallel data, is the objective of this paper. [sent-9, score-0.635]

8 In the first scenario (S1), the SMT developer is given only monolingual source and target samples from the relevant domain, and a small test parallel corpus. [sent-12, score-0.488]

9 In the second scenario (S2), an additional small seed parallel corpus is given that can be used to train small in-domain models and measure (with some variance) the evaluation score at a few points on the initial portion of the learning curve. [sent-15, score-0.482]

10 In both cases, the task consists in predicting an evaluation score (BLEU, throughout this work) on the test corpus as a function of the size of a subset of the source sample, assuming that we could have it manually translated and use the resulting bilingual corpus for training. [sent-16, score-0.317]

11 An extensive study across six parametric function families, empirically establishing that a certain three-parameter power-law family is well suited for modeling learning curves for the Moses SMT system when the evaluation score is BLEU. [sent-18, score-0.659]

12 A method for inferring learning curves based on features computed from the resources available in scenario S 1, suitable for both the scenarios described above (S1) and (S2) (Section 4); 3. [sent-20, score-0.489]

13 A method for extrapolating the learning curve from a few measurements, suitable for scenario S2 (Section 5); 4. [sent-21, score-0.652]

14 Our experiments involve 30 distinct language pair and domain combinations and 96 different learning curves. [sent-30, score-0.104]

15 They show that without any parallel data we can predict the expected translation accuracy at 75K segments within an error of 6 BLEU points (Ta- ble 4), while using a seed training corpus of 10K segments narrows this error to within 1. [sent-31, score-0.612]

16 2 Related Work Learning curves are routinely used to illustrate how the performance of experimental methods depend on the amount of training data used. [sent-33, score-0.364]

17 (2003) used learning curves to compare performance for various meta-parameter settings such as maximum phrase length, while Turchi et al. [sent-35, score-0.327]

18 (2008) extensively studied the behaviour of learning curves under a number of test conditions on Spanish-English. [sent-36, score-0.375]

19 Their results showed that the most predictive features were the morphological complexity of the languages, their linguistic relatedness and their word-order divergence; in our work, we make use of these features, among others, for predicting translation accuracy (Section 4). [sent-39, score-0.141]

20 (2003) used learning curves for predicting maximum performance bounds of learning algorithms and to compare them. [sent-41, score-0.479]

21 (2001), the learning curves of two classification algorithms were modelled for eight different large data sets. [sent-43, score-0.327]

22 This work uses similar a priori knowledge for restricting the form of learning curves as ours (see Section 3), and also similar empirical evaluation criteria for comparing curve families with one another. [sent-44, score-1.042]

23 While both application and performance metric in our work are different, we arrive at a similar conclusion that a power law family of the form y = c − a x−α is a gpoowode rm laowdel f aomf tilhye olefar thnein fgo crmurv yes =. [sent-45, score-0.199]

24 Learning curves are also frequently used for determining empirically the number of iterations for an incremental learning procedure. [sent-46, score-0.327]

25 The crucial difference in our work is that in the previous cases, learning curves are plotted a posteriori i. [sent-47, score-0.327]

26 once the labelled data has become available and the training has been performed, whereas 23 in our work the learning curve itself is the object of the prediction. [sent-49, score-0.514]

27 Our goal is to learn to predict what the learning curve will be a priori without having to label the data at all (S1), or through labelling only a very small amount of it (S2). [sent-50, score-0.662]

28 In this respect, the academic field of Computa- tional Learning Theory has a similar goal, since it strives to identify bounds to performance measures1 , typically including a dependency on the training sample size. [sent-51, score-0.153]

29 3 Selecting a parametric family of curves The first step in our approach consists in selecting a suitable family of shapes for the learning curves that we want to produce in the two scenarios being considered. [sent-53, score-1.233]

30 For a certain bilingual test dataset d, we consider a set of observations Od = {(x1, y1) , (x2, y2) . [sent-55, score-0.143]

31 , 2002)) of a translation model trained on a parallel corpus of size xi. [sent-59, score-0.238]

32 The corpus size xi is measured in terms of the number of segments (sentences) present in the parallel corpus. [sent-60, score-0.332]

33 We consider such observations to be generated by a regression model of the form: yi = F(xi; θ) + ? [sent-61, score-0.211]

34 Based on our prior knowledge of the problem, we limit the search for a suitable F to families that satisfies the following conditions- monotonically increasing, concave and bounded. [sent-64, score-0.278]

35 The second condition expresses a notion of “diminishing returns”, namely that a given amount of additional training data is more advantageous when added to a small rather than to a big amount of initial data. [sent-66, score-0.211]

36 We consider six possible families of functions satisfying these conditions, which are listed in Table 1. [sent-69, score-0.272]

37 Preliminary experiments indicated that curves from EMPI Lxo pwdgP34e2ly= c= −Fc o(−cr (mea −u/x(la +x−oαgb +)xb−α Table 1: Curve families. [sent-70, score-0.29]

38 the “Power” and “Exp” family with only two parameters underfitted, while those with five or more parameters led to overfitting and solution instability. [sent-71, score-0.303]

39 We decided to only select families with three or four parameters. [sent-72, score-0.269]

40 Curve fitting technique Given a set of observations {(x1, y1) , (x2, y2) . [sent-73, score-0.255]

41 (xn, yn)} and a curve family F(x; θ) from Table 1, we compute a best fit where: ˆθ- Xn θˆ = argmθinXi=1[yi− F(xi;θ)]2, (2) through use of the Levenberg-Marquardt method (Mor e´, 1978) for non-linear regression. [sent-76, score-0.747]

42 For selecting a learning curve family, and for all other experiments in this paper, we trained a large number of systems on multiple configurations of training sets and sample sizes, and tested each on multiple test sets; these are listed in Table 2. [sent-77, score-0.693]

43 Language codes: Cz=Czech, Da=Danish, En=English, De=German, Fr=French, Jp=Japanese, Es=Spanish The goodness offit for each ofthe families is eval2The settings used in training the systems are those described in http : / /www . [sent-82, score-0.328]

44 html ine 24 uated based on their ability to i) fit over the entire set of observations, ii) extrapolate to points beyond the observed portion of the curve and iii) generalize well over different datasets . [sent-85, score-0.72]

45 We use a recursive fitting procedure where the curve obtained from fitting the first ipoints is used to predict the observations at two points: xi+1, i. [sent-86, score-0.93]

46 the point to the immediate right of the currently observed xi and xn, i. [sent-88, score-0.127]

47 The following error measures quantify the goodness of fit of the curve families: 1. [sent-91, score-0.654]

48 Average root mean-squared error (RMSE): N1Xc∈StX∈Tc(n1iX=n1[yi− F(xi;θˆ)]2)c1t/2 where S is the set of training datasets, Tc is the set of test datasets for training configuration c, is as defined in Eq. [sent-92, score-0.347]

49 2, N is the total number of combinations of training configurations and test datasets, and iranges on a grid of training subset sizes. [sent-93, score-0.278]

50 Average root mean squared residual at next point X = xi+1 (NPR): N1Xc∈StX∈Tc(n −1 k − 1iXn=−k1[yi+1− F(xi+1;θˆi)]2)c1t/2 where θˆi is obtained using only observations up to xi in Eq. [sent-96, score-0.505]

51 Average root mean squared residual at the last point X = xn (LPR): N1Xc∈StX∈Tc(n −1 k − 1iXn=−k1[yn− F(xn;θˆi)]2)c1t/2 Curve fitting evaluation The evaluation of the goodness of fit for the curve families is presented in Table 3. [sent-99, score-1.41]

52 The average values of the root meansquared error and the average residuals across all the learning curves used in our experiments are shown in this table. [sent-100, score-0.594]

53 Figure 1shows the curve fits obtained 3We start the summation from i = k, because at least k points are required for computing ˆθi. [sent-102, score-0.566]

54 Figure 1: Curve fits using different curve families on a test dataset for all the six families on a test dataset for EnglishGerman language pair. [sent-103, score-1.086]

55 Loooking at the values in Table 3, we decided to use the Pow3 family as the best overall compromise. [sent-105, score-0.26]

56 The only available parallel resource is a very small test corpus. [sent-108, score-0.217]

57 Our objective is to predict the evolution of the BLEU score on the given test set as a function of the size of a random subset of the training data that we manually translate4. [sent-109, score-0.25]

58 25 The intuition behind this is that the source-side and target-side monolingual data already convey significant information about the difficulty of the translation task. [sent-110, score-0.186]

59 We first train models to predict the BLEU score at m anchor sizes s1, . [sent-112, score-0.507]

60 m} where wj is a vector of feature weights specific to predicting at anchor size j,and φ is a vector of sizeindependent configuration features, detailed below. [sent-119, score-0.681]

61 We then perform inference using these models to predict the BLEU score at each anchor, for the test case of interest. [sent-120, score-0.12]

62 We finally estimate the parameters of the learning curve by weighted least squares regression using the anchor predictions. [sent-121, score-0.943]

63 Anchor sizes can be chosen rather arbitrarily, but must satisfy the following two constraints: 1. [sent-122, score-0.105]

64 They must be three or more in number in order to allow fitting the tri-parameter curve. [sent-123, score-0.16]

65 Average length of tokens in the (source) test set and in the monolingual source language corpus. [sent-130, score-0.212]

66 Lexical diversity features: (a) type-token ratios for n-grams of order 1to 5 in the monolingual corpus ofboth source and target languages (b) perplexity of language models of order 2 to 5 derived from the monolingual source corpus computed on the source side of the test corpus. [sent-132, score-0.586]

67 4We specify that it is a random sample as opposed to a subset deliberately chosen to maximize learning effectiveness. [sent-133, score-0.145]

68 Features capturing divergence between languages in the pair: (a) average ratio of source/target sentence lengths in the test set. [sent-136, score-0.193]

69 (b) ratio of type-token ratios of orders 1 to 5 in the monolingual corpus of both source and target languages. [sent-137, score-0.266]

70 Word-order divergence: The divergence in the word-order between the source and the target languages can be captured using the part-ofspeech (pos) tag sequences across languages. [sent-139, score-0.183]

71 We use cross-entropy measure to capture similarity between the n-gram distributions of the pos tags in the monolingual corpora of the two languages. [sent-140, score-0.121]

72 These features capture our intuition that translation is going to be harder if the language in the domain is highly variable and if the source and target languages diverge more in terms of morphology and word-order. [sent-147, score-0.217]

73 The training data for fitting these linear models is obtained in the following way. [sent-149, score-0.194]

74 For each configuration (combination of language pair and domain) c and test set t in Table 2, a gold curve is fitted using the selected tri-parameter power-law family using a fine grid of corpus sizes. [sent-150, score-0.968]

75 This is available as a byproduct of the experiments for comparing different parametric families described in Section 3. [sent-151, score-0.335]

76 We then compute the value of the gold curves at the m anchor sizes: we thus have m “gold” vectors µ1, . [sent-152, score-0.691]

77 , µm with accurate estimates of BLEU at the anchor sizes5. [sent-155, score-0.358]

78 We construct the design matrix Φ with one column for each feature vector φct corresponding to each combination of training configuration c and test set t. [sent-156, score-0.213]

79 As baseline, we take a constant mean model predicting, for each anchor size sj, the average of all the µjct. [sent-160, score-0.484]

80 We do not assume the difficulty of predicting BLEU at all anchor points to be the same. [sent-161, score-0.519]

81 To allow for this, we use (non-regularized) weighted leastsquares to fit a curve from our parametric family through the m anchor points6. [sent-162, score-1.203]

82 2), the anchor confidence is set to be the inverse of the cross-validated mean square residuals: ωj= N1Xc∈StX∈Tc(φc>twj\c− µjct)2! [sent-166, score-0.496]

83 −1 w\jc (4) where are the feature weights obtained by the regression above on all training configurations except c, µjct is the gold value at anchor j for training/test combination c, t, and N is the total number of such combinations7. [sent-167, score-0.57]

84 In other words, we assign to each anchor point a confidence inverse to the crossvalidated mean squared error of the model used to predict it. [sent-168, score-0.787]

85 For a new unseen configuration with feature vec- tor φu, we determine the parameters θu of the corresponding learning curve as: θu= argmθinXjωj? [sent-169, score-0.631]

86 2 5 (5) Extrapolating a learning curve fitted on a small parallel corpus Given a small “seed” parallel corpus, the translation system can be used to train small in-domain models and the evaluation score can be measured at a few initial sample sizes {(x1, y1) , (x2, y2) . [sent-171, score-1.133]

87 l points provides evidence for predicting its performance for larger sample sizes. [sent-176, score-0.214]

88 In order to do so, a learning curve from the family Pow3 is first fit through these initial points. [sent-177, score-0.814]

89 We 6When the number of anchor points is the same as the number of parameters in the parametric family, the curve can be fit exactly through all anchor points. [sent-178, score-1.499]

90 However the general discussion is relevant in case there are more anchor points than parameters, and also in view of the combination of inference and extrapolation in Section 6. [sent-179, score-0.567]

91 7Curves on different test data for the same training configuration are highly correlated and are therefore left out. [sent-180, score-0.181]

92 Thahte p be ≥st f i3t f ˆηo irs choism oppueteradt iounsin tog t bhee swaemllecurve fitting as in Eq. [sent-182, score-0.187]

93 At each individual anchor size sj, the accuracy of prediction is measured using the root mean-squared error between the prediction of extrapolated curves and the gold values: N1Xc∈StX∈Tc[F(sj; ηˆ ct) − µctj]2! [sent-184, score-1.06]

94 1/2 (6) where ˆη ct are the parameters of the curve fit using the initial points for the combination ct. [sent-185, score-0.79]

95 In general, we observed that the extrapolated curve tends to over-estimate BLEU for large samples. [sent-186, score-0.535]

96 6 Combining inference and extrapolation In scenario S2, the models trained from the seed parallel corpus and the features used for inference (Section 4) provide complementary information. [sent-187, score-0.352]

97 For the inference method of Section 4, predictions of models at anchor points are weighted by the inverse of the model empirical squared error (ωj). [sent-189, score-0.687]

98 Let u be a new configuration with seed parallel corpus of size xu, and let xl be the largest point in our grid for which xl ≤ xu. [sent-191, score-0.572]

99 We first train translation models and evaluate s≤cor xes on samples of size x1, . [sent-192, score-0.14]

100 , xl, fit parameters ηˆu through the scores, and then extrapolate BLEU at the anchors sj : F(sj ; ˆη u) ,j ∈ {1, . [sent-195, score-0.4]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('curve', 0.443), ('anchor', 0.358), ('curves', 0.29), ('families', 0.237), ('family', 0.199), ('fitting', 0.16), ('stx', 0.154), ('squared', 0.134), ('monolingual', 0.121), ('bleu', 0.118), ('sj', 0.116), ('tc', 0.106), ('wj', 0.106), ('fit', 0.105), ('parallel', 0.104), ('configuration', 0.099), ('parametric', 0.098), ('observations', 0.095), ('extrapolated', 0.092), ('extrapolation', 0.092), ('xi', 0.091), ('xn', 0.089), ('points', 0.085), ('sizes', 0.077), ('predicting', 0.076), ('xl', 0.075), ('anchors', 0.073), ('predict', 0.072), ('scenario', 0.07), ('smt', 0.07), ('divergence', 0.068), ('translation', 0.065), ('yi', 0.063), ('extrapolating', 0.061), ('jct', 0.061), ('residuals', 0.061), ('inverse', 0.061), ('seed', 0.059), ('goodness', 0.057), ('grid', 0.055), ('extrapolate', 0.054), ('residual', 0.054), ('fitted', 0.054), ('regression', 0.053), ('sample', 0.053), ('parameters', 0.052), ('scenarios', 0.051), ('prediction', 0.051), ('yn', 0.051), ('root', 0.05), ('configurations', 0.05), ('error', 0.049), ('xerox', 0.049), ('test', 0.048), ('argm', 0.046), ('mean', 0.045), ('satisfactory', 0.043), ('regime', 0.043), ('source', 0.043), ('ct', 0.043), ('gold', 0.043), ('size', 0.042), ('ratios', 0.041), ('suitable', 0.041), ('amount', 0.04), ('regularization', 0.039), ('bounds', 0.039), ('average', 0.039), ('languages', 0.038), ('fits', 0.038), ('domain', 0.037), ('moses', 0.037), ('learning', 0.037), ('point', 0.036), ('small', 0.035), ('priori', 0.035), ('six', 0.035), ('measured', 0.034), ('training', 0.034), ('segments', 0.034), ('target', 0.034), ('datasets', 0.033), ('samples', 0.033), ('combination', 0.032), ('confidence', 0.032), ('condition', 0.032), ('decided', 0.032), ('initial', 0.03), ('combinations', 0.03), ('resource', 0.03), ('values', 0.029), ('selecting', 0.028), ('chosen', 0.028), ('subset', 0.027), ('corpus', 0.027), ('manually', 0.027), ('strives', 0.027), ('lasso', 0.027), ('bhee', 0.027), ('usi', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 163 acl-2012-Prediction of Learning Curves in Machine Translation

Author: Prasanth Kolachina ; Nicola Cancedda ; Marc Dymetman ; Sriram Venkatapathy

2 0.12955561 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

Author: Xiaodong He ; Li Deng

Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.

3 0.090232067 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

Author: Jinsong Su ; Hua Wu ; Haifeng Wang ; Yidong Chen ; Xiaodong Shi ; Huailin Dong ; Qun Liu

Abstract: To adapt a translation model trained from the data in one domain to another, previous works paid more attention to the studies of parallel corpus while ignoring the in-domain monolingual corpora which can be obtained more easily. In this paper, we propose a novel approach for translation model adaptation by utilizing in-domain monolingual topic information instead of the in-domain bilingual corpora, which incorporates the topic information into translation probability estimation. Our method establishes the relationship between the out-of-domain bilingual corpus and the in-domain monolingual corpora via topic mapping and phrase-topic distribution probability estimation from in-domain monolingual corpora. Experimental result on the NIST Chinese-English translation task shows that our approach significantly outperforms the baseline system.

4 0.089766175 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

Author: Malte Nuhn ; Arne Mauser ; Hermann Ney

Abstract: In this paper we show how to train statistical machine translation systems on reallife tasks using only non-parallel monolingual data from two languages. We present a modification of the method shown in (Ravi and Knight, 2011) that is scalable to vocabulary sizes of several thousand words. On the task shown in (Ravi and Knight, 2011) we obtain better results with only 5% of the computational effort when running our method with an n-gram language model. The efficiency improvement of our method allows us to run experiments with vocabulary sizes of around 5,000 words, such as a non-parallel version of the VERBMOBIL corpus. We also report results using data from the monolingual French and English GIGAWORD corpora.

5 0.08926557 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

Author: Patrick Simianer ; Stefan Riezler ; Chris Dyer

Abstract: With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy local features for SCFG-based SMT that can be read off from rules at runtime, and present a learning algorithm that applies ‘1/‘2 regularization for joint feature selection over distributed stochastic learning processes. We present experiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets.

6 0.082956754 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

7 0.081234522 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

8 0.077348925 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

9 0.075095333 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

10 0.07462199 140 acl-2012-Machine Translation without Words through Substring Alignment

11 0.072893478 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

12 0.068338022 34 acl-2012-Automatically Learning Measures of Child Language Development

13 0.066940732 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation

14 0.066793635 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

15 0.066609524 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

16 0.065071039 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

17 0.064648584 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling

18 0.063469753 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

19 0.063388057 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

20 0.063228205 9 acl-2012-A Cost Sensitive Part-of-Speech Tagging: Differentiating Serious Errors from Minor Errors

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.202), (1, -0.079), (2, 0.028), (3, 0.025), (4, 0.02), (5, 0.014), (6, -0.003), (7, 0.008), (8, -0.02), (9, -0.019), (10, -0.027), (11, 0.014), (12, -0.039), (13, 0.017), (14, -0.017), (15, 0.026), (16, 0.069), (17, 0.003), (18, 0.006), (19, 0.016), (20, -0.007), (21, -0.104), (22, 0.03), (23, -0.049), (24, 0.014), (25, -0.022), (26, 0.016), (27, 0.088), (28, -0.01), (29, 0.107), (30, -0.036), (31, 0.025), (32, 0.075), (33, -0.013), (34, 0.084), (35, -0.101), (36, -0.015), (37, -0.101), (38, -0.061), (39, -0.166), (40, 0.086), (41, 0.08), (42, 0.012), (43, -0.029), (44, 0.035), (45, -0.123), (46, -0.001), (47, 0.016), (48, -0.052), (49, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94060642 163 acl-2012-Prediction of Learning Curves in Machine Translation

Author: Prasanth Kolachina ; Nicola Cancedda ; Marc Dymetman ; Sriram Venkatapathy

2 0.70638049 136 acl-2012-Learning to Translate with Multiple Objectives

Author: Kevin Duh ; Katsuhito Sudoh ; Xianchao Wu ; Hajime Tsukada ; Masaaki Nagata

Abstract: We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (e.g. BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. Our approach is based on the theory of Pareto Optimality. It is simple to implement on top of existing single-objective optimization methods (e.g. MERT, PRO) and outperforms ad hoc alternatives based on linear-combination of metrics. We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization.

3 0.67123872 34 acl-2012-Automatically Learning Measures of Child Language Development

Author: Sam Sahakian ; Benjamin Snyder

Abstract: We propose a new approach for the creation of child language development metrics. A set of linguistic features is computed on child speech samples and used as input in two age prediction experiments. In the first experiment, we learn a child-specific metric and predicts the ages at which speech samples were produced. We then learn a more general developmental index by applying our method across children, predicting relative temporal orderings of speech samples. In both cases we compare our results with established measures of language development, showing improvements in age prediction performance.

4 0.61666811 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

Author: Boxing Chen ; Roland Kuhn ; Samuel Larkin

Abstract: Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT , a new MT evaluation metric which combines precision, recall and an ordering metric and which is primarily designed for tuning MT systems. PORT does not require external resources and is quick to compute. It has a better correlation with human judgment than BLEU. We compare PORT-tuned MT systems to BLEU-tuned baselines in five experimental conditions involving four language pairs. PORT tuning achieves 1 consistently better performance than BLEU tuning, according to four automated metrics (including BLEU) and to human evaluation: in comparisons of outputs from 300 source sentences, human judges preferred the PORT-tuned output 45.3% of the time (vs. 32.7% BLEU tuning preferences and 22.0% ties). 1

5 0.60704905 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

Author: Tahira Naseem ; Regina Barzilay ; Amir Globerson

Abstract: We present a novel algorithm for multilingual dependency parsing that uses annotations from a diverse set of source languages to parse a new unannotated language. Our motivation is to broaden the advantages of multilingual learning to languages that exhibit significant differences from existing resource-rich languages. The algorithm learns which aspects of the source languages are relevant for the target language and ties model parameters accordingly. The model factorizes the process of generating a dependency tree into two steps: selection of syntactic dependents and their ordering. Being largely languageuniversal, the selection component is learned in a supervised fashion from all the training languages. In contrast, the ordering decisions are only influenced by languages with similar properties. We systematically model this cross-lingual sharing using typological features. In our experiments, the model consistently outperforms a state-of-the-art multilingual parser. The largest improvement is achieved on the non Indo-European languages yielding a gain of 14.4%.1

6 0.59214693 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

7 0.58581597 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

8 0.57842845 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

9 0.57063222 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

10 0.54879636 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

11 0.54845828 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

12 0.53606236 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

13 0.50665903 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

14 0.50134164 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

15 0.50018674 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis

16 0.49229035 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

17 0.48528731 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

18 0.46454281 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

19 0.46165329 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

20 0.44507149 181 acl-2012-Spectral Learning of Latent-Variable PCFGs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.026), (26, 0.036), (28, 0.046), (30, 0.031), (37, 0.037), (39, 0.049), (59, 0.015), (74, 0.049), (79, 0.28), (82, 0.04), (84, 0.018), (85, 0.03), (90, 0.125), (92, 0.074), (94, 0.03), (99, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94097239 166 acl-2012-Qualitative Modeling of Spatial Prepositions and Motion Expressions

Author: Inderjeet Mani ; James Pustejovsky

Abstract: unkown-abstract

same-paper 2 0.74418193 163 acl-2012-Prediction of Learning Curves in Machine Translation

Author: Prasanth Kolachina ; Nicola Cancedda ; Marc Dymetman ; Sriram Venkatapathy

3 0.58323073 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

Author: Deyi Xiong ; Min Zhang ; Haizhou Li

Abstract: Predicate-argument structure contains rich semantic information of which statistical machine translation hasn’t taken full advantage. In this paper, we propose two discriminative, feature-based models to exploit predicateargument structures for statistical machine translation: 1) a predicate translation model and 2) an argument reordering model. The predicate translation model explores lexical and semantic contexts surrounding a verbal predicate to select desirable translations for the predicate. The argument reordering model automatically predicts the moving direction of an argument relative to its predicate after translation using semantic features. The two models are integrated into a state-of-theart phrase-based machine translation system and evaluated on Chinese-to-English transla- , tion tasks with large-scale training data. Experimental results demonstrate that the two models significantly improve translation accuracy.

4 0.52917379 31 acl-2012-Authorship Attribution with Author-aware Topic Models

Author: Yanir Seroussi ; Fabian Bohnert ; Ingrid Zukerman

Abstract: Authorship attribution deals with identifying the authors of anonymous texts. Building on our earlier finding that the Latent Dirichlet Allocation (LDA) topic model can be used to improve authorship attribution accuracy, we show that employing a previously-suggested Author-Topic (AT) model outperforms LDA when applied to scenarios with many authors. In addition, we define a model that combines LDA and AT by representing authors and documents over two disjoint topic sets, and show that our model outperforms LDA, AT and support vector machines on datasets with many authors.

5 0.52820635 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

Author: Bevan Jones ; Mark Johnson ; Sharon Goldwater

Abstract: Many semantic parsing models use tree transformations to map between natural language and meaning representation. However, while tree transformations are central to several state-of-the-art approaches, little use has been made of the rich literature on tree automata. This paper makes the connection concrete with a tree transducer based semantic parsing model and suggests that other models can be interpreted in a similar framework, increasing the generality of their contributions. In particular, this paper further introduces a variational Bayesian inference algorithm that is applicable to a wide class of tree transducers, producing state-of-the-art semantic parsing results while remaining applicable to any domain employing probabilistic tree transducers.

6 0.52681923 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

7 0.52625096 167 acl-2012-QuickView: NLP-based Tweet Search

8 0.52615494 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

9 0.52521247 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

10 0.52497935 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

11 0.52416974 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

12 0.52366847 191 acl-2012-Temporally Anchored Relation Extraction

13 0.52362007 10 acl-2012-A Discriminative Hierarchical Model for Fast Coreference at Large Scale

14 0.52317631 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

15 0.52300364 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

16 0.52289861 156 acl-2012-Online Plagiarized Detection Through Exploiting Lexical, Syntax, and Semantic Information

17 0.52160752 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

18 0.52159607 187 acl-2012-Subgroup Detection in Ideological Discussions

19 0.52008158 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

20 0.51904261 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars