acl acl2012 acl2012-140 knowledge-graph by maker-knowledge-mining

140 acl-2012-Machine Translation without Words through Substring Alignment

Source: pdf

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. [sent-2, score-0.979]

2 We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. [sent-3, score-0.184]

3 In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs. [sent-4, score-0.453]

4 1 Introduction Traditionally, the task of statistical machine translation (SMT) is defined as translating a source sentence = {f1, . [sent-5, score-0.37]

5 The most obvious example of this lies in languages that do not separate words with white space such as Chinese, Japanese, or Thai, in which the choice of a segmentation standard has a large effect on translation accuracy (Chang et al. [sent-13, score-0.424]

6 165 boundaries, all machine translation systems perform at least some precursory form of tokenization, splitting punctuation and words to prevent the sparsity that would occur if punctuated and non-punctuated words were treated as different entities. [sent-16, score-0.467]

7 A myriad of methods have been proposed to handle each of these phenomena individually, including morphological analysis, stemming, compound breaking, number regularization, optimizing word segmentation, and transliteration, which we outline in more detail in Section 2. [sent-18, score-0.227]

8 This method is attractive, as it is theoretically able to handle all sparsity phenomena in a single unified framework, but has only been shown feasible between similar language pairs such as Spanish-Catalan (Vilar et al. [sent-22, score-0.158]

9 (2007) state and we confirm, accurate translations cannot be achieved when applying traditional translation techniques to character-based translation for less similar language pairs. [sent-26, score-0.606]

10 In this paper, we propose improvements to the alignment process tailored to character-based ma- chine translation, and demonstrate that it is, in fact, possible to achieve translation accuracies that apProce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-27, score-0.583]

11 We also propose two improvements to the manyto-many alignment method of Neubig et al. [sent-32, score-0.28]

12 One barrier to applying many-to-many alignment models to character strings is training cost. [sent-34, score-0.378]

13 In the inversion transduction grammar (ITG) framework (Wu, 1997), which is widely used in many-to-many alignment, search is cumbersome for longer sentences, a problem that is further exacerbated when using characters instead of words as the basic unit. [sent-35, score-0.228]

14 Secondly, we describe a method to seed the search process using counts of all substring pairs in the corpus to bias the phrase alignment model. [sent-38, score-0.655]

15 We do this by defining prior probabilities based on these substring counts within the Bayesian phrasal ITG framework. [sent-39, score-0.583]

16 An evaluation on four language pairs with differing morphological properties shows that for distant language pairs, character-based SMT can achieve translation accuracy comparable to word-based systems. [sent-40, score-0.408]

17 Finally, we perform a qualitative analysis, which finds that character-based translation can handle unsegmented text, conjugation, and proper names in a unified framework with no additional processing. [sent-42, score-0.346]

18 In fact, it has been shown that there is a direct negative correlation between vocabulary 166 size (and thus sparsity) of a language and translation accuracy (Koehn, 2005). [sent-44, score-0.303]

19 Sparsity causes trouble for alignment models, both in the form of incorrectly aligned uncommon words, and in the form of garbage collection, where uncommon words in one language are incorrectly aligned to large segments of the sentence in the other language (Och and Ney, 2003). [sent-45, score-0.446]

20 Unknown words are also a problem during the translation process, and the default approach is to map them as-is into the target sentence. [sent-46, score-0.356]

21 This is a major problem in agglutinative languages such as Finnish or compounding languages such as German. [sent-47, score-0.207]

22 Another source of data sparsity that occurs in all languages is proper names, which have been handled by using cognates or transliteration to improve translation (Knight and Graehl, 1998; Kondrak et al. [sent-51, score-0.586]

23 , 2003; Finch and Sumita, 2007), and more sophisticated methods for named entity translation that combine translation and transliteration have also been proposed (Al-Onaizan and Knight, 2002). [sent-52, score-0.669]

24 Choosing word units is also essential for creating good translation results for languages that do not explicitly mark word boundaries, such as Chinese, Japanese, and Thai. [sent-53, score-0.367]

25 A number of works have dealt with this word segmentation problem in translation, mainly focusing on Chinese-to-English translation (Bai et al. [sent-54, score-0.36]

26 We have enumerated these related works to demonstrate the myriad of data sparsity problems and proposed solutions. [sent-59, score-0.16]

27 Character-based translation has the potential to handle all of the phenomena in the previously mentioned research in a single unified framework, requiring no language specific tools such as morphological analyzers or word segmenters. [sent-60, score-0.451]

28 In this work, we propose effective alignment techniques that allow character-based translation to achieve accurate translation results for both close and distant language pairs. [sent-64, score-0.932]

29 These may be words in word-based alignment models or single characters in character-based alignment models. [sent-70, score-0.616]

30 1 We define our alignment as a1K, where each element is a span ak = hs, t, u, vi indicating that the target string es, . [sent-71, score-0.538]

31 1 One-to-Many Alignment The most well-known and widely-used models for bitext alignment are for one-to-many alignment, including the IBM models (Brown et al. [sent-79, score-0.28]

32 These models are by nature directional, attempting to find the alignments that maximize the conditional probability of the target sentence a1K). [sent-82, score-0.218]

33 However, in order for one-to-many alignment P(e1I|f1J, methods to be effective, each fj must contain 1Some previous work has also performed alignment using morphological analyzers to normalize or split the sentence into morpheme streams (Corston-Oliver and Gamon, 2004). [sent-87, score-0.749]

34 167 enough information to allow for effective alignment with its corresponding elements in e1I. [sent-88, score-0.326]

35 2 Many-to-Many Alignment On the other hand, in recent years, there have been advances in many-to-many alignment techniques that are able to align multi-element chunks on both sides of the translation (Marcu and Wong, 2002; DeNero et al. [sent-91, score-0.583]

36 (201 1), which uses Bayesian inference in the phrasal inversion transduction grammar (ITG, Wu (1997)) framework. [sent-97, score-0.298]

37 ITGs are a variety of synchronous context free grammar (SCFG) that allows for many-to-many alignment to be achieved in polynomial time through the process of biparsing, which we explain more in the following section. [sent-98, score-0.327]

38 4 Look-Ahead Biparsing In this work, we experiment with the alignment method of Neubig et al. [sent-101, score-0.28]

39 This is important in the character-based translation context, as we would like to use phrases that contain large numbers of characters without creating a phrase table so large that it cannot be used in actual decoding. [sent-103, score-0.401]

40 In this framework, training is performed using sentence- Figure 1: (a) A chart with inside probabilities in boxes and forward/backward probabilities marking the surrounding arrows. [sent-104, score-0.421]

41 Lightly and darkly shaded spans will be trimmed when the beam is log(P) ≥ −3 and log(P) ≥ −6 respectively. [sent-106, score-0.231]

42 wise block sampling, acquiring a sample for each sentence by first performing bottom-up biparsing to create a chart of probabilities, then performing topdown sampling of a new tree based on the probabilities in this chart. [sent-107, score-0.375]

43 Within each cell of the chart spanning est and fuv is an “inside” probability I(as,t,u,v). [sent-109, score-0.339]

44 While the exact calculation of these probabilities can be performed in O(n6) time, where n is the 2Pt can be specified according to Bayesian statistics scribed by Neubig et al. [sent-111, score-0.224]

45 3 In this section we propose the use of a look-ahead probability to increase the efficiency of this chart parsing. [sent-116, score-0.188]

46 (2009), spans are pushed onto a different queue based on their size, and queues are processed in ascending order of size. [sent-118, score-0.179]

47 Agendas can further be trimmed based on a histogram beam (Saers et al. [sent-119, score-0.151]

48 , 2011) compared to the best hypothesis In other words, we have a queue discipline based on the inside probability, and all spans ak where I(ak) < cI( aˆ) are pruned. [sent-121, score-0.376]

49 Figure 1(a) provides an example of why it is unwise to ignore competing hypotheses during beam pruning. [sent-125, score-0.163]

50 Particularly, the alignment “les/1960s” competes with the high-probability alignment “les/the,” so intuitively should be a good candidate for pruning. [sent-126, score-0.56]

51 However its probability is only slightly higher than “ann e´es/1960s,” which has no competing hypotheses and thus should not be trimmed. [sent-127, score-0.161]

52 As the calculation of the actual outside probability O(ak) is just as expensive as parsing itself, it is necessary to approximate this with heuristic function O∗ that can be calculated efficiently. [sent-129, score-0.201]

53 During the calculation of the phrase generation probabilities Pt, we save the best inside probability I∗ for each monolingual span. [sent-135, score-0.377]

54 Ie∗(s,t) ={˜ a=h s˜,˜t, u˜m,˜ vaix; s˜=s,t˜=t}Pt( a˜) If∗(u,v) ={˜ a=h s˜,˜t, u˜m,˜ vai;x u˜=u, v˜=v}Pt( a˜) For each language independently, we calculate forward probabilities α and backward probabilities β. [sent-136, score-0.25]

55 For example, αe(s) is the maximum probability of the span (0, s) of e that can be created by concatenating together consecutive values of Ie∗: αe(s) ={S1m,. [sent-137, score-0.166]

56 Backwards probabilities and probabilities over f can be defined similarly. [sent-144, score-0.25]

57 These probabilities are calcu- lated for e and f independently, and can be calculated in n2 time by processing each α in ascending order, and each β in descending order in a fashion similar to that of the forward-backward algorithm. [sent-145, score-0.17]

58 Finally, for any span, we define the outside heuristic as the minimum of the two independent look-ahead probabilities over each language O∗ (as,t,u,v) = min(αe (s) ∗ βe(t) , αf (u) ∗ βf (v)) . [sent-146, score-0.163]

59 Looking again at Figure 1 (b), it can be seen that the relative probability difference between the highest probability span “les/the” and the spans “ann e´es/1960s” and “60/1960s” decreases, allowing for tighter beam pruning without losing these good hypotheses. [sent-147, score-0.511]

60 In this section, we overview an existing method used to calculate these prior probabilities, and also propose a new way to calculate priors based on substring cooccurrence statistics. [sent-150, score-0.312]

61 1 Word-based Priors Previous research on many-to-many translation has used IBM model 1 probabilities to bias phrasal alignments so that phrases whose member words are good translations are also aligned. [sent-152, score-0.675]

62 However, for reasons previously stated in Section 3, these methods are less satisfactory when performing character-based alignment, as the amount of information contained in a character does not allow for proper alignment. [sent-159, score-0.144]

63 2 Substring Co-occurrence Priors Instead, we propose a method for using raw substring co-occurrence statistics to bias alignments towards substrings that often co-occur in the entire training corpus. [sent-161, score-0.411]

64 This is similar to the method of Cromieres (2006), but instead of using these cooccurrence statistics as a heuristic alignment criterion, we incorporate them as a prior probability in a statistical model that can take into account mutual exclusivity of overlapping substrings in a sentence. [sent-162, score-0.56]

65 We define this prior probability using three counts over substrings c(e), c(f), and c(e, f). [sent-163, score-0.304]

66 c(e, f) is a count of the total number of sentences in which the substring e occurs on the target side, and f occurs on the source side. [sent-165, score-0.264]

67 We perform the calculation of these statistics using enhanced suffix arrays, a data structure that can efficiently calculate all substrings in a corpus (Abouelhoda et al. [sent-166, score-0.184]

68 4 While suffix arrays allow for efficient calculation of these statistics, storing all co-occurrence counts c(e, f) is an unrealistic memory burden for larger 4Using the open-source implementation code . [sent-168, score-0.258]

69 /Z for all substring pairs where c(e, f) > d and where Z is a normalization term equal to Z = X Pcooc(e|f)Pcooc(f|e). [sent-179, score-0.211]

70 It should be noted that as we are using discounting, many substring pairs will be given zero probability according to Pcooc. [sent-181, score-0.315]

71 For word-based translation in the Kyoto task, training was performed using the provided tok- enization scripts. [sent-196, score-0.343]

72 In characterbased translation, white spaces between words were treated as any other character and not given any special treatment. [sent-298, score-0.184]

73 For alignment, we use the GIZA++ implementation of one-to-many alignment7 and the pialign im- plementation of the phrasal ITG with the proposed improvements. [sent-300, score-0.178]

74 models8 modified For GIZA++, we used the default settings for word-based alignment, but used the HMM model for character-based alignment to allow for alignment of longer sentences. [sent-301, score-0.606]

75 For pialign, default settings were used except for character-based ITG alignment, which used a probability beam of instead For decoding, we use the Moses using the default settings except for the stack size, which we set to 1000 instead of 200. [sent-302, score-0.21]

76 Minimum error rate training was performed to maximize word-based BLEU score for all For language models, word-based translation uses a word 5-gram model, and characterbased translation uses a character 12-gram model, both smoothed using interpolated Kneser-Ney. [sent-303, score-0.83]

77 com/pial ign / 9Improvement by using a beam larger than 10−4 was marginal, especially with co-occurrence prior probabilities. [sent-310, score-0.165]

78 2 Quantitative Evaluation Table 2 presents a quantitative analysis of the translation results for each of the proposed methods. [sent-314, score-0.303]

79 We evaluate translation quality using BLEU score (Papineni et al. [sent-316, score-0.303]

80 It can be seen that character-based translation with all of the proposed alignment improvements greatly exceeds character-based translation using one-to-many alignment, confirming that substringbased information is necessary for accurate alignments. [sent-318, score-0.886]

81 When compared with word-based translation, character-based translation achieves better, comparable, or inferior results on character-based BLEU, comparable or inferior results on METEOR, and inferior results on word-based BLEU. [sent-319, score-0.453]

82 The differences between the evaluation metrics are due to the fact that character-based translation often gets words mostly correct other than one or two letters. [sent-320, score-0.303]

83 Interestingly, for translation into English, character-based translation achieves higher accuracy compared to word-based translation on Japanese and Finnish input, followed by German, I TTGG--cwhoar d2 f. [sent-322, score-0.909]

84 This confirms that characterbased translation is performing well on languages that have long words or ambiguous boundaries, and less well on language pairs with relatively strong one-to-one correspondence between words. [sent-327, score-0.453]

85 Two raters evaluated 100 sentences each, assigning a score of 0-5 based on how well the translation conveys the information contained in the reference. [sent-330, score-0.303]

86 Table 4 shows a breakdown of the sentences for which character-based translation received a score of at 2+ points more than word-based. [sent-333, score-0.303]

87 It can be seen that character-based translation is properly handling sparsity phenomena. [sent-334, score-0.418]

88 On the other hand, word-based translation was generally stronger with reordering and lexical choice of more common words. [sent-335, score-0.303]

89 4 Effect of Alignment Method In this section, we compare the translation accuracies for character-based translation using the phrasal ITG model with and without the proposed improvements of substring co-occurrence priors and lookahead parsing as described in Sections 4 and 5. [sent-337, score-0.985]

90 It can be seen that the co-occurrence prior gives gains in all cases, indicating that substring statistics are effectively seeding the ITG aligner. [sent-345, score-0.27]

91 The introduced lookahead probabilities improve accuracy significantly when substring co-occurrence counts are not used, and slightly when co-occurrence counts are used. [sent-346, score-0.46]

92 More beam sent/s sent/s 7 importantly, they allow for more aggressive pruning, increasing sampling speed from 1. [sent-347, score-0.215]

93 Conclusion and Future Directions This paper demonstrated that character-based translation can act as a unified framework for handling difficult problems in translation: morphology, compound words, transliteration, and segmentation. [sent-352, score-0.423]

94 One future challenge includes scaling training up to longer sentences, which can likely be achieved through methods such as the heuristic span pruning of Haghighi et al. [sent-353, score-0.155]

95 In addition, error analysis showed that wordbased translation performed better than characterbased translation on reordering and lexical choice, indicating that improved decoding (or pre-ordering) and language modeling tailored to character-based translation will likely greatly improve accuracy. [sent-357, score-1.08]

96 Unsupervised bilingual morpheme segmentation and alignment with context-rich hidden semi-Markov models. [sent-507, score-0.337]

97 An unsupervised model for joint phrase alignment and extraction. [sent-512, score-0.322]

98 Learning stochastic bracketing inversion transduction grammars with a cubic time biparsing algorithm. [sent-543, score-0.275]

99 Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. [sent-584, score-0.172]

100 Improved statistical machine translation by multiple Chinese word segmentation. [sent-599, score-0.303]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('translation', 0.303), ('alignment', 0.28), ('substring', 0.211), ('neubig', 0.207), ('itg', 0.19), ('pcooc', 0.155), ('ak', 0.143), ('itgs', 0.129), ('saers', 0.129), ('phrasal', 0.126), ('probabilities', 0.125), ('sparsity', 0.115), ('vilar', 0.109), ('beam', 0.106), ('morphological', 0.105), ('probability', 0.104), ('biparsing', 0.103), ('fuv', 0.103), ('character', 0.098), ('kyoto', 0.095), ('meteor', 0.092), ('characterbased', 0.086), ('inversion', 0.086), ('transduction', 0.086), ('chart', 0.084), ('uncommon', 0.083), ('finnish', 0.082), ('spans', 0.08), ('substrings', 0.079), ('ppois', 0.078), ('px', 0.077), ('compound', 0.077), ('pt', 0.07), ('morphology', 0.068), ('talbot', 0.068), ('translating', 0.067), ('bleu', 0.066), ('languages', 0.064), ('transliteration', 0.063), ('sampling', 0.063), ('sornlertlamvanich', 0.062), ('counts', 0.062), ('span', 0.062), ('alignments', 0.061), ('bias', 0.06), ('prior', 0.059), ('blunsom', 0.059), ('bayesian', 0.059), ('calculation', 0.059), ('denero', 0.057), ('competing', 0.057), ('segmentation', 0.057), ('characters', 0.056), ('pruning', 0.055), ('queue', 0.054), ('target', 0.053), ('abouelhoda', 0.052), ('cromieres', 0.052), ('discipline', 0.052), ('inv', 0.052), ('pialign', 0.052), ('pprior', 0.052), ('puni', 0.052), ('tatsuya', 0.052), ('japanese', 0.051), ('smt', 0.05), ('inferior', 0.05), ('splitting', 0.049), ('boundaries', 0.048), ('est', 0.048), ('synchronous', 0.047), ('inside', 0.047), ('eiichiro', 0.046), ('suffix', 0.046), ('allow', 0.046), ('arrays', 0.045), ('shinsuke', 0.045), ('wordbased', 0.045), ('ascending', 0.045), ('myriad', 0.045), ('trimmed', 0.045), ('fj', 0.044), ('graham', 0.044), ('koehn', 0.044), ('unified', 0.043), ('priors', 0.042), ('phrase', 0.042), ('kondrak', 0.041), ('cognates', 0.041), ('naradowsky', 0.041), ('str', 0.041), ('agglutinative', 0.041), ('denkowski', 0.041), ('taro', 0.041), ('inverted', 0.041), ('performed', 0.04), ('brown', 0.039), ('ie', 0.039), ('heuristic', 0.038), ('compounding', 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

2 0.26864415 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

Author: Ashish Vaswani ; Liang Huang ; David Chiang

Abstract: Two decades after their invention, the IBM word-based translation models, widely available in the GIZA++ toolkit, remain the dominant approach to word alignment and an integral part of many statistical translation systems. Although many models have surpassed them in accuracy, none have supplanted them in practice. In this paper, we propose a simple extension to the IBM models: an ‘0 prior to encourage sparsity in the word-to-word translation model. We explain how to implement this extension efficiently for large-scale data (also released as a modification to GIZA++) and demonstrate, in experiments on Czech, Arabic, Chinese, and Urdu to English translation, significant improvements over IBM Model 4 in both word alignment (up to +6.7 F1) and translation quality (up to +1.4 B ).

3 0.24492949 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

Author: Jingbo Zhu ; Tong Xiao ; Chunliang Zhang

Abstract: This paper presents an unsupervised approach to learning translation span alignments from parallel data that improves syntactic rule extraction by deleting spurious word alignment links and adding new valuable links based on bilingual translation span correspondences. Experiments on Chinese-English translation demonstrate improvements over standard methods for tree-to-string and tree-to-tree translation. 1

4 0.23870663 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

Author: Ning Xi ; Guangchao Tang ; Xinyu Dai ; Shujian Huang ; Jiajun Chen

Abstract: The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses two different segmentation specifications for alignment and translation respectively: we use Chinese character as the basic unit for alignment, and then convert this alignment to conventional word alignment for translation rule induction. Experimentally, our approach outperformed two baselines: fully word-based system (using word for both alignment and translation) and fully character-based system, in terms of alignment quality and translation performance. 1Introduction Chinese Word segmentation is a necessary step in Chinese-English statistical machine translation (SMT) because Chinese sentences do not delimit words by spaces. The key characteristic of a Chinese word segmenter is the segmentation specification1. As depicted in Figure 1(a), the dominant practice of SMT uses the same word segmentation for both word alignment and translation rule induction. For brevity, we will refer to the word segmentation of the bilingual corpus as word segmentation for alignment (WSA for short), because it determines the basic tokens for alignment; and refer to the word segmentation of the aligned corpus as word segmentation for rules (WSR for short), because it determines the basic tokens of translation 1 We hereafter use “word segmentation” for short. 285 (a) WSA=WSR (b) WSA≠WSR Figure 1. WSA and WSR in SMT pipeline rules2, which also determines how the translation rules would be matched by the source sentences. It is widely accepted that word segmentation with a higher F-score will not necessarily yield better translation performance (Chang et al., 2008; Zhang et al., 2008; Xiao et al., 2010). Therefore, many approaches have been proposed to learn word segmentation suitable for SMT. These approaches were either complicated (Ma et al., 2007; Chang et al., 2008; Ma and Way, 2009; Paul et al., 2010), or of high computational complexity (Chung and Gildea 2009; Duan et al., 2010). Moreover, they implicitly assumed that WSA and WSR should be equal. This requirement may lead to a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses different word segmentation specifications as WSA and WSR respectively, as shown Figure 1(b). We investigate a solution in this framework: first, we use Chinese character as the basic unit for alignment, viz. character alignment; second, we use a simple method (Elming and Habash, 2007) to convert the character alignment to conventional word alignment for translation rule induction. In the 2 Interestingly, word is also a basic token in syntax-based rules. Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi2c 8s5–290, experiment, our approach consistently outperformed two baselines with three different word segmenters: fully word-based system (using word for both alignment and translation) and fully character-based system, in terms of alignment quality and translation performance. The remainder of this paper is structured as follows: Section 2 analyzes the influences of WSA and WSR on SMT respectively; Section 3 discusses how to convert character alignment to word alignment; Section 4 presents experimental results, followed by conclusions and future work in section 5. 2 Understanding WSA and WSR We propose a solution to tackle the suboptimal problem: using Chinese character for alignment while using Chinese word for translation. Character alignment differs from conventional word alignment in the basic tokens of the Chinese side of the training corpus3. Table 1 compares the token distributions of character-based corpus (CCorpus) and word-based corpus (WCorpus). We see that the WCorpus has a longer-tailed distribution than the CCorpus. More than 70% of the unique tokens appear less than 5 times in WCorpus. However, over half of the tokens appear more than or equal to 5 times in the CCorpus. This indicates that modeling word alignment could suffer more from data sparsity than modeling character alignment. Table 2 shows the numbers of the unique tokens (#UT) and unique bilingual token pairs (#UTP) of the two corpora. Consider two extensively features, fertility and translation features, which are extensively used by many state-of-the-art word aligners. The number of parameters w.r.t. fertility features grows linearly with #UT while the number of parameters w.r.t. translation features grows linearly with #UTP. We compare #UT and #UTP of both corpora in Table 2. As can be seen, CCorpus has less UT and UTP than WCorpus, i.e. character alignment model has a compact parameterization than word alignment model, where the compactness of parameterization is shown very important in statistical modeling (Collins, 1999). Another advantage of character alignment is the reduction in alignment errors caused by word seg3 Several works have proposed to use character (letter) on both sides of the parallel corpus for SMT between similar (European) languages (Vilar et al., 2007; Tiedemann, 2009), however, Chinese is not similar to English. 286 Frequency Characters (%) Words (%) 1 27.22 45.39 2 11.13 14.61 3 6.18 6.47 4 4.26 4.32 5(+) 50.21 29.21 Table 1 Token distribution of CCorpus and WCorpus Stats. Characters Words #UT 9.7K 88.1K #UTP 15.8M 24.2M Table 2 #UT and #UTP in CCorpus and WCorpus mentation errors. For example, “切尼 (Cheney)” and “愿 (will)” are wrongly merged into one word 切尼 by the word segmenter, and 切尼 wrongly aligns to a comma in English sentence in the word alignment; However, both 切 and 尼 align to “Cheney” correctly in the character alignment. However, this kind of errors cannot be fixed by methods which learn new words by packing already segmented words, such as word packing (Ma et al., 2007) and Pseudo-word (Duan et al., 2010). As character could preserve more meanings than word in Chinese, it seems that a character can be wrongly aligned to many English words by the aligner. However, we found this can be avoided to a great extent by the basic features (co-occurrence and distortion) used by many alignment models. For example, we observed that the four characters of the non-compositional word “阿拉法特 (Arafat)” align to Arafat correctly, although these characters preserve different meanings from that of Arafat. This can be attributed to the frequent co-occurrence (192 愿愿 times) of these characters and Arafat in CCorpus. Moreover, 法 usually means France in Chinese, thus it may co-occur very often with France in CCorpus. If both France and Arafat appear in the English sentence, 法 may wrongly align to France. However, if 阿 aligns to Arafat, 法 will probably align to Arafat, because aligning 法 to Arafat could result in a lower distortion cost than aligning it to France. Different from alignment, translation is a pattern matching procedure (Lopez, 2008). WSR determines how the translation rules would be matched by the source sentences. For example, if we use translation rules with character as WSR to translate name entities such as the non-compositional word 阿拉法特, i.e. translating literally, we may get a wrong translation. That’s because the linguistic knowledge that the four characters convey a specific meaning different from the characters has been lost, which cannot always be totally recovered even by using phrase in phrase-based SMT systems (see Chang et al. (2008) for detail). Duan et al. (2010) and Paul et al., (2010) further pointed out that coarser-grained segmentation of the source sentence do help capture more contexts in translation. Therefore, rather than using character, using coarser-grained, at least as coarser as the conventional word, as WSR is quite necessary. 3 Converting Character Alignment to Word Alignment In order to use word as WSR, we employ the same method as Elming and Habash (2007)4 to convert the character alignment (CA) to its word-based version (CA ’) for translation rule induction. The conversion is very intuitive: for every English-Chinese word pair ??, ?? in the sentence pair, we align ? to ? as a link in CA ’, if and only if there is at least one Chinese character of ? aligns to ? in CA. Given two different segmentations A and B of the same sentence, it is easy to prove that if every word in A is finer-grained than the word of B at the corresponding position, the conversion is unambiguity (we omit the proof due to space limitation). As character is a finer-grained than its original word, character alignment can always be converted to alignment based on any word segmentation. Therefore, our approach can be naturally scaled to syntax-based system by converting character alignment to word alignment where the word seg- mentation is consistent with the parsers. We compare CA with the conventional word alignment (WA) as follows: We hand-align some sentence pairs as the evaluation set based on characters (ESChar), and converted it to the evaluation set based on word (ESWord) using the above conversion method. It is worth noting that comparing CA and WA by evaluating CA on ESChar and evaluating WA on ESWord is meaningless, because the basic tokens in CA and WA are different. However, based on the conversion method, comparing CA with WA can be accomplished by evaluating both CA ’ and WA on ESWord. 4 They used this conversion for word alignment combination only, no translation results were reported. 287 4 Experiments 4.1 Setup FBIS corpus (LDC2003E14) (210K sentence pairs) was used for small-scale task. A large bilingual corpus of our lab (1.9M sentence pairs) was used for large-scale task. The NIST’06 and NIST’08 test sets were used as the development set and test set respectively. The Chinese portions of all these data were preprocessed by character segmenter (CHAR), ICTCLAS word segmenter5 (ICT) and Stanford word segmenters with CTB and PKU specifications6 respectively. The first 100 sentence pairs of the hand-aligned set in Haghighi et al. (2009) were hand-aligned as ESChar, which is converted to three ESWords based on three segmentations respectively. These ESWords were appended to training corpus with the corresponding word segmentation for evaluation purpose. Both character and word alignment were performed by GIZA++ (Och and Ney, 2003) enhanced with gdf heuristics to combine bidirectional alignments (Koehn et al., 2003). A 5-gram language model was trained from the Xinhua portion of Gigaword corpus. A phrase-based MT decoder similar to (Koehn et al., 2007) was used with the decoding weights optimized by MERT (Och, 2003). 4.2 Evaluation We first evaluate the alignment quality. The method discussed in section 3 was used to compare character and word alignment. As can be seen from Table 3, the systems using character as WSA outperformed the ones using word as WSA in both small-scale (row 3-5) and large-scale task (row 6-8) with all segmentations. This gain can be attributed to the small vocabulary size (sparsity) for character alignment. The observation is consistent with Koehn (2005) which claimed that there is a negative correlation between the vocabulary size and translation performance without explicitly distinguishing WSA and WSR. We then evaluated the translation performance. The baselines are fully word-based MT systems (WordSys), i.e. using word as both WSA and WSR, and fully character-based systems (CharSys). Table 5 http://www.ictclas.org/ 6 http://nlp.stanford.edu/software/segmenter.shtml TLSablCIPeKT3BUAlig87 n609P5mW.0162eonrdt8avl52R01ai.g l6489numatieo78n29F t. 46590PrecC87 i1hP28s.oa3027rn(ctPe89)r6R05,.ar7162e3licganm8 (15F62eR.n983)t, TableSL4TWwrcahonSraAdslatioWw no SerdRvalu2Ct31iT.o405Bn1724ofW2P 301Ko.895rU61d Sy2sI03Ca.29nT035d4 proand F-score (F) with ? ? 0.5 (Fraser and Marcu, 2007) posed system using BLEU-SBP (Chiang et al., 2008) 4 compares WordSys to our proposed system. Significant testing was carried out using bootstrap re-sampling method proposed by Koehn (2004) with a 95% confidence level. We see that our proposed systems outperformed WordSys in all segmentation specifications settings. Table 5 lists the results of CharSys in small-scale task. In this setting, we gradually set the phrase length and the distortion limits of the phrase-based decoder (context size) to 7, 9, 11 and 13, in order to remove the disadvantage of shorter context size of using character as WSR for fair comparison with WordSys as suggested by Duan et al. (2010). Comparing Table 4 and 5, we see that all CharSys underperformed WordSys. This observation is consistent with Chang et al. (2008) which claimed that using characters, even with large phrase length (up to 13 in our experiment) cannot always capture everything a Chinese word segmenter can do, and using word for translation is quite necessary. We also see that CharSys underperformed our proposed systems, that’s because the harm of using character as WSR outweighed the benefit of using character as WSA, which indicated that word segmentation better for alignment is not necessarily better for translation, and vice versa. We finally compared our approaches to Ma et al. (2007) and Ma and Way (2009), which proposed “packed word (PW)” and “bilingual motivated word (BS)” respectively. Both methods iteratively learn word segmentation and alignment alternatively, with the former starting from word-based corpus and the latter starting from characters-based corpus. Therefore, PW can be experimented on all segmentations. Table 6 lists their results in small- 288 Context Size 7 9 11 13 BLEU 20.90 21.19 20.89 21.09 Table 5 Translation evaluation of CharSys. CWPhrSoayps+TdoPtSaBebWmySdsle6wWPcCBhoWSa rAmdpawWrPBisoWS rRdnwiC2t1hT.2504oB6the2r1P0w9K.2o178U496rk s2I10C.9T547 scale task, we see that both PW and BS underperformed our approach. This may be attributed to the low recall of the learned BS or PW in their approaches. BS underperformed both two baselines, one reason is that Ma and Way (2009) also employed word lattice decoding techniques (Dyer et al., 2008) to tackle the low recall of BS, which was removed from our experiments for fair comparison. Interestingly, we found that using character as WSA and BS as WSR (Char+BS), a moderate gain (+0.43 point) was achieved compared with fully BS-based system; and using character as WSA and PW as WSR (Char+PW), significant gains were achieved compared with fully PW-based system, the result of CTB segmentation in this setting even outperformed our proposed approach (+0.42 point). This observation indicated that in our framework, better combinations of WSA and WSR can be found to achieve better translation performance. 5 Conclusions and Future Work We proposed a SMT framework that uses character for alignment and word for translation, which improved both alignment quality and translation performance. We believe that in this framework, using other finer-grained segmentation, with fewer ambiguities than character, would better parameterize the alignment models, while using other coarser-grained segmentation as WSR can help capture more linguistic knowledge than word to get better translation. We also believe that our approach, if integrated with combination techniques (Dyer et al., 2008; Xi et al., 2011), can yield better results. Acknowledgments We thank ACL reviewers. This work is supported by the National Natural Science Foundation of China (No. 61003 112), the National Fundamental Research Program of China (2010CB327903). References Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Peitra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), pages 263-3 11. Pi-Chuan Chang, Michel Galley, and Christopher D. Manning. 2008. Optimizing Chinese word segmentation for machine translation performance. In Proceedings of third workshop on SMT, pages 224-232. David Chiang, Steve DeNeefe, Yee Seng Chan and Hwee Tou Ng. 2008. Decomposability of Translation Metrics for Improved Evaluation and Efficient Algorithms. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pages 610-619. Tagyoung Chung and Daniel Gildea. 2009. Unsupervised tokenization for machine translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pages 718-726. Michael Collins. 1999. Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania. Xiangyu Duan, Min Zhang, and Haizhou Li. 2010. Pseudo-word for phrase-based machine translation. In Proceedings of the Association for Computational Linguistics, pages 148-156. Christopher Dyer, Smaranda Muresan, and Philip Resnik. 2008. Generalizing word lattice translation. In Proceedings of the Association for Computational Linguistics, pages 1012-1020. Jakob Elming and Nizar Habash. 2007. Combination of statistical word alignments based on multiple preprocessing schemes. In Proceedings of the Association for Computational Linguistics, pages 25-28. Alexander Fraser and Daniel Marcu. 2007. Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation. In Computational Linguistics, 33(3), pages 293-303. Aria Haghighi, John Blitzer, John DeNero, and Dan Klein. 2009. Better word alignments with supervised ITG models. In Proceedings of the Association for Computational Linguistics, pages 923-93 1. Phillip Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, E. Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 177-1 80. 289 Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of the Conference on Empirical Methods on Natural Language Processing, pages 388-395. Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the MT Summit. Adam David Lopez. 2008. Machine translation by pattern matching. Ph.D. thesis, University of Maryland. Yanjun Ma, Nicolas Stroppa, and Andy Way. 2007. Bootstrapping word alignment via word packing. In Proceedings of the Association for Computational Linguistics, pages 304-3 11. Yanjun Ma and Andy Way. 2009. Bilingually motivated domain-adapted word segmentation for statistical machine translation. In Proceedings of the Conference of the European Chapter of the ACL, pages 549-557. Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the Association for Computational Linguistics, pages 440-447. Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), pages 19-5 1. Michael Paul, Andrew Finch and Eiichiro Sumita. 2010. Integration of multiple bilingually-learned segmentation schemes into statistical machine translation. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 400-408. Jörg Tiedemann. 2009. Character-based PSMT for closely related languages. In Proceedings of the Annual Conference of the European Association for machine Translation, pages 12-19. David Vilar, Jan-T. Peter and Hermann Ney. 2007. Can we translate letters? In Proceedings of the Second Workshop on Statistical Machine Translation, pages 33-39. Xinyan Xiao, Yang Liu, Young-Sook Hwang, Qun Liu and Shouxun Lin. 2010. Joint tokenization and translation. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1200-1208. Ning Xi, Guangchao Tang, Boyuan Li, and Yinggong Zhao. 2011. Word alignment combination over multiple word segmentation. In Proceedings of the ACL 2011 Student Session, pages 1-5. Ruiqiang Zhang, Keiji Yasuda, and Eiichiro Sumita. 2008. Improved statistical machine translation by multiple Chinese word segmentation. of the Third Workshop on Statistical Machine Translation, pages 216-223. 290 In Proceedings

5 0.23583663 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

Author: Xiaodong He ; Li Deng

Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.

6 0.21295908 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

7 0.20354275 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

8 0.19792259 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

9 0.18208723 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

10 0.17906803 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

11 0.17525113 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

12 0.15650615 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

13 0.15365882 131 acl-2012-Learning Translation Consensus with Structured Label Propagation

14 0.15236695 134 acl-2012-Learning to Find Translations and Transliterations on the Web

15 0.15130609 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

16 0.147172 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

17 0.14165263 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

18 0.13851537 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

19 0.1376911 19 acl-2012-A Ranking-based Approach to Word Reordering for Statistical Machine Translation

20 0.13575622 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.364), (1, -0.326), (2, 0.12), (3, 0.045), (4, 0.09), (5, -0.002), (6, 0.016), (7, -0.068), (8, 0.003), (9, -0.018), (10, -0.033), (11, -0.095), (12, -0.058), (13, -0.03), (14, 0.046), (15, -0.076), (16, -0.06), (17, 0.028), (18, 0.051), (19, 0.142), (20, -0.087), (21, 0.018), (22, 0.035), (23, -0.022), (24, -0.008), (25, -0.031), (26, -0.086), (27, -0.027), (28, 0.109), (29, -0.096), (30, -0.045), (31, -0.016), (32, -0.017), (33, -0.09), (34, 0.042), (35, 0.11), (36, -0.024), (37, 0.034), (38, -0.063), (39, 0.065), (40, -0.077), (41, -0.019), (42, 0.054), (43, -0.048), (44, -0.082), (45, 0.008), (46, 0.046), (47, 0.048), (48, -0.011), (49, 0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96710342 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

2 0.9023279 81 acl-2012-Enhancing Statistical Machine Translation with Character Alignment

Author: Ning Xi ; Guangchao Tang ; Xinyu Dai ; Shujian Huang ; Jiajun Chen

3 0.87281781 179 acl-2012-Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm

Author: Ashish Vaswani ; Liang Huang ; David Chiang

4 0.81929493 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

Author: Darcey Riley ; Daniel Gildea

Abstract: Bayesian approaches have been shown to reduce the amount of overfitting that occurs when running the EM algorithm, by placing prior probabilities on the model parameters. We apply one such Bayesian technique, variational Bayes, to the IBM models of word alignment for statistical machine translation. We show that using variational Bayes improves the performance of the widely used GIZA++ software, as well as improving the overall performance of the Moses machine translation system in terms of BLEU score.

5 0.80488819 128 acl-2012-Learning Better Rule Extraction with Translation Span Alignment

Author: Jingbo Zhu ; Tong Xiao ; Chunliang Zhang

6 0.785065 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

7 0.71302682 66 acl-2012-DOMCAT: A Bilingual Concordancer for Domain-Specific Computer Assisted Translation

8 0.6958921 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

9 0.67209274 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

10 0.66088468 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

11 0.65732074 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

12 0.65092641 105 acl-2012-Head-Driven Hierarchical Phrase-based Translation

13 0.63104147 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

14 0.62649071 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

15 0.62108105 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

16 0.61216134 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

17 0.61208349 131 acl-2012-Learning Translation Consensus with Structured Label Propagation

18 0.60586262 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

19 0.58148926 162 acl-2012-Post-ordering by Parsing for Japanese-English Statistical Machine Translation

20 0.5783444 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.024), (26, 0.037), (28, 0.082), (30, 0.038), (37, 0.03), (38, 0.01), (39, 0.043), (49, 0.015), (55, 0.01), (57, 0.031), (59, 0.012), (74, 0.045), (82, 0.012), (84, 0.036), (85, 0.041), (90, 0.122), (92, 0.059), (94, 0.06), (95, 0.18), (99, 0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78995967 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

2 0.74511689 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

Author: Jinsong Su ; Hua Wu ; Haifeng Wang ; Yidong Chen ; Xiaodong Shi ; Huailin Dong ; Qun Liu

Abstract: To adapt a translation model trained from the data in one domain to another, previous works paid more attention to the studies of parallel corpus while ignoring the in-domain monolingual corpora which can be obtained more easily. In this paper, we propose a novel approach for translation model adaptation by utilizing in-domain monolingual topic information instead of the in-domain bilingual corpora, which incorporates the topic information into translation probability estimation. Our method establishes the relationship between the out-of-domain bilingual corpus and the in-domain monolingual corpora via topic mapping and phrase-topic distribution probability estimation from in-domain monolingual corpora. Experimental result on the NIST Chinese-English translation task shows that our approach significantly outperforms the baseline system.

3 0.68297136 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

Author: Patrick Simianer ; Stefan Riezler ; Chris Dyer

Abstract: With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy local features for SCFG-based SMT that can be read off from rules at runtime, and present a learning algorithm that applies ‘1/‘2 regularization for joint feature selection over distributed stochastic learning processes. We present experiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets.

4 0.68061286 218 acl-2012-You Had Me at Hello: How Phrasing Affects Memorability

Author: Cristian Danescu-Niculescu-Mizil ; Justin Cheng ; Jon Kleinberg ; Lillian Lee

Abstract: Understanding the ways in which information achieves widespread public awareness is a research question of significant interest. We consider whether, and how, the way in which the information is phrased the choice of words and sentence structure — can affect this process. To this end, we develop an analysis framework and build a corpus of movie quotes, annotated with memorability information, in which we are able to control for both the speaker and the setting of the quotes. We find that there are significant differences between memorable and non-memorable quotes in several key dimensions, even after controlling for situational and contextual factors. One is lexical distinctiveness: in aggregate, memorable quotes use less common word choices, but at the same time are built upon a scaffolding of common syntactic patterns. Another is that memorable quotes tend to be more general in ways that make them easy to apply in new contexts — that is, more portable. — We also show how the concept of “memorable language” can be extended across domains. 1 Hello. My name is Inigo Montoya. Understanding what items will be retained in the public consciousness, and why, is a question of fundamental interest in many domains, including marketing, politics, entertainment, and social media; as we all know, many items barely register, whereas others catch on and take hold in many people’s minds. An active line of recent computational work has employed a variety of perspectives on this question. 892 Building on a foundation in the sociology of diffusion [27, 31], researchers have explored the ways in which network structure affects the way information spreads, with domains of interest including blogs [1, 11], email [37], on-line commerce [22], and social media [2, 28, 33, 38]. There has also been recent research addressing temporal aspects of how different media sources convey information [23, 30, 39] and ways in which people react differently to infor- mation on different topics [28, 36]. Beyond all these factors, however, one’s everyday experience with these domains suggests that the way in which a piece of information is expressed the choice of words, the way it is phrased might also have a fundamental effect on the extent to which it takes hold in people’s minds. Concepts that attain wide reach are often carried in messages such as political slogans, marketing phrases, or aphorisms whose language seems intuitively to be memorable, “catchy,” or otherwise compelling. Our first challenge in exploring this hypothesis is to develop a notion of “successful” language that is precise enough to allow for quantitative evaluation. We also face the challenge of devising an evaluation setting that separates the phrasing of a message from the conditions in which it was delivered highlycited quotes tend to have been delivered under compelling circumstances or fit an existing cultural, political, or social narrative, and potentially what appeals to us about the quote is really just its invocation of these extra-linguistic contexts. Is the form of the language adding an effect beyond or independent of these (obviously very crucial) factors? To — — — investigate the question, one needs a way of controlProce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi8c 9s2–901, ling as much as possible for the role that the surrounding context of the language plays. — — The present work (i): Evaluating language-based memorability Defining what makes an utterance memorable is subtle, and scholars in several domains have written about this question. There is a rough consensus that an appropriate definition involves elements of both recognition people should be able to retain the quote and recognize it when they hear it invoked and production people should be motivated to refer to it in relevant situations [15]. One suggested reason for why some memes succeed is their ability to provoke emotions [16]. Alternatively, memorable quotes can be good for expressing the feelings, mood, or situation of an individual, a group, or a culture (the zeitgeist): “Certain quotes exquisitely capture the mood or feeling we wish to communicate to someone. We hear them ... and store them away for future use” [10]. None of these observations, however, serve as definitions, and indeed, we believe it desirable to — — — not pre-commit to an abstract definition, but rather to adopt an operational formulation based on external human judgments. In designing our study, we focus on a domain in which (i) there is rich use of language, some of which has achieved deep cultural penetration; (ii) there already exist a large number of external human judgments perhaps implicit, but in a form we can extract; and (iii) we can control for the setting in which the text was used. Specifically, we use the complete scripts of roughly 1000 movies, representing diverse genres, eras, and levels of popularity, and consider which lines are the most “memorable”. To acquire memorability labels, for each sentence in each script, we determine whether it has been listed as a “memorable quote” by users of the widely-known IMDb (the Internet Movie Database), and also estimate the number oftimes it appears on the Web. Both ofthese serve as memorability metrics for our purposes. When we evaluate properties of memorable quotes, we comparethemwithquotes thatarenotassessed as memorable, but were spoken by the same character, at approximately the same point in the same movie. This enables us to control in a fairly — fine-grained way for the confounding effects of context discussed above: we can observe differences 893 that persist even after taking into account both the speaker and the setting. In a pilot validation study, we find that human subjects are effective at recognizing the more IMDbmemorable of two quotes, even for movies they have not seen. This motivates a search for features intrinsic to the text of quotes that signal memorability. In fact, comments provided by the human subjects as part of the task suggested two basic forms that such textual signals could take: subjects felt that (i) memorable quotes often involve a distinctive turn of phrase; and (ii) memorable quotes tend to invoke general themes that aren’t tied to the specific setting they came from, and hence can be more easily invoked for future (out of context) uses. We test both of these principles in our analysis of the data. The present work (ii): What distinguishes memorable quotes Under the controlled-comparison setting sketched above, we find that memorable quotes exhibit significant differences from nonmemorable quotes in several fundamental respects, and these differences in the data reinforce the two main principles from the human pilot study. First, we show a concrete sense in which memorable quotes are indeed distinctive: with respect to lexical language models trained on the newswire portions of the Brown corpus [21], memorable quotes have significantly lower likelihood than their nonmemorable counterparts. Interestingly, this distinctiveness takes place at the level of words, but not at the level of other syntactic features: the part-ofspeech composition of memorable quotes is in fact more likely with respect to newswire. Thus, we can think of memorable quotes as consisting, in an aggregate sense, of unusual word choices built on a scaffolding of common part-of-speech patterns. We also identify a number of ways in which memorable quotes convey greater generality. In their patterns of verb tenses, personal pronouns, and determiners, memorable quotes are structured so as to be more “free-standing,” containing fewer markers that indicate references to nearby text. Memorable quotes differ in other interesting as- pects as well, such as sound distributions. Our analysis ofmemorable movie quotes suggests a framework by which the memorability of text in a range of different domains could be investigated. We provide evidence that such cross-domain properties may hold, guided by one of our motivating applications in marketing. In particular, we analyze a corpus of advertising slogans, and we show that these slogans have significantly greater likelihood at both the word level and the part-of-speech level with respect to a language model trained on memorable movie quotes, compared to a corresponding language model trained on non-memorable movie quotes. This suggests that some of the principles underlying memorable text have the potential to apply across different areas. Roadmap §2 lays the empirical foundations of our work: the design yasntdh ecerematpioirnic aofl our movie-quotes dataset, which we make publicly available (§2. 1), a pilot study cwhit hw ehu mmakaen subjects validating §I2M.1D),b abased memorability labels (§2.2), and further study bofa incorporating search-engine c2)o,u anntds (§2.3). §3 uddeytoafi lisn our analysis aenardc prediction experiments, using both movie-quotes data and, as an exploration of cross-domain applicability, slogans data. §4 surveys rcerloastse-dd owmoarkin across a variety goafn fsie dladtsa.. §5 briefly sruelmatmedar wizoesrk ka andcr ionsdsic aat veasr some ffuft uierled sd.ire §c5tio bnrsie. 2 I’m ready for my close-up. 2.1 Data To study the properties of memorable movie quotes, we need a source of movie lines and a designation of memorability. Following [8], we constructed a corpus consisting of all lines from roughly 1000 movies, varying in genre, era, and popularity; for each movie, we then extracted the list of quotes from IMDb’s Memorable Quotes page corresponding to the movie.1 A memorable quote in IMDb can appear either as an individual sentence spoken by one character, or as a multi-sentence line, or as a block of dialogue involving multiple characters. In the latter two cases, it can be hard to determine which particular portion is viewed as memorable (some involve a build-up to a punch line; others involve the follow-through after a well-phrased opening sentence), and so we focus in our comparisons on those memorable quotes that 1This extraction involved some edit-distance-based alignment, since the exact form of the line in the script can exhibit minor differences from the version typed into IMDb. rmotuqsfebmaNerolbm543281760 0 1234D5ecil678910 894 Figure 1: Location of memorable quotes in each decile of movie scripts (the first 10th, the second 10th, etc.), summed over all movies. The same qualitative results hold if we discard each movie’s very first and last line, which might have privileged status. appear as a single sentence rather than a multi-line block.2 We now formulate a task that we can use to evaluate the features of memorable quotes. Recall that our goal is to identify effects based in the language of the quotes themselves, beyond any factors arising from the speaker or context. Thus, for each (singlesentence) memorable quote M, we identify a nonmemorable quote that is as similar as possible to M in all characteristics but the choice of words. This means we want it to be spoken by the same character in the same movie. It also means that we want it to have the same length: controlling for length is important because we expect that on average, shorter quotes will be easier to remember than long quotes, and that wouldn’t be an interesting textual effect to report. Moreover, we also want to control for the fact that a quote’s position in a movie can affect memorability: certain scenes produce more memorable dialogue, and as Figure 1 demonstrates, in aggregate memorable quotes also occur disproportionately near the beginnings and especially the ends of movies. In summary, then, for each M, we pick a contrasting (single-sentence) quote N from the same movie that is as close in the script as possible to M (either before or after it), subject to the conditions that (i) M and N are uttered by the same speaker, (ii) M and N have the same number of words, and (iii) N does not occur in the IMDb list of memorable 2We also ran experiments relaxing the single-sentence assumption, which allows for stricter scene control and a larger dataset but complicates comparisons involving syntax. The non-syntax results were in line with those reported here. TaJSOMbtrclodekviTn1ra:eBTykhoPrwNenpmlxeasipFIHAeaithrclsfnitkaQeomuifltw’sdaveoitycmsnedoqatbuliocrkeytsl f.woEeimlanchguwspakyirdfsebavot;ilmsdfcoenti’dus.erx-citaINmSnrkeioamct:ohenwmardleytQ.howfeu t’yvrecp,o’gsmrtpuaosnmtyef o rtgnhqieuvrobt.pehasirtdeosfpykuern close together in the movie by the same while the other is not. (Contractions character, have the same length, and one is labeled memorable by the IMDb such as “it’s” count as two words.) quotes for the movie (either as a single line or as part of a larger block). Given such pairs, we formulate a pairwise comparison task: given M and N, determine which is the memorable quote. Psychological research on subjective evaluation [35], as well as initial experiments using ourselves as subjects, indicated that this pairwise set-up easier to work with than simply presenting a single sentence and asking whether it is memorable or not; the latter requires agreement on an “absolute” criterion for memorability that is very hard to impose consistently, whereas the former simply requires a judgment that one quote is more memorable than another. Our main dataset, available at http://www.cs. cornell.edu/∼cristian/memorability.html,3 thus consists of approximately 2200 such (M, N) pairs, separated by a median of 5 same-character lines in the script. The reader can get a sense for the nature of the data from the three examples in Table 1. We now discuss two further aspects to the formulation of the experiment: a preliminary pilot study involving human subjects, and the incorporation of search engine counts into the data. 2.2 Pilot study: Human performance As a preliminary consideration, we did a small pilot study to see if humans can distinguish memorable from non-memorable quotes, assuming our IMDBinduced labels as gold standard. Six subjects, all native speakers of English and none an author of this paper, were presented with 11 or 12 pairs of memorable vs. non-memorable quotes; again, we controlled for extra-textual effects by ensuring that in each pair the two quotes come from the same movie, are by the same character, have the same length, and 3Also available there: other examples and factoids. 895 Table 2: Human pilot study: number of matches to IMDb-induced annotation, ordered by decreasing match percentage. For the null hypothesis of random guessing, these results are statistically significant, p < 2−6 ≈ .016. appear as nearly as possible in the same scene.4 The order of quotes within pairs was randomized. Importantly, because we wanted to understand whether the language of the quotes by itself contains signals about memorability, we chose quotes from movies that the subjects said they had not seen. (This means that each subject saw a different set of quotes.) Moreover, the subjects were requested not to consult any external sources of information.5 The reader is welcome to try a demo version of the task at http: //www.cs.cornell.edu/∼cristian/memorability.html. Table 2 shows that all the subjects performed (sometimes much) better than chance, and against the null hypothesis that all subjects are guessing randomly, the results are statistically significant, p < 2−6 ≈ .016. These preliminary findings provide evidenc≈e f.0or1 t6h.e T validity eolifm our traysk fi:n despite trohev apparent difficulty of the job, even humans who haven’t seen the movie in question can recover our IMDb4In this pilot study, we allowed multi-sentence quotes. 5We did not use crowd-sourcing because we saw no way to ensure that this condition would be obeyed by arbitrary subjects. We do note, though, that after our research was completed and as of Apr. 26, 2012, ≈ 11,300 people completed the online test: average accuracy: 27,2 ≈%, 1 1m,3o0d0e npueompbleer c coomrrpelcett:e d9 t/1he2. induced labels with some reliability.6 2.3 Incorporating search engine counts Thus far we have discussed a dataset in which memorability is determined through an explicit labeling drawn from the IMDb. Given the “production” aspect of memorability discussed in § 1, we stihoonu”ld a saplesoc expect tmhaotr mabeimlityora dbislce quotes nw §il1l ,te wnde to appear more extensively on Web pages than nonmemorable quotes; note that incorporating this insight makes it possible to use the (implicit) judgments of a much larger number of people than are represented by the IMDb database. It therefore makes sense to try using search-engine result counts as a second indication of memorability. We experimented with several ways of constructing memorability information from search-engine counts, but this proved challenging. Searching for a quote as a stand-alone phrase runs into the problem that a number of quotes are also sentences that people use without the movie in mind, and so high counts for such quotes do not testify to the phrase’s status as a memorable quote from the movie. On the other hand, searching for the quote in a Boolean conjunction with the movie’s title discards most of these uses, but also eliminates a large fraction of the appearances on the Web that we want to find: precisely because memorable quotes tend to have widespread cultural usage, people generally don’t feel the need to include the movie’s title when invoking them. Finally, since we are dealing with roughly 1000 movies, the result counts vary over an enormous range, from recent blockbusters to movies with relatively small fan bases. In the end, we found that it was more effective to use the result counts in conjunction with the IMDb labels, so that the counts played the role of an additional filter rather than a free-standing numerical value. Thus, for each pair (M, N) produced using the IMDb methodology above, we searched for each of M and N as quoted expressions in a Boolean conjunction with the title of the movie. We then kept only those pairs for which M (i) produced more than five results in our (quoted, conjoined) search, and (ii) produced at least twice as many results as the cor6The average accuracy being below 100% reinforces that context is very important, too. 896 responding search for N. We created a version of this filtered dataset using each of Google and Bing, and all the main findings were consistent with the results on the IMDb-only dataset. Thus, in what follows, we will focus on the main IMDb-only dataset, discussing the relationship to the dataset filtered by search engine counts where relevant (in which case we will refer to the +Google dataset). 3 Never send a human to do a machine’s job. We now discuss experiments that investigate the hypotheses discussed in §1. In particular, we devise pmoetthheosdess t dhiastc can assess 1th.e Idnis ptianrcttiicvuelnaer,ss w aend d generality hypotheses and test whether there exists a notion of “memorable language” that operates across domains. In addition, we evaluate and compare the predictive power of these hypotheses. 3.1 Distinctiveness One of the hypotheses we examine is whether the use of language in memorable quotes is to some extent unusual. In order to quantify the level of distinctiveness of a quote, we take a language-model approach: we model “common language” using the newswire sections of the Brown corpus [21]7, and evaluate how distinctive a quote is by evaluating its likelihood with respect to this model the lower the likelihood, the more distinctive. In order to assess different levels of lexical and syntactic distinctiveness, we employ a total of six Laplacesmoothed8 language models: 1-gram, 2-gram, and — 3-gram word LMs and 1-gram, 2-gram and 3-gram LMs. We find strong evidence that from a lexical perspective, memorable quotes are more distinctive than their non-memorable counterparts. As indicated in Table 3, for each of our lexical “common language” models, in about 60% of the quote pairs, the memorable quote is more distinctive. Interestingly, the reverse is true when it comes to part-of-speech9 7Results were qualitatively similar if we used the fiction portions. The age of the Brown corpus makes it less likely to contain modern movie quotes. 8We employ Laplace (additive) smoothing with a smoothing parameter of 0.2. The language models’ vocabulary was that of the entire training corpus. 9Throughout we obtain part-of-speech tags by using the NLTK maximum entropy tagger with default parameters. in which the the memorable quote is more distinctive than the non-memorable one according to the respective “common language” model. Significance according to a two-tailed sign test is indicated using *-notation (∗∗∗=“p<.001”). syntax: memorable quotes appear to follow the syntactic patterns of “common language” as closely as or more closely than non-memorable quotes. Together, these results suggest that memorable quotes consist of unusual word sequences built on common syntactic scaffolding. 3.2 Generality Another of our hypotheses is that memorable quotes are easier to use outside the specific context in which they were uttered that is, more “portable” and therefore exhibit fewer terms that refer to those settings. We use the following syntactic properties as proxies for the generality of a quote: • Fewer 3rd-person pronouns, since these commonly r 3efer to a person or object that was introduced earlier in the discourse. Utterances that employ fewer such pronouns are easier to adapt to new contexts, and so will be considered more — — general. • More indefinite articles like a and an, since they are more likely ttioc lreesfer li ktoe general concepts than definite articles. Quotes with more indefinite articles will be considered more general. Fewer past tense verbs and more present tFeenwsee verbs, tseinncsee t vheer bfosrm aenrd are more likely to refer to specific previous events. Therefore utterances that employ fewer past tense verbs (and more present tense verbs) will be considered more general. Table 4 gives the results for each of these four metrics in each case, we show the percentage of • — 897 TalfmGebowsnre4pa:in3srGldet sypfne.msrate.lripnctysoe: purncsetaI56gM47e.326D9o710bf% -qo∗u n∗l tyepa+56iG892rs.o7i364ng% wl∗ eh∗i ch the memorable quote is more general than the non- memorable ones according to the respective metric. Pairs where the metric does not distinguish between the quotes are not considered. quote pairs for which the memorable quote scores better on the generality metric. Note that because the issue of generality is a complex one for which there is no straightforward single metric, our approach here is based on several proxies for generality, considered independently; yet, as the results show, all of these point in a consistent direction. It is an interesting open question to develop richer ways of assessing whether a quote has greater generality, in the sense that people intuitively attribute to memorable quotes. 3.3 “Memorable” language beyond movies One of the motivating questions in our analysis is whether there are general principles underlying “memorable language.” The results thus far suggest potential families of such principles. A further question in this direction is whether the notion of memorability can be extended across different domains, and for this we collected (and distribute on our website) 431 phrases that were explicitly designed to be memorable: advertising slogans (e.g., “Quality never goes out of style.”). The focus on slogans is also in keeping with one of the initial motivations in studying memorability, namely, marketing applications in other words, assessing whether a proposed slogan has features that are consistent with memorable text. The fact that it’s not clear how to construct a collection of “non-memorable” counterparts to slogans appears to pose a technical challenge. However, we can still use a language-modeling approach to assess whether the textual properties of the slogans are closer to the memorable movie quotes (as one would conjecture) or to the non-memorable movie quotes. Specifically, we train one language model on memorable quotes and another on non-memorable quotes — guage: percentage of slogans that have higher likelihood under the memorable language model than under the nonmemorable one (for each of the six language models considered). Rightmost column: for reference, the percentage of newswire sentences that have higher likelihood under the memorable language model than under the nonmemorable one. TaG% ble3nipared6stpa:lfeitrnSsyilto.megpareotnsicluaerns mo1s42lto.61g048ae% nseral2w1m.h16e3mn% .comn2p-63ma.0r46e19dm% .to memorable and non-memorable quotes. (%s of 3rd pers. pronouns and indefinite articles are relative to all tokens, %s of past tense are relative to all past and present verbs.) and compare how likely each slogan is to be produced according to these two models. As shown in the middle column of Table 5, we find that slogans are better predicted both lexically and syntactically by the former model. This result thus offers evidence for a concept of “memorable language” that can be applied beyond a single domain. We also note that the higher likelihood of slogans under a “memorable language” model is not simply occurring for the trivial reason that this model predicts all other large bodies of text better. In particular, the newswire section of the Brown corpus is predicted better at the lexical level by the language model trained on non-memorable quotes. Finally, Table 6 shows that slogans employ general language, in the sense that for each of our generality metrics, we see a slogans/memorablequotes/non-memorable quotes spectrum. 3.4 Prediction task We now show how the principles discussed above can provide features for a basic prediction task, corresponding to the task in our human pilot study: 898 given a pair of quotes, identify the memorable one. Our first formulation of the prediction task uses a standard bag-of-words model10. If there were no information in the textual content of a quote to determine whether it were memorable, then an SVM employing bag-of-words features should perform no better than chance. Instead, though, it obtains 59.67% (10-fold cross-validation) accuracy, as shown in Table 7. We then develop models using features based on the measures formulated earlier in this section: generality measures (the four listed in Table 4); distinctiveness measures (likelihood according to 1, 2, and 3-gram “common language” models at the lexical and part-of-speech level for each quote in the pair, their differences, and pairwise comparisons between them); and similarityto-slogans measures (likelihood according to 1, 2, and 3-gram slogan-language models at the lexical and part-of-speech level for each quote in the pair, their differences, and pairwise comparisons between them). Even a relatively small number of distinctiveness features, on their own, improve significantly over the much larger bag-of-words model. When we include additional features based on generality and language-model features measuring similarity to slogans, the performance improves further (last line of Table 7). Thus, the main conclusion from these prediction tasks is that abstracting notions such as distinctiveness and generality can produce relatively streamlined models that outperform much heavier-weight bag-of-words models, and can suggest steps toward approaching the performance of human judges who very much unlike our system have the full cultural context in which movies occur at their disposal. — — 3.5 Other characteristics We also made some auxiliary observations that may be ofinterest. Specifically, we find differences in letter and sound distribution (e.g., memorable quotes after curse-word removal use significantly more “front sounds” (labials or front vowels such as represented by the letter i) and significantly fewer “back sounds” such as the one represented by u),11 — — 10We discarded terms appearing fewer than 10 times. 11These findings may relate to marketing research on sound symbolism [7, 19, 40]. TablesdgF7lieao:sngtPiehnorauefc dtliswevctymeo irnp.des:StoVgeMh10r-fo#ldec9ra265ot42sv5aA6l8942ic.d36720atu57%ri aocn∗yresult using the respective feature sets. Random baseline accuracy is 50%. Accuracies statistically significantly greater than bag-of-words according to a two-tailed t-test are indicated with *(p<.05) and **(p<.01). word complexity (e.g., memorable quotes use words with significantly more syllables) and phrase complexity (e.g., memorable quotes use fewer coordinating conjunctions). The latter two are in line with our distinctiveness hypothesis. 4 A long time ago, in a galaxy far, far away How an item’s linguistic form affects the reaction it generates has been studied in several contexts, including evaluations of product reviews [9], political speeches [12], on-line posts [13], scientific papers [14], and retweeting of Twitter posts [36]. We use a different set of features, abstracting the notions of distinctiveness and generality, in order to focus on these higher-level aspects of phrasing rather than on particular lower-level features. Related to our interest in distinctiveness, work in advertising research has studied the effect of syntactic complexity on recognition and recall of slogans [5, 6, 24]. There may also be connections to Von Restorff’s isolation effect Hunt [17], which asserts that when all but one item in a list are similar in some way, memory for the different item is enhanced. Related to our interest in generality, Knapp et al. [20] surveyed subjects regarding memorable messages or pieces of advice they had received, finding that the ability to be applied to multiple concrete situations was an important factor. Memorability, although distinct from “memorizability”, relates to short- and long-term recall. Thorn and Page [34] survey sub-lexical, lexical, and semantic attributes affecting short-term memorability of lexical items. Studies of verbatim recall have also considered the task of distinguishing an exact quote from close paraphrases [3]. Investigations of longterm recall have included studies ofculturally signif- 899 icant passages of text [29] and findings regarding the effect of rhetorical devices of alliterative [4], “rhythmic, poetic, and thematic constraints” [18, 26]. Finally, there are complex connections between humor and memory [32], which may lead to interactions with computational humor recognition [25]. 5 I think this is the beginning of a beautiful friendship. Motivated by the broad question of what kinds of information achieve widespread public awareness, we studied the the effect of phrasing on a quote’s memorability. A challenge is that quotes differ not only in how they are worded, but also in who said them and under what circumstances; to deal with this difficulty, we constructed a controlled corpus of movie quotes in which lines deemed memorable are paired with non-memorable lines spoken by the same character at approximately the same point in the same movie. After controlling for context and situation, memorable quotes were still found to exhibit, on av- erage (there will always be individual exceptions), significant differences from non-memorable quotes in several important respects, including measures capturing distinctiveness and generality. Our experiments with slogans show how the principles we identify can extend to a different domain. Future work may lead to applications in marketing, advertising and education [4]. Moreover, the subtle nature of memorability, and its connection to research in psychology, suggests a range of further research directions. We believe that the framework developed here can serve as the basis for further computational studies of the process by which information takes hold in the public consciousness, and the role that language effects play in this process. My mother thanks you. My father thanks you. My sister thanks you. And Ithank you: Rebecca Hwa, Evie Kleinberg, Diana Minculescu, Alex Niculescu-Mizil, Jennifer Smith, Benjamin Zimmer, and the anonymous reviewers for helpful discussions and comments; our annotators Steven An, Lars Backstrom, Eric Baumer, Jeff Chadwick, Evie Kleinberg, and Myle Ott; and the makers of Cepacol, Robitussin, and Sudafed, whose products got us through the submission deadline. This paper is based upon work supported in part by NSF grants IIS-0910664, IIS-1016099, Google, and Yahoo! References [1] [2] [3] [4] [5] Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan M. Lukose. Implicit structure and the dynamics of blogspace. In Workshop on the Weblogging Ecosystem, 2004. Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of KDD, 2006. Elizabeth Bates, Walter Kintsch, Charles R. Fletcher, and Vittoria Giuliani. The role of pronominalization and ellipsis in texts: Some memory experiments. Journal of Experimental Psychology: Human Learning and Memory, 6 (6):676–691, 1980. Frank Boers and Seth Lindstromberg. Finding ways to make phrase-learning feasible: The mnemonic effect of alliteration. System, 33(2): 225–238, 2005. Samuel D. Bradley and Robert Meeds. Surface-structure transformations and advertising slogans: The case for moderate syntactic complexity. Psychology and Marketing, 19: 595–619, 2002. [6] Robert Chamblee, Robert Gilmore, Gloria Thomas, and Gary Soldow. When copy complexity can help ad readership. Journal of Advertising Research, 33(3):23–23, 1993. [7] John Colapinto. Famous names. The New Yorker, pages 38–43, 2011. [8] Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, 2011. [9] Cristian Danescu-Niculescu-Mizil, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. How opinions are received by online communities: A case study on Amazon.com helpfulness votes. In Proceedings of WWW, pages 141–150, 2009. [10] Stuart Fischoff, Esmeralda Cardenas, Angela Hernandez, Korey Wyatt, Jared Young, and 900 [11] [12] [13] [14] [15] Rachel Gordon. Popular movie quotes: Reflections of a people and a culture. In Annual Convention of the American Psychological Association, 2000. Daniel Gruhl, R. Guha, David Liben-Nowell, and Andrew Tomkins. Information diffusion through blogspace. Proceedings of WWW, pages 491–501, 2004. Marco Guerini, Carlo Strapparava, and Oliviero Stock. Trusting politicians’ words (for persuasive NLP). In Proceedings of CICLing, pages 263–274, 2008. Marco Guerini, Carlo Strapparava, and G o¨zde O¨zbal. Exploring text virality in social networks. In Proceedings of ICWSM (poster), 2011. Marco Guerini, Alberto Pepe, and Bruno Lepri. Do linguistic style and readability of scientific abstracts affect their virality? In Proceedings of ICWSM, 2012. Richard Jackson Harris, Abigail J. Werth, Kyle E. Bures, and Chelsea M. Bartel. Social movie quoting: What, why, and how? Ciencias Psicologicas, 2(1):35–45, 2008. [16] Chip Heath, Chris Bell, and Emily Steinberg. Emotional selection in memes: The case of urban legends. Journal of Personality, 81(6): 1028–1041, 2001. [17] R. Reed Hunt. The subtlety of distinctiveness: What von Restorff really did. Psychonomic Bulletin & Review, 2(1): 105–1 12, 1995. [18] Ira E. Hyman Jr. and David C. Rubin. Memorabeatlia: A naturalistic study of long-term memory. Memory & Cognition, 18(2):205– 214, 1990. [19] Richard R. Klink. Creating brand names with meaning: The use of sound symbolism. Marketing Letters, 11(1):5–20, 2000. [20] Mark L. Knapp, Cynthia Stohl, and Kathleen K. Reardon. “Memorable” messages. Journal of Communication, 3 1(4):27– 41, 1981. [21] Henry Kuˇ cera and W. Nelson Francis. Computational analysis of present-day American English. Dartmouth Publishing Group, 1967. [22] Jure Leskovec, Lada Adamic, and Bernardo Huberman. The dynamics of viral marketing. ACM Transactions on the Web, 1(1), May [23] [24] [25] [26] [27] [28] [29] 2007. Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of KDD, pages 497–506, 2009. Tina M. Lowrey. The relation between script complexity and commercial memorability. Journal of Advertising, 35(3):7–15, 2006. Rada Mihalcea and Carlo Strapparava. Learning to laugh (automatically): Computational models for humor recognition. Computational Intelligence, 22(2): 126–142, 2006. Milman Parry and Adam Parry. The making of Homeric verse: The collected papers of Milman Parry. Clarendon Press, Oxford, 1971. Everett Rogers. Diffusion of Innovations. Free Press, fourth edition, 1995. Daniel M. Romero, Brendan Meeder, and Jon Kleinberg. Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on Twitter. Proceedings of WWW, pages 695–704, 2011. David C. Rubin. Very long-term memory for [30] [3 1] [32] [33] prose and verse. Journal of Verbal Learning and Verbal Behavior, 16(5):61 1–621, 1977. Nathan Schneider, Rebecca Hwa, Philip Gianfortoni, Dipanjan Das, Michael Heilman, Alan W. Black, Frederick L. Crabbe, and Noah A. Smith. Visualizing topical quotations over time to understand news discourse. Technical Report CMU-LTI-01-103, CMU, 2010. David Strang and Sarah Soule. Diffusion in organizations and social movements: From hybrid corn to poison pills. Annual Review of Sociology, 24:265–290, 1998. Hannah Summerfelt, Louis Lippman, and Ira E. Hyman Jr. The effect of humor on memory: Constrained by the pun. The Journal of General Psychology, 137(4), 2010. Eric Sun, Itamar Rosenn, Cameron Marlow, and Thomas M. Lento. Gesundheit! Model- 901 ing contagion through Facebook News Feed. In Proceedings of ICWSM, 2009. [34] Annabel Thorn and Mike Page. Interactions Between Short-Term and Long-Term Memory [35] [36] [37] [38] [39] [40] in the Verbal Domain. Psychology Press, 2009. Louis L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273– 286, 1927. Oren Tsur and Ari Rappoport. What’s in a Hashtag? Content based prediction of the spread of ideas in microblogging communities. In Proceedings of WSDM, 2012. Fang Wu, Bernardo A. Huberman, Lada A. Adamic, and Joshua R. Tyler. Information flow in social groups. Physica A: Statistical and Theoretical Physics, 337(1-2):327–335, 2004. Shaomei Wu, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. Who says what to whom on Twitter. In Proceedings of WWW, 2011. Jaewon Yang and Jure Leskovec. Patterns of temporal variation in online media. In Proceedings of WSDM, 2011. Eric Yorkston and Geeta Menon. A sound idea: Phonetic effects of brand names on consumer judgments. Journal of Consumer Research, 3 1 (1):43–51, 2004.

5 0.67713052 136 acl-2012-Learning to Translate with Multiple Objectives

Author: Kevin Duh ; Katsuhito Sudoh ; Xianchao Wu ; Hajime Tsukada ; Masaaki Nagata

Abstract: We introduce an approach to optimize a machine translation (MT) system on multiple metrics simultaneously. Different metrics (e.g. BLEU, TER) focus on different aspects of translation quality; our multi-objective approach leverages these diverse aspects to improve overall quality. Our approach is based on the theory of Pareto Optimality. It is simple to implement on top of existing single-objective optimization methods (e.g. MERT, PRO) and outperforms ad hoc alternatives based on linear-combination of metrics. We also discuss the issue of metric tunability and show that our Pareto approach is more effective in incorporating new metrics from MT evaluation for MT optimization.

6 0.67541349 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

7 0.6700691 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

8 0.66797376 83 acl-2012-Error Mining on Dependency Trees

9 0.66782409 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

10 0.66762739 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

11 0.66672462 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

12 0.66502523 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

13 0.66325605 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning

14 0.66309285 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

15 0.66295964 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

16 0.66258305 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

17 0.66153854 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

18 0.66139787 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

19 0.65980953 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

20 0.6594016 110 acl-2012-Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model