emnlp emnlp2010 emnlp2010-40 knowledge-graph by maker-knowledge-mining

40 emnlp-2010-Effects of Empty Categories on Machine Translation


Source: pdf

Author: Tagyoung Chung ; Daniel Gildea

Abstract: We examine effects that empty categories have on machine translation. Empty categories are elements in parse trees that lack corresponding overt surface forms (words) such as dropped pronouns and markers for control constructions. We start by training machine translation systems with manually inserted empty elements. We find that inclusion of some empty categories in training data improves the translation result. We expand the experiment by automatically inserting these elements into a larger data set using various methods and training on the modified corpus. We show that even when automatic prediction of null elements is not highly accurate, it nevertheless improves the end translation result.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Effects of Empty Categories on Machine Translation Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We examine effects that empty categories have on machine translation. [sent-1, score-0.625]

2 Empty categories are elements in parse trees that lack corresponding overt surface forms (words) such as dropped pronouns and markers for control constructions. [sent-2, score-0.729]

3 We start by training machine translation systems with manually inserted empty elements. [sent-3, score-0.668]

4 We find that inclusion of some empty categories in training data improves the translation result. [sent-4, score-0.728]

5 We show that even when automatic prediction of null elements is not highly accurate, it nevertheless improves the end translation result. [sent-6, score-0.296]

6 1 Introduction An empty category is an element in a parse tree that does not have a corresponding surface word. [sent-7, score-0.625]

7 They include traces such as Wh-traces which indicate movement operations in interrogative sentences and dropped pronouns which indicate omission of pronouns in places where pronouns are normally expected. [sent-8, score-0.742]

8 Many treebanks include empty nodes in parse trees to represent non-local dependencies or dropped elements. [sent-9, score-0.807]

9 An example of the latter include dropped pronouns in the Korean Treebank (Han and Ryu, 2005) and the Chinese Treebank (Xue and Xia, 2000). [sent-12, score-0.355]

10 In languages such as Chinese, Japanese, and Korean, pronouns are frequently or regularly dropped 636 when they are pragmatically inferable. [sent-13, score-0.397]

11 Translating these pro-drop languages into languages such as En- glish where pronouns are regularly retained could be problematic because English pronouns have to be generated from nothing. [sent-18, score-0.368]

12 If the learned phrases include pronouns on the target side that are dropped from source side, the system may be able to insert pronouns even when they are missing from the source language. [sent-24, score-0.595]

13 In this paper, we examine a strategy of automat- ically inserting two types of empty elements from the Korean and Chinese treebanks as a preprocessProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0. [sent-30, score-0.667]

14 52034716tcrdoeamigohncspet ryos nuelfdocstpAireuo-’bnmracjsteoiucvreniosmgrenoltbajeivc Table 1: List of empty categories in the Korean Treebank (top) and the Chinese Treebank (bottom) and their persentence frequencies in the training data of initial experiments. [sent-36, score-0.625]

15 We first describe our experiments with data that have been annotated with empty categories, focusing on zero pronouns and traces such as those used in control constructions. [sent-38, score-0.82]

16 We use these annotations to insert empty elements in a corpus and train a machine translation system to see if they improve translation results. [sent-39, score-0.894]

17 Then, we illustrate different methods we have devised to automatically insert empty elements to corpus. [sent-40, score-0.688]

18 Finally, we describe our experiments with training machine translation systems with corpora that are automatically augmented with empty elements. [sent-41, score-0.651]

19 1 Setup We start by testing the plausibility of our idea of preprocessing corpus to insert empty categories with ideal datasets. [sent-44, score-0.701]

20 We extract null elements along with tree terminals (words) and train a simple phrase637 BLEU Each experiment has different empty categories added in. [sent-48, score-0.89]

21 *PRO* stands for the empty category used to mark control structures and *pro* indicates dropped pronouns for both Chinese and Korean. [sent-49, score-0.978]

22 There are several different empty categories in the different treebanks. [sent-58, score-0.625]

23 We have experimented with leaving in and out different empty categories for different experiments to see their effect. [sent-59, score-0.639]

24 We hypothesized that nominal phrasal empty categories such as dropped pronouns may be more useful than other ones, since they are the ones that may be missing in the source language (Chinese and Korean) but have counterparts in the target (English). [sent-60, score-1.0]

25 Table 1 summarizes empty categories in Chinese and Korean treebank and their frequencies in the training data. [sent-61, score-0.716]

26 For the Chinese to En- glish experiment, empty categories that mark control structures (*PRO*), which serve as the subject of a dependent clause, and dropped pronouns (*pro*), which mark omission of pragmatically in- Tabtiwlhoe yr3d:APl(ex0i c|. [sent-65, score-1.116]

27 h1P04e25RKOo∗re)an- English translation system (left) and a lexical translation from the Chinese-English translation system (right). [sent-67, score-0.309]

28 For the Korean-English lexical translation table, the left column is English words that are aligned to a dropped pronoun (*pro*) and the right column is the conditional probability of P(e | ∗pro∗). [sent-68, score-0.374]

29 For the Korean to English experi- ment, the dropped pronoun is the only empty category that seems to improve translation. [sent-72, score-0.762]

30 For the Korean to English experiment, we also tried annotating whether the dropped pronouns are a subject, an object, or a complement using information from the Treebank’s function tags, since English pronouns are inflected according to case. [sent-73, score-0.558]

31 This is possibly due to data sparsity created when dropped pronouns are annotated. [sent-75, score-0.355]

32 Dropped pronouns in subject position were the overwhelming majority (91%), and there were too few dropped pronouns in object position to learn good parameters. [sent-76, score-0.521]

33 3 Analysis Table 3 and Table 4 give us a glimpse of why having these empty categories may lead to better translation. [sent-78, score-0.625]

34 Table 3 is the lexical translation table for the dropped pronoun (*pro*) from the Korean to English experiment and the marker for control constructions (*PRO*) from the Chinese to English experiment. [sent-79, score-0.48]

35 For the dropped pronoun in the Korean to English experiment, although there are errors, the table largely reflects expected translations of a dropped pronoun. [sent-80, score-0.423]

36 For the control construction marker 638 in the Chinese to English experiment, the top translation for *PRO* is the English word to, which is expected since Chinese clauses that have control construction markers often translate to English as toinfinitives. [sent-82, score-0.372]

37 Table 4 shows how translations from the system trained with null elements and the system trained without null elements differ. [sent-84, score-0.405]

38 Chinese verbs that follow the empty node for control constructions (*PRO*) are generally translated to English as a verb in to-infinitive form, a gerund, or a nominalized verb. [sent-86, score-0.672]

39 The translation results show that the system trained with this null element (*PRO*) translates verbs that follow the null element largely in such a manner. [sent-87, score-0.339]

40 Experiments in this section showed that preprocessing the corpus to include some empty elements can improve translation results. [sent-90, score-0.759]

41 We also identified which empty categories maybe helpful for improving translation for different language pairs. [sent-91, score-0.745]

42 In the next section, we focus on how we add these elements automatically to a corpus that is not annotated with empty elements for the purpose of preprocessing corpus for machine translation. [sent-92, score-0.777]

43 3 Recovering empty nodes There are a few previous works that have attempted restore empty nodes for parse trees using the Penn English Treebank. [sent-93, score-1.214]

44 Johnson (2002) uses rather simple pattern matching to restore empty categories as well as their co-indexed antecedents with surprisingly good accuracy. [sent-94, score-0.744]

45 (2006) present a more sophisticated algorithm that tries to recover empty categories in several steps. [sent-96, score-0.671]

46 In each step, one or more empty categories are restored using patterns or classifiers (five maximum-entropy and two perceptron-based classifiers to be exact). [sent-97, score-0.678]

47 follows empty node marker for Chinese The second column is the English reference translation. [sent-102, score-0.634]

48 The third column is the translation output from the system that is trained with the empty categories added in. [sent-103, score-0.756]

49 The fourth column is the translation output from the system trained without the empty categories added, which was given the test set without the empty categories. [sent-104, score-1.286]

50 a couple of empty categories that would help machine translation. [sent-106, score-0.625]

51 The linguistic differences and the empty categories we are interested in recovering made the task much harder than it is for English. [sent-108, score-0.732]

52 From this section on, we will discuss only Chinese-English translation because Chinese presents a much more interesting case, since we need to recover two different empty categories that are very similarly distributed. [sent-110, score-0.774]

53 As we have discussed in Section 2, we are interested in recovering dropped pronouns (*pro*) and control construction markers (*PRO*). [sent-115, score-0.602]

54 We have tried three different relatively simple methods so that recovering empty elements would not require any special infrastructure. [sent-116, score-0.778]

55 1 Pattern matching Johnson (2002) defines a pattern for empty node recovery to be a minimally connected tree fragment containing an empty node and all nodes co-indexed with it. [sent-118, score-1.368]

56 Table 5 shows the top five patterns that match control constructions (*PRO*) and dropped pronouns (*pro*). [sent-121, score-0.509]

57 The top pattern that matches *pro* and 639 *PRO* are both exactly the same, since the pattern will be matched against parse trees where empty nodes have been deleted. [sent-122, score-0.74]

58 When it became apparent that we cannot use the same definition of patterns to successfully restore empty categories, we added more context to the patterns. [sent-123, score-0.646]

59 Instead of using minimal tree fragments that matched empty categories, we included the parent and siblings of the minimal tree fragment in the pattern (pattern matching method 1). [sent-125, score-0.735]

60 However, as can be seen in Table 5, there is still a lot of overlap between patterns for the two empty categories. [sent-127, score-0.583]

61 However, it is more apparent that at least we can choose the pattern that will maximize matches for one empty category and then discard that pattern for the other empty category. [sent-128, score-1.214]

62 In this way, we are able have more context for patterns such as (VP VV (IP ( NP (-NONE- *PRO*) ) VP)) by knowing what the verb that precedes the empty category is. [sent-130, score-0.6]

63 The Chinese verb 决 定 generally translates to English as to decide and is more often followed by a control construction than by a dropped pronoun. [sent-133, score-0.299]

64 Sentences are parsed without empty nodes and if a tree fragment (IP VP PU) is encountered in a parse tree, the empty node may be inserted according to the learned pattern (IP (NP-SBJ (-NONE- *pro*)) VP PU). [sent-135, score-1.311]

65 For example, if (IP VP) occurs one hundred times in a treebank that is stripped of empty nodes and if pattern (IP (NP (-NONE- *PRO*)) VP) occurs less than fifty times in the same treebank that is annotated with empty nodes, it is discarded. [sent-143, score-1.34]

66 In cases where there was an overlap between two empty categories, the pattern was chosen for either *pro* or *PRO*, whichever that maximized the number of matchings and then discarded for the other. [sent-145, score-0.591]

67 3 Parsing In this approach, we annotated nonterminal symbols in the treebank to include information about empty categories and then extracted a context free grammar from the modified treebank. [sent-158, score-0.757]

68 641 ley state-splitting grammar trainer to predict empty categories covered the empty categories from the trees. [sent-160, score-1.269]

69 For every empty node, the most immediate ancestor of the empty node that has more than one child was annotated with information about the empty node, and the empty node was deleted. [sent-162, score-2.25]

70 We annotated whether the deleted empty node was *pro* or *PRO* and where it was deleted. [sent-163, score-0.588]

71 Adding where the child was necessary because, even though most empty nodes are the first child, there are many exceptions. [sent-164, score-0.581]

72 We first extracted a plain context free grammar after modifying the trees and used the modified grammar to parse the test set and then tried to recover the empty elements. [sent-165, score-0.725]

73 Although the state splitting procedure is designed to maximize the likelihood of of the parse trees, rather than specifically to predict the empty nodes, learning a refined grammar over modified trees was also effective in helping to predict empty nodes. [sent-170, score-1.153]

74 However, we are dealing with a different language and different kinds of empty categories. [sent-179, score-0.53]

75 In the next section, we take the best variation of the each method use it to add empty categories to a training corpus and train machine translation systems to see whether having empty cat- egories can help improve translation in more realistic situations. [sent-181, score-1.382]

76 5 Analysis The results reveal many interesting aspects about recovering empty categories. [sent-183, score-0.637]

77 This suggests that rather than tree structure, local context of words 642 and part-of-speech tags maybe more important features for predicting dropped pronouns. [sent-188, score-0.263]

78 It is interesting to note how effective the parser was at predicting empty categories. [sent-190, score-0.556]

79 The split-merge cycles learn grammars that produce better parse trees rather than grammars that predict empty categories more accurately. [sent-194, score-0.678]

80 By modifying this learning process, we may be able to learn grammars that are better suited for predicting empty categories. [sent-195, score-0.556]

81 The same method for recovering null elements was applied to the train- TablPCBeaR8trs:FeirnlgaelBL2 4E3L. [sent-202, score-0.3]

82 F1 scores for recovering empty categories are repeated here for comparison. [sent-210, score-0.732]

83 ing, development, and test sets to insert empty nodes for each experiment. [sent-211, score-0.619]

84 The machine translation system that used training data from the method that was overall the best in predicting empty elements performed the best. [sent-222, score-0.763]

85 3 The BLEU scores presented in Table 8 represent the best variations of each method we have tried for recovering empty elements. [sent-227, score-0.674]

86 Although the difference was small, when the F1 score were same for two variations of a method, it seemed that we could get slightly better BLEU score with the variation that had higher recall for recovering empty ele3We thank an anonymous reviewer for tipping the brevity penalty. [sent-228, score-0.679]

87 We tried a variation of the experiment where the CRF method is used to recover *pro* and the pattern matching is used to recover *PRO*, since these represent the best methods for recovering the respective empty categories. [sent-230, score-0.901]

88 This experiment suggest that more sophisticated methods should be considered when resolving conflicts created by using heterogeneous methods to recover different empty categories. [sent-239, score-0.602]

89 Table 9 shows five example translations of source sentences in the test set that have one of the empty categories. [sent-240, score-0.549]

90 Since empty categories have been automatically inserted, they are not always in the correct places. [sent-241, score-0.625]

91 5 Conclusion In this paper, we have showed that adding some empty elements can help building machine transla- tion systems. [sent-243, score-0.634]

92 We showed that we can still benefit from augmenting the training corpus with empty elements even when empty element prediction is less than what would be conventionally considered robust. [sent-244, score-1.185]

93 More comprehensive and sophisti- test corpus did not have empty categories. [sent-246, score-0.53]

94 The system trained with nulls is the system trained with the training corpus and the test corpus that have been automatically augmented with empty categories. [sent-247, score-0.548]

95 (2006) may be necessary for more accurate recovery of empty elements. [sent-250, score-0.554]

96 There are several other issues we may consider when recovering empty categories that are missing in the target language. [sent-253, score-0.752]

97 We only considered empty categories that are present in treebanks. [sent-254, score-0.625]

98 However, there might be some empty elements which are not annotated but nevertheless helpful for improving machine translation. [sent-255, score-0.651]

99 It may be beneficial to include consideration for empty elements in the decoding process, so that it can benefit from interacting with 644 other elements of the machine translation system. [sent-257, score-0.841]

100 A simple pattern-matching algorithm for recovering empty nodes and their antecedents. [sent-296, score-0.672]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pro', 0.598), ('empty', 0.53), ('ip', 0.236), ('dropped', 0.189), ('vp', 0.185), ('pronouns', 0.166), ('pu', 0.157), ('korean', 0.122), ('recovering', 0.107), ('elements', 0.104), ('translation', 0.103), ('np', 0.103), ('categories', 0.095), ('null', 0.089), ('control', 0.076), ('treebank', 0.075), ('chinese', 0.071), ('pattern', 0.061), ('sites', 0.058), ('vv', 0.054), ('insert', 0.054), ('patterns', 0.053), ('crf', 0.051), ('markers', 0.046), ('recover', 0.046), ('bleu', 0.045), ('node', 0.041), ('cp', 0.041), ('tried', 0.037), ('gabbard', 0.037), ('inserted', 0.035), ('nodes', 0.035), ('marker', 0.035), ('inserting', 0.033), ('restore', 0.031), ('traces', 0.031), ('tree', 0.031), ('column', 0.028), ('matching', 0.027), ('trees', 0.027), ('experiment', 0.026), ('minimally', 0.026), ('predicting', 0.026), ('parse', 0.026), ('pn', 0.026), ('pronoun', 0.026), ('english', 0.025), ('penn', 0.025), ('constructions', 0.025), ('afterwards', 0.024), ('dec', 0.024), ('lcp', 0.024), ('omission', 0.024), ('recovery', 0.024), ('preprocessing', 0.022), ('fragment', 0.022), ('element', 0.021), ('pragmatically', 0.021), ('regularly', 0.021), ('ircs', 0.021), ('brevity', 0.021), ('variation', 0.021), ('modified', 0.021), ('missing', 0.02), ('och', 0.019), ('translations', 0.019), ('grammar', 0.019), ('augmented', 0.018), ('construction', 0.018), ('johnson', 0.018), ('lc', 0.017), ('isozaki', 0.017), ('maybe', 0.017), ('stripped', 0.017), ('conflict', 0.017), ('siblings', 0.017), ('became', 0.017), ('bracketing', 0.017), ('guidelines', 0.017), ('category', 0.017), ('annotated', 0.017), ('exemplified', 0.016), ('bies', 0.016), ('translates', 0.016), ('child', 0.016), ('summarizes', 0.016), ('moses', 0.016), ('parent', 0.016), ('recovered', 0.015), ('rochester', 0.015), ('yamada', 0.015), ('hong', 0.015), ('glish', 0.015), ('terminals', 0.015), ('papineni', 0.015), ('sixth', 0.015), ('immediate', 0.015), ('apparent', 0.015), ('koehn', 0.014), ('experimented', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 40 emnlp-2010-Effects of Empty Categories on Machine Translation

Author: Tagyoung Chung ; Daniel Gildea

Abstract: We examine effects that empty categories have on machine translation. Empty categories are elements in parse trees that lack corresponding overt surface forms (words) such as dropped pronouns and markers for control constructions. We start by training machine translation systems with manually inserted empty elements. We find that inclusion of some empty categories in training data improves the translation result. We expand the experiment by automatically inserting these elements into a larger data set using various methods and training on the modified corpus. We show that even when automatic prediction of null elements is not highly accurate, it nevertheless improves the end translation result.

2 0.13770299 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

Author: Liang Huang ; Haitao Mi

Abstract: Syntax-based translation models should in principle be efficient with polynomially-sized search space, but in practice they are often embarassingly slow, partly due to the cost of language model integration. In this paper we borrow from phrase-based decoding the idea to generate a translation incrementally left-to-right, and show that for tree-to-string models, with a clever encoding of derivation history, this method runs in averagecase polynomial-time in theory, and lineartime with beam search in practice (whereas phrase-based decoding is exponential-time in theory and quadratic-time in practice). Experiments show that, with comparable translation quality, our tree-to-string system (in Python) can run more than 30 times faster than the phrase-based system Moses (in C++).

3 0.10751718 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

Author: Eugene Charniak

Abstract: We present a new syntactic parser that works left-to-right and top down, thus maintaining a fully-connected parse tree for a few alternative parse hypotheses. All of the commonly used statistical parsers use context-free dynamic programming algorithms and as such work bottom up on the entire sentence. Thus they only find a complete fully connected parse at the very end. In contrast, both subjective and experimental evidence show that people understand a sentence word-to-word as they go along, or close to it. The constraint that the parser keeps one or more fully connected syntactic trees is intended to operationalize this cognitive fact. Our parser achieves a new best result for topdown parsers of 89.4%,a 20% error reduction over the previous single-parser best result for parsers of this type of 86.8% (Roark, 2001) . The improved performance is due to embracing the very large feature set available in exchange for giving up dynamic programming.

4 0.091080323 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.

5 0.088959679 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian

Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.

6 0.078203164 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

7 0.076079011 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

8 0.071728803 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

9 0.070621677 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

10 0.06630291 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

11 0.064755388 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

12 0.060333043 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

13 0.059405066 47 emnlp-2010-Example-Based Paraphrasing for Improved Phrase-Based Statistical Machine Translation

14 0.056959093 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

15 0.055194326 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

16 0.052125137 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

17 0.051405031 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

18 0.051083792 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

19 0.05087484 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

20 0.050529733 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, -0.068), (2, 0.084), (3, 0.035), (4, 0.036), (5, 0.05), (6, 0.077), (7, -0.068), (8, -0.114), (9, 0.014), (10, 0.124), (11, -0.071), (12, -0.043), (13, -0.002), (14, -0.006), (15, 0.086), (16, 0.125), (17, 0.039), (18, -0.012), (19, -0.177), (20, -0.217), (21, 0.192), (22, -0.085), (23, 0.005), (24, -0.116), (25, -0.002), (26, -0.072), (27, 0.067), (28, -0.082), (29, 0.036), (30, 0.021), (31, -0.109), (32, 0.003), (33, -0.07), (34, 0.186), (35, -0.169), (36, -0.031), (37, 0.054), (38, 0.055), (39, -0.006), (40, -0.062), (41, -0.021), (42, 0.098), (43, -0.015), (44, -0.045), (45, 0.147), (46, 0.299), (47, 0.017), (48, 0.076), (49, -0.135)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96721536 40 emnlp-2010-Effects of Empty Categories on Machine Translation

Author: Tagyoung Chung ; Daniel Gildea

Abstract: We examine effects that empty categories have on machine translation. Empty categories are elements in parse trees that lack corresponding overt surface forms (words) such as dropped pronouns and markers for control constructions. We start by training machine translation systems with manually inserted empty elements. We find that inclusion of some empty categories in training data improves the translation result. We expand the experiment by automatically inserting these elements into a larger data set using various methods and training on the modified corpus. We show that even when automatic prediction of null elements is not highly accurate, it nevertheless improves the end translation result.

2 0.51990134 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

Author: Liang Huang ; Haitao Mi

Abstract: Syntax-based translation models should in principle be efficient with polynomially-sized search space, but in practice they are often embarassingly slow, partly due to the cost of language model integration. In this paper we borrow from phrase-based decoding the idea to generate a translation incrementally left-to-right, and show that for tree-to-string models, with a clever encoding of derivation history, this method runs in averagecase polynomial-time in theory, and lineartime with beam search in practice (whereas phrase-based decoding is exponential-time in theory and quadratic-time in practice). Experiments show that, with comparable translation quality, our tree-to-string system (in Python) can run more than 30 times faster than the phrase-based system Moses (in C++).

3 0.45966741 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

Author: Eugene Charniak

Abstract: We present a new syntactic parser that works left-to-right and top down, thus maintaining a fully-connected parse tree for a few alternative parse hypotheses. All of the commonly used statistical parsers use context-free dynamic programming algorithms and as such work bottom up on the entire sentence. Thus they only find a complete fully connected parse at the very end. In contrast, both subjective and experimental evidence show that people understand a sentence word-to-word as they go along, or close to it. The constraint that the parser keeps one or more fully connected syntactic trees is intended to operationalize this cognitive fact. Our parser achieves a new best result for topdown parsers of 89.4%,a 20% error reduction over the previous single-parser best result for parsers of this type of 86.8% (Roark, 2001) . The improved performance is due to embracing the very large feature set available in exchange for giving up dynamic programming.

4 0.36052459 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification

Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng

Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.

5 0.33765724 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

Author: Jackie Chi Kit Cheung ; Gerald Penn

Abstract: Syntactic consistency is the preference to reuse a syntactic construction shortly after its appearance in a discourse. We present an analysis of the WSJ portion of the Penn Treebank, and show that syntactic consistency is pervasive across productions with various lefthand side nonterminals. Then, we implement a reranking constituent parser that makes use of extra-sentential context in its feature set. Using a linear-chain conditional random field, we improve parsing accuracy over the generative baseline parser on the Penn Treebank WSJ corpus, rivalling a similar model that does not make use of context. We show that the context-aware and the context-ignorant rerankers perform well on different subsets of the evaluation data, suggesting a combined approach would provide further improvement. We also compare parses made by models, and suggest that context can be useful for parsing by capturing structural dependencies between sentences as opposed to lexically governed dependencies.

6 0.33570206 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

7 0.29485318 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

8 0.29448214 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

9 0.27840969 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping

10 0.27640936 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

11 0.2611703 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

12 0.25331688 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

13 0.24338672 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

14 0.21889082 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

15 0.21403323 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs

16 0.20740907 5 emnlp-2010-A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages

17 0.20145316 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

18 0.19828147 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

19 0.18230782 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

20 0.18093845 14 emnlp-2010-A Tree Kernel-Based Unified Framework for Chinese Zero Anaphora Resolution


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.011), (12, 0.029), (29, 0.09), (30, 0.038), (52, 0.021), (56, 0.064), (61, 0.018), (62, 0.021), (66, 0.095), (72, 0.047), (76, 0.353), (79, 0.015), (83, 0.022), (87, 0.037), (92, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98061424 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

Author: Aida Khemakhem ; Bilel Gargouri ; Abdelmajid Ben Hamadou

Abstract: Electronic dictionaries covering all natural language levels are very relevant for the human use as well as for the automatic processing use, namely those constructed with respect to international standards. Such dictionaries are characterized by a complex structure and an important access time when using a querying system. However, the need of a user is generally limited to a part of such a dictionary according to his domain and expertise level which corresponds to a specialized dictionary. Given the importance of managing a unified dictionary and considering the personalized needs of users, we propose an approach for generating personalized views starting from a normalized dictionary with respect to Lexical Markup Framework LMF-ISO 24613 norm. This approach provides the re-use of already defined views for a community of users by managing their profiles information and promoting the materialization of the generated views. It is composed of four main steps: (i) the projection of data categories controlled by a set of constraints (related to the user‟s profiles), (ii) the selection of values with consistency checking, (iii) the automatic generation of the query‟s model and finally, (iv) the refinement of the view. The proposed approach was con- solidated by carrying out an experiment on an LMF normalized Arabic dictionary. 1

2 0.96780097 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

Author: Stephen Boxwell ; Dennis Mehay ; Chris Brew

Abstract: In many NLP systems, there is a unidirectional flow of information in which a parser supplies input to a semantic role labeler. In this paper, we build a system that allows information to flow in both directions. We make use of semantic role predictions in choosing a single-best parse. This process relies on an averaged perceptron model to distinguish likely semantic roles from erroneous ones. Our system penalizes parses that give rise to low-scoring semantic roles. To explore the consequences of this we perform two experiments. First, we use a baseline generative model to produce n-best parses, which are then re-ordered by our semantic model. Second, we use a modified version of our semantic role labeler to predict semantic roles at parse time. The performance of this modified labeler is weaker than that of our best full SRL, because it is restricted to features that can be computed directly from the parser’s packed chart. For both experiments, the resulting semantic predictions are then used to select parses. Finally, we feed the selected parses produced by each experiment to the full version of our semantic role labeler. We find that SRL performance can be improved over this baseline by selecting parses with likely semantic roles.

same-paper 3 0.81966227 40 emnlp-2010-Effects of Empty Categories on Machine Translation

Author: Tagyoung Chung ; Daniel Gildea

Abstract: We examine effects that empty categories have on machine translation. Empty categories are elements in parse trees that lack corresponding overt surface forms (words) such as dropped pronouns and markers for control constructions. We start by training machine translation systems with manually inserted empty elements. We find that inclusion of some empty categories in training data improves the translation result. We expand the experiment by automatically inserting these elements into a larger data set using various methods and training on the modified corpus. We show that even when automatic prediction of null elements is not highly accurate, it nevertheless improves the end translation result.

4 0.80304426 26 emnlp-2010-Classifying Dialogue Acts in One-on-One Live Chats

Author: Su Nam Kim ; Lawrence Cavedon ; Timothy Baldwin

Abstract: We explore the task of automatically classifying dialogue acts in 1-on-1 online chat forums, an increasingly popular means of providing customer service. In particular, we investigate the effectiveness of various features and machine learners for this task. While a simple bag-of-words approach provides a solid baseline, we find that adding information from dialogue structure and inter-utterance dependency provides some increase in performance; learners that account for sequential dependencies (CRFs) show the best performance. We report our results from testing using a corpus of chat dialogues derived from online shopping customer-feedback data.

5 0.53960586 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

6 0.53723329 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation

7 0.53659773 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

8 0.535353 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue

9 0.53350139 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

10 0.52986276 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

11 0.52912569 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

12 0.52833581 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

13 0.52397698 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning

14 0.51458168 114 emnlp-2010-Unsupervised Parse Selection for HPSG

15 0.51390159 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

16 0.50139123 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

17 0.49556851 94 emnlp-2010-SCFG Decoding Without Binarization

18 0.49389496 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

19 0.49284738 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

20 0.48817086 68 emnlp-2010-Joint Inference for Bilingual Semantic Role Labeling