emnlp emnlp2011 emnlp2011-47 knowledge-graph by maker-knowledge-mining

47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation


Source: pdf

Author: Fabien Cromieres ; Sadao Kurohashi

Abstract: We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. We also propose improvements of this algorithm allowing several kinds of flexible matchings.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. [sent-5, score-1.088]

2 , 2003), synchronous grammar rules as in (Chiang, 2007) or treelet pairs as in (Quirk et al. [sent-12, score-0.344]

3 They showed that using Suffix Arrays, it was possible to efficiently retrieve all sentences containing substrings of the sentence to be translated, and thus extract the needed translation rules on-the-fly. [sent-17, score-0.226]

4 Since Syntax-Based systems usually work with the parse trees of the source-side sentences, we will need to be able to retrieve efficiently examples trees from fragments (treelets) of the parse tree of the sentence we want to translate. [sent-25, score-0.288]

5 A treelet T1 is a subtreelet of another treelet T2 if T1 is itself a connected subgraph of T2. [sent-32, score-0.697]

6 A subtree rooted at node n of a tree T is a ctrheiellde. [sent-38, score-0.175]

7 Furthermore, if T2 and T1 are rooted at the same node in the original tree, we say that T2 is a descending supertreelet of T1. [sent-44, score-0.185]

8 ec th2o0d1s1 i Ans Nsoactuiartaioln La fonrg Cuaogmep Purtoatcieosnsainlg L,in pgaugies ti 5c0s8–518, In treelet retrieval, we are given a certain treelet type and want to find all of the tokens of this type in the database DB. [sent-48, score-0.775]

9 Each token of a given treelet type twheil d dbaeta aibdaesneti DfieBd. [sent-49, score-0.32]

10 by a mapping ffr aom giv vtehen ntroeedele to tfy ytphee treelet type to the nodes of the treelet token in the database. [sent-50, score-0.682]

11 No treelet whose all immediate subtreelets are empty is ever put in the discovered queue, which allows us to prune most of the empty treelets of the Inclusion DAG. [sent-57, score-1.232]

12 A treelet in the inclusion DAG can be computed as soon as two of its antecedents have been computed. [sent-58, score-0.32]

13 To start the computation (or rather, “seed” it), it is necessary to know the occurrences of treelet of smaller size. [sent-59, score-0.418]

14 Given a labeled tree T and a node n ∈ T , the path-to-root bofe n di tsr ethee T sequence oodfe ela nbel ∈s fT ro ,m th n ptoa thh-teo r-roooot. [sent-68, score-0.158]

15 We can find the set of occurrences of any linear treelet with a binary search. [sent-75, score-0.388]

16 For example, the treelet a(b) corresponds to the label sequence “ba”. [sent-76, score-0.348]

17 Once the Path-to-Root Array is built, for a linear treelet T, we can find its occurrences by a binary search of the first and last path-to-root starting with the labels of T. [sent-80, score-0.388]

18 5 bytes per pointer in the array (tree id: 4 bytes, start position: 1 byte), and 5 bytes per nodes to store the database in memory (label id:4 bytes, parent position: 1byte). [sent-83, score-0.404]

19 The inverted index associates with every label the set of its occurrences, each occurrences being represented by a tuple containing the index of the tree, the position of the label in the tree, and the position of the parent of the label in 512 the tree. [sent-87, score-0.508]

20 Knowing the position of the parent will allow to compute treelets of size 2 by intersection (D-coverage). [sent-88, score-0.825]

21 Taking the idea further, we can actually consider the possibility of precomputing treelets of size greater than 1, especially if they appear frequently in the corpus. [sent-90, score-0.792]

22 1 Postorder traversal The way we choose the treelet to be popped out on line 3 of algorithm 1 will define different computation strategies. [sent-92, score-0.443]

23 We will process treelets in an order depending on their root node. [sent-94, score-0.777]

24 More precisely, we consider the nodes of the query tree in the order given by a depth-first postorder traversal of the query tree. [sent-95, score-0.499]

25 This way, when a treelet rooted at n is processed, all of the treelets rooted at a descendant of n have already been processed. [sent-96, score-1.147]

26 We can suppose that every processed treelet is assigned an index that we note #T. [sent-97, score-0.403]

27 For example, if TQ =“a(b c d(e))” and the treelets “b”F arn edx “d(e)” ihfa Tve been assigned the indexes 2 and 4, the recursive representation of the treelet “a(b d(e))” would be [a,(2,0,4)]. [sent-110, score-1.135]

28 DNode is a priority queue containing the treelets rDooted at Node discovered so far. [sent-112, score-0.801]

29 The priority queue pop out the smallest treelets first. [sent-113, score-0.811]

30 Line 14 maintain a list L of processed treelets and assign the index of T ilins tL L t oof # pTro. [sent-114, score-0.809]

31 eLsisneed 2 tr2e keeps ntrda cakss oigfn th thee non-empty iimnm Le tdoia #tTe supertreelets pofs every otrfe ethleet through a dictionary S. [sent-115, score-0.166]

32 This is used in the procedure computesupertreelets (algorithm 3) nt toh generate uthree c iommmpeudtei-ate supertreelets of a treelet T given its recursive rep- resentation. [sent-116, score-0.527]

33 In this procedure, line 6 produces the Algorithm 2: DAG traversal by query-tree postorder 1 for Node in postorder-traversal(query-tree) do 2 Telem = [Node, (0, 0, . [sent-117, score-0.145]

34 A treelet is only processed once all of its immediate supertreelets have been computed, which is optimal to reduce the cost of the ⊗ operation. [sent-123, score-0.507]

35 sco Tvheer supertreelets from the info in S has also several sbuepneefrittr. [sent-125, score-0.143]

36 Similarly, itn hthee next section, we will be able to prevent the discovery of nonmaximal treelets by modifying S. [sent-127, score-0.778]

37 2 Pruning non-maximal treelets We now try to address another aspect of the overwhelming number of potential treelets in a query tree. [sent-131, score-1.623]

38 As we said, in most practical cases, most of the larger treelets in a query tree will be empty. [sent-132, score-0.959]

39 Still, it is 513 Algorithm 3: compute-supertrees Input: T,#T Output: lst: list of immediate supertreelets of T 1 m ← |root(T) |; 2 fmor ← ←i i |nr 1o . [sent-133, score-0.187]

40 , 0)] ; 8 Append Tnew to lst; Input: T,#T = possible that some tree exactly identical to the query tree (or some tree having a very large treelet in common with the query tree) do exist in the database. [sent-147, score-0.825]

41 This case is obviously a best case for translation, but unfortunately could be a worst-case for our algorithm, as it means that all of the (possibly trillions of) treelets of the query tree will be computed. [sent-148, score-0.959]

42 The idea is that in most cases where a query tree is “full” (that is all of its treelets are not empty), most of the larger treelets will share the same occurrences (in the database trees that are very similar to the query tree). [sent-150, score-2.055]

43 If for every matching M1 of occ(T1), there exist a matching M2 of occ(T2) such that M2 |T1 = M1, we say that T1 is dominated by T2. [sent-154, score-0.172]

44 We will therefore be, in general, more interested by the larger treelet T2 and can prune as many nonmaximal treelets as we want in the traversal. [sent-157, score-1.121]

45 Tk) the treelet whose root is n, and for which the k subtrees rooted at the k TSR#C-d1 2-e b35(-b. [sent-170, score-0.387]

46 The row “T” represents the treelets in the order The row “#” is the index #T, and the row “R” is the recursive representation of the treelet. [sent-179, score-0.844]

47 When a treelet is poped out of DNode, occ(T) is CS a(Tt )th. [sent-181, score-0.32]

48 Let us further suppose that for all i, Ti is dominated by a descending supertreelet (with the possibility that Ti = Then n(T1 . [sent-193, score-0.16]

49 hTahte T procedure computesupertreelets, whinen S (caTlled during the processing of the parent node, will thus skip all of the treelets that are ”trivially” dominated according to property 4. [sent-201, score-0.843]

50 treelet T by one of its supertrelets TS is not a matter of just testing if |occ(T) | = |occ(TS) |, as would be tohfe j case swtiinthg substring: a =tr|e oeclect( can , ha avse w loeuslsd occurrences than one of its supertreelets (eg. [sent-205, score-0.531]

51 An efficient way is to first check that the two treelets occurs in the same number of sentences, then confirm this with a systematic check of the definition. [sent-207, score-0.749]

52 In a dependency tree, nodes are labeled by words and most non-elementary treelets have a small number of occurrences. [sent-211, score-0.791]

53 In a constituent tree, many treelets containing only internal nodes have a high frequency 514 and will be expensive to compute. [sent-212, score-0.818]

54 However, it is usually not very interesting to retrieve all the occurrences of treelets such as “NP(Det NN)” in the context ofa MT system. [sent-214, score-0.869]

55 More precisely, we want to retrieve efficiently treelets containing at least one leaf of the query tree. [sent-217, score-1.02]

56 Therefore, an alternative computation strategy would only explore treelets containing at least one terminal node. [sent-218, score-0.806]

57 4 Complexity Processing time will be mainly dependent on two factors: the number of treelets in a query tree that need to be computed, and the average time to compute a treelet. [sent-221, score-0.983]

58 It can be shown quite easily that the time needed to compute a treelet with our method is proportional to its number of occurrences, which is itself growing as O(NC). [sent-223, score-0.344]

59 The number of treelets needing to be computed is, in the worst case, exponential in m. [sent-225, score-0.749]

60 In practice, the only case where most of the treelets are non-empty is when the database contains trees similar to the query tree in the database, and this is handled by the modification of the algorithm is section 4. [sent-226, score-1.113]

61 In other cases, most of the treelets are empty, and empirically, we find that the number of non-empty treelets in a query tree PLSDriaozrteacgeb os nast iden iogs inktz -iem (e#p(ntIPyontdRvre. [sent-228, score-1.708]

62 It is also possgirbolwe sto a p bopruonxdi mthaet esliyze a os fO t(hme re · tNrieved treelets (only retrieving treelets with less than 10 nodes, for example), similarly to what is done in (Lopez, 2007). [sent-236, score-1.554]

63 The number of treelets will then only grows as O(m). [sent-237, score-0.749]

64 The total processing time of a given query tree will therefore be on the order of O(m · NC1. [sent-238, score-0.21]

65 The fact tOha(tm th ·i sN give a complexity worse than linear with respect to the database size might seem a concern, but this is actually only because we are retrieving more and more different types of treelets. [sent-240, score-0.168]

66 The cost of retrieving one treelet remain linear with respect to the size of the corpus. [sent-241, score-0.376]

67 It should be also noted that the constant hidden in the big-O notation can be (almost) arbitrarily reduced by precomputing more and more of the most common (and more expensive) treelets (a timememory trade-off). [sent-243, score-0.792]

68 The largest trees in the database have around 100 nodes. [sent-246, score-0.154]

69 We computed, using our algorithm, 100 randomly selected query trees having from 10 to 70 nodes, with an average of 27 nodes per tree. [sent-248, score-0.209]

70 2 brings virtually no overhead and gives similar performances whether the query tree is in the database or not (effectively reducing the worst-case computation time from days to seconds). [sent-261, score-0.381]

71 9 millions dependency trees, we automatically extracted STSG rules of size smaller than 6 and stored them in a database, considering that extracting rules of larger sizes would lead to an unmanageable database size. [sent-266, score-0.16]

72 We compared MT results using only the rules of size smaller than 6 to using all the rules computed on-the-fly after treelet retrieving by our method. [sent-267, score-0.424]

73 We consider that each node of the query tree and database is labeled by 2 labels (or more) of different generality. [sent-270, score-0.373]

74 We want to retrieve treelets that match by word or POS with the query tree. [sent-272, score-0.949]

75 1 Processing multi-Label trees To do this, the inverted index will just need to include entries for both words and POS. [sent-274, score-0.213]

76 For example, the dependency tree “likes,V: 1 (Paul,N:0 Salmon,N:2 (and,CC:3 (Tuna,N:4)))” would produce the following (node,parents) entries in the inverted index: {N: [(0,1) (2,1) (4,3)], Paul: [(0,1)], Salmon:[(2,1)],. [sent-275, score-0.196]

77 We actually want to compute all of the treelets of a query tree TQ labeled by words and POS (meaning aea qcuhe nryo tdree can be matched by either word or POS). [sent-280, score-1.006]

78 First, we modify the recursive representation of a treelet so that it also includes the chosen label of its root node. [sent-282, score-0.411]

79 For each treelet of size m we would have had in a single label query tree, we now virtually have 2m treelets. [sent-286, score-0.473]

80 We therefore need to tell the algorithm that some treelets are more important that some others. [sent-291, score-0.749]

81 While we have used the Computation Hypertree representation to compute treelets efficiently, we can also use it to prioritize the treelets we want to compute. [sent-292, score-1.545]

82 We can then modify our traversal strategy of the Inclusion DAG to compute treelets having the biggest weights first: we just need to specify that the treelet popped out on line 3 is the treelet with the highest score (more generally, we could consider a A* search). [sent-297, score-1.506]

83 3 Experiments Using the above ideas, we have made some experiments for computing query dependency trees labeled with both words and POS. [sent-299, score-0.167]

84 We score the treelets by giving them a penalty of -1 for each POS they contain, and stop the search when all remaining treelets have a score lower than -2 (in other words, treelets are allowed at most 2 POS-matchings). [sent-300, score-2.247]

85 In line 3 of algorithm 2, elementary treelets are assigned a weight of 0 or -1 depending on whether their label is a word or POS. [sent-303, score-0.839]

86 Line 5 is replaced by ”pop the first treelet with minimal weight and break the loop if the minimal weight is inferior to -2”. [sent-304, score-0.32]

87 Table 7 shows the increase in the size of the biggest non-empty treelets when allowing 2 nodes to be matched by POS. [sent-306, score-0.791]

88 It also shows the impact on BLEU score of using these additional treelets for onthe-fly rule generation in our simple MT system. [sent-307, score-0.749]

89 Improvement on BLEU is limited, but it might be due to a very experimental handling of approximately matched treelet examples in our MT system. [sent-308, score-0.32]

90 This is due to the increased number of treelets to be computed, and to the fact that POS-labeled elementary treelets have a high number of occurrences. [sent-310, score-1.537]

91 Our method allows the use of packed representation of both the query tree and the database. [sent-318, score-0.28]

92 For example, the inverted index of a database containing the packed forest of figure 8 would contain the following entries: {held: [(1,10a),(1,10b)], NP: [(6,9),(7,9),(9,10a)], VP:[(10,N)], PP:[(8,10b)], a:[(2,6)], talk:[(3,6)], with:[(4,7) (4,8)], Sharon:[(5,7) (5,8)]}. [sent-320, score-0.417]

93 The inverted index can now be used to search in the trees contained in a packed forest database without any modification. [sent-323, score-0.432]

94 Modifications to the algorithm in order to handle a packed forest query are similar to the ones developed in section 6. [sent-324, score-0.232]

95 This opens the possibility of computing the occurrences of discontinuous treelets in much the same way as is done in (Lopez, 2007) for discontinuous substrings. [sent-329, score-0.921]

96 , 2009) describes a method for efficiently matching precomputed treelets rules. [sent-336, score-0.896]

97 These rules are organized in a kind of prefix tree that al- lows efficient matching of packed forests. [sent-337, score-0.221]

98 , 2006) also propose a greedy algorithm for matching TSC rules to a query tree. [sent-339, score-0.191]

99 9 Conclusion and future work We have presented a method for efficiently retrieving examples of treelets contained in a query tree, thus allowing on-the-fly computation of translation rules for Syntax-Based systems. [sent-340, score-1.074]

100 Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. [sent-441, score-0.224]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('treelets', 0.749), ('treelet', 0.32), ('occ', 0.242), ('supertreelets', 0.143), ('query', 0.125), ('database', 0.112), ('inverted', 0.111), ('arrays', 0.098), ('tree', 0.085), ('supertreelet', 0.071), ('traversal', 0.07), ('array', 0.07), ('packed', 0.07), ('suffix', 0.069), ('occurrences', 0.068), ('ts', 0.068), ('dominated', 0.065), ('bytes', 0.061), ('precomputed', 0.061), ('index', 0.06), ('dnode', 0.057), ('subtreelet', 0.057), ('utiyama', 0.057), ('retrieving', 0.056), ('retrieve', 0.052), ('postorder', 0.052), ('discontinuous', 0.052), ('node', 0.051), ('translation', 0.046), ('empty', 0.045), ('efficiently', 0.044), ('immediate', 0.044), ('lopez', 0.044), ('ebmt', 0.043), ('precomputing', 0.043), ('tnew', 0.043), ('trees', 0.042), ('matching', 0.042), ('nodes', 0.042), ('kyoto', 0.041), ('rooted', 0.039), ('elementary', 0.039), ('db', 0.039), ('forest', 0.037), ('nakazawa', 0.037), ('pop', 0.037), ('tq', 0.037), ('recursive', 0.035), ('flexible', 0.034), ('append', 0.033), ('ascending', 0.033), ('dog', 0.033), ('substrings', 0.033), ('indexes', 0.031), ('kurohashi', 0.031), ('computation', 0.03), ('modifications', 0.03), ('substring', 0.03), ('page', 0.029), ('tot', 0.029), ('dag', 0.029), ('tk', 0.029), ('performances', 0.029), ('parent', 0.029), ('computesupertreelets', 0.029), ('domination', 0.029), ('fabien', 0.029), ('manber', 0.029), ('nonmaximal', 0.029), ('occc', 0.029), ('pointer', 0.029), ('precomputation', 0.029), ('pty', 0.029), ('salmon', 0.029), ('stsg', 0.029), ('subtreelets', 0.029), ('telem', 0.029), ('tid', 0.029), ('label', 0.028), ('root', 0.028), ('containing', 0.027), ('nc', 0.027), ('zhang', 0.026), ('queue', 0.025), ('maximal', 0.025), ('neo', 0.025), ('roots', 0.025), ('concreteness', 0.025), ('compute', 0.024), ('rules', 0.024), ('else', 0.024), ('descending', 0.024), ('every', 0.023), ('position', 0.023), ('line', 0.023), ('want', 0.023), ('iaf', 0.022), ('tsr', 0.022), ('sigmod', 0.022), ('manageable', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999905 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

Author: Fabien Cromieres ; Sadao Kurohashi

Abstract: We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. We also propose improvements of this algorithm allowing several kinds of flexible matchings.

2 0.087752275 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation

Author: Jun Xie ; Haitao Mi ; Qun Liu

Abstract: Dependency structure, as a first step towards semantics, is believed to be helpful to improve translation quality. However, previous works on dependency structure based models typically resort to insertion operations to complete translations, which make it difficult to specify ordering information in translation rules. In our model of this paper, we handle this problem by directly specifying the ordering information in head-dependents rules which represent the source side as head-dependents relations and the target side as strings. The head-dependents rules require only substitution operation, thus our model requires no heuristics or separate ordering models of the previous works to control the word order of translations. Large-scale experiments show that our model performs well on long distance reordering, and outperforms the state- of-the-art constituency-to-string model (+1.47 BLEU on average) and hierarchical phrasebased model (+0.46 BLEU on average) on two Chinese-English NIST test sets without resort to phrases or parse forest. For the first time, a source dependency structure based model catches up with and surpasses the state-of-theart translation models.

3 0.070589766 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Author: Wei Lu ; Hwee Tou Ng

Abstract: This paper describes a novel probabilistic approach for generating natural language sentences from their underlying semantics in the form of typed lambda calculus. The approach is built on top of a novel reduction-based weighted synchronous context free grammar formalism, which facilitates the transformation process from typed lambda calculus into natural language sentences. Sentences can then be generated based on such grammar rules with a log-linear model. To acquire such grammar rules automatically in an unsupervised manner, we also propose a novel approach with a generative model, which maps from sub-expressions of logical forms to word sequences in natural language sentences. Experiments on benchmark datasets for both English and Chinese generation tasks yield significant improvements over results obtained by two state-of-the-art machine translation models, in terms of both automatic metrics and human evaluation.

4 0.067410991 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

Author: Jiajun Zhang ; Feifei Zhai ; Chengqing Zong

Abstract: Due to its explicit modeling of the grammaticality of the output via target-side syntax, the string-to-tree model has been shown to be one of the most successful syntax-based translation models. However, a major limitation of this model is that it does not utilize any useful syntactic information on the source side. In this paper, we analyze the difficulties of incorporating source syntax in a string-totree model. We then propose a new way to use the source syntax in a fuzzy manner, both in source syntactic annotation and in rule matching. We further explore three algorithms in rule matching: 0-1 matching, likelihood matching, and deep similarity matching. Our method not only guarantees grammatical output with an explicit target tree, but also enables the system to choose the proper translation rules via fuzzy use of the source syntax. Our extensive experiments have shown significant improvements over the state-of-the-art string-to-tree system. 1

5 0.065385811 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

Author: Xinyan Xiao ; Yang Liu ; Qun Liu ; Shouxun Lin

Abstract: Although discriminative training guarantees to improve statistical machine translation by incorporating a large amount of overlapping features, it is hard to scale up to large data due to decoding complexity. We propose a new algorithm to generate translation forest of training data in linear time with the help of word alignment. Our algorithm also alleviates the oracle selection problem by ensuring that a forest always contains derivations that exactly yield the reference translation. With millions of features trained on 519K sentences in 0.03 second per sentence, our system achieves significant improvement by 0.84 BLEU over the baseline system on the NIST Chinese-English test sets.

6 0.058598615 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion

7 0.053708054 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

8 0.049849194 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

9 0.046800729 115 emnlp-2011-Relaxed Cross-lingual Projection of Constituent Syntax

10 0.043542162 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

11 0.042022098 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking

12 0.040956702 127 emnlp-2011-Structured Lexical Similarity via Convolution Kernels on Dependency Trees

13 0.040363837 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP

14 0.040196117 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

15 0.039168321 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

16 0.038716346 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

17 0.036961116 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing

18 0.03621763 125 emnlp-2011-Statistical Machine Translation with Local Language Models

19 0.035501558 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base

20 0.035158966 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.132), (1, 0.049), (2, 0.008), (3, -0.027), (4, 0.01), (5, -0.054), (6, -0.043), (7, -0.055), (8, 0.061), (9, 0.031), (10, 0.039), (11, 0.053), (12, -0.053), (13, 0.069), (14, 0.048), (15, -0.11), (16, 0.05), (17, 0.074), (18, -0.012), (19, 0.001), (20, 0.042), (21, -0.021), (22, -0.082), (23, 0.144), (24, -0.115), (25, -0.033), (26, 0.027), (27, 0.076), (28, 0.03), (29, -0.028), (30, -0.072), (31, -0.006), (32, -0.091), (33, -0.087), (34, 0.022), (35, -0.041), (36, -0.025), (37, 0.067), (38, -0.016), (39, -0.045), (40, -0.081), (41, 0.051), (42, -0.018), (43, -0.08), (44, 0.057), (45, 0.0), (46, 0.009), (47, 0.032), (48, 0.208), (49, 0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92823017 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

Author: Fabien Cromieres ; Sadao Kurohashi

Abstract: We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. We also propose improvements of this algorithm allowing several kinds of flexible matchings.

2 0.49369603 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation

Author: Jun Xie ; Haitao Mi ; Qun Liu

Abstract: Dependency structure, as a first step towards semantics, is believed to be helpful to improve translation quality. However, previous works on dependency structure based models typically resort to insertion operations to complete translations, which make it difficult to specify ordering information in translation rules. In our model of this paper, we handle this problem by directly specifying the ordering information in head-dependents rules which represent the source side as head-dependents relations and the target side as strings. The head-dependents rules require only substitution operation, thus our model requires no heuristics or separate ordering models of the previous works to control the word order of translations. Large-scale experiments show that our model performs well on long distance reordering, and outperforms the state- of-the-art constituency-to-string model (+1.47 BLEU on average) and hierarchical phrasebased model (+0.46 BLEU on average) on two Chinese-English NIST test sets without resort to phrases or parse forest. For the first time, a source dependency structure based model catches up with and surpasses the state-of-theart translation models.

3 0.45977449 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

Author: Jiajun Zhang ; Feifei Zhai ; Chengqing Zong

Abstract: Due to its explicit modeling of the grammaticality of the output via target-side syntax, the string-to-tree model has been shown to be one of the most successful syntax-based translation models. However, a major limitation of this model is that it does not utilize any useful syntactic information on the source side. In this paper, we analyze the difficulties of incorporating source syntax in a string-totree model. We then propose a new way to use the source syntax in a fuzzy manner, both in source syntactic annotation and in rule matching. We further explore three algorithms in rule matching: 0-1 matching, likelihood matching, and deep similarity matching. Our method not only guarantees grammatical output with an explicit target tree, but also enables the system to choose the proper translation rules via fuzzy use of the source syntax. Our extensive experiments have shown significant improvements over the state-of-the-art string-to-tree system. 1

4 0.43976343 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

Author: Xinyan Xiao ; Yang Liu ; Qun Liu ; Shouxun Lin

Abstract: Although discriminative training guarantees to improve statistical machine translation by incorporating a large amount of overlapping features, it is hard to scale up to large data due to decoding complexity. We propose a new algorithm to generate translation forest of training data in linear time with the help of word alignment. Our algorithm also alleviates the oracle selection problem by ensuring that a forest always contains derivations that exactly yield the reference translation. With millions of features trained on 519K sentences in 0.03 second per sentence, our system achieves significant improvement by 0.84 BLEU over the baseline system on the NIST Chinese-English test sets.

5 0.41151479 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

Author: Amit Goyal ; Hal Daume III

Abstract: We exploit sketch techniques, especially the Count-Min sketch, a memory, and time efficient framework which approximates the frequency of a word pair in the corpus without explicitly storing the word pair itself. These methods use hashing to deal with massive amounts of streaming text. We apply CountMin sketch to approximate word pair counts and exhibit their effectiveness on three important NLP tasks. Our experiments demonstrate that on all of the three tasks, we get performance comparable to Exact word pair counts setting and state-of-the-art system. Our method scales to 49 GB of unzipped web data using bounded space of 2 billion counters (8 GB memory).

6 0.40798739 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion

7 0.39874005 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

8 0.3839235 16 emnlp-2011-Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP

9 0.38329369 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

10 0.3748925 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition

11 0.36781844 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation

12 0.3677991 29 emnlp-2011-Collaborative Ranking: A Case Study on Entity Linking

13 0.33765206 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

14 0.31095061 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

15 0.31026769 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction

16 0.29492229 31 emnlp-2011-Computation of Infix Probabilities for Probabilistic Context-Free Grammars

17 0.29482657 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

18 0.28296199 127 emnlp-2011-Structured Lexical Similarity via Convolution Kernels on Dependency Trees

19 0.25367579 109 emnlp-2011-Random Walk Inference and Learning in A Large Scale Knowledge Base

20 0.24321397 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(15, 0.019), (23, 0.076), (35, 0.329), (36, 0.034), (37, 0.028), (45, 0.073), (53, 0.02), (54, 0.05), (57, 0.018), (62, 0.019), (64, 0.018), (65, 0.012), (66, 0.032), (69, 0.026), (79, 0.045), (82, 0.023), (87, 0.012), (90, 0.031), (96, 0.03), (98, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76106185 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

Author: Fabien Cromieres ; Sadao Kurohashi

Abstract: We propose an algorithm allowing to efficiently retrieve example treelets in a parsed tree database in order to allow on-the-fly extraction of syntactic translation rules. We also propose improvements of this algorithm allowing several kinds of flexible matchings.

2 0.61297476 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

Author: Jiajun Zhang ; Feifei Zhai ; Chengqing Zong

Abstract: Due to its explicit modeling of the grammaticality of the output via target-side syntax, the string-to-tree model has been shown to be one of the most successful syntax-based translation models. However, a major limitation of this model is that it does not utilize any useful syntactic information on the source side. In this paper, we analyze the difficulties of incorporating source syntax in a string-totree model. We then propose a new way to use the source syntax in a fuzzy manner, both in source syntactic annotation and in rule matching. We further explore three algorithms in rule matching: 0-1 matching, likelihood matching, and deep similarity matching. Our method not only guarantees grammatical output with an explicit target tree, but also enables the system to choose the proper translation rules via fuzzy use of the source syntax. Our extensive experiments have shown significant improvements over the state-of-the-art string-to-tree system. 1

3 0.39165559 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

Author: Juri Ganitkevitch ; Chris Callison-Burch ; Courtney Napoles ; Benjamin Van Durme

Abstract: Previous work has shown that high quality phrasal paraphrases can be extracted from bilingual parallel corpora. However, it is not clear whether bitexts are an appropriate resource for extracting more sophisticated sentential paraphrases, which are more obviously learnable from monolingual parallel corpora. We extend bilingual paraphrase extraction to syntactic paraphrases and demonstrate its ability to learn a variety of general paraphrastic transformations, including passivization, dative shift, and topicalization. We discuss how our model can be adapted to many text generation tasks by augmenting its feature set, development data, and parameter estimation routine. We illustrate this adaptation by using our paraphrase model for the task of sentence compression and achieve results competitive with state-of-the-art compression systems.

4 0.38113904 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning

Author: Edward Grefenstette ; Mehrnoosh Sadrzadeh

Abstract: Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (2010) using data from the BNC and evaluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The evaluation is based on the word disambiguation task developed by Mitchell and Lapata (2008) for intransitive sentences, and on a similar new experiment designed for transitive sentences. Our model matches the results of its competitors . in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity showcases the compositional power of our model.

5 0.37850603 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing

Author: Amit Dubey ; Frank Keller ; Patrick Sturt

Abstract: This paper introduces a psycholinguistic model of sentence processing which combines a Hidden Markov Model noun phrase chunker with a co-reference classifier. Both models are fully incremental and generative, giving probabilities of lexical elements conditional upon linguistic structure. This allows us to compute the information theoretic measure of surprisal, which is known to correlate with human processing effort. We evaluate our surprisal predictions on the Dundee corpus of eye-movement data show that our model achieve a better fit with human reading times than a syntax-only model which does not have access to co-reference information.

6 0.37155852 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

7 0.37025508 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

8 0.37025255 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

9 0.3701784 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

10 0.37011558 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation

11 0.36998287 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

12 0.36988154 107 emnlp-2011-Probabilistic models of similarity in syntactic context

13 0.36876836 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

14 0.36800653 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing

15 0.36735108 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

16 0.36698687 97 emnlp-2011-Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French

17 0.36618835 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

18 0.36552 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

19 0.36518458 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming

20 0.36457407 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search