acl acl2013 acl2013-36 knowledge-graph by maker-knowledge-mining

36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

Source: pdf

Author: Joohyun Kim ; Raymond Mooney

Abstract: We adapt discriminative reranking to improve the performance of grounded language acquisition, specifically the task of learning to follow navigation instructions from observation. Unlike conventional reranking used in syntactic and semantic parsing, gold-standard reference trees are not naturally available in a grounded setting. Instead, we show how the weak supervision of response feedback (e.g. successful task completion) can be used as an alternative, experimentally demonstrating that its performance is comparable to training on gold-standard parse trees.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We adapt discriminative reranking to improve the performance of grounded language acquisition, specifically the task of learning to follow navigation instructions from observation. [sent-3, score-1.026]

2 Unlike conventional reranking used in syntactic and semantic parsing, gold-standard reference trees are not naturally available in a grounded setting. [sent-4, score-0.714]

3 Subsequently, Kim and Mooney (2012) extended their approach to make it tractable for the more complex problem of learning to follow natural-language navigation instructions from observations of humans following such instructions in a virtual environment (Chen and Mooney, 2011). [sent-12, score-0.549]

4 The observed sequence of actions provides very weak, ambiguous supervision for learning instructional language since there are many possible ways to describe the same execution path. [sent-13, score-0.424]

5 Since their system employs a generative model, discriminative reranking (Collins, 2000) could potentially improve its performance. [sent-18, score-0.516]

6 By training a discriminative classifier that uses global features of complete parses to identify correct interpretations, a reranker can significantly improve the accuracy of a generative model. [sent-19, score-0.39]

7 For the navigation task, this supervision consists of the observed sequence of actions taken by a human when following an instruction. [sent-28, score-0.321]

8 Therefore, it is impossible to directly apply conventional discriminative reranking to such problems. [sent-29, score-0.472]

9 We show how to adapt reranking to work with such weak supervision. [sent-30, score-0.474]

10 Instead of using gold-standard annotations to determine the correct interpretations, we simply prefer interpretations of navigation instructions that, when executed in the world, actually reach the intended destination. [sent-31, score-0.507]

11 Additionally, we extensively revise the features typically used in parse reranking to work with the PCFG approach to grounded language learning. [sent-32, score-0.833]

12 The rest of the paper is organized as follows: Section 2 reviews the navigation task and the PCFG approach to grounded language learn- ing. [sent-33, score-0.421]

13 Section 3 presents our modified approach to reranking and Section 4 describes the novel features used to evaluate parses. [sent-34, score-0.456]

14 (b) A sample natural language instruction and its formal landmarks plan for the path illustrated above. [sent-40, score-0.375]

15 Formally speaking, given training examples of the form (ei, ai, wi), where ei is an NL instruction, ai is an executed action sequence for the instruction, and wi is the initial world state, we want to learn to produce an appropriate action sequence aj given a novel (ej , wj). [sent-50, score-0.384]

16 (2006), executes the formal plan, flexibly adapting to situations encountered during execution and producing the action sequence aj. [sent-54, score-0.471]

17 The landmarks and correct plans for a sample instruction are shown in Figure 1b. [sent-57, score-0.406]

18 2 PCFG Induction for Grounded Language Learning The baseline generative model we use for reranking employs the unsupervised PCFG induction approach introduced by Kim and Mooney (2012). [sent-59, score-0.447]

19 After using Expectation-Maximization (EM) to estimate the parameters for these productions using the am- biguous supervision provided by the groundedlearning setting, it produces a PCFG whose most probable parse for a sentence encodes its correct semantic interpretation. [sent-63, score-0.344]

20 Our pro- posed reranking model is used to discriminatively reorder the top parses produced by this generative model. [sent-71, score-0.667]

21 However, grounded language learning tasks, such as our navigation task, do not provide reference parse trees for training examples. [sent-78, score-0.761]

22 Thus, the third component in our reranking model becomes an evaluation function EXEC that maps a parse tree y into a real number representing the success rate (w. [sent-89, score-0.696]

23 Additionally, we improve the perceptron training algorithm by using multiple reference parses to update the weight vector W¯. [sent-93, score-0.441]

24 Although we determine the pseudo-gold reference tree to be the candidate parse y∗ such that y∗ = arg maxy∈GEN(e) EXEC(y), it may not actually be the correct parse for the sentence. [sent-94, score-0.671]

25 Other parses may contain useful information for learning, and therefore we devise a way to update weights using all candidate parses whose successful execution rate is greater than the parse preferred by the currently learned model. [sent-95, score-1.066]

26 In a similar vein, when reranking semantic parses, Ge and Mooney (2006) chose as a reference parse the one which was most similar to the gold-standard semantic annotation. [sent-98, score-0.693]

27 However, in the navigation task, the ultimate goal is to generate a plan that, when actually executed in the virtual environment, leads to the desired destination. [sent-99, score-0.485]

28 Therefore, the pseudo-gold reference is chosen as the candidate parse that produces the MR plan with the great220 est execution success. [sent-100, score-0.791]

29 This requires an external module that evaluates the execution accuracy of the candidate parses. [sent-101, score-0.408]

30 , 2006) execution module, which is also used to evaluate how well the overall system learns to follow directions (Chen and Mooney, 2011). [sent-103, score-0.312]

31 Since MARCO is nondeterministic when executing underspecified plans, we execute each candidate plan 10 times, and its execution rate is the percentage of trials in which it reaches the correct destination. [sent-104, score-0.56]

32 When there are multiple candidate parses tied for the highest execution rate, the one assigned the largest probability by the baseline model is selected. [sent-105, score-0.565]

33 The final plan MRs are produced from parse trees using compositional semantics (see Kim and Mooney (2012) for details). [sent-108, score-0.395]

34 Consequently, the n-best parse trees for the baseline model do not necessarily produce the n-best distinct plans, since many parses can produce the same plan. [sent-109, score-0.489]

35 The score assigned to a plan is the probability of the most probable parse that generates that plan. [sent-112, score-0.315]

36 In our initial modified version, we replaced the gold-standard reference parse with the pseudo-gold reference, which has the highest execution rate amongst all candidate parses. [sent-117, score-0.72]

37 However, this ignores all other candidate parses during perceptron training. [sent-118, score-0.331]

38 ” There may be multiple candidate parses with the same maximum execution rate, and even candidates with lower execution rates could represent the correct plan for the instruction given the weak, indirect supervision provided by the observed sequence of human actions. [sent-120, score-1.2]

39 Instead of only updating the weights with the single difference between the predicted and pseudo-gold parses, the weight vector W¯ is updated with the sum of feature-vector differences between the current predicted candidate and all other candidates that have a higher execution rate. [sent-122, score-0.408]

40 Formally, in this version, we replace lines 5–6 of Algorithm 1with: 1: for all y ∈ GEN(ei) where y yi and EXEC(y) > EXEC(yi) d)o w 2: W¯ = W¯ + (EXEC(y) − EXEC(yi)) (Φ(ei, y) − Φ(ei, yi)) 3: end for = where EXEC(y) is the execution rate of the MR plan m derived from parse tree y. [sent-123, score-0.764]

41 In the experiments below, we demonstrate that, by exploiting multiple reference parses, this new update rule increases the execution accuracy of the final system. [sent-124, score-0.485]

42 Intuitively, this approach gathers additional information from all candidate parses with higher execution accuracy when learning the discriminative reranker. [sent-125, score-0.667]

43 In addition, as shown in line 2 of the algorithm above, it uses the difference in execution rates between a candidate and the currently preferred parse to weight the update to the parameters for that candidate. [sent-126, score-0.69]

44 4 Reranking Features This section describes the features Φ extracted from parses produced by the generative model and used to rerank the candidates. [sent-128, score-0.317]

45 1 Base Features The base features adapt those used in previous reranking methods, specifically those of Collins (2002a), Lu et al. [sent-130, score-0.504]

46 Indicates whether a particular PCFG rule as well as the nonterminal above it is used in the parse tree: f(L3 ⇒ L5L6|L1) = 1. [sent-141, score-0.432]

47 Indicates whether a nonterminal has a given NL word below it in the parse tree: f(L2 ; left) = 1 and f(L4 ; turn) = 1. [sent-143, score-0.432]

48 Indicates whether a nonterminal has a child nonterminal which eventually generates a NL word in the parse tree: f(L4 ; |L2) = 1 left e) Unigram. [sent-145, score-0.73]

49 Indicates whether a nonterminal produces a given child nonterminal or terminal NL word in the parse tree: f(L1 → L2) = 1 and f(L1 → L3) = 1. [sent-146, score-0.71]

50 2 Predicate-Only Features The base features above generally include nonterminal symbols used in the parse tree. [sent-155, score-0.534]

51 There are ’ 2,500 nonterminals in the grammar consatrreuc ’ted 2 f5o0r0 th neo navigation data, m groasmt mofa rw choicnhare very specific and rare. [sent-157, score-0.318]

52 This results in a very large, sparse feature space which can easily lead 222 the reranking model to over-fit the training data and prevent it from generalizing properly. [sent-158, score-0.403]

53 First, we construct generalized versions of the base features in which nonterminal symbols use only predicate names and omit their arguments. [sent-160, score-0.318]

54 In the navigation task, action arguments frequently contain redundant, rarely used information. [sent-161, score-0.362]

55 For instance, a nonterminal for the MR Turn ( LEFT ) Veri fy ( at : SOFA, front :EASEL ) Trave l( steps : 3 ) is transformed into the predicate-only form Turn ( ) Ve rify ( ) T rave l ) ( , and then used to construct more general versions of the base features described in the previous section. [sent-163, score-0.315]

56 Second, another version of the base features are constructed in which nonterminal symbols include action arguments but omit all interleaving verification steps. [sent-164, score-0.535]

57 Although verification steps sometimes help interpret the actions and their surrounding context, they frequently cause the nonterminal symbols to become unnecessarily complex and specific. [sent-166, score-0.36]

58 3 Descended Action Features Finally, another feature group which we utilize captures whether a particular atomic action in a nonterminal “descends” into one of its child nonterminals or not. [sent-168, score-0.504]

59 fy When an atomic action descends into lower nonterminals in a parse tree, it indicates that it is mentioned in the NL instruction and is therefore important. [sent-172, score-0.737]

60 Below are several feature types related to descended actions that are used in our reranking model: a) Descended Action. [sent-173, score-0.616]

61 Indicates whether a given atomic action in a nonterminal descends to the next level. [sent-174, score-0.469]

62 Indicates whether a given atomic action in a nonterminal descends to a child nonterminal and this child generates a given NL word below it: f(Turn(LEFT) ; left) = 1 5 Experimental Evaluation 5. [sent-181, score-0.755]

63 1, we need to generate many more top parse trees to get 50 distinct formal MR plans. [sent-200, score-0.33]

64 We limit the num- ber of best parse trees to 1,000,000, and even with this high limit, some training examples were left with less than 50 distinct Each candidate plans. [sent-201, score-0.409]

65 06 2 6 Table 1: Oracle parse and execution accuracy for single sentence and complete paragraph instructions for the n best parses. [sent-214, score-0.715]

66 Our reranking model is then trained on the training data using the n-best candidate parses. [sent-216, score-0.466]

67 Finally, we measure both parse and execution accuracy on the test data. [sent-220, score-0.561]

68 Successful execution rates are calculated by averaging 10 nondeterministic MARCO executions. [sent-226, score-0.342]

69 2 Reranking Results Oracle results As typical in reranking experiments, we first present results for an “oracle” that always returns the best result amongst the top-n candidates produced by the baseline system, thereby providing an upper bound on the improvements possible with reranking. [sent-228, score-0.433]

70 Table 1 shows oracle accuracy for both semantic parsing and plan execution for single sentence and complete paragraph instructions for various values of n. [sent-229, score-0.698]

71 For oracle parse accuracy, for each sentence, we pick the parse that gives the highest F1 score. [sent-230, score-0.496]

72 For oracle single-sentence execution accuracy, we pick the parse that gives the highest execution success rate. [sent-231, score-0.904]

73 These singlesentence plans are then concatenated to produce a complete plan for each paragraph instruction in order to measure overall execution accuracy. [sent-232, score-0.749]

74 gold-standard reference weight updates Table 2 presents reranking results for our proposed response-based weight update (S ingle) for the averaged perceptron (cf. [sent-236, score-0.766]

75 Since the gold-standard annotation gives the correct MR rather than a parse tree for each sentence, Gold selects as a single reference parse the candidate in the top 50 whose resulting MR is most similar to the gold-standard MR as determined by its parse accuracy. [sent-239, score-0.86]

76 Ge and Mooney (2006) employ a similar approach when reranking semantic parses. [sent-240, score-0.403]

77 The results show that our response-based approach (S ingle) has better execution accuracy than both the baseline and the standard approach using gold-standard parses (Gold). [sent-241, score-0.535]

78 The overall result is very promising because it demonstrates how reranking can be applied to grounded language learning tasks where gold-standard parses are not readily available. [sent-246, score-0.78]

79 i5163roa72n3 Table 2: Reranking results comparing our response-based methods using single (S ingle) or multiple (Mult i) pseudo-gold parses to the standard approach using a single gold-standard parse (Gold). [sent-250, score-0.406]

80 multiple reference parses Table 2 also shows performance when using multiple reference parse trees to update weights (cf. [sent-255, score-0.67]

81 2, the single-best pseudo-gold parse provides weak, ambiguous feedback since it only provides a rough estimate of the response feedback from the execution module. [sent-260, score-0.639]

82 Using a variety of preferable parses to update weights provides a greater amount and variety of weak feedback and therefore leads to a more accurate model. [sent-261, score-0.342]

83 3 Comparison of different feature groups Table 3 compares reranking results using the different feature groups described in Section 4. [sent-262, score-0.403]

84 This indicates that, unlike Mult i, parses other than the best one do not have useful information in terms of optimizing normal parse accuracy. [sent-266, score-0.461]

85 2), De s c refers to descended action features (cf. [sent-274, score-0.323]

86 All results use weight update with multiple reference parses (cf. [sent-278, score-0.363]

87 ther improve the plan execution performance, and reranking using all of the feature groups (Al l) performs the best, as expected. [sent-281, score-0.814]

88 However, since our model is optimizing plan execution during training, the results for parse accuracy are always worse than the baseline model. [sent-282, score-0.686]

89 6 Related Work Discriminative reranking is a common machine learning technique to improve the output of generative models. [sent-283, score-0.447]

90 How- ever, to our knowledge, there has been no previous attempt to apply discriminative reranking to grounded language acquisition, where goldstandard reference parses are not typically available for training reranking models. [sent-289, score-1.326]

91 Although the demands of grounded language tasks, such as following navigation instructions, are different, it would be interesting to try adapting these alternative approaches to such problems. [sent-293, score-0.421]

92 7 Future Work In the future, we would like to explore the construction of better, more-general reranking features that are less prone to over-fitting. [sent-294, score-0.43]

93 Since typical reranking features rely on the combination and/or modification of nonterminals appear- ing in parse trees, for the large PCFG’s produced for grounded language learning, such features are very sparse and rare. [sent-295, score-0.974]

94 In addition, employing other reranking methodologies, such as kernel methods (Collins, 2002b), and forest reranking exploiting a packed forest of exponentially many parse trees (Huang, 2008), is another area of future work. [sent-297, score-1.072]

95 We also would like to apply our approach to other reranking algorithms such as SVMs (Joachims, 2002) and MaxEnt methods (Charniak and Johnson, 2005). [sent-298, score-0.403]

96 8 Conclusions In this paper, we have shown how to adapt discriminative reranking to grounded language learning. [sent-299, score-0.687]

97 Since typical grounded language learning problems, such as navigation instruction following, do not provide the gold-standard reference parses required by standard reranking models, we have devised a novel method for using the weaker supervision provided by response feedback (e. [sent-300, score-1.328]

98 the execution of inferred navigation plans) when training a perceptron-based reranker. [sent-302, score-0.546]

99 In addition, since this response-based supervision is weak and ambiguous, we have also presented a method for using multiple reference parses to perform perceptron weight updates and shown a clear further improvement in end-task performance with this approach. [sent-304, score-0.506]

100 Learning to interpret natural language navigation instructions from observations. [sent-322, score-0.37]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reranking', 0.403), ('execution', 0.312), ('navigation', 0.234), ('parse', 0.216), ('nonterminal', 0.216), ('mooney', 0.212), ('parses', 0.19), ('grounded', 0.187), ('mr', 0.187), ('descended', 0.168), ('instruction', 0.155), ('pcfg', 0.147), ('exec', 0.135), ('plans', 0.134), ('nl', 0.13), ('action', 0.128), ('sofa', 0.124), ('instructions', 0.105), ('plan', 0.099), ('turn', 0.091), ('nonterminals', 0.084), ('descends', 0.084), ('gen', 0.083), ('kim', 0.079), ('perceptron', 0.078), ('reference', 0.074), ('ei', 0.072), ('discriminative', 0.069), ('virtual', 0.069), ('mrs', 0.069), ('macmahon', 0.067), ('update', 0.066), ('collins', 0.065), ('oracle', 0.064), ('candidate', 0.063), ('yi', 0.06), ('landmarks', 0.059), ('interpretations', 0.058), ('executed', 0.056), ('robot', 0.055), ('chen', 0.052), ('austin', 0.052), ('ge', 0.051), ('ingle', 0.05), ('interleaving', 0.05), ('trees', 0.05), ('paragraph', 0.049), ('tree', 0.048), ('left', 0.047), ('base', 0.046), ('updates', 0.046), ('grandparent', 0.045), ('actions', 0.045), ('generative', 0.044), ('feedback', 0.043), ('weak', 0.043), ('supervision', 0.042), ('joohyun', 0.041), ('atomic', 0.041), ('verification', 0.039), ('corner', 0.039), ('orschinger', 0.037), ('environment', 0.036), ('parsing', 0.036), ('child', 0.035), ('mult', 0.034), ('environments', 0.034), ('easel', 0.034), ('trave', 0.034), ('gold', 0.033), ('accuracy', 0.033), ('weight', 0.033), ('averaged', 0.033), ('distinct', 0.033), ('productions', 0.032), ('interpret', 0.031), ('formal', 0.031), ('marco', 0.031), ('sample', 0.031), ('raymond', 0.031), ('travel', 0.03), ('verify', 0.03), ('nondeterministic', 0.03), ('produced', 0.03), ('indicates', 0.029), ('symbols', 0.029), ('rate', 0.029), ('adapt', 0.028), ('maxy', 0.027), ('produces', 0.027), ('features', 0.027), ('simulated', 0.027), ('correct', 0.027), ('actually', 0.027), ('pa', 0.026), ('rerank', 0.026), ('optimizing', 0.026), ('modified', 0.026), ('front', 0.026), ('ambiguous', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

Author: Joohyun Kim ; Raymond Mooney

2 0.1586338 141 acl-2013-Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation

Author: Srinivasan Janarthanam ; Oliver Lemon ; Phil Bartie ; Tiphaine Dalmas ; Anna Dickinson ; Xingkun Liu ; William Mackaness ; Bonnie Webber

Abstract: We present a city navigation and tourist information mobile dialogue app with integrated question-answering (QA) and geographic information system (GIS) modules that helps pedestrian users to navigate in and learn about urban environments. In contrast to existing mobile apps which treat these problems independently, our Android app addresses the problem of navigation and touristic questionanswering in an integrated fashion using a shared dialogue context. We evaluated our system in comparison with Samsung S-Voice (which interfaces to Google navigation and Google search) with 17 users and found that users judged our system to be significantly more interesting to interact with and learn from. They also rated our system above Google search (with the Samsung S-Voice interface) for tourist information tasks.

3 0.1318472 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

Author: Greg Coppola ; Mark Steedman

Abstract: Higher-order dependency features are known to improve dependency parser accuracy. We investigate the incorporation of such features into a cube decoding phrase-structure parser. We find considerable gains in accuracy on the range of standard metrics. What is especially interesting is that we find strong, statistically significant gains on dependency recovery on out-of-domain tests (Brown vs. WSJ). This suggests that higher-order dependency features are not simply overfitting the training material.

4 0.12862542 312 acl-2013-Semantic Parsing as Machine Translation

Author: Jacob Andreas ; Andreas Vlachos ; Stephen Clark

Abstract: Semantic parsing is the problem of deriving a structured meaning representation from a natural language utterance. Here we approach it as a straightforward machine translation task, and demonstrate that standard machine translation components can be adapted into a semantic parser. In experiments on the multilingual GeoQuery corpus we find that our parser is competitive with the state of the art, and in some cases achieves higher accuracy than recently proposed purpose-built systems. These results support the use of machine translation methods as an informative baseline in semantic parsing evaluations, and suggest that research in semantic parsing could benefit from advances in machine translation.

5 0.1280158 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions

Author: Gabor Angeli ; Jakob Uszkoreit

Abstract: Temporal resolution systems are traditionally tuned to a particular language, requiring significant human effort to translate them to new languages. We present a language independent semantic parser for learning the interpretation of temporal phrases given only a corpus of utterances and the times they reference. We make use of a latent parse that encodes a language-flexible representation of time, and extract rich features over both the parse and associated temporal semantics. The parameters of the model are learned using a weakly supervised bootstrapping approach, without the need for manually tuned parameters or any other language expertise. We achieve state-of-the-art accuracy on all languages in the TempEval2 temporal normalization task, reporting a 4% improvement in both English and Spanish accuracy, and to our knowledge the first results for four other languages.

6 0.11434459 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing

7 0.11193723 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

8 0.11087526 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

9 0.096325994 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search

10 0.093450405 275 acl-2013-Parsing with Compositional Vector Grammars

11 0.08617723 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

12 0.08585307 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

13 0.084888384 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

14 0.083122551 230 acl-2013-Lightly Supervised Learning of Procedural Dialog Systems

15 0.076625146 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars

16 0.072954267 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri

17 0.070393987 320 acl-2013-Shallow Local Multi-Bottom-up Tree Transducers in Statistical Machine Translation

18 0.069086306 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

19 0.068639055 314 acl-2013-Semantic Roles for String to Tree Machine Translation

20 0.067334414 357 acl-2013-Transfer Learning for Constituency-Based Grammars

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.178), (1, -0.061), (2, -0.091), (3, -0.02), (4, -0.074), (5, 0.035), (6, 0.095), (7, -0.056), (8, 0.016), (9, 0.029), (10, -0.044), (11, 0.028), (12, 0.01), (13, -0.051), (14, 0.035), (15, -0.002), (16, 0.003), (17, 0.08), (18, -0.009), (19, -0.032), (20, -0.036), (21, -0.065), (22, 0.063), (23, 0.077), (24, 0.006), (25, -0.033), (26, 0.033), (27, 0.025), (28, -0.056), (29, 0.059), (30, 0.017), (31, 0.01), (32, 0.063), (33, -0.024), (34, 0.032), (35, 0.042), (36, -0.025), (37, 0.025), (38, 0.045), (39, 0.052), (40, -0.026), (41, 0.089), (42, -0.009), (43, -0.053), (44, 0.024), (45, 0.15), (46, -0.004), (47, 0.013), (48, -0.028), (49, -0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91819161 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

Author: Joohyun Kim ; Raymond Mooney

2 0.69814277 176 acl-2013-Grounded Unsupervised Semantic Parsing

Author: Hoifung Poon

Abstract: We present the first unsupervised approach for semantic parsing that rivals the accuracy of supervised approaches in translating natural-language questions to database queries. Our GUSP system produces a semantic parse by annotating the dependency-tree nodes and edges with latent states, and learns a probabilistic grammar using EM. To compensate for the lack of example annotations or question-answer pairs, GUSP adopts a novel grounded-learning approach to leverage database for indirect supervision. On the challenging ATIS dataset, GUSP attained an accuracy of 84%, effectively tying with the best published results by supervised approaches.

3 0.68003255 311 acl-2013-Semantic Neighborhoods as Hypergraphs

Author: Chris Quirk ; Pallavi Choudhury

Abstract: Ambiguity preserving representations such as lattices are very useful in a number of NLP tasks, including paraphrase generation, paraphrase recognition, and machine translation evaluation. Lattices compactly represent lexical variation, but word order variation leads to a combinatorial explosion of states. We advocate hypergraphs as compact representations for sets of utterances describing the same event or object. We present a method to construct hypergraphs from sets of utterances, and evaluate this method on a simple recognition task. Given a set of utterances that describe a single object or event, we construct such a hypergraph, and demonstrate that it can recognize novel descriptions of the same event with high accuracy.

4 0.67978507 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions

Author: Gabor Angeli ; Jakob Uszkoreit

5 0.66347879 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars

Author: Yoav Artzi ; Nicholas FitzGerald ; Luke Zettlemoyer

Abstract: unkown-abstract

6 0.65301877 90 acl-2013-Conditional Random Fields for Responsive Surface Realisation using Global Features

7 0.62192506 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing

8 0.60934126 165 acl-2013-General binarization for parsing and translation

9 0.6088379 163 acl-2013-From Natural Language Specifications to Program Input Parsers

10 0.59016418 275 acl-2013-Parsing with Compositional Vector Grammars

11 0.58851683 190 acl-2013-Implicatures and Nested Beliefs in Approximate Decentralized-POMDPs

12 0.58587855 312 acl-2013-Semantic Parsing as Machine Translation

13 0.58285433 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics

14 0.56291085 324 acl-2013-Smatch: an Evaluation Metric for Semantic Feature Structures

15 0.55843818 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

16 0.54645783 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

17 0.53982192 175 acl-2013-Grounded Language Learning from Video Described with Sentences

18 0.53802335 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

19 0.53116357 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension

20 0.52884555 141 acl-2013-Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.049), (6, 0.3), (11, 0.052), (14, 0.011), (24, 0.035), (26, 0.041), (35, 0.075), (42, 0.046), (48, 0.038), (64, 0.038), (70, 0.061), (88, 0.054), (90, 0.04), (95, 0.063), (99, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94625366 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

Author: Chris Quirk

Abstract: The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. Initial attempts at modeling fertility used heuristic search methods. Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. Yet in practice we also need the single best alignment, which is difficult to find using Gibbs. Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality.

2 0.94559431 319 acl-2013-Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics

Author: Dehong Gao ; Wenjie Li ; Renxian Zhang

Abstract: The growth of the Web 2.0 technologies has led to an explosion of social networking media sites. Among them, Twitter is the most popular service by far due to its ease for realtime sharing of information. It collects millions of tweets per day and monitors what people are talking about in the trending topics updated timely. Then the question is how users can understand a topic in a short time when they are frustrated with the overwhelming and unorganized tweets. In this paper, this problem is approached by sequential summarization which aims to produce a sequential summary, i.e., a series of chronologically ordered short subsummaries that collectively provide a full story about topic development. Both the number and the content of sub-summaries are automatically identified by the proposed stream-based and semantic-based approaches. These approaches are evaluated in terms of sequence coverage, sequence novelty and sequence correlation and the effectiveness of their combination is demonstrated. 1 Introduction and Background Twitter, as a popular micro-blogging service, collects millions of real-time short text messages (known as tweets) every second. It acts as not only a public platform for posting trifles about users’ daily lives, but also a public reporter for real-time news. Twitter has shown its powerful ability in information delivery in many events, like the wildfires in San Diego and the earthquake in Japan. Nevertheless, the side effect is individual users usually sink deep under millions of flooding-in tweets. To alleviate this problem, the applications like whatthetrend 1 have evolved from Twitter to provide services that encourage users to edit explanatory tweets about a trending topic, which can be regarded as topic summaries. It is to some extent a good way to help users understand trending topics. 1 whatthetrend.com There is also pioneering research in automatic Twitter trending topic summarization. (O'Connor et al., 2010) explained Twitter trending topics by providing a list of significant terms. Users could utilize these terms to drill down to the tweets which are related to the trending topics. (Sharifi et al., 2010) attempted to provide a one-line summary for each trending topic using phrase reinforcement ranking. The relevance model employed by (Harabagiu and Hickl, 2011) generated summaries in larger size, i.e., 250word summaries, by synthesizing multiple high rank tweets. (Duan et al., 2012) incorporate the user influence and content quality information in timeline tweet summarization and employ reinforcement graph to generate summaries for trending topics. Twitter summarization is an emerging research area. Current approaches still followed the traditional summarization route and mainly focused on mining tweets of both significance and representativeness. Though, the summaries generated in such a way can sketch the most important aspects of the topic, they are incapable of providing full descriptions of the changes of the focus of a topic, and the temporal information or freshness of the tweets, especially for those newsworthy trending topics, like earthquake and sports meeting. As the main information producer in Twitter, the massive crowd keeps close pace with the development of trending topics and provide the timely updated information. The information dynamics and timeliness is an important consideration for Twitter summarization. That is why we propose sequential summarization in this work, which aims to produce sequential summaries to capture the temporal changes of mass focus. Our work resembles update summarization promoted by TAC 2 which required creating summaries with new information assuming the reader has already read some previous documents under the same topic. Given two chronologically ordered documents sets about a topic, the systems were asked to generate two 2 www.nist.gov/tac 567 summaries, and the second one should inform the user of new information only. In order to achieve this goal, existing approaches mainly emphasized the novelty of the subsequent summary (Li and Croft, 2006; Varma et al., 2009; Steinberger and Jezek, 2009). Different from update summarization, we focus more on the temporal change of trending topics. In particular, we need to automatically detect the “update points” among a myriad of related tweets. It is the goal of this paper to set up a new practical summarization application tailored for timely updated Twitter messages. With the aim of providing a full description of the focus changes and the records of the timeline of a trending topic, the systems are expected to discover the chronologically ordered sets of information by themselves and they are free to generate any number of update summaries according to the actual situations instead of a fixed number of summaries as specified in DUC/TAC. Our main contributions include novel approaches to sequential summarization and corresponding evaluation criteria for this new application. All of them will be detailed in the following sections. 2 Sequential Summarization Sequential summarization proposed here aims to generate a series of chronologically ordered subsummaries for a given Twitter trending topic. Each sub-summary is supposed to represent one main subtopic or one main aspect of the topic, while a sequential summary, made up by the subsummaries, should retain the order the information is delivered to the public. In such a way, the sequential summary is able to provide a general picture of the entire topic development. 2.1 Subtopic Segmentation One of the keys to sequential summarization is subtopic segmentation. How many subtopics have attracted the public attention, what are they, and how are they developed? It is important to provide the valuable and organized materials for more fine-grained summarization approaches. We proposed the following two approaches to automatically detect and chronologically order the subtopics. 2.1.1 Stream-based Subtopic Detection and Ordering Typically when a subtopic is popular enough, it will create a certain level of surge in the tweet stream. In other words, every surge in the tweet stream can be regarded as an indicator of the appearance of a subtopic that is worthy of being summarized. Our early investigation provides evidence to support this assumption. By examining the correlations between tweet content changes and volume changes in randomly selected topics, we have observed that the changes in tweet volume can really provide the clues of topic development or changes of crowd focus. The stream-based subtopic detection approach employs the offline peak area detection (Opad) algorithm (Shamma et al., 2010) to locate such surges by tracing tweet volume changes. It regards the collection of tweets at each such surge time range as a new subtopic. Offline Peak Area Detection (Opad) Algorithm 1: Input: TS (tweets stream, each twi with timestamp ti); peak interval window ∆? (in hour), and time stepℎ (ℎ ≪ ∆?); 2: Output: Peak Areas PA. 3: Initial: two time slots: ?′ = ? = ?0 + ∆?; Tweet numbers: ?′ = ? = ?????(?) 4: while (?? = ? + ℎ) < ??−1 5: update ?′ = ?? + ∆? and ?′ = ?????(?′) 6: if (?′ < ? And up-hilling) 7: output one peak area ??? 8: state of down-hilling 9: else 10: update ? = ?′ and ? = ?′ 11: state of up-hilling 12: 13: function ?????(?) 14: Count tweets in time interval T The subtopics detected by the Opad algorithm are naturally ordered in the timeline. 2.1.2 Semantic-based Subtopic Detection and Ordering Basically the stream-based approach monitors the changes of the level of user attention. It is easy to implement and intuitively works, but it fails to handle the cases where the posts about the same subtopic are received at different time ranges due to the difference of geographical and time zones. This may make some subtopics scattered into several time slots (peak areas) or one peak area mixed with more than one subtopic. In order to sequentially segment the subtopics from the semantic aspect, the semantic-based subtopic detection approach breaks the time order of tweet stream, and regards each tweet as an individual short document. It takes advantage of Dynamic Topic Modeling (David and Michael, 2006) to explore the tweet content. 568 DTM in nature is a clustering approach which can dynamically generate the subtopic underlying the topic. Any clustering approach requires a pre-specified cluster number. To avoid tuning the cluster number experimentally, the subtopic number required by the semantic-based approach is either calculated according to heuristics or determined by the number of the peak areas detected from the stream-based approach in this work. Unlike the stream-based approach, the subtopics formed by DTM are the sets of distributions of subtopic and word probabilities. They are time independent. Thus, the temporal order among these subtopics is not obvious and needs to be discovered. We use the probabilistic relationships between tweets and topics learned from DTM to assign each tweet to a subtopic that it most likely belongs to. Then the subtopics are ordered temporally according to the mean values of their tweets’ timestamps. 2.2 Sequential Summary Generation Once the subtopics are detected and ordered, the tweets belonging to each subtopic are ranked and the most significant one is extracted to generate the sub-summary regarding that subtopic. Two different ranking strategies are adopted to conform to two different subtopic detection mechanisms. For a tweet in a peak area, the linear combination of two measures is considered to independently. Each sub-summary is up to 140 characters in length to comply with the limit of tweet, but the annotators are free to choose the number of sub-summaries. It ends up with 6.3 and 4.8 sub-summaries on average in a sequential summary written by the two annotators respectively. These two sets of sequential summaries are regarded as reference summaries to evaluate system-generated summaries from the following three aspects. Sequence Coverage Sequence coverage measures the N-gram match between system-generated summaries and human-written summaries (stopword removed first). Considering temporal information is an important factor in sequential summaries, we evaluate its significance to be a sub-summary: (1) subtopic representativeness measured by the  cosine similarity between the tweet and the centroid of all the tweets in the same peak area; (2) crowding endorsement measured by the times that the tweet is re-tweeted normalized by the total number of re-tweeting. With the DTM model, the significance of the tweets is evaluated directly by word distribution per subtopic. MMR (Carbonell and Goldstein, 1998) is used to reduce redundancy in sub-summary generation. 3 Experiments and Evaluations The experiments are conducted on the 24 Twitter trending topics collected using Twitter APIs 3 . The statistics are shown in Table 1. Due to the shortage of gold-standard sequential summaries, we invite two annotators to read the chronologically ordered tweets, and write a series of sub-summaries for each topic 3https://dev.twitter.com/ propose the position-aware coverage measure by accommodating the position information in matching. Let S={s1, s2, sk} denote a … … …, sequential summary and si the ith sub-summary, N-gram coverage is defined as: ???????? =|? 1?|?∑?∈? ?∑? ? ?∈?∙ℎ ?∑ ? ?∈?-?ℎ? ?∑? ∈-? ?,? ? ? ?∈? ? ? ? ? ? ? (ℎ?(?-?-? ? ? ?) where, ??? = |? − ?| + 1, i and j denote the serial numbers of the sub-summaries in the systemgenerated summary ??? and the human-written summary ?ℎ? , respectively. ? serves as a coefficient to discount long-distance matched sub-summaries. We evaluate unigram, bigram, and skipped bigram matches. Like in ROUGE (Lin, 2004), the skip distance is up to four words.  Sequence Novelty Sequence novelty evaluates the average novelty of two successive sub-summaries. Information content (IC) has been used to measure the novelty of update summaries by (Aggarwal et al., 2009). In this paper, the novelty of a system569 generated sequential summary is defined as the average of IC increments of two adjacent subsummaries, ??????? =|?|1 − 1?∑>1(????− ????, ??−1) × where |?| is the number of sub-summaries in the sequential summary. ???? = ∑?∈?? ??? . ????, ??−1 = ∑?∈??∩??−1 ??? is the overlapped information in the two adjacent sub-summaries. ??? = ???? ?????????(?, ???) where w is a word, ???? is the inverse tweet frequency of w, and ??? is all the tweets in the trending topic. The relevance function is introduced to ensure that the information brought by new sub-summaries is not only novel but also related to the topic.  Sequence Correlation Sequence correlation evaluates the sequential matching degree between system-generated and human-written summaries. In statistics, Kendall’s tau coefficient is often used to measure the association between two sequences (Lapata, 2006). The basic idea is to count the concordant and discordant pairs which contain the same elements in two sequences. Borrowing this idea, for each sub-summary in a human-generated summary, we find its most matched subsummary (judged by the cosine similarity measure) in the corresponding system-generated summary and then define the correlation according to the concordance between the two matched sub-summary sequences. ??????????? 2(|#???????????????| |#???????????????|) − = ?(? − 1) where n is the number of human-written subsummaries. Tables 2 and 3 below present the evaluation results. For the stream-based approach, we set ∆t=3 hours experimentally. For the semanticbased approach, we compare three different approaches to defining the sub-topic number K: (1) Semantic-based 1: Following the approach proposed in (Li et al., 2007), we first derive the matrix of tweet cosine similarity. Given the 1norm of eigenvalues ?????? (? = 1, 2, ,?) of the similarity matrix and the ratios ?? = ??????/?2 , the subtopic number ? = ? + 1 if ?? − ??+1 > ? (? 0.4 ). (2) Semantic-based 2: Using the rule of thumb in (Wan and Yang, 2008), ? = √? , where n is the tweet number. (3) Combined: K is defined as the number of the peak areas detected from the Opad algorithm, meanwhile we use the … = tweets within peak areas as the tweets of DTM. This is our new idea. The experiments confirm the superiority of the semantic-based approach over the stream-based approach in summary content coverage and novelty evaluations, showing that the former is better at subtopic content modeling. The subsummaries generated by the stream-based approach have comparative sequence (i.e., order) correlation with the human summaries. Combining the advantages the two approaches leads to the best overall results. SCebomaSCs beonmtdivr1eac( ∆nrdδ-bm(ta=i∆g0-cs3e.t)5=d32U0 n.3ig510r32a7m B0 .i1g 6r3589a46m87 SB0 k.i1 gp8725r69ame173d Table 2. N-Gram Coverage Evaluation Sem CtraeonTmaA tmicapb-nplibentria ec3os-de.abcd N(a∆hs(o1evt∆=(sdetδ=3l2)t 0y).a4n)dCoN0r .o 73e vl071ea96lti783 oy nEvCalo0ur a. 3 tei3792ol3a489nt650io n 4 Concluding Remarks We start a new application for Twitter trending topics, i.e., sequential summarization, to reveal the developing scenario of the trending topics while retaining the order of information presentation. We develop several solutions to automatically detect, segment and order subtopics temporally, and extract the most significant tweets into the sub-summaries to compose sequential summaries. Empirically, the combination of the stream-based approach and the semantic-based approach leads to sequential summaries with high coverage, low redundancy, and good order. Acknowledgments The work described in this paper is supported by a Hong Kong RGC project (PolyU No. 5202/12E) and a National Nature Science Foundation of China (NSFC No. 61272291). References Aggarwal Gaurav, Sumbaly Roshan and Sinha Shakti. 2009. Update Summarization. Stanford: CS224N Final Projects. 570 Blei M. David and Jordan I. Michael. 2006. Dynamic topic models. In Proceedings of the 23rd international conference on Machine learning, 113120. Pittsburgh, Pennsylvania. Carbonell Jaime and Goldstein Jade. 1998. The use of MMR, diversity based reranking for reordering documents and producing summaries. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval, 335-336. Melbourne, Australia. Duan Yajuan, Chen Zhimin, Wei Furu, Zhou Ming and Heung-Yeung Shum. 2012. Twitter Topic Summarization by Ranking Tweets using Social Influence and Content Quality. In Proceedings of the 24th International Conference on Computational Linguistics, 763-780. Mumbai, India. Harabagiu Sanda and Hickl Andrew. 2011. Relevance Modeling for Microblog Summarization. In Proceedings of 5th International AAAI Conference on Weblogs and Social Media. Barcelona, Spain. Lapata Mirella. 2006. Automatic evaluation of information ordering: Kendall’s tau. Computational Linguistics, 32(4): 1-14. Li Wenyuan, Ng Wee-Keong, Liu Ying and Ong Kok-Leong. 2007. Enhancing the Effectiveness of Clustering with Spectra Analysis. IEEE Transactions on Knowledge and Data Engineering, 19(7):887-902. Li Xiaoyan and Croft W. Bruce. 2006. Improving novelty detection for general topics using sentence level information patterns. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, 238-247. New York, USA. Lin Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out, 74-81 . Barcelona, Spain. Liu Fei, Liu Yang and Weng Fuliang. 2011. Why is “SXSW ” trending? Exploring Multiple Text Sources for Twitter Topic Summarization. In Proceedings of the ACL Workshop on Language in Social Media, 66-75. Portland, Oregon. O'Connor Brendan, Krieger Michel and Ahn David. 2010. TweetMotif: Exploratory Search and Topic Summarization for Twitter. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media, 384-385. Atlanta, Georgia. Shamma A. David, Kennedy Lyndon and Churchill F. Elizabeth. 2010. Tweetgeist: Can the Twitter Timeline Reveal the Structure of Broadcast Events? In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work, 589-593. Savannah, Georgia, USA. Sharifi Beaux, Hutton Mark-Anthony and Kalita Jugal. 2010. Summarizing Microblogs Automatically. In Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 685688. Los Angeles, California. Steinberger Josef and Jezek Karel. 2009. Update summarization based on novel topic distribution. In Proceedings of the 9th ACM Symposium on Document Engineering, 205-213. Munich, Germany. Varma Vasudeva, Bharat Vijay, Kovelamudi Sudheer, Praveen Bysani, Kumar K. N, Kranthi Reddy, Karuna Kumar and Nitin Maganti. 2009. IIIT Hyderabad at TAC 2009. In Proceedings of the 2009 Text Analysis Conference. GaithsBurg, Maryland. Wan Xiaojun and Yang Jianjun. 2008. Multidocument summarization using cluster-based link analysis. In Proceedings of the 3 1st Annual International Conference on Research and Development in Information Retrieval, 299-306. Singapore, Singapore. 571

3 0.94068366 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning

Author: Daniel Beck ; Lucia Specia ; Trevor Cohn

Abstract: Quality estimation models provide feedback on the quality of machine translated texts. They are usually trained on humanannotated datasets, which are very costly due to its task-specific nature. We investigate active learning techniques to reduce the size of these datasets and thus annotation effort. Experiments on a number of datasets show that with as little as 25% of the training instances it is possible to obtain similar or superior performance compared to that of the complete datasets. In other words, our active learning query strategies can not only reduce annotation effort but can also result in better quality predictors. ,t .

4 0.93179488 145 acl-2013-Exploiting Qualitative Information from Automatic Word Alignment for Cross-lingual NLP Tasks

Author: Jose G.C. de Souza ; Miquel Espla-Gomis ; Marco Turchi ; Matteo Negri

Abstract: The use of automatic word alignment to capture sentence-level semantic relations is common to a number of cross-lingual NLP applications. Despite its proved usefulness, however, word alignment information is typically considered from a quantitative point of view (e.g. the number of alignments), disregarding qualitative aspects (the importance of aligned terms). In this paper we demonstrate that integrating qualitative information can bring significant performance improvements with negligible impact on system complexity. Focusing on the cross-lingual textual en- tailment task, we contribute with a novel method that: i) significantly outperforms the state of the art, and ii) is portable, with limited loss in performance, to language pairs where training data are not available.

5 0.90951115 246 acl-2013-Modeling Thesis Clarity in Student Essays

Author: Isaac Persing ; Vincent Ng

Abstract: Recently, researchers have begun exploring methods of scoring student essays with respect to particular dimensions of quality such as coherence, technical errors, and relevance to prompt, but there is relatively little work on modeling thesis clarity. We present a new annotated corpus and propose a learning-based approach to scoring essays along the thesis clarity dimension. Additionally, in order to provide more valuable feedback on why an essay is scored as it is, we propose a second learning-based approach to identifying what kinds of errors an essay has that may lower its thesis clarity score.

same-paper 6 0.89579195 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

7 0.89191288 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

8 0.70171785 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

9 0.69017905 259 acl-2013-Non-Monotonic Sentence Alignment via Semisupervised Learning

10 0.68378878 333 acl-2013-Summarization Through Submodularity and Dispersion

11 0.68279266 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization

12 0.6827637 353 acl-2013-Towards Robust Abstractive Multi-Document Summarization: A Caseframe Analysis of Centrality and Domain

13 0.67855769 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

14 0.66968447 59 acl-2013-Automated Pyramid Scoring of Summaries using Distributional Semantics

15 0.66875654 176 acl-2013-Grounded Unsupervised Semantic Parsing

16 0.66759801 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

17 0.66755009 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

18 0.66508925 248 acl-2013-Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation

19 0.65331471 101 acl-2013-Cut the noise: Mutually reinforcing reordering and alignments for improved machine translation

20 0.64654362 129 acl-2013-Domain-Independent Abstract Generation for Focused Meeting Summarization