emnlp emnlp2011 emnlp2011-38 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alan Ritter ; Colin Cherry ; William B. Dolan
Abstract: Ottawa, Ontario, K1A 0R6 Co l . Cherry@ nrc-cnrc . gc . ca in Redmond, WA 98052 bi l ldol @mi cro so ft . com large corpus of status-response pairs found on Twitter to create a system that responds to Twitter status We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response.
Reference: text
sentIndex sentText sentNum sentScore
1 com large corpus of status-response pairs found on Twitter to create a system that responds to Twitter status We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. [sent-10, score-1.028]
2 We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. [sent-11, score-0.757]
3 We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. [sent-13, score-0.579]
4 As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response. [sent-14, score-0.226]
5 1 Introduction Recently there has been an explosion in the number of people having informal, public conversations on social media websites such as Facebook and Twitter. [sent-15, score-0.223]
6 This presents a unique opportunity to build collections of naturally occurring conversations that are orders of magnitude larger than those previously available. [sent-16, score-0.138]
7 We investigate the problem of response generation: given a conversational stimulus, generate an appropriate response. [sent-18, score-0.64]
8 Note that we make no mention of context, intent or dialogue state; our goal is to generate any response that fits the provided stimulus; however, we do so without employing rules or templates, with the hope of creating a system that is both flexible and extensible when operating in an open domain. [sent-20, score-0.711]
9 Success in open domain response generation could be immediately useful to social media platforms, providing a list of suggested responses to a target status, or providing conversation-aware autocomplete for responses in progress. [sent-21, score-1.507]
10 However, we are most excited by the future potential of data-driven response generation when used inside larger dialogue systems, where direct consideration of the user’s utterance could be combined with dialogue state (Wong and Mooney, 2007; Langner et al. [sent-25, score-1.065]
11 In this work, we investigate statistical machine translation as an approach for response generation. [sent-27, score-0.537]
12 For example, consider the stimulusresponse pair from the data: Stimulus: I’m slowly making this soup . [sent-29, score-0.113]
13 Haha Here “it” in the response refers to “this soup” in the status by co-reference; however, there is also a more subtle relationship between the “smells” and “looks”, as well as “gorgeous” and “delicious”. [sent-39, score-0.753]
14 Parallelisms such as these are frequent in naturally occurring conversations, leading us to ask whether it might be possible to translate a stimulus into an appropriate response. [sent-40, score-0.226]
15 We apply SMT to this problem, treating Twitter as our parallel corpus, with status posts as our source language and their responses as our target language. [sent-41, score-0.66]
16 We identify two key challenges in adapting SMT to the response generation task. [sent-43, score-0.606]
17 First, unlike bilingual text, stimulus-response pairs are not semantically equivalent, leading to a wider range of possible responses for a given stimulus phrase. [sent-44, score-0.628]
18 Thus, the most strongly associated word or phrase pairs found by off-the-shelf word alignment and phrase extraction tools are identical pairs. [sent-46, score-0.243]
19 Secondly, in stimulus-response pairs, there are far more unaligned words than in bilingual pairs; it is often the case that large portions of the stimulus are not referenced in the response and vice versa. [sent-48, score-0.676]
20 These difficult cases confuse the IBM word alignment models. [sent-50, score-0.117]
21 We compare our approach to response generation against two Information Retrieval or nearest neighbour approaches, which use the input stimulus to select a response directly from the training data. [sent-52, score-1.321]
22 We show that SMT-based solutions outperform IR-based solutions, and are chosen over actual human responses in our data in 15% of cases. [sent-55, score-0.494]
23 As far 584 as we are aware, this is the first work to investigate the feasibility of SMT’s application to generating responses to open-domain linguistic stimuli. [sent-56, score-0.446]
24 In contrast, we focus on the simpler task of generating an appropriate response to a single utterance. [sent-63, score-0.59]
25 We leverage large amounts of conversational training data to scale to our Social Media domain, where conversations can be on just about any topic. [sent-64, score-0.226]
26 Additionally, there has been work on generating more natural utterances in goal-directed dialogue systems (Ratnaparkhi, 2000; Rambow et al. [sent-65, score-0.311]
27 Currently, most dialogue systems rely on either canned responses or templates for generation, which can result in utterances which sound very unnatural in context (Chambers and Allen, 2004). [sent-67, score-0.681]
28 Recent work has investigated the use of SMT in translating internal dialogue state into natural language (Langner et al. [sent-68, score-0.21]
29 In addition to dialogue state, we believe it may be beneficial to consider the user’s utterance when generating responses in order to generate locally coherent discourse (Barzilay and Lapata, 2005). [sent-70, score-0.709]
30 Data-driven generation based on users’ utterances might also be a useful way to fill in knowledge gaps in the system (Galley et al. [sent-71, score-0.168]
31 Most relevant to our efforts is the work by Soricut and Marcu (2006), who applied the IBM word alignment models to a discourse ordering task, exploiting the same intuition investigated in this paper: certain words (or phrases) tend to trigger the usage of other words in subsequent discourse units. [sent-79, score-0.188]
32 As far as we are aware, ours is the first work to explore the use of phrase-based translation in generating responses to open-domain linguistic stimuli, although the analogy between translation and dialogue has been drawn (Leuski and Traum, 2010). [sent-80, score-0.728]
33 Twitter conversations don’t occur in real-time as in IRC; rather as in email, users typically take turns responding to each other. [sent-85, score-0.184]
34 In addition, the Twitter API maintains a reference from each reply to the post it responds to, so unlike IRC, there is no need for conversation disentanglement (Elsner and Charniak, 2008; Wang and Oard, 2009). [sent-87, score-0.186]
35 The first message of a conversation is typically unique, not directed at any particular user but instead broadcast to the author’s followers (a status message). [sent-88, score-0.401]
36 As a result of this constraint, any system trained with this data will be specialized for responding to Twitter status posts. [sent-90, score-0.298]
37 4 Response Generation as Translation When applied to conversations, SMT models the probability of a response r given the input statuspost s using a log-linear combination of feature functions. [sent-91, score-0.501]
38 Most prominent among these features are the conditional phrase-translation probabilities in both directions, P(s|r) and P(r|s), which ensure r i sb an appropriate response dto P s, |asn)d, twhhei language model P(r), which ensures r is a well-formed response. [sent-92, score-0.552]
39 As in translation, the response models are estimated from counts of phrase pairs observed in the training bitext, and the language model is built using n-gram statistics from a large set of observed responses. [sent-93, score-0.604]
40 To find the best response to a given input status-post, we employ the Moses phrase-based decoder (Koehn et al. [sent-94, score-0.501]
41 , 2007), which conducts a beam search for the best response given the input, according to the log-linear model. [sent-95, score-0.501]
42 There is a clear correspondence between words in the status and the response. [sent-128, score-0.252]
43 For example, directly applying Moses with default settings to the conversation data produces a system which yields the following (typical) output on the above example: Stimulus: I’m slowly making this soup . [sent-131, score-0.228]
44 Because there is a wide range of acceptable responses to any status, these identical pairs have the strongest associations in the data, and therefore dominate the phrase table. [sent-146, score-0.511]
45 In conversational data, there are some cases in which there is a decomposable alignment between if . [sent-152, score-0.17]
46 asier uestionq leasep Figure 2: Example from the data where word alignment is difficult (requires alignment between large phrases in the status and response). [sent-222, score-0.452]
47 words, as seen in figure 1, and some difficult cases where alignment between large phrases is required, for example figure 2. [sent-223, score-0.118]
48 These difficult sentence pairs confuse the IBM word alignment models which have no way to distinguish between the easy and hard cases. [sent-224, score-0.162]
49 Given a source and target phrase s and t, we consider the contingency table illustrated in figure 3, which includes co-occurrence counts for s and t, the num- ber of sentence-pairs containing s, but not t and vice versa, in addition to the number of pairs containing neither s nor t. [sent-234, score-0.169]
50 3M responses from the training data, along with roughly 1M replies collected using Twitter’s streaming API. [sent-259, score-0.453]
51 3Note that this includes an arbitrary subset of the (1,1,1) pairs (phrase pairs where both phrases were only observed once in the data). [sent-261, score-0.126]
52 587 We do not use any form of SMT reordering model, as the position of the phrase in the response does not seem to be very correlated with the corresponding position in the status. [sent-263, score-0.559]
53 4 Because automatic evaluation of response generation is an open problem, we avoided the use of discriminative training algorithms such as Minimum Error-Rate Training (Och, 2003). [sent-266, score-0.606]
54 5 Information Retrieval One straightforward data-driven approach to response generation is nearest neighbour, or information retrieval. [sent-267, score-0.606]
55 Given a novel status s and a training corpus of status/response pairs, two retrieval strategies can be used to return a best response r0: IR-STATUS [rargmaxi sim(s,si)] Retrieve the response ri whose associated status message si is most similar to the user’s input s. [sent-270, score-1.543]
56 IR-RESPONSE [rargmaxi sim(s,ri)] Retrieve the response ri which has highest similarity when di- rectly compared to s. [sent-271, score-0.501]
57 At first glance, IR-STATUS may appear to be the most promising option; intuitively, if an input status is very similar to a training status, we might expect the corresponding training response to pair well with the input. [sent-272, score-0.753]
58 However, as we describe in §6, it turns tohuet ti nhaptu directly retrieving tehe d emscorsibt esim inila §r6 response (IR-RESPONSE) tends to return acceptable replies more reliably, as judged by human annotators. [sent-273, score-0.594]
59 To implement our two IR response generators, we rely on the default similarity measure implemented in the Lucene5 Information Retrieval Library, which is an IDF-weighted Vector-Space similarity. [sent-274, score-0.501]
60 6 Experiments In order to compare various approaches to automated response generation, we used human evalu4The language model weight was set to 0. [sent-275, score-0.549]
61 While automated evaluation has been investigated in the area of spoken dialogue systems (Jung et al. [sent-284, score-0.21]
62 , 2009), it is unclear how well it will correlate with human judgment in open-domain conversations where the range of possible responses is very large. [sent-285, score-0.594]
63 These tweets were selected from conversations collected from a later, non-overlapping time-period from those used in training. [sent-290, score-0.172]
64 For each of the 200 statuses, we generated a response using method a and b, then showed the status and both re- sponses to the Turkers, asking them to choose the best response. [sent-292, score-0.753]
65 The order of the systems used to generate a response was randomized, and each of the 200 HITs was submitted to 3 different Turkers. [sent-293, score-0.501]
66 t an appropriate response should be on the same topic as the status, and should also “make sense” in response to it. [sent-296, score-1.053]
67 The uniform distribution is appropriate in our setup, since annotators are not told which system generated each output, and the order of choices is randomized. [sent-306, score-0.113]
68 Note that agreement between annotators is lower than typically reported in corpus annotation tasks. [sent-311, score-0.126]
69 When annotating which of two automatically generated outputs is better, there is not always a clear answer; both responses might be good or bad. [sent-312, score-0.408]
70 We can expect strong agreement only in cases where one response is clearly better. [sent-313, score-0.565]
71 Strong agreement is not required, however, as we are using many annotations to compare each pair of systems, and the human judgments are not intended to be used as training data. [sent-314, score-0.144]
72 Similar agreement was reported in an evaluation of automatically generated MT output as part StatusMT-CHATMT-BASELINEIR-STATUSHUMANRND-BASELINEIR-RESPONSE reasonable responses for illustration purposes. [sent-315, score-0.508]
73 The column Fraction A lists the fraction of HITs where the majority of annotators agreed System A’s response was better. [sent-323, score-0.594]
74 We had expected that matching status to status would create a more natural and effective IR system, but in practice, it appears that the additional level of indirection employed by IRSTATUS created only more opportunity for confusion and error. [sent-330, score-0.504]
75 Also, we did not necessarily expect MT-CHAT’s output to be preferred by human annotators: the SMT system is the only one that generates a completely novel response, and is therefore the system most likely to make fluency errors. [sent-331, score-0.178]
76 We had expected human annotators to pick up on these fluency errors, giving the the advantage to the IR systems. [sent-332, score-0.155]
77 However, it appears that MT-CHAT’s ability to tailor its response to the status on a fine-grained scale overcame the disadvantage of occasionally introducing fluency errors. [sent-333, score-0.798]
78 In order to test how close MT-CHAT’s responses come to human-level abilities, we compared its output to actual human responses from our dataset. [sent-335, score-0.938]
79 In some cases the human responses change the topic of conversation, and completely ignore the initial sta- tus. [sent-336, score-0.456]
80 For instance, one frequent type of response we noticed in the data was a greeting: “How have you been? [sent-337, score-0.501]
81 ” For the purposes of this evaluation, we manually filtered out cases where the human response was completely offtopic from the status, selecting 200 pairs at random that met our criteria and using the actual responses as the HUMAN output. [sent-339, score-1.04]
82 However, its output is preferred over the human responses 15% of the time, a fact that is particularly surprising given the very small by MT standards amount of data used to train the model. [sent-341, score-0.541]
83 A few examples where MT-CHAT’s output were selected over the human response are – – 6See inter annotator agreement in table 4. [sent-342, score-0.649]
84 We also evaluated the effect of filtering all possible phrase pairs using Fisher’s Exact Test, which we did instead of conducting phrase extraction according to the very noisy word alignments. [sent-345, score-0.161]
85 MacTco-CunHtA fTo’rs output i rse preferred 5s8 d%e socfr tihbeed dti imne over MT-BASELINE, indicating that direct phrase extraction is useful in this conversational setting. [sent-350, score-0.231]
86 Finally, as an additional baseline, we compared MT-CHAT’s output to random responses selected from those observed 2 or more times in the training data. [sent-351, score-0.444]
87 One might argue that short, common responses are very general, and that a reply like “lol” could be considered a good response to almost any status. [sent-352, score-0.944]
88 However, the human evaluation shows a clear preference for MT-CHAT’s output: raters favour responses that are tailored to the stimulus. [sent-353, score-0.456]
89 To evaluate whether BLEU is an appropriate automatic evaluation measure for response generation, we attempted to measure its agreement with the human judgments. [sent-357, score-0.664]
90 It would seem that BLEU has some agreement with human judgments on this task, but perhaps not enough to be immediately useful. [sent-361, score-0.144]
91 Our experiments show that SMT techniques are better-suited than IR approaches on the task of response generation. [sent-367, score-0.501]
92 Our system, MT-CHAT, produced responses which were preferred by human annotators over actual human responses 15% of the time. [sent-368, score-1.061]
93 Although this is still far from human-level performance, we believe there is much room for improvement: from designing appropriate wordalignment and decoding algorithms that account for the selective nature of response in dialogue, to simply adding more training data. [sent-369, score-0.552]
94 We described the many challenges posed by adapting phrase-based SMT to dialogue, and presented initial solutions to several, including direct phrasal alignment, and phrase-table scores discouraging responses that are lexically similar to the status. [sent-370, score-0.408]
95 Finally, we have provided results from an initial experiment to evaluate the BLEU metric when applied to response generation, showing that though the metric as is does not work well, there is sufficient correlation to suggest that a similar, dialoguefocused approach may be feasible. [sent-371, score-0.501]
96 By generating responses to Tweets out of context, we have demonstrated that the models underlying phrase-based SMT are capable of guiding the con- struction of appropriate responses. [sent-372, score-0.497]
97 In the future, we are excited about the role these models could potentially play in guiding response construction for conversationally-aware chat input schemes, as well as goal-directed dialogue systems. [sent-373, score-0.75]
98 Using mechanical turk to build machine translation evaluation sets. [sent-397, score-0.112]
99 Stochastic language generation in a dialogue system: Toward a domain independent generator. [sent-422, score-0.315]
100 Evaluating a dialog language generation system: comparing the mountain system to other nlg approaches. [sent-502, score-0.148]
wordName wordTfidf (topN-words)
[('response', 0.501), ('responses', 0.408), ('status', 0.252), ('dialogue', 0.21), ('stimulus', 0.175), ('conversations', 0.138), ('smt', 0.13), ('twitter', 0.125), ('generation', 0.105), ('fisher', 0.101), ('gorgeous', 0.09), ('conversational', 0.088), ('alignment', 0.082), ('conversation', 0.079), ('soup', 0.07), ('turkers', 0.07), ('ir', 0.068), ('isbell', 0.068), ('langner', 0.068), ('smells', 0.068), ('wong', 0.066), ('agreement', 0.064), ('utterances', 0.063), ('annotators', 0.062), ('bleu', 0.062), ('chris', 0.06), ('phrase', 0.058), ('discourse', 0.053), ('appropriate', 0.051), ('morristown', 0.049), ('preferred', 0.049), ('human', 0.048), ('social', 0.047), ('responding', 0.046), ('fluency', 0.045), ('artstein', 0.045), ('chambers', 0.045), ('chatterbots', 0.045), ('delicious', 0.045), ('hasselgren', 0.045), ('jafarpour', 0.045), ('rargmaxi', 0.045), ('replies', 0.045), ('stimuli', 0.045), ('unfiltered', 0.045), ('weizenbaum', 0.045), ('pairs', 0.045), ('slowly', 0.043), ('dialog', 0.043), ('nj', 0.043), ('mechanical', 0.041), ('dolan', 0.041), ('exact', 0.041), ('ibm', 0.04), ('shaikh', 0.039), ('bennett', 0.039), ('bloodgood', 0.039), ('disentanglement', 0.039), ('excited', 0.039), ('hobbs', 0.039), ('irc', 0.039), ('jung', 0.039), ('leuski', 0.039), ('neighbour', 0.039), ('swanson', 0.039), ('wah', 0.039), ('yuk', 0.039), ('actual', 0.038), ('generating', 0.038), ('media', 0.038), ('moses', 0.037), ('message', 0.037), ('cherry', 0.036), ('ritter', 0.036), ('bill', 0.036), ('translation', 0.036), ('output', 0.036), ('phrases', 0.036), ('coherence', 0.035), ('turk', 0.035), ('confuse', 0.035), ('repetition', 0.035), ('jerry', 0.035), ('rambow', 0.035), ('reply', 0.035), ('quirk', 0.034), ('tweets', 0.034), ('user', 0.033), ('contingency', 0.033), ('responds', 0.033), ('innovative', 0.033), ('landis', 0.033), ('ber', 0.033), ('elsner', 0.033), ('massively', 0.033), ('alignments', 0.032), ('judgments', 0.032), ('galley', 0.032), ('fraction', 0.031), ('alan', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 38 emnlp-2011-Data-Driven Response Generation in Social Media
Author: Alan Ritter ; Colin Cherry ; William B. Dolan
Abstract: Ottawa, Ontario, K1A 0R6 Co l . Cherry@ nrc-cnrc . gc . ca in Redmond, WA 98052 bi l ldol @mi cro so ft . com large corpus of status-response pairs found on Twitter to create a system that responds to Twitter status We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response.
2 0.14896996 24 emnlp-2011-Bootstrapping Semantic Parsers from Conversations
Author: Yoav Artzi ; Luke Zettlemoyer
Abstract: Conversations provide rich opportunities for interactive, continuous learning. When something goes wrong, a system can ask for clarification, rewording, or otherwise redirect the interaction to achieve its goals. In this paper, we present an approach for using conversational interactions of this type to induce semantic parsers. We demonstrate learning without any explicit annotation of the meanings of user utterances. Instead, we model meaning with latent variables, and introduce a loss function to measure how well potential meanings match the conversation. This loss drives the overall learning approach, which induces a weighted CCG grammar that could be used to automatically bootstrap the semantic analysis component in a complete dialog system. Experiments on DARPA Communicator conversational logs demonstrate effective learning, despite requiring no explicit mean- . ing annotations.
3 0.13295922 41 emnlp-2011-Discriminating Gender on Twitter
Author: John D. Burger ; John Henderson ; George Kim ; Guido Zarrella
Abstract: Accurate prediction of demographic attributes from social media and other informal online content is valuable for marketing, personalization, and legal investigation. This paper describes the construction of a large, multilingual dataset labeled with gender, and investigates statistical models for determining the gender of uncharacterized Twitter users. We explore several different classifier types on this dataset. We show the degree to which classifier accuracy varies based on tweet volumes as well as when various kinds of profile metadata are included in the models. We also perform a large-scale human assessment using Amazon Mechanical Turk. Our methods significantly out-perform both baseline models and almost all humans on the same task.
4 0.1159291 105 emnlp-2011-Predicting Thread Discourse Structure over Technical Web Forums
Author: Li Wang ; Marco Lui ; Su Nam Kim ; Joakim Nivre ; Timothy Baldwin
Abstract: Online discussion forums are a valuable means for users to resolve specific information needs, both interactively for the participants and statically for users who search/browse over historical thread data. However, the complex structure of forum threads can make it difficult for users to extract relevant information. The discourse structure of web forum threads, in the form of labelled dependency relationships between posts, has the potential to greatly improve information access over web forum archives. In this paper, we present the task of parsing user forum threads to determine the labelled dependencies between posts. Three methods, including a dependency parsing approach, are proposed to jointly classify the links (relationships) between posts and the dialogue act (type) of each link. The proposed methods significantly surpass an informed baseline. We also experiment with “in situ” classification of evolving threads, and establish that our best methods are able to perform equivalently well over partial threads as complete threads.
5 0.10409559 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: Many machine translation evaluation metrics have been proposed after the seminal BLEU metric, and many among them have been found to consistently outperform BLEU, demonstrated by their better correlations with human judgment. It has long been the hope that by tuning machine translation systems against these new generation metrics, advances in automatic machine translation evaluation can lead directly to advances in automatic machine translation. However, to date there has been no unambiguous report that these new metrics can improve a state-of-theart machine translation system over its BLEUtuned baseline. In this paper, we demonstrate that tuning Joshua, a hierarchical phrase-based statistical machine translation system, with the TESLA metrics results in significantly better humanjudged translation quality than the BLEUtuned baseline. TESLA-M in particular is simple and performs well in practice on large datasets. We release all our implementation under an open source license. It is our hope that this work will encourage the machine translation community to finally move away from BLEU as the unquestioned default and to consider the new generation metrics when tuning their systems.
6 0.10314306 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
7 0.10279492 3 emnlp-2011-A Correction Model for Word Alignments
8 0.10032535 89 emnlp-2011-Linguistic Redundancy in Twitter
9 0.098319225 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues
10 0.093884327 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
11 0.09079697 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
12 0.088415951 125 emnlp-2011-Statistical Machine Translation with Local Language Models
13 0.087527439 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
14 0.086997129 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
15 0.082076028 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
16 0.07399787 133 emnlp-2011-The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources
17 0.07182239 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs
18 0.071406975 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
19 0.071195461 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation
20 0.066383854 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use
topicId topicWeight
[(0, 0.237), (1, -0.064), (2, 0.17), (3, -0.133), (4, -0.079), (5, -0.081), (6, -0.03), (7, 0.074), (8, -0.028), (9, -0.142), (10, -0.133), (11, -0.137), (12, -0.037), (13, -0.049), (14, -0.056), (15, 0.125), (16, 0.1), (17, -0.062), (18, -0.053), (19, -0.099), (20, -0.048), (21, 0.03), (22, -0.213), (23, 0.001), (24, -0.006), (25, 0.089), (26, -0.019), (27, -0.011), (28, 0.068), (29, -0.094), (30, 0.058), (31, 0.053), (32, -0.016), (33, -0.043), (34, 0.055), (35, 0.034), (36, -0.119), (37, 0.008), (38, -0.165), (39, 0.072), (40, 0.014), (41, -0.111), (42, 0.228), (43, 0.01), (44, 0.043), (45, -0.103), (46, -0.004), (47, 0.04), (48, 0.08), (49, 0.086)]
simIndex simValue paperId paperTitle
same-paper 1 0.94139636 38 emnlp-2011-Data-Driven Response Generation in Social Media
Author: Alan Ritter ; Colin Cherry ; William B. Dolan
Abstract: Ottawa, Ontario, K1A 0R6 Co l . Cherry@ nrc-cnrc . gc . ca in Redmond, WA 98052 bi l ldol @mi cro so ft . com large corpus of status-response pairs found on Twitter to create a system that responds to Twitter status We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response.
2 0.49204236 24 emnlp-2011-Bootstrapping Semantic Parsers from Conversations
Author: Yoav Artzi ; Luke Zettlemoyer
Abstract: Conversations provide rich opportunities for interactive, continuous learning. When something goes wrong, a system can ask for clarification, rewording, or otherwise redirect the interaction to achieve its goals. In this paper, we present an approach for using conversational interactions of this type to induce semantic parsers. We demonstrate learning without any explicit annotation of the meanings of user utterances. Instead, we model meaning with latent variables, and introduce a loss function to measure how well potential meanings match the conversation. This loss drives the overall learning approach, which induces a weighted CCG grammar that could be used to automatically bootstrap the semantic analysis component in a complete dialog system. Experiments on DARPA Communicator conversational logs demonstrate effective learning, despite requiring no explicit mean- . ing annotations.
3 0.48858702 84 emnlp-2011-Learning the Information Status of Noun Phrases in Spoken Dialogues
Author: Altaf Rahman ; Vincent Ng
Abstract: An entity in a dialogue may be old, new, or mediated/inferrable with respect to the hearer’s beliefs. Knowing the information status of the entities participating in a dialogue can therefore facilitate its interpretation. We address the under-investigated problem of automatically determining the information status of discourse entities. Specifically, we extend Nissim’s (2006) machine learning approach to information-status determination with lexical and structured features, and exploit learned knowledge of the information status of each discourse entity for coreference resolution. Experimental results on a set of Switchboard dialogues reveal that (1) incorporating our proposed features into Nissim’s feature set enables our system to achieve stateof-the-art performance on information-status classification, and (2) the resulting information can be used to improve the performance of learning-based coreference resolvers.
4 0.48443124 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
Author: Dani Yogatama ; Michael Heilman ; Brendan O'Connor ; Chris Dyer ; Bryan R. Routledge ; Noah A. Smith
Abstract: We consider the problem of predicting measurable responses to scientific articles based primarily on their text content. Specifically, we consider papers in two fields (economics and computational linguistics) and make predictions about downloads and within-community citations. Our approach is based on generalized linear models, allowing interpretability; a novel extension that captures first-order temporal effects is also presented. We demonstrate that text features significantly improve accuracy of predictions over metadata features like authors, topical categories, and publication venues.
5 0.44190365 105 emnlp-2011-Predicting Thread Discourse Structure over Technical Web Forums
Author: Li Wang ; Marco Lui ; Su Nam Kim ; Joakim Nivre ; Timothy Baldwin
Abstract: Online discussion forums are a valuable means for users to resolve specific information needs, both interactively for the participants and statically for users who search/browse over historical thread data. However, the complex structure of forum threads can make it difficult for users to extract relevant information. The discourse structure of web forum threads, in the form of labelled dependency relationships between posts, has the potential to greatly improve information access over web forum archives. In this paper, we present the task of parsing user forum threads to determine the labelled dependencies between posts. Three methods, including a dependency parsing approach, are proposed to jointly classify the links (relationships) between posts and the dialogue act (type) of each link. The proposed methods significantly surpass an informed baseline. We also experiment with “in situ” classification of evolving threads, and establish that our best methods are able to perform equivalently well over partial threads as complete threads.
6 0.4287411 3 emnlp-2011-A Correction Model for Word Alignments
7 0.42063713 41 emnlp-2011-Discriminating Gender on Twitter
9 0.38881868 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
10 0.38324174 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association
11 0.37269932 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
12 0.36591232 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures
13 0.36301064 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use
14 0.35233811 25 emnlp-2011-Cache-based Document-level Statistical Machine Translation
15 0.33911148 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models
16 0.33786383 44 emnlp-2011-Domain Adaptation via Pseudo In-Domain Data Selection
17 0.32591116 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
18 0.32367578 27 emnlp-2011-Classifying Sentences as Speech Acts in Message Board Posts
19 0.32030097 89 emnlp-2011-Linguistic Redundancy in Twitter
20 0.32022002 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
topicId topicWeight
[(23, 0.106), (36, 0.018), (37, 0.023), (45, 0.072), (52, 0.019), (53, 0.035), (54, 0.038), (57, 0.023), (62, 0.025), (64, 0.023), (66, 0.027), (69, 0.019), (75, 0.013), (79, 0.053), (80, 0.263), (82, 0.017), (85, 0.022), (87, 0.047), (90, 0.017), (96, 0.045), (98, 0.03)]
simIndex simValue paperId paperTitle
1 0.78218281 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context
Author: Peter Turney ; Yair Neuman ; Dan Assaf ; Yohai Cohen
Abstract: Metaphor is ubiquitous in text, even in highly technical text. Correct inference about textual entailment requires computers to distinguish the literal and metaphorical senses of a word. Past work has treated this problem as a classical word sense disambiguation task. In this paper, we take a new approach, based on research in cognitive linguistics that views metaphor as a method for transferring knowledge from a familiar, well-understood, or concrete domain to an unfamiliar, less understood, or more abstract domain. This view leads to the hypothesis that metaphorical word usage is correlated with the degree of abstractness of the word’s context. We introduce an algorithm that uses this hypothesis to classify a word sense in a given context as either literal (de- notative) or metaphorical (connotative). We evaluate this algorithm with a set of adjectivenoun phrases (e.g., in dark comedy, the adjective dark is used metaphorically; in dark hair, it is used literally) and with the TroFi (Trope Finder) Example Base of literal and nonliteral usage for fifty verbs. We achieve state-of-theart performance on both datasets.
same-paper 2 0.76943064 38 emnlp-2011-Data-Driven Response Generation in Social Media
Author: Alan Ritter ; Colin Cherry ; William B. Dolan
Abstract: Ottawa, Ontario, K1A 0R6 Co l . Cherry@ nrc-cnrc . gc . ca in Redmond, WA 98052 bi l ldol @mi cro so ft . com large corpus of status-response pairs found on Twitter to create a system that responds to Twitter status We present a data-driven approach to generating responses to Twitter status posts, based on phrase-based Statistical Machine Translation. We find that mapping conversational stimuli onto responses is more difficult than translating between languages, due to the wider range of possible responses, the larger fraction of unaligned words/phrases, and the presence of large phrase pairs whose alignment cannot be further decomposed. After addressing these challenges, we compare approaches based on SMT and Information Retrieval in a human evaluation. We show that SMT outperforms IR on this task, and its output is preferred over actual human responses in 15% of cases. As far as we are aware, this is the first work to investigate the use of phrase-based SMT to directly translate a linguistic stimulus into an appropriate response.
3 0.51362079 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study
Author: Alan Ritter ; Sam Clark ; Mausam ; Oren Etzioni
Abstract: People tweet more than 100 Million times daily, yielding a noisy, informal, but sometimes informative corpus of 140-character messages that mirrors the zeitgeist in an unprecedented manner. The performance of standard NLP tools is severely degraded on tweets. This paper addresses this issue by re-building the NLP pipeline beginning with part-of-speech tagging, through chunking, to named-entity recognition. Our novel T-NER system doubles F1 score compared with the Stanford NER system. T-NER leverages the redundancy inherent in tweets to achieve this performance, using LabeledLDA to exploit Freebase dictionaries as a source of distant supervision. LabeledLDA outperforms cotraining, increasing F1 by 25% over ten common entity types. Our NLP tools are available at: http : / / github .com/ aritt er /twitte r_nlp
4 0.5133006 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
Author: Zhenghua Li ; Min Zhang ; Wanxiang Che ; Ting Liu ; Wenliang Chen ; Haizhou Li
Abstract: Part-of-speech (POS) is an indispensable feature in dependency parsing. Current research usually models POS tagging and dependency parsing independently. This may suffer from error propagation problem. Our experiments show that parsing accuracy drops by about 6% when using automatic POS tags instead of gold ones. To solve this issue, this paper proposes a solution by jointly optimizing POS tagging and dependency parsing in a unique model. We design several joint models and their corresponding decoding algorithms to incorporate different feature sets. We further present an effective pruning strategy to reduce the search space of candidate POS tags, leading to significant improvement of parsing speed. Experimental results on Chinese Penn Treebank 5 show that our joint models significantly improve the state-of-the-art parsing accuracy by about 1.5%. Detailed analysis shows that the joint method is able to choose such POS tags that are more helpful and discriminative from parsing viewpoint. This is the fundamental reason of parsing accuracy improvement.
5 0.50819385 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Author: Kevin Gimpel ; Noah A. Smith
Abstract: We present a quasi-synchronous dependency grammar (Smith and Eisner, 2006) for machine translation in which the leaves of the tree are phrases rather than words as in previous work (Gimpel and Smith, 2009). This formulation allows us to combine structural components of phrase-based and syntax-based MT in a single model. We describe a method of extracting phrase dependencies from parallel text using a target-side dependency parser. For decoding, we describe a coarse-to-fine approach based on lattice dependency parsing of phrase lattices. We demonstrate performance improvements for Chinese-English and UrduEnglish translation over a phrase-based baseline. We also investigate the use of unsupervised dependency parsers, reporting encouraging preliminary results.
6 0.50708306 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation
7 0.5059607 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
8 0.50363708 27 emnlp-2011-Classifying Sentences as Speech Acts in Message Board Posts
9 0.50224447 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing
10 0.50180656 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
11 0.49942467 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
12 0.49889436 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference
13 0.49859238 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
14 0.49858221 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
15 0.49778292 128 emnlp-2011-Structured Relation Discovery using Generative Models
16 0.49628797 136 emnlp-2011-Training a Parser for Machine Translation Reordering
17 0.49621198 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features
18 0.49507844 37 emnlp-2011-Cross-Cutting Models of Lexical Semantics
19 0.49380526 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
20 0.49142772 13 emnlp-2011-A Word Reordering Model for Improved Machine Translation