acl acl2012 acl2012-195 knowledge-graph by maker-knowledge-mining

195 acl-2012-The Creation of a Corpus of English Metalanguage

Source: pdf

Author: Shomir Wilson

Abstract: Metalanguage is an essential linguistic mechanism which allows us to communicate explicit information about language itself. However, it has been underexamined in research in language technologies, to the detriment of the performance of systems that could exploit it. This paper describes the creation of the first tagged and delineated corpus of English metalanguage, accompanied by an explicit definition and a rubric for identifying the phenomenon in text. This resource will provide a basis for further studies of metalanguage and enable its utilization in language technologies.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper describes the creation of the first tagged and delineated corpus of English metalanguage, accompanied by an explicit definition and a rubric for identifying the phenomenon in text. [sent-3, score-0.399]

2 This resource will provide a basis for further studies of metalanguage and enable its utilization in language technologies. [sent-4, score-0.526]

3 The use-mention distinction is illustrated simply in Sentences (1) and (2) below: (1) I watch football on weekends. [sent-7, score-0.164]

4 A reader understands that football in Sentence (1) refers to a sporting activity, while the same word in Sentence (2) refers to the term football itself. [sent-9, score-0.336]

5 Evidence suggests that human communication frequently employs metalanguage (Anderson et al. [sent-10, score-0.526]

6 The study of the syntax and semantics of metalanguage is well developed for formal languages. [sent-24, score-0.526]

7 Parsing the distinction is difficult, as shown in Figure 1below: go does not function as a verb in Sentence (6), but it is tagged as such. [sent-26, score-0.16]

8 Delineating an instance of metalanguage with quotation marks is a common convention, but this often fails to ameliorate the parsing problem. [sent-27, score-0.828]

9 Moreover, applications of natural language processing generally lack the ability to recognize and interpret metalanguage (Anderson et al. [sent-29, score-0.56]

10 Applications of natural language understanding cannot process metalanguage without detecting it, especially when upstream components (such as parsers) mangle its structure. [sent-34, score-0.526]

11 Interactive systems that could leverage expectations of metalanguage competency currently fail to do so. [sent-35, score-0.526]

12 Adding quotation marks around go alters the parser output slightly (not shown), but go remains labeled VB. [sent-41, score-0.383]

13 The exchange shown in Figure 2 is representative of the reactions of nearly all dialog systems: in spite of the domain generality of metalanguage and the user’s expectation of its availability, the system does not recognize it and instead “talks past” the user. [sent-51, score-0.602]

14 In effect, language technologies that ignore metalanguage are discarding the most direct source of linguistic information that text or utterances can provide. [sent-52, score-0.555]

15 Section 2 presents a definition and a rubric for metalanguage in the form of mentioned language. [sent-54, score-1.09]

16 2 Metalanguage Distinction1 and the Use-Mention Although the reader is likely to be familiar with the terms use-mention distinction and metalanguage, the topic merits further explanation to precisely establish the phenomenon being studied. [sent-57, score-0.252]

17 This paper will adopt the term mentioned language to describe the literal, delineable phenomenon illustrated in examples thus far. [sent-59, score-0.359]

18 Other forms of metalanguage occur through deictic references to linguistic entities that do not appear in the relevant statement. [sent-60, score-0.526]

19 ) For technical tractability, this study focuses on mentioned language. [sent-62, score-0.229]

20 1 Definition Although the use-mention distinction has enjoyed a long history of theoretical discussion, attempts to explicitly define one or both of the distinction’s disjuncts are difficult (or impossible) to find. [sent-64, score-0.147]

21 Below is the definition of mentioned language adopted by this study, followed by clarifications. [sent-65, score-0.279]

22 Definition: For T a token or a set of tokens in a sentence, if T is produced to draw attention to a property of the token T or the type of T, then T is an instance of mentioned language. [sent-66, score-0.33]

23 A property might 1 The definition and rubric in this section were originally introduced by Wilson (2011a). [sent-70, score-0.335]

24 The type of T is relevant in most instances of mentioned language, but the token itself is relevant in others, as in the sentence below: (9) “The” appears between quote marks here. [sent-73, score-0.561]

25 The adoption of this definition was motivated by a desire to study mentioned language with precise, repeatable results. [sent-75, score-0.279]

26 A brief attempt to train annotators using the definition was unsuccessful, and instead a rubric was created for this purpose. [sent-77, score-0.396]

27 2 Annotation Rubric A human reader with some knowledge of the usemention distinction can often intuit the presence of mentioned language in a sentence. [sent-79, score-0.417]

28 However, to operationalize the concept and move toward corpus construction, it was necessary to create a rubric for labeling it. [sent-80, score-0.32]

29 The rubric is based on substitution, and it may be applied, with caveats described below, to determine whether a linguistic entity is mentioned by the sentence in which it occurs. [sent-81, score-0.555]

30 X is an instance of mentioned language if, when assuming that X' refers to X, the meaning of S' is equivalent to the meaning of S. [sent-87, score-0.43]

31 To maintain coherency, minor adjustments in sentence wording will be necessary for some candidate phrases. [sent-90, score-0.157]

32 Figure 4: Examples of rubric application using the pseudocode in Figure 3. [sent-96, score-0.334]

33 Also, quotation marks around or inside of a candidate phrase require special attention, since their inclusion or exclusion in X can alter the meaning of S’. [sent-97, score-0.488]

34 For this discussion, quotation marks and other stylistic cues are considered informal cues which aid a reader in detecting mentioned language. [sent-98, score-1.364]

35 Style conventions may call for them, and in some cases they might be strictly necessary, but a competent language user possesses sufficient skill to properly discard or retain them as each instance requires (Saka 1998). [sent-99, score-0.203]

36 3 The Mentioned Language Corpus “Laboratory examples” of mentioned language (such as the examples thus far in this paper) only begin to illustrate the variation in the phenomenon. [sent-100, score-0.258]

37 To conduct an empirical examination of mentioned language and to study the feasibility of automatic identification, it was necessary to gather a large, diverse set of samples. [sent-101, score-0.229]

38 This section describes the process of building a series of three progressively more sophisticated corpora of mentioned language. [sent-102, score-0.229]

39 This third corpus is the first to delineate mentioned language: that is, it identifies precise subsequences of words in a sentence that are subject to the phenomenon. [sent-105, score-0.314]

40 1 Approach The article set of English Wikipedia2 was chosen as a source for text, from which instances were mined using a combination of automated and manual efforts. [sent-108, score-0.173]

41 2) Stylistic cues that sometimes delimit mentioned language are present in article text. [sent-112, score-0.559]

42 Contributors tend to use quote marks, italic text, or bold text to delimit mentioned language3, thus following conventions respected across many domains of writing (Strunk & White 1979; Chicago Editorial Staff 2010; American Psychological Association. [sent-113, score-0.375]

43 3 These conventions are stated in Wikipedia’s style manual, though it is unclear whether most contributors read the manual or follow the conventions out of habit. [sent-118, score-0.223]

44 641 boards and other sources of informal language were considered, but the lack of consistent (or any) stylistic cues would have made candidate phrase collection untenably time-consuming. [sent-119, score-0.701]

45 Articles are written informatively and they generally assume the reader is unfamiliar with their topics, leading to frequent instances of mentioned language. [sent-121, score-0.433]

46 Then, the main bodies of article text (excluding discussion pages, image captions, and other peripheral text) were scanned for sentences that contained instances of highlighted text (i. [sent-126, score-0.262]

47 Since stylistic cues are also used for other language tasks, candidate instances were heuristically filtered and then annotated by human readers. [sent-129, score-0.734]

48 2 Previous Efforts In previous work, a pilot corpus was constructed to verify the fertility of Wikipedia as a source for mentioned language. [sent-131, score-0.229]

49 From 1,000 articles, 1,339 sentences that contained stylistic cues were examined by a human reader, and 17 1 were found to contain at least one instance of mentioned language. [sent-132, score-0.793]

50 Next, the “Combined Cues” corpus was constructed to test the combination of stylistic filtering and a new lexical filter for selecting candidate instances. [sent-134, score-0.377]

51 From 3,83 1 articles, a set of 898 sentences were found to contain 1,164 candidate instances that passed the combination of stylistic and lexical filters. [sent-136, score-0.496]

52 Hand annotation of those candidates yielded 1,082 instances of mentioned language. [sent-137, score-0.348]

53 It did not seem plausible that the set of mention-significant words was complete enough to justify that high percentage, and concerns were raised that the lexical filter was rejecting many instances of mentioned language. [sent-139, score-0.348]

54 For each of the 23 original mention-significant words, a human reader started with its containing synset and followed hypernym links until a synset was reached that did not refer to a linguistic entity. [sent-142, score-0.213]

55 642 Using the combination of stylistic and lexical cues, 2,393 candidate instances were collected, and the researcher used the rubric and definition from Section 2 to identify 629 instances of mentioned language 4 . [sent-147, score-1.212]

56 The researcher also identified four categories of mentioned language based on the nature of the substitution phrase X’ specified by the rubric. [sent-148, score-0.304]

57 4 Corpus Composition As stated previously, categories for mentioned language were identified based on intuitive relationships among the substitution phrases created for the rubric (e. [sent-153, score-0.548]

58 The categories are: 1) Words as Words (WW): Within the context of the sentence, the candidate phrase is used to refer to the word or phrase itself and not what it usually refers to. [sent-156, score-0.28]

59 2) Names as Names (NN): The sentence directly refers to the candidate phrase as a proper name, nickname, or title. [sent-162, score-0.245]

60 3) Spelling or Pronunciation (SP): The candidate text appears only to illustrate spelling, pronunciation, or a character symbol. [sent-163, score-0.145]

61 4) Other Mention/Interesting (OM): The candidate phrase is an instance of mentioned language that does not fit the above three categories. [sent-164, score-0.42]

62 5) Not Mention (XX): The candidate phrase is not mentioned language. [sent-165, score-0.387]

63 The OM category was occupied mostly by instances of speech or language production by an agent, as illustrated by the two OM examples in Table 2. [sent-171, score-0.178]

64 Category Code Frequency Words as Words Names as Names Spelling or Pronunciation WW NN SP 438 117 48 Not Mention XX 1,764 Other Mention/Interesting OM 26 Table 1: The by-category composition of candidate instances in the Enhanced Cues corpus. [sent-172, score-0.235]

65 In the interest of revealing both lexical and syntactic cues for mentioned language, part-ofspeech tags were computed (using NLTK (Loper & Bird 2002)) for words in all of the sentences containing candidate instances. [sent-173, score-0.583]

66 Although the heuristics for collecting candidate instances were not intended to function as a classifier, figures for precision are shown for each word: these represent 643 the percentage of occurrences of the word which were associated with candidates identified as mentioned language. [sent-175, score-0.464]

67 For example, 80% of appearances of the verb call preceded a candidate instance that was labeled as mentioned language. [sent-176, score-0.414]

68 NN Digeri is the name of a Thracian tribe mentioned by Pliny the Elder, in The Natural History. [sent-179, score-0.28]

69 Candidate phrases appear underlined, with the original stylistic cues removed. [sent-190, score-0.499]

70 Many of these words appeared as mention words for the Combined Cues corpus, indicating that prior intuitions about framing metalanguage were correct. [sent-191, score-0.57]

71 In particular, call (v), word(n), and term (n) were exceptionally frequent and effective at associating with mentioned language. [sent-192, score-0.302]

72 In contrast, the distribution of frequencies for the words following candidate instances exhibited a “long tail”, indicating greater variation in vocabulary. [sent-193, score-0.273]

73 Precision (%) 1 2 3 4 5 6 7 8 9 10 call (v) 92 word (n) 68 term (n) 60 name (n) 31 use (v) 17 know (v) 15 also (rb) 13 name (v) 11 sometimes (rb) 9 Latin (n) 9 80 95. [sent-195, score-0.175]

74 2 Table 3: The top ten words appearing in the threeword sequences before candidate instances, with precisions of association with mentioned language. [sent-203, score-0.465]

75 Precision (%) 1 2 3 4 5 6 7 8 9 10 mean (v) name (n) use (v) meaning (n) derive (v) refers (n) describe (v) refer (v) word (n) may (md) 31 24 11 8 8 7 6 6 6 5 83. [sent-205, score-0.192]

76 5 Table 4: The top ten words appearing in the threeword sequences after candidate instances, with precisions of association with mentioned language. [sent-211, score-0.465]

77 5 Reliability and Consistency of Annotation To provide some indication of the reliability and consistency of the Enhanced Cues Corpus, three additional expert annotators were recruited to label a subset of the candidate instances. [sent-213, score-0.215]

78 These additional annotators received guidelines for annotation that included the five categories, and they worked separately (from each other and from the primary annotator) to label 100 instances selected randomly with quotas for each category. [sent-214, score-0.18]

79 Calculations first were performed to determine the level of agreement on the mere presence of mentioned language, by mapping labels WW, NN, SP, and OM to true and XX to false. [sent-215, score-0.229]

80 All four annotators agreed upon a true label for 46 instances and a false label for 30 instances, with an average pairwise Kappa (computed via NTLK) of 0. [sent-216, score-0.18]

81 The relatively low value for WW was not expected, though it seems possible that the redaction of specific stylistic cues made annotators less certain when to apply this category. [sent-233, score-0.56]

82 Overall, these numbers suggest that, although annotators tend to agree whether a candidate instance is mentioned language or not, there is less of a consensus on how to qualify positive instances. [sent-234, score-0.439]

83 4 Discussion The Enhanced Cues corpus confirms some of the hypothesized properties of metalanguage and yields some unexpected insights. [sent-235, score-0.526]

84 Stylistic cues appear to be strongly associated with mentioned language; although the examination of candidate phrases was limited to “highlighted” text, informal perusal of the remainder of article text confirmed this association. [sent-236, score-0.681]

85 Further evidence can be seen in examples from other texts, shown below with their original stylistic cues intact:  Like so many words, the meaning of “addiction” has varied wildly over time, but the trajectory might surprise you. [sent-237, score-0.589]

86 9 However, the connection between mentioned language and stylistic cues is only valuable when stylistic cues are available. [sent-248, score-1.227]

87 Still, even in their absence there appears to be an association between mentioned language and a core set of nouns and verbs. [sent-249, score-0.258]

88 Recurring patterns were observed in how mention-significant words related to mentioned language. [sent-250, score-0.229]

89 Two were particularly common:  Noun apposition between a mention-significant noun and mentioned language. [sent-251, score-0.229]

90 An example of this appears in Sentence (5), consisting of the noun verb and the mentioned word have. [sent-252, score-0.258]

91 With further study, it should be possible to exploit these relationships to automatically detect mentioned language in text. [sent-255, score-0.229]

92  5 Related Work The use-mention distinction has enjoyed a long history of chiefly theoretical discussion. [sent-256, score-0.147]

93 Beyond those authors already cited, many others have addressed it as the formal topic of quotation (Davidson 1979; Cappelen & Lepore 1997; GarcíaCarpintero 2004; Partee 1973; Quine 1940; Tarski 1933). [sent-257, score-0.191]

94 (2004), who created a corpus of metalanguage from a subset of the British National Corpus, finding that approximately 11% of spoken utterances contained some form (whether explicit or implicit) of metalanguage. [sent-268, score-0.619]

95 6 Future Work As explained in the introduction, the long-term goal of this research program is to apply an understanding of metalanguage to enhance language technologies. [sent-270, score-0.526]

96 Between these long-term and immediate goals lies an intermediate step: methods must be developed to detect and delineate metalanguage automatically. [sent-272, score-0.57]

97 Using the Enhanced Cues Corpus, a two-stage approach to automatic identification of mentioned language is being developed. [sent-273, score-0.229]

98 The first stage is detection, the determination of whether a sentence contains an instance of mentioned language. [sent-274, score-0.303]

99 The second stage is delineation, the determination of the subsequence of words in a sentence that functions as mentioned language. [sent-279, score-0.27]

100 Early efforts have focused on the associations discussed in Section 5 between mentioned language and mention-significant words. [sent-280, score-0.229]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('metalanguage', 0.526), ('rubric', 0.285), ('stylistic', 0.261), ('cues', 0.238), ('mentioned', 0.229), ('quotation', 0.191), ('anderson', 0.153), ('instances', 0.119), ('candidate', 0.116), ('distinction', 0.103), ('om', 0.1), ('saka', 0.088), ('ww', 0.085), ('reader', 0.085), ('marks', 0.078), ('enhanced', 0.078), ('conventions', 0.077), ('wilson', 0.073), ('pronunciation', 0.071), ('xx', 0.071), ('perlis', 0.066), ('spelled', 0.065), ('wikipedia', 0.065), ('phenomenon', 0.064), ('spelling', 0.064), ('football', 0.061), ('meaning', 0.061), ('annotators', 0.061), ('sp', 0.057), ('kappa', 0.057), ('highlighted', 0.057), ('skill', 0.057), ('go', 0.057), ('article', 0.054), ('name', 0.051), ('definition', 0.05), ('names', 0.05), ('pseudocode', 0.049), ('synset', 0.047), ('nn', 0.047), ('refers', 0.046), ('psychological', 0.046), ('chicago', 0.045), ('mention', 0.044), ('addi', 0.044), ('cappelen', 0.044), ('davidson', 0.044), ('delineate', 0.044), ('editorial', 0.044), ('enjoyed', 0.044), ('partee', 0.044), ('purang', 0.044), ('quine', 0.044), ('raux', 0.044), ('shomi', 0.044), ('strunk', 0.044), ('tarski', 0.044), ('threeword', 0.044), ('informal', 0.044), ('appearing', 0.043), ('dialog', 0.042), ('phrase', 0.042), ('sentence', 0.041), ('conversation', 0.04), ('descendants', 0.04), ('tough', 0.038), ('maier', 0.038), ('delimit', 0.038), ('exhibited', 0.038), ('statistic', 0.038), ('garc', 0.038), ('reliability', 0.038), ('term', 0.037), ('call', 0.036), ('contributors', 0.035), ('bus', 0.035), ('rewritten', 0.035), ('operationalize', 0.035), ('title', 0.034), ('refer', 0.034), ('token', 0.034), ('articles', 0.034), ('stated', 0.034), ('recognize', 0.034), ('instance', 0.033), ('symbol', 0.033), ('letter', 0.033), ('researcher', 0.033), ('adler', 0.033), ('precisions', 0.033), ('spoken', 0.032), ('contained', 0.032), ('loper', 0.031), ('quote', 0.031), ('category', 0.03), ('examples', 0.029), ('nltk', 0.029), ('filters', 0.029), ('appears', 0.029), ('utterances', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 195 acl-2012-The Creation of a Corpus of English Metalanguage

Author: Shomir Wilson

2 0.099640645 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

Author: Mark Johnson ; Katherine Demuth ; Michael Frank

Abstract: This paper uses an unsupervised model of grounded language acquisition to study the role that social cues play in language acquisition. The input to the model consists of (orthographically transcribed) child-directed utterances accompanied by the set of objects present in the non-linguistic context. Each object is annotated by social cues, indicating e.g., whether the caregiver is looking at or touching the object. We show how to model the task of inferring which objects are being talked about (and which words refer to which objects) as standard grammatical inference, and describe PCFG-based unigram models and adaptor grammar-based collocation models for the task. Exploiting social cues improves the performance of all models. Our models learn the relative importance of each social cue jointly with word-object mappings and collocation structure, consis- tent with the idea that children could discover the importance of particular social information sources during word learning.

3 0.073235169 7 acl-2012-A Computational Approach to the Automation of Creative Naming

Author: Gozde Ozbal ; Carlo Strapparava

Abstract: In this paper, we propose a computational approach to generate neologisms consisting of homophonic puns and metaphors based on the category of the service to be named and the properties to be underlined. We describe all the linguistic resources and natural language processing techniques that we have exploited for this task. Then, we analyze the performance of the system that we have developed. The empirical results show that our approach is generally effective and it constitutes a solid starting point for the automation ofthe naming process.

4 0.071892835 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith

Abstract: “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.

5 0.06376908 150 acl-2012-Multilingual Named Entity Recognition using Parallel Data and Metadata from Wikipedia

Author: Sungchul Kim ; Kristina Toutanova ; Hwanjo Yu

Abstract: In this paper we propose a method to automatically label multi-lingual data with named entity tags. We build on prior work utilizing Wikipedia metadata and show how to effectively combine the weak annotations stemming from Wikipedia metadata with information obtained through English-foreign language parallel Wikipedia sentences. The combination is achieved using a novel semi-CRF model for foreign sentence tagging in the context of a parallel English sentence. The model outperforms both standard annotation projection methods and methods based solely on Wikipedia metadata.

6 0.061857831 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

7 0.058885552 94 acl-2012-Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection

8 0.05827558 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

9 0.05790633 153 acl-2012-Named Entity Disambiguation in Streaming Data

10 0.055508588 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

11 0.055129372 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

12 0.054717366 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

13 0.054341469 18 acl-2012-A Probabilistic Model for Canonicalizing Named Entity Mentions

14 0.053600606 73 acl-2012-Discriminative Learning for Joint Template Filling

15 0.052715763 177 acl-2012-Sentence Dependency Tagging in Online Question Answering Forums

16 0.052687831 74 acl-2012-Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach

17 0.051431041 134 acl-2012-Learning to Find Translations and Transliterations on the Web

18 0.04901107 14 acl-2012-A Joint Model for Discovery of Aspects in Utterances

19 0.047490913 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling

20 0.047133032 186 acl-2012-Structuring E-Commerce Inventory

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.168), (1, 0.075), (2, -0.007), (3, 0.03), (4, 0.019), (5, 0.086), (6, -0.005), (7, 0.002), (8, -0.008), (9, 0.041), (10, -0.032), (11, -0.008), (12, -0.038), (13, 0.061), (14, -0.013), (15, -0.029), (16, 0.036), (17, -0.018), (18, -0.059), (19, -0.025), (20, -0.067), (21, -0.029), (22, 0.012), (23, -0.02), (24, -0.081), (25, 0.095), (26, 0.034), (27, -0.038), (28, -0.003), (29, -0.048), (30, 0.035), (31, -0.005), (32, 0.037), (33, 0.112), (34, 0.008), (35, 0.031), (36, 0.027), (37, -0.034), (38, 0.044), (39, -0.046), (40, -0.026), (41, 0.029), (42, 0.021), (43, 0.018), (44, 0.039), (45, 0.168), (46, 0.115), (47, 0.056), (48, 0.003), (49, -0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.926974 195 acl-2012-The Creation of a Corpus of English Metalanguage

Author: Shomir Wilson

2 0.67938411 7 acl-2012-A Computational Approach to the Automation of Creative Naming

Author: Gozde Ozbal ; Carlo Strapparava

3 0.64222807 218 acl-2012-You Had Me at Hello: How Phrasing Affects Memorability

Author: Cristian Danescu-Niculescu-Mizil ; Justin Cheng ; Jon Kleinberg ; Lillian Lee

Abstract: Understanding the ways in which information achieves widespread public awareness is a research question of significant interest. We consider whether, and how, the way in which the information is phrased the choice of words and sentence structure — can affect this process. To this end, we develop an analysis framework and build a corpus of movie quotes, annotated with memorability information, in which we are able to control for both the speaker and the setting of the quotes. We find that there are significant differences between memorable and non-memorable quotes in several key dimensions, even after controlling for situational and contextual factors. One is lexical distinctiveness: in aggregate, memorable quotes use less common word choices, but at the same time are built upon a scaffolding of common syntactic patterns. Another is that memorable quotes tend to be more general in ways that make them easy to apply in new contexts — that is, more portable. — We also show how the concept of “memorable language” can be extended across domains. 1 Hello. My name is Inigo Montoya. Understanding what items will be retained in the public consciousness, and why, is a question of fundamental interest in many domains, including marketing, politics, entertainment, and social media; as we all know, many items barely register, whereas others catch on and take hold in many people’s minds. An active line of recent computational work has employed a variety of perspectives on this question. 892 Building on a foundation in the sociology of diffusion [27, 31], researchers have explored the ways in which network structure affects the way information spreads, with domains of interest including blogs [1, 11], email [37], on-line commerce [22], and social media [2, 28, 33, 38]. There has also been recent research addressing temporal aspects of how different media sources convey information [23, 30, 39] and ways in which people react differently to infor- mation on different topics [28, 36]. Beyond all these factors, however, one’s everyday experience with these domains suggests that the way in which a piece of information is expressed the choice of words, the way it is phrased might also have a fundamental effect on the extent to which it takes hold in people’s minds. Concepts that attain wide reach are often carried in messages such as political slogans, marketing phrases, or aphorisms whose language seems intuitively to be memorable, “catchy,” or otherwise compelling. Our first challenge in exploring this hypothesis is to develop a notion of “successful” language that is precise enough to allow for quantitative evaluation. We also face the challenge of devising an evaluation setting that separates the phrasing of a message from the conditions in which it was delivered highlycited quotes tend to have been delivered under compelling circumstances or fit an existing cultural, political, or social narrative, and potentially what appeals to us about the quote is really just its invocation of these extra-linguistic contexts. Is the form of the language adding an effect beyond or independent of these (obviously very crucial) factors? To — — — investigate the question, one needs a way of controlProce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi8c 9s2–901, ling as much as possible for the role that the surrounding context of the language plays. — — The present work (i): Evaluating language-based memorability Defining what makes an utterance memorable is subtle, and scholars in several domains have written about this question. There is a rough consensus that an appropriate definition involves elements of both recognition people should be able to retain the quote and recognize it when they hear it invoked and production people should be motivated to refer to it in relevant situations [15]. One suggested reason for why some memes succeed is their ability to provoke emotions [16]. Alternatively, memorable quotes can be good for expressing the feelings, mood, or situation of an individual, a group, or a culture (the zeitgeist): “Certain quotes exquisitely capture the mood or feeling we wish to communicate to someone. We hear them ... and store them away for future use” [10]. None of these observations, however, serve as definitions, and indeed, we believe it desirable to — — — not pre-commit to an abstract definition, but rather to adopt an operational formulation based on external human judgments. In designing our study, we focus on a domain in which (i) there is rich use of language, some of which has achieved deep cultural penetration; (ii) there already exist a large number of external human judgments perhaps implicit, but in a form we can extract; and (iii) we can control for the setting in which the text was used. Specifically, we use the complete scripts of roughly 1000 movies, representing diverse genres, eras, and levels of popularity, and consider which lines are the most “memorable”. To acquire memorability labels, for each sentence in each script, we determine whether it has been listed as a “memorable quote” by users of the widely-known IMDb (the Internet Movie Database), and also estimate the number oftimes it appears on the Web. Both ofthese serve as memorability metrics for our purposes. When we evaluate properties of memorable quotes, we comparethemwithquotes thatarenotassessed as memorable, but were spoken by the same character, at approximately the same point in the same movie. This enables us to control in a fairly — fine-grained way for the confounding effects of context discussed above: we can observe differences 893 that persist even after taking into account both the speaker and the setting. In a pilot validation study, we find that human subjects are effective at recognizing the more IMDbmemorable of two quotes, even for movies they have not seen. This motivates a search for features intrinsic to the text of quotes that signal memorability. In fact, comments provided by the human subjects as part of the task suggested two basic forms that such textual signals could take: subjects felt that (i) memorable quotes often involve a distinctive turn of phrase; and (ii) memorable quotes tend to invoke general themes that aren’t tied to the specific setting they came from, and hence can be more easily invoked for future (out of context) uses. We test both of these principles in our analysis of the data. The present work (ii): What distinguishes memorable quotes Under the controlled-comparison setting sketched above, we find that memorable quotes exhibit significant differences from nonmemorable quotes in several fundamental respects, and these differences in the data reinforce the two main principles from the human pilot study. First, we show a concrete sense in which memorable quotes are indeed distinctive: with respect to lexical language models trained on the newswire portions of the Brown corpus [21], memorable quotes have significantly lower likelihood than their nonmemorable counterparts. Interestingly, this distinctiveness takes place at the level of words, but not at the level of other syntactic features: the part-ofspeech composition of memorable quotes is in fact more likely with respect to newswire. Thus, we can think of memorable quotes as consisting, in an aggregate sense, of unusual word choices built on a scaffolding of common part-of-speech patterns. We also identify a number of ways in which memorable quotes convey greater generality. In their patterns of verb tenses, personal pronouns, and determiners, memorable quotes are structured so as to be more “free-standing,” containing fewer markers that indicate references to nearby text. Memorable quotes differ in other interesting as- pects as well, such as sound distributions. Our analysis ofmemorable movie quotes suggests a framework by which the memorability of text in a range of different domains could be investigated. We provide evidence that such cross-domain properties may hold, guided by one of our motivating applications in marketing. In particular, we analyze a corpus of advertising slogans, and we show that these slogans have significantly greater likelihood at both the word level and the part-of-speech level with respect to a language model trained on memorable movie quotes, compared to a corresponding language model trained on non-memorable movie quotes. This suggests that some of the principles underlying memorable text have the potential to apply across different areas. Roadmap §2 lays the empirical foundations of our work: the design yasntdh ecerematpioirnic aofl our movie-quotes dataset, which we make publicly available (§2. 1), a pilot study cwhit hw ehu mmakaen subjects validating §I2M.1D),b abased memorability labels (§2.2), and further study bofa incorporating search-engine c2)o,u anntds (§2.3). §3 uddeytoafi lisn our analysis aenardc prediction experiments, using both movie-quotes data and, as an exploration of cross-domain applicability, slogans data. §4 surveys rcerloastse-dd owmoarkin across a variety goafn fsie dladtsa.. §5 briefly sruelmatmedar wizoesrk ka andcr ionsdsic aat veasr some ffuft uierled sd.ire §c5tio bnrsie. 2 I’m ready for my close-up. 2.1 Data To study the properties of memorable movie quotes, we need a source of movie lines and a designation of memorability. Following [8], we constructed a corpus consisting of all lines from roughly 1000 movies, varying in genre, era, and popularity; for each movie, we then extracted the list of quotes from IMDb’s Memorable Quotes page corresponding to the movie.1 A memorable quote in IMDb can appear either as an individual sentence spoken by one character, or as a multi-sentence line, or as a block of dialogue involving multiple characters. In the latter two cases, it can be hard to determine which particular portion is viewed as memorable (some involve a build-up to a punch line; others involve the follow-through after a well-phrased opening sentence), and so we focus in our comparisons on those memorable quotes that 1This extraction involved some edit-distance-based alignment, since the exact form of the line in the script can exhibit minor differences from the version typed into IMDb. rmotuqsfebmaNerolbm543281760 0 1234D5ecil678910 894 Figure 1: Location of memorable quotes in each decile of movie scripts (the first 10th, the second 10th, etc.), summed over all movies. The same qualitative results hold if we discard each movie’s very first and last line, which might have privileged status. appear as a single sentence rather than a multi-line block.2 We now formulate a task that we can use to evaluate the features of memorable quotes. Recall that our goal is to identify effects based in the language of the quotes themselves, beyond any factors arising from the speaker or context. Thus, for each (singlesentence) memorable quote M, we identify a nonmemorable quote that is as similar as possible to M in all characteristics but the choice of words. This means we want it to be spoken by the same character in the same movie. It also means that we want it to have the same length: controlling for length is important because we expect that on average, shorter quotes will be easier to remember than long quotes, and that wouldn’t be an interesting textual effect to report. Moreover, we also want to control for the fact that a quote’s position in a movie can affect memorability: certain scenes produce more memorable dialogue, and as Figure 1 demonstrates, in aggregate memorable quotes also occur disproportionately near the beginnings and especially the ends of movies. In summary, then, for each M, we pick a contrasting (single-sentence) quote N from the same movie that is as close in the script as possible to M (either before or after it), subject to the conditions that (i) M and N are uttered by the same speaker, (ii) M and N have the same number of words, and (iii) N does not occur in the IMDb list of memorable 2We also ran experiments relaxing the single-sentence assumption, which allows for stricter scene control and a larger dataset but complicates comparisons involving syntax. The non-syntax results were in line with those reported here. TaJSOMbtrclodekviTn1ra:eBTykhoPrwNenpmlxeasipFIHAeaithrclsfnitkaQeomuifltw’sdaveoitycmsnedoqatbuliocrkeytsl f.woEeimlanchguwspakyirdfsebavot;ilmsdfcoenti’dus.erx-citaINmSnrkeioamct:ohenwmardleytQ.howfeu t’yvrecp,o’gsmrtpuaosnmtyef o rtgnhqieuvrobt.pehasirtdeosfpykuern close together in the movie by the same while the other is not. (Contractions character, have the same length, and one is labeled memorable by the IMDb such as “it’s” count as two words.) quotes for the movie (either as a single line or as part of a larger block). Given such pairs, we formulate a pairwise comparison task: given M and N, determine which is the memorable quote. Psychological research on subjective evaluation [35], as well as initial experiments using ourselves as subjects, indicated that this pairwise set-up easier to work with than simply presenting a single sentence and asking whether it is memorable or not; the latter requires agreement on an “absolute” criterion for memorability that is very hard to impose consistently, whereas the former simply requires a judgment that one quote is more memorable than another. Our main dataset, available at http://www.cs. cornell.edu/∼cristian/memorability.html,3 thus consists of approximately 2200 such (M, N) pairs, separated by a median of 5 same-character lines in the script. The reader can get a sense for the nature of the data from the three examples in Table 1. We now discuss two further aspects to the formulation of the experiment: a preliminary pilot study involving human subjects, and the incorporation of search engine counts into the data. 2.2 Pilot study: Human performance As a preliminary consideration, we did a small pilot study to see if humans can distinguish memorable from non-memorable quotes, assuming our IMDBinduced labels as gold standard. Six subjects, all native speakers of English and none an author of this paper, were presented with 11 or 12 pairs of memorable vs. non-memorable quotes; again, we controlled for extra-textual effects by ensuring that in each pair the two quotes come from the same movie, are by the same character, have the same length, and 3Also available there: other examples and factoids. 895 Table 2: Human pilot study: number of matches to IMDb-induced annotation, ordered by decreasing match percentage. For the null hypothesis of random guessing, these results are statistically significant, p < 2−6 ≈ .016. appear as nearly as possible in the same scene.4 The order of quotes within pairs was randomized. Importantly, because we wanted to understand whether the language of the quotes by itself contains signals about memorability, we chose quotes from movies that the subjects said they had not seen. (This means that each subject saw a different set of quotes.) Moreover, the subjects were requested not to consult any external sources of information.5 The reader is welcome to try a demo version of the task at http: //www.cs.cornell.edu/∼cristian/memorability.html. Table 2 shows that all the subjects performed (sometimes much) better than chance, and against the null hypothesis that all subjects are guessing randomly, the results are statistically significant, p < 2−6 ≈ .016. These preliminary findings provide evidenc≈e f.0or1 t6h.e T validity eolifm our traysk fi:n despite trohev apparent difficulty of the job, even humans who haven’t seen the movie in question can recover our IMDb4In this pilot study, we allowed multi-sentence quotes. 5We did not use crowd-sourcing because we saw no way to ensure that this condition would be obeyed by arbitrary subjects. We do note, though, that after our research was completed and as of Apr. 26, 2012, ≈ 11,300 people completed the online test: average accuracy: 27,2 ≈%, 1 1m,3o0d0e npueompbleer c coomrrpelcett:e d9 t/1he2. induced labels with some reliability.6 2.3 Incorporating search engine counts Thus far we have discussed a dataset in which memorability is determined through an explicit labeling drawn from the IMDb. Given the “production” aspect of memorability discussed in § 1, we stihoonu”ld a saplesoc expect tmhaotr mabeimlityora dbislce quotes nw §il1l ,te wnde to appear more extensively on Web pages than nonmemorable quotes; note that incorporating this insight makes it possible to use the (implicit) judgments of a much larger number of people than are represented by the IMDb database. It therefore makes sense to try using search-engine result counts as a second indication of memorability. We experimented with several ways of constructing memorability information from search-engine counts, but this proved challenging. Searching for a quote as a stand-alone phrase runs into the problem that a number of quotes are also sentences that people use without the movie in mind, and so high counts for such quotes do not testify to the phrase’s status as a memorable quote from the movie. On the other hand, searching for the quote in a Boolean conjunction with the movie’s title discards most of these uses, but also eliminates a large fraction of the appearances on the Web that we want to find: precisely because memorable quotes tend to have widespread cultural usage, people generally don’t feel the need to include the movie’s title when invoking them. Finally, since we are dealing with roughly 1000 movies, the result counts vary over an enormous range, from recent blockbusters to movies with relatively small fan bases. In the end, we found that it was more effective to use the result counts in conjunction with the IMDb labels, so that the counts played the role of an additional filter rather than a free-standing numerical value. Thus, for each pair (M, N) produced using the IMDb methodology above, we searched for each of M and N as quoted expressions in a Boolean conjunction with the title of the movie. We then kept only those pairs for which M (i) produced more than five results in our (quoted, conjoined) search, and (ii) produced at least twice as many results as the cor6The average accuracy being below 100% reinforces that context is very important, too. 896 responding search for N. We created a version of this filtered dataset using each of Google and Bing, and all the main findings were consistent with the results on the IMDb-only dataset. Thus, in what follows, we will focus on the main IMDb-only dataset, discussing the relationship to the dataset filtered by search engine counts where relevant (in which case we will refer to the +Google dataset). 3 Never send a human to do a machine’s job. We now discuss experiments that investigate the hypotheses discussed in §1. In particular, we devise pmoetthheosdess t dhiastc can assess 1th.e Idnis ptianrcttiicvuelnaer,ss w aend d generality hypotheses and test whether there exists a notion of “memorable language” that operates across domains. In addition, we evaluate and compare the predictive power of these hypotheses. 3.1 Distinctiveness One of the hypotheses we examine is whether the use of language in memorable quotes is to some extent unusual. In order to quantify the level of distinctiveness of a quote, we take a language-model approach: we model “common language” using the newswire sections of the Brown corpus [21]7, and evaluate how distinctive a quote is by evaluating its likelihood with respect to this model the lower the likelihood, the more distinctive. In order to assess different levels of lexical and syntactic distinctiveness, we employ a total of six Laplacesmoothed8 language models: 1-gram, 2-gram, and — 3-gram word LMs and 1-gram, 2-gram and 3-gram LMs. We find strong evidence that from a lexical perspective, memorable quotes are more distinctive than their non-memorable counterparts. As indicated in Table 3, for each of our lexical “common language” models, in about 60% of the quote pairs, the memorable quote is more distinctive. Interestingly, the reverse is true when it comes to part-of-speech9 7Results were qualitatively similar if we used the fiction portions. The age of the Brown corpus makes it less likely to contain modern movie quotes. 8We employ Laplace (additive) smoothing with a smoothing parameter of 0.2. The language models’ vocabulary was that of the entire training corpus. 9Throughout we obtain part-of-speech tags by using the NLTK maximum entropy tagger with default parameters. in which the the memorable quote is more distinctive than the non-memorable one according to the respective “common language” model. Significance according to a two-tailed sign test is indicated using *-notation (∗∗∗=“p<.001”). syntax: memorable quotes appear to follow the syntactic patterns of “common language” as closely as or more closely than non-memorable quotes. Together, these results suggest that memorable quotes consist of unusual word sequences built on common syntactic scaffolding. 3.2 Generality Another of our hypotheses is that memorable quotes are easier to use outside the specific context in which they were uttered that is, more “portable” and therefore exhibit fewer terms that refer to those settings. We use the following syntactic properties as proxies for the generality of a quote: • Fewer 3rd-person pronouns, since these commonly r 3efer to a person or object that was introduced earlier in the discourse. Utterances that employ fewer such pronouns are easier to adapt to new contexts, and so will be considered more — — general. • More indefinite articles like a and an, since they are more likely ttioc lreesfer li ktoe general concepts than definite articles. Quotes with more indefinite articles will be considered more general. Fewer past tense verbs and more present tFeenwsee verbs, tseinncsee t vheer bfosrm aenrd are more likely to refer to specific previous events. Therefore utterances that employ fewer past tense verbs (and more present tense verbs) will be considered more general. Table 4 gives the results for each of these four metrics in each case, we show the percentage of • — 897 TalfmGebowsnre4pa:in3srGldet sypfne.msrate.lripnctysoe: purncsetaI56gM47e.326D9o710bf% -qo∗u n∗l tyepa+56iG892rs.o7i364ng% wl∗ eh∗i ch the memorable quote is more general than the non- memorable ones according to the respective metric. Pairs where the metric does not distinguish between the quotes are not considered. quote pairs for which the memorable quote scores better on the generality metric. Note that because the issue of generality is a complex one for which there is no straightforward single metric, our approach here is based on several proxies for generality, considered independently; yet, as the results show, all of these point in a consistent direction. It is an interesting open question to develop richer ways of assessing whether a quote has greater generality, in the sense that people intuitively attribute to memorable quotes. 3.3 “Memorable” language beyond movies One of the motivating questions in our analysis is whether there are general principles underlying “memorable language.” The results thus far suggest potential families of such principles. A further question in this direction is whether the notion of memorability can be extended across different domains, and for this we collected (and distribute on our website) 431 phrases that were explicitly designed to be memorable: advertising slogans (e.g., “Quality never goes out of style.”). The focus on slogans is also in keeping with one of the initial motivations in studying memorability, namely, marketing applications in other words, assessing whether a proposed slogan has features that are consistent with memorable text. The fact that it’s not clear how to construct a collection of “non-memorable” counterparts to slogans appears to pose a technical challenge. However, we can still use a language-modeling approach to assess whether the textual properties of the slogans are closer to the memorable movie quotes (as one would conjecture) or to the non-memorable movie quotes. Specifically, we train one language model on memorable quotes and another on non-memorable quotes — guage: percentage of slogans that have higher likelihood under the memorable language model than under the nonmemorable one (for each of the six language models considered). Rightmost column: for reference, the percentage of newswire sentences that have higher likelihood under the memorable language model than under the nonmemorable one. TaG% ble3nipared6stpa:lfeitrnSsyilto.megpareotnsicluaerns mo1s42lto.61g048ae% nseral2w1m.h16e3mn% .comn2p-63ma.0r46e19dm% .to memorable and non-memorable quotes. (%s of 3rd pers. pronouns and indefinite articles are relative to all tokens, %s of past tense are relative to all past and present verbs.) and compare how likely each slogan is to be produced according to these two models. As shown in the middle column of Table 5, we find that slogans are better predicted both lexically and syntactically by the former model. This result thus offers evidence for a concept of “memorable language” that can be applied beyond a single domain. We also note that the higher likelihood of slogans under a “memorable language” model is not simply occurring for the trivial reason that this model predicts all other large bodies of text better. In particular, the newswire section of the Brown corpus is predicted better at the lexical level by the language model trained on non-memorable quotes. Finally, Table 6 shows that slogans employ general language, in the sense that for each of our generality metrics, we see a slogans/memorablequotes/non-memorable quotes spectrum. 3.4 Prediction task We now show how the principles discussed above can provide features for a basic prediction task, corresponding to the task in our human pilot study: 898 given a pair of quotes, identify the memorable one. Our first formulation of the prediction task uses a standard bag-of-words model10. If there were no information in the textual content of a quote to determine whether it were memorable, then an SVM employing bag-of-words features should perform no better than chance. Instead, though, it obtains 59.67% (10-fold cross-validation) accuracy, as shown in Table 7. We then develop models using features based on the measures formulated earlier in this section: generality measures (the four listed in Table 4); distinctiveness measures (likelihood according to 1, 2, and 3-gram “common language” models at the lexical and part-of-speech level for each quote in the pair, their differences, and pairwise comparisons between them); and similarityto-slogans measures (likelihood according to 1, 2, and 3-gram slogan-language models at the lexical and part-of-speech level for each quote in the pair, their differences, and pairwise comparisons between them). Even a relatively small number of distinctiveness features, on their own, improve significantly over the much larger bag-of-words model. When we include additional features based on generality and language-model features measuring similarity to slogans, the performance improves further (last line of Table 7). Thus, the main conclusion from these prediction tasks is that abstracting notions such as distinctiveness and generality can produce relatively streamlined models that outperform much heavier-weight bag-of-words models, and can suggest steps toward approaching the performance of human judges who very much unlike our system have the full cultural context in which movies occur at their disposal. — — 3.5 Other characteristics We also made some auxiliary observations that may be ofinterest. Specifically, we find differences in letter and sound distribution (e.g., memorable quotes after curse-word removal use significantly more “front sounds” (labials or front vowels such as represented by the letter i) and significantly fewer “back sounds” such as the one represented by u),11 — — 10We discarded terms appearing fewer than 10 times. 11These findings may relate to marketing research on sound symbolism [7, 19, 40]. TablesdgF7lieao:sngtPiehnorauefc dtliswevctymeo irnp.des:StoVgeMh10r-fo#ldec9ra265ot42sv5aA6l8942ic.d36720atu57%ri aocn∗yresult using the respective feature sets. Random baseline accuracy is 50%. Accuracies statistically significantly greater than bag-of-words according to a two-tailed t-test are indicated with *(p<.05) and **(p<.01). word complexity (e.g., memorable quotes use words with significantly more syllables) and phrase complexity (e.g., memorable quotes use fewer coordinating conjunctions). The latter two are in line with our distinctiveness hypothesis. 4 A long time ago, in a galaxy far, far away How an item’s linguistic form affects the reaction it generates has been studied in several contexts, including evaluations of product reviews [9], political speeches [12], on-line posts [13], scientific papers [14], and retweeting of Twitter posts [36]. We use a different set of features, abstracting the notions of distinctiveness and generality, in order to focus on these higher-level aspects of phrasing rather than on particular lower-level features. Related to our interest in distinctiveness, work in advertising research has studied the effect of syntactic complexity on recognition and recall of slogans [5, 6, 24]. There may also be connections to Von Restorff’s isolation effect Hunt [17], which asserts that when all but one item in a list are similar in some way, memory for the different item is enhanced. Related to our interest in generality, Knapp et al. [20] surveyed subjects regarding memorable messages or pieces of advice they had received, finding that the ability to be applied to multiple concrete situations was an important factor. Memorability, although distinct from “memorizability”, relates to short- and long-term recall. Thorn and Page [34] survey sub-lexical, lexical, and semantic attributes affecting short-term memorability of lexical items. Studies of verbatim recall have also considered the task of distinguishing an exact quote from close paraphrases [3]. Investigations of longterm recall have included studies ofculturally signif- 899 icant passages of text [29] and findings regarding the effect of rhetorical devices of alliterative [4], “rhythmic, poetic, and thematic constraints” [18, 26]. Finally, there are complex connections between humor and memory [32], which may lead to interactions with computational humor recognition [25]. 5 I think this is the beginning of a beautiful friendship. Motivated by the broad question of what kinds of information achieve widespread public awareness, we studied the the effect of phrasing on a quote’s memorability. A challenge is that quotes differ not only in how they are worded, but also in who said them and under what circumstances; to deal with this difficulty, we constructed a controlled corpus of movie quotes in which lines deemed memorable are paired with non-memorable lines spoken by the same character at approximately the same point in the same movie. After controlling for context and situation, memorable quotes were still found to exhibit, on av- erage (there will always be individual exceptions), significant differences from non-memorable quotes in several important respects, including measures capturing distinctiveness and generality. Our experiments with slogans show how the principles we identify can extend to a different domain. Future work may lead to applications in marketing, advertising and education [4]. Moreover, the subtle nature of memorability, and its connection to research in psychology, suggests a range of further research directions. We believe that the framework developed here can serve as the basis for further computational studies of the process by which information takes hold in the public consciousness, and the role that language effects play in this process. My mother thanks you. My father thanks you. My sister thanks you. And Ithank you: Rebecca Hwa, Evie Kleinberg, Diana Minculescu, Alex Niculescu-Mizil, Jennifer Smith, Benjamin Zimmer, and the anonymous reviewers for helpful discussions and comments; our annotators Steven An, Lars Backstrom, Eric Baumer, Jeff Chadwick, Evie Kleinberg, and Myle Ott; and the makers of Cepacol, Robitussin, and Sudafed, whose products got us through the submission deadline. This paper is based upon work supported in part by NSF grants IIS-0910664, IIS-1016099, Google, and Yahoo! References [1] [2] [3] [4] [5] Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan M. Lukose. Implicit structure and the dynamics of blogspace. In Workshop on the Weblogging Ecosystem, 2004. Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, and Xiangyang Lan. Group formation in large social networks: Membership, growth, and evolution. In Proceedings of KDD, 2006. Elizabeth Bates, Walter Kintsch, Charles R. Fletcher, and Vittoria Giuliani. The role of pronominalization and ellipsis in texts: Some memory experiments. Journal of Experimental Psychology: Human Learning and Memory, 6 (6):676–691, 1980. Frank Boers and Seth Lindstromberg. Finding ways to make phrase-learning feasible: The mnemonic effect of alliteration. System, 33(2): 225–238, 2005. Samuel D. Bradley and Robert Meeds. Surface-structure transformations and advertising slogans: The case for moderate syntactic complexity. Psychology and Marketing, 19: 595–619, 2002. [6] Robert Chamblee, Robert Gilmore, Gloria Thomas, and Gary Soldow. When copy complexity can help ad readership. Journal of Advertising Research, 33(3):23–23, 1993. [7] John Colapinto. Famous names. The New Yorker, pages 38–43, 2011. [8] Cristian Danescu-Niculescu-Mizil and Lillian Lee. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, 2011. [9] Cristian Danescu-Niculescu-Mizil, Gueorgi Kossinets, Jon Kleinberg, and Lillian Lee. How opinions are received by online communities: A case study on Amazon.com helpfulness votes. In Proceedings of WWW, pages 141–150, 2009. [10] Stuart Fischoff, Esmeralda Cardenas, Angela Hernandez, Korey Wyatt, Jared Young, and 900 [11] [12] [13] [14] [15] Rachel Gordon. Popular movie quotes: Reflections of a people and a culture. In Annual Convention of the American Psychological Association, 2000. Daniel Gruhl, R. Guha, David Liben-Nowell, and Andrew Tomkins. Information diffusion through blogspace. Proceedings of WWW, pages 491–501, 2004. Marco Guerini, Carlo Strapparava, and Oliviero Stock. Trusting politicians’ words (for persuasive NLP). In Proceedings of CICLing, pages 263–274, 2008. Marco Guerini, Carlo Strapparava, and G o¨zde O¨zbal. Exploring text virality in social networks. In Proceedings of ICWSM (poster), 2011. Marco Guerini, Alberto Pepe, and Bruno Lepri. Do linguistic style and readability of scientific abstracts affect their virality? In Proceedings of ICWSM, 2012. Richard Jackson Harris, Abigail J. Werth, Kyle E. Bures, and Chelsea M. Bartel. Social movie quoting: What, why, and how? Ciencias Psicologicas, 2(1):35–45, 2008. [16] Chip Heath, Chris Bell, and Emily Steinberg. Emotional selection in memes: The case of urban legends. Journal of Personality, 81(6): 1028–1041, 2001. [17] R. Reed Hunt. The subtlety of distinctiveness: What von Restorff really did. Psychonomic Bulletin & Review, 2(1): 105–1 12, 1995. [18] Ira E. Hyman Jr. and David C. Rubin. Memorabeatlia: A naturalistic study of long-term memory. Memory & Cognition, 18(2):205– 214, 1990. [19] Richard R. Klink. Creating brand names with meaning: The use of sound symbolism. Marketing Letters, 11(1):5–20, 2000. [20] Mark L. Knapp, Cynthia Stohl, and Kathleen K. Reardon. “Memorable” messages. Journal of Communication, 3 1(4):27– 41, 1981. [21] Henry Kuˇ cera and W. Nelson Francis. Computational analysis of present-day American English. Dartmouth Publishing Group, 1967. [22] Jure Leskovec, Lada Adamic, and Bernardo Huberman. The dynamics of viral marketing. ACM Transactions on the Web, 1(1), May [23] [24] [25] [26] [27] [28] [29] 2007. Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of the news cycle. In Proceedings of KDD, pages 497–506, 2009. Tina M. Lowrey. The relation between script complexity and commercial memorability. Journal of Advertising, 35(3):7–15, 2006. Rada Mihalcea and Carlo Strapparava. Learning to laugh (automatically): Computational models for humor recognition. Computational Intelligence, 22(2): 126–142, 2006. Milman Parry and Adam Parry. The making of Homeric verse: The collected papers of Milman Parry. Clarendon Press, Oxford, 1971. Everett Rogers. Diffusion of Innovations. Free Press, fourth edition, 1995. Daniel M. Romero, Brendan Meeder, and Jon Kleinberg. Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on Twitter. Proceedings of WWW, pages 695–704, 2011. David C. Rubin. Very long-term memory for [30] [3 1] [32] [33] prose and verse. Journal of Verbal Learning and Verbal Behavior, 16(5):61 1–621, 1977. Nathan Schneider, Rebecca Hwa, Philip Gianfortoni, Dipanjan Das, Michael Heilman, Alan W. Black, Frederick L. Crabbe, and Noah A. Smith. Visualizing topical quotations over time to understand news discourse. Technical Report CMU-LTI-01-103, CMU, 2010. David Strang and Sarah Soule. Diffusion in organizations and social movements: From hybrid corn to poison pills. Annual Review of Sociology, 24:265–290, 1998. Hannah Summerfelt, Louis Lippman, and Ira E. Hyman Jr. The effect of humor on memory: Constrained by the pun. The Journal of General Psychology, 137(4), 2010. Eric Sun, Itamar Rosenn, Cameron Marlow, and Thomas M. Lento. Gesundheit! Model- 901 ing contagion through Facebook News Feed. In Proceedings of ICWSM, 2009. [34] Annabel Thorn and Mike Page. Interactions Between Short-Term and Long-Term Memory [35] [36] [37] [38] [39] [40] in the Verbal Domain. Psychology Press, 2009. Louis L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273– 286, 1927. Oren Tsur and Ari Rappoport. What’s in a Hashtag? Content based prediction of the spread of ideas in microblogging communities. In Proceedings of WSDM, 2012. Fang Wu, Bernardo A. Huberman, Lada A. Adamic, and Joshua R. Tyler. Information flow in social groups. Physica A: Statistical and Theoretical Physics, 337(1-2):327–335, 2004. Shaomei Wu, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. Who says what to whom on Twitter. In Proceedings of WWW, 2011. Jaewon Yang and Jure Leskovec. Patterns of temporal variation in online media. In Proceedings of WSDM, 2011. Eric Yorkston and Geeta Menon. A sound idea: Phonetic effects of brand names on consumer judgments. Journal of Consumer Research, 3 1 (1):43–51, 2004.

4 0.62994969 186 acl-2012-Structuring E-Commerce Inventory

Author: Karin Mauge ; Khash Rohanimanesh ; Jean-David Ruvini

Abstract: Large e-commerce enterprises feature millions of items entered daily by a large variety of sellers. While some sellers provide rich, structured descriptions of their items, a vast majority of them provide unstructured natural language descriptions. In the paper we present a 2 steps method for structuring items into descriptive properties. The first step consists in unsupervised property discovery and extraction. The second step involves supervised property synonym discovery using a maximum entropy based clustering algorithm. We evaluate our method on a year worth of ecommerce data and show that it achieves excellent precision with good recall.

5 0.6287685 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith

6 0.50884986 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

7 0.49958387 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

8 0.49693257 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

9 0.48766047 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

10 0.48710039 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

11 0.4631739 112 acl-2012-Humor as Circuits in Semantic Networks

12 0.45985883 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum

13 0.44668308 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

14 0.44201854 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing

15 0.43671593 215 acl-2012-WizIE: A Best Practices Guided Development Environment for Information Extraction

16 0.42084709 129 acl-2012-Learning High-Level Planning from Text

17 0.42023361 15 acl-2012-A Meta Learning Approach to Grammatical Error Correction

18 0.41751552 85 acl-2012-Event Linking: Grounding Event Reference in a News Archive

19 0.41627133 6 acl-2012-A Comprehensive Gold Standard for the Enron Organizational Hierarchy

20 0.40946507 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.022), (26, 0.04), (28, 0.04), (30, 0.027), (37, 0.023), (39, 0.069), (59, 0.013), (74, 0.029), (82, 0.014), (84, 0.418), (85, 0.015), (90, 0.095), (92, 0.045), (94, 0.019), (99, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93068004 122 acl-2012-Joint Evaluation of Morphological Segmentation and Syntactic Parsing

Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson

Abstract: We present novel metrics for parse evaluation in joint segmentation and parsing scenarios where the gold sequence of terminals is not known in advance. The protocol uses distance-based metrics defined for the space of trees over lattices. Our metrics allow us to precisely quantify the performance gap between non-realistic parsing scenarios (assuming gold segmented and tagged input) and realistic ones (not assuming gold segmentation and tags). Our evaluation of segmentation and parsing for Modern Hebrew sheds new light on the performance ofthe best parsing systems to date in the different scenarios.

2 0.90886527 68 acl-2012-Decoding Running Key Ciphers

Author: Sravana Reddy ; Kevin Knight

Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.

same-paper 3 0.90839767 195 acl-2012-The Creation of a Corpus of English Metalanguage

Author: Shomir Wilson

4 0.88699919 135 acl-2012-Learning to Temporally Order Medical Events in Clinical Text

Author: Preethi Raghavan ; Albert Lai ; Eric Fosler-Lussier

Abstract: We investigate the problem of ordering medical events in unstructured clinical narratives by learning to rank them based on their time of occurrence. We represent each medical event as a time duration, with a corresponding start and stop, and learn to rank the starts/stops based on their proximity to the admission date. Such a representation allows us to learn all of Allen’s temporal relations between medical events. Interestingly, we observe that this methodology performs better than a classification-based approach for this domain, but worse on the relationships found in the Timebank corpus. This finding has important implications for styles of data representation and resources used for temporal relation learning: clinical narratives may have different language attributes corresponding to temporal ordering relative to Timebank, implying that the field may need to look at a wider range ofdomains to fully understand the nature of temporal ordering.

5 0.8342616 93 acl-2012-Fast Online Lexicon Learning for Grounded Language Acquisition

Author: David Chen

Abstract: Learning a semantic lexicon is often an important first step in building a system that learns to interpret the meaning of natural language. It is especially important in language grounding where the training data usually consist of language paired with an ambiguous perceptual context. Recent work by Chen and Mooney (201 1) introduced a lexicon learning method that deals with ambiguous relational data by taking intersections of graphs. While the algorithm produced good lexicons for the task of learning to interpret navigation instructions, it only works in batch settings and does not scale well to large datasets. In this paper we introduce a new online algorithm that is an order of magnitude faster and surpasses the stateof-the-art results. We show that by changing the grammar of the formal meaning represen- . tation language and training on additional data collected from Amazon’s Mechanical Turk we can further improve the results. We also include experimental results on a Chinese translation of the training data to demonstrate the generality of our approach.

6 0.50136262 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

7 0.47296637 111 acl-2012-How Are Spelling Errors Generated and Corrected? A Study of Corrected and Uncorrected Spelling Errors Using Keystroke Logs

8 0.47242612 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

9 0.47153655 139 acl-2012-MIX Is Not a Tree-Adjoining Language

10 0.46809489 34 acl-2012-Automatically Learning Measures of Child Language Development

11 0.46645641 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

12 0.46558788 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

13 0.46326923 194 acl-2012-Text Segmentation by Language Using Minimum Description Length

14 0.45678166 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

15 0.45536885 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

16 0.44705436 104 acl-2012-Graph-based Semi-Supervised Learning Algorithms for NLP

17 0.44508976 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing

18 0.44274437 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

19 0.43275312 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

20 0.43070626 129 acl-2012-Learning High-Level Planning from Text