acl acl2011 acl2011-306 knowledge-graph by maker-knowledge-mining

306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style


Source: pdf

Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons

Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. [sent-2, score-0.807]

2 We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. [sent-3, score-1.654]

3 Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. [sent-4, score-0.085]

4 We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study. [sent-5, score-1.489]

5 1 Introduction We live in a world with an ever increasing amount and variety of information. [sent-6, score-0.041]

6 A great deal of that content is in a textual format. [sent-7, score-0.045]

7 As such, it is not uncommon to want to gain access to this information when a visual display is not convenient or available (while driving or walking for example). [sent-9, score-0.096]

8 One way of addressing this issue is to use audio displays and, in particular, have users listen to content read to them by a speech synthesizer instead of reading it themselves on a display. [sent-10, score-0.695]

9 While listening to speech opens many opportunities, it also has issues which must be considered when using it as a replacement for reading. [sent-11, score-0.071]

10 One important consideration is that the text that was origi- nally written to be read might not be suitable to be listened to. [sent-12, score-0.26]

11 radio news broadcast) compared ∗Work conducted while interning at Intel Labs 248 Barbara Rosario Intel Labs Santa Clara, CA, USA barbara. [sent-15, score-0.172]

12 One key reason for the difference is that understanding is more important than grammar to a radio news writer. [sent-22, score-0.172]

13 Furthermore, audio has different perceptual and information qualities compared to reading. [sent-23, score-0.397]

14 For example, the use of the negations not and no should be limited since it is easy for listeners to miss that single utterance. [sent-24, score-0.051]

15 Listener cannot relisten to a word; and, missing it has a huge impact on meaning. [sent-25, score-0.043]

16 In this paper, we address the problem of changing the writing-style of text to make it suitable to being to listened to instead of being read. [sent-26, score-0.139]

17 We start by researching the writing-style differences across text and audio in the linguistics and journalism literatures. [sent-27, score-0.632]

18 Based on this study, we suggest a number of linguistic features that set the two styles apart. [sent-28, score-0.228]

19 We validate these features statistically by analyzing their distributions in a corpus of parallel text- and audio-style documents; and experimentally through a style classification task. [sent-29, score-0.592]

20 Moreover, we evaluate the impact of style transformation on the user experience by conducting a user study. [sent-30, score-0.934]

21 In Section 3, we summarize the main style differences as they appear in the journalism and linguistics literatures. [sent-33, score-0.711]

22 The features that we propose and their validation are discussed in Section 5. [sent-35, score-0.048]

23 In Section 6, we describe the user study and discuss the results. [sent-36, score-0.165]

24 2 Related Work There has been a considerable amount of research on the language variations for different registers and Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-38, score-0.066]

25 i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 248–253, genres in the linguistics community, including research that focused on the variations between written and spoken language (Biber, 1988; Halliday, 1985; Esser, 1993; Whittaker et al. [sent-40, score-0.202]

26 For example, Biber (1988) provides an exhaustive study of such variations. [sent-42, score-0.049]

27 He uses computational techniques to analyze the linguistic characteristics of twenty-three spoken and written genres, enabling identification of the basic, underlying dimensions of variation in English. [sent-43, score-0.169]

28 Halliday (1985) performs a comparative study of spoken and written language, contrasting the prosodic features and grammatical intricacy of speech with the high lexical density and grammatical metaphor or writing. [sent-44, score-0.269]

29 Esser (2000) proposes a general framework for the different presentation structures of medium-dependent linguistic units. [sent-45, score-0.037]

30 Most of these studies focus on the variations between the written and the spontaneous spoken language. [sent-46, score-0.167]

31 Our focus is on the written language for audio, i. [sent-47, score-0.084]

32 on a style that we hypothesize being somewhere between the formally written and spon- taneous speech styles. [sent-49, score-0.631]

33 Fang (1991) provides a pragmatic analysis and a side-by-side comparisons of the ”writing style differences in newspaper, radio, and television news” as part of the instructions for journalist students learning to write for the three different mediums. [sent-50, score-0.705]

34 In this work the goal is to optimize the style, and generation is one approach to that end (we plan addressing it for future work) Authorship attribution (Mosteller and Wallace, 1964; Stamatatos et al. [sent-55, score-0.095]

35 , 2007; Schler and Argamon, 2009) is also related to our work since arguably different authors write in different styles. [sent-58, score-0.039]

36 (2003) explored differences between male and female writing in a large subset of the British National Corpus covering a range of genres. [sent-60, score-0.27]

37 They used lexical features based on taxonomies of various semantic functions of different lexical items (words or phrases). [sent-63, score-0.048]

38 These studies focused on the correlation between style of the text and the personal characteristics of its author. [sent-64, score-0.507]

39 In our work, we focus on the change in writing style according to the change of the medium. [sent-65, score-0.667]

40 3 Writing Style Differences Across Text and Audio In this section, we summarize the literature on writing style differences across text and audio. [sent-66, score-0.777]

41 Writing styles for different media have evolved due to the unique nature of each medium and to the manner in which its audience consumes it. [sent-68, score-0.185]

42 For example, in audio, the information must be consumed sequentially and the listener does not have the option to skip the informa- tion that she finds less interesting. [sent-69, score-0.179]

43 The eye skip around in text but there is not that option with listening. [sent-71, score-0.04]

44 Moreover, unlike attentive readers of text, audio listeners may be engaged in some task (e. [sent-72, score-0.448]

45 ) other than absorbing the information they listen to, and therefore are paying less attention. [sent-75, score-0.112]

46 All these differences of the audio medium affect the length of sentences, the choice of words, the structure of phrases of attribution, the use of pronouns, etc. [sent-76, score-0.549]

47 Some general guidelines of audio style (Biber, 1988; Fang, 1991) include 1) the choice of simple words and short, declarative sentences with active voice preferred. [sent-77, score-0.954]

48 It is better to repeat a name, so that the listener will not have to pause or replay to recall. [sent-81, score-0.171]

49 5) Direct quotations are uncommon and the person being quoted is identified before the quotation. [sent-82, score-0.105]

50 7) Numbers should be approximated so that they can be under- M Average Sentence Length M Percentage of Complex Words s tn 1600 0 0 0 0 0 0 0 0 0 0 0 Ratio of Adverbs Figure 1: The distributions of three features for both articles and transcripts stood. [sent-85, score-0.319]

51 8) Adjectives and adverbs should be used only when necessary for the meaning. [sent-87, score-0.054]

52 4 Data In order to determine the differences between the text and audio styles, we needed textual data that ideally covered the same semantic content but was produced for the two different media. [sent-88, score-0.552]

53 Through their APIs we obtained the same semantic content in the two different styles: written text style (articles, henceforth) and in audio style (transcripts, henceforth). [sent-90, score-1.54]

54 The NPR Story API output contains links to the Transcript API when a transcript is available. [sent-91, score-0.069]

55 With the Transcript API, we were able to get full transcripts of stories heard on air1. [sent-92, score-0.141]

56 We collected 3855 news articles and their corresponding transcripts. [sent-94, score-0.13]

57 The data cover a varied set of topics from four months of broadcast (from March 6 to June 3, 2010). [sent-95, score-0.038]

58 5 Features Based on the study of style differences outlined in section 3, we propose a number of document-level, linguistic features that we hypothesized distinguish the two writing styles. [sent-97, score-0.911]

59 The analysis of these features (will be discussed later in the section) showed that they are of different importance to style identification. [sent-101, score-0.555]

60 Table 1 shows a list of the top features and their descriptions. [sent-102, score-0.048]

61 1 Statistical Analysis The goal of this analysis is to show that the values of the features that we extracted are really different across the two styles and that the difference is significant. [sent-104, score-0.223]

62 We compute the distribution of the values of each feature in articles and its distribution in transcripts. [sent-105, score-0.093]

63 For example, Figure 1 shows the distributions of 3 features for both articles and transcripts. [sent-106, score-0.178]

64 05) reveals statistically significant difference for all of the features (p < 0. [sent-109, score-0.048]

65 This analysis corroborated our linguistic hypotheses, such as the average sentence length is longer for articles than for transcripts, complex words (more than 3 syllables) are more common in articles, articles contain more adverbs, etc. [sent-111, score-0.223]

66 2 Classification To further verify that our features really distinguish between the two writing styles, we conducted a classification experiment. [sent-113, score-0.24]

67 We used the features described in Table 1 (excluding the Direct Quotation feature) and the dataset described in section 4 to train a classifier. [sent-114, score-0.048]

68 We excluded the Direct Quotation feature from this experiment because it is a very distinguishing feature for vast majority of the articles articles. [sent-119, score-0.093]

69 The in our dataset contained direct quotations and none of the transcripts did. [sent-120, score-0.23]

70 251 6 User Study To better understand which features are more important indicators of the style, we use Guyon et al. [sent-122, score-0.048]

71 ’s (2002) method for feature selection using SVM to rank the features based on their importance. [sent-123, score-0.048]

72 Up to this point, we know that there are differences in style between articles and transcripts, and we formalized these differences in the form of linguistic features that are easy to extract using computational techniques. [sent-125, score-0.952]

73 However, we still do not know the impact of changing the style on the user experience. [sent-126, score-0.666]

74 To address this issue, we did manual transformation of style for 50 article paragraphs. [sent-127, score-0.66]

75 The transformation was done in light of the features described in the previous section. [sent-128, score-0.167]

76 For example, if a sentence is longer than 25 words, we simplify it; and, if it is in passive voice we change it to active voice whenever possible, etc. [sent-129, score-0.1]

77 We used a speech synthesizer to convert the original paragraphs and their transformed versions into audio clips. [sent-130, score-0.655]

78 We used these audio clips to conduct a user study. [sent-131, score-0.607]

79 We gave human participants the audio clips to listen to and transcribe. [sent-132, score-0.572]

80 Each audio clip was divided into segments 15 seconds long. [sent-133, score-0.491]

81 Each segment can be played only once and pauses automatically when it is finished to allow the user to transcribe the segment. [sent-134, score-0.174]

82 The user was not allowed to replay any segment of the clip. [sent-135, score-0.179]

83 Our hypothesis for this study is that audio clips ofthe transformed paragraphs (audio style) are easier to comprehend, and hence, easier to transcribe than the original paragraphs (text style). [sent-136, score-0.843]

84 We use the edit distance between the transcripts and the text of each audio clip to measure the transcription accuracy. [sent-137, score-0.751]

85 We assume that the transcription accuracy is an indicator for the comprehension level, i. [sent-138, score-0.145]

86 the higher the accuracy of the transcription the higher the comprehension. [sent-140, score-0.082]

87 We used Amazon Mechanical Turk to run the user study. [sent-141, score-0.116]

88 We restricted the workers to those who have more than 95% approval rate for all their previous work and who live in the United States (since we are targeting English speakers). [sent-143, score-0.078]

89 We also assigned the same audio clip to 10 different workers and took the average edit distance of the 10 transcripts for each audio clip. [sent-144, score-1.103]

90 The differences in the transcription accuracy for the original and the transformed paragraphs were statically significant at the 0. [sent-145, score-0.347]

91 This result indicates that the change in style has an impact on the comprehension of the delivered information as measured by the accuracy of the transcriptions. [sent-150, score-0.644]

92 7 Conclusions and Future Work In this paper, we presented the progress on an ongoing research on writing style transformation from text style to audio style. [sent-151, score-1.69]

93 We surveyed the linguistics and journalism literatures for the differences in writing style for different media. [sent-153, score-0.871]

94 We formalized the problem by suggesting a number of linguistic features and showing their validity in distinguishing between the two styles of interest, text vs audio. [sent-154, score-0.275]

95 We also conducted a user study to show the impact of style transformation on comprehension and the overall user experience. [sent-155, score-1.013]

96 The next step in this work would be to build a style transformation system that uses the features discussed in this paper as the bases for determining when, where, and how to do the style transformation. [sent-156, score-1.181]

97 Gender, genre, and writing style in formal written texts. [sent-159, score-0.751]

98 Medium-transferability and presenta- tion structure in speech and writing. [sent-187, score-0.04]

99 Inference and disputed authorship :the Federalist / [by] FrederickMosteller [and] DavidL. [sent-221, score-0.069]

100 Play it again: a study of the factors underlying speech browsing behavior. [sent-254, score-0.089]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('style', 0.507), ('audio', 0.397), ('argamon', 0.166), ('writing', 0.16), ('styles', 0.143), ('transcripts', 0.141), ('radio', 0.135), ('transformation', 0.119), ('user', 0.116), ('differences', 0.11), ('listener', 0.108), ('esser', 0.107), ('listened', 0.107), ('clip', 0.094), ('biber', 0.094), ('clips', 0.094), ('journalism', 0.094), ('articles', 0.093), ('shlomo', 0.091), ('paragraphs', 0.09), ('written', 0.084), ('transcription', 0.082), ('listen', 0.081), ('labs', 0.074), ('jrgen', 0.071), ('paraphrase', 0.07), ('transcript', 0.069), ('authorship', 0.069), ('fang', 0.066), ('transformed', 0.065), ('attribution', 0.063), ('comprehension', 0.063), ('replay', 0.063), ('guyon', 0.063), ('npr', 0.063), ('synthesizer', 0.063), ('intel', 0.058), ('mosteller', 0.058), ('quotations', 0.058), ('transcribe', 0.058), ('clara', 0.054), ('quotation', 0.054), ('adverbs', 0.054), ('listeners', 0.051), ('halliday', 0.051), ('api', 0.051), ('voice', 0.05), ('driving', 0.049), ('television', 0.049), ('schler', 0.049), ('stamatatos', 0.049), ('study', 0.049), ('quirk', 0.049), ('spoken', 0.048), ('features', 0.048), ('uncommon', 0.047), ('formalized', 0.047), ('santa', 0.047), ('libsvm', 0.047), ('whittaker', 0.047), ('shinyama', 0.046), ('content', 0.045), ('madnani', 0.044), ('impact', 0.043), ('medium', 0.042), ('live', 0.041), ('chi', 0.041), ('speech', 0.04), ('skip', 0.04), ('write', 0.039), ('broadcast', 0.038), ('read', 0.037), ('linguistic', 0.037), ('workers', 0.037), ('henceforth', 0.037), ('news', 0.037), ('edit', 0.037), ('distributions', 0.037), ('paraphrases', 0.036), ('variations', 0.035), ('genres', 0.035), ('article', 0.034), ('stroudsburg', 0.034), ('newspaper', 0.033), ('experience', 0.033), ('suitable', 0.032), ('addressing', 0.032), ('really', 0.032), ('direct', 0.031), ('mechanical', 0.031), ('consumed', 0.031), ('anat', 0.031), ('researching', 0.031), ('navendu', 0.031), ('absorbing', 0.031), ('registers', 0.031), ('monograph', 0.031), ('opens', 0.031), ('delivered', 0.031), ('addisonwesley', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons

Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.

2 0.12252117 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

3 0.11545272 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

Author: Sara Rosenthal ; Kathleen McKeown

Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.

4 0.10698386 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

Author: David Chen ; William Dolan

Abstract: A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.

5 0.093352765 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat

Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.

6 0.087808564 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

7 0.080650948 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web

8 0.076982334 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts

9 0.069982201 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport

10 0.06829682 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

11 0.065082699 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

12 0.062661611 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

13 0.062203042 109 acl-2011-Effective Measures of Domain Similarity for Parsing

14 0.056960635 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations

15 0.056831811 228 acl-2011-N-Best Rescoring Based on Pitch-accent Patterns

16 0.056775928 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

17 0.055661302 133 acl-2011-Extracting Social Power Relationships from Natural Language

18 0.055247508 177 acl-2011-Interactive Group Suggesting for Twitter

19 0.055164002 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

20 0.053927433 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.151), (1, 0.039), (2, -0.011), (3, 0.05), (4, -0.084), (5, 0.053), (6, 0.055), (7, 0.006), (8, -0.011), (9, -0.107), (10, -0.117), (11, 0.024), (12, -0.039), (13, 0.023), (14, 0.028), (15, 0.022), (16, -0.025), (17, -0.006), (18, 0.024), (19, -0.06), (20, 0.081), (21, -0.013), (22, -0.079), (23, 0.047), (24, -0.019), (25, -0.01), (26, 0.131), (27, -0.019), (28, -0.037), (29, -0.108), (30, -0.075), (31, 0.069), (32, -0.061), (33, 0.05), (34, 0.011), (35, -0.031), (36, -0.071), (37, 0.031), (38, -0.047), (39, 0.111), (40, 0.023), (41, 0.067), (42, 0.03), (43, 0.023), (44, 0.048), (45, -0.037), (46, -0.109), (47, -0.013), (48, 0.001), (49, -0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9433431 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons

Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.

2 0.75013149 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

Author: Sara Rosenthal ; Kathleen McKeown

Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.

3 0.66727579 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics

Author: Steffen Hedegaard ; Jakob Grue Simonsen

Abstract: We investigate authorship attribution using classifiers based on frame semantics. The purpose is to discover whether adding semantic information to lexical and syntactic methods for authorship attribution will improve them, specifically to address the difficult problem of authorship attribution of translated texts. Our results suggest (i) that frame-based classifiers are usable for author attribution of both translated and untranslated texts; (ii) that framebased classifiers generally perform worse than the baseline classifiers for untranslated texts, but (iii) perform as well as, or superior to the baseline classifiers on translated texts; (iv) that—contrary to current belief—naïve clas- sifiers based on lexical markers may perform tolerably on translated texts if the combination of author and translator is present in the training set of a classifier.

4 0.61971921 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution

Author: Hugo Jair Escalante ; Thamar Solorio ; Manuel Montes-y-Gomez

Abstract: This paper proposes the use of local histograms (LH) over character n-grams for authorship attribution (AA). LHs are enriched histogram representations that preserve sequential information in documents; they have been successfully used for text categorization and document visualization using word histograms. In this work we explore the suitability of LHs over n-grams at the character-level for AA. We show that LHs are particularly helpful for AA, because they provide useful information for uncovering, to some extent, the writing style of authors. We report experimental results in AA data sets that confirm that LHs over character n-grams are more helpful for AA than the usual global histograms, yielding results far superior to state of the art approaches. We found that LHs are even more advantageous in challenging conditions, such as having imbalanced and small training sets. Our results motivate further research on the use of LHs for modeling the writing style of authors for related tasks, such as authorship verification and plagiarism detection.

5 0.59307981 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts

Author: Derya Ozkan ; Louis-Philippe Morency

Abstract: In many computational linguistic scenarios, training labels are subjectives making it necessary to acquire the opinions of multiple annotators/experts, which is referred to as ”wisdom of crowds”. In this paper, we propose a new approach for modeling wisdom of crowds based on the Latent Mixture of Discriminative Experts (LMDE) model that can automatically learn the prototypical patterns and hidden dynamic among different experts. Experiments show improvement over state-of-the-art approaches on the task of listener backchannel prediction in dyadic conversations.

6 0.58938462 133 acl-2011-Extracting Social Power Relationships from Natural Language

7 0.58055806 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications

8 0.54014802 194 acl-2011-Language Use: What can it tell us?

9 0.53368706 83 acl-2011-Contrasting Multi-Lingual Prosodic Cues to Predict Verbal Feedback for Rapport

10 0.53299731 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

11 0.52595717 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

12 0.5217194 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

13 0.51751524 286 acl-2011-Social Network Extraction from Texts: A Thesis Proposal

14 0.51678276 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

15 0.51625651 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

16 0.51029891 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

17 0.49329022 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

18 0.49069092 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

19 0.4848026 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

20 0.48429313 55 acl-2011-Automatically Predicting Peer-Review Helpfulness


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.029), (5, 0.053), (17, 0.047), (26, 0.023), (31, 0.017), (37, 0.069), (39, 0.034), (41, 0.059), (53, 0.027), (55, 0.03), (57, 0.251), (59, 0.024), (72, 0.036), (91, 0.039), (96, 0.164), (97, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.82784563 285 acl-2011-Simple supervised document geolocation with geodesic grids

Author: Benjamin Wing ; Jason Baldridge

Abstract: We investigate automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and it is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.

same-paper 2 0.82342964 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons

Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.

3 0.81772131 243 acl-2011-Partial Parsing from Bitext Projections

Author: Prashanth Mannem ; Aswarth Dara

Abstract: Recent work has shown how a parallel corpus can be leveraged to build syntactic parser for a target language by projecting automatic source parse onto the target sentence using word alignments. The projected target dependency parses are not always fully connected to be useful for training traditional dependency parsers. In this paper, we present a greedy non-directional parsing algorithm which doesn’t need a fully connected parse and can learn from partial parses by utilizing available structural and syntactic information in them. Our parser achieved statistically significant improvements over a baseline system that trains on only fully connected parses for Bulgarian, Spanish and Hindi. It also gave a significant improvement over previously reported results for Bulgarian and set a benchmark for Hindi.

4 0.80314922 305 acl-2011-Topical Keyphrase Extraction from Twitter

Author: Xin Zhao ; Jing Jiang ; Jing He ; Yang Song ; Palakorn Achanauparp ; Ee-Peng Lim ; Xiaoming Li

Abstract: Summarizing and analyzing Twitter content is an important and challenging task. In this paper, we propose to extract topical keyphrases as one way to summarize Twitter. We propose a context-sensitive topical PageRank method for keyword ranking and a probabilistic scoring function that considers both relevance and interestingness of keyphrases for keyphrase ranking. We evaluate our proposed methods on a large Twitter data set. Experiments show that these methods are very effective for topical keyphrase extraction.

5 0.79385632 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

Author: Manaal Faruqui ; Sebastian Pado

Abstract: In contrast to many languages (like Russian or French), modern English does not distinguish formal and informal (“T/V”) address overtly, for example by pronoun choice. We describe an ongoing study which investigates to what degree the T/V distinction is recoverable in English text, and with what textual features it correlates. Our findings are: (a) human raters can label English utterances as T or V fairly well, given sufficient context; (b), lexical cues can predict T/V almost at human level.

6 0.64289331 101 acl-2011-Disentangling Chat with Local Coherence Models

7 0.62986732 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations

8 0.62958056 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

9 0.6268118 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques

10 0.62490493 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

11 0.62438458 318 acl-2011-Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

12 0.62377381 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

13 0.6230216 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

14 0.6226846 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

15 0.62183535 235 acl-2011-Optimal and Syntactically-Informed Decoding for Monolingual Phrase-Based Alignment

16 0.62166059 133 acl-2011-Extracting Social Power Relationships from Natural Language

17 0.62135667 11 acl-2011-A Fast and Accurate Method for Approximate String Search

18 0.62103641 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework

19 0.62059414 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules

20 0.62043804 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations