acl acl2012 acl2012-29 knowledge-graph by maker-knowledge-mining

29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

Source: pdf

Author: Karolina Owczarzak ; Peter A. Rankel ; Hoa Trang Dang ; John M. Conroy

Abstract: We investigate the consistency of human assessors involved in summarization evaluation to understand its effect on system ranking and automatic evaluation techniques. Using Text Analysis Conference data, we measure annotator consistency based on human scoring of summaries for Responsiveness, Readability, and Pyramid scoring. We identify inconsistencies in the data and measure to what extent these inconsistencies affect the ranking of automatic summarization systems. Finally, we examine the stability of automatic metrics (ROUGE and CLASSY) with respect to the inconsistent assessments.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Assessing the Effect of Inconsistent Assessors on Summarization Evaluation Karolina Owczarzak National Institute of Standards and Technology Gaithersburg, MD 20899 karol ina . [sent-1, score-0.021]

2 Hoa Trang Dang National Institute of Standards and Technology Gaithersburg, MD 20899 hoa . [sent-3, score-0.034]

3 gov Abstract We investigate the consistency of human assessors involved in summarization evaluation to understand its effect on system ranking and automatic evaluation techniques. [sent-5, score-0.908]

4 Using Text Analysis Conference data, we measure annotator consistency based on human scoring of summaries for Responsiveness, Readability, and Pyramid scoring. [sent-6, score-0.536]

5 We identify inconsistencies in the data and measure to what extent these inconsistencies affect the ranking of automatic summarization systems. [sent-7, score-0.277]

6 Finally, we examine the stability of automatic metrics (ROUGE and CLASSY) with respect to the inconsistent assessments. [sent-8, score-0.285]

7 1 Introduction Automatic summarization of documents is a research area that unfortunately depends on human feedback. [sent-9, score-0.184]

8 Although attempts have been made at automating the evaluation of summaries, none is so good as to remove the need for human assessors. [sent-10, score-0.058]

9 We investigate two ways of measuring evaluation consistency in order to see what effect it has on summarization evaluation and training of automatic evaluation metrics. [sent-12, score-0.418]

10 2 Assessor consistency In the Text Analysis Conference (TAC) Summarization track, participants are allowed to submit more than one run (usually two), and this option is often used to test different settings or versions of the same summarization system. [sent-13, score-0.357]

11 In cases when the system versions are not too divergent, they sometimes 359 Peter A. [sent-14, score-0.053]

12 Summaries are randomized within each topic before they are evaluated, so the identical copies are usually interspersed with 40-50 other summaries for the same topic and are not evaluated in a row. [sent-20, score-0.441]

13 Given that each topic is evaluated by a single assessor, it then becomes possible to check assessor consistency, i. [sent-21, score-0.513]

14 , whether the assessor judged the two identical summaries in the same way. [sent-23, score-0.749]

15 For each summary, assessors conduct content evaluation according to the Pyramid framework (Nenkova and Passonneau, 2004) and assign it Responsiveness and Readability scores1, so assessor consistency can be checked in these three areas separately. [sent-24, score-1.086]

16 We found between 230 (in 2009) and 430 (in 2011) pairs of identical summaries for the 20082011 data (given on average 45 topics, 50 runs, and two summarization conditions: main and update), giving in effect anywhere from around 30 to 60 instances per assessor per year. [sent-25, score-0.885]

17 Using Krippendorff’s alpha (Freelon, 2004), we calculated assessor consistency within each year, as well as total consistency over all years’ data (for those assessors who worked multiple years). [sent-26, score-1.311]

18 Table 1 shows rankings of assessors in 2011, based on their Readability, Responsiveness, and Pyramid judgments for identical summary pairs (around 60 pairs per assessor). [sent-27, score-0.606]

19 Interestingly, consistency values for Readability are lower overall than those for Responsiveness and Pyramid, even for the most consistent assessors. [sent-28, score-0.215]

20 by assigning a numerical score according to detailed guidelines, this sug1http://www. [sent-31, score-0.068]

21 ny89 gr073 69125Readbilty and Responsiveness scores and in Pyramid evaluation, as represented by Krippendorff’s alpha for interval values, on 2011data. [sent-41, score-0.103]

22 gests that Readability as a quality of text is inherently more vague and difficult to pinpoint. [sent-42, score-0.019]

23 On the other hand, Pyramid consistency values are generally the highest, which can be explained by how the Pyramid evaluation is designed. [sent-43, score-0.21]

24 Even if the assessor is inconsistent in selecting Summary Content Units (SCUs) across different summaries, as long as the total summary weight is similar, the summary’s final score will be similar, too. [sent-44, score-0.756]

25 2 Therefore, it would be better to look at whether assessors tend to find the same SCUs (information “nuggets”) in different summaries on the same topic, and whether they annotate them consistently. [sent-45, score-0.671]

26 This can be done using the “autoannotate” function of the Pyramid process, where all SCU contributors (selected text strings) from already annotated summaries are matched against the text of a candidate (un-annotated) summary. [sent-46, score-0.288]

27 The autoannotate function works fairly well for matching between extractive summaries, which tend to repeat verbatim whole sentences from source documents. [sent-47, score-0.091]

28 For each summary in 2008-201 1data, we autoannotated it using all remaining manually-annotated summaries from the same topic, and then we compared the resulting “autoPyramid” score with the score from the original manual annotation for that summary. [sent-48, score-0.504]

29 360 Figure 1: Annotator consistency in selecting SCUs in Pyramid evaluation, as represented by the difference between manual Pyramid and automatic Pyramid scores (mP-aP), on 2011data. [sent-50, score-0.364]

30 Either way, if we then average out score differences for all summaries for a given topic, it will give us a good picture of the annotation consistency in this particular topic. [sent-53, score-0.513]

31 Higher average autoPyramid scores suggest that the assessor was missing content, or otherwise making frequent random mistakes in assigning content. [sent-54, score-0.525]

32 Figure 1 shows the macro-average difference between manual Pyramid scores and autoPyramid scores for each assessor in 2011. [sent-55, score-0.591]

33 This can be explained by the fact that the Pyramid evaluation and assigning Readability scores are different processes and might require different skills and types of focus. [sent-57, score-0.104]

34 3 Impact on evaluation Since human assessment is used to rank participating summarizers in the TAC Summarization track, 3Due to space constraints, we report figures for only 2011, but the results for other years are similar. [sent-58, score-0.165]

35 s9 mwr486ho30arstize ranking and the ranking after excluding topics by one or two worst assessors in each category. [sent-63, score-0.852]

36 we should examine the potential impact of inconsistent assessors on the overall evaluation. [sent-64, score-0.612]

37 Because the final summarizer score is the average over many topics, and the topics are fairly evenly distributed among assessors for annotation, excluding noisy topics/assessors has very little impact on summarizer ranking. [sent-65, score-0.95]

38 As an example, consider the 2011 assessor consistency data in Table 1 and Figure 1. [sent-66, score-0.631]

39 If we exclude topics by the worst performing assessor from each of these categories, recalculate the summarizer rankings, and then check the correlation between the original and newly created rankings, we obtain results in Table 2. [sent-67, score-0.876]

40 Although the impact on evaluating automatic summarizers is small, it could be argued that excluding topics with inconsistent human scoring will have an impact on the performance of automatic evaluation metrics, which might be unfairly penalized by their inability to emulate random human mistakes. [sent-68, score-0.802]

41 Table 3 shows ROUGE-2 (Lin, 2004), one of the state-of-the-art automatic metrics used in TAC, and its correlations with human metrics, before and after exclusion of noisy topics from 2011 data. [sent-69, score-0.38]

42 The results are fairly inconclusive: it seems that in most cases, removing topics does more harm than good, suggesting that the signal-to-noise ratio is still tipped in favor of signal. [sent-70, score-0.242]

43 The only exception is Readability, where ROUGE records a slight increase in correlation; this is unsurprising, given that consistency values for Readability are the lowest of all categories, and perhaps here removing noise has more impact. [sent-71, score-0.25]

44 In the case of Pyramid, there is a small gain when we exclude the single worst assessor, but excluding two assessors results in a decreased correlation, perhaps because we remove too much valid information at the same time. [sent-72, score-0.639]

45 A different picture emerges when we examine how well ROUGE-2 can predict human scores on the summary level. [sent-73, score-0.284]

46 9 P524- 32aP Table 3: Correlation between the summarizer rankings according to ROUGE-2 and human metrics, before and after excluding topics by one or two worst assessors in that category. [sent-78, score-0.903]

47 7 P745-1 2aP Table 4: Correlation between ROUGE-2 and human metrics on a summary level before and after excluding topics by one or two worst assessors in that category. [sent-83, score-0.962]

48 maries annotated by each particular assessor and calculated the correlation between ROUGE-2 and this assessor’s manual scores for individual summaries. [sent-84, score-0.679]

49 Then we calculated the mean correlation over all assessors. [sent-85, score-0.133]

50 Unsurprisingly, inconsistent assessors tend to correlate poorly with automatic (and therefore always consistent) metrics, so excluding one or two worst assessors from each category increases ROUGE’s average per-assessor summary-level correlation, as can be seen in Table 4. [sent-86, score-1.182]

51 The only exception here is when we exclude assessors based on their autoPyramid performance: again, because inconsistent SCU selection doesn’t necessarily translate into inconsistent final Pyramid scores, excluding those assessors doesn’t do much for ROUGE-2. [sent-87, score-1.26]

52 4 Impact on training Another area where excluding noisy topics might be useful is in training new automatic evaluation metrics. [sent-88, score-0.389]

53 To examine this issue we turned to CLASSY (Rankel et al. [sent-89, score-0.033]

54 , 2011), an automatic evaluation metric submitted to TAC each year from 2009-201 1. [sent-90, score-0.089]

55 CLASSY consists of four different versions, each aimed at predicting a particular human evaluation score. [sent-91, score-0.058]

56 Each version of CLASSY is based on one of three regression methods: robust regression, nonnegative least squares, or canonical correlation. [sent-92, score-0.032]

57 The regressions are calculated based on a collection of linguistic and content features, derived from the summary to be scored. [sent-93, score-0.198]

58 CLASSY requires two years of marked data to score summaries in a new year. [sent-94, score-0.331]

59 In order to predict the human metrics in 2011, for example, CLASSY uses the human ratings from 2009 and 2010. [sent-95, score-0.173]

60 It first considers each subset of the features in turn, and using each of the regression methods, fits a model to the 2009 data. [sent-96, score-0.032]

61 The subset/method combination that best predicts the 2010 scores is then used to predict scores for 2011. [sent-97, score-0.122]

62 First, we trained all four CLASSY versions on all available 2009-2010 topics, and then trained again excluding topics by the most inconsistent assessor(s). [sent-99, score-0.451]

63 A different subset of topics was excluded depending on whether this particular version of CLASSY was aiming to predict Responsiveness, Readability, or the Pyramid score. [sent-100, score-0.245]

64 Then we tested CLASSY’s performance on 2011 data, ranking either automatic summarizers (NoModels case) or human and automatic summarizers together (AllPeers case), separately for main and update summaries, and calculated its correlation with the metrics it was aiming to predict. [sent-101, score-0.577]

65 For Pyramid, (a) indicates that excluded topics were selected based on Krippendorff’s alpha, and (b) indicates that topics were excluded based on their mean difference between manual and automatic Pyramid scores. [sent-103, score-0.479]

66 The results are encouraging; it seems that removing noisy topics from training data does improve the correlations with manual metrics in most cases. [sent-104, score-0.395]

67 The greatest increase takes place in CLASSY’s correlations with Responsiveness for main summaries in AllPeers case, and for correlations with Readability. [sent-105, score-0.387]

68 While none of the changes are large enough to achieve statistical significance, the pattern of improvement is fairly consistent. [sent-106, score-0.042]

69 5 Conclusions We investigated the consistency of human assessors in the area of summarization evaluation. [sent-107, score-0.773]

70 We considered two ways of measuring assessor consistency, depending on the metric, and studied the impact of consistent scoring on ranking summarization sys- tems and on the performance of automatic evaluation systems. [sent-108, score-0.775]

71 We found that summarization system ranking, based on scores for multiple topics, was surprisingly stable and didn’t change signifi362 rics on 2011 data (main and update summaries), before and after excluding most inconsistent topic from 20092010 training data for CLASSY. [sent-109, score-0.491]

72 However, on a summary level, remov- ing topics scored by the most inconsistent assessors helped ROUGE-2 increase its correlation with human metrics. [sent-111, score-0.909]

73 In the area of training automatic metrics, we found some encouraging results; removing noise from the training data allowed most CLASSY versions to improve their correlations with the manual metrics that they were aiming to model. [sent-112, score-0.421]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('assessor', 0.444), ('assessors', 0.402), ('pyramid', 0.375), ('summaries', 0.269), ('classy', 0.213), ('consistency', 0.187), ('scus', 0.169), ('readability', 0.164), ('responsiveness', 0.155), ('autopyramid', 0.146), ('topics', 0.143), ('inconsistent', 0.136), ('excluding', 0.119), ('summarization', 0.117), ('summary', 0.114), ('correlation', 0.079), ('worst', 0.078), ('summarizers', 0.077), ('summarizer', 0.072), ('rouge', 0.072), ('metrics', 0.071), ('krippendorff', 0.063), ('tac', 0.06), ('correlations', 0.059), ('alpha', 0.058), ('conroy', 0.058), ('rankel', 0.058), ('manual', 0.057), ('ranking', 0.055), ('rankings', 0.054), ('versions', 0.053), ('nenkova', 0.051), ('topic', 0.049), ('allpeers', 0.049), ('autoannotate', 0.049), ('automatic', 0.045), ('scores', 0.045), ('maryland', 0.043), ('duc', 0.042), ('scu', 0.042), ('fairly', 0.042), ('impact', 0.041), ('exclude', 0.04), ('removing', 0.038), ('passonneau', 0.036), ('assigning', 0.036), ('identical', 0.036), ('excluded', 0.035), ('human', 0.035), ('strings', 0.035), ('aiming', 0.035), ('gaithersburg', 0.034), ('ani', 0.034), ('hoa', 0.034), ('calculated', 0.033), ('examine', 0.033), ('regression', 0.032), ('score', 0.032), ('area', 0.032), ('predict', 0.032), ('encouraging', 0.031), ('inconsistencies', 0.03), ('years', 0.03), ('selecting', 0.03), ('content', 0.03), ('consistent', 0.028), ('standards', 0.027), ('doesn', 0.027), ('noisy', 0.027), ('md', 0.026), ('exception', 0.025), ('picture', 0.025), ('update', 0.025), ('rebecca', 0.025), ('evaluation', 0.023), ('annotator', 0.023), ('scoring', 0.022), ('mean', 0.021), ('cantly', 0.021), ('deen', 0.021), ('divergent', 0.021), ('gov', 0.021), ('ina', 0.021), ('inability', 0.021), ('inconclusive', 0.021), ('maries', 0.021), ('math', 0.021), ('regressions', 0.021), ('year', 0.021), ('check', 0.02), ('track', 0.02), ('interspersed', 0.019), ('unfairly', 0.019), ('anywhere', 0.019), ('contributors', 0.019), ('gests', 0.019), ('harm', 0.019), ('mirrors', 0.019), ('missed', 0.019), ('randomized', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

Author: Karolina Owczarzak ; Peter A. Rankel ; Hoa Trang Dang ; John M. Conroy

Abstract: We investigate the consistency of human assessors involved in summarization evaluation to understand its effect on system ranking and automatic evaluation techniques. Using Text Analysis Conference data, we measure annotator consistency based on human scoring of summaries for Responsiveness, Readability, and Pyramid scoring. We identify inconsistencies in the data and measure to what extent these inconsistencies affect the ranking of automatic summarization systems. Finally, we examine the stability of automatic metrics (ROUGE and CLASSY) with respect to the inconsistent assessments.

2 0.2968629 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

Author: Ziheng Lin ; Chang Liu ; Hwee Tou Ng ; Min-Yen Kan

Abstract: An ideal summarization system should produce summaries that have high content coverage and linguistic quality. Many state-ofthe-art summarization systems focus on content coverage by extracting content-dense sentences from source articles. A current research focus is to process these sentences so that they read fluently as a whole. The current AESOP task encourages research on evaluating summaries on content, readability, and overall responsiveness. In this work, we adapt a machine translation metric to measure content coverage, apply an enhanced discourse coherence model to evaluate summary readability, and combine both in a trained regression model to evaluate overall responsiveness. The results show significantly improved performance over AESOP 2011 submitted metrics.

3 0.1913901 101 acl-2012-Fully Abstractive Approach to Guided Summarization

Author: Pierre-Etienne Genest ; Guy Lapalme

Abstract: This paper shows that full abstraction can be accomplished in the context of guided summarization. We describe a work in progress that relies on Information Extraction, statistical content selection and Natural Language Generation. Early results already demonstrate the effectiveness of the approach.

4 0.062225368 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation

Author: Vladimir Eidelman ; Jordan Boyd-Graber ; Philip Resnik

Abstract: We propose an approach that biases machine translation systems toward relevant translations based on topic-specific contexts, where topics are induced in an unsupervised way using topic models; this can be thought of as inducing subcorpora for adaptation without any human annotation. We use these topic distributions to compute topic-dependent lex- ical weighting probabilities and directly incorporate them into our translation model as features. Conditioning lexical probabilities on the topic biases translations toward topicrelevant output, resulting in significant improvements of up to 1 BLEU and 3 TER on Chinese to English translation over a strong baseline.

5 0.061991192 171 acl-2012-SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations

Author: Viet-An Nguyen ; Jordan Boyd-Graber ; Philip Resnik

Abstract: One of the key tasks for analyzing conversational data is segmenting it into coherent topic segments. However, most models of topic segmentation ignore the social aspect of conversations, focusing only on the words used. We introduce a hierarchical Bayesian nonparametric model, Speaker Identity for Topic Segmentation (SITS), that discovers (1) the topics used in a conversation, (2) how these topics are shared across conversations, (3) when these topics shift, and (4) a person-specific tendency to introduce new topics. We evaluate against current unsupervised segmentation models to show that including personspecific information improves segmentation performance on meeting corpora and on political debates. Moreover, we provide evidence that SITS captures an individual’s tendency to introduce new topics in political contexts, via analysis of the 2008 US presidential debates and the television program Crossfire. 1 Topic Segmentation as a Social Process Conversation, interactive discussion between two or more people, is one of the most essential and common forms of communication. Whether in an informal situation or in more formal settings such as a political debate or business meeting, a conversation is often not about just one thing: topics evolve and are replaced as the conversation unfolds. Discovering this hidden structure in conversations is a key problem for conversational assistants (Tur et al., 2010) and tools that summarize (Murray et al., 2005) and display (Ehlen et al., 2007) conversational data. Topic segmentation also can illuminate individuals’ agendas (Boydstun et al., 2011), patterns of agree- ment and disagreement (Hawes et al., 2009; Abbott 78 Jordan Boyd-Graber iSchool and UMIACS University of Maryland College Park, MD jbg@ umiac s .umd .edu Philip Resnik Department of Linguistics and UMIACS University of Maryland College Park, MD re snik @ umd .edu al., 2011), and relationships among conversational participants (Ireland et al., 2011). One of the most natural ways to capture conversational structure is topic segmentation (Reynar, 1998; Purver, 2011). Topic segmentation approaches range from simple heuristic methods based on lexical similarity (Morris and Hirst, 1991 ; Hearst, 1997) to more intricate generative models and supervised methods (Georgescul et al., 2006; Purver et al., 2006; Gruber et al., 2007; Eisenstein and Barzilay, 2008), which have been shown to outperform the established heuristics. However, previous computational work on conversational structure, particularly in topic discovery and topic segmentation, focuses primarily on conet tent, ignoring the speakers. We argue that, because conversation is a social process, we can understand conversational phenomena better by explicitly modeling behaviors of conversational participants. In Section 2, we incorporate participant identity in a new model we call Speaker Identity for Topic Segmentation (SITS), which discovers topical structure in conversation while jointly incorporating a participantlevel social component. Specifically, we explicitly model an individual’s tendency to introduce a topic. After outlining inference in Section 3 and introducing data in Section 4, we use SITS to improve state-ofthe-art-topic segmentation and topic identification models in Section 5. In addition, in Section 6, we also show that the per-speaker model is able to discover individuals who shape and influence the course of a conversation. Finally, we discuss related work and conclude the paper in Section 7. 2 Modeling Multiparty Discussions Data Properties We are interested in turn-taking, multiparty discussion. This is a broad category, inProce Jedijung, sR oefpu thbeli c50 othf K Aonrneua,a8l -M14e Jtiunlgy o 2f0 t1h2e. A ?c s 2o0c1ia2ti Aosns fo cria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsteiscs 78–87, cluding political debates, business meetings, and online chats. More formally, such datasets contain C conversations. A conversation c has Tc turns, each of which is a maximal uninterrupted utterance by one speaker.1 In each turn t ∈ [1, Tc], a speaker ac,t utters N words {wc,t,n}. Eatch ∈ w [1o,rTd is from a vocabulary of size V , {awnd th}ere are M distinct speakers. Modeling Approaches The key insight of topic segmentation is that segments evince lexical cohesion (Galley et al., 2003; Olney and Cai, 2005). Words within a segment will look more like their neighbors than other words. This insight has been used to tune supervised methods (Hsueh et al., 2006) and inspire unsupervised models of lexical cohesion using bags of words (Purver et al., 2006) and language models (Eisenstein and Barzilay, 2008). We too take the unsupervised statistical approach. It requires few resources and is applicable in many domains without extensive training. Like previous approaches, we consider each turn to be a bag of words generated from an admixture of topics. Topics—after the topic modeling literature (Blei and Lafferty, 2009)—are multinomial distributions over terms. These topics are part of a generative model posited to have produced a corpus. However, topic models alone cannot model the dynamics of a conversation. Topic models typically do not model the temporal dynamics of individual documents, and those that do (Wang et al., 2008; Gerrish and Blei, 2010) are designed for larger documents and are not applicable here because they assume that most topics appear in every time slice. Instead, we endow each turn with a binary latent variable lc,t, called the topic shift. This latent variable signifies whether the speaker changed the topic of the conversation. To capture the topic-controlling behavior of the speakers across different conversations, we further associate each speaker m with a latent topic shift tendency, πm. Informally, this variable is intended to capture the propensity of a speaker to effect a topic shift. Formally, it represents the probability that the speaker m will change the topic (distribution) of a conversation. We take a Bayesian nonparametric approach (M¨uller and Quintana, 2004). Unlike 1Note the distinction with phonetic definition are bounded by silence. utterances, which by 79 parametric models, which a priori fix the number of topics, nonparametric models use a flexible number of topics to better represent data. Nonparametric distributions such as the Dirichlet process (Ferguson, 1973) share statistical strength among conversations using a hierarchical model, such as the hierarchical Dirichlet process (HDP) (Teh et al., 2006). 2.1 Generative Process In this section, we develop SITS, a generative model of multiparty discourse that jointly discovers topics and speaker-specific topic shifts from an unannotated corpus (Figure 1a). As in the hierarchical Dirichlet process (Teh et al., 2006), we allow an unbounded number of topics to be shared among the turns of the corpus. Topics are drawn from a base distribution H over multinomial distributions over the vocabulary, a finite Dirichlet with symmetric prior λ. Unlike the HDP, where every document (here, every turn) draws a new multinomial distribution from a Dirichlet process, the social and temporal dynamics of a conversation, as specified by the binary topic shift indicator lc,t, determine when new draws happen. The full generative process is as follows: 1. For speaker m ∈ [1, M], draw speaker shift probability πm ∼ Beta(γ) 2. Draw∼ global probability measure G0 ∼ DP(α, H) 3. For each conversation c ∈ [1, C] (a) Draw conversation distribution Gc ∼ DP(α0 , G0) (b) For each turn t ∈ [1, Tc] with speaker ac,t i. If t = 1, set the topic shift lc,t = 1. Otherwise, draw lc,t ∼ Bernoulli(πac,t ). ii. If lc,t = 1∼, d Breawrn Gc,t ∼ DP(αc, Gc). Otherwise, set Gc,t ≡ Gc,t−1 . iii. For each word ≡ind Gex n ∈ [1, Nc,t] • Draw ψc,t,n ∼ Gc,t • DDrraaww wc,t,n ∼ Multinomial(ψc,t,n) The hierarchy of Dirichlet processes allows statistical strength to be shared across contexts; within a conversation and across conversations. The perspeaker topic shift tendency πm allows speaker identity to influence the evolution of topics. To make notation concrete and aligned with the topic segmentation, we introduce notation for segments in a conversation. A segment s of conversation c is a sequence of turns [τ, τ0] such that lc,τ = lc,τ0+1 = 1and lc,t = 0, ∀t ∈ (τ, τ0] . When lc,t = 0, Gc,t is the same =Gc 0,t,−∀1t a ∈nd ( aτ,llτ τtopics (i.e. multinomial distributions over words) {ψc,t,n} that generate words in turn t and the topics{ ψ{ψc,t}−1,n} that generate words in turn t −1 come from{ψ ψthc,et −s1a,mn}e as Figure 1: Graphical model representations of our proposed models: (a) the nonparametric version; (b) the parametric version. Nodes represent random variables (shaded ones are observed), lines are probabilistic dependencies. Plates represent repetition. The innermost plates are turns, grouped in conversations. distribution. Thus all topics used in a segment s are drawn from a single distribution, Gc,s, , , , Gc,s | lc,1 lc,2 · · · lc,Tc , αc, Gc ∼ DP(αc, Gc) (1) For notational convenience, Sc denotes the number of segments in conversation c, and st denotes the segment index of turn t. We emphasize that all segment-related notations are derived from the posterior over the topic shifts land not part of the model itself. Parametric Version SITS is a generalization of a parametric model (Figure 1b) where each turn has a multinomial distribution over K topics. In the parametric case, the number of topics K is fixed. Each topic, as before, is a multinomial distribution φ1 . . . φK. In the parametric case, each turn t in conversation c has an explicit multinomial distribution over K topics θc,t, identical for turns within a segment. A new topic distribution θ is drawn from a Dirichlet distribution parameterized by α when the topic shift indicator lis 1. The parametric version does not share strength within or across conversations, unlike SITS. When applied on a single conversation without speaker identity (all speakers are identical) it is equivalent to (Purver et al., 2006). In our experiments (Section 5), we compare against both. 80 3 Inference To find the latent variables that best explain observed data, we use Gibbs sampling, a widely used Markov chain Monte Carlo inference technique (Neal, 2000; Resnik and Hardisty, 2010). The state space is latent variables for topic indices assigned to all tokens z = {zc,t,n} and topic shifts assigned to turns l= {lc,t}. {Wze marginalize over all other latent variablle =s. Here, we only present the conditional sampling equations; for more details, see our supplement.2 3.1 Sampling Topic Assignments To sample zc,t,n, the index of the shared topic assigned to token n of turn t in conversation c, we need to sample the path assigning each word token to a segment-specific topic, each segment-specific topic to a conversational topic and each conversational topic to a shared topic. For efficiency, we make use of the minimal path assumption (Wallach, 2008) to generate these assignments.3 Under the minimal path assumption, an observation is assumed to have been generated by using a new distribution if and only if there is no existing distribution with the same value. 2∼vietan/topicshift/appendix.pdf 3We also investigated using the maximal assumption and fully sampling assignments. We found the minimal path assumption worked as well as explicitly sampling seating assignments and that the maximal path assumption worked less well. We use Nc,s,k to denote the number of tokens in segment s in conversation c assigned topic k; Nc,k denotes the total number of segment-specific topics in conversation c assigned topic k and Nk denotes the number of conversational topics assigned topic k. TWk,w denotes the number of times the shared topic k is assigned to word w in the vocabulary. Marginal counts are represented with · and ∗ represents all hyperparameters. The condit·ional d∗istribution for zc,t,n is P(zc,t,n = k | wc,t,n = w, z−c,t,n, w−c,t,n, l, ∗) ∝ Nc−,sct ,kn+αNc −c,s−ct,kct·,n Nn+c −,·αc ,t0cnN +k−· αc,t0 ,n + αK ×  VT1 W k−, ·c,wctk, n e+w V.λ( 2), Here V is the size of the vocabulary, K is the current number of shared topics and the superscript −c,t,n denotes counts without considering wc,t,n. In Equation 2, the first factor is proportional to the probability of sampling a path according to the minimal path assumption; the second factor is proportional to the likelihood of observing w given the sampled topic. Since an uninformed prior is used, when a new topic is sampled, all tokens are equiprobable. 3.2 Sampling Topic Shifts Sampling the topic shift variable lc,t requires us to consider merging or splitting segments. We use kc,t to denote the shared topic indices of all tokens in turn t of conversation c; Sac,t,x to denote the number of times speaker ac,t is assigned the topic shift with value x ∈ {0, 1}; Jcx,s to denote the number of topics in segment s 1o}f conversation c if lc,t = x and Ncx,s,j to denote the number of tokens assigned to the segment-specific topic j when lc,t = x.4 Again, the superscript −c,t is used to denote exclusion of turn t of conversation c in the corresponding counts. Recall that the topic shift is a binary variable. We use 0 to represent the case that the topic distribution is identical to the previous turn. We sample this assignment P(lc,t = 0 | l−c,t, w, k, a, ∗) ∝ SSa−a−cc,c,ct,t , t·,0++ 2 γγ×αcJ0c,sNtx=Qc01,sjJt=c,0·,1s(tx(N −c0 1,s +t,j α−c) 1)!. (3) 4Deterministically knowQing the path assignments is the primary efficiency motivation for using the minimal path assumption. The alternative is to explicitly sample the path assignments, which is more complicated (for both notation and computation). This option is spelled in full detail in the supplementary material. 81 In Equation 3, the first factor is proportional to the probability of assigning a topic shift of value 0 to speaker ac,t and the second factor is proportional to the joint probability of all topics in segment st of conversation c when lc,t = 0. The other alternative is for the topic shift to be 1, which represents the introduction of a new distri- bution over topics inside an existing segment. We sample this as P(lc,t = 1 | l−c,t, w, k, a, ∗) ∝ S −a −c ,c t, t, t, ·1+ 2 γ ×αcJc1,(st−1x)NQ=c1,1(jJs=ct1−,1(s1t)−,·1()x(N −c1 1,( +st− α1c) ,j− 1)! αcJcQ1,sNxt=c1Q1,stJj,c=1·,(s1xt( −N 1c1, +stj α−c) 1)!. (4) As above, the first faQctor in Equation 4 is proportional to the probability of assigning a topic shift of value 1to speaker ac,t; the second factor in the big bracket is proportional to the joint distribution of the topics in segments st − 1 and st. In this case lc,t = 1 means splitting the current segment, which results in two joint probabilities for two segments. 4 Datasets This section introduces the three corpora we use. We preprocess the data to remove stopwords and remove turns containing fewer than five tokens. The ICSI Meeting Corpus: The ICSI Meeting Corpus (Janin et al., 2003) is 75 transcribed meetings. For evaluation, we used a standard set of reference segmentations (Galley et al., 2003) of 25 meetings. Segmentations are binary, i.e., each point of the document is either a segment boundary or not, and on average each meeting has 8 segment boundaries. After preprocessing, there are 60 unique speakers and the vocabulary contains 3346 non-stopword tokens. The 2008 Presidential Election Debates Our second dataset contains three annotated presidential debates (Boydstun et al., 2011) between Barack Obama and John McCain and a vice presidential debate between Joe Biden and Sarah Palin. Each turn is one of two types: questions (Q) from the moderator or responses (R) from a candidate. Each clause in a turn is coded with a Question Topic (TQ) and a Response Topic (TR). Thus, a turn has a list of TQ’s and TR’s both of length equal to the number of clauses in the turn. Topics are from the Policy Agendas Topics SpeakerTypeTurn clausesTQTR BrokawQbSeenfo.r Oeib ta gmeat,s [b.e.t.t]er A arned yo thuey sa oyuingght [. to. b]e th parte tphaere Adm foerri tchaant? economy is going to get much worse1N/A ObamaR[hN.o .m,.]e Is B a,um mtac mokenofs itdu iermenpt o ahrabt oaun th tel yt ,h we c Aaen’rm epea gryoic ithnangei e trco bo hinlaosvm e[.y t. o. h]elp ordinary familes be able to stay in their1 1 4 BrokawQSen. McCain, in all candor, do you think the economy is going to get worse before it gets better?1N/A McCainR[Iom.ftwho.trie]n Ikiegrtofih oeicwonumkteiv aegfn wdlyt.ebri[ua.dyc otuf]petfh ec tserivo bnlayd,islmfoaw nes,d staobptihelcaziteplt ihoneptlrheoscuatsni hgflauvmean rckne itnw– WmhoaisrcthgiaIngbetoalnitevshoe w ne wca vna,l ucet1 240 Table 1: Example turns from the annotated 2008 election debates. The topics (TQ and TR) are from the Policy Agendas Topics Codebook which contains the following codes of topic: Macroeconomics Community Development (14), Government Operations (20). (1), Housing & Codebook, a manual inventory of 19 major topics and 225 subtopics.5 Table 1 shows an example annotation. To get reference segmentations, we assign each turn a real value from 0 to 1indicating how much a turn changes the topic. For a question-typed turn, the score is the fraction of clause topics not appearing in the previous turn; for response-typed turns, the score is the fraction of clause topics that do not appear in the corresponding question. This results in a set of non-binary reference segmentations. For evaluation metrics that require binary segmentations, we create a binary segmentation by setting a turn as a segment boundary if the computed score is 1. This threshold is chosen to include only true segment boundaries. CNN’s Crossfire Crossfire was a weekly U.S. television “talking heads” program engineered to incite heated arguments (hence the name). Each episode features two recurring hosts, two guests, and clips from the week’s news. Our Crossfire dataset contains 1134 transcribed episodes aired between 2000 and 2004.6 There are 2567 unique speakers. Unlike the previous two datasets, Crossfire does not have explicit topic segmentations, so we use it to explore speaker-specific characteristics (Section 6). 5 Topic Segmentation Experiments In this section, we examine how well SITS can replicate annotations of when new topics are introduced. 5 6∼vietan/topicshift/ 82 We discuss metrics for evaluating an algorithm’s segmentation against a gold annotation, describe our experimental setup, and report those results. Evaluation Metrics To evaluate segmentations, we use Pk (Beeferman et al., 1999) and WindowDiff (WD) (Pevzner and Hearst, 2002). Both metrics measure the probability that two points in a document will be incorrectly separated by a segment boundary. Both techniques consider all spans of length k in the document and count whether the two endpoints of the window are (im)properly segmented against the gold segmentation. However, these metrics have drawbacks. First, they require both hypothesized and reference segmentations to be binary. Many algorithms (e.g., probabilistic approaches) give non-binary segmentations where candidate boundaries have real-valued scores (e.g., probability or confidence). Thus, evaluation requires arbitrary thresholding to binarize soft scores. To be fair, thresholds are set so the number of segments are equal to a predefined value (Purver et al., 2006; Galley et al., 2003). To overcome these limitations, we also use Earth Mover’s Distance (EMD) (Rubner et al., 2000), a metric that measures the distance between two distributions. The EMD is the minimal cost to transform one distribution into the other. Each segmentation can be considered a multi-dimensional distribution where each candidate boundary is a dimension. In EMD, a distance function across features allows partial credit for “near miss” segment boundaries. In addition, because EMD operates on distributions, we can compute the distance between non-binary hypothesized segmentations with binary or real-valued reference segmentations. We use the FastEMD implementation (Pele and Werman, 2009). Experimental Methods We applied the following methods to discover topic segmentations in a document: • TextTiling (Hearst, 1997) is one of the earliest generalpurpose topic segmentation algorithms, sliding a fixedwidth window to detect major changes in lexical similarity. • P-NoSpeaker-S: parametric version without speaker identity run on keaerc-hS conversation (Purver et al., 2006) • P-NoSpeaker-M: parametric version without speaker identity run on Mall conversations • P-SITS: the parametric version of SITS with speaker identity run on all conversations • NP-HMM: the HMM-based nonparametric model which a single topic per turn. This model can be considered a Sticky HDP-HMM (Fox et al., 2008) with speaker identity. • NP-SITS: the nonparametric version of SITS with speaker identity run on all conversations. Parameter Settings and Implementations experiment, all parameters same as in (Hearst, 1997). of TextTiling In our are the For statistical models, Gibbs sampling with 10 randomly initialized chains is used. Initial hyperparameter values are sampled from U(0, 1) to favor sparsity; statistics are collected after 500 burn-in iterations with a lag of 25 iterations over a total of 5000 iterations; and slice sampling (Neal, 2003) optimizes hyperparameters. Results and Analysis Table 2 shows the perfor- mance of various models on the topic segmentation problem, using the ICSI corpus and the 2008 debates. Consistent with previous results, probabilistic models outperform TextTiling. In addition, among the probabilistic models, the models that had access to speaker information consistently segment better than those lacking such information, supporting our assertion that there is benefit to modeling conversation as a social process. Furthermore, NP-SITS outperforms NP-HMM in both experiments, suggesting that using a distribution over topics to turns is better than using a single topic. This is consistent with parametric results reported in (Purver et al., 2006). The contribution of speaker identity seems more valuable in the debate setting. Debates are characterized by strong rewards for setting the agenda; dodging a question or moving the debate toward an oppo83 nent’s weakness can be useful strategies (Boydstun et al., 2011). In contrast, meetings (particularly lowstakes ICSI meetings) are characterized by pragmatic rather than strategic topic shifts. Second, agendasetting roles are clearer in formal debates; a modera- tor is tasked with setting the agenda and ensuring the conversation does not wander too much. The nonparametric model does best on the smaller debate dataset. We suspect that an evaluation that directly accessed the topic quality, either via prediction (Teh et al., 2006) or interpretability (Chang et al., 2009) would favor the nonparametric model more. 6 Evaluating Topic Shift Tendency In this section, we focus on the ability of SITS to capture speaker-level attributes. Recall that SITS associates with each speaker a topic shift tendency π that represents the probability of asserting a new topic in the conversation. While topic segmentation is a well studied problem, there are no established quantitative measurements of an individual’s ability to control a conversation. To evaluate whether the tendency is capturing meaningful characteristics of speakers, we compare our inferred tendencies against insights from political science. 2008 Elections To obtain a posterior estimate of π (Figure 3) we create 10 chains with hyperparameters sampled from the uniform distribution U(0, 1) and averaged π over 10 chains (as described in Section 5). In these debates, Ifill is the moderator of the debate between Biden and Palin; Brokaw, Lehrer and Schieffer are the three moderators of three debates between Obama and McCain. Here “Question” denotes questions from audiences in “town hall” debate. The role of this “speaker” can be considered equivalent to the debate moderator. The topic shift tendencies of moderators are much higher than for candidates. In the three debates between Obama and McCain, the moderators— Brokaw, Lehrer and Schieffer—have significantly higher scores than both candidates. This is a useful reality check, since in a debate the moderators are the ones asking questions and literally controlling the topical focus. Interestingly, in the vice-presidential debate, the score of moderator Ifill is only slightly higher than those of Palin and Biden; this is consistent with media commentary characterizing her as a size of the metrics Pk and WindowDiff chosen to replicate previous results. weak moderator.7 Similarly, the “Question” speaker had a relatively high variance, consistent with an amalgamation of many distinct speakers. These topic shift tendencies suggest that all candidates manage to succeed at some points in setting and controlling the debate topics. Our model gives Obama a slightly higher score than McCain, consistent with social science claims (Boydstun et al., 2011) that Obama had the lead in setting the agenda over McCain. Table 4 shows of SITS-detected topic shifts. Crossfire Crossfire, unlike the debates, has many speakers. This allows us to examine more closely what we can learn about speakers’ topic shift tendency. We verified that SITS can segment topics, and assuming that changing the topic is useful for a speaker, how can we characterize who does so effectively? We examine the relationship between topic shift tendency, social roles, and political ideology. To focus on frequent speakers, we filter out speakers with fewer than 30 turns. Most speakers have relatively small π, with the mode around 0.3. There are, however, speakers with very high topic shift tendencies. Table 5 shows the speakers having the highest values according to SITS. We find that there are three general patterns for who influences the course of a conversation in Crossfire. First, there are structural “speakers” the show uses to frame and propose new topics. These are 7 84 2008 Presidential Election Debates (larger means greater tendency) audience questions, news clips (e.g. many of Gore’s and Bush’s turns from 2000), and voice overs. That SITS is able to recover these is reassuring. Second, the stable of regular hosts receives high topic shift tendencies, which is reasonable given their experience with the format and ostensible moderation roles (in practice they also stoke lively discussion). The remaining class is more interesting. The remaining non-hosts with high topic shift tendency are relative moderates on the political spectrum: • John Kasich, one of few Republicans to support the assault weapons ban and now governor of Ohio, a swing state • Christine Todd Whitman, former Republican governor of CNehrwis Jersey, a very iDtmemano,c froartmice srt Ratee • John McCain, who before 2008 was known as a “maverick” for working with Democrats (e.g. Russ Feingold) This suggests that, despite Crossfire’s tendency to create highly partisan debates, those who are able to work across the political spectrum may best be able to influence the topic under discussion in highly polarized contexts. Table 4 shows detected topic shifts from these speakers; two of these examples (McCain and Whitman) show disagreement of Republicans with President Bush. In the other, Kasich is defending a Republican plan (school vouchers) popular with traditional Democratic constituencies. 7 Related and Future Work In the realm of statistical models, a number of techniques incorporate social connections and identity to explain content in social networks (Chang and Blei, atsbDePMmwphIncFiAoasCrtuLleycnNdAg:irIs’SatYphyo,weumckItGrasy’.qoheivfnuIakgrsdt?heo vna,dtbpJ.omslrheyivcaBnwdspeur[.ihodqtef]nuar,slihmetdnyuaopi’s-SbeI[hBn.FCtDvHLcr]ligEemIhysNoa:nFbvWidxeAltEsghnmRboad:eics[yr.,fmtuwleinha][go.,dLYftweur]–’lhsdaitngxerkbIfoat.hqeslkOufinrmbtyoeha,rit[n.geholyasc]rdi,wteoaxylpm’sburneItaopfkvicsqr.,n[BYoOtafebxruli.,mcEksGgatvn]roOebpyitmlnorcd.ea[sfviPYtr]lgoandyu., Previous turnTurn detected as shifting topic examples of those with high topic shift tendency 238947156FPAGNQMreouna.mlvsWea†‡kt.iluBonrcseh‡.7586 41702 4863150FBCKWMealchgrsitCvA lamuhoin†efr.5 2473509 π. RankSpeakerπRankSpeakerπ Table 5: Top speakers by topic shift tendencies. We mark hosts (†) and “speakers” who often (but not always) appeared in clips (‡). Apart from those groups, speakers with the highest tendency were political moderates. 2009) and scientific corpora (Rosen-Zvi et al., 2004). However, these models ignore the temporal evolution of content, treating documents as static. Models that do investigate the evolution of topics over time typically ignore the identify of the speaker. For example: models having sticky topics over ngrams (Johnson, 2010), sticky HDP-HMM (Fox et al., 2008); models that are an amalgam of sequential models and topic models (Griffiths et al., 2005; Wal85 lach, 2006; Gruber et al., 2007; Ahmed and Xing, 2008; Boyd-Graber and Blei, 2008; Du et al., 2010); or explicit models of time or other relevant features as a distinct latent variable (Wang and McCallum, 2006; Eisenstein et al., 2010). In contrast, SITS jointly models topic and individuals’ tendency to control a conversation. Not only does SITS outperform other models using standard computational linguistics baselines, but it also pro- poses intriguing hypotheses for social scientists. Associating each speaker with a scalar that models their tendency to change the topic does improve performance on standard tasks, but it’s inadequate to fully describe an individual. Modeling individuals’ perspective (Paul and Girju, 2010), “side” (Thomas et al., 2006), or personal preferences for topics (Grimmer, 2009) would enrich the model and better illuminate the interaction of influence and topic. Statistical analysis of political discourse can help discover patterns that political scientists, who often work via a “close reading,” might otherwise miss. We plan to work with social scientists to validate our implicit hypothesis that our topic shift tendency correlates well with intuitive measures of “influence.” Acknowledgements This research was funded in part by the Army Research Laboratory through ARL Cooperative Agreement W91 1NF-09-2-0072 and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), through the Army Research Laboratory. Jordan Boyd-Graber and Philip Resnik are also supported by US National Science Foundation Grant NSF grant #1018625. Any opinions, findings, conclusions, or recommendations expressed are the authors’ and do not necessarily reflect those of the sponsors. References [Abbott et al., 2011] Abbott, R., Walker, M., Anand, P., Fox Tree, J. E., Bowmani, R., and King, J. (201 1). How can you say such things?!?: Recognizing disagreement in informal political argument. In Proceedings of the Workshop on Language in Social Media (LSM 2011), pages 2–1 1. [Ahmed and Xing, 2008] Ahmed, A. and Xing, E. P. (2008). Dynamic non-parametric mixture models and the recurrent Chinese restaurant process: with applications to evolutionary clustering. In SDM, pages 219– 230. [Beeferman et al., 1999] Beeferman, D., Berger, A., and Lafferty, J. (1999). Statistical models for text segmentation. Mach. Learn., 34: 177–210. [Blei and Lafferty, 2009] Blei, D. M. and Lafferty, J. (2009). Text Mining: Theory and Applications, chapter Topic Models. Taylor and Francis, London. [Boyd-Graber and Blei, 2008] Boyd-Graber, J. and Blei, D. M. (2008). Syntactic topic models. In Proceedings of Advances in Neural Information Processing Systems. [Boydstun et al., 2011] Boydstun, A. E., Phillips, C., and Glazier, R. A. (201 1). It’s the economy again, stupid: Agenda control in the 2008 presidential debates. Forthcoming. [Chang and Blei, 2009] Chang, J. and Blei, D. M. (2009). Relational topic models for document networks. In Proceedings of Artificial Intelligence and Statistics. [Chang et al., 2009] Chang, J., Boyd-Graber, J., Wang, C., Gerrish, S., and Blei, D. M. (2009). Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems. [Du et al., 2010] Du, L., Buntine, W., and Jin, H. (2010). Sequential latent dirichlet allocation: Discover underlying topic structures within a document. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 148 –157. 86 [Ehlen et al., 2007] Ehlen, P., Purver, M., and Niekrasz, J. (2007). A meeting browser that learns. In In: Proceedings of the AAAI Spring Symposium on Interaction Challenges for Intelligent Assistants. [Eisenstein and Barzilay, 2008] Eisenstein, J. and Barzilay, R. (2008). Bayesian unsupervised topic segmentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Proceedings of Emperical Methods in Natural Language Processing. [Eisenstein et al., 2010] Eisenstein, J., O’Connor, B., Smith, N. A., and Xing, E. P. (2010). A latent variable model for geographic lexical variation. In EMNLP’10, pages 1277–1287. [Ferguson, 1973] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230. [Fox et al., 2008] Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. (2008). An hdp-hmm for systems with state persistence. In Proceedings of International Conference of Machine Learning. [Galley et al., 2003] Galley, M., McKeown, K., FoslerLussier, E., and Jing, H. (2003). Discourse segmentation of multi-party conversation. In Proceedings of the Association for Computational Linguistics. [Georgescul et al., 2006] Georgescul, M., Clark, A., and Armstrong, S. (2006). Word distributions for thematic segmentation in a support vector machine approach. In Conference on Computational Natural Language Learning. [Gerrish and Blei, 2010] Gerrish, S. and Blei, D. M. (2010). A language-based approach to measuring scholarly impact. In Proceedings of International Conference of Machine Learning. [Griffiths et al., 2005] Griffiths, T. L., Steyvers, M., Blei, D. M., and Tenenbaum, J. B. (2005). Integrating topics and syntax. In Proceedings of Advances in Neural Information Processing Systems. [Grimmer, 2009] Grimmer, J. (2009). A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases. Political Analysis, 18: 1–35. [Gruber et al., 2007] Gruber, A., Rosen-Zvi, M., and Weiss, Y. (2007). Hidden topic Markov models. In Artificial Intelligence and Statistics. [Hawes et al., 2009] Hawes, T., Lin, J., and Resnik, P. (2009). Elements of a computational model for multiparty discourse: The turn-taking behavior of Supreme Court justices. Journal of the American Society for Information Science and Technology, 60(8): 1607–1615. [Hearst, 1997] Hearst, M. A. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1):33–64. [Hsueh et al., 2006] Hsueh, P.-y., Moore, J. D., and Renals, S. (2006). Automatic segmentation of multiparty dialogue. In Proceedings of the European Chapter of the Association for Computational Linguistics. [Ireland et al., 2011] Ireland, M. E., Slatcher, R. B., Eastwick, P. W., Scissors, L. E., Finkel, E. J., and Pennebaker, J. W. (201 1). Language style matching predicts relationship initiation and stability. Psychological Science, 22(1):39–44. [Janin et al., 2003] Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., and Wooters, C. (2003). The ICSI meeting corpus. In IEEE International Confer- ence on Acoustics, Speech, and Signal Processing. [Johnson, 2010] Johnson, M. (2010). PCFGs, topic models, adaptor grammars and learning topical collocations and the structure of proper names. In Proceedings of the Association for Computational Linguistics. [Morris and Hirst, 1991] Morris, J. and Hirst, G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17:21–48. [M¨ uller and Quintana, 2004] Mu¨ller, P. and Quintana, F. A. (2004). Nonparametric Bayesian data analysis. Statistical Science, 19(1):95–1 10. [Murray et al., 2005] Murray, G., Renals, S., and Carletta, J. (2005). Extractive summarization of meeting recordings. In European Conference on Speech Communication and Technology. [Neal, 2000] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249– 265. [Neal, 2003] Neal, R. M. (2003). Slice sampling. Annals of Statistics, 31:705–767. [Olney and Cai, 2005] Olney, A. and Cai, Z. (2005). An orthonormal basis for topic segmentation in tutorial dialogue. In Proceedings of the Human Language Technology Conference. [Paul and Girju, 2010] Paul, M. and Girju, R. (2010). A two-dimensional topic-aspect model for discovering multi-faceted topics. In Association for the Advancement of Artificial Intelligence. [Pele and Werman, 2009] Pele, O. and Werman, M. (2009). Fast and robust earth mover’s distances. In International Conference on Computer Vision. [Pevzner and Hearst, 2002] Pevzner, L. and Hearst, M. A. (2002). A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28. [Purver, 2011] Purver, M. (201 1). Topic segmentation. In Tur, G. and de Mori, R., editors, Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, pages 291–3 17. Wiley. 87 [Purver et al., 2006] Purver, M., Ko¨rding, K., Griffiths, T. L., and Tenenbaum, J. (2006). Unsupervised topic modelling for multi-party spoken discourse. In Proceedings of the Association for Computational Linguistics. [Resnik and Hardisty, 2010] Resnik, P. and Hardisty, E. (2010). Gibbs sampling for the uninitiated. Technical Report UMIACS-TR-2010-04, University of Maryland. [Reynar, 1998] Reynar, J. C. (1998). Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania. [Rosen-Zvi et al., 2004] Rosen-Zvi, M., Griffiths, T. L., Steyvers, M., and Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of Uncertainty in Artificial Intelligence. [Rubner et al., 2000] Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40:99–121 . [Teh et al., 2006] Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476): 1566–1581. [Thomas et al., 2006] Thomas, M., Pang, B., and Lee, L. (2006). Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Proceedings of Emperical Methods in Natural Language Processing. [Tur et al., 2010] Tur, G., Stolcke, A., Voss, L., Peters, S., Hakkani-Tu¨r, D., Dowding, J., Favre, B., Ferna´ndez, R., Frampton, M., Frandsen, M., Frederickson, C., Graciarena, M., Kintzing, D., Leveque, K., Mason, S., Niekrasz, J., Purver, M., Riedhammer, K., Shriberg, E., Tien, J., Vergyri, D., and Yang, F. (2010). The CALO meeting assistant system. Trans. Audio, Speech and Lang. Proc., 18: 1601–161 1. [Wallach, 2006] Wallach, H. M. (2006). Topic modeling: Beyond bag-of-words. In Proceedings of International Conference of Machine Learning. [Wallach, 2008] Wallach, H. M. (2008). Structured Topic Models for Language. PhD thesis, University of Cambridge. [Wang et al., 2008] Wang, C., Blei, D. M., and Heckerman, D. (2008). Continuous time dynamic topic models. In Proceedings of Uncertainty in Artificial Intelligence. [Wang and McCallum, 2006] Wang, X. and McCallum, A. (2006). Topics over time: a non-Markov continuoustime model of topical trends. In Knowledge Discovery and Data Mining, Knowledge Discovery and Data Mining.

6 0.058500815 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

7 0.054657467 22 acl-2012-A Topic Similarity Model for Hierarchical Phrase-based Translation

8 0.051889278 79 acl-2012-Efficient Tree-Based Topic Modeling

9 0.050712451 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

10 0.048869886 98 acl-2012-Finding Bursty Topics from Microblogs

11 0.047023561 161 acl-2012-Polarity Consistency Checking for Sentiment Dictionaries

12 0.046499223 120 acl-2012-Information-theoretic Multi-view Domain Adaptation

13 0.044670142 55 acl-2012-Community Answer Summarization for Multi-Sentence Question with Group L1 Regularization

14 0.042142436 144 acl-2012-Modeling Review Comments

15 0.037023764 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

16 0.0357203 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

17 0.03176925 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

18 0.031192364 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

19 0.030942766 31 acl-2012-Authorship Attribution with Author-aware Topic Models

20 0.028421704 210 acl-2012-Unsupervized Word Segmentation: the Case for Mandarin Chinese

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.098), (1, 0.049), (2, 0.04), (3, 0.056), (4, -0.07), (5, 0.007), (6, -0.048), (7, -0.035), (8, -0.077), (9, 0.114), (10, -0.054), (11, 0.015), (12, 0.019), (13, 0.011), (14, -0.026), (15, 0.043), (16, 0.105), (17, -0.05), (18, -0.06), (19, -0.019), (20, 0.218), (21, -0.108), (22, 0.386), (23, 0.015), (24, 0.01), (25, 0.162), (26, -0.056), (27, -0.031), (28, 0.195), (29, 0.021), (30, 0.075), (31, 0.199), (32, -0.014), (33, -0.064), (34, -0.163), (35, -0.026), (36, -0.01), (37, 0.141), (38, -0.165), (39, 0.09), (40, -0.095), (41, -0.094), (42, -0.04), (43, -0.013), (44, -0.024), (45, 0.029), (46, -0.003), (47, -0.118), (48, 0.033), (49, -0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97238779 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

Author: Karolina Owczarzak ; Peter A. Rankel ; Hoa Trang Dang ; John M. Conroy

Abstract: We investigate the consistency of human assessors involved in summarization evaluation to understand its effect on system ranking and automatic evaluation techniques. Using Text Analysis Conference data, we measure annotator consistency based on human scoring of summaries for Responsiveness, Readability, and Pyramid scoring. We identify inconsistencies in the data and measure to what extent these inconsistencies affect the ranking of automatic summarization systems. Finally, we examine the stability of automatic metrics (ROUGE and CLASSY) with respect to the inconsistent assessments.

2 0.77424723 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

Author: Ziheng Lin ; Chang Liu ; Hwee Tou Ng ; Min-Yen Kan

Abstract: An ideal summarization system should produce summaries that have high content coverage and linguistic quality. Many state-ofthe-art summarization systems focus on content coverage by extracting content-dense sentences from source articles. A current research focus is to process these sentences so that they read fluently as a whole. The current AESOP task encourages research on evaluating summaries on content, readability, and overall responsiveness. In this work, we adapt a machine translation metric to measure content coverage, apply an enhanced discourse coherence model to evaluate summary readability, and combine both in a trained regression model to evaluate overall responsiveness. The results show significantly improved performance over AESOP 2011 submitted metrics.

3 0.77296871 101 acl-2012-Fully Abstractive Approach to Guided Summarization

Author: Pierre-Etienne Genest ; Guy Lapalme

Abstract: This paper shows that full abstraction can be accomplished in the context of guided summarization. We describe a work in progress that relies on Information Extraction, statistical content selection and Natural Language Generation. Early results already demonstrate the effectiveness of the approach.

4 0.24336413 46 acl-2012-Character-Level Machine Translation Evaluation for Languages with Ambiguous Word Boundaries

Author: Chang Liu ; Hwee Tou Ng

Abstract: In this work, we introduce the TESLACELAB metric (Translation Evaluation of Sentences with Linear-programming-based Analysis Character-level Evaluation for Languages with Ambiguous word Boundaries) for automatic machine translation evaluation. For languages such as Chinese where words usually have meaningful internal structure and word boundaries are often fuzzy, TESLA-CELAB acknowledges the advantage of character-level evaluation over word-level evaluation. By reformulating the problem in the linear programming framework, TESLACELAB addresses several drawbacks of the character-level metrics, in particular the modeling of synonyms spanning multiple characters. We show empirically that TESLACELAB significantly outperforms characterlevel BLEU in the English-Chinese translation evaluation tasks. –

5 0.23726013 86 acl-2012-Exploiting Latent Information to Predict Diffusions of Novel Topics on Social Networks

Author: Tsung-Ting Kuo ; San-Chuan Hung ; Wei-Shih Lin ; Nanyun Peng ; Shou-De Lin ; Wei-Fen Lin

Abstract: This paper brings a marriage of two seemly unrelated topics, natural language processing (NLP) and social network analysis (SNA). We propose a new task in SNA which is to predict the diffusion of a new topic, and design a learning-based framework to solve this problem. We exploit the latent semantic information among users, topics, and social connections as features for prediction. Our framework is evaluated on real data collected from public domain. The experiments show 16% AUC improvement over baseline methods. The source code and dataset are available at fusion/ 1 Background The diffusion of information on social networks has been studied for decades. Generally, the proposed strategies can be categorized into two categories, model-driven and data-driven. The model-driven strategies, such as independent cascade model (Kempe et al., 2003), rely on certain manually crafted, usually intuitive, models to fit the diffusion data without using diffusion history. The data-driven strategies usually utilize learning-based approaches to predict the future propagation given historical records of prediction (Fei et al., 2011; Galuba et al., 2010; Petrovic et al., 2011). Data-driven strategies usually perform better than model-driven approaches because the past diffusion behavior is used during learning (Galuba et al., 2010). Recently, researchers started to exploit content information in data-driven diffusion models (Fei et al., 2011; Petrovic et al., 2011; Zhu et al., 2011). 344 However, most of the data-driven approaches assume that in order to train a model and predict the future diffusion of a topic, it is required to obtain historical records about how this topic has propagated in a social network (Petrovic et al., 2011; Zhu et al., 2011). We argue that such assumption does not always hold in the real-world scenario, and being able to forecast the propagation of novel or unseen topics is more valuable in practice. For example, a company would like to know which users are more likely to be the source of ‘viva voce’ of a newly released product for advertising purpose. A political party might want to estimate the potential degree of responses of a half-baked policy before deciding to bring it up to public. To achieve such goal, it is required to predict the future propagation behavior of a topic even before any actual diffusion happens on this topic (i.e., no historical propagation data of this topic are available). Lin et al. also propose an idea aiming at predicting the inference of implicit diffusions for novel topics (Lin et al., 2011). The main difference between their work and ours is that they focus on implicit diffusions, whose data are usually not available. Consequently, they need to rely on a model-driven approach instead of a datadriven approach. On the other hand, our work focuses on the prediction of explicit diffusion behaviors. Despite the fact that no diffusion data of novel topics is available, we can still design a data- driven approach taking advantage of some explicit diffusion data of known topics. Our experiments show that being able to utilize such information is critical for diffusion prediction. 2 The Novel-Topic Diffusion Model We start by assuming an existing social network G = (V, E), where V is the set of nodes (or user) v, and E is the set of link e. The set of topics is Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A.s ?c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi3c 4s4–348, denoted as T. Among them, some are considered as novel topics (denoted as N), while the rest (R) are used as the training records. We are also given a set of diffusion records D = {d | d = (src, dest, t) }, where src is the source node (or diffusion source), dest is the destination node, and t is the topic of the diffusion that belongs to R but not N. We assume that diffusions cannot occur between nodes without direct social connection; any diffusion pair implies the existence of a link e = (src, dest) ∈ E. Finally, we assume there are sets of keywords or tags that relevant to each topic (including existing and novel topics). Note that the set of keywords for novel topics should be seen in that of existing topics. From these sets of keywords, we construct a topicword matrix TW = (P(wordj | topici))i,j of which the elements stand for the conditional probabilities that a word appears in the text of a certain topic. Similarly, we also construct a user-word matrix UW= (P(wordj | useri))i,j from these sets of keywords. Given the above information, the goal is to predict whether a given link is active (i.e., belongs to a diffusion link) for topics in N. 2.1 The Framework The main challenge of this problem lays in that the past diffusion behaviors of new topics are missing. To address this challenge, we propose a supervised diffusion discovery framework that exploits the latent semantic information among users, topics, and their explicit / implicit interactions. Intuitively, four kinds of information are useful for prediction: • Topic information: Intuitively, knowing the signatures of a topic (e.g., is it about politics?) is critical to the success of the prediction. • User information: The information of a user such as the personality (e.g., whether this user is aggressive or passive) is generally useful. • User-topic interaction: Understanding the users' preference on certain topics can improve the quality of prediction. • Global information: We include some global features (e.g., topology info) of social network. Below we will describe how these four kinds of information can be modeled in our framework. 2.2 Topic Information We extract hidden topic category information to model topic signature. In particular, we exploit the 345 Latent Dirichlet Allocation (LDA) method (Blei et al., 2003), which is a widely used topic modeling technique, to decompose the topic-word matrix TW into hidden topic categories: TW = TH * HW , where TH is a topic-hidden matrix, HW is hiddenword matrix, and h is the manually-chosen parameter to determine the size of hidden topic categories. TH indicates the distribution of each topic to hidden topic categories, and HW indicates the distribution of each lexical term to hidden topic categories. Note that TW and TH include both existing and novel topics. We utilize THt,*, the row vector of the topic-hidden matrix TH for a topic t, as a feature set. In brief, we apply LDA to extract the topic-hidden vector THt,* to model topic signature (TG) for both existing and novel topics. Topic information can be further exploited. To predict whether a novel topic will be propagated through a link, we can first enumerate the existing topics that have been propagated through this link. For each such topic, we can calculate its similarity with the new topic based on the hidden vectors generated above (e.g., using cosine similarity between feature vectors). Then, we sum up the similarity values as a new feature: topic similarity (TS). For example, a link has previously propagated two topics for a total of three times {ACL, KDD, ACL}, and we would like to know whether a new topic, EMNLP, will propagate through this link. We can use the topic-hidden vector to generate the similarity values between EMNLP and the other topics (e.g., {0.6, 0.4, 0.6}), and then sum them up (1.6) as the value of TS. 2.3 User Information Similar to topic information, we extract latent personal information to model user signature (the users are anonymized already). We apply LDA on the user-word matrix UW: UW = UM * MW , where UM is the user-hidden matrix, MW is the hidden-word matrix, and m is the manually-chosen size of hidden user categories. UM indicates the distribution of each user to the hidden user categories (e.g., age). We then use UMu,*, the row vector of UM for the user u, as a feature set. In brief, we apply LDA to extract the user-hidden vector UMu,* for both source and destination nodes of a link to model user signature (UG). 2.4 User-Topic Interaction Modeling user-topic interaction turns out to be non-trivial. It is not useful to exploit latent semantic analysis directly on the user-topic matrix UR = UQ * QR , where UR represents how many times each user is diffused for existing topic R (R ∈ T), because UR does not contain information of novel topics, and neither do UQ and QR. Given no propagation record about novel topics, we propose a method that allows us to still extract implicit user-topic information. First, we extract from the matrix TH (described in Section 2.2) a subset RH that contains only information about existing topics. Next we apply left division to derive another userhidden matrix UH: UH = (RH \ URT)T = ((RHT RH )-1 RHT URT)T Using left division, we generate the UH matrix using existing topic information. Finally, we exploit UHu,*, the row vector of the user-hidden matrix UH for the user u, as a feature set. Note that novel topics were included in the process of learning the hidden topic categories on RH; therefore the features learned here do implicitly utilize some latent information of novel topics, which is not the case for UM. Experiments confirm the superiority of our approach. Furthermore, our approach ensures that the hidden categories in topic-hidden and user-hidden matrices are identical. Intuitively, our method directly models the user’s preference to topics’ signature (e.g., how capable is this user to propagate topics in politics category?). In contrast, the UM mentioned in Section 2.3 represents the users’ signature (e.g., aggressiveness) and has nothing to do with their opinions on a topic. In short, we obtain the user-hidden probability vector UHu,* as a feature set, which models user preferences to latent categories (UPLC). 2.5 Global Features Given a candidate link, we can extract global social features such as in-degree (ID) and outdegree (OD). We tried other features such as PageRank values but found them not useful. Moreover, we extract the number of distinct topics (NDT) for a link as a feature. The intuition behind this is that the more distinct topics a user has diffused to another, the more likely the diffusion will happen for novel topics. 346 2.6 Complexity Analysis The complexity to produce each feature is as below: (1) Topic information: O(I * |T| * h * Bt) for LDA using Gibbs sampling, where Iis # of the iterations in sampling, |T| is # of topics, and Bt is the average # of tokens in a topic. (2) User information: O(I * |V| * m * Bu) , where |V| is # of users, and Bu is the average # of tokens for a user. (3) User-topic interaction: the time complexity is O(h3 + h2 * |T| + h * |T| * |V|). (4) Global features: O(|D|), where |D| is # of diffusions. 3 Experiments For evaluation, we try to use the diffusion records of old topics to predict whether a diffusion link exists between two nodes given a new topic. 3.1 Dataset and Evaluation Metric We first identify 100 most popular topic (e.g., earthquake) from the Plurk micro-blog site between 01/201 1 and 05/201 1. Plurk is a popular micro-blog service in Asia with more than 5 million users (Kuo et al., 2011). We manually separate the 100 topics into 7 groups. We use topic-wise 4-fold cross validation to evaluate our method, because there are only 100 available topics. For each group, we select 3/4 of the topics as training and 1/4 as validation. The positive diffusion records are generated based on the post-response behavior. That is, if a person x posts a message containing one of the selected topic t, and later there is a person y responding to this message, we consider a diffusion of t has occurred from x to y (i.e., (x, y, t) is a positive instance). Our dataset contains a total of 1,642,894 positive instances out of 100 distinct topics; the largest and smallest topic contains 303,424 and 2,166 diffusions, respectively. Also, the same amount of negative instances for each topic (totally 1,642,894) is sampled for binary classification (similar to the setup in KDD Cup 2011 Track 2). The negative links of a topic t are sampled randomly based on the absence of responses for that given topic. The underlying social network is created using the post-response behavior as well. We assume there is an acquaintance link between x and y if and only if x has responded to y (or vice versa) on at least one topic. Eventually we generated a social network of 163,034 nodes and 382,878 links. Furthermore, the sets of keywords for each topic are required to create the TW and UW matrices for latent topic analysis; we simply extract the content of posts and responses for each topic to create both matrices. We set the hidden category number h = m = 7, which is equal to the number of topic groups. We use area under ROC curve (AUC) to evaluate our proposed framework (Davis and Goadrich, 2006); we rank the testing instances based on their likelihood of being positive, and compare it with the ground truth to compute AUC. 3.2 Implementation and Baseline After trying many classifiers and obtaining similar results for all of them, we report only results from LIBLINEAR with c=0.0001 (Fan et al., 2008) due to space limitation. We remove stop-words, use SCWS (Hightman, 2012) for tokenization, and MALLET (McCallum, 2002) and GibbsLDA++ (Phan and Nguyen, 2007) for LDA. There are three baseline models we compare the result with. First, we simply use the total number of existing diffusions among all topics between two nodes as the single feature for prediction. Second, we exploit the independent cascading model (Kempe et al., 2003), and utilize the normalized total number of diffusions as the propagation probability of each link. Third, we try the heat diffusion model (Ma et al., 2008), set initial heat proportional to out-degree, and tune the diffusion time parameter until the best results are obtained. Note that we did not compare with any data-driven approaches, as we have not identified one that can predict diffusion of novel topics. 3.3 Results The result of each model is shown in Table 1. All except two features outperform the baseline. The best single feature is TS. Note that UPLC performs better than UG, which verifies our hypothesis that maintaining the same hidden features across different LDA models is better. We further conduct experiments to evaluate different combinations of features (Table 2), and found that the best one (TS + ID + NDT) results in about 16% improvement over the baseline, and outperforms the combination of all features. As stated in (Witten et al., 2011), 347 adding useless features may cause the performance of classifiers to deteriorate. Intuitively, TS captures both latent topic and historical diffusion information, while ID and NDT provide complementary social characteristics of users. 4 Conclusions The main contributions of this paper are as below: 1. We propose a novel task of predicting the diffusion of unseen topics, which has wide applications in real-world. 2. Compared to the traditional model-driven or content-independent data-driven works on diffusion analysis, our solution demonstrates how one can bring together ideas from two different but promising areas, NLP and SNA, to solve a challenging problem. 3. Promising experiment result (74% in AUC) not only demonstrates the usefulness of the proposed models, but also indicates that predicting diffusion of unseen topics without historical diffusion data is feasible. Acknowledgments This work was also supported by National Science Council, National Taiwan University and Intel Corporation under Grants NSC 100-291 1-I-002-001, and 101R7501. References David M. Blei, Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3.993-1022. Jesse Davis & Mark Goadrich. 2006. The relationship between Precision-Recall and ROC curves. Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang & Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res., 9.1871-74. Hongliang Fei, Ruoyi Jiang, Yuhao Yang, Bo Luo & Jun Huan. 2011. Content based social behavior prediction: a multi-task learning approach. Proceedings of the 20th ACM international conference on Information and knowledge management, Glasgow, Scotland, UK. Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic & Wolfgang Kellerer. 2010. Outtweeting the twitterers - predicting information cascades in microblogs. Proceedings of the 3rd conference on Online social networks, Boston, MA. Hightman. 2012. Simple Chinese Words Segmentation (SCWS). David Kempe, Jon Kleinberg & Eva Tardos. 2003. Maximizing the spread of influence through a social network. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, Washington, D.C. Tsung-Ting Kuo, San-Chuan Hung, Wei-Shih Lin, Shou-De Lin, Ting-Chun Peng & Chia-Chun Shih. 2011. Assessing the Quality of Diffusion Models Using Real-World Social Network Data. Conference on Technologies and Applications of Artificial Intelligence, 2011. C.X. Lin, Q.Z. Mei, Y.L. Jiang, J.W. Han & S.X. Qi. 2011. Inferring the Diffusion and Evolution of Topics in Social Communities. Proceedings of the IEEE International Conference on Data Mining, 2011. Hao Ma, Haixuan Yang, Michael R. Lyu & Irwin King. 2008. Mining social networks using heat diffusion processes for marketing candidates selection. Proceeding of the 17th ACM conference on Information and knowledge management, Napa Valley, California, USA. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Sasa Petrovic, Miles Osborne & Victor Lavrenko. 2011. RT to Win! Predicting Message Propagation in Twitter. International AAAI Conference on Weblogs and Social Media, 2011. 348 Xuan-Hieu Phan & Cam-Tu Nguyen. 2007. GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA). Ian H. Witten, Eibe Frank & Mark A. Hall. 2011. Data Mining: Practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc. Jiang Zhu, Fei Xiong, Dongzhen Piao, Yun Liu & Ying Zhang. 2011. Statistically Modeling the Effectiveness of Disaster Information in Social Media. Proceedings of the 2011 IEEE Global Humanitarian Technology Conference.

6 0.22742485 178 acl-2012-Sentence Simplification by Monolingual Machine Translation

7 0.19998136 120 acl-2012-Information-theoretic Multi-view Domain Adaptation

8 0.19976249 79 acl-2012-Efficient Tree-Based Topic Modeling

9 0.19303346 171 acl-2012-SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations

10 0.19225457 34 acl-2012-Automatically Learning Measures of Child Language Development

11 0.18622215 215 acl-2012-WizIE: A Best Practices Guided Development Environment for Information Extraction

12 0.17924905 39 acl-2012-Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum

13 0.16779238 88 acl-2012-Exploiting Social Information in Grounded Language Learning via Grammatical Reduction

14 0.16539356 51 acl-2012-Collective Generation of Natural Image Descriptions

15 0.16308795 31 acl-2012-Authorship Attribution with Author-aware Topic Models

16 0.15972762 180 acl-2012-Social Event Radar: A Bilingual Context Mining and Sentiment Analysis Summarization System

17 0.15376832 55 acl-2012-Community Answer Summarization for Multi-Sentence Question with Group L1 Regularization

18 0.15195575 98 acl-2012-Finding Bursty Topics from Microblogs

19 0.14079516 144 acl-2012-Modeling Review Comments

20 0.13471004 110 acl-2012-Historical Analysis of Legal Opinions with a Sparse Mixed-Effects Latent Variable Model

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.214), (22, 0.072), (25, 0.021), (26, 0.064), (28, 0.062), (30, 0.039), (37, 0.02), (39, 0.043), (74, 0.016), (82, 0.025), (84, 0.015), (85, 0.041), (90, 0.073), (92, 0.048), (94, 0.019), (99, 0.135)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80308419 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

Author: Karolina Owczarzak ; Peter A. Rankel ; Hoa Trang Dang ; John M. Conroy

Abstract: We investigate the consistency of human assessors involved in summarization evaluation to understand its effect on system ranking and automatic evaluation techniques. Using Text Analysis Conference data, we measure annotator consistency based on human scoring of summaries for Responsiveness, Readability, and Pyramid scoring. We identify inconsistencies in the data and measure to what extent these inconsistencies affect the ranking of automatic summarization systems. Finally, we examine the stability of automatic metrics (ROUGE and CLASSY) with respect to the inconsistent assessments.

2 0.69361293 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

Author: Seyed Abolghasem Mirroshandel ; Alexis Nasr ; Joseph Le Roux

Abstract: Treebanks are not large enough to reliably model precise lexical phenomena. This deficiency provokes attachment errors in the parsers trained on such data. We propose in this paper to compute lexical affinities, on large corpora, for specific lexico-syntactic configurations that are hard to disambiguate and introduce the new information in a parser. Experiments on the French Treebank showed a relative decrease ofthe error rate of 7. 1% Labeled Accuracy Score yielding the best parsing results on this treebank.

3 0.60211641 149 acl-2012-Movie-DiC: a Movie Dialogue Corpus for Research and Development

Author: Rafael E. Banchs

Abstract: This paper describes Movie-DiC a Movie Dialogue Corpus recently collected for research and development purposes. The collected dataset comprises 132,229 dialogues containing a total of 764,146 turns that have been extracted from 753 movies. Details on how the data collection has been created and how it is structured are provided along with its main statistics and characteristics. 1

4 0.59572148 101 acl-2012-Fully Abstractive Approach to Guided Summarization

Author: Pierre-Etienne Genest ; Guy Lapalme

Abstract: This paper shows that full abstraction can be accomplished in the context of guided summarization. We describe a work in progress that relies on Information Extraction, statistical content selection and Natural Language Generation. Early results already demonstrate the effectiveness of the approach.

5 0.59163547 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

Author: Ziheng Lin ; Chang Liu ; Hwee Tou Ng ; Min-Yen Kan

Abstract: An ideal summarization system should produce summaries that have high content coverage and linguistic quality. Many state-ofthe-art summarization systems focus on content coverage by extracting content-dense sentences from source articles. A current research focus is to process these sentences so that they read fluently as a whole. The current AESOP task encourages research on evaluating summaries on content, readability, and overall responsiveness. In this work, we adapt a machine translation metric to measure content coverage, apply an enhanced discourse coherence model to evaluate summary readability, and combine both in a trained regression model to evaluate overall responsiveness. The results show significantly improved performance over AESOP 2011 submitted metrics.

6 0.58952188 170 acl-2012-Robust Conversion of CCG Derivations to Phrase Structure Trees

7 0.58705884 53 acl-2012-Combining Textual Entailment and Argumentation Theory for Supporting Online Debates Interactions

8 0.55783015 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

9 0.55572903 153 acl-2012-Named Entity Disambiguation in Streaming Data

10 0.55547833 40 acl-2012-Big Data versus the Crowd: Looking for Relationships in All the Right Places

11 0.54963624 159 acl-2012-Pattern Learning for Relation Extraction with a Hierarchical Topic Model

12 0.54838896 191 acl-2012-Temporally Anchored Relation Extraction

13 0.54834127 146 acl-2012-Modeling Topic Dependencies in Hierarchical Text Categorization

14 0.54124212 139 acl-2012-MIX Is Not a Tree-Adjoining Language

15 0.53855461 62 acl-2012-Cross-Lingual Mixture Model for Sentiment Classification

16 0.53622639 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

17 0.52844471 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

18 0.52727461 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

19 0.52508998 218 acl-2012-You Had Me at Hello: How Phrasing Affects Memorability

20 0.52421844 8 acl-2012-A Corpus of Textual Revisions in Second Language Writing