emnlp emnlp2011 emnlp2011-110 knowledge-graph by maker-knowledge-mining

110 emnlp-2011-Ranking Human and Machine Summarization Systems


Source: pdf

Author: Peter Rankel ; John Conroy ; Eric Slud ; Dianne O'Leary

Abstract: The Text Analysis Conference (TAC) ranks summarization systems by their average score over a collection of document sets. We investigate the statistical appropriateness of this score and propose an alternative that better distinguishes between human and machine evaluation systems.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The Text Analysis Conference (TAC) ranks summarization systems by their average score over a collection of document sets. [sent-6, score-0.195]

2 We investigate the statistical appropriateness of this score and propose an alternative that better distinguishes between human and machine evaluation systems. [sent-7, score-0.044]

3 A major theme of this conference is multi-document summarization: machine summarization of sets of related documents, sometimes query-focused and sometimes generic. [sent-9, score-0.126]

4 The summarizers are judged by how well the summaries match human-generated summaries in either automatic metrics such as ROUGE (Lin and Hovy, 2003) or manual metrics such as responsiveness or pyramid evaluation (Nenkova et al. [sent-10, score-0.975]

5 Typically the systems are ranked by their average score over all document sets. [sent-12, score-0.069]

6 Ranking by average score is quite appropriate under certain statistical hypotheses, for example, when each sample is drawn from a distribution which differs from the distribution of other samples only through a location shift (Randles and Wolfe, 1979). [sent-13, score-0.033]

7 However, a non-parametric (rank-based) analysis of variance on the summarizers’ scores on each document set revealed an impossibly small p-value (less 467 John M. [sent-14, score-0.172]

8 Conroy IDA/Center for Computing Sciences Bowie, Maryland conroy j ohnm@ gmai l . [sent-15, score-0.121]

9 edu Figure 1: Confidence Intervals from a non-parametric Tukey’s honestly significant difference test for 46 TAC 2010 update document sets. [sent-19, score-0.147]

10 The blue confidence interval (for document set d1032) does not overlap any of the 30 red intervals. [sent-20, score-0.1]

11 Hence, the test concludes that 30 document sets have mean significantly different from the mean of d1032. [sent-21, score-0.069]

12 468 Figure 5: ROUGE-2 scores for the TAC 2010 update summary task, organized by document set (y-axis) and summarizer (x-axis). [sent-27, score-0.5]

13 The 51 summarizers fall into two distinct groups: machine systems (first 43 columns) and humans (last 8 columns). [sent-28, score-0.3]

14 Note that each human only summarized half of the document sets, thus creating 23 missing values in each of the last 8 columns. [sent-29, score-0.113]

15 Black is used to indicate missing values in the last 8 columns and low scores in the first 43 columns. [sent-30, score-0.091]

16 than 10−12 using Matlab’s kruskalwal l s i 1), providing evidence that a summary’s score is not independent of the document set. [sent-31, score-0.069]

17 This effect can be seen in Figure 1, showing the confidence bands, as computed by a Tukey honestly significant difference test for each document set’s difficulty as measured by the mean rank responsiveness score for TAC 2010. [sent-32, score-0.337]

18 The test clearly shows that the summarizer performances on different document sets have different averages. [sent-33, score-0.36]

19 Some rows are clearly darker, indicating overall lower scores for the sum– 1The Kruskal-Wallis test performs a one-way analysis of variance of document-set differences after first converting the summary scores for each sample to their ranks within the pooled sample. [sent-35, score-0.277]

20 Computed from the converted scores, the KruskalWallis test statistic is essentially the ratio of the between-group sum of squares to the combined within-group sum of squares. [sent-36, score-0.207]

21 maries of these documents, and the variances of the scores differ row-by-row. [sent-37, score-0.056]

22 These plots show qualitatively what the non-parametric analysis of variance demonstrates statistically. [sent-38, score-0.047]

23 While the data presented was for the TAC 2010 update document sets, similar results hold for all the TAC 2008, 2009, and 2010 data. [sent-39, score-0.098]

24 Hence, it may be advantageous to measure summarizer quality by accounting for heterogeneity of documents within each test set. [sent-40, score-0.333]

25 A non-parametric paired test like the Wilcoxon signed-rank is one way to do this. [sent-41, score-0.216]

26 In the paper (Conroy and Dang, 2008) the authors noted that while there is a significant gap in performance between machine systems and human summarizers when measured by average manual metrics, this gap is not present when measured by the averages of the best automatic metric (ROUGE). [sent-43, score-0.351]

27 In particular, in the DUC 2005-2007 data some systems have ROUGE performance within the 95% confidence intervals of several human summarizers, but their pyramid, linguistic, and responsiveness scores do not achieve this level of performance. [sent-44, score-0.349]

28 Thus, the inexpensive automatic metrics, as currently employed, do not predict well how machine summaries compare to human summaries. [sent-45, score-0.14]

29 In this work we explore the use of documentpaired testing for summarizer comparison. [sent-46, score-0.383]

30 Our main approach is to consider each pair of two summarizers’ sets of scores (over all documents) as a balanced two-sample dataset, and to assess that pair’s mean difference in scores through a two-sample T or Wilcoxon test, paired or unpaired. [sent-47, score-0.328]

31 Our hope is that paired testing, using either the standard paired two-sample t-test or the distributionfree Wilcoxon signed-rank test, can provide greater power in the statistical analysis of automatic metrics such as ROUGE. [sent-49, score-0.674]

32 2 Size and Power of Tests Statistical tests are generally compared by choosing rejection thresholds to achieve a certain small prob469 ability of Type Ierror (usually as α = . [sent-50, score-0.239]

33 Given multiple tests with the same Type I error, one prefers the test with the smallest probability of Type IIerror. [sent-52, score-0.104]

34 Since power is defined to be one minus the Type II error probability, we prefer the test with the most power. [sent-53, score-0.091]

35 However, in many settings, the null hypothesis comprises many possible probability laws, as here where the null hypothesis is that the underlying probability laws for the score-samples of two separate summarizers are equal, without specifying exactly what that probability distribution is. [sent-56, score-0.676]

36 In this case, the significance level is an upper bound for the attained size of the test, defined as supP∈H0 P(S ≥ c), the largest rejection probability P(S ≥ c) achieved by any probability law compatible wi≥th the null hypothesis. [sent-57, score-0.276]

37 The power of the test then depends on the specific probability law Q from the considered alternatives in HA. [sent-58, score-0.171]

38 For each such Q, and given a threshold c, the power for the test at Q is the rejection probability Q(S ≥ c). [sent-59, score-0.264]

39 Talhteesrena dtievfein hypotheses are composite, that is, each consists of multiple probability laws for the data. [sent-61, score-0.152]

40 However, the discrete TAC data can be thought of as rounded continuous data, rather than as truly discrete data. [sent-64, score-0.087]

41 The thresholds c for the tests can be defined either by theoretical distributions, by large-sample approximations, or by data-resampling (bootstrap) techniques, and (only) in the last case are these thresholds data-dependent, or random. [sent-67, score-0.219]

42 We explain these notions with respect to the two-sample data-structure in which the scores from the first summarizer are denoted X1, . [sent-68, score-0.347]

43 , Xn, where n is the number of documents with non-missing scores for both summarizers, and the scores from the second summarizer are Y1, . [sent-71, score-0.403]

44 Then the paired statisticsP are defined as — Xn Tp = pn(n − 1)Z¯/(X(Zk −Z¯)2)1/2 Xk=1 and Xn W = X sgn(Zk)Rk+ kX= X1 where Rk+ is the rank of |Zk | among |Z1|, . [sent-76, score-0.216]

45 hypotheses, tthee t hvaatri uantedse Zk are ualsls aunmde adl ienr-dependent identically distributed (iid), while under H0, the random variables Zk are symmetric about 0. [sent-81, score-0.032]

46 However, iwbuhteionn n i tsh moderately or very large, the cutoff is well approximated by the standard-normal 1− α/2 quantile zα/2, and Tp becomes approximately nonparametrically valid with this cutoff, by the Central Limit Theorem. [sent-83, score-0.172]

47 The Wilcoxon signed-rank statistic W has theoretical cutoff c = c(W) which depends only on n, whenever the data Zk are continuously distributed; but for − 470 large n, thecutoffis given simply as pn3/12 ·zα/2. [sent-84, score-0.387]

48 When there are ties (as might be compmon /in1 d2i·sczrete data), the calculation of cutoffs and p-values for Wilcoxon becomes slightly more complicated and is no longer fully nonparametric except in a largesample approximate sense. [sent-85, score-0.118]

49 The situation for the two-sample unpaired tstatistic T currently used in TAC evaluation is not so neat. [sent-86, score-0.218]

50 ) However, an essential element of the summarization data is the heterogeneity of documents. [sent-88, score-0.168]

51 This means that while {Xk}kn=1 can be viewed as iid scores when dwohciulem {enXts are selected randomly and not necessarily equiprobably from the ensemble of all possible documents, the Yk and Xk samples are dependent. [sent-89, score-0.131]

52 d iSfftiel re,n tchees p {Zk}kn=1, are )ii}d which is what fmorakee ths paired testing {vZali}d. [sent-91, score-0.308]

53 However, there is no theoretical distribution for T from which to calculate valid quantiles c for cutoffs, and therefore the use of the unpaired t-statistic cannot be recommended for TAC evaluation. [sent-92, score-0.362]

54 What can be done in a particular dataset, like the TAC summarization score datsets we consider, to ascertain the approximate validity of theoretically derived large-sample cutoffs for test statistics? [sent-93, score-0.21]

55 In the age of plentiful and fast computers, quite a lot, through the powerful computational machinery of the bootstrap (Efron and Tibshirani, 1993). [sent-94, score-0.143]

56 We have done this in two distinct ways, each creating 2000 datasets with n paired scores: MC Monte Carlo Method. [sent-96, score-0.216]

57 For each of many iterations (in our case 2000), define a new dataset {(X0k, Yk0)}kn=1 by independently swapping Xk (aXnd Yk )w}ith probability 1/2. [sent-97, score-0.04]

58 Hence, (X0k, Yk0) = (Xk, Yk) with probability 1/2 and (Yk, Xk) with probability 1/2. [sent-98, score-0.08]

59 The upshot is that the 1 − α empirical quantile for S based on either of 1th −ese α αsim emuplairtiiocnal lm qeutahnotdilse serves as a data-dependent cutoff c attaining approximate size α for all H0generated data. [sent-105, score-0.172]

60 The MC and HB methods will be employed in Section 4 to check the theoretical pvalues. [sent-106, score-0.098]

61 3 Relative Efficiency of W versus Tp Statistical theory does have something to say about the comparative powers of paired W versus Tp statistics. [sent-107, score-0.312]

62 These statistics have been studied (Randles and Wolfe, 1979), in terms of their asymptotic relative efficiency for location-shift alternatives based on symmetric densities (f(z −ϑ) is a locationbshasifetd o ofn f(z)). [sent-108, score-0.114]

63 tFroicr many pairs zo−f parametric iaonndrank-based statistics S, S˜, including W and Tp, the following assertion has been proved for testing H0 at significance level α. [sent-109, score-0.136]

64 First assume the Zk are distributed according to some density f(z − ϑ), where f(z) is a symmetrsoicm fuen dcetinosnit (f(−z) = f(z)). [sent-110, score-0.071]

65 nW (fh(e−n n gets large t Nhee p√owers ate any =al 0- = ternatives with very small ϑ = γ/√n, γ 0, can be made asymptotically equal by using samples aonf 471 size n with statistic S and of size ρ · n with statistic Ss˜i. [sent-112, score-0.447]

66 z eH ner we ρ = ARE(S, nS˜d) oisf a czeon ρst ·a nnt w not depending on n or γ but definitely depending on f, called asymptotic relative efficiency of S with respect to S˜. [sent-113, score-0.074]

67 (The smaller ρ < 1 is, the more statistic is preferred among the two. [sent-114, score-0.207]

68 167) that the Wilcoxon signed-rank statistic W provides greater robustness and often much greater efficiency than the paired T, with ARE which is 0. [sent-120, score-0.527]

69 Nevertheless, as we found by statistical analysis of the TAC data, both are far superior to the unpaired T-statistic, with either theoretical or empirical bootstrapped p-values. [sent-124, score-0.289]

70 4 Testing Setup and Results To evaluate our ideas, we used the TAC data from 2008-2010 and focused on three manual metrics (overall responsiveness, pyramid score, and linguistic quality score) and two automatic metrics (ROUGE-2 and ROUGE-SU4). [sent-125, score-0.393]

71 We make the assumption, backed by both the scores given and comments made by NIST summary assessors 3, that automatic summarization systems do not perform at the human level of performance. [sent-126, score-0.354]

72 ” Thus, automatic summarization fails the Turing test of machine intelligence (Turing, 1950). [sent-128, score-0.161]

73 Finally, our own results show no matter how you compare human and machine scores all machines systems score significantly worse than humans. [sent-130, score-0.1]

74 MetricU2n0p0a8ir:- 4T64 =Pa 5ir8-T × 8W pai rlsc. [sent-141, score-0.038]

75 For each of these metrics, we first created a score matrix whose (i, j)-entry represents the score for summarizer j on document set i(these matrices generated the colorplots in Figures 2 5). [sent-147, score-0.36]

76 We then performed a Wilcoxon signed-rank test on certain pairs of columns of this matrix (any pair consisting of one machine system and one human summarizer). [sent-148, score-0.079]

77 As a baseline, we did the same testing with a paired and an unpaired t-test. [sent-149, score-0.526]

78 Each of these tests resulted in a – p-value, and we counted how many were less than . [sent-150, score-0.064]

79 The results of these tests (shown in Table 2), were somewhat surprising. [sent-152, score-0.064]

80 Although we expected the nonparametric signed-rank test to perform better than an unpaired t-test, we were surprised to see that a paired t-test performed even better. [sent-153, score-0.468]

81 All three tests always reject the null hypotheses when human metrics are used. [sent-154, score-0.287]

82 This is what we’d like to happen with automatic metrics as well. [sent-155, score-0.117]

83 As seen from the table, the paired t-test and Wilcoxon signed-rank test offer a good improvement over the unpaired t-test. [sent-156, score-0.434]

84 In this case, we are comparing pairs of machine summarization systems. [sent-158, score-0.126]

85 However, since the number of significant differences increases with paired testing here as well, we believe this also reflects the greater discriminatory power of paired testing. [sent-161, score-0.649]

86 We now apply the Monte Carlo and Hybrid Monte Carlo to check the theoretical p-values reported in Tables 1 and 2. [sent-162, score-0.071]

87 The empirical quantiles found by these methods generally confirm the theoretical p-value test results reported there, especially in Table 2. [sent-163, score-0.144]

88 5 Conclusions and Future Work In this paper we observed that summarization systems’ performance varied significantly across document sets on the Text Analysis Conference (TAC) data. [sent-165, score-0.195]

89 This variance in performance suggested that paired testing may be more appropriate than the t-test currently employed at TAC to compare the performance of summarization systems. [sent-166, score-0.508]

90 We estimated the statistical power of the t-test and the Wilcoxon signed-rank test by calcu- lating the number of machine systems whose performance was significantly different than that of human summarizers. [sent-168, score-0.135]

91 As human assessors score machine systems as not achieving human performance in either content or responsiveness, automatic metrics such as ROUGE should ideally indicate this distinction. [sent-169, score-0.243]

92 We found that the paired Wilcoxon test significantly increases the number of machine systems that score significantly different than humans when the pairwise test is performed on ROUGE-2 and ROUGE-SU4 scores. [sent-170, score-0.244]

93 Thus, we demonstrated that the Wilcoxon paired test shows more statistical power than the t-test for comparing summarization systems. [sent-171, score-0.433]

94 Consequently, the use of paired testing should not only be used in formal evaluations such as TAC, but also should be employed by summarization developers to more accurately assess whether changes to an automatic system give rise to improved performance. [sent-172, score-0.496]

95 Further study needs to analyze more summarization metrics such as those proposed at the recent NIST evaluation of automatic metrics, Automatically Evaluating Summaries ofPeers (AESOP) (Nat, 2010). [sent-173, score-0.243]

96 As metrics become more sophisticated and aim to more accurately predict human judgements such as overall responsiveness and linguistic quality, paired testing seems likely to be a more powerful statistical procedure than the unpaired t-test for head-to-head summarizer comparisons. [sent-174, score-1.163]

97 Throughout our research in this paper, we treated each separate kind of scores on a document set as data for one summarizer to be compared with the same kind of scores for other summarizers. [sent-175, score-0.472]

98 However, it might be more fruitful to treat all the scores as multivariate data and compare the summarizers that way. [sent-176, score-0.366]

99 Multivariate statistical techniques such as Principal Component Analysis may play a constructive role in suggesting highly discriminating new composite scores, perhaps leading to statistics with even more power to measure a summary’s quality. [sent-177, score-0.157]

100 It is likely that paired testing may also be appropriate for BLEU as well and will give additional discriminating power between machine translations and human translations. [sent-180, score-0.478]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('summarizer', 0.291), ('summarizers', 0.272), ('wilcoxon', 0.263), ('tac', 0.255), ('zk', 0.23), ('unpaired', 0.218), ('paired', 0.216), ('statistic', 0.207), ('pyramid', 0.194), ('responsiveness', 0.188), ('yk', 0.175), ('tp', 0.148), ('xk', 0.137), ('rejection', 0.133), ('kn', 0.132), ('summarization', 0.126), ('conroy', 0.121), ('cutoff', 0.109), ('maryland', 0.109), ('randles', 0.097), ('rouge', 0.093), ('testing', 0.092), ('power', 0.091), ('cutoffs', 0.084), ('metrics', 0.082), ('laws', 0.078), ('bootstrap', 0.076), ('quantiles', 0.073), ('supp', 0.073), ('wolfe', 0.073), ('dang', 0.071), ('theoretical', 0.071), ('document', 0.069), ('tests', 0.064), ('null', 0.063), ('efron', 0.063), ('quantile', 0.063), ('pooled', 0.063), ('summaries', 0.061), ('scores', 0.056), ('summary', 0.055), ('honestly', 0.049), ('pwai', 0.049), ('var', 0.049), ('variance', 0.047), ('parametric', 0.044), ('human', 0.044), ('monte', 0.043), ('thresholds', 0.042), ('heterogeneity', 0.042), ('iid', 0.042), ('bickel', 0.042), ('math', 0.042), ('hypothesis', 0.04), ('probability', 0.04), ('alternatives', 0.04), ('carlo', 0.039), ('density', 0.039), ('tukey', 0.038), ('powers', 0.038), ('assessors', 0.038), ('owczarzak', 0.038), ('asymptotic', 0.038), ('nenkova', 0.038), ('nat', 0.038), ('pai', 0.038), ('multivariate', 0.038), ('college', 0.037), ('efficiency', 0.036), ('park', 0.035), ('automatic', 0.035), ('machinery', 0.035), ('duc', 0.035), ('discriminating', 0.035), ('turing', 0.035), ('columns', 0.035), ('nonparametric', 0.034), ('greater', 0.034), ('hypotheses', 0.034), ('samples', 0.033), ('hb', 0.033), ('distributed', 0.032), ('powerful', 0.032), ('replacement', 0.031), ('tibshirani', 0.031), ('mc', 0.031), ('composite', 0.031), ('hoa', 0.031), ('confidence', 0.031), ('mathematical', 0.031), ('continuous', 0.031), ('intervals', 0.03), ('rk', 0.03), ('nist', 0.029), ('versus', 0.029), ('update', 0.029), ('humans', 0.028), ('discrete', 0.028), ('xn', 0.028), ('employed', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 110 emnlp-2011-Ranking Human and Machine Summarization Systems

Author: Peter Rankel ; John Conroy ; Eric Slud ; Dianne O'Leary

Abstract: The Text Analysis Conference (TAC) ranks summarization systems by their average score over a collection of document sets. We investigate the statistical appropriateness of this score and propose an alternative that better distinguishes between human and machine evaluation systems.

2 0.08269985 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures

Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo

Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.

3 0.080840863 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization

Author: Rui Yan ; Liang Kong ; Congrui Huang ; Xiaojun Wan ; Xiaoming Li ; Yan Zhang

Abstract: We investigate an important and challenging problem in summary generation, i.e., Evolutionary Trans-Temporal Summarization (ETTS), which generates news timelines from massive data on the Internet. ETTS greatly facilitates fast news browsing and knowledge comprehension, and hence is a necessity. Given the collection oftime-stamped web documents related to the evolving news, ETTS aims to return news evolution along the timeline, consisting of individual but correlated summaries on each date. Existing summarization algorithms fail to utilize trans-temporal characteristics among these component summaries. We propose to model trans-temporal correlations among component summaries for timelines, using inter-date and intra-date sen- tence dependencies, and present a novel combination. We develop experimental systems to compare 5 rival algorithms on 6 instinctively different datasets which amount to 10251 documents. Evaluation results in ROUGE metrics indicate the effectiveness of the proposed approach based on trans-temporal information. 1

4 0.077494778 130 emnlp-2011-Summarize What You Are Interested In: An Optimization Framework for Interactive Personalized Summarization

Author: Rui Yan ; Jian-Yun Nie ; Xiaoming Li

Abstract: Most traditional summarization methods treat their outputs as static and plain texts, which fail to capture user interests during summarization because the generated summaries are the same for different users. However, users have individual preferences on a particular source document collection and obviously a universal summary for all users might not always be satisfactory. Hence we investigate an important and challenging problem in summary generation, i.e., Interactive Personalized Summarization (IPS), which generates summaries in an interactive and personalized manner. Given the source documents, IPS captures user interests by enabling interactive clicks and incorporates personalization by modeling captured reader preference. We develop . experimental systems to compare 5 rival algorithms on 4 instinctively different datasets which amount to 5197 documents. Evaluation results in ROUGE metrics indicate the comparable performance between IPS and the best competing system but IPS produces summaries with much more user satisfaction according to evaluator ratings. Besides, low ROUGE consistency among these user preferred summaries indicates the existence of personalization.

5 0.074538805 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

Author: Peng Li ; Yinglin Wang ; Wei Gao ; Jing Jiang

Abstract: In this paper, we propose a novel approach to automatic generation of aspect-oriented summaries from multiple documents. We first develop an event-aspect LDA model to cluster sentences into aspects. We then use extended LexRank algorithm to rank the sentences in each cluster. We use Integer Linear Programming for sentence selection. Key features of our method include automatic grouping of semantically related sentences and sentence ranking based on extension of random walk model. Also, we implement a new sentence compression algorithm which use dependency tree instead of parser tree. We compare our method with four baseline methods. Quantitative evaluation based on Rouge metric demonstrates the effectiveness and advantages of our method.

6 0.053743586 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

7 0.046645477 125 emnlp-2011-Statistical Machine Translation with Local Language Models

8 0.043657407 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

9 0.042711828 112 emnlp-2011-Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures

10 0.042367879 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

11 0.039873414 90 emnlp-2011-Linking Entities to a Knowledge Base with Query Expansion

12 0.037376113 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding

13 0.03696293 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

14 0.032235552 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification

15 0.031074092 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

16 0.030931564 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

17 0.030562822 38 emnlp-2011-Data-Driven Response Generation in Social Media

18 0.030296497 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

19 0.030238161 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article

20 0.02931942 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.12), (1, -0.031), (2, -0.01), (3, -0.107), (4, 0.028), (5, 0.017), (6, -0.014), (7, -0.058), (8, -0.002), (9, 0.005), (10, -0.007), (11, -0.166), (12, -0.089), (13, 0.056), (14, 0.114), (15, 0.09), (16, -0.079), (17, 0.09), (18, 0.007), (19, 0.053), (20, 0.016), (21, -0.096), (22, 0.129), (23, -0.057), (24, -0.045), (25, 0.011), (26, 0.029), (27, 0.075), (28, -0.023), (29, -0.051), (30, -0.013), (31, -0.038), (32, 0.02), (33, -0.044), (34, 0.014), (35, 0.016), (36, 0.065), (37, 0.174), (38, -0.237), (39, -0.096), (40, 0.066), (41, -0.097), (42, 0.13), (43, -0.007), (44, -0.121), (45, 0.02), (46, -0.019), (47, -0.145), (48, -0.094), (49, -0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95556182 110 emnlp-2011-Ranking Human and Machine Summarization Systems

Author: Peter Rankel ; John Conroy ; Eric Slud ; Dianne O'Leary

Abstract: The Text Analysis Conference (TAC) ranks summarization systems by their average score over a collection of document sets. We investigate the statistical appropriateness of this score and propose an alternative that better distinguishes between human and machine evaluation systems.

2 0.61074507 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures

Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo

Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.

3 0.58452541 135 emnlp-2011-Timeline Generation through Evolutionary Trans-Temporal Summarization

Author: Rui Yan ; Liang Kong ; Congrui Huang ; Xiaojun Wan ; Xiaoming Li ; Yan Zhang

Abstract: We investigate an important and challenging problem in summary generation, i.e., Evolutionary Trans-Temporal Summarization (ETTS), which generates news timelines from massive data on the Internet. ETTS greatly facilitates fast news browsing and knowledge comprehension, and hence is a necessity. Given the collection oftime-stamped web documents related to the evolving news, ETTS aims to return news evolution along the timeline, consisting of individual but correlated summaries on each date. Existing summarization algorithms fail to utilize trans-temporal characteristics among these component summaries. We propose to model trans-temporal correlations among component summaries for timelines, using inter-date and intra-date sen- tence dependencies, and present a novel combination. We develop experimental systems to compare 5 rival algorithms on 6 instinctively different datasets which amount to 10251 documents. Evaluation results in ROUGE metrics indicate the effectiveness of the proposed approach based on trans-temporal information. 1

4 0.43602696 130 emnlp-2011-Summarize What You Are Interested In: An Optimization Framework for Interactive Personalized Summarization

Author: Rui Yan ; Jian-Yun Nie ; Xiaoming Li

Abstract: Most traditional summarization methods treat their outputs as static and plain texts, which fail to capture user interests during summarization because the generated summaries are the same for different users. However, users have individual preferences on a particular source document collection and obviously a universal summary for all users might not always be satisfactory. Hence we investigate an important and challenging problem in summary generation, i.e., Interactive Personalized Summarization (IPS), which generates summaries in an interactive and personalized manner. Given the source documents, IPS captures user interests by enabling interactive clicks and incorporates personalization by modeling captured reader preference. We develop . experimental systems to compare 5 rival algorithms on 4 instinctively different datasets which amount to 5197 documents. Evaluation results in ROUGE metrics indicate the comparable performance between IPS and the best competing system but IPS produces summaries with much more user satisfaction according to evaluator ratings. Besides, low ROUGE consistency among these user preferred summaries indicates the existence of personalization.

5 0.39834791 86 emnlp-2011-Lexical Co-occurrence, Statistical Significance, and Word Association

Author: Dipak L. Chaudhari ; Om P. Damani ; Srivatsan Laxman

Abstract: Om P. Damani Srivatsan Laxman Computer Science and Engg. Microsoft Research India IIT Bombay Bangalore damani @ cse . i . ac . in itb s laxman@mi cro s o ft . com of words that co-occur in a large number of docuLexical co-occurrence is an important cue for detecting word associations. We propose a new measure of word association based on a new notion of statistical significance for lexical co-occurrences. Existing measures typically rely on global unigram frequencies to determine expected co-occurrence counts. In- stead, we focus only on documents that contain both terms (of a candidate word-pair) and ask if the distribution of the observed spans of the word-pair resembles that under a random null model. This would imply that the words in the pair are not related strongly enough for one word to influence placement of the other. However, if the words are found to occur closer together than explainable by the null model, then we hypothesize a more direct association between the words. Through extensive empirical evaluation on most of the publicly available benchmark data sets, we show the advantages of our measure over existing co-occurrence measures.

6 0.3396036 61 emnlp-2011-Generating Aspect-oriented Multi-Document Summarization with Event-aspect model

7 0.29132682 5 emnlp-2011-A Fast Re-scoring Strategy to Capture Long-Distance Dependencies

8 0.28529745 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

9 0.27661216 112 emnlp-2011-Refining the Notions of Depth and Density in WordNet-based Semantic Similarity Measures

10 0.26620242 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article

11 0.25925794 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices

12 0.24893238 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information

13 0.24313526 38 emnlp-2011-Data-Driven Response Generation in Social Media

14 0.24088766 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.

15 0.23657452 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora

16 0.23285468 34 emnlp-2011-Corpus-Guided Sentence Generation of Natural Images

17 0.23206802 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

18 0.2216837 32 emnlp-2011-Computing Logical Form on Regulatory Texts

19 0.21896388 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge

20 0.2107493 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.079), (36, 0.019), (37, 0.034), (45, 0.067), (53, 0.024), (54, 0.533), (57, 0.016), (62, 0.01), (64, 0.01), (66, 0.022), (69, 0.011), (79, 0.034), (82, 0.015), (90, 0.011), (96, 0.026), (98, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91458637 110 emnlp-2011-Ranking Human and Machine Summarization Systems

Author: Peter Rankel ; John Conroy ; Eric Slud ; Dianne O'Leary

Abstract: The Text Analysis Conference (TAC) ranks summarization systems by their average score over a collection of document sets. We investigate the statistical appropriateness of this score and propose an alternative that better distinguishes between human and machine evaluation systems.

2 0.84513783 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

Author: Wei Lu ; Hwee Tou Ng

Abstract: This paper describes a novel probabilistic approach for generating natural language sentences from their underlying semantics in the form of typed lambda calculus. The approach is built on top of a novel reduction-based weighted synchronous context free grammar formalism, which facilitates the transformation process from typed lambda calculus into natural language sentences. Sentences can then be generated based on such grammar rules with a log-linear model. To acquire such grammar rules automatically in an unsupervised manner, we also propose a novel approach with a generative model, which maps from sub-expressions of logical forms to word sequences in natural language sentences. Experiments on benchmark datasets for both English and Chinese generation tasks yield significant improvements over results obtained by two state-of-the-art machine translation models, in terms of both automatic metrics and human evaluation.

3 0.81604612 67 emnlp-2011-Hierarchical Verb Clustering Using Graph Factorization

Author: Lin Sun ; Anna Korhonen

Abstract: Most previous research on verb clustering has focussed on acquiring flat classifications from corpus data, although many manually built classifications are taxonomic in nature. Also Natural Language Processing (NLP) applications benefit from taxonomic classifications because they vary in terms of the granularity they require from a classification. We introduce a new clustering method called Hierarchical Graph Factorization Clustering (HGFC) and extend it so that it is optimal for the task. Our results show that HGFC outperforms the frequently used agglomerative clustering on a hierarchical test set extracted from VerbNet, and that it yields state-of-the-art performance also on a flat test set. We demonstrate how the method can be used to acquire novel classifications as well as to extend existing ones on the basis of some prior knowledge about the classification.

4 0.43066004 123 emnlp-2011-Soft Dependency Constraints for Reordering in Hierarchical Phrase-Based Translation

Author: Yang Gao ; Philipp Koehn ; Alexandra Birch

Abstract: Long-distance reordering remains one of the biggest challenges facing machine translation. We derive soft constraints from the source dependency parsing to directly address the reordering problem for the hierarchical phrasebased model. Our approach significantly improves Chinese–English machine translation on a large-scale task by 0.84 BLEU points on average. Moreover, when we switch the tuning function from BLEU to the LRscore which promotes reordering, we observe total improvements of 1.21 BLEU, 1.30 LRscore and 3.36 TER over the baseline. On average our approach improves reordering precision and recall by 6.9 and 0.3 absolute points, respectively, and is found to be especially effective for long-distance reodering.

5 0.4271138 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning

Author: Edward Grefenstette ; Mehrnoosh Sadrzadeh

Abstract: Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (2010) using data from the BNC and evaluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The evaluation is based on the word disambiguation task developed by Mitchell and Lapata (2008) for intransitive sentences, and on a similar new experiment designed for transitive sentences. Our model matches the results of its competitors . in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity showcases the compositional power of our model.

6 0.42379743 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation

7 0.4191815 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation

8 0.40696147 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing

9 0.40602207 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation

10 0.39177722 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

11 0.38883835 134 emnlp-2011-Third-order Variational Reranking on Packed-Shared Dependency Forests

12 0.38152418 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

13 0.38089547 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

14 0.3723022 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax

15 0.37126839 38 emnlp-2011-Data-Driven Response Generation in Social Media

16 0.36974743 15 emnlp-2011-A novel dependency-to-string model for statistical machine translation

17 0.36192042 97 emnlp-2011-Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French

18 0.36175203 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

19 0.35868922 66 emnlp-2011-Hierarchical Phrase-based Translation Representations

20 0.35509437 77 emnlp-2011-Large-Scale Cognate Recovery