acl acl2011 acl2011-25 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Anselmo Penas ; Alvaro Rodrigo
Abstract: There are several tasks where is preferable not responding than responding incorrectly. This idea is not new, but despite several previous attempts there isn’t a commonly accepted measure to assess non-response. We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct.
Reference: text
sentIndex sentText sentNum sentScore
1 e s } Abstract There are several tasks where is preferable not responding than responding incorrectly. [sent-3, score-0.35]
2 This idea is not new, but despite several previous attempts there isn’t a commonly accepted measure to assess non-response. [sent-4, score-0.195]
3 We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. [sent-5, score-0.185]
4 The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. [sent-6, score-0.456]
5 We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. [sent-7, score-0.766]
6 This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct. [sent-8, score-0.262]
7 However, there are scenarios where we should consider the possibility of not responding, because this behavior has more value than responding incorrectly. [sent-11, score-0.253]
8 In this case, where multiple choices for a question are offered, choosing a wrong option should be punished against leaving the question unanswered. [sent-17, score-0.407]
9 However, utility functions give arbitrary value to not responding and ignore the system’s behavior showed when it responds (see Section 2). [sent-19, score-0.322]
10 2), as an extension of accuracy (the proportion of correctly answered questions). [sent-21, score-0.402]
11 Since every question has a correct answer, non response is not correct but it is not incorrect either. [sent-34, score-0.39]
12 Thus, if we want to consider n questions in the evaluation, the measure would be: UF =1n∑i=n1U(i) =nac−n naw (1) The rationale of this utility function is intuitive: not answering adds no value and wrong answers add negative values. [sent-36, score-1.175]
13 Positive values of UF indicate more correct answers than incorrect ones, while negative values indicate the opposite. [sent-37, score-0.348]
14 However, the utility function is giving an arbitrary value to the preferences (-1, 0, 1). [sent-38, score-0.19]
15 Now we want to interpret in some way the value that Formula (1) assigns to unanswered questions. [sent-39, score-0.441]
16 For this purpose, we need to transform Formula (1) into a more meaningful measure with a parameter for the number of unanswered questions (nu). [sent-40, score-0.613]
17 5nnu − naw (2) Measure (2) provides the same ranking of systems than measure (1). [sent-51, score-0.413]
18 In other words, unanswered questions are receiving the same value as if half of them had been answered correctly. [sent-54, score-0.777]
19 This does not seem correct given that not answering is being rewarded in the same proportion to all the systems, without taking into account the performance they have shown with the answered questions. [sent-55, score-0.499]
20 We need to propose a more sensible estimation for the weight of unanswered questions. [sent-56, score-0.35]
21 1 A rationale for the Value of Unanswered Questions According to the utility function suggested, unanswered questions would have value as if half of them had been answered correctly. [sent-58, score-0.881]
22 Let’s generalize this idea and estate more clearly our hypothesis: Unanswered questions have the same value as if a proportion of them would have been answered correctly. [sent-61, score-0.614]
23 The utility measure (2) corresponds to P(C) in Formula (3) where P(C/¬A) recsepoivnedss a oco Pn(sCta)n int v Faolurme uolfa 0 (3. [sent-64, score-0.196]
24 Following this, our measure must consist of two parts: The overall accuracy and a better estimation of correctness over the unanswered questions. [sent-67, score-0.53]
25 2 The Measure Proposed: c@1 From the answered questions we have already observed the proportion of questions that received a correct answer (P(C ∩ A) = nac/n). [sent-69, score-0.974]
26 A system that answers all the questions will re- ceive a score equal to the traditional accuracy measure: nu=0 and therefore c@1=nac/n. [sent-73, score-0.46]
27 Unanswered questions will add value to c@1 as if they were answered with the accuracy already shown. [sent-75, score-0.544]
28 A system that does not return any answer would receive a score equal to 0 due to nac=0 in both summands. [sent-77, score-0.214]
29 3 Other Estimations for P(C/¬A) In this section we study whether other estimations of P(C/¬A) can provide a sensible measure for QA owfh Pen( Cun/a¬nAsw) ceraend p questions are btaleke mne ianstuor eac fcoor QunAt. [sent-81, score-0.381]
30 1 P(C/¬A) ≡ 0 This estimation considers the absence of response as incorrect response and we have the traditional accuracy (nac/n). [sent-89, score-0.241]
31 2 P(C/¬A) ≡ 1 This estimation considers all unanswered questions as correctly answered. [sent-92, score-0.522]
32 This option is not reasonable and is given for completeness: systems giving no answer would get maximum score. [sent-93, score-0.297]
33 5 It could be argued that since we cannot have observations of correctness for unanswered questions, we should assume equiprobability between P(C/¬A) asnhodu P(¬C/¬A). [sent-96, score-0.345]
34 As previously explained, in this case we are giving an arbitrary constant value to unanswered questions independently of the system’s performance shown with answered ones. [sent-98, score-0.82]
35 We should be aiming at rewarding those systems not responding instead of giving wrong answers, not reward the sole fact that the system is not responding. [sent-100, score-0.359]
36 4 P(C/¬A) ≡ P(C/A) An alternative is to estimate the probability of correctness for the unanswered questions as the precision observed over the answered ones: P(C/A)= nac/(nac+ naw). [sent-102, score-0.753]
37 In this case, our measure would be like the one shown in Formula (5): P(C) = P(C ∩ A) + P(C/¬A) = P(C/A) ∗ P(A) + P(C/A) ∗ P(¬A) = ∗ P(¬A) = (5) = P(C/A) =nacn+ac naw The resulting measure is again the observed precision over the answered ones. [sent-103, score-0.715]
38 This is not a sensible measure, as it would reward a cheating system that decides to leave all questions unanswered except one for which it is sure to have a correct answer. [sent-104, score-0.66]
39 5 P(C/¬A) ≡ P(¬C/A) The last option to be considered explores the idea that systems fail not responding in the same proportion that they fail when they give an answer (i. [sent-107, score-0.557]
40 Estimating P(C/¬A) as naw / (nac+ naw), the measure awtinoguld P b(eC: P(C) = P(C ∩ A) + P(C/¬A) = P(C ∩ A) ∗ P(¬C/A) = ∗ ∗ P(¬A) = P(¬A) = (6) nac + naw nu n nac + naw n ∗ This measure is very easy to cheat. [sent-110, score-1.537]
41 It is possible to obtain almost a perfect score just by answering incorrectly only one question and leaving unanswered the rest of the questions. [sent-111, score-0.598]
42 For this purpose, we have chosen the method described by Buckley and Voorhees (2000) for assessing the stability and discrimination power, as well as the method described by Voorhees and Buckley (2002) for examining the sensitivity of our measure. [sent-113, score-0.329]
43 We have compared the results over c@ 1 with the ones obtained using both accuracy and the utility function (UF) defined in Formula (1). [sent-115, score-0.173]
44 Then, the experiments about stability and sensitivity will be described. [sent-118, score-0.233]
45 The collection has a set of 500 questions with their answers. [sent-122, score-0.195]
46 The 44 runs in different languages contain the human assessments for the answers given by actual participants. [sent-123, score-0.246]
47 In this case, they had the chance to submit their best candidate in order to assess the performance of their validation module (the one that decides whether to give or not the answer). [sent-125, score-0.172]
48 In order to study the stability of c@ 1 and to compare it with accuracy we used the method described by Buckley and Voorhees (2000). [sent-131, score-0.177]
49 The less discriminative the measure is, the more ties between systems there will be. [sent-133, score-0.188]
50 Let f represents the fuzziness value, which is the percent difference between scores such that if the difference is smaller than f then the two scores are deemed to be equivalent. [sent-138, score-0.172]
51 The same algorithm gives us the proportion of ties (Formula (8)), which we use for measuring discrimination power, that is the lower the proportion of ties is, the more discriminative the measure is. [sent-141, score-0.535]
52 Figure 1: Algorithm for computing EQM(x,y), GTM(x,y) and GTM(y,x) in the stability method We assume that for each measure the correct decision about whether run x is better than run y happens when there are more cases where the value of x is better than the value of y. [sent-142, score-0.464]
53 On the other hand, it is clear that larger fuzziness values decrease the error rate but also decrease the discrimination power of a measure. [sent-144, score-0.229]
54 Since a fixed fuzziness value might imply different trade-offs for different metrics, we decided to vary the fuzziness value from 0. [sent-145, score-0.324]
55 In the Figure we can see how there is a consistent decrease of the error rate of all measures when the proportion of ties increases (this corresponds to the increase in the fuzziness value). [sent-149, score-0.275]
56 Figure 2 shows that the curves of accuracy and c@1 are quite similar (slightly better behavior of c@ 1) , which means that they have a similar stability and discrimination power. [sent-150, score-0.307]
57 The results suggest that the three measures are quite stable, having c@1 and accuracy a lower error rate than UF when the proportion of ties grows. [sent-151, score-0.249]
58 These curves are similar to the ones obtained for 1419 Figure 2: Error-rate / Proportion of ties curves for accuracy, c@1 and UF with c = 250 other QA evaluation measures (Sakai, 2007a). [sent-152, score-0.21]
59 3 Sensitivity The swap-rate (Voorhees and Buckley, 2002) represents the chance of obtaining a discrepancy between two question sets (of the same size) as to whether a system is better than another given a certain difference bin. [sent-154, score-0.179]
60 Looking at the swap-rates of all the difference performance bins, the performance difference required in order to conclude that a run is better than another for a given confidence value can be estimated. [sent-155, score-0.236]
61 For example, if we want to know the required difference for concluding that system A is better than system B with a confidence of 95%, then we select the difference that represents the first bin where the swap-rate is lower or equal than 0. [sent-156, score-0.226]
62 The sensitivity of the measure is the number of times among all the comparisons in the experiment where this performance difference is obtained (Sakai, 2007b). [sent-158, score-0.285]
63 The swap method works as follows: let S denote a set of runs, let x and y denote a pair of runs from S. [sent-161, score-0.203]
64 These results show that their sensitivity values are similar, and higher than the value for accuracy. [sent-184, score-0.192]
65 The aim is to compare the results of the proposed c@1 measure with accuracy in order to compare their behavior. [sent-187, score-0.185]
66 (i) number of questions correctly answered; (ii) number of questions incorrectly answered; (iii) number of unan- swered questions. [sent-192, score-0.426]
67 Table 3 shows a couple of examples where two systems have answered correctly a similar number of questions. [sent-193, score-0.249]
68 However, icia091ro has returned less incorrect answers by not responding some questions. [sent-195, score-0.461]
69 Table 3 shows how accuracy is sensitive only to the number of correct answers whereas c@1 is able to distinguish when 1420 systems keep the number of correct answers but reduce the number of incorrect ones by not responding to some. [sent-197, score-0.896]
70 5 Related Work The decision of leaving a query without response is related to the system ability to measure accurately its self-confidence about the correctness of their candidate answers. [sent-199, score-0.309]
71 Mean Reciprocal Rank (MRR) has traditionally been used to evaluate Question Answering systems when several answers per question were allowed and given in order (Fukumoto et al. [sent-202, score-0.342]
72 However, as it occurs with Accuracy (proportion ofquestions correctly answered), the risk ofgiving a wrong answer is always preferred better than not responding. [sent-204, score-0.304]
73 The QA track at TREC 2001 was the first evaluation campaign in which systems were allowed to leave a question unanswered (Voorhees, 2001). [sent-205, score-0.426]
74 The main evaluation measure was MRR, but performance was also measured by means of the percentage of answered questions and the portion of them that were correctly answered. [sent-206, score-0.571]
75 TREC 2002 discarded the idea of including unanswered questions in the evaluation. [sent-208, score-0.519]
76 Only one answer by question was allowed and all answers had to be ranked according to the system’s self-confidence in the correctness of the answer. [sent-209, score-0.61]
77 Systems were evalu- ated by means of Confidence Weighted Score (CWS), rewarding those systems able to provide more correct answers at the top of the ranking (Voorhees, 2002). [sent-210, score-0.341]
78 The formulation of CWS is the following: CWS =n1∑i=n1C(ii) (9) Where n is the number of questions, and C(i) is the number of correct answers up to the position iin the ranking. [sent-211, score-0.269]
79 Formally: 1421 = ∑i C(i) ∑I(j) ∑j=1 (10) where I(j) is a function that returns 1 if answer j is correct and 0 if it is not. [sent-212, score-0.276]
80 Since only one answer per question is requested, R equals to n (the number of questions) in CWS. [sent-214, score-0.349]
81 However, in AP formula the summands belong to the positions of the ranking where there is a relevant result (product of I(r)), whereas in CWS every position of the ranking add value to the measure regardless of whether there is a relevant result or not in that position. [sent-215, score-0.376]
82 Therefore, CWS gives much more value to some questions over others: questions whose answers are at the top of the ranking are giving almost the complete value to CWS, whereas those questions whose answers are at the bottom of the ranking are almost not counting in the evaluation. [sent-216, score-1.274]
83 Although CWS was aimed at promoting the development of better self-confidence scores, it was discussed as a measure for evaluating QA systems performance. [sent-217, score-0.165]
84 These measures are based in a utility function that returns -1 if the answer is incorrect and 1 if it is correct. [sent-223, score-0.397]
85 K is a variation of K1 for being used in evaluations where more than an answer per question is allowed. [sent-225, score-0.349]
86 If the self-score is 0, then the answer is ignored and thus, this measure is permitting to leave a question unanswered. [sent-226, score-0.476]
87 However, the final value of K1 is difficult to interpret: a positive value does not indicate necessarily more correct answers than incorrect ones, but that the sum of scores of correct answers is higher than the sum resulting from the scores of incorrect answers. [sent-228, score-0.852]
88 This could explain the little success of this measure for evaluating QA systems in favor, again, of accuracy measure. [sent-229, score-0.223]
89 Thus, the development of better validation technologies (systems able to decide whether the candidate answers are correct or not) is not promoted, despite new QA architectures require them. [sent-233, score-0.44]
90 The starting point was the reformulation of Answer Validation as a Recognizing Textual Entailment problem, under the assumption 1422 that hypotheses can be automatically generated by combining the question with the candidate answer (Pe n˜as et al. [sent-242, score-0.382]
91 Thus, validation was seen as a binary classification problem whose evaluation must deal with unbalanced collections (different proportion of positive and negative examples, correct and incorrect answers). [sent-244, score-0.34]
92 For this reason, AVE 2006 used F-measure based on precision and recall for correct answers selection (Pe n˜as et al. [sent-245, score-0.269]
93 With this aim, several measures were proposed to assess: the correct selection of candidate answers, the correct rejection of wrong answer and finally estimate the potential gain (in terms of accuracy) that Answer Validation modules can provide to QA (Rodrigo et al. [sent-253, score-0.46]
94 The idea was to give value to the correctly rejected answers as if they could be correctly answered with the accuracy shown selecting the correct answers. [sent-255, score-0.723]
95 6 Conclusions The central idea of this work is that not respond- ing has more value than responding incorrectly. [sent-257, score-0.286]
96 This idea is not new, but despite several attempts in TREC and CLEF there wasn’t a commonly accepted measure to assess non-response. [sent-258, score-0.195]
97 We have shown also that the proposed measure c@1 has a good balance of discrimination power, stability and sensitivity properties. [sent-261, score-0.456]
98 Finally, we have shown how this measure rewards systems able to maintain the same number of correct answers and at the same time reduce the number of incorrect ones, by leaving some questions unanswered. [sent-262, score-0.713]
99 Among other tasks, measure c@1 is well suited for evaluating Reading Comprehension tests, where multiple choices per question are given, but only one is correct. [sent-263, score-0.3]
100 Non-response must be assessed if we want to measure effective reading and not just the ability to rank options. [sent-264, score-0.206]
wordName wordTfidf (topN-words)
[('unanswered', 0.291), ('naw', 0.248), ('nac', 0.229), ('answer', 0.214), ('answered', 0.213), ('answers', 0.207), ('questions', 0.195), ('voorhees', 0.184), ('responding', 0.175), ('clef', 0.155), ('anselmo', 0.151), ('rodrigo', 0.151), ('cws', 0.139), ('qa', 0.139), ('trec', 0.135), ('question', 0.135), ('uf', 0.131), ('answering', 0.129), ('measure', 0.127), ('stability', 0.119), ('pe', 0.118), ('nas', 0.116), ('felisa', 0.114), ('sensitivity', 0.114), ('validation', 0.104), ('sakai', 0.101), ('discrimination', 0.096), ('formula', 0.095), ('proportion', 0.095), ('buckley', 0.087), ('fuzziness', 0.084), ('nu', 0.081), ('incorrect', 0.079), ('value', 0.078), ('lvaro', 0.076), ('utility', 0.069), ('correct', 0.062), ('ties', 0.061), ('retrieval', 0.061), ('revised', 0.059), ('sensible', 0.059), ('accuracy', 0.058), ('wrong', 0.054), ('correctness', 0.054), ('reward', 0.053), ('ellen', 0.052), ('response', 0.052), ('alvaro', 0.05), ('herrera', 0.05), ('power', 0.049), ('gtm', 0.046), ('reading', 0.046), ('let', 0.046), ('ones', 0.046), ('comprehension', 0.045), ('difference', 0.044), ('bins', 0.044), ('notes', 0.044), ('overview', 0.043), ('leaving', 0.043), ('giving', 0.043), ('mrr', 0.041), ('springer', 0.041), ('option', 0.04), ('lecture', 0.04), ('tetsuya', 0.04), ('runs', 0.039), ('confidence', 0.039), ('interpret', 0.039), ('ranking', 0.038), ('evaluating', 0.038), ('deepqa', 0.038), ('eqm', 0.038), ('nnac', 0.038), ('sama', 0.038), ('valent', 0.038), ('exercise', 0.038), ('ave', 0.038), ('iv', 0.037), ('accomplish', 0.037), ('denote', 0.036), ('sigir', 0.036), ('correctly', 0.036), ('measures', 0.035), ('assess', 0.035), ('concluding', 0.035), ('rationale', 0.035), ('curves', 0.034), ('decide', 0.034), ('fukumoto', 0.034), ('rewarding', 0.034), ('candidate', 0.033), ('reliability', 0.033), ('want', 0.033), ('idea', 0.033), ('required', 0.031), ('multilingual', 0.03), ('campaigns', 0.029), ('showing', 0.029), ('entailment', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 25 acl-2011-A Simple Measure to Assess Non-response
Author: Anselmo Penas ; Alvaro Rodrigo
Abstract: There are several tasks where is preferable not responding than responding incorrectly. This idea is not new, but despite several previous attempts there isn’t a commonly accepted measure to assess non-response. We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct.
2 0.20801064 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
Author: Shuguang Li ; Suresh Manandhar
Abstract: In this paper we address the problem of question recommendation from large archives of community question answering data by exploiting the users’ information needs. Our experimental results indicate that questions based on the same or similar information need can provide excellent question recommendation. We show that translation model can be effectively utilized to predict the information need given only the user’s query question. Experiments show that the proposed information need prediction approach can improve the performance of question recommendation.
3 0.15346344 245 acl-2011-Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives
Author: Guangyou Zhou ; Li Cai ; Jun Zhao ; Kang Liu
Abstract: Community-based question answer (Q&A;) has become an important issue due to the popularity of Q&A; archives on the web. This paper is concerned with the problem of question retrieval. Question retrieval in Q&A; archives aims to find historical questions that are semantically equivalent or relevant to the queried questions. In this paper, we propose a novel phrase-based translation model for question retrieval. Compared to the traditional word-based translation models, the phrasebased translation model is more effective because it captures contextual information in modeling the translation ofphrases as a whole, rather than translating single words in isolation. Experiments conducted on real Q&A; data demonstrate that our proposed phrasebased translation model significantly outperforms the state-of-the-art word-based translation model.
Author: Michael Mohler ; Razvan Bunescu ; Rada Mihalcea
Abstract: In this work we address the task of computerassisted assessment of short student answers. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation. We also present a first attempt to align the dependency graphs of the student and the instructor answers in order to make use of a structural component in the automatic grading of student answers.
5 0.094352692 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
Author: Dan Goldwasser ; Roi Reichart ; James Clarke ; Dan Roth
Abstract: Current approaches for semantic parsing take a supervised approach requiring a considerable amount of training data which is expensive and difficult to obtain. This supervision bottleneck is one of the major difficulties in scaling up semantic parsing. We argue that a semantic parser can be trained effectively without annotated data, and introduce an unsupervised learning algorithm. The algorithm takes a self training approach driven by confidence estimation. Evaluated over Geoquery, a standard dataset for this task, our system achieved 66% accuracy, compared to 80% of its fully supervised counterpart, demonstrating the promise of unsupervised approaches for this task.
6 0.083721206 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
7 0.082036965 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges
8 0.081417643 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes
9 0.071877666 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
10 0.064350791 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
11 0.058573615 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
12 0.053291228 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
13 0.048722032 109 acl-2011-Effective Measures of Domain Similarity for Parsing
14 0.047510181 156 acl-2011-IMASS: An Intelligent Microblog Analysis and Summarization System
15 0.046629842 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment
16 0.046581592 200 acl-2011-Learning Dependency-Based Compositional Semantics
17 0.045106821 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
18 0.045103319 22 acl-2011-A Probabilistic Modeling Framework for Lexical Entailment
19 0.044702541 194 acl-2011-Language Use: What can it tell us?
20 0.043907836 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model
topicId topicWeight
[(0, 0.136), (1, 0.032), (2, -0.033), (3, 0.065), (4, -0.069), (5, -0.022), (6, 0.004), (7, -0.017), (8, 0.021), (9, -0.039), (10, 0.005), (11, -0.03), (12, 0.007), (13, -0.007), (14, -0.042), (15, 0.009), (16, -0.001), (17, -0.083), (18, -0.004), (19, -0.03), (20, 0.122), (21, 0.036), (22, -0.098), (23, 0.075), (24, -0.061), (25, -0.072), (26, -0.06), (27, -0.021), (28, 0.006), (29, 0.093), (30, -0.008), (31, -0.065), (32, -0.006), (33, 0.019), (34, 0.04), (35, 0.023), (36, -0.107), (37, -0.093), (38, 0.241), (39, -0.007), (40, -0.195), (41, -0.08), (42, -0.076), (43, 0.045), (44, 0.058), (45, -0.215), (46, 0.082), (47, -0.065), (48, -0.132), (49, -0.073)]
simIndex simValue paperId paperTitle
same-paper 1 0.97245198 25 acl-2011-A Simple Measure to Assess Non-response
Author: Anselmo Penas ; Alvaro Rodrigo
Abstract: There are several tasks where is preferable not responding than responding incorrectly. This idea is not new, but despite several previous attempts there isn’t a commonly accepted measure to assess non-response. We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct.
2 0.85250407 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
Author: Shuguang Li ; Suresh Manandhar
Abstract: In this paper we address the problem of question recommendation from large archives of community question answering data by exploiting the users’ information needs. Our experimental results indicate that questions based on the same or similar information need can provide excellent question recommendation. We show that translation model can be effectively utilized to predict the information need given only the user’s query question. Experiments show that the proposed information need prediction approach can improve the performance of question recommendation.
3 0.67773688 245 acl-2011-Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives
Author: Guangyou Zhou ; Li Cai ; Jun Zhao ; Kang Liu
Abstract: Community-based question answer (Q&A;) has become an important issue due to the popularity of Q&A; archives on the web. This paper is concerned with the problem of question retrieval. Question retrieval in Q&A; archives aims to find historical questions that are semantically equivalent or relevant to the queried questions. In this paper, we propose a novel phrase-based translation model for question retrieval. Compared to the traditional word-based translation models, the phrasebased translation model is more effective because it captures contextual information in modeling the translation ofphrases as a whole, rather than translating single words in isolation. Experiments conducted on real Q&A; data demonstrate that our proposed phrasebased translation model significantly outperforms the state-of-the-art word-based translation model.
Author: Michael Mohler ; Razvan Bunescu ; Rada Mihalcea
Abstract: In this work we address the task of computerassisted assessment of short student answers. We combine several graph alignment features with lexical semantic similarity measures using machine learning techniques and show that the student answers can be more accurately graded than if the semantic measures were used in isolation. We also present a first attempt to align the dependency graphs of the student and the instructor answers in order to make use of a structural component in the automatic grading of student answers.
5 0.53046501 200 acl-2011-Learning Dependency-Based Compositional Semantics
Author: Percy Liang ; Michael Jordan ; Dan Klein
Abstract: Compositional question answering begins by mapping questions to logical forms, but training a semantic parser to perform this mapping typically requires the costly annotation of the target logical forms. In this paper, we learn to map questions to answers via latent logical forms, which are induced automatically from question-answer pairs. In tackling this challenging learning problem, we introduce a new semantic representation which highlights a parallel between dependency syntax and efficient evaluation of logical forms. On two standard semantic parsing benchmarks (GEO and JOBS), our system obtains the highest published accuracies, despite requiring no annotated logical forms.
6 0.46890903 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
7 0.44626674 120 acl-2011-Even the Abstract have Color: Consensus in Word-Colour Associations
8 0.42605099 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
9 0.41067743 99 acl-2011-Discrete vs. Continuous Rating Scales for Language Evaluation in NLP
10 0.38308534 294 acl-2011-Temporal Evaluation
11 0.36667252 257 acl-2011-Question Detection in Spoken Conversations Using Textual Conversations
12 0.36194223 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
13 0.34569848 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
14 0.33646405 156 acl-2011-IMASS: An Intelligent Microblog Analysis and Summarization System
15 0.32901543 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
16 0.32761794 320 acl-2011-Unsupervised Discovery of Domain-Specific Knowledge from Text
17 0.32367709 109 acl-2011-Effective Measures of Domain Similarity for Parsing
18 0.31870311 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon
19 0.31507489 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes
20 0.31291586 74 acl-2011-Combining Indicators of Allophony
topicId topicWeight
[(5, 0.018), (17, 0.032), (37, 0.027), (39, 0.021), (41, 0.028), (55, 0.014), (59, 0.021), (72, 0.02), (91, 0.024), (96, 0.703), (98, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.99878532 25 acl-2011-A Simple Measure to Assess Non-response
Author: Anselmo Penas ; Alvaro Rodrigo
Abstract: There are several tasks where is preferable not responding than responding incorrectly. This idea is not new, but despite several previous attempts there isn’t a commonly accepted measure to assess non-response. We study here an extension of accuracy measure with this feature and a very easy to understand interpretation. The measure proposed (c@1) has a good balance of discrimination power, stability and sensitivity properties. We show also how this measure is able to reward systems that maintain the same number of correct answers and at the same time decrease the number of incorrect ones, by leaving some questions unanswered. This measure is well suited for tasks such as Reading Comprehension tests, where multiple choices per question are given, but only one is correct.
2 0.99739069 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?
Author: Maoxi Li ; Chengqing Zong ; Hwee Tou Ng
Abstract: Word is usually adopted as the smallest unit in most tasks of Chinese language processing. However, for automatic evaluation of the quality of Chinese translation output when translating from other languages, either a word-level approach or a character-level approach is possible. So far, there has been no detailed study to compare the correlations of these two approaches with human assessment. In this paper, we compare word-level metrics with characterlevel metrics on the submitted output of English-to-Chinese translation systems in the IWSLT’08 CT-EC and NIST’08 EC tasks. Our experimental results reveal that character-level metrics correlate with human assessment better than word-level metrics. Our analysis suggests several key reasons behind this finding. 1
3 0.99561173 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers
Author: Daniel Emilio Beck
Abstract: In this paper I present a Master’s thesis proposal in syntax-based Statistical Machine Translation. Ipropose to build discriminative SMT models using both tree-to-string and tree-to-tree approaches. Translation and language models will be represented mainly through the use of Tree Automata and Tree Transducers. These formalisms have important representational properties that makes them well-suited for syntax modeling. Ialso present an experiment plan to evaluate these models through the use of a parallel corpus written in English and Brazilian Portuguese.
Author: Vicent Alabau ; Alberto Sanchis ; Francisco Casacuberta
Abstract: In interactive machine translation (IMT), a human expert is integrated into the core of a machine translation (MT) system. The human expert interacts with the IMT system by partially correcting the errors of the system’s output. Then, the system proposes a new solution. This process is repeated until the output meets the desired quality. In this scenario, the interaction is typically performed using the keyboard and the mouse. In this work, we present an alternative modality to interact within IMT systems by writing on a tactile display or using an electronic pen. An on-line handwritten text recognition (HTR) system has been specifically designed to operate with IMT systems. Our HTR system improves previous approaches in two main aspects. First, HTR decoding is tightly coupled with the IMT system. Second, the language models proposed are context aware, in the sense that they take into account the partial corrections and the source sentence by using a combination of ngrams and word-based IBM models. The proposed system achieves an important boost in performance with respect to previous work.
5 0.99389392 270 acl-2011-SciSumm: A Multi-Document Summarization System for Scientific Articles
Author: Nitin Agarwal ; Ravi Shankar Reddy ; Kiran GVR ; Carolyn Penstein Rose
Abstract: In this demo, we present SciSumm, an interactive multi-document summarization system for scientific articles. The document collection to be summarized is a list of papers cited together within the same source article, otherwise known as a co-citation. At the heart of the approach is a topic based clustering of fragments extracted from each article based on queries generated from the context surrounding the co-cited list of papers. This analysis enables the generation of an overview of common themes from the co-cited papers that relate to the context in which the co-citation was found. SciSumm is currently built over the 2008 ACL Anthology, however the gen- eralizable nature of the summarization techniques and the extensible architecture makes it possible to use the system with other corpora where a citation network is available. Evaluation results on the same corpus demonstrate that our system performs better than an existing widely used multi-document summarization system (MEAD).
6 0.99213558 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names
7 0.99098331 272 acl-2011-Semantic Information and Derivation Rules for Robust Dialogue Act Detection in a Spoken Dialogue System
8 0.98586994 335 acl-2011-Why Initialization Matters for IBM Model 1: Multiple Optima and Non-Strict Convexity
9 0.97983348 82 acl-2011-Content Models with Attitude
10 0.96801394 41 acl-2011-An Interactive Machine Translation System with Online Learning
11 0.96319312 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
12 0.95350903 266 acl-2011-Reordering with Source Language Collocations
13 0.94496304 264 acl-2011-Reordering Metrics for MT
14 0.94056702 169 acl-2011-Improving Question Recommendation by Exploiting Information Need
15 0.93727893 251 acl-2011-Probabilistic Document Modeling for Syntax Removal in Text Summarization
16 0.93702233 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
17 0.93428731 21 acl-2011-A Pilot Study of Opinion Summarization in Conversations
18 0.93301195 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
19 0.93046242 326 acl-2011-Using Bilingual Information for Cross-Language Document Summarization
20 0.92883801 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages