emnlp emnlp2010 emnlp2010-122 knowledge-graph by maker-knowledge-mining

122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions


Source: pdf

Author: Pawel Mazur ; Robert Dale

Abstract: The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. The corpus contains around 120000 tokens, and 2600 TIMEX2 expressions, thus comparing favourably in size to other existing corpora used in these areas. We describe the prepa- ration of the corpus, and compare the profile of the data with other existing temporally annotated corpora. We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. The corpus is publicly available for research purposes.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 pawe l @ma zur wroc l aw pl Abstract The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. [sent-4, score-0.129]

2 In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. [sent-6, score-0.076]

3 We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. [sent-9, score-0.622]

4 1 Introduction The reliable processing of temporal information is an important step in many NLP applications, such as information extraction, question answering, and document summarisation. [sent-11, score-0.568]

5 Consequently, the tasks of identifying and assigning values to temporal expressions have recently received significant attention, resulting in the creation of mature corpus annotation guidelines (e. [sent-12, score-0.875]

6 In particular, the documents in these corpora tend to be limited in length and, in consequence, discourse structure. [sent-30, score-0.159]

7 This impacts on the number, range and variety of temporal expressions they contain. [sent-31, score-0.782]

8 Existing research carried out on the interpretation of temporal expressions, e. [sent-32, score-0.509]

9 , 2005; Mazur and Dale, 2008), suggests that many temporal expressions in documents, especially news stories, can be interpreted fairly simply as being relative to a reference date that is typically the document creation date. [sent-35, score-0.908]

10 This phenomenon does not carry over to longer, more narrative-style documents that describe extended sequences of events, as found, for example, in biographies or descriptions of protracted geo-political events. [sent-36, score-0.076]

11 Consequently, existing corpora are not ideal as development data for systems intended to work on such historical narrations. [sent-37, score-0.089]

12 In this paper we introduce a new annotated corpus of temporal expressions that is intended to address this shortfall. [sent-38, score-0.782]

13 The corpus, which we call WikiWars, consists of 22 documents from English Wikipedia that describe the historical course of wars. [sent-39, score-0.113]

14 Despite the small number of documents, their length means that the corpus yields a large number of temporal expressions, and poses new challenges for tracking 3See corpora LDC2005T07 and LDC2006T06 in the LDC catalogue (http : / /www . [sent-40, score-0.596]

15 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 9t1ic3s–92 , temporal focus through extended texts. [sent-47, score-0.509]

16 The corpus has been made available for others to use;5 to give an indication of the difficulty of processing the temporal phenomena in the texts, we also report on the performance of DANTE, our temporal expression tagger, on detecting and interpreting the temporal expressions in the corpus. [sent-48, score-1.957]

17 In Section 2 we describe related work, focusing on the TIMEX2 annotation scheme, and existing corpora that contain annotations of temporal expressions using this scheme. [sent-50, score-0.956]

18 In Section 4 we comment on some artefacts of Wikipedia articles that impact on the annotation process and the use of this corpus. [sent-52, score-0.133]

19 In Section 6 we report on the performance of our temporal expression tagger on this data set. [sent-54, score-0.676]

20 2 Related Work At the time of writing, there are two mature, widecoverage schemes for the annotation of temporal information in texts: TIMEX2 (Ferro et al. [sent-56, score-0.566]

21 These schemes were used to annotate corpora that are often used in research on temporal expression recognition and normalisation: the series of corpora used for training and evaluation in the Automatic Content Extraction (ACE) program6 run in 2004, 2005 and 2007, and the TimeBank Corpus. [sent-61, score-0.726]

22 The ACE corpora were prepared for the development and evaluation of systems participating in the ACE program. [sent-62, score-0.093]

23 One is that most of the documents are relatively short, so that the average number of temporal expressions per document is low (typically between seven and nine per document, including the document time stamp as a metadata element). [sent-74, score-1.055]

24 This results in very limited temporal discourse structure, and relatively few underspecified and relative temporal expressions. [sent-75, score-1.049]

25 Unfortunately, these are the more difficult temporal expressions to handle, and so the ACE corpora may not serve as a good baseline for performance more generally. [sent-76, score-0.834]

26 Unfortunely, this corpus has the same limitations as the ACE corpora in regard to document length and complexity of discourse structure. [sent-82, score-0.142]

27 Further, TimeBank is annotated with TimeML, a scheme more complex than TIMEX2 since it also encompasses the tagging of events and temporal relations. [sent-83, score-0.561]

28 However, TIMEX2 is sufficiently sophisticated for the annotation of most types of temporal expressions, and our review of the literature reveals that the majority of existing temporal taggers output TIMEX2 annotations. [sent-84, score-1.075]

29 3 Creating WikiWars Given the above concerns, we were particularly interested in developing a corpus that would allow more rigorous testing oftechniques for tracking time across extended narratives, since these give rise to more complex temporal phenomena than are found in simpler documents. [sent-86, score-0.544]

30 After considering various types of historical narrative, we settled on descriptions of the course of wars and conflicts as being particularly rich in the kinds of phenomena we wanted to explore. [sent-88, score-0.16]

31 1 Selecting Data We queried Google with two phrases, ‘most famous wars in history’ and ‘the biggest wars’, and in each case chose the top-ranked result. [sent-90, score-0.123]

32 One of the pages found proposed a list of the 10 most famous wars in history, and the other listed the names of the 20 biggest wars that happened in the 20th century, measured in terms of the number of military deaths. [sent-91, score-0.283]

33 Wikipedia did not contain an article for one war, and we considered two articles as inappropriate for our purposes since they did not describe the course of the wars, but rather some general information about the conflicts. [sent-93, score-0.111]

34 Finally, we converted each of the text files into an SGML file: each document was wrapped in one DOC tag, inside which there are DOCID, DOCTYPE and DATETIME tags. [sent-100, score-0.091]

35 The document time stamp is the date and time at which we downloaded the page from Wikipedia to our local repository. [sent-101, score-0.205]

36 3 Creating Gold Standard Annotations Having prepared the input SGML documents, we then processed them with the DANTE temporal expression tagger (see Mazur and Dale (2007)). [sent-105, score-0.717]

37 DANTE outputs the original SGML documents augmented with an inline TIMEX2 annotation for each temporal expression found. [sent-106, score-0.794]

38 These output files can be imported to Callisto,7 an annotation tool that supports TIMEX2 annotations. [sent-107, score-0.126]

39 This process also included the annotation of any temporal expression missed by the automatic tagger, and the removal of spurious matches. [sent-111, score-0.679]

40 Then, Annotator 2 (the second author) checked all the revised annotations and prepared a list of errors found and doubts or queries in regard to potentially problematic annotations. [sent-112, score-0.15]

41 The final SGML files containing inline annotations were then transformed into ACE APF XML annotation files, this being the stand-off markup format developed for ACE evaluations. [sent-114, score-0.193]

42 The resulting corpus is thus available in two formats: one contains the original documents enriched with inline annotations, and the other consists of stand-off annotations in the ACE APF format. [sent-116, score-0.18]

43 One instance of this phenomenon relates to the fact that the TIMEX2 guidelines state that the provision of some attribute values for what are called event-based expressions (such as three weeks after the siege of Boston began or the first year of the American invasion) is optional. [sent-123, score-0.452]

44 Second, time zone information is supposed to be marked only for expressions which have it explicitly stated. [sent-130, score-0.339]

45 We also found that, in a not insignificant number of cases, it is impossible to provide a precise and correct value for a temporal expression. [sent-132, score-0.509]

46 For example, the TIMEX2 guidelines stipulate that the anchors of durations cannot have a MOD attribute, so if the anchor is mid-August, the value of the anchor must refer to August, which is not entirely correct as the semantics of mid- is lost. [sent-133, score-0.088]

47 TIMEX2 only supports nonspecific expressions which have explicit information about granularity. [sent-134, score-0.273]

48 One might consider using the typical durations of events of the corresponding types in such cases, but this solution also has problems (see (Pan et al. [sent-136, score-0.104]

49 As is acknowledged in the TIMEX2 guidelines, the treatment of set expressions (i. [sent-138, score-0.273]

50 One rule states that set expressions should not be anchored (Ferro et al. [sent-143, score-0.273]

51 42); this has the consequence that the full semantics of the expression annually since 1955 cannot be provided, and the expression is therefore treated as two separate expressions, annually and 1955. [sent-145, score-0.226]

52 Finally, alternative calendars are not supported, so an expression like February in the pre-revolutionary Russian calendar cannot receive a value unless it appears in an appositive construction which provides an alternative description. [sent-146, score-0.152]

53 Here, 18 Brumaire of the Year VIII is a date in an alternative calendar used in France, but we annotated only the Year VIII based on the trigger year. [sent-148, score-0.106]

54 5 Corpus Statistics The corpus contains 22 documents with a total of almost 120,000 tokens8 and 2,671 temporal expressions annotated in TIMEX2 format. [sent-151, score-0.858]

55 WikiWars has an order of magnitude more temporal expressions in each document, and a slightly higher density of temporal expressions than the other corpora. [sent-155, score-1.564]

56 Table 2 presents statistics on the individual documents that make up the corpus. [sent-156, score-0.076]

57 On average, each article was changed almost 52 times per month, with the monthly number of changes for a single article ranging from 1to 372. [sent-173, score-0.132]

58 The nature of the revision process in Wikipedia leads to some artefacts that may be not typical of other document sources, such as news, where the text is usually carefully prepared by its author and checked by an editor. [sent-177, score-0.131]

59 This may be the result of a number of modifications made by different authors, or it may be due to a lack of writing skill on the part of the person who wrote the paragraph in question. [sent-194, score-0.081]

60 Clearly such instances of incoherence will cause problems for any process that attempts to track the temporal focus. [sent-201, score-0.509]

61 Consider the following example: (3) The Afghan government, having secured a treaty in December 1978 that allowed them to call on Soviet forces, repeatedly requested the introduction of troops in Afghanistan in the spring and summer of 1979. [sent-204, score-0.182]

62 They requested Soviet troops to provide security and to assist in the fight against the mujahideen rebels. [sent-205, score-0.138]

63 ] After a month, the Afghan requests were no longer for individual crews and subunits, but for regiments and larger units. [sent-212, score-0.087]

64 In July, the Afghan government requested that two motorized rifle divisions be sent to Afghanistan. [sent-213, score-0.135]

65 The following day, they requested an airborne division in addition to the earlier requests. [sent-214, score-0.138]

66 Here, in the first paragraph there are four temporal expressions related to the Afghan government asking for troops and equipment. [sent-215, score-0.949]

67 There is also one date related to the Soviets’ reply to these requests and sending of tanks, and one date related to the arrival of an airborne battalion. [sent-216, score-0.236]

68 The second paragraph starts with after a month; the first possible interpretation is that this is a month after the 7th July mentioned in the previous paragraph; i. [sent-217, score-0.219]

69 It is also unclear whether the second paragraph talks about the same request for airborne forces which was mentioned in the first paragraph: both these events are dated July. [sent-222, score-0.245]

70 This may suggest that what at first looks just like a careless and ambiguous use of the expression after a month is in fact a larger problem of lack of coherency in these two paragraphs. [sent-224, score-0.284]

71 3 Use of Deictic Expressions One of the articles, 0 7 IraqWar, contained a number of deictic temporal expressions, indicative of the fact that the events described were happening contemporaneously to the time of writing (as is often the case in news stories); for example: (4) a. [sent-226, score-0.64]

72 Democrats plan to push legislation this spring that would force the Iraqi government to spend its own surplus to rebuild. [sent-227, score-0.102]

73 Obviously, after some time these expressions will no longer make sense, since there is no ‘at-the-time-ofwriting’ time stamp associated with these sentences: for the reader of a Wikipedia article, the reference date is the time of reading. [sent-230, score-0.419]

74 11 Arguably, these sentences should be corrected, making the temporal expressions fully-specified (e. [sent-232, score-0.782]

75 in spring of that year and the following year) if there is a context in the article which supports their correct interpretation. [sent-236, score-0.183]

76 Of course, not only the temporal expressions need to be revised, but also the tense and aspect of the verbs used in the sentences. [sent-237, score-0.782]

77 In the gold standard annotations, however, we provided the values by interpreting these expressions with respect to the document time stamp (i. [sent-238, score-0.455]

78 The italicized temporal expression is difficult to detect, and it is not clear how it should be annotated. [sent-244, score-0.622]

79 Note also that the expression combines a 919 time zone with a date, rather than with a time. [sent-247, score-0.179]

80 5 Quotes Missing a Time Stamp Occasionally it happens that an article contains a quoted utterance, but there is no indication of when the utterance was made. [sent-250, score-0.099]

81 For example, in the document 0 5 VietnamWar we find the following: (6) Nixon said in an announcement, “I am tonight announcing plans for the withdrawal of an additional 150,000 American troops to be completed during the spring of next year. [sent-251, score-0.199]

82 ” It is impossible to determine what dates are meant by the three temporal expressions present in the an- nouncement. [sent-253, score-0.782]

83 In some cases this information may be provided in citation footnotes, but this is not always the case; when this is absent, such expressions can only be annotated at the level of textual extent and a localised, context-dependent semantics. [sent-254, score-0.273]

84 1 Vocabulary Differences First, we found differences on the level of the lexical triggers that signal the presence of temporal expressions. [sent-257, score-0.509]

85 Some tokens are combined into what we call trigger classes; for example, all weekday names belong to the class WEEKDAYNAME. [sent-260, score-0.083]

86 the names of temporal units (such as month and year). [sent-263, score-0.717]

87 On the other hand, weekday names are quite frequent in the ACE corpus, but do not occur in the table for WikiWars. [sent-267, score-0.083]

88 2 Temporal Discourse Structure A more interesting property that WikiWars exhibits, and which is noticeably absent from the simpler ACE data, is what we might think of as a discourse mechanism for resetting the temporal focus. [sent-285, score-0.54]

89 In these cases, the discourse does not follow a single global timeline from the beginning to the end of the document, but is rather divided into subdiscourses which describe separate chains of events that often have common temporal starting points. [sent-287, score-0.592]

90 In most cases the switch to a different ‘part of the story’ can be determined not only by analysing the events and their geographic locations, but by recognizing that the first date appearing in the new subdiscourse is generally fully specified. [sent-291, score-0.119]

91 6 Automated Processing of WikiWars After we developed the WikiWars corpus, we used it to evaluate our temporal expression tagger, DANTE, which had been developed for participation in ACE. [sent-304, score-0.622]

92 Performance at finding temporal expressions in text is traditionally reported, for example by (Mani and Wilson, 2000; Negri and Marseglia, 2005; Teisse`dre et al. [sent-305, score-0.782]

93 So, we extended DANTE’s coverage with approximately 20 temporal triggers and modifiers to include the more common vocabulary that appeared in the WikiWars data; we also modified the recognition grammar to reduce the number of spurious matches and extent errors. [sent-321, score-0.509]

94 This is most likely because the strategy of using the document time stamp for the interpretation of context-dependent expressions does not work at all for WikiWars documents, whereas it works well for ACE documents, in line with our earlier comments in regard to the genres of the documents. [sent-332, score-0.411]

95 This emphasises the need to develop sophisticated methods for temporal focus tracking if we are to extend current time-stamping technologies beyond the relatively simplistic temporal structures found in currently available corpora. [sent-333, score-1.053]

96 7 Conclusions and Future Work We have presented a new corpus based on the historical descriptions of 22 wars sourced from En- glish Wikipedia, and we have described in detail the methodology adopted to construct the corpus; the corpus can be easily extended in the same way. [sent-334, score-0.16]

97 We annotated temporal expressions in these documents with TIMEX2 tags, which provide both the textual extents and the semantics of the expressions in the context of whole article. [sent-335, score-1.131]

98 A cascaded machine learning approach to interpreting temporal expressions. [sent-345, score-0.553]

99 Automatic time expression labeling for english and chinese text. [sent-375, score-0.113]

100 Resources for calendar expressions semantic tagging and temporal navigation through texts. [sent-439, score-0.821]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('temporal', 0.509), ('wikiwars', 0.414), ('ace', 0.28), ('expressions', 0.273), ('month', 0.171), ('wars', 0.123), ('dante', 0.118), ('expression', 0.113), ('wikipedia', 0.11), ('timebank', 0.107), ('timeml', 0.092), ('stamp', 0.079), ('pustejovsky', 0.079), ('afghan', 0.077), ('requested', 0.077), ('documents', 0.076), ('year', 0.073), ('date', 0.067), ('article', 0.066), ('zone', 0.066), ('mani', 0.066), ('annotations', 0.065), ('airborne', 0.061), ('boguraev', 0.061), ('brumaire', 0.061), ('mazur', 0.061), ('troops', 0.061), ('document', 0.059), ('government', 0.058), ('annotation', 0.057), ('february', 0.055), ('war', 0.055), ('tagger', 0.054), ('durations', 0.052), ('ahn', 0.052), ('sgml', 0.052), ('forces', 0.052), ('events', 0.052), ('corpora', 0.052), ('december', 0.051), ('paragraph', 0.048), ('july', 0.047), ('afghanistan', 0.046), ('crews', 0.046), ('deictic', 0.046), ('negri', 0.046), ('soviet', 0.046), ('soviets', 0.046), ('weekday', 0.046), ('wroc', 0.046), ('articles', 0.045), ('interpreting', 0.044), ('october', 0.044), ('revised', 0.044), ('spring', 0.044), ('requests', 0.041), ('prepared', 0.041), ('weeks', 0.039), ('april', 0.039), ('calendar', 0.039), ('gate', 0.039), ('inline', 0.039), ('tool', 0.037), ('names', 0.037), ('historical', 0.037), ('guidelines', 0.036), ('tonight', 0.035), ('june', 0.035), ('viii', 0.035), ('ferro', 0.035), ('tracking', 0.035), ('writing', 0.033), ('narrative', 0.033), ('quoted', 0.033), ('files', 0.032), ('request', 0.032), ('discourse', 0.031), ('algeria', 0.031), ('apf', 0.031), ('artefacts', 0.031), ('bagram', 0.031), ('branimir', 0.031), ('castan', 0.031), ('cunningham', 0.031), ('elected', 0.031), ('gaulle', 0.031), ('inderjeet', 0.031), ('ingria', 0.031), ('iraqi', 0.031), ('iraqwar', 0.031), ('markable', 0.031), ('marseglia', 0.031), ('mexico', 0.031), ('pawe', 0.031), ('pawel', 0.031), ('poland', 0.031), ('provision', 0.031), ('saur', 0.031), ('setzer', 0.031), ('southeast', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

Author: Pawel Mazur ; Robert Dale

Abstract: The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. The corpus contains around 120000 tokens, and 2600 TIMEX2 expressions, thus comparing favourably in size to other existing corpora used in these areas. We describe the prepa- ration of the corpus, and compare the profile of the data with other existing temporally annotated corpora. We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. The corpus is publicly available for research purposes.

2 0.13404831 20 emnlp-2010-Automatic Detection and Classification of Social Events

Author: Apoorv Agarwal ; Owen Rambow

Abstract: In this paper we introduce the new task of social event extraction from text. We distinguish two broad types of social events depending on whether only one or both parties are aware of the social contact. We annotate part of Automatic Content Extraction (ACE) data, and perform experiments using Support Vector Machines with Kernel methods. We use a combination of structures derived from phrase structure trees and dependency trees. A characteristic of our events (which distinguishes them from ACE events) is that the participating entities can be spread far across the parse trees. We use syntactic and semantic insights to devise a new structure derived from dependency trees and show that this plays a role in achieving the best performing system for both social event detection and classification tasks. We also use three data sampling approaches to solve the problem of data skewness. Sampling methods improve the F1-measure for the task of relation detection by over 20% absolute over the baseline.

3 0.066931032 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

4 0.058194641 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni

Abstract: Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text.

5 0.05414471 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

Author: Roi Reichart ; Ari Rappoport

Abstract: Polysemy is a major characteristic of natural languages. Like words, syntactic forms can have several meanings. Understanding the correct meaning of a syntactic form is of great importance to many NLP applications. In this paper we address an important type of syntactic polysemy the multiple possible senses of tense syntactic forms. We make our discussion concrete by introducing the task of Tense Sense Disambiguation (TSD): given a concrete tense syntactic form present in a sentence, select its appropriate sense among a set of possible senses. Using English grammar textbooks, we compiled a syntactic sense dictionary comprising common tense syntactic forms and semantic senses for each. We annotated thousands of BNC sentences using the – defined senses. We describe a supervised TSD algorithm trained on these annotations, which outperforms a strong baseline for the task.

6 0.051237006 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue

7 0.049315397 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

8 0.048004109 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

9 0.04648732 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

10 0.044280235 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

11 0.044070475 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

12 0.043652248 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

13 0.043261927 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

14 0.043250527 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora

15 0.042447396 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

16 0.042184684 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

17 0.040852472 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

18 0.04021465 114 emnlp-2010-Unsupervised Parse Selection for HPSG

19 0.039347548 80 emnlp-2010-Modeling Organization in Student Essays

20 0.03830348 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.138), (1, 0.085), (2, -0.057), (3, 0.129), (4, 0.002), (5, -0.062), (6, 0.015), (7, -0.029), (8, -0.001), (9, -0.082), (10, -0.046), (11, -0.035), (12, 0.067), (13, 0.067), (14, 0.028), (15, -0.051), (16, 0.088), (17, 0.15), (18, -0.085), (19, 0.03), (20, 0.044), (21, -0.038), (22, 0.088), (23, -0.104), (24, -0.096), (25, -0.114), (26, 0.16), (27, -0.156), (28, 0.027), (29, -0.096), (30, -0.098), (31, -0.282), (32, -0.144), (33, 0.007), (34, -0.01), (35, 0.085), (36, 0.025), (37, -0.252), (38, 0.122), (39, 0.095), (40, -0.251), (41, 0.02), (42, -0.038), (43, -0.155), (44, 0.094), (45, -0.012), (46, 0.044), (47, -0.02), (48, -0.048), (49, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9751929 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

Author: Pawel Mazur ; Robert Dale

Abstract: The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. The corpus contains around 120000 tokens, and 2600 TIMEX2 expressions, thus comparing favourably in size to other existing corpora used in these areas. We describe the prepa- ration of the corpus, and compare the profile of the data with other existing temporally annotated corpora. We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. The corpus is publicly available for research purposes.

2 0.51196975 20 emnlp-2010-Automatic Detection and Classification of Social Events

Author: Apoorv Agarwal ; Owen Rambow

Abstract: In this paper we introduce the new task of social event extraction from text. We distinguish two broad types of social events depending on whether only one or both parties are aware of the social contact. We annotate part of Automatic Content Extraction (ACE) data, and perform experiments using Support Vector Machines with Kernel methods. We use a combination of structures derived from phrase structure trees and dependency trees. A characteristic of our events (which distinguishes them from ACE events) is that the participating entities can be spread far across the parse trees. We use syntactic and semantic insights to devise a new structure derived from dependency trees and show that this plays a role in achieving the best performing system for both social event detection and classification tasks. We also use three data sampling approaches to solve the problem of data skewness. Sampling methods improve the F1-measure for the task of relation detection by over 20% absolute over the baseline.

3 0.45377618 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

Author: Roi Reichart ; Ari Rappoport

Abstract: Polysemy is a major characteristic of natural languages. Like words, syntactic forms can have several meanings. Understanding the correct meaning of a syntactic form is of great importance to many NLP applications. In this paper we address an important type of syntactic polysemy the multiple possible senses of tense syntactic forms. We make our discussion concrete by introducing the task of Tense Sense Disambiguation (TSD): given a concrete tense syntactic form present in a sentence, select its appropriate sense among a set of possible senses. Using English grammar textbooks, we compiled a syntactic sense dictionary comprising common tense syntactic forms and semantic senses for each. We annotated thousands of BNC sentences using the – defined senses. We describe a supervised TSD algorithm trained on these annotations, which outperforms a strong baseline for the task.

4 0.37738115 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue

Author: Zahar Prasov ; Joyce Y. Chai

Abstract: In situated dialogue humans often utter linguistic expressions that refer to extralinguistic entities in the environment. Correctly resolving these references is critical yet challenging for artificial agents partly due to their limited speech recognition and language understanding capabilities. Motivated by psycholinguistic studies demonstrating a tight link between language production and human eye gaze, we have developed approaches that integrate naturally occurring human eye gaze with speech recognition hypotheses to resolve exophoric references in situated dialogue in a virtual world. In addition to incorporating eye gaze with the best recognized spoken hypothesis, we developed an algorithm to also handle multiple hypotheses modeled as word confusion networks. Our empirical results demonstrate that incorporating eye gaze with recognition hypotheses consistently outperforms the results obtained from processing recognition hypotheses alone. Incorporating eye gaze with word confusion networks further improves performance.

5 0.26025608 46 emnlp-2010-Evaluating the Impact of Alternative Dependency Graph Encodings on Solving Event Extraction Tasks

Author: Ekaterina Buyko ; Udo Hahn

Abstract: In state-of-the-art approaches to information extraction (IE), dependency graphs constitute the fundamental data structure for syntactic structuring and subsequent knowledge elicitation from natural language documents. The top-performing systems in the BioNLP 2009 Shared Task on Event Extraction all shared the idea to use dependency structures generated by a variety of parsers either directly or in some converted manner — and optionally modified their output to fit the special needs of IE. As there are systematic differences between various dependency representations being used in this competition, we scrutinize on different encoding styles for dependency information and their possible impact on solving several IE tasks. After assessing more or less established dependency representations such as the Stanford and CoNLL-X dependen— cies, we will then focus on trimming operations that pave the way to more effective IE. Our evaluation study covers data from a number of constituency- and dependency-based parsers and provides experimental evidence which dependency representations are particularly beneficial for the event extraction task. Based on empirical findings from our study we were able to achieve the performance of 57.2% F-score on the development data set of the BioNLP Shared Task 2009.

6 0.24998274 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

7 0.23369963 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

8 0.23253988 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications

9 0.22867659 114 emnlp-2010-Unsupervised Parse Selection for HPSG

10 0.22545433 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

11 0.21498004 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

12 0.20740508 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams

13 0.20600291 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

14 0.20353667 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

15 0.19463767 80 emnlp-2010-Modeling Organization in Student Essays

16 0.17957987 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning

17 0.17390522 26 emnlp-2010-Classifying Dialogue Acts in One-on-One Live Chats

18 0.16484927 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

19 0.16456042 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

20 0.16454992 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.013), (10, 0.011), (12, 0.026), (29, 0.047), (30, 0.01), (32, 0.011), (52, 0.018), (56, 0.053), (62, 0.011), (66, 0.071), (72, 0.586), (76, 0.028), (89, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93142653 122 emnlp-2010-WikiWars: A New Corpus for Research on Temporal Expressions

Author: Pawel Mazur ; Robert Dale

Abstract: The reliable extraction of knowledge from text requires an appropriate treatment of the time at which reported events take place. Unfortunately, there are very few annotated data sets that support the development of techniques for event time-stamping and tracking the progression of time through a narrative. In this paper, we present a new corpus of temporally-rich documents sourced from English Wikipedia, which we have annotated with TIMEX2 tags. The corpus contains around 120000 tokens, and 2600 TIMEX2 expressions, thus comparing favourably in size to other existing corpora used in these areas. We describe the prepa- ration of the corpus, and compare the profile of the data with other existing temporally annotated corpora. We also report the results obtained when we use DANTE, our temporal expression tagger, to process this corpus, and point to where further work is required. The corpus is publicly available for research purposes.

2 0.9180252 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

Author: Valentin Zhikov ; Hiroya Takamura ; Manabu Okumura

Abstract: This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a leasteffort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the ”CHILDES” corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced . computational time.

3 0.86850673 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

Author: Kostadin Cholakov ; Gertjan van Noord

Abstract: Unknown words are a hindrance to the performance of hand-crafted computational grammars of natural language. However, words with incomplete and incorrect lexical entries pose an even bigger problem because they can be the cause of a parsing failure despite being listed in the lexicon of the grammar. Such lexical entries are hard to detect and even harder to correct. We employ an error miner to pinpoint words with problematic lexical entries. An automated lexical acquisition technique is then used to learn new entries for those words which allows the grammar to parse previously uncovered sentences successfully. We test our method on a large-scale grammar of Dutch and a set of sentences for which this grammar fails to produce a parse. The application of the method enables the grammar to cover 83.76% of those sentences with an accuracy of 86.15%.

4 0.51501411 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka

Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

5 0.46347335 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei

Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.

6 0.46280032 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

7 0.42262393 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue

8 0.39915237 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

9 0.39504883 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

10 0.39355502 20 emnlp-2010-Automatic Detection and Classification of Social Events

11 0.38944501 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

12 0.38528675 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

13 0.38010889 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

14 0.37799156 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

15 0.37199739 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

16 0.37191153 51 emnlp-2010-Function-Based Question Classification for General QA

17 0.36984172 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping

18 0.36670971 80 emnlp-2010-Modeling Organization in Student Essays

19 0.36298543 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

20 0.35962096 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice