acl acl2011 acl2011-337 knowledge-graph by maker-knowledge-mining

337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History


Source: pdf

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. [sent-3, score-0.526]

2 Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. [sent-4, score-0.164]

3 Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. [sent-5, score-0.394]

4 By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. [sent-6, score-0.865]

5 We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history. [sent-8, score-0.116]

6 The majority of Wikipedia-based NLP algorithms works on single snapshots of Wikipedia, which are . [sent-15, score-0.095]

7 1 Such a snapshot only represents the state of Wikipedia at a certain fixed point in time, while Wikipedia actually is a dynamic resource that is constantly changed by its millions of editors. [sent-18, score-0.07]

8 This is mainly due to older snapshots becoming unavailable, as there is no official backup server. [sent-21, score-0.121]

9 In this paper, we present a toolkit that solves both issues by reconstructing a certain past state of Wikipedia from its edit history, which is offered by the Wikimedia Foundation in form of a database dump. [sent-23, score-0.357]

10 Besides reconstructing past states of Wikipedia, the revision history data also constitutes a novel knowledge source for NLP algorithms. [sent-25, score-0.725]

11 The sequence of article edits can be used as training data for data-driven NLP algorithms, such as vandalism detection (Chin et al. [sent-26, score-0.172]

12 , 2010), the expansion of textual entailment corpora (Zanzotto and Pennacchiotti, 2010), or assesing the trustworthiness of Wikipedia articles (Zeng et al. [sent-28, score-0.058]

13 1c 120 S1y1ste Amss Doceimatoionsntr faotrio Cnosm, p augteasti 9o7na–l1 L0i2n,guistics However, efficient access to this new resource has been limited by the immense size of the data. [sent-34, score-0.129]

14 The revisions for all articles in the current English Wikipedia sum up to over 5 terabytes of text. [sent-35, score-0.414]

15 Thus, in Section 4, we present a tool to efficiently access Wikipedia’s edit history. [sent-38, score-0.245]

16 It provides an easy-to-use API for programmatically accessing the revision data and reduces the required storage space to less than 2% of its original size. [sent-39, score-0.767]

17 2 Related Work To our knowledge, there are currently only two alternatives to programmatically access Wikipedia’s re- vision history. [sent-43, score-0.162]

18 One possibility is to manually parse the original XML revision dump. [sent-44, score-0.469]

19 However, due to the huge size of these dumps, efficient, random access is infeasible with this approach. [sent-45, score-0.155]

20 However, using a web service entails that the desired revision for every single article has to be requested from the service, transferred over the Internet and then stored locally in an appropriate format. [sent-47, score-0.667]

21 Access to all revisions of all Wikipedia articles for a large-scale analysis is infeasible with this method because it is strongly constricted by the data transfer speed over the Internet. [sent-48, score-0.414]

22 Better suited for tasks of this kind are APIs that utilize databases for storing and accessing the Wikipedia data. [sent-50, score-0.171]

23 However, current database-driven Wikipedia APIs do not support access to article re- visions. [sent-51, score-0.256]

24 That is why we decided to extend an established API with the ability to efficiently access 2http : / /www . [sent-52, score-0.162]

25 Wikipedia Miner3 (Milne and Witten, 2009) is an open source toolkit which provides access to Wikipedia with the help of a preprocessed database. [sent-56, score-0.192]

26 It represents articles, categories and redirects as Java classes and provides access to the article content either as MediaWiki markup or as plain text. [sent-57, score-0.256]

27 The toolkit mainly focuses on Wikipedia’s structure, the contained concepts, and semantic relations, but it makes little use of the textual content within the articles. [sent-58, score-0.063]

28 Another open source API for accessing Wikipedia data from a preprocessed database is JWPL4 (Zesch et al. [sent-60, score-0.216]

29 In addition to that, JWPL contains a MediaWiki markup parser to further analyze the article contents to make available fine-grained information like e. [sent-63, score-0.127]

30 We have chosen to extend JWPL with our revision toolkit, as it has better support for accessing article contents, natively supports multiple languages, and seems to have a larger and more active developer community. [sent-67, score-0.767]

31 In the following section, we present the parts of the toolkit which reconstruct past states of Wikipedia, while in section 4, we describe tools allowing to efficiently access Wikipedia’s edit history. [sent-68, score-0.454]

32 3 Reconstructing Past States of Wikipedia Access to arbitrary past states of Wikipedia is required to (i) evaluate the performance of Wikipediabased NLP algorithms over time, and (ii) to reproduce Wikipedia-based research results. [sent-69, score-0.103]

33 For this reason, we have developed a tool called TimeMachine, which addresses both of these issues by making use of the revision dump provided by the Wikimedia Foundation. [sent-70, score-0.523]

34 By iterating over all articles in the revision dump and extracting the desired revision of each article, it is possible to recover the state of Wikipedia at an earlier point in time. [sent-71, score-1.05]

35 com PropertyDescriptionExample Value The TimeMachine is controlled by a single configuration file, which allows (i) to restore individual Wikipedia snapshots or (ii) to generate whole snapshot series. [sent-76, score-0.165]

36 The two timestamps define the start and end time of the snapshot series, while the interval between the snapshots in the series is set by the parameter each. [sent-79, score-0.214]

37 In the example, the TimeMachine recovers 13 snapshots between Jan 01, 2009 at 01. [sent-80, score-0.095]

38 It can be accessed with the JWPL API in the same way as snapshots created using JWPL itself. [sent-89, score-0.119]

39 Issue of Deleted Articles The past snapshot of Wikipedia created by our toolkit is identical to the state of Wikipedia at that time with the exception of articles that have been deleted meanwhile. [sent-90, score-0.29]

40 Articles might be deleted only by Wikipedia administrators 99 if they are subject to copyright violations, vandalism, spam or other conditions that violate Wikipedia policies. [sent-91, score-0.064]

41 As a consequence, they are removed from the public view along with all their revision infor- mation, which makes it impossible to recover them from any future publicly available dump. [sent-92, score-0.469]

42 Most of the affected pages are newly created duplicates of already existing articles or spam articles. [sent-94, score-0.084]

43 4 Efficient Access to Revisions Even though article revisions are available from the official Wikipedia revision dumps, accessing this information on a large scale is still a difficult task. [sent-95, score-1.123]

44 First, the revision dump contains all revisions as full text. [sent-97, score-0.879]

45 This results in a massive amount of data and makes structured access very hard. [sent-98, score-0.129]

46 Second, there is no efficient API available so far for accessing article revisions on a large scale. [sent-99, score-0.654]

47 First, we describe our solution to the storage problem. [sent-101, score-0.094]

48 1 Revision Storage As each revision of a Wikipedia article stores the full article text, the revision history obviously contains a lot of redundant data. [sent-106, score-1.264]

49 The RevisionMachine makes use of this fact and utilizes a dedicated storage format which stores a revision only by means of the changes that have been made to the previous revision. [sent-107, score-0.598]

50 Therefore, we have developed our own diff algorithm, which is based on a longest common substring search and constitutes the foundation for our revision storage format. [sent-110, score-0.6]

51 The processing of two subsequent revisions can be divided into four steps: • • First, the RevisionMachine searches for all common substrings with a user-defined minimal length. [sent-111, score-0.356]

52 Then, the revisions are divided into blocks of : equal length. [sent-112, score-0.356]

53 • In the next step, the current revision is represented by means of a sequence of actions performed on the previous revision. [sent-118, score-0.469]

54 For example, in the adjacent revision pair r1 : This is the very first sentence! [sent-119, score-0.469]

55 In addition to the other operations, they can make use ofan additional temporary storage register to save the text that is being moved. [sent-125, score-0.094]

56 • 100 Finally, the string representation of this action sequence is compressed and stored in the database. [sent-126, score-0.07]

57 With this approach, we achieve to reduce the demand for disk space for a recent English Wikipedia dump containing all article revisions from 5470 GB to only 96 GB, i. [sent-127, score-0.565]

58 The compressed data is stored in a MySQL database, which provides sophisticated indexing mechanisms for high-performance access to the data. [sent-130, score-0.199]

59 Obviously, storing only the changes instead of the full text of each revision trades in speed for space. [sent-131, score-0.469]

60 Accessing a certain revision now requires reconstructing the text of the revision from a list of changes. [sent-132, score-1.019]

61 As articles often have several thousand revisions, this might take too long. [sent-133, score-0.058]

62 Thus, in order to speed up the recovery of the revision text, every n-th revision is stored as a full revision. [sent-134, score-0.983]

63 A low value of n decreases the time needed to access a certain revision, but increases the demand for storage space. [sent-135, score-0.247]

64 2 Revision Access After the converted revisions have been stored in the revision database, it can either be used standalone or combined with the JWPL data and accessed via the standard JWPL API. [sent-139, score-0.894]

65 Upon first access, the database user has to have write permission on the database, as indexes have to be created. [sent-142, score-0.075]

66 The RevisionIterator allows to iterate over all revisions in Wikipedia. [sent-145, score-0.356]

67 The RevisionAPI grants access to the revisions of individual articles. [sent-146, score-0.485]

68 In addition to 9If hard disk space is no limiting factor, the parameter can be set to 1to avoid the compression of the revisions and maximize the performance. [sent-147, score-0.384]

69 / / S et up d at ab a s e c o nn e c tio n Dat aba s eConfigurat i db = new Dat abas eConfigurat i ( ) ; on on db . [sent-148, score-0.256]

70 getWikipediaConnect i ( db ) ; l on Revi s i Ite rat or revIt = new Revi s ion Iterator ( db ) ; on Revi s i onApi revApi = new Revi s ionApi ( db ) ; Listing 1: Setting up the RevisionMachine that, the Wikipedia object provides access to JWPL functionalities. [sent-155, score-0.513]

71 Processing all article revisions in Wikipedia The first use case focuses on the utilization of the complete set of article revisions in a Wikipedia snapshot. [sent-157, score-0.966]

72 Thereby, the iterator ensures that successive revisions always correspond to adjacent revisions of a single article in chronological order. [sent-159, score-0.876]

73 The start of a new article can easily be detected by checking the timestamp and the article id. [sent-160, score-0.254]

74 Processing revisions of individual articles The second use case shows how the RevisionMachine can be used to access the edit history of a specific article. [sent-162, score-0.698]

75 The example in Listing 3 illustrates how all revisions for the article Automobile can be retrieved by first performing a page query with the JWPL API and then retrieving all revision timestamps for this page, which can finally be used to access the revision objects. [sent-163, score-1.627]

76 Accessing the meta data of a revision The third use case illustrates the access to the meta data of individual revisions. [sent-164, score-0.78]

77 The meta data includes the name or IP of the contributor, the additional user comment for the revision and a flag that identifies a revision as minor or major. [sent-165, score-1.029]

78 Listing 4 shows how the number of edits and unique contributors can be used to indicate the level of edit activity for an article. [sent-166, score-0.083]

79 101 5 Conclusions In this paper, we presented an open-source toolkit which extends JWPL, an API for accessing Wikipedia, with the ability to reconstruct past states of Wikipedia, and to efficiently access the edit history of Wikipedia articles. [sent-167, score-0.697]

80 Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia, and is also a requirement for the creation of time-based series of Wikipedia snapshots and for assessing the influence of Wikipedia growth on NLP algorithms. [sent-168, score-0.283]

81 Furthermore, Wikipedia’s edit history has been shown to be a valuable knowledge source for NLP, which is hard to access because of the lack of efficient tools for managing the huge amount of revision data. [sent-169, score-0.805]

82 By utilizing a dedicated storage format for the revisions, our toolkit massively decreases the amount of data to be stored. [sent-170, score-0.242]

83 At the same time, it provides an easyto-use interface to access the revision data. [sent-171, score-0.623]

84 We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history. [sent-172, score-0.116]

85 The toolkit will be made available as part of JWPL, and can be obtained from the project’s website at Google Code. [sent-173, score-0.063]

86 / / I e r a t e over a ll r e v is io n s of t w hile ( revIt . [sent-179, score-0.06]

87 hasNext ( ) ) { Rev(ir s ivIont rev =e tre()vI )t . [sent-180, score-0.121]

88 getArt i le ID ( ) ; c / / p ro c e s s r e v is io n . [sent-183, score-0.101]

89 a ll a r t ic le s } Listing 2: Iteration over all revisions of all articles / / Get a r t ic le with t it le ” Automobile ” P age art i cle = wiki . [sent-186, score-0.666]

90 getP age ( ” Auto mobi le ” ) ; in t id = art i cle . [sent-187, score-0.107]

91 getPage I ( ) ; d / / Get a ll r e v is io n s fo r th e a r t ic le Co l ct ion revi s ionTime St amps = revApi . [sent-188, score-0.31]

92 getRevi s ionTime st amps ( id ) ; le fo r ( Time st amp t : revi s ionTime St amps ) { R(eTivim s istona rev =r revApi . [sent-189, score-0.524]

93 getRevi s sio)n ({ id , t ) ; / / p ro c e s s r e v is io n . [sent-190, score-0.096]

94 } Listing 3: Accessing the revisions of a specific article / / Meta d at a pro v ide d by th e Revi s ionAPI St ringBu ffe r s = new St ringBu ffe r ( ) ; s . [sent-193, score-0.543]

95 getNumberOfRevi s ions (page I )+” r e v is io n s . [sent-195, score-0.06]

96 \) ; d / / Meta d at a pro v ide d by th e Re v is io n o bj e c t s . [sent-204, score-0.085]

97 ” Minor ” : ” Maj or ” ) +” r e v is io n by : ” +rev . [sent-207, score-0.06]

98 getComment ( ) ) ; Listing 4: Accessing the meta data of a revision References Si-Chi Chin, W. [sent-210, score-0.56]

99 Detecting wikipedia vandalism with active learning and statistical language mod- els. [sent-213, score-0.453]

100 Mining wikipedia’s article revision history for training computational linguistics algorithms. [sent-242, score-0.668]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('revision', 0.469), ('wikipedia', 0.408), ('revisions', 0.356), ('jwpl', 0.243), ('revisionmachine', 0.187), ('accessing', 0.171), ('access', 0.129), ('db', 0.128), ('article', 0.127), ('rev', 0.121), ('revi', 0.115), ('revapi', 0.112), ('api', 0.107), ('mediawiki', 0.099), ('listing', 0.096), ('snapshots', 0.095), ('storage', 0.094), ('meta', 0.091), ('edit', 0.083), ('zesch', 0.082), ('reconstructing', 0.081), ('timemachine', 0.075), ('wikimedia', 0.075), ('history', 0.072), ('snapshot', 0.07), ('toolkit', 0.063), ('past', 0.061), ('io', 0.06), ('articles', 0.058), ('append', 0.057), ('amps', 0.056), ('dumps', 0.056), ('iontime', 0.056), ('dump', 0.054), ('timestamps', 0.049), ('yamangil', 0.049), ('iryna', 0.049), ('torsten', 0.049), ('milne', 0.045), ('vandalism', 0.045), ('nelken', 0.045), ('apis', 0.045), ('database', 0.045), ('stored', 0.045), ('reconstruct', 0.043), ('states', 0.042), ('le', 0.041), ('ic', 0.038), ('deleted', 0.038), ('nlp', 0.038), ('diff', 0.037), ('econfigurat', 0.037), ('getrevi', 0.037), ('ionapi', 0.037), ('iterator', 0.037), ('revit', 0.037), ('ribut', 0.037), ('ringbu', 0.037), ('sourceforge', 0.037), ('id', 0.036), ('dedicated', 0.035), ('st', 0.033), ('amp', 0.033), ('programmatically', 0.033), ('consolidate', 0.033), ('prerequisite', 0.033), ('elif', 0.033), ('wpl', 0.033), ('darmstadt', 0.033), ('efficiently', 0.033), ('chin', 0.03), ('medelyan', 0.03), ('permission', 0.03), ('zanzotto', 0.03), ('ffe', 0.03), ('automobile', 0.03), ('rani', 0.03), ('cle', 0.03), ('reproducing', 0.028), ('disk', 0.028), ('gurevych', 0.028), ('yatskar', 0.028), ('page', 0.028), ('zeng', 0.027), ('simplifications', 0.027), ('service', 0.026), ('older', 0.026), ('managing', 0.026), ('spam', 0.026), ('massively', 0.026), ('huge', 0.026), ('interface', 0.025), ('bj', 0.025), ('compressed', 0.025), ('influence', 0.024), ('accessed', 0.024), ('solves', 0.024), ('decreases', 0.024), ('wiki', 0.023), ('gb', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

2 0.20860989 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

3 0.15884267 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

4 0.15495156 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

Author: William Coster ; David Kauchak

Abstract: In this paper we examine the task of sentence simplification which aims to reduce the reading complexity of a sentence by incorporating more accessible vocabulary and sentence structure. We introduce a new data set that pairs English Wikipedia with Simple English Wikipedia and is orders of magnitude larger than any previously examined for sentence simplification. The data contains the full range of simplification operations including rewording, reordering, insertion and deletion. We provide an analysis of this corpus as well as preliminary results using a phrase-based trans- lation approach for simplification.

5 0.14255044 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

6 0.14050518 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

7 0.12791567 285 acl-2011-Simple supervised document geolocation with geodesic grids

8 0.092555359 52 acl-2011-Automatic Labelling of Topic Models

9 0.065601163 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

10 0.059784781 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

11 0.057460174 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

12 0.051793225 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

13 0.051277716 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

14 0.049183898 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

15 0.047474928 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

16 0.047320053 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

17 0.046643011 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

18 0.044045836 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

19 0.043805797 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

20 0.043770783 135 acl-2011-Faster and Smaller N-Gram Language Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.1), (1, 0.029), (2, -0.06), (3, 0.048), (4, -0.012), (5, -0.023), (6, 0.022), (7, -0.047), (8, -0.124), (9, -0.04), (10, -0.032), (11, 0.042), (12, -0.017), (13, -0.055), (14, 0.054), (15, 0.064), (16, 0.211), (17, -0.052), (18, -0.014), (19, -0.148), (20, 0.118), (21, -0.101), (22, -0.043), (23, -0.15), (24, 0.151), (25, -0.036), (26, 0.028), (27, -0.048), (28, 0.053), (29, -0.011), (30, -0.015), (31, 0.036), (32, -0.012), (33, -0.042), (34, 0.06), (35, -0.005), (36, -0.014), (37, -0.013), (38, 0.004), (39, -0.029), (40, 0.024), (41, 0.021), (42, 0.055), (43, 0.095), (44, 0.062), (45, 0.078), (46, -0.054), (47, -0.104), (48, 0.04), (49, -0.082)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97917676 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

2 0.82137948 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

Author: Lev Ratinov ; Dan Roth ; Doug Downey ; Mike Anderson

Abstract: Disambiguating concepts and entities in a context sensitive way is a fundamental problem in natural language processing. The comprehensiveness of Wikipedia has made the online encyclopedia an increasingly popular target for disambiguation. Disambiguation to Wikipedia is similar to a traditional Word Sense Disambiguation task, but distinct in that the Wikipedia link structure provides additional information about which disambiguations are compatible. In this work we analyze approaches that utilize this information to arrive at coherent sets of disambiguations for a given document (which we call “global” approaches), and compare them to more traditional (local) approaches. We show that previous approaches for global disambiguation can be improved, but even then the local disambiguation provides a baseline which is very hard to beat.

3 0.80010271 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

4 0.66858518 285 acl-2011-Simple supervised document geolocation with geodesic grids

Author: Benjamin Wing ; Jason Baldridge

Abstract: We investigate automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and it is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.

5 0.62982595 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

6 0.53583872 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

7 0.48052195 254 acl-2011-Putting it Simply: a Context-Aware Approach to Lexical Simplification

8 0.47880715 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

9 0.46598861 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

10 0.45789522 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

11 0.43112162 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

12 0.4039391 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

13 0.38349164 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base

14 0.33780783 298 acl-2011-The ACL Anthology Searchbench

15 0.32214117 291 acl-2011-SystemT: A Declarative Information Extraction System

16 0.29198438 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

17 0.28768274 84 acl-2011-Contrasting Opposing Views of News Articles on Contentious Issues

18 0.28038558 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

19 0.27900398 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

20 0.25697717 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.052), (16, 0.016), (17, 0.03), (26, 0.057), (27, 0.366), (37, 0.053), (39, 0.042), (41, 0.052), (44, 0.012), (55, 0.014), (59, 0.025), (72, 0.027), (91, 0.044), (96, 0.103), (98, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75460231 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

2 0.57919466 96 acl-2011-Disambiguating temporal-contrastive connectives for machine translation

Author: Thomas Meyer

Abstract: Temporal–contrastive discourse connectives (although, while, since, etc.) signal various types ofrelations between clauses such as temporal, contrast, concession and cause. They are often ambiguous and therefore difficult to translate from one language to another. We discuss several new and translation-oriented experiments for the disambiguation of a specific subset of discourse connectives in order to correct some of the translation errors made by current statistical machine translation systems.

3 0.51913804 101 acl-2011-Disentangling Chat with Local Coherence Models

Author: Micha Elsner ; Eugene Charniak

Abstract: We evaluate several popular models of local discourse coherence for domain and task generality by applying them to chat disentanglement. Using experiments on synthetic multiparty conversations, we show that most models transfer well from text to dialogue. Coherence models improve results overall when good parses and topic models are available, and on a constrained task for real chat data.

4 0.44586045 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

Abstract: We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.

5 0.38879913 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

Author: Alexander M. Rush ; Michael Collins

Abstract: We describe an exact decoding algorithm for syntax-based statistical translation. The approach uses Lagrangian relaxation to decompose the decoding problem into tractable subproblems, thereby avoiding exhaustive dynamic programming. The method recovers exact solutions, with certificates of optimality, on over 97% of test examples; it has comparable speed to state-of-the-art decoders.

6 0.38628501 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

7 0.38323012 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

8 0.3793126 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

9 0.37912059 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

10 0.37874806 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning

11 0.37854987 133 acl-2011-Extracting Social Power Relationships from Natural Language

12 0.3778626 182 acl-2011-Joint Annotation of Search Queries

13 0.37769699 135 acl-2011-Faster and Smaller N-Gram Language Models

14 0.37739241 258 acl-2011-Ranking Class Labels Using Query Sessions

15 0.37718365 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

16 0.37718275 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

17 0.37712318 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

18 0.37689567 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

19 0.37676245 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

20 0.37664968 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition