acl acl2011 acl2011-285 knowledge-graph by maker-knowledge-mining

285 acl-2011-Simple supervised document geolocation with geodesic grids

Source: pdf

Author: Benjamin Wing ; Jason Baldridge

Abstract: We investigate automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and it is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Geolocation can be an effective means of summarizing large document collections and it is an important component of geographic information retrieval. [sent-6, score-0.247]

2 We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. [sent-7, score-0.393]

3 All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. [sent-8, score-0.309]

4 We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. [sent-9, score-0.381]

5 Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset. [sent-12, score-0.418]

6 Leidner (2008) provides a systematic overview of geography-based language applications over the previous decade, with a special focus on the problem of toponym resolution—identifying and disambiguating the references to locations in texts. [sent-14, score-0.373]

7 The Perseus project performs automatic toponym resolution on historical texts in order to display a map with each text showing the locations that are mentioned (Smith and Crane, 2001); Google Books also does this for some books, though the toponyms are identified and resolved quite crudely. [sent-21, score-0.593]

8 Eisenstein et al (2010) investigate questions of dialectal differences and variation in regional interests in Twitter users using a collection of geotagged tweets. [sent-23, score-0.309]

9 Determining a single location of a document is only a well-posed problem for certain documents, generally of fairly small size, but there are a number of natural situations in which such collections arise. [sent-25, score-0.226]

10 For example, a great number of articles in Wikipedia have been manually geotagged; this allows those ar- ticles to appear in their geographic locations while geobrowsing in an application like Google Earth. [sent-26, score-0.426]

11 Overell’s main goal is toponym resolution, for which geolocation serves as an input feature. [sent-29, score-0.497]

12 However, for many document collections, such metadata is unavailable, especially in the case of recently digitized historical documents. [sent-34, score-0.204]

13 (2010) evaluate their geographic topic model by geolocating USA-based Twitter users based on their tweet content. [sent-36, score-0.286]

14 This is essentially a document geolocation task, where each document is a concatenation of all the tweets for a single user. [sent-37, score-0.543]

15 Their geographic topic model receives supervision from many documents/users and predicts locations for unseen documents/users. [sent-38, score-0.302]

16 In this paper, we tackle document geolocation using several simple supervised methods on the textual content of documents and a geodesic grid as a discrete representation of the earth’s surface. [sent-39, score-0.793]

17 Performance is measured both on geotagged Wikipedia articles (Overell, 2009) and tweets (Eisenstein et al. [sent-44, score-0.424]

18 For the Twitter data set, we obtain a median error of 479 km, which improves on the 494 km error of Eisenstein et al. [sent-48, score-0.329]

19 Wikipedia articles generally cover a single subject; in addition, most articles that refer to geographically 1We became aware of Serdyukov et al. [sent-54, score-0.291]

20 Such articles are well-suited as a source of supervised content for document geolocation purposes. [sent-57, score-0.517]

21 Wikipedia’s geotagged articles encompass more than just cities, geographic formations and landmarks. [sent-59, score-0.521]

22 The latter type of article is actually quite challenging to geolocate based on the text content: though the ship is moored in Boston, most of the page discusses its role in various battles along the eastern seaboard of the USA. [sent-61, score-0.238]

23 However, such articles make up only a small fraction of the geotagged articles. [sent-62, score-0.381]

24 Excluding various types of special-purpose articles used primarily for maintaining the site (specifically, redirect articles and articles outside the main namespace), the dump includes 3,43 1,722 content-bearing articles, of which 488,269 are geotagged. [sent-65, score-0.419]

25 It is necessary to process the raw dump to obtain the plain text, as well as metadata such as geotagged coordinates. [sent-66, score-0.363]

26 See Lieberman and Lin (2009) for more discussion of a related effort to extract and use the geotagged articles in Wikipedia. [sent-72, score-0.381]

27 com/wex / round-robin fashion into training, development, and testing sets after randomizing the order of the articles, which preserved the proportion of geotagged articles. [sent-79, score-0.257]

28 Geo-tagged Microblog Corpus As a second evaluation corpus on a different domain, we use the corpus of geotagged tweets collected and used by Eisenstein et al. [sent-85, score-0.3]

29 We use the train/dev/test splits provided with the data; for these, the tweets of each user (a feed) have been concatenated to form a single document, and the location label associated with each document is the location of the first tweet by that user. [sent-88, score-0.417]

30 5 3 Grid representation for connecting texts to locations Geolocation involves identifying some spatial region with a unit of text—be it a word, phrase, or document. [sent-94, score-0.194]

31 (2010) use Gaussian distributions to model the locations of Twitter users in the United States of America. [sent-103, score-0.289]

32 This appears to work reasonably well for that restricted region, but is likely to run into problems when predicting locations for anywhere on earth— instead, spherical distributions like the von MisesFisher distribution would need to be employed. [sent-104, score-0.283]

33 We take here the simpler alternative of discretizing the earth’s surface with a geodesic grid; this al- lows us to predict locations with a variety of standard approaches over discrete outcomes. [sent-105, score-0.283]

34 (2009), we use the simplest strategy: a grid of square cells of equal degree, such as 1◦ by 1◦. [sent-108, score-0.4]

35 Given that most of the populated regions of interest for us are closer to the equator than not and that we use cells of quite fine granularity (down to 0. [sent-111, score-0.262]

36 With such a discrete representation of the earth’s surface, there are four distributions that form the core of all our geolocation methods. [sent-113, score-0.401]

37 The first is a standard multinomial distribution over the vocabulary for every cell in the grid. [sent-114, score-0.393]

38 Given a grid G with cells ci and a vocabulary V with words wj, we have θcij = P(wj |ci). [sent-115, score-0.53]

39 This grid representation ignores all higher level regions, such as states, countries, rivers, and mountain ranges, but it is consistent with the geocoding in both the Wikipedia and Twitter datasets. [sent-121, score-0.23]

40 Those for highly focused point-locations will jam up in a few disconnected cells—in the extreme case, toponyms like Springfield which are connected to many specific point locations around the earth. [sent-123, score-0.243]

41 We use grids with cell sizes of varying granularity d×d for d = 0. [sent-124, score-0.413]

42 l1l at the equator is roughly 56x55 km and at 45◦ latitude it is 39x55 km. [sent-129, score-0.246]

43 For comparison, at the equator a cell at d=5◦ is about 557x553 km (2,592 cells; 1,747 non-empty) and at d=0. [sent-131, score-0.544]

44 The geolocation methods predict a cell for a document, and the latitude and longitude of the degree-midpoint of the cell is used as the predicted location. [sent-135, score-1.061]

45 Prediction error is the great-circle distance from these predicted locations to the locations given by the gold standard. [sent-136, score-0.373]

46 The use of cell midpoints provides a fair comparison for predictions with different cell sizes. [sent-137, score-0.724]

47 (2009), which are all computed relative to a given grid size. [sent-139, score-0.23]

48 Smaller cells reduce this penalty and permit the word distributions θcij to be much more specific for each cell, but they are harder to predict exactly and suffer more from sparse word counts compared to courser granularity. [sent-141, score-0.245]

49 4 Supervised models for document geolocation Our methods use only the text in the documents; predictions are made based on the distributions θ, κ, and ρ introduced in the previous section. [sent-144, score-0.498]

50 The word distribution of document dk backs off to the global distribution θDj. [sent-149, score-0.404]

51 8 Finally, the cell distributions are simply the relative frequency of the number of documents in each cell: γi = A standa|rDd| set of stop words are ignored. [sent-165, score-0.471]

52 Also, all words are lowercased except in the case of the most-common-toponym baselines, where uppercase words serve as a fallback in case a toponym cannot be located in the article. [sent-166, score-0.241]

53 2 Kullback-Leibler divergence Given the distributions for each cell, θci, in the grid, we use an information retrieval approach to choose a location for a test document dk: compute the similarity between its word distribution θdk and that of each cell, and then choose the closest one. [sent-169, score-0.432]

54 The best cell – cKL is the one which provides the best encoding for the test document: cKL= arcgi∈mGinKL(θdk||θci) (8) The fact that KL is not symmetric is desired here: the other direction, KL(θci | |θdk ), asks which cell 8This also acts as an exploratory tool. [sent-172, score-0.694]

55 As an example for why non-symmetric KL in this order is appropriate, consider geolocating a page in a densely geotagged cell, such as the page for the Washington Monument. [sent-176, score-0.39]

56 Many of those words appear only once in the monument’s page, but this will still be a higher value than for the cell and will weight the contribution accordingly. [sent-178, score-0.347]

57 4 Average cell probability For each word, κji gives the probability of each cell in the grid. [sent-184, score-0.694]

58 Random Choose crand randomly from a uniform distribution over the entire grid G. [sent-189, score-0.276]

59 Cell prior maximum Choose the cell with the highest prior probability according to γ: = arg maxci∈G γi. [sent-190, score-0.347]

60 Most frequent toponym Identify the most frequent toponym in the article and the geotagged Wikipedia articles that match it. [sent-191, score-0.861]

61 Then identify which of those articles has the most incoming links ccpm (a measure of its prominence), and then choose cmft to be the cell that contains the geotagged location for that article. [sent-192, score-0.847]

62 Note that a toponym matches an article (or equivalently, the article is a candidate for the toponym) either if the toponym is the same as the article’s title, 960 grid size (degrees) Figure 1: Plot of grid resolution in degrees versus mean error for each method on the Wikipedia dev set. [sent-194, score-1.217]

63 For example, the toponym Tucson would match articles named Tucson, Tucson (city) or Tucson, Arizona. [sent-196, score-0.335]

64 In this fashion, the set of toponyms, and the list of candidates for each toponym, is generated from the set of all geotagged Wikipedia articles. [sent-197, score-0.257]

65 5 Experiments The approaches described in the previous section are evaluated on both the geotagged Wikipedia and Twitter datasets. [sent-198, score-0.257]

66 Given a predicted cell for a document, the prediction error is the great-circle distance between the true location and the center of as described in section 3. [sent-199, score-0.515]

67 c c, Grid resolution and thresholding The major parameter of all our methods is the grid resolution. [sent-200, score-0.331]

68 in the first feed, it removes Figure 2: Histograms of distribution of error distances (in km) for grid size 0. [sent-206, score-0.325]

69 Figure 1graphs the mean error of each method for different resolutions on the Wikipedia dev set, and Figure 2 graphs the distribution of error distances for grid size 0. [sent-210, score-0.472]

70 These results indicate that a grid size even smaller than 0. [sent-212, score-0.23]

71 To test this, we ran experiments using a grid size of 0. [sent-214, score-0.23]

72 The mean errors on the dev set increased slightly, from 323 km to 348 and 329 km, respectively, indicating that 0. [sent-217, score-0.217]

73 For the Twitter dataset, we considered both grid size and vocabulary threshold. [sent-219, score-0.23]

74 Table 1 shows mean prediction error using KL divergence, for various combinations of threshold and grid size. [sent-221, score-0.347]

75 Clearly, the larger grid size of 5◦ is more optimal than the 0. [sent-223, score-0.23]

76 Overall, there is a less clear trend for the other methods 961 set for various combinations of vocabulary threshold (in feeds) and grid size, using the KL divergence strategy. [sent-226, score-0.351]

77 Our interpretation of this is that there is greater sparsity for the Twitter dataset, and thus it is more sensitive to arbitrary aspects of how different user feeds are captured in different cells at different granularities. [sent-228, score-0.215]

78 , our results using KL divergence are slightly worse than theirs: median error of 516 km and mean of 986 km. [sent-238, score-0.397]

79 Wikipedia articles tend to use a lot of toponyms and words that correlate strongly with particular places while many, perhaps most, tweets discuss quotidian details such as what the user ate for lunch. [sent-240, score-0.248]

80 Finally, there are orders of magnitude more training examples for Wikipedia, which allows for greater grid resolution and thus more precise location predictions. [sent-242, score-0.45]

81 Threshold makes no difference for cell prior maximum. [sent-246, score-0.347]

82 However, prediction is quite good for ships that were sunk in particular battles which are described in detail on the page; examples are the USS Gambier Bay, USS Hammann (DD412), and the HMS Majestic (1895). [sent-250, score-0.226]

83 Another interesting aspect of geolocating ship articles is that ships tend to end up sunk in remote battle locations, such that their article is the only one located in the cell covering the location in the training set. [sent-253, score-1.002]

84 Ship terminology thus dominates such cells, with the effect that our models often (incorrectly) geolocate test articles about other ships to such locations (and often about ships with similar properties). [sent-254, score-0.593]

85 This also leads to generally more accurate geolocation of HMS ships over USS ships; the former seem to have been sunk in more concentrated regions that are themselves less spread out globally. [sent-255, score-0.552]

86 6 Related work Lieberman and Lin (2009) also work with geotagged Wikipedia articles, but they do in order so to ana962 lyze the likely locations of users who edit such articles. [sent-256, score-0.471]

87 Some approaches to document geolocation rely largely or entirely on non-textual metadata, which is often unavailable for many corpora of interest, Nonetheless, our methods could be combined with such methods when such metadata is available. [sent-263, score-0.452]

88 7 Conclusion We have shown that automatic identification of the location of a document based only on its text can be performed with high accuracy using simple supervised methods and a discrete grid representation of the earth’s surface. [sent-269, score-0.496]

89 Our most effective geolocation strategy finds the grid cell whose word distribution has the smallest KL divergence from that of the test document, and easily beats several effective baselines. [sent-271, score-0.994]

90 We predict the location of Wikipedia pages to a median error of 11. [sent-272, score-0.251]

91 For Twitter, we obtain a median error of 479 km and mean error of 967 km. [sent-274, score-0.361]

92 Using naive Bayes and a simple averaging of word-level cell distributions also both worked well; however, KL was more effective, we believe, because it weights the words in the document most heavily, and thus puts less importance on the less specific word distributions of each cell. [sent-275, score-0.646]

93 It could also help with Wikipedia, especially for buildings: for example, the page for Independence Hall in Philadelphia links to geotagged “friend” pages for Philadelphia, the Liberty Bell, and many other nearby locations and buildings. [sent-278, score-0.453]

94 However, we note that we are still primarily interested in geolocation with only text because there are a great many situations in which such linked structure is unavailable. [sent-279, score-0.286]

95 9 The task of identifying a single location for an entire document provides a convenient way of evaluating approaches for connecting texts with locations, but it is not fully coherent in the context of documents that cover multiple locations. [sent-281, score-0.275]

96 Nonetheless, both the average cell probability and naive Bayes models output a distribution over all cells, which could be used to assign multiple locations. [sent-282, score-0.435]

97 Furthermore, these cell distributions could additionally be used to define a document level prior for resolution of individual toponyms. [sent-283, score-0.63]

98 edu/ 963 Though we treated the grid resolution as a parameter, the grids themselves form a hierarchy of cells containing finer-grained cells. [sent-287, score-0.567]

99 For example, given a cell of the finest grain, the average cell probability and naive Bayes models could successively back off to the values produced by their coarser-grained containing cells, and KL divergence could be summed from finest-to-coarsest grain. [sent-289, score-0.821]

100 Another strategy for making models less sensitive to grid resolution is to smooth the per-cell word distributions over neighboring cells; this strategy improved results on Flickr photo geolocation for Serdyukov et al. [sent-290, score-0.692]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cell', 0.347), ('geolocation', 0.286), ('geotagged', 0.257), ('wikipedia', 0.238), ('grid', 0.23), ('toponym', 0.211), ('dk', 0.205), ('twitter', 0.185), ('cells', 0.17), ('locations', 0.162), ('kl', 0.153), ('km', 0.148), ('geographic', 0.14), ('ci', 0.13), ('ships', 0.129), ('articles', 0.124), ('location', 0.119), ('document', 0.107), ('eisenstein', 0.105), ('resolution', 0.101), ('earth', 0.099), ('serdyukov', 0.097), ('divergence', 0.085), ('median', 0.083), ('geodesic', 0.081), ('toponyms', 0.081), ('distributions', 0.075), ('grids', 0.066), ('arcgi', 0.065), ('coursey', 0.065), ('geolocating', 0.065), ('overell', 0.065), ('ship', 0.065), ('sunk', 0.065), ('wj', 0.064), ('metadata', 0.059), ('article', 0.058), ('lieberman', 0.057), ('cij', 0.053), ('tucson', 0.053), ('users', 0.052), ('austin', 0.051), ('documents', 0.049), ('error', 0.049), ('argci', 0.049), ('dkj', 0.049), ('equator', 0.049), ('geolocate', 0.049), ('geolocated', 0.049), ('latitude', 0.049), ('monument', 0.049), ('dump', 0.047), ('distribution', 0.046), ('mihalcea', 0.045), ('feeds', 0.045), ('regions', 0.043), ('geographically', 0.043), ('flickr', 0.043), ('vdk', 0.043), ('uss', 0.043), ('tweets', 0.043), ('ji', 0.042), ('naive', 0.042), ('discrete', 0.04), ('perseus', 0.039), ('historical', 0.038), ('dev', 0.037), ('backstrom', 0.037), ('coordinates', 0.037), ('threshold', 0.036), ('rada', 0.035), ('nonetheless', 0.035), ('page', 0.034), ('bayes', 0.033), ('texas', 0.033), ('barbecue', 0.032), ('battles', 0.032), ('ckl', 0.032), ('cpi', 0.032), ('dkjlog', 0.032), ('hms', 0.032), ('kino', 0.032), ('longhorn', 0.032), ('longitude', 0.032), ('newsstand', 0.032), ('rauch', 0.032), ('teitler', 0.032), ('textgrounder', 0.032), ('triangular', 0.032), ('unsurprising', 0.032), ('spatial', 0.032), ('mean', 0.032), ('ny', 0.03), ('located', 0.03), ('predictions', 0.03), ('concentrated', 0.029), ('facebook', 0.029), ('york', 0.029), ('tweet', 0.029), ('resolutions', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 285 acl-2011-Simple supervised document geolocation with geodesic grids

Author: Benjamin Wing ; Jason Baldridge

2 0.18433012 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

Author: Nathan Bodenstab ; Aaron Dunlop ; Keith Hall ; Brian Roark

Abstract: Efficient decoding for syntactic parsing has become a necessary research area as statistical grammars grow in accuracy and size and as more NLP applications leverage syntactic analyses. We review prior methods for pruning and then present a new framework that unifies their strengths into a single approach. Using a log linear model, we learn the optimal beam-search pruning parameters for each CYK chart cell, effectively predicting the most promising areas of the model space to explore. We demonstrate that our method is faster than coarse-to-fine pruning, exemplified in both the Charniak and Berkeley parsers, by empirically comparing our parser to the Berkeley parser using the same grammar and under identical operating conditions.

3 0.14823177 129 acl-2011-Extending the Entity Grid with Entity-Specific Features

Author: Micha Elsner ; Eugene Charniak

Abstract: We extend the popular entity grid representation for local coherence modeling. The grid abstracts away information about the entities it models; we add discourse prominence, named entity type and coreference features to distinguish between important and unimportant entities. We improve the best result for WSJ document discrimination by 6%.

4 0.13851191 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

Author: Danuta Ploch

Abstract: Named entity disambiguation is the task of linking an entity mention in a text to the correct real-world referent predefined in a knowledge base, and is a crucial subtask in many areas like information retrieval or topic detection and tracking. Named entity disambiguation is challenging because entity mentions can be ambiguous and an entity can be referenced by different surface forms. We present an approach that exploits Wikipedia relations between entities co-occurring with the ambiguous form to derive a range of novel features for classifying candidate referents. We find that our features improve disambiguation results significantly over a strong popularity baseline, and are especially suitable for recognizing entities not contained in the knowledge base. Our system achieves state-of-the-art results on the TAC-KBP 2009 dataset.

5 0.12942626 177 acl-2011-Interactive Group Suggesting for Twitter

Author: Zhonghua Qu ; Yang Liu

Abstract: The number of users on Twitter has drastically increased in the past years. However, Twitter does not have an effective user grouping mechanism. Therefore tweets from other users can quickly overrun and become inconvenient to read. In this paper, we propose methods to help users group the people they follow using their provided seeding users. Two sources of information are used to build sub-systems: textural information captured by the tweets sent by users, and social connections among users. We also propose a measure of fitness to determine which subsystem best represents the seed users and use it for target user ranking. Our experiments show that our proposed framework works well and that adaptively choosing the appropriate sub-system for group suggestion results in increased accuracy.

6 0.12791567 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

7 0.12315402 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

8 0.10410496 114 acl-2011-End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories

9 0.1032926 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

10 0.10196809 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names

11 0.09990295 52 acl-2011-Automatic Labelling of Topic Models

12 0.097988024 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

13 0.09705884 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

14 0.095220104 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models

15 0.094239697 109 acl-2011-Effective Measures of Domain Similarity for Parsing

16 0.087581955 292 acl-2011-Target-dependent Twitter Sentiment Classification

17 0.084133461 101 acl-2011-Disentangling Chat with Local Coherence Models

18 0.078391686 283 acl-2011-Simple English Wikipedia: A New Text Simplification Task

19 0.072500139 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

20 0.072427288 64 acl-2011-C-Feel-It: A Sentiment Analyzer for Micro-blogs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.168), (1, 0.092), (2, -0.07), (3, 0.048), (4, -0.003), (5, -0.05), (6, -0.015), (7, -0.034), (8, -0.151), (9, 0.065), (10, -0.134), (11, 0.107), (12, 0.061), (13, -0.101), (14, 0.011), (15, 0.041), (16, 0.076), (17, 0.009), (18, -0.016), (19, -0.114), (20, 0.122), (21, -0.04), (22, -0.046), (23, -0.083), (24, 0.02), (25, 0.03), (26, -0.07), (27, -0.07), (28, -0.014), (29, -0.028), (30, -0.04), (31, 0.07), (32, 0.019), (33, -0.045), (34, 0.131), (35, -0.082), (36, -0.019), (37, 0.067), (38, 0.09), (39, -0.079), (40, 0.032), (41, 0.012), (42, 0.108), (43, 0.053), (44, 0.119), (45, 0.044), (46, 0.056), (47, -0.012), (48, 0.016), (49, -0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9329806 285 acl-2011-Simple supervised document geolocation with geodesic grids

Author: Benjamin Wing ; Jason Baldridge

2 0.67591804 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History

Author: Oliver Ferschke ; Torsten Zesch ; Iryna Gurevych

Abstract: We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia’s edit history.

3 0.55582857 195 acl-2011-Language of Vandalism: Improving Wikipedia Vandalism Detection via Stylometric Analysis

Author: Manoj Harpalani ; Michael Hart ; Sandesh Signh ; Rob Johnson ; Yejin Choi

Abstract: Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexicosyntactic patterns based on n-grams. ,

4 0.52734184 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

Author: Nathan Bodenstab ; Aaron Dunlop ; Keith Hall ; Brian Roark

5 0.52702069 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

Author: Andrei Popescu-Belis ; Majid Yazdani ; Alexandre Nanchen ; Philip N. Garner

Abstract: The Automatic Content Linking Device is a just-in-time document retrieval system which monitors an ongoing conversation or a monologue and enriches it with potentially related documents, including multimedia ones, from local repositories or from the Internet. The documents are found using keyword-based search or using a semantic similarity measure between documents and the words obtained from automatic speech recognition. Results are displayed in real time to meeting participants, or to users watching a recorded lecture or conversation.

6 0.5180546 213 acl-2011-Local and Global Algorithms for Disambiguation to Wikipedia

7 0.5131489 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing

8 0.5101614 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

9 0.5080303 305 acl-2011-Topical Keyphrase Extraction from Twitter

10 0.4952766 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation

11 0.47448307 177 acl-2011-Interactive Group Suggesting for Twitter

12 0.46744099 298 acl-2011-The ACL Anthology Searchbench

13 0.44332054 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names

14 0.43293491 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

15 0.42364618 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

16 0.41571414 101 acl-2011-Disentangling Chat with Local Coherence Models

17 0.41488132 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

18 0.40335593 129 acl-2011-Extending the Entity Grid with Entity-Specific Features

19 0.39746049 291 acl-2011-SystemT: A Declarative Information Extraction System

20 0.39450833 191 acl-2011-Knowledge Base Population: Successful Approaches and Challenges

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.035), (17, 0.064), (26, 0.029), (37, 0.051), (39, 0.075), (41, 0.05), (53, 0.013), (55, 0.03), (57, 0.282), (59, 0.04), (72, 0.03), (73, 0.015), (91, 0.035), (96, 0.136), (97, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77381098 285 acl-2011-Simple supervised document geolocation with geodesic grids

Author: Benjamin Wing ; Jason Baldridge

2 0.76966661 243 acl-2011-Partial Parsing from Bitext Projections

Author: Prashanth Mannem ; Aswarth Dara

Abstract: Recent work has shown how a parallel corpus can be leveraged to build syntactic parser for a target language by projecting automatic source parse onto the target sentence using word alignments. The projected target dependency parses are not always fully connected to be useful for training traditional dependency parsers. In this paper, we present a greedy non-directional parsing algorithm which doesn’t need a fully connected parse and can learn from partial parses by utilizing available structural and syntactic information in them. Our parser achieved statistically significant improvements over a baseline system that trains on only fully connected parses for Bulgarian, Spanish and Hindi. It also gave a significant improvement over previously reported results for Bulgarian and set a benchmark for Hindi.

3 0.75258255 305 acl-2011-Topical Keyphrase Extraction from Twitter

Author: Xin Zhao ; Jing Jiang ; Jing He ; Yang Song ; Palakorn Achanauparp ; Ee-Peng Lim ; Xiaoming Li

Abstract: Summarizing and analyzing Twitter content is an important and challenging task. In this paper, we propose to extract topical keyphrases as one way to summarize Twitter. We propose a context-sensitive topical PageRank method for keyword ranking and a probabilistic scoring function that considers both relevance and interestingness of keyphrases for keyphrase ranking. We evaluate our proposed methods on a large Twitter data set. Experiments show that these methods are very effective for topical keyphrase extraction.

4 0.74013221 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style

Author: Amjad Abu-Jbara ; Barbara Rosario ; Kent Lyons

Abstract: In this paper, we address the problem of optimizing the style of textual content to make it more suitable to being listened to by a user as opposed to being read. We study the differences between the written style and the audio style by consulting the linguistics andjour- nalism literatures. Guided by this study, we suggest a number of linguistic features to distinguish between the two styles. We show the correctness of our features and the impact of style transformation on the user experience through statistical analysis, a style classification task, and a user study.

5 0.72093737 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

Author: Manaal Faruqui ; Sebastian Pado

Abstract: In contrast to many languages (like Russian or French), modern English does not distinguish formal and informal (“T/V”) address overtly, for example by pronoun choice. We describe an ongoing study which investigates to what degree the T/V distinction is recoverable in English text, and with what textual features it correlates. Our findings are: (a) human raters can label English utterances as T or V fairly well, given sufficient context; (b), lexical cues can predict T/V almost at human level.

6 0.55619478 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

7 0.55331743 101 acl-2011-Disentangling Chat with Local Coherence Models

8 0.54922765 178 acl-2011-Interactive Topic Modeling

9 0.54848069 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

10 0.54831785 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

11 0.54747838 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

12 0.54568017 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

13 0.54566634 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation

14 0.54445988 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

15 0.54403615 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

16 0.54388773 192 acl-2011-Language-Independent Parsing with Empty Elements

17 0.54383272 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

18 0.54351604 11 acl-2011-A Fast and Accurate Method for Approximate String Search

19 0.54340369 182 acl-2011-Joint Annotation of Search Queries

20 0.54298168 37 acl-2011-An Empirical Evaluation of Data-Driven Paraphrase Generation Techniques