emnlp emnlp2012 emnlp2012-121 knowledge-graph by maker-knowledge-mining

121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

Source: pdf

Author: Stephen Roller ; Michael Speriosu ; Sarat Rallapalli ; Benjamin Wing ; Jason Baldridge

Abstract: The geographical properties of words have recently begun to be exploited for geolocating documents based solely on their text, often in the context of social media and online content. One common approach for geolocating texts is rooted in information retrieval. Given training documents labeled with latitude/longitude coordinates, a grid is overlaid on the Earth and pseudo-documents constructed by concatenating the documents within a given grid cell; then a location for a test document is chosen based on the most similar pseudo-document. Uniform grids are normally used, but they are sensitive to the dispersion of documents over the earth. We define an alternative grid construction using k-d trees that more robustly adapts to data, especially with larger training sets. We also provide a better way of choosing the locations for pseudo-documents. We evaluate these strategies on existing Wikipedia and Twitter corpora, as well as a new, larger Twitter corpus. The adaptive grid achieves competitive results with a uniform grid on small training sets and outperforms it on the large Twitter corpus. The two grid constructions can also be combined to produce consistently strong results across all training sets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract The geographical properties of words have recently begun to be exploited for geolocating documents based solely on their text, often in the context of social media and online content. [sent-5, score-0.278]

2 Given training documents labeled with latitude/longitude coordinates, a grid is overlaid on the Earth and pseudo-documents constructed by concatenating the documents within a given grid cell; then a location for a test document is chosen based on the most similar pseudo-document. [sent-7, score-1.018]

3 Uniform grids are normally used, but they are sensitive to the dispersion of documents over the earth. [sent-8, score-0.329]

4 We define an alternative grid construction using k-d trees that more robustly adapts to data, especially with larger training sets. [sent-9, score-0.275]

5 We also provide a better way of choosing the locations for pseudo-documents. [sent-10, score-0.209]

6 The adaptive grid achieves competitive results with a uniform grid on small training sets and outperforms it on the large Twitter corpus. [sent-12, score-0.636]

7 The two grid constructions can also be combined to produce consistently strong results across all training sets. [sent-13, score-0.245]

8 It is often desirable to extract summary metadata from such resources, such as the date of writing or the location of the author yet only a small portion of available documents are explicitly annotated in this fashion. [sent-15, score-0.3]

9 For example, clues to the geographic location of a document may come from a variety of word features, e. [sent-17, score-0.415]

10 toponyms (Toronto), geographic features (mountain), culturally local features (hockey), and stylistic or dialectical differences (cool vs. [sent-19, score-0.243]

11 One of the first works on document geolocation is Ding et al. [sent-26, score-0.379]

12 (2010), who geolocate Twitter users by resolving their profile locations against a gazetteer of U. [sent-33, score-0.364]

13 An alternative to using a discrete set of locations from a gazetteer is to use information retrieval (IR) techniques on a set of geolocated training documents. [sent-36, score-0.308]

14 Lc a2n0g1u2ag Aes Psorcoicaetsiosin fgo arn Cdo Cmopmutpauti oantiaoln Lailn Ngautiustriacls training document and a location chosen based on the location(s) of the most similar training document(s). [sent-40, score-0.238]

15 For image geolocation, Chen and Grauman (201 1) perform mean-shift clustering over training images to discretize locations, then estimate a test image’s location with weighted voting from the k most similar documents. [sent-41, score-0.24]

16 Additionally, they group documents via a uniform geodesic grid rather than a clustered set of locations. [sent-44, score-0.527]

17 This reduces the number of similarity computations and removes the need to perform location clustering altogether, but introduces a new parameter controlling the granularity of the grid. [sent-45, score-0.199]

18 (201 1) predict the locations of tweets and users by comparing text in tweets to language models as- sociated with zip codes and broader geopolitical enclosures. [sent-47, score-0.586]

19 (2012) discretize by simply clustering data points within a small distance threshold, but only perform geolocation within fixed city limits. [sent-49, score-0.367]

20 (2010) predict locations based on Gaussian distributions over the earth’s surface as part of a hierarchical Bayesian model. [sent-51, score-0.209]

21 We build on the IR approach with grids while addressing some ofthe shortcomings ofa uniform grid. [sent-55, score-0.217]

22 Uniform grids are problematic in that they ignore the geographic dispersion of documents and forgo the possibility of greater-granularity geographic resolution in document-rich areas. [sent-56, score-0.683]

23 Instead, we construct a grid using a k-d tree, which adapts to the size of the training set and the geographic dispersion of the documents it contains. [sent-57, score-0.708]

24 It also has the desirable property of generally requiring fewer active cells than a uniform grid, drastically reducing the computation time required to label a test 1501 document. [sent-59, score-0.233]

25 In addition, a simple difference in the choice of location for a given grid cell the centroid of the training documents in the cell, rather than the cell midpoint results in across-the-board improvements. [sent-61, score-1.157]

26 We also construct and evaluate on a much larger dataset of geolocated tweets than has been used in previous papers, demonstrating the scalability and robustness of our methods and confirming the ability of the adaptive grid to more effectively use larger datasets. [sent-62, score-0.505]

27 A document in this dataset is the concatenation of all tweets by a single user, with a location derived from the earliest tweet with specific, GPSassigned latitude/longitude coordinates. [sent-68, score-0.425]

28 r all tweets of a user concatenated as a single document, and use the earliest collected GPS-assigned location as the gold location. [sent-75, score-0.349]

29 To remove many spammers and robots, we only kept users following 5 to 1000 people, followed by at least 5 users, and authoring no more than 1000 tweets in the three month period. [sent-77, score-0.225]

30 The resulting dataset contains 38 million tweets from 449,694 users, or roughly 85 tweets per user on average. [sent-78, score-0.346]

31 2 3 Model Assume we have a collection d of documents and their associated location labels l. [sent-83, score-0.3]

32 These documents may be actual texts, or they can be pseudodocuments comprised of a number of texts grouped via some algorithm (such as the grids discussed in the next section). [sent-84, score-0.291]

33 For a test document di, its similarity to each labeled document is computed, and the location of the most similar document assigned to di. [sent-85, score-0.404]

34 In related work on image geolocation, Hays and Efros (2008) use the same general framework, but compute the location based on the k-nearest neighbors (kNN) rather than the top one. [sent-90, score-0.202]

35 We smooth documents using the pseudo-GoodTuring method of W&B;, a nonparametric discounting model that backs off from the unsmoothed distribution˜θdi of the document to the unsmoothed distribution θ˜D of all documents. [sent-98, score-0.228]

36 A standard strategy to deal with this problem is to collapse groups of geographically nearby documents into larger pseudo-documents. [sent-114, score-0.251]

37 Formally, this involves partitioning the training documents into a set of sets of documents G = {g1 . [sent-116, score-0.379]

38 This can be chosen based on the partitioning function itself or the locations of the documents in each group. [sent-123, score-0.443]

39 Both W&B; and SMvZ use uniform grids consisting of cells of equal degree size to partition documents. [sent-124, score-0.423]

40 We explore an alternative that uses k-d (k-dimensional) trees to construct a non-uniform grid that adapts to training sets of different sizes more gracefully. [sent-125, score-0.316]

41 W&B; define the location for a cell to be its ge- ographic center, while SMvZ only perform error analysis in terms of choosing the correct cell. [sent-127, score-0.344]

42 We obtain consistently improved results using the centroid of the cell’s documents, which takes into account where the documents are concentrated. [sent-128, score-0.311]

43 Partitioning geolocated documents using a k-d tree provides finer granularity in dense regions and coarser granularity elsewhere. [sent-137, score-0.345]

44 For example, documents from Queens and Brooklyn may show significant cultural distinctions, while documents separated by the same distance in rural Montana may ap- pear culturally identical. [sent-138, score-0.356]

45 A uniform grid with large cells will mash Queens and Brooklyn together, while small cells will create unnecessarily sparse regions in Montana. [sent-139, score-0.607]

46 An important parameter for a k-d tree is its bucket size, which determines the maximum number of points (documents in our case) that a cell may contain. [sent-140, score-0.534]

47 By varying the bucket size, the cells can be made fine- or coarse-grained. [sent-141, score-0.468]

48 If the number of documents in the node exceeds the bucket size, the node is split into two nodes along a chosen split dimension and point. [sent-146, score-0.54]

49 (1977), we choose to always split a node 3We note that the grid “rectangles” are actually trapezoids due to the nature of the latitude/longitude coordinate system. [sent-151, score-0.273]

50 Figure 1: View of North America showing k-d leaves created from GEOWIKI with a bucket size of 600 and the MIDPOINT method, as visualized in Google Earth. [sent-153, score-0.461]

51 Figure 1 shows the leaves of the k-d tree formed over North America using the GEOWIKI dataset, 1504 the MIDPOINT node division method, and a bucket size of 600. [sent-164, score-0.535]

52 More densely populated areas of the earth (which in turn tend to have more Wikipedia documents associated with them) contain smaller and more numerous leaf cells. [sent-166, score-0.328]

53 The cells over Manhattan are significantly smaller than those of Queens, the Bronx, and East Jersey, even at such a coarse bucket size. [sent-167, score-0.468]

54 Though the leaves of the k-d tree implicitly cover the entire surface of the earth, our illustrations limit the size of each box by its data, leaving gaps where no training documents exist. [sent-168, score-0.313]

55 3 Selecting a Representative Location W&B; use the geographic center of a cell as the geolocation for the pseudo-document it represents. [sent-170, score-0.663]

56 However, this ignores the fact that many cells will have imbalances in the dispersion of the documents they contain typically, they will be clumpy, with documents clustering around areas of high population or activity. [sent-171, score-0.522]

57 An alternative is to select the centroid of the locations of all the documents contained within a cell. [sent-172, score-0.52]

58 Uniform grids with small cells are not especially sensitive to this choice since the absolute distance between a center or centroid prediction will not be great, and empty cells are simply discarded. [sent-173, score-0.611]

59 Nonetheless, using the centroid has the benefit of making a uniform grid less sensitive to cell size, such that larger cells can be used more reliably especially important when there are few training documents. [sent-174, score-0.793]

60 In contrast, when choosing representative locations for the leaves of a k-d tree, it is quite important to use the centroid because the leaves necessarily span the entire earth and none are discarded (since all have a roughly similar number of documents in – – them). [sent-175, score-0.786]

61 Using the centroid allows these large leaves to be in the mix, while still predicting the locations in them that have the greatest document density. [sent-177, score-0.569]

62 W&B; refers to a uniform grid and geographiccenter location selection, UNIFCENTROID to a uniform grid with centroid location selection, KDCENTROID to a k-d tree grid with centroid location selection, and UNIFKDCENTROID to the union of pseudo-documents constructed by UNIFCENTROID and KDCENTROID. [sent-181, score-1.786]

63 We also provide two baselines, both of which are based on a uniform grid with centroid location selection. [sent-182, score-0.67]

64 RANDOM predicts a grid cell chosen at random uniformly; MOSTCOMMONCELL always predicts the grid cell containing the most training documents. [sent-183, score-0.788]

65 1 Tuning The specific parameters are (1) the partition location method; (2) the bucket size for k-d partitioning; (3) the node division method for k-d partitioning; (4) the degree size for uniform grid partitioning. [sent-197, score-0.988]

66 Development set results show that the centroid always performs better than the center for all datasets, typically by a wide margin (especially for large partition sizes). [sent-200, score-0.244]

67 Larger bucket sizes tend to produce larger leaves, so documents in a partition will have a higher average distance to the center or centroid point. [sent-205, score-0.802]

68 Conversely, small bucket sizes lead to fewer training documents per partition. [sent-207, score-0.525]

69 A bucket size of one reduces to the situation where no pseudo-documents are used. [sent-208, score-0.379]

70 The graphs in Figure 3 show development set performance when varying bucket size. [sent-213, score-0.339]

71 In the case of plateaus, as was common with the FRIEDMAN method, we chose the middle of the plateau as the bucket size. [sent-215, score-0.339]

72 Overall, we found optimal bucket sizes of 100 for GEOWIKI, 530 for GEOTEXT, 460 for UTGEO201 1-SMALL, and 1050 for UTGEO20 11-LARGE. [sent-216, score-0.38]

73 That the Wikipedia data requires a smaller bucket size is unsurprising: the documents themselves are generally longer and there are many more of them, so a small bucket size provides good coverage and granularity without sacrificing the ability to estimate good language models for each partition. [sent-217, score-0.947]

74 MIDPOINT is clearly bet- ter for GEOWIKI, while FRIEDMAN is better for GEOTEXT in the range of bucket sizes producing the best results. [sent-220, score-0.38]

75 Following W&B;, we choose a cell degree size of 0. [sent-226, score-0.189]

76 The results obtained by W&B; on GEOWIKI are already very strong, but we do see a clear improvement by changing from the center-based locations for pseudo-documents they used to the centroid-based locations we employ: mean error drops from 221 km to 181 km, and median error from 11. [sent-242, score-0.772]

77 Also, we reduce the mean error further to 176 km for the configuration that combines the uniform grid and the k-d partitions, though at the cost of increasing median error somewhat. [sent-245, score-0.703]

78 configurations trained on The numbers given for W&B; were produced from their implementation, and correspond to uniform grid partitioning with locations from centers rather than centroids. [sent-255, score-0.677]

79 For GEOTEXT, the results show that the uniform grid with centroid locations is the most effective of our configurations. [sent-257, score-0.724]

80 (201 1) by 69 km with respect to median error, but has 52 km worse performance than their model with respect to mean error. [sent-259, score-0.457]

81 4) With the small training set, error is worse than with GEOTEXT, reflecting the wider geographic scope of UTGEO20 11. [sent-265, score-0.217]

82 KDCENTROID is much more effective than the uniform grids, but combining it with the uniform grid in UNIFKDCENTROID edges it out by a small amount. [sent-266, score-0.453]

83 The bucket size used with the large training set is double that for the small one, but there are many more leaves created since there are 42 times more training documents. [sent-269, score-0.461]

84 With the extra data, the model is able to adapt better to the dispersion of documents and still have strong language models for each leaf that work well even with our greedy winner-takes-all decision method. [sent-270, score-0.265]

85 (2010) limit themselves to users with at least 1,000 tweets, while we have an average of 85 tweets per user. [sent-274, score-0.225]

86 Their reported mean error distance of 862 km (versus our best mean of 860 km on UTGEO20 11-LARGE) indicates that their performance is hurt by a relatively small number of extremely incorrect guesses, as ours appears to be. [sent-275, score-0.511]

87 Parameters, especially bucket size, need retuning as data increases, which we hope to estimate automatically in future work Finally, we note that the KDCENTROID method was faster than other methods. [sent-278, score-0.339]

88 In many cases, landmarks in Australia or New Zealand are predicted in European locations with similarlynamed landmarks, or vice versa e. [sent-292, score-0.209]

89 Some of the other large errors stem from incorrect gold labels, in particular due to sign errors in latitude or longitude, which can place documents 10,000 or more km from their correct locations. [sent-297, score-0.377]

90 To investigate which words tend to be good indicators of location, we computed, for each word in a development set, the average error distance of documents containing that word. [sent-304, score-0.218]

91 8 Conclusion We have shown how to construct an adaptive grid with k-d trees that enables robust text geolocation and scales well to large training sets. [sent-318, score-0.583]

92 For example, the pseudo-document word distributions can be smoothed based on nearby documents or on the structure of the k-d tree itself. [sent-320, score-0.231]

93 We also expect predicting locations based on multiple most similar documents (kNN) to be more effective in predicting document location, as the second and third most similar training documents together may sometimes be a better estimation of its distribution than just the first alone. [sent-322, score-0.582]

94 Other possibilities include constructing multiple k-d trees using random subsets of the training data to reduce sensitivity to the bucket size. [sent-324, score-0.339]

95 (2005) show that roughly 70% of social network links can be described using geographic information and that the probability of a social link is inversely proportional to geographic distance. [sent-327, score-0.452]

96 (2010) verify these results on a much larger scale using geolocated Facebook profiles: their algorithm geolocates users with only the social graph and significantly outperforms IP-based geolocation systems. [sent-329, score-0.484]

97 (2012) also show that a combination of textual and social data can accurately geolocate individual tweets when scope is limited to a single city. [sent-332, score-0.25]

98 Tweets are temporally ordered and the geographic distance between consecutive tweeting events is constrained by the author’s movement. [sent-333, score-0.243]

99 For tweetlevel geolocation, it will be useful to build on work in geolocation that considers the temporal dimension (Chen and Grauman, 2011; Kalogerakis et al. [sent-334, score-0.296]

100 “I’m eating a sandwich in Glasgow”: Model- ing locations with tweets. [sent-408, score-0.209]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bucket', 0.339), ('geolocation', 0.296), ('geotext', 0.246), ('geowiki', 0.246), ('grid', 0.245), ('locations', 0.209), ('km', 0.183), ('geographic', 0.177), ('centroid', 0.166), ('location', 0.155), ('tweets', 0.152), ('cell', 0.149), ('kdcentroid', 0.148), ('midpoint', 0.148), ('documents', 0.145), ('cells', 0.129), ('grids', 0.113), ('friedman', 0.106), ('uniform', 0.104), ('earth', 0.102), ('unifcentroid', 0.099), ('partitioning', 0.089), ('document', 0.083), ('wing', 0.082), ('leaves', 0.082), ('di', 0.08), ('users', 0.073), ('twitter', 0.071), ('dispersion', 0.071), ('eisenstein', 0.07), ('geographically', 0.066), ('geolocated', 0.066), ('sadilek', 0.066), ('serdyukov', 0.066), ('unifkdcentroid', 0.066), ('median', 0.055), ('geographical', 0.051), ('geolocate', 0.049), ('geotagged', 0.049), ('grauman', 0.049), ('hays', 0.049), ('knn', 0.049), ('latitude', 0.049), ('longitude', 0.049), ('queens', 0.049), ('smvz', 0.049), ('social', 0.049), ('leaf', 0.049), ('image', 0.047), ('tree', 0.046), ('cheng', 0.046), ('baldridge', 0.046), ('granularity', 0.044), ('adaptive', 0.042), ('vision', 0.042), ('minutes', 0.042), ('user', 0.042), ('center', 0.041), ('sizes', 0.041), ('error', 0.04), ('size', 0.04), ('nearby', 0.04), ('discretize', 0.038), ('partition', 0.037), ('america', 0.036), ('mean', 0.036), ('tweet', 0.035), ('splitting', 0.035), ('wikipedia', 0.034), ('distance', 0.033), ('activist', 0.033), ('alexei', 0.033), ('bentley', 0.033), ('brooklyn', 0.033), ('comaniciu', 0.033), ('culturally', 0.033), ('efros', 0.033), ('gazetteer', 0.033), ('geodesic', 0.033), ('geolocating', 0.033), ('increments', 0.033), ('kalogerakis', 0.033), ('kinsella', 0.033), ('kulis', 0.033), ('ponte', 0.033), ('pseudodocuments', 0.033), ('roller', 0.033), ('sankaranarayanan', 0.033), ('sarat', 0.033), ('speriosu', 0.033), ('toponyms', 0.033), ('tweeting', 0.033), ('vanessa', 0.033), ('wormholes', 0.033), ('areas', 0.032), ('adapts', 0.03), ('configurations', 0.03), ('gi', 0.029), ('greatest', 0.029), ('node', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

Author: Stephen Roller ; Michael Speriosu ; Sarat Rallapalli ; Benjamin Wing ; Jason Baldridge

2 0.056390766 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

Author: Annie Louis ; Ani Nenkova

Abstract: We introduce a model of coherence which captures the intentional discourse structure in text. Our work is based on the hypothesis that syntax provides a proxy for the communicative goal of a sentence and therefore the sequence of sentences in a coherent discourse should exhibit detectable structural patterns. Results show that our method has high discriminating power for separating out coherent and incoherent news articles reaching accuracies of up to 90%. We also show that our syntactic patterns are correlated with manual annotations of intentional structure for academic conference articles and can successfully predict the coherence of abstract, introduction and related work sections of these articles. 59.3 (100.0) Intro 50.3 (100.0) 1166 Rel wk 55.4 (100.0) >= 0.663.8 (67.2)50.8 (71.1)58.6 (75.9) >= 0.7 67.2 (32.0) 54.4 (38.6) 63.3 (52.8) >= 0.8 74.0 (10.0) 51.6 (22.0) 63.0 (25.7) >= 0.9 91.7 (2.0) 30.6 (5.0) 68.1 (7.2) Table 9: Accuracy (% examples) above each confidence level for the conference versus workshop task. These results are shown in Table 9. The proportion of examples under each setting is also indicated. When only examples above 0.6 confidence are examined, the classifier has a higher accuracy of63.8% for abstracts and covers close to 70% of the examples. Similarly, when a cutoff of 0.7 is applied to the confidence for predicting related work sections, we achieve 63.3% accuracy for 53% of examples. So we can consider that 30 to 47% of the examples in the two sections respectively are harder to tell apart. Interestingly however even high confidence predictions on introductions remain incorrect. These results show that our model can successfully distinguish the structure of articles beyond just clearly incoherent permutation examples. 7 Conclusion Our work is the first to develop an unsupervised model for intentional structure and to show that it has good accuracy for coherence prediction and also complements entity and lexical structure of discourse. This result raises interesting questions about how patterns captured by these different coherence metrics vary and how they can be combined usefully for predicting coherence. We plan to explore these ideas in future work. We also want to analyze genre differences to understand if the strength of these coherence dimensions varies with genre. Acknowledgements This work is partially supported by a Google research grant and NSF CAREER 0953445 award. References Regina Barzilay and Mirella Lapata. 2008. Modeling local coherence: An entity-based approach. Computa- tional Linguistics, 34(1): 1–34. Regina Barzilay and Lillian Lee. 2004. Catching the drift: Probabilistic content models, with applications to generation and summarization. In Proceedings of NAACL-HLT, pages 113–120. Xavier Carreras, Michael Collins, and Terry Koo. 2008. Tag, dynamic programming, and the perceptron for efficient, feature-rich parsing. In Proceedings of CoNLL, pages 9–16. Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and maxent discriminative reranking. In Proceedings of ACL, pages 173–180. Jackie C.K. Cheung and Gerald Penn. 2010. Utilizing extra-sentential context for parsing. In Proceedings of EMNLP, pages 23–33. Christelle Cocco, Rapha ¨el Pittier, Fran ¸cois Bavaud, and Aris Xanthos. 2011. Segmentation and clustering of textual sequences: a typological approach. In Proceedings of RANLP, pages 427–433. Michael Collins and Terry Koo. 2005. Discriminative reranking for natural language parsing. Computational Linguistics, 3 1:25–70. Isaac G. Councill, C. Lee Giles, and Min-Yen Kan. 2008. Parscit: An open-source crf reference string parsing package. In Proceedings of LREC, pages 661–667. Micha Elsner and Eugene Charniak. 2008. Coreferenceinspired coherence modeling. In Proceedings of ACLHLT, Short Papers, pages 41–44. Micha Elsner and Eugene Charniak. 2011. Extending the entity grid with entity-specific features. In Proceedings of ACL-HLT, pages 125–129. Micha Elsner, Joseph Austerweil, and Eugene Charniak. 2007. A unified local and global model for discourse coherence. In Proceedings of NAACL-HLT, pages 436–443. Pascale Fung and Grace Ngai. 2006. One story, one flow: Hidden markov story models for multilingual multidocument summarization. ACM Transactions on Speech and Language Processing, 3(2): 1–16. Barbara J. Grosz and Candace L. Sidner. 1986. Attention, intentions, and the structure of discourse. Computational Linguistics, 3(12): 175–204. Yufan Guo, Anna Korhonen, and Thierry Poibeau. 2011. A weakly-supervised approach to argumentative zoning of scientific documents. In Proceedings of EMNLP, pages 273–283. Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In Proceedings of ACL-HLT, pages 586–594, June. 1167 Nikiforos Karamanis, Chris Mellish, Massimo Poesio, and Jon Oberlander. 2009. Evaluating centering for information ordering using corpora. Computational Linguistics, 35(1):29–46. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of ACL, pages 423–430. Mirella Lapata and Regina Barzilay. 2005. Automatic evaluation of text coherence: Models and representations. In Proceedings of IJCAI. Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of ACL, pages 545–552. Maria Liakata and Larisa Soldatova. 2008. Guidelines for the annotation of general scientific concepts. JISC Project Report. Maria Liakata, Simone Teufel, Advaith Siddharthan, and Colin Batchelor. 2010. Corpora for the conceptualisation and zoning of scientific papers. In Proceedings of LREC. Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng. 2009. Recognizing implicit discourse relations in the Penn Discourse Treebank. In Proceedings of EMNLP, pages 343–351. Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2011. Automatically evaluating text coherence using discourse relations. In Proceedings of ACL-HLT, pages 997– 1006. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1994. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330. Emily Pitler and Ani Nenkova. 2008. Revisiting readability: A unified framework for predicting text quality. In Proceedings of EMNLP, pages 186–195. Dragomir R. Radev, Mark Thomas Joseph, Bryan Gibson, and Pradeep Muthukrishnan. 2009. A Bibliometric and Network Analysis ofthe field of Computational Linguistics. Journal of the American Society for Information Science and Technology. David Reitter, Johanna D. Moore, and Frank Keller. 2006. Priming of Syntactic Rules in Task-Oriented Dialogue and Spontaneous Conversation. In Proceedings of the 28th Annual Conference of the Cognitive Science Society, pages 685–690. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the fifth conference on Applied natural language processing, pages 16–19. Radu Soricut and Daniel Marcu. 2006. Discourse generation using utility-trained coherence models. In Proceedings of COLING-ACL, pages 803–810. John Swales. 1990. Genre analysis: English in academic and research settings, volume 11. Cambridge University Press. Simone Teufel and Marc Moens. 2000. What’s yours and what’s mine: determining intellectual attribution in scientific text. In Proceedings of EMNLP, pages 9– 17. Simone Teufel, Jean Carletta, and Marc Moens. 1999. An annotation scheme for discourse-level argumentation in research articles. In Proceedings of EACL, pages 110–1 17. Ying Zhao, George Karypis, and Usama Fayyad. 2005. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10: 141–168. 1168

3 0.053754397 134 emnlp-2012-User Demographics and Language in an Implicit Social Network

Author: Katja Filippova

Abstract: We consider the task of predicting the gender of the YouTube1 users and contrast two information sources: the comments they leave and the social environment induced from the affiliation graph of users and videos. We propagate gender information through the videos and show that a user’s gender can be predicted from her social environment with the accuracy above 90%. We also show that the gender can be predicted from language alone (89%). A surprising result of our study is that the latter predictions correlate more strongly with the gender predominant in the user’s environment than with the sex of the person as reported in the profile. We also investigate how the two views (linguistic and social) can be combined and analyse how prediction accuracy changes over different age groups.

4 0.052656744 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities

Author: Xin Zhao ; Baihan Shu ; Jing Jiang ; Yang Song ; Hongfei Yan ; Xiaoming Li

Abstract: Activities on social media increase at a dramatic rate. When an external event happens, there is a surge in the degree of activities related to the event. These activities may be temporally correlated with one another, but they may also capture different aspects of an event and therefore exhibit different bursty patterns. In this paper, we propose to identify event-related bursts via social media activities. We study how to correlate multiple types of activities to derive a global bursty pattern. To model smoothness of one state sequence, we propose a novel function which can capture the state context. The experiments on a large Twitter dataset shows our methods are very effective.

5 0.051352993 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

Author: Kristian Woodsend ; Mirella Lapata

Abstract: Multi-document summarization involves many aspects of content selection and surface realization. The summaries must be informative, succinct, grammatical, and obey stylistic writing conventions. We present a method where such individual aspects are learned separately from data (without any hand-engineering) but optimized jointly using an integer linear programme. The ILP framework allows us to combine the decisions of the expert learners and to select and rewrite source content through a mixture of objective setting, soft and hard constraints. Experimental results on the TAC-08 data set show that our model achieves state-of-the-art performance using ROUGE and significantly improves the informativeness of the summaries.

6 0.050264783 36 emnlp-2012-Domain Adaptation for Coreference Resolution: An Adaptive Ensemble Approach

7 0.046517849 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

8 0.045667935 120 emnlp-2012-Streaming Analysis of Discourse Participants

9 0.043207046 86 emnlp-2012-Locally Training the Log-Linear Model for SMT

10 0.041401166 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

11 0.04116223 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

12 0.039624084 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

13 0.039260224 102 emnlp-2012-Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems

14 0.03886224 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

15 0.038482405 105 emnlp-2012-Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output

16 0.038348779 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

17 0.036794178 1 emnlp-2012-A Bayesian Model for Learning SCFGs with Discontiguous Rules

18 0.036103103 30 emnlp-2012-Constructing Task-Specific Taxonomies for Document Collection Browsing

19 0.036038812 35 emnlp-2012-Document-Wide Decoding for Phrase-Based Statistical Machine Translation

20 0.035906762 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.144), (1, 0.031), (2, 0.006), (3, 0.046), (4, -0.062), (5, 0.007), (6, 0.006), (7, -0.012), (8, 0.067), (9, 0.037), (10, 0.051), (11, -0.054), (12, -0.116), (13, 0.031), (14, 0.014), (15, 0.017), (16, 0.042), (17, -0.03), (18, -0.002), (19, 0.031), (20, -0.099), (21, -0.024), (22, 0.125), (23, -0.047), (24, 0.046), (25, 0.063), (26, -0.129), (27, 0.029), (28, 0.027), (29, 0.085), (30, -0.021), (31, 0.008), (32, 0.102), (33, -0.056), (34, 0.006), (35, -0.014), (36, -0.026), (37, 0.068), (38, -0.109), (39, -0.063), (40, -0.021), (41, -0.074), (42, -0.061), (43, -0.059), (44, -0.12), (45, 0.181), (46, 0.047), (47, 0.286), (48, 0.011), (49, 0.206)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93937796 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

Author: Stephen Roller ; Michael Speriosu ; Sarat Rallapalli ; Benjamin Wing ; Jason Baldridge

2 0.46970147 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

Author: Jennifer Gillenwater ; Alex Kulesza ; Ben Taskar

Abstract: We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads—singlylinked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.

3 0.44690686 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities

Author: Xin Zhao ; Baihan Shu ; Jing Jiang ; Yang Song ; Hongfei Yan ; Xiaoming Li

4 0.41038606 139 emnlp-2012-Word Salad: Relating Food Prices and Descriptions

Author: Victor Chahuneau ; Kevin Gimpel ; Bryan R. Routledge ; Lily Scherlis ; Noah A. Smith

Abstract: We investigate the use of language in food writing, specifically on restaurant menus and in customer reviews. Our approach is to build predictive models of concrete external variables, such as restaurant menu prices. We make use of a dataset of menus and customer reviews for thousands of restaurants in several U.S. cities. By focusing on prediction tasks and doing our analysis at scale, our methodology allows quantitative, objective measurements of the words and phrases used to de- scribe food in restaurants. We also explore interactions in language use between menu prices and sentiment as expressed in user reviews.

5 0.39734089 86 emnlp-2012-Locally Training the Log-Linear Model for SMT

Author: Lemao Liu ; Hailong Cao ; Taro Watanabe ; Tiejun Zhao ; Mo Yu ; Conghui Zhu

Abstract: In statistical machine translation, minimum error rate training (MERT) is a standard method for tuning a single weight with regard to a given development data. However, due to the diversity and uneven distribution of source sentences, there are two problems suffered by this method. First, its performance is highly dependent on the choice of a development set, which may lead to an unstable performance for testing. Second, translations become inconsistent at the sentence level since tuning is performed globally on a document level. In this paper, we propose a novel local training method to address these two problems. Unlike a global training method, such as MERT, in which a single weight is learned and used for all the input sentences, we perform training and testing in one step by learning a sentencewise weight for each input sentence. We pro- pose efficient incremental training methods to put the local training into practice. In NIST Chinese-to-English translation tasks, our local training method significantly outperforms MERT with the maximal improvements up to 2.0 BLEU points, meanwhile its efficiency is comparable to that of the global method.

6 0.3703109 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

7 0.36284542 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs

8 0.3433809 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

9 0.30820024 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

10 0.2982012 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings

11 0.29278538 134 emnlp-2012-User Demographics and Language in an Implicit Social Network

12 0.28880811 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering

13 0.27103966 46 emnlp-2012-Exploiting Reducibility in Unsupervised Dependency Parsing

14 0.25858462 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

15 0.25092745 36 emnlp-2012-Domain Adaptation for Coreference Resolution: An Adaptive Ensemble Approach

16 0.23577803 78 emnlp-2012-Learning Lexicon Models from Search Logs for Query Expansion

17 0.23178299 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

18 0.23005465 25 emnlp-2012-Bilingual Lexicon Extraction from Comparable Corpora Using Label Propagation

19 0.23001869 120 emnlp-2012-Streaming Analysis of Discourse Participants

20 0.22875866 102 emnlp-2012-Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.019), (16, 0.022), (25, 0.014), (34, 0.072), (45, 0.014), (60, 0.066), (63, 0.066), (64, 0.017), (65, 0.024), (70, 0.019), (73, 0.012), (74, 0.036), (76, 0.052), (80, 0.017), (86, 0.027), (91, 0.413), (95, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73346788 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

Author: Stephen Roller ; Michael Speriosu ; Sarat Rallapalli ; Benjamin Wing ; Jason Baldridge

2 0.31598556 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

Author: Jianxing Yu ; Zheng-Jun Zha ; Tat-Seng Chua

Abstract: This paper proposes to generate appropriate answers for opinion questions about products by exploiting the hierarchical organization of consumer reviews. The hierarchy organizes product aspects as nodes following their parent-child relations. For each aspect, the reviews and corresponding opinions on this aspect are stored. We develop a new framework for opinion Questions Answering, which enables accurate question analysis and effective answer generation by making use the hierarchy. In particular, we first identify the (explicit/implicit) product aspects asked in the questions and their sub-aspects by referring to the hierarchy. We then retrieve the corresponding review fragments relevant to the aspects from the hierarchy. In order to gener- ate appropriate answers from the review fragments, we develop a multi-criteria optimization approach for answer generation by simultaneously taking into account review salience, coherence, diversity, and parent-child relations among the aspects. We conduct evaluations on 11 popular products in four domains. The evaluated corpus contains 70,359 consumer reviews and 220 questions on these products. Experimental results demonstrate the effectiveness of our approach.

3 0.31561387 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

4 0.31345788 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

Author: Lizhen Qu ; Rainer Gemulla ; Gerhard Weikum

Abstract: We propose the weakly supervised MultiExperts Model (MEM) for analyzing the semantic orientation of opinions expressed in natural language reviews. In contrast to most prior work, MEM predicts both opinion polarity and opinion strength at the level of individual sentences; such fine-grained analysis helps to understand better why users like or dislike the entity under review. A key challenge in this setting is that it is hard to obtain sentence-level training data for both polarity and strength. For this reason, MEM is weakly supervised: It starts with potentially noisy indicators obtained from coarse-grained training data (i.e., document-level ratings), a small set of diverse base predictors, and, if available, small amounts of fine-grained training data. We integrate these noisy indicators into a unified probabilistic framework using ideas from ensemble learning and graph-based semi-supervised learning. Our experiments indicate that MEM outperforms state-of-the-art methods by a significant margin.

5 0.31028157 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

Author: Michael J. Paul

Abstract: Recent work has explored the use of hidden Markov models for unsupervised discourse and conversation modeling, where each segment or block of text such as a message in a conversation is associated with a hidden state in a sequence. We extend this approach to allow each block of text to be a mixture of multiple classes. Under our model, the probability of a class in a text block is a log-linear function of the classes in the previous block. We show that this model performs well at predictive tasks on two conversation data sets, improving thread reconstruction accuracy by up to 15 percentage points over a standard HMM. Additionally, we show quantitatively that the induced word clusters correspond to speech acts more closely than baseline models.

6 0.30886346 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

7 0.30885211 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

8 0.30719978 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

9 0.30640426 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

10 0.30423495 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

11 0.30327296 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

12 0.30253801 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

13 0.30252612 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes

14 0.30233851 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

15 0.30221146 120 emnlp-2012-Streaming Analysis of Discourse Participants

16 0.30177647 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

17 0.30152705 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

18 0.30108541 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM

19 0.30075058 51 emnlp-2012-Extracting Opinion Expressions with semi-Markov Conditional Random Fields

20 0.30048811 97 emnlp-2012-Natural Language Questions for the Web of Data