emnlp emnlp2012 emnlp2012-27 knowledge-graph by maker-knowledge-mining

27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

Source: pdf

Author: Song Feng ; Ritwik Banerjee ; Yejin Choi

Abstract: Much of the writing styles recognized in rhetorical and composition theories involve deep syntactic elements. However, most previous research for computational stylometric analysis has relied on shallow lexico-syntactic patterns. Some very recent work has shown that PCFG models can detect distributional difference in syntactic styles, but without offering much insights into exactly what constitute salient stylistic elements in sentence structure characterizing each authorship. In this paper, we present a comprehensive exploration of syntactic elements in writing styles, with particular emphasis on interpretable characterization of stylistic elements. We present analytic insights with respect to the authorship attribution task in two different domains. ,

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Some very recent work has shown that PCFG models can detect distributional difference in syntactic styles, but without offering much insights into exactly what constitute salient stylistic elements in sentence structure characterizing each authorship. [sent-3, score-0.487]

2 In this paper, we present a comprehensive exploration of syntactic elements in writing styles, with particular emphasis on interpretable characterization of stylistic elements. [sent-4, score-0.494]

3 We present analytic insights with respect to the authorship attribution task in two different domains. [sent-5, score-0.814]

4 , 1 Introduction Much of the writing styles recognized in rhetorical and composition theories involve deep syntactic elements in style (e. [sent-6, score-0.579]

5 However, previous research for automatic authorship attribution and computational stylometric analysis have relied mostly on shallow lexico-syntactic patterns (e. [sent-9, score-0.871]

6 edu Some very recent works have shown that PCFG models can detect distributional difference in sentence structure in gender attribution (Sarawgi et al. [sent-16, score-0.282]

7 However, still very little has been understood exactly what constitutes salient stylistic elements in sentence structures that characterize each author. [sent-19, score-0.334]

8 Although the work of Wong and Dras (2011) has extracted production rules with highest information gain, their analysis stops short of providing insight any deeper than what simple n-gramlevel analysis could also provide. [sent-20, score-0.242]

9 1 One might even wonder whether PCFG models are hinging mostly on leaf production rules, and whether there are indeed deep syntactic differences at all. [sent-21, score-0.383]

10 In contrast, a periodic sentence starts with subordinate phrases and clauses, suspending the most 1For instance, missing determiners in English text written by Chinese speakers, or simple n-gram anomaly such as frequent use of “according to” by Chinese speak- ers (Wong and Dras, 2011) . [sent-24, score-0.278]

11 2 Periodic sentences were favored in classical times, while loose sentences became more popular in the modern age. [sent-25, score-0.153]

12 PCS BAR loose periodic Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water. [sent-32, score-0.368]

13 Hence, shallow lexico-syntactic analysis will not be able to catch the pronounced stylistic difference that is clear to a human reader. [sent-36, score-0.21]

14 One might wonder whether we could gain interesting insights simply by looking at the most discriminative production rules in PCFG trees. [sent-37, score-0.277]

15 To address this question, Table 1 shows the top ten most discriminative production rules for authorship attribution for scientific articles,3 ranked by LIBLINEAR (Fan et al. [sent-38, score-1.09]

16 4 Note that terminal production rules are excluded so as to focus directly on syntax. [sent-40, score-0.244]

17 We can also observe that none of the top 10 most discriminative production rules for Hobbs includes SBAR tag, which represents subordinate clauses. [sent-46, score-0.263]

18 Can we unveil something more in deep syntactic structure that can characterize the collective syntactic difference between any two authors? [sent-48, score-0.23]

19 For instance, what can we say about distributional difference between loose and periodic sentences discussed earlier for each author? [sent-49, score-0.294]

20 In general, production rules in CFGs do not directly map to a wide variety of stylistic elements in rhetorical and composition theories. [sent-51, score-0.602]

21 This is only as expected however, partly because CFGs are not designed for stylometric analysis in the first place, and also because some syntactic elements can go beyond the scope of context free grammars. [sent-52, score-0.296]

22 As an attempt to reduce this gap between modern statistical parsers and cognitively recognizable stylistic elements, we explore two complementary approaches: 1. [sent-53, score-0.21]

23 Translating some of the well known stylistic elements of rhetorical theories into PCFG analysis (Section 3) . [sent-54, score-0.358]

24 2 Data For the empirical analysis of authorship attribution, we use two different datasets described below. [sent-59, score-0.499]

25 Since it is nearly impossible to determine the goldstandard authorship of a paper written by multiple authors, we select 10 authors who have published at least 8 single-authored papers. [sent-63, score-0.554]

26 5 Novels We collect 5 novels from 5 English authors: Charles Dickens, Edward Bulwer-Lytton, Jane Austen, Thomas Hardy and Walter Scott. [sent-65, score-0.198]

27 We point out that authorship attribution is fundamentally different from text categorization in that it is often practically impossible to collect more than several documents for each author. [sent-68, score-0.743]

28 Therefore, it is desirable that the attribution algorithms to detect the authors based on very small samples. [sent-69, score-0.299]

29 6 Type-II Identification – Loose/Periodic: A sentence can also be classified as loose or periodic, and we present Algorithm 2 for this identification. [sent-83, score-0.149]

30 848 Table 3: Sentence Types (%) in scientific data. [sent-110, score-0.141]

31 identification, it labeled all loose sentences cor- rectly, and achieved 90% accuracy on periodic sentences. [sent-111, score-0.294]

32 Discussion Tables 3 & 4 show the sentence type distribution in scientific data and novels, respectively. [sent-112, score-0.179]

33 Notice that all authors use loose sentences much more often than periodic sentences, a known trend in modern English. [sent-117, score-0.349]

34 In Table 4, we see the opposite trend among 19th-century novels: with the exception of Jane Austen, all authors utilize periodic sentences comparatively more often. [sent-118, score-0.238]

35 Can we determine authorship solely based on the distribution of sentence types? [sent-120, score-0.537]

36 on 8Due to space limitation, we present analyses based 4 authors from the scientific data. [sent-125, score-0.196]

37 037 Table 4: Sentence Types (%) in Novels 4 Syntactic Elements Based on Production Rules In this section, we examine three different aspects of syntactic elements based on production rules. [sent-142, score-0.312]

38 1 Syntactic Variations We conjecture that the variety of syntactic structure, which most previous research in computational stylometry has not paid much attention to, provides an interesting insight into authorship. [sent-144, score-0.155]

39 One way to quantify the degree of syntactic variations is to count the unique production rules. [sent-145, score-0.226]

40 Our default setting is to exclude all lexicalized rules in the productions to focus directly on the syntactic varia- tions. [sent-148, score-0.221]

41 In our experiments (Section 6) , however, we do augment the rules with (a) ancestor nodes to capture deeper syntactic structure and (b) lexical (leaf) nodes. [sent-149, score-0.178]

42 For instance, we find that McDon employs a wider variety of syntactic structure than others, while Lin’s writing exhibits relatively the least variation. [sent-151, score-0.198]

43 Teihnigs indicates that Hobbs tends to use a certain subset production rules much more frequently than Joshi. [sent-196, score-0.206]

44 Similarly, among novels, Jane Austen’s writing has the highest amount of variation, while Walter Scott’s writing style is the least varied. [sent-198, score-0.275]

45 It is interesting to note that the authors with highest coverage – Austen and Dickens – have much lower deviation in their syntactic structure when compared to Hardy and Scott. [sent-201, score-0.17]

46 This indicates that while Austen and Dickens consistently employ a wider variety of sentence structures in their writing, Hardy and Scott follow a relatively more uniform style with sporadic forays into diverse syntactic constructs. [sent-202, score-0.163]

47 1 give us a better and more general insight into the characteristics of each author, its ability to provide insight on deep syntactic structure is still limited, as it covers production rules at all levels of 1526 the tree. [sent-205, score-0.426]

48 Tables 6 and 7 present the most discriminative sentence outlines of each author in the scientific data and novels, respectively. [sent-209, score-0.279]

49 5 Syntactic Elements Based on Tree Topology In this section, we investigate quantitative techniques to capture stylistic elements in the tree 9The presence of “FRAG” is not surprising. [sent-218, score-0.38]

50 10 Notice that sentence (1) is a loose sentence, and sentence (2) is periodic. [sent-246, score-0.187]

51 In general, loose sentences grow deep and unbalanced, while periodic sentences are relatively more balanced and wider. [sent-247, score-0.36]

52 For a tree t rooted at NR with a height n, let T be the set of leaf nodes, and let F be the set oTf fbuerc thateio sne nodes, fa nnodd leest, ξ(Ni, Nj) d beeno tthee t sheet length of the shortest path from Ni to Nj . [sent-248, score-0.268]

53 Inspired by the work of Shao (1990) , we analyze tree topology with the following four measurements: • • • Leaf height (hT = {hiT, Ni ∈ T }), where LhiTe = ξ(Ni, NR) Ni ∈ Th . [sent-249, score-0.224]

54 For instance, tehree leaf height of “free” ∈of T T . [sent-250, score-0.184]

55 Furcation height (hF = {hiF, Ni ∈ F}), Fwhurercea ahtiFio is the maximum leaf height ∈w Fith}i)n, the subtree rooted at Ni. [sent-253, score-0.277]

56 In Figure 1, for example, the furcation height of the VP in Tree (2) (marked in triangle) is 3. [sent-254, score-0.13]

57 Tree ( 1) and Tree (2) differ in that Tree ( 1) is highly unbalanced and grows deep, while Tree Figure 1: Parsed trees Metrics # of tokens maxi {hiT} maxi {{hwLi}} maxi {{σwHi}} maxi {{σσSi}} Tree (1) 15 11 6 4. [sent-270, score-0.252]

58 These consist of simple production rules and other syntactic features based on tree-traversals. [sent-280, score-0.288]

59 These sets of production rules and syntax fea1528 tures are used to build SVM classifiers using LIBLINEAR (Fan et al. [sent-282, score-0.206]

60 We would like to point out that the latter configuration is of high practical importance in authorship attribution, since we may not always have sufficient training data in realistic situations, e. [sent-285, score-0.499]

61 Lexical tokens provide strong clues by creating features that are specific to each author: research topics in the scientific data, and proper nouns such as character names in novels. [sent-288, score-0.141]

62 Our experimental results (Tables 11 & 12) show that not only do deep syntactic features perform well on their own, but they also significantly improve over lexical features. [sent-291, score-0.148]

63 pr synv synh synv+h syn0 syn↓ synl style11 pˆr ∗ Features Rules excluding terminal productions. [sent-293, score-0.505]

64 , VBG → NP (for node VP) synv ∪V synh No tr∪ee s tynraversal. [sent-304, score-0.275]

65 , {VP → VBG, VP → NP} syn↓ {∪V { edge VtoB parent node} The s∪et { oefd 1g 1e teoxtr paa stylistic ef}eatures. [sent-309, score-0.21]

66 6 values from the distribution of sentence types (Section 3) , and 5 topological metrics (Section 5) characterizing the height, width and imbalance of a tree. [sent-310, score-0.157]

67 Variations Each production rule is augmented with the grandparent node. [sent-311, score-0.184]

68 Illustration: pˆ r∗ denotes the set of production rules pr (including terminal productions) that are augmented with their grandparent nodes. [sent-314, score-0.348]

69 To quantify the amount of authorship information carried in the set style11 , we experiment with a SVM classifier using only 11 features (one for each metric) , and achieve accuracy of 42. [sent-315, score-0.499]

70 ) , and that the classification is based on just 11 features, this experiment demonstrates how effectively the tree topology statistics capture idiolects. [sent-319, score-0.131]

71 This is expected since tokens such as function words play an important role in determining authorship (e. [sent-321, score-0.499]

72 A more important observation, however, is that even after removing the leaf production rules, accuracy as high as 93% (scientific) and 92. [sent-325, score-0.235]

73 9 pˆr∗ Table 11: Authorship attribution with 20% training data. [sent-385, score-0.244]

74 Also no- tice that using only production rules, we achieve higher accuracy in novels (90. [sent-388, score-0.342]

75 1%) , but the addition of style11 features yields better results with scientific data (93. [sent-389, score-0.141]

76 In the scientific dataset, increasing the amount of training data decreases the average performance difference between lexicalized and unlexicalized features: 13. [sent-392, score-0.176]

77 We further observe that with scientific data, increasing the amount of training data improves the average performance across all unlexicalized feature-sets from 50. [sent-398, score-0.141]

78 While authors such as Dickens or Hardy have their unique writing styles that a classifier can learn based on few documents, capturing idiolects in the more rigid domain of scientific writing is far from obvious with little training data. [sent-405, score-0.514]

79 1 pˆr∗ Table 12: Authorship attribution with 80% training data. [sent-466, score-0.244]

80 Turning to lexicalized features, we note that with more training data, lexical cues perform better in scientific domain than in novels. [sent-467, score-0.176]

81 Finally, we point out that adding the style features derived from sentence types and tree topologies almost always improves the performance. [sent-474, score-0.165]

82 In scientific data, synv∗+h with style11 features shows the best performance (96%) , while synl∗ yields the best results for novels (95. [sent-475, score-0.339]

83 7 Related Work There are several hurdles in authorship attribution. [sent-480, score-0.499]

84 , 2010) , or classical literature like novels and proses (e. [sent-484, score-0.24]

85 (2003) employed frequency measures on ngrams for authorship attribution. [sent-491, score-0.499]

86 The use of syntactic features from parse trees in authorship attribution was initiated by Baayen et al. [sent-495, score-0.825]

87 Syntactic features from PCFG parse trees have also been used for gender attribution (Sarawgi et al. [sent-498, score-0.244]

88 The primary focus of most previous research, however, was to attain better classification accuracy, rather than providing linguistic interpretations of individual authorship and their stylistic elements. [sent-501, score-0.709]

89 Our work is the first to attempt authorship attribution of scientific papers, a contemporary domain where language is very formal, and the stylistic variations have limited scope. [sent-502, score-1.094]

90 In addition to exploring this new domain, we also present a comparative study expounding the role of syntactic features for authorship attri- bution in classical literature. [sent-503, score-0.623]

91 Furthermore, our work is also the first to utilize tree topological features (Chan et al. [sent-504, score-0.141]

92 8 Conclusion In this paper, we have presented a comprehensive exploration of syntactic elements in writing styles, with particular emphasis on interpretable characterization of stylistic elements, thus distinguishing our work from other recent work on syntactic stylometric analysis. [sent-506, score-0.704]

93 Our analytical study provides novel statistically supported insights into stylistic elements that have not been computationally analyzed in previous literature. [sent-507, score-0.367]

94 Measuring the usefulness of function words for authorship attribution. [sent-515, score-0.499]

95 Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. [sent-532, score-0.581]

96 Delta: A measure of stylistic difference and a guide to likely authorship. [sent-572, score-0.21]

97 Bigrams of syntactic labels for authorship discrimination of short texts. [sent-608, score-0.581]

98 Authorship attribution and verification with many authors and limited data. [sent-661, score-0.299]

99 Language independent authorship attribution using character level language models. [sent-681, score-0.743]

100 A framework for authorship identification of online messages: Writing-style features and classification techniques. [sent-764, score-0.551]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('authorship', 0.499), ('attribution', 0.244), ('stylistic', 0.21), ('novels', 0.198), ('periodic', 0.183), ('synv', 0.165), ('production', 0.144), ('scientific', 0.141), ('nr', 0.138), ('stylometric', 0.128), ('synl', 0.128), ('writing', 0.116), ('loose', 0.111), ('austen', 0.11), ('stamatatos', 0.11), ('synh', 0.11), ('vp', 0.108), ('pcfg', 0.106), ('height', 0.093), ('dickens', 0.092), ('leaf', 0.091), ('ni', 0.089), ('styles', 0.086), ('elements', 0.086), ('syn', 0.086), ('tree', 0.084), ('syntactic', 0.082), ('literary', 0.082), ('luyckx', 0.073), ('mcdon', 0.073), ('insights', 0.071), ('deep', 0.066), ('pr', 0.064), ('argamon', 0.063), ('baayen', 0.063), ('hardy', 0.063), ('maxi', 0.063), ('rules', 0.062), ('sbar', 0.062), ('rhetorical', 0.062), ('imbalance', 0.062), ('subordinate', 0.057), ('topological', 0.057), ('authors', 0.055), ('garcia', 0.055), ('keselj', 0.055), ('ltkop', 0.055), ('ltop', 0.055), ('mosteller', 0.055), ('sarawgi', 0.055), ('outlines', 0.053), ('hobbs', 0.053), ('raghavan', 0.053), ('shlomo', 0.053), ('identification', 0.052), ('stroudsburg', 0.05), ('dras', 0.049), ('wong', 0.049), ('npp', 0.047), ('daelemans', 0.047), ('author', 0.047), ('jane', 0.047), ('topology', 0.047), ('np', 0.046), ('style', 0.043), ('productions', 0.042), ('classical', 0.042), ('horizontal', 0.041), ('grandparent', 0.04), ('sentence', 0.038), ('terminal', 0.038), ('composition', 0.038), ('walter', 0.037), ('ss', 0.037), ('brook', 0.037), ('diederich', 0.037), ('furcation', 0.037), ('halteren', 0.037), ('hobbsjoshilinmcdon', 0.037), ('houvardas', 0.037), ('mutiny', 0.037), ('pighin', 0.037), ('sebtaurrn', 0.037), ('shores', 0.037), ('stony', 0.037), ('strunk', 0.037), ('stylometry', 0.037), ('threat', 0.037), ('vlado', 0.037), ('wallace', 0.037), ('insight', 0.036), ('lexicalized', 0.035), ('wl', 0.035), ('yejin', 0.035), ('rse', 0.035), ('nodes', 0.034), ('nl', 0.033), ('peng', 0.033), ('vbg', 0.033), ('deviation', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999875 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

Author: Song Feng ; Ritwik Banerjee ; Yejin Choi

2 0.09619464 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

Author: Annie Louis ; Ani Nenkova

Abstract: We introduce a model of coherence which captures the intentional discourse structure in text. Our work is based on the hypothesis that syntax provides a proxy for the communicative goal of a sentence and therefore the sequence of sentences in a coherent discourse should exhibit detectable structural patterns. Results show that our method has high discriminating power for separating out coherent and incoherent news articles reaching accuracies of up to 90%. We also show that our syntactic patterns are correlated with manual annotations of intentional structure for academic conference articles and can successfully predict the coherence of abstract, introduction and related work sections of these articles. 59.3 (100.0) Intro 50.3 (100.0) 1166 Rel wk 55.4 (100.0) >= 0.663.8 (67.2)50.8 (71.1)58.6 (75.9) >= 0.7 67.2 (32.0) 54.4 (38.6) 63.3 (52.8) >= 0.8 74.0 (10.0) 51.6 (22.0) 63.0 (25.7) >= 0.9 91.7 (2.0) 30.6 (5.0) 68.1 (7.2) Table 9: Accuracy (% examples) above each confidence level for the conference versus workshop task. These results are shown in Table 9. The proportion of examples under each setting is also indicated. When only examples above 0.6 confidence are examined, the classifier has a higher accuracy of63.8% for abstracts and covers close to 70% of the examples. Similarly, when a cutoff of 0.7 is applied to the confidence for predicting related work sections, we achieve 63.3% accuracy for 53% of examples. So we can consider that 30 to 47% of the examples in the two sections respectively are harder to tell apart. Interestingly however even high confidence predictions on introductions remain incorrect. These results show that our model can successfully distinguish the structure of articles beyond just clearly incoherent permutation examples. 7 Conclusion Our work is the first to develop an unsupervised model for intentional structure and to show that it has good accuracy for coherence prediction and also complements entity and lexical structure of discourse. This result raises interesting questions about how patterns captured by these different coherence metrics vary and how they can be combined usefully for predicting coherence. We plan to explore these ideas in future work. We also want to analyze genre differences to understand if the strength of these coherence dimensions varies with genre. Acknowledgements This work is partially supported by a Google research grant and NSF CAREER 0953445 award. References Regina Barzilay and Mirella Lapata. 2008. Modeling local coherence: An entity-based approach. Computa- tional Linguistics, 34(1): 1–34. Regina Barzilay and Lillian Lee. 2004. Catching the drift: Probabilistic content models, with applications to generation and summarization. In Proceedings of NAACL-HLT, pages 113–120. Xavier Carreras, Michael Collins, and Terry Koo. 2008. Tag, dynamic programming, and the perceptron for efficient, feature-rich parsing. In Proceedings of CoNLL, pages 9–16. Eugene Charniak and Mark Johnson. 2005. Coarse-tofine n-best parsing and maxent discriminative reranking. In Proceedings of ACL, pages 173–180. Jackie C.K. Cheung and Gerald Penn. 2010. Utilizing extra-sentential context for parsing. In Proceedings of EMNLP, pages 23–33. Christelle Cocco, Rapha ¨el Pittier, Fran ¸cois Bavaud, and Aris Xanthos. 2011. Segmentation and clustering of textual sequences: a typological approach. In Proceedings of RANLP, pages 427–433. Michael Collins and Terry Koo. 2005. Discriminative reranking for natural language parsing. Computational Linguistics, 3 1:25–70. Isaac G. Councill, C. Lee Giles, and Min-Yen Kan. 2008. Parscit: An open-source crf reference string parsing package. In Proceedings of LREC, pages 661–667. Micha Elsner and Eugene Charniak. 2008. Coreferenceinspired coherence modeling. In Proceedings of ACLHLT, Short Papers, pages 41–44. Micha Elsner and Eugene Charniak. 2011. Extending the entity grid with entity-specific features. In Proceedings of ACL-HLT, pages 125–129. Micha Elsner, Joseph Austerweil, and Eugene Charniak. 2007. A unified local and global model for discourse coherence. In Proceedings of NAACL-HLT, pages 436–443. Pascale Fung and Grace Ngai. 2006. One story, one flow: Hidden markov story models for multilingual multidocument summarization. ACM Transactions on Speech and Language Processing, 3(2): 1–16. Barbara J. Grosz and Candace L. Sidner. 1986. Attention, intentions, and the structure of discourse. Computational Linguistics, 3(12): 175–204. Yufan Guo, Anna Korhonen, and Thierry Poibeau. 2011. A weakly-supervised approach to argumentative zoning of scientific documents. In Proceedings of EMNLP, pages 273–283. Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In Proceedings of ACL-HLT, pages 586–594, June. 1167 Nikiforos Karamanis, Chris Mellish, Massimo Poesio, and Jon Oberlander. 2009. Evaluating centering for information ordering using corpora. Computational Linguistics, 35(1):29–46. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of ACL, pages 423–430. Mirella Lapata and Regina Barzilay. 2005. Automatic evaluation of text coherence: Models and representations. In Proceedings of IJCAI. Mirella Lapata. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of ACL, pages 545–552. Maria Liakata and Larisa Soldatova. 2008. Guidelines for the annotation of general scientific concepts. JISC Project Report. Maria Liakata, Simone Teufel, Advaith Siddharthan, and Colin Batchelor. 2010. Corpora for the conceptualisation and zoning of scientific papers. In Proceedings of LREC. Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng. 2009. Recognizing implicit discourse relations in the Penn Discourse Treebank. In Proceedings of EMNLP, pages 343–351. Ziheng Lin, Hwee Tou Ng, and Min-Yen Kan. 2011. Automatically evaluating text coherence using discourse relations. In Proceedings of ACL-HLT, pages 997– 1006. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1994. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2):313–330. Emily Pitler and Ani Nenkova. 2008. Revisiting readability: A unified framework for predicting text quality. In Proceedings of EMNLP, pages 186–195. Dragomir R. Radev, Mark Thomas Joseph, Bryan Gibson, and Pradeep Muthukrishnan. 2009. A Bibliometric and Network Analysis ofthe field of Computational Linguistics. Journal of the American Society for Information Science and Technology. David Reitter, Johanna D. Moore, and Frank Keller. 2006. Priming of Syntactic Rules in Task-Oriented Dialogue and Spontaneous Conversation. In Proceedings of the 28th Annual Conference of the Cognitive Science Society, pages 685–690. Jeffrey C. Reynar and Adwait Ratnaparkhi. 1997. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the fifth conference on Applied natural language processing, pages 16–19. Radu Soricut and Daniel Marcu. 2006. Discourse generation using utility-trained coherence models. In Proceedings of COLING-ACL, pages 803–810. John Swales. 1990. Genre analysis: English in academic and research settings, volume 11. Cambridge University Press. Simone Teufel and Marc Moens. 2000. What’s yours and what’s mine: determining intellectual attribution in scientific text. In Proceedings of EMNLP, pages 9– 17. Simone Teufel, Jean Carletta, and Marc Moens. 1999. An annotation scheme for discourse-level argumentation in research articles. In Proceedings of EACL, pages 110–1 17. Ying Zhao, George Karypis, and Usama Fayyad. 2005. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10: 141–168. 1168

3 0.092510298 105 emnlp-2012-Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output

Author: Jonathan K. Kummerfeld ; David Hall ; James R. Curran ; Dan Klein

Abstract: Constituency parser performance is primarily interpreted through a single metric, F-score on WSJ section 23, that conveys no linguistic information regarding the remaining errors. We classify errors within a set of linguistically meaningful types using tree transformations that repair groups of errors together. We use this analysis to answer a range of questions about parser behaviour, including what linguistic constructions are difficult for stateof-the-art parsers, what types of errors are being resolved by rerankers, and what types are introduced when parsing out-of-domain text.

4 0.078171536 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

Author: Kristian Woodsend ; Mirella Lapata

Abstract: Multi-document summarization involves many aspects of content selection and surface realization. The summaries must be informative, succinct, grammatical, and obey stylistic writing conventions. We present a method where such individual aspects are learned separately from data (without any hand-engineering) but optimized jointly using an integer linear programme. The ILP framework allows us to combine the decisions of the expert learners and to select and rewrite source content through a mixture of objective setting, soft and hard constraints. Experimental results on the TAC-08 data set show that our model achieves state-of-the-art performance using ROUGE and significantly improves the informativeness of the summaries.

5 0.077822395 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

Author: Sze-Meng Jojo Wong ; Mark Dras ; Mark Johnson

Abstract: The task of inferring the native language of an author based on texts written in a second language has generally been tackled as a classification problem, typically using as features a mix of n-grams over characters and part of speech tags (for small and fixed n) and unigram function words. To capture arbitrarily long n-grams that syntax-based approaches have suggested are useful, adaptor grammars have some promise. In this work we investigate their extension to identifying n-gram collocations of arbitrary length over a mix of PoS tags and words, using both maxent and induced syntactic language model approaches to classification. After presenting a new, simple baseline, we show that learned collocations used as features in a maxent model perform better still, but that the story is more mixed for the syntactic language model.

6 0.067963004 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence

7 0.063525006 9 emnlp-2012-A Sequence Labelling Approach to Quote Attribution

8 0.057468463 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing

9 0.055748723 37 emnlp-2012-Dynamic Programming for Higher Order Parsing of Gap-Minding Trees

10 0.053884383 133 emnlp-2012-Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision

11 0.052564584 131 emnlp-2012-Unified Dependency Parsing of Chinese Morphological and Syntactic Structures

12 0.051525831 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction

13 0.049100891 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition

14 0.046878364 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation

15 0.045507662 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures

16 0.04449863 115 emnlp-2012-SSHLDA: A Semi-Supervised Hierarchical Topic Model

17 0.043257795 106 emnlp-2012-Part-of-Speech Tagging for Chinese-English Mixed Texts with Dynamic Features

18 0.042433787 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis

19 0.041697673 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

20 0.04143079 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, -0.037), (2, 0.053), (3, 0.012), (4, 0.015), (5, 0.043), (6, -0.017), (7, -0.009), (8, -0.047), (9, 0.059), (10, -0.051), (11, 0.018), (12, -0.158), (13, 0.14), (14, 0.029), (15, 0.005), (16, 0.092), (17, -0.061), (18, -0.042), (19, -0.017), (20, -0.062), (21, 0.097), (22, 0.162), (23, -0.051), (24, -0.022), (25, 0.092), (26, 0.058), (27, 0.096), (28, 0.044), (29, -0.129), (30, -0.214), (31, 0.087), (32, 0.031), (33, 0.06), (34, -0.289), (35, -0.092), (36, -0.003), (37, -0.127), (38, -0.18), (39, -0.055), (40, 0.087), (41, -0.002), (42, -0.03), (43, 0.034), (44, -0.217), (45, -0.034), (46, -0.083), (47, -0.116), (48, 0.012), (49, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93663818 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

Author: Song Feng ; Ritwik Banerjee ; Yejin Choi

2 0.60103089 9 emnlp-2012-A Sequence Labelling Approach to Quote Attribution

Author: Timothy O'Keefe ; Silvia Pareti ; James R. Curran ; Irena Koprinska ; Matthew Honnibal

Abstract: Quote extraction and attribution is the task of automatically extracting quotes from text and attributing each quote to its correct speaker. The present state-of-the-art system uses gold standard information from previous decisions in its features, which, when removed, results in a large drop in performance. We treat the problem as a sequence labelling task, which allows us to incorporate sequence features without using gold standard information. We present results on two new corpora and an augmented version of a third, achieving a new state-of-the-art for systems using only realistic features.

3 0.51105022 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification

Author: Sze-Meng Jojo Wong ; Mark Dras ; Mark Johnson

4 0.40397882 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

Author: Annie Louis ; Ani Nenkova

5 0.38808873 133 emnlp-2012-Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision

Author: Joohyun Kim ; Raymond Mooney

Abstract: “Grounded” language learning employs training data in the form of sentences paired with relevant but ambiguous perceptual contexts. B ¨orschinger et al. (201 1) introduced an approach to grounded language learning based on unsupervised PCFG induction. Their approach works well when each sentence potentially refers to one of a small set of possible meanings, such as in the sportscasting task. However, it does not scale to problems with a large set of potential meanings for each sentence, such as the navigation instruction following task studied by Chen and Mooney (201 1). This paper presents an enhancement of the PCFG approach that scales to such problems with highly-ambiguous supervision. Experimental results on the navigation task demonstrates the effectiveness of our approach.

6 0.34025934 105 emnlp-2012-Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output

7 0.31623366 37 emnlp-2012-Dynamic Programming for Higher Order Parsing of Gap-Minding Trees

8 0.30545077 94 emnlp-2012-Multiple Aspect Summarization Using Integer Linear Programming

9 0.26367116 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

10 0.25076982 59 emnlp-2012-Generating Non-Projective Word Order in Statistical Linearization

11 0.24860297 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence

12 0.24755037 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis

13 0.2443132 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

14 0.23991071 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction

15 0.23414569 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures

16 0.22572495 17 emnlp-2012-An "AI readability" Formula for French as a Foreign Language

17 0.2184141 10 emnlp-2012-A Statistical Relational Learning Approach to Identifying Evidence Based Medicine Categories

18 0.20548499 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation

19 0.203187 120 emnlp-2012-Streaming Analysis of Discourse Participants

20 0.19525027 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.021), (11, 0.019), (16, 0.026), (34, 0.061), (45, 0.018), (60, 0.08), (63, 0.068), (64, 0.029), (65, 0.024), (70, 0.027), (73, 0.03), (74, 0.066), (76, 0.06), (80, 0.013), (86, 0.024), (88, 0.335), (95, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76965785 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

Author: Song Feng ; Ritwik Banerjee ; Yejin Choi

2 0.64955735 13 emnlp-2012-A Unified Approach to Transliteration-based Text Input with Online Spelling Correction

Author: Hisami Suzuki ; Jianfeng Gao

Abstract: This paper presents an integrated, end-to-end approach to online spelling correction for text input. Online spelling correction refers to the spelling correction as you type, as opposed to post-editing. The online scenario is particularly important for languages that routinely use transliteration-based text input methods, such as Chinese and Japanese, because the desired target characters cannot be input at all unless they are in the list of candidates provided by an input method, and spelling errors prevent them from appearing in the list. For example, a user might type suesheng by mistake to mean xuesheng 学生 'student' in Chinese; existing input methods fail to convert this misspelled input to the desired target Chinese characters. In this paper, we propose a unified approach to the problem of spelling correction and transliteration-based character conversion using an approach inspired by the phrasebased statistical machine translation framework. At the phrase (substring) level, k most probable pinyin (Romanized Chinese) corrections are generated using a monotone decoder; at the sentence level, input pinyin strings are directly transliterated into target Chinese characters by a decoder using a loglinear model that refer to the features of both levels. A new method of automatically deriving parallel training data from user keystroke logs is also presented. Experiments on Chinese pinyin conversion show that our integrated method reduces the character error rate by 20% (from 8.9% to 7. 12%) over the previous state-of-the art based on a noisy channel model. 609 1

3 0.41363901 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

4 0.4117488 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

Author: Valentin I. Spitkovsky ; Hiyan Alshawi ; Daniel Jurafsky

Abstract: We present a new family of models for unsupervised parsing, Dependency and Boundary models, that use cues at constituent boundaries to inform head-outward dependency tree generation. We build on three intuitions that are explicit in phrase-structure grammars but only implicit in standard dependency formulations: (i) Distributions of words that occur at sentence boundaries such as English determiners resemble constituent edges. (ii) Punctuation at sentence boundaries further helps distinguish full sentences from fragments like headlines and titles, allowing us to model grammatical differences between complete and incomplete sentences. (iii) Sentence-internal punctuation boundaries help with longer-distance dependencies, since punctuation correlates with constituent edges. Our models induce state-of-the-art dependency grammars for many languages without — — special knowledge of optimal input sentence lengths or biased, manually-tuned initializers.

5 0.40635857 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

Author: Lizhen Qu ; Rainer Gemulla ; Gerhard Weikum

Abstract: We propose the weakly supervised MultiExperts Model (MEM) for analyzing the semantic orientation of opinions expressed in natural language reviews. In contrast to most prior work, MEM predicts both opinion polarity and opinion strength at the level of individual sentences; such fine-grained analysis helps to understand better why users like or dislike the entity under review. A key challenge in this setting is that it is hard to obtain sentence-level training data for both polarity and strength. For this reason, MEM is weakly supervised: It starts with potentially noisy indicators obtained from coarse-grained training data (i.e., document-level ratings), a small set of diverse base predictors, and, if available, small amounts of fine-grained training data. We integrate these noisy indicators into a unified probabilistic framework using ideas from ensemble learning and graph-based semi-supervised learning. Our experiments indicate that MEM outperforms state-of-the-art methods by a significant margin.

6 0.40233192 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

7 0.40222564 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

8 0.39958152 51 emnlp-2012-Extracting Opinion Expressions with semi-Markov Conditional Random Fields

9 0.39595118 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

10 0.3948991 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

11 0.39442813 122 emnlp-2012-Syntactic Surprisal Affects Spoken Word Duration in Conversational Contexts

12 0.39419678 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence

13 0.39401633 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

14 0.39282542 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars

15 0.39241192 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries

16 0.39231399 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

17 0.39114523 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

18 0.39071724 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

19 0.39036098 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

20 0.38875827 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes