acl acl2010 acl2010-34 knowledge-graph by maker-knowledge-mining

34 acl-2010-Authorship Attribution Using Probabilistic Context-Free Grammars


Source: pdf

Author: Sindhu Raghavan ; Adriana Kovashka ; Raymond Mooney

Abstract: In this paper, we present a novel approach for authorship attribution, the task of identifying the author of a document, using probabilistic context-free grammars. Our approach involves building a probabilistic context-free grammar for each author and using this grammar as a language model for classification. We evaluate the performance of our method on a wide range of datasets to demonstrate its efficacy.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we present a novel approach for authorship attribution, the task of identifying the author of a document, using probabilistic context-free grammars. [sent-3, score-0.75]

2 Our approach involves building a probabilistic context-free grammar for each author and using this grammar as a language model for classification. [sent-4, score-0.306]

3 We evaluate the performance of our method on a wide range of datasets to demonstrate its efficacy. [sent-5, score-0.215]

4 In the context of written text, such as newspaper articles or short stories, the author’s style could be consid- ered a distinct “language. [sent-7, score-0.209]

5 ” Authorship attribution, also referred to as authorship identification or prediction, studies strategies for discriminating between the styles of different authors. [sent-8, score-0.589]

6 The general approach to authorship attribution is to extract a number of style markers from the text and use these style markers as features to train a classifier (Burrows, 1987; Binongo and Smith, 1999; Diederich et al. [sent-11, score-1.03]

7 These style markers could include the frequencies of certain characters, function words, phrases or sentences. [sent-13, score-0.102]

8 (2003) build a character-level n-gram model for each author. [sent-15, score-0.071]

9 (1996) demonstrate that the use of syntactic features from parse trees can improve the accuracy of authorship attribution. [sent-19, score-0.64]

10 While there have been several approaches proposed for authorship attribution, it is not clear if the performance of one is better than the other. [sent-20, score-0.591]

11 For more information on the current state of the art for authorship attribution, we refer the reader to a detailed survey by Stamatatos (2009). [sent-22, score-0.579]

12 Our approach involves building a probabilistic contextfree grammar (PCFG) for each author and using this grammar as a language model for classification. [sent-24, score-0.306]

13 Experiments on a variety of corpora including poetry and newspaper articles on a number of topics demonstrate that our PCFG approach performs fairly well, but it only outperforms a bi- gram language model on a couple of datasets (e. [sent-25, score-0.752]

14 However, combining our approach with other methods results in an ensemble that performs the best on most datasets. [sent-28, score-0.141]

15 2 Authorship Attribution using PCFG We now describe our approach to authorship attribution. [sent-29, score-0.556]

16 Given a training set of documents from different authors, we build a PCFG for each author based on the documents they have written. [sent-30, score-0.482]

17 Given a test document, we parse it using each author’s grammar and assign it to the author whose PCFG produced the highest likelihood for the document. [sent-31, score-0.254]

18 In order to build a PCFG, a standard statistical parser takes a corpus of parse trees of sentences as training input. [sent-32, score-0.107]

19 Since we do not have access to authors’ documents annotated with parse trees, we use a statistical parser trained on a generic 38 Uppsala,P Srwoceedeedni,n 1g1s- 1of6 t Jhuely AC 20L1 200. [sent-33, score-0.167]

20 Our approach is summarized below: Input A training set of documents labeled with author names and a test set ofdocuments with unknown authors. [sent-44, score-0.337]

21 Treebank each training document using the parser trained in Step 1. [sent-48, score-0.09]

22 Train a PCFG Gi for each author Ai using the treebanked documents for that author. [sent-50, score-0.31]

23 For each test document, compute its likelihood for each grammar Gi by multiplying the probability of the top PCFG parse for each sentence. [sent-52, score-0.06]

24 For each test document, find the author Ai whose grammar Gi results in the highest likelihood score. [sent-54, score-0.229]

25 1 Data We collected a variety of documents with known authors including news articles on a wide range of topics and literary works like poetry. [sent-58, score-0.435]

26 We downloaded all texts from the Internet and manually removed extraneous information as well as titles, author names, and chapter headings. [sent-59, score-0.217]

27 We collected several news articles from the New York Times online journal (http : / / global . [sent-60, score-0.153]

28 We also collected news articles on cricket from the ESPN cricinfo website (http : / /www . [sent-63, score-0.391]

29 In addition, we collected poems from the Project Gutenberg website (http : / /www . [sent-66, score-0.082]

30 We attempted to collect sets of documents on a shared topic written by multiple authors. [sent-69, score-0.181]

31 This was done to ensure that the datasets truly tested authorship attribution as opposed to topic identification. [sent-70, score-0.981]

32 However, since it is very difficult to find authors that write literary works on the same topic, the Poetry dataset exhibits higher topic variability than our news datasets. [sent-71, score-0.293]

33 We had 5 different datasets in total Football, Business, Travel, Cricket, and Poetry. [sent-72, score-0.153]

34 The number of authors in our datasets ranged from 3 to 6. [sent-73, score-0.211]

35 For each dataset, we split the documents into training and test sets. [sent-74, score-0.143]

36 , 1999) have observed that having unequal number of words per author in the training set leads to poor performance for the authors with fewer words. [sent-76, score-0.339]

37 Therefore, we ensured that, in the – training set, the total number of words per author was roughly the same. [sent-77, score-0.246]

38 We would like to note that we could have also selected the training set such that the total number of sentences per author was roughly the same. [sent-78, score-0.246]

39 However, since we would like to compare the performance of the PCFG-based approach with a bag-of-words baseline, we decided to normalize the training set based on the number of words, rather than sentences. [sent-79, score-0.062]

40 For testing, we used 15 documents per author for datasets with news articles and 5 or 10 documents per author for the Poetry dataset. [sent-80, score-0.947]

41 More details about the datasets can be found in Table 1. [sent-81, score-0.153]

42 2 Methodology We evaluated our approach to authorship prediction on the five datasets described above. [sent-88, score-0.739]

43 For news articles, we used the first 10 sections of the WSJ corpus, which consists of annotated news articles on finance, to build the initial statistical parser in 39 Step 1. [sent-89, score-0.233]

44 For Poetry, we used 7 sections of the Brown corpus which consists of annotated documents from different areas of literature. [sent-90, score-0.116]

45 In the basic approach, we trained a PCFG model for each author based solely on the documents written by that author. [sent-91, score-0.425]

46 However, since the number of documents per author is relatively low, this leads to very sparse training data. [sent-92, score-0.362]

47 We refer to this model as “PCFG-I”, where I stands for interpolation since this effectively exploits linear interpo- lation with the base corpus to smooth parameters. [sent-94, score-0.126]

48 ” However, for the task of authorship prediction, we hypothesized that the frequency of specific stop words could provide useful information about the author’s writing style. [sent-98, score-0.634]

49 We surmised that a discriminative classifier like MaxEnt might perform better than a generative classifier like Naive Bayes. [sent-101, score-0.128]

50 However, when sufficient training data is not available, generative models are known to perform better than discriminative models (Ng and Jordan, 2001). [sent-102, score-0.097]

51 We used the same background corpus mixing method used for the PCFG-I model to effectively smooth the n-gram models. [sent-106, score-0.096]

52 Since a generative model like Naive Bayes that uses n-gram frequencies is equivalent to an n-gram language model, we also used the Naive Bayes classifier in MALLET to implement the n-gram models. [sent-107, score-0.107]

53 Note that a Naive-Bayes bag-of-words model is equivalent to a unigram language model. [sent-108, score-0.08]

54 While the PCFG model captures the author’s writing style at the syntactic level, it may not accurately capture lexical information. [sent-109, score-0.184]

55 Since both syntactic and lexical information is presumably useful in capturing the author’s overall writing style, we also developed an ensemble using a PCFG model, the bag-of-words MaxEnt classifier, and an ngram language model. [sent-110, score-0.2]

56 We also developed another ensemble based on MaxEnt and n-gram language models to demonstrate the contribution of the PCFG model to the overall performance of PCFG-E. [sent-113, score-0.187]

57 For each dataset, we report accuracy, the fraction of the test documents whose authors were correctly identified. [sent-114, score-0.174]

58 3 Results and Discussion Table 2 shows the accuracy of authorship prediction on different datasets. [sent-116, score-0.586]

59 For the n-gram models, we only report the results for the bigram model with smoothing (Bigram-I) as it was the best performing model for most datasets (except for Cricket and Poetry). [sent-117, score-0.352]

60 For the Cricket dataset, the trigram-I model was the best performing ngram model with an accuracy of 98. [sent-118, score-0.183]

61 Generally, a higher order n-gram model (n = 3 or higher) performs poorly as it requires a fair amount of smoothing due to the exponential increase in all possible n-gram combinations. [sent-120, score-0.118]

62 Hence, the supe- rior performance of the trigram-I model on the Cricket dataset was a surprising result. [sent-121, score-0.167]

63 For the Poetry dataset, the unigram-I model performed best among the smoothed n-gram models at 81. [sent-122, score-0.072]

64 This is unsurprising because as mentioned above, topic information is strongest in the Poetry dataset, and it is captured well in the unigram model. [sent-124, score-0.069]

65 For bag-of-words methods, we find that the generatively trained Naive Bayes model (unigram language model) performs better than or equal to the discriminatively trained MaxEnt model on most datasets (except for Business). [sent-125, score-0.265]

66 This result is not suprising since our datasets are limited in size, and generative models tend to perform better than discriminative methods when there is very little training data available. [sent-126, score-0.25]

67 Amongst the different baseline models (MaxEnt, Naive Bayes, Bigram-I), we find Bigram-I to be the best performing model (except for Cricket and Poetry). [sent-127, score-0.135]

68 PCFG-E refers to the ensemble based on MaxEnt, Bigram-I, and PCFG-I. [sent-137, score-0.083]

69 MaxEnt+Bigram-I refers to the ensemble based on MaxEnt and Bigram-I. [sent-138, score-0.083]

70 While discussing the performance of the PCFG model and its variants, we consider the best performing baseline model. [sent-140, score-0.17]

71 We observe that the basic PCFG model and the PCFG-I model do not usually outperform the best baseline method (except for Football and Poetry, as discussed below). [sent-141, score-0.179]

72 For Football, the basic PCFG model outperforms the best baseline, while for Poetry, the PCFG-I model outperforms the best baseline. [sent-142, score-0.233]

73 Further, the performance of the basic PCFG model is inferior to that ofPCFG-I for most datasets, likely due to the insufficient training data used in the basic model. [sent-143, score-0.182]

74 Ideally one would use more training documents, but in many domains it is impossible to obtain a large corpus of documents written by a single author. [sent-144, score-0.177]

75 For example, as Luyckx and Daelemans (2008) argue, in foren- sics one would like to identify the authorship of documents based on a limited number of documents written by the author. [sent-145, score-0.822]

76 Hence, we investigated smoothing techniques to improve the performance of the basic PCFG model. [sent-146, score-0.122]

77 We found that the interpolation approach resulted in a substantial improvement in the performance of the PCFG model for all but the Football dataset (discussed below). [sent-147, score-0.174]

78 We combined the best n-gram model (Bigram-I) and PCFG model (PCFG-I) with MaxEnt to build PCFG-E. [sent-151, score-0.143]

79 For the Travel dataset, we find that the performance of the PCFG-E model is equal to that of the best constituent model (Bigram-I). [sent-152, score-0.149]

80 For the remaining datasets, the performance of PCFG-E is better than the best constituent model. [sent-153, score-0.065]

81 Further- more, for the Football, Cricket and Poetry datasets this improvement is quite substantial. [sent-154, score-0.153]

82 We now find that the performance of some variant of PCFG is always better than or equal to that of the best baseline. [sent-155, score-0.065]

83 While the basic PCFG model outperforms the baseline for the Football dataset, PCFGE outperforms the best baseline for the Poetry and Business datasets. [sent-156, score-0.213]

84 For the Cricket and Travel datasets, the performance of the PCFG-E model equals that of the best baseline. [sent-157, score-0.107]

85 In order to assess the statistical significance of any performance difference between the best PCFG model and the best baseline, we performed the McNemar’s test, a non-parametric test for binomial variables (Rosner, 2005). [sent-158, score-0.137]

86 The performance of PCFG and PCFG-I is particularly impressive on the Football and Poetry datasets. [sent-161, score-0.069]

87 For the Football dataset, the basic PCFG model is the best performing PCFG model and it performs much better than other methods. [sent-162, score-0.218]

88 It is surprising that smoothing using PCFG-I actually results in a drop in performance on this dataset. [sent-163, score-0.121]

89 We hypothesize that the authors in the Football dataset may have very different syntactic writing styles that are effectively captured by the basic PCFG model. [sent-164, score-0.311]

90 This is impressive given the much looser syntactic structure of poetry compared to news articles, and it indicates the value of syntactic information for distinguishing between literary authors. [sent-167, score-0.592]

91 Finally, we consider the specific contribution of the PCFG-I model towards the performance of 41 the PCFG-E ensemble. [sent-168, score-0.077]

92 Based on comparing the results for PCFG-E and MaxEnt+Bigram-I, we find that there is a drop in performance for most datasets when removing PCFG-I from the ensemble. [sent-169, score-0.226]

93 Thus, it further illustrates the importance of broader syntactic information for the task of authorship attribution. [sent-172, score-0.588]

94 4 Future Work and Conclusions In this paper, we have presented our ongoing work on authorship attribution, describing a novel approach that uses probabilistic context-free grammars. [sent-173, score-0.556]

95 We have demonstrated that both syntactic and lexical information are useful in effectively capturing authors’ overall writing style. [sent-174, score-0.114]

96 To this end, we have developed an ensemble approach that performs better than the baseline models on several datasets. [sent-175, score-0.137]

97 An interesting extension of our current approach is to consider discriminative training of PCFGs for each author. [sent-176, score-0.061]

98 Finally, we would like to compare the performance of our method to other state-of-the-art approaches to authorship prediction. [sent-177, score-0.591]

99 Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. [sent-184, score-0.611]

100 Generative classifiers: A comparison of logistic regression and naive Bayes. [sent-247, score-0.104]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('authorship', 0.556), ('poetry', 0.357), ('pcfg', 0.318), ('attribution', 0.241), ('cricket', 0.208), ('author', 0.194), ('football', 0.188), ('maxent', 0.157), ('datasets', 0.153), ('documents', 0.116), ('luyckx', 0.106), ('naive', 0.104), ('stamatatos', 0.085), ('ensemble', 0.083), ('literary', 0.083), ('mosteller', 0.079), ('articles', 0.07), ('bayes', 0.07), ('dataset', 0.067), ('authors', 0.058), ('style', 0.057), ('news', 0.054), ('travel', 0.054), ('daelemans', 0.054), ('writing', 0.053), ('binongo', 0.053), ('diederich', 0.053), ('juola', 0.053), ('pcfge', 0.053), ('business', 0.052), ('mallet', 0.052), ('smoothing', 0.048), ('federalist', 0.046), ('wsj', 0.045), ('markers', 0.045), ('gi', 0.043), ('wallace', 0.042), ('model', 0.042), ('basic', 0.039), ('unigram', 0.038), ('drop', 0.038), ('adriana', 0.038), ('holmes', 0.038), ('performing', 0.037), ('document', 0.037), ('generative', 0.036), ('baayen', 0.036), ('brown', 0.036), ('grammar', 0.035), ('performance', 0.035), ('impressive', 0.034), ('zheng', 0.034), ('discriminative', 0.034), ('written', 0.034), ('styles', 0.033), ('syntactic', 0.032), ('ngram', 0.032), ('topic', 0.031), ('opennlp', 0.03), ('interpolation', 0.03), ('website', 0.03), ('austin', 0.03), ('prediction', 0.03), ('best', 0.03), ('effectively', 0.029), ('classifier', 0.029), ('collected', 0.029), ('klein', 0.029), ('build', 0.029), ('performs', 0.028), ('peng', 0.027), ('demonstrate', 0.027), ('training', 0.027), ('baseline', 0.026), ('parser', 0.026), ('smooth', 0.025), ('stop', 0.025), ('per', 0.025), ('newspaper', 0.025), ('topics', 0.025), ('parse', 0.025), ('outperforms', 0.025), ('nips', 0.024), ('http', 0.023), ('survey', 0.023), ('fundamentals', 0.023), ('burrows', 0.023), ('cave', 0.023), ('cyber', 0.023), ('ered', 0.023), ('exas', 0.023), ('extraneous', 0.023), ('fuchun', 0.023), ('ime', 0.023), ('kachites', 0.023), ('poems', 0.023), ('reebank', 0.023), ('rior', 0.023), ('rong', 0.023), ('rosner', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 34 acl-2010-Authorship Attribution Using Probabilistic Context-Free Grammars

Author: Sindhu Raghavan ; Adriana Kovashka ; Raymond Mooney

Abstract: In this paper, we present a novel approach for authorship attribution, the task of identifying the author of a document, using probabilistic context-free grammars. Our approach involves building a probabilistic context-free grammar for each author and using this grammar as a language model for classification. We evaluate the performance of our method on a wide range of datasets to demonstrate its efficacy.

2 0.12159595 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

Author: Mark Johnson

Abstract: This paper establishes a connection between two apparently very different kinds of probabilistic models. Latent Dirichlet Allocation (LDA) models are used as “topic models” to produce a lowdimensional representation of documents, while Probabilistic Context-Free Grammars (PCFGs) define distributions over trees. The paper begins by showing that LDA topic models can be viewed as a special kind of PCFG, so Bayesian inference for PCFGs can be used to infer Topic Models as well. Adaptor Grammars (AGs) are a hierarchical, non-parameteric Bayesian extension of PCFGs. Exploiting the close relationship between LDA and PCFGs just described, we propose two novel probabilistic models that combine insights from LDA and AG models. The first replaces the unigram component of LDA topic models with multi-word sequences or collocations generated by an AG. The second extension builds on the first one to learn aspects of the internal structure of proper names.

3 0.094735034 5 acl-2010-A Framework for Figurative Language Detection Based on Sense Differentiation

Author: Daria Bogdanova

Abstract: Various text mining algorithms require the process offeature selection. High-level semantically rich features, such as figurative language uses, speech errors etc., are very promising for such problems as e.g. writing style detection, but automatic extraction of such features is a big challenge. In this paper, we propose a framework for figurative language use detection. This framework is based on the idea of sense differentiation. We describe two algorithms illustrating the mentioned idea. We show then how these algorithms work by applying them to Russian language data.

4 0.079930387 162 acl-2010-Learning Common Grammar from Multilingual Corpus

Author: Tomoharu Iwata ; Daichi Mochihashi ; Hiroshi Sawada

Abstract: We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel multilingual corpus of eleven languages demonstrate the feasibility of the proposed method.

5 0.072824866 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

Author: Mohit Bansal ; Dan Klein

Abstract: We present a simple but accurate parser which exploits both large tree fragments and symbol refinement. We parse with all fragments of the training set, in contrast to much recent work on tree selection in data-oriented parsing and treesubstitution grammar learning. We require only simple, deterministic grammar symbol refinement, in contrast to recent work on latent symbol refinement. Moreover, our parser requires no explicit lexicon machinery, instead parsing input sentences as character streams. Despite its simplicity, our parser achieves accuracies of over 88% F1 on the standard English WSJ task, which is competitive with substantially more complicated state-of-theart lexicalized and latent-variable parsers. Additional specific contributions center on making implicit all-fragments parsing efficient, including a coarse-to-fine inference scheme and a new graph encoding.

6 0.054346897 112 acl-2010-Extracting Social Networks from Literary Fiction

7 0.050446976 204 acl-2010-Recommendation in Internet Forums and Blogs

8 0.049999185 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction

9 0.049693666 158 acl-2010-Latent Variable Models of Selectional Preference

10 0.046116244 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis

11 0.046056002 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

12 0.045481648 79 acl-2010-Cross-Lingual Latent Topic Extraction

13 0.045462415 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification

14 0.045018509 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning

15 0.044898152 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

16 0.044330318 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization

17 0.043427523 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

18 0.042700496 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing

19 0.04089167 244 acl-2010-TrustRank: Inducing Trust in Automatic Translations via Ranking

20 0.040559348 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.13), (1, 0.028), (2, -0.019), (3, 0.005), (4, -0.007), (5, -0.028), (6, 0.035), (7, -0.043), (8, 0.06), (9, -0.027), (10, -0.013), (11, -0.017), (12, 0.081), (13, -0.006), (14, 0.005), (15, -0.015), (16, -0.061), (17, -0.03), (18, 0.038), (19, -0.018), (20, -0.029), (21, -0.053), (22, 0.063), (23, -0.014), (24, -0.093), (25, -0.04), (26, 0.04), (27, 0.081), (28, -0.071), (29, -0.031), (30, 0.035), (31, 0.027), (32, -0.052), (33, 0.118), (34, -0.036), (35, 0.042), (36, -0.062), (37, -0.004), (38, -0.073), (39, -0.101), (40, 0.023), (41, 0.036), (42, 0.054), (43, -0.128), (44, -0.011), (45, 0.103), (46, -0.0), (47, 0.072), (48, 0.073), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90658897 34 acl-2010-Authorship Attribution Using Probabilistic Context-Free Grammars

Author: Sindhu Raghavan ; Adriana Kovashka ; Raymond Mooney

Abstract: In this paper, we present a novel approach for authorship attribution, the task of identifying the author of a document, using probabilistic context-free grammars. Our approach involves building a probabilistic context-free grammar for each author and using this grammar as a language model for classification. We evaluate the performance of our method on a wide range of datasets to demonstrate its efficacy.

2 0.65764874 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

Author: Mark Johnson

Abstract: This paper establishes a connection between two apparently very different kinds of probabilistic models. Latent Dirichlet Allocation (LDA) models are used as “topic models” to produce a lowdimensional representation of documents, while Probabilistic Context-Free Grammars (PCFGs) define distributions over trees. The paper begins by showing that LDA topic models can be viewed as a special kind of PCFG, so Bayesian inference for PCFGs can be used to infer Topic Models as well. Adaptor Grammars (AGs) are a hierarchical, non-parameteric Bayesian extension of PCFGs. Exploiting the close relationship between LDA and PCFGs just described, we propose two novel probabilistic models that combine insights from LDA and AG models. The first replaces the unigram component of LDA topic models with multi-word sequences or collocations generated by an AG. The second extension builds on the first one to learn aspects of the internal structure of proper names.

3 0.54361194 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

Author: Mohit Bansal ; Dan Klein

Abstract: We present a simple but accurate parser which exploits both large tree fragments and symbol refinement. We parse with all fragments of the training set, in contrast to much recent work on tree selection in data-oriented parsing and treesubstitution grammar learning. We require only simple, deterministic grammar symbol refinement, in contrast to recent work on latent symbol refinement. Moreover, our parser requires no explicit lexicon machinery, instead parsing input sentences as character streams. Despite its simplicity, our parser achieves accuracies of over 88% F1 on the standard English WSJ task, which is competitive with substantially more complicated state-of-theart lexicalized and latent-variable parsers. Additional specific contributions center on making implicit all-fragments parsing efficient, including a coarse-to-fine inference scheme and a new graph encoding.

4 0.49028444 162 acl-2010-Learning Common Grammar from Multilingual Corpus

Author: Tomoharu Iwata ; Daichi Mochihashi ; Hiroshi Sawada

Abstract: We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel multilingual corpus of eleven languages demonstrate the feasibility of the proposed method.

5 0.47619072 79 acl-2010-Cross-Lingual Latent Topic Extraction

Author: Duo Zhang ; Qiaozhu Mei ; ChengXiang Zhai

Abstract: Probabilistic latent topic models have recently enjoyed much success in extracting and analyzing latent topics in text in an unsupervised way. One common deficiency of existing topic models, though, is that they would not work well for extracting cross-lingual latent topics simply because words in different languages generally do not co-occur with each other. In this paper, we propose a way to incorporate a bilingual dictionary into a probabilistic topic model so that we can apply topic models to extract shared latent topics in text data of different languages. Specifically, we propose a new topic model called Probabilistic Cross-Lingual Latent Semantic Analysis (PCLSA) which extends the Proba- bilistic Latent Semantic Analysis (PLSA) model by regularizing its likelihood function with soft constraints defined based on a bilingual dictionary. Both qualitative and quantitative experimental results show that the PCLSA model can effectively extract cross-lingual latent topics from multilingual text data.

6 0.43237007 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

7 0.42544445 132 acl-2010-Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data

8 0.42138344 5 acl-2010-A Framework for Figurative Language Detection Based on Sense Differentiation

9 0.4128173 246 acl-2010-Unsupervised Discourse Segmentation of Documents with Inherently Parallel Structure

10 0.41228876 204 acl-2010-Recommendation in Internet Forums and Blogs

11 0.41005111 19 acl-2010-A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation

12 0.40823802 139 acl-2010-Identifying Generic Noun Phrases

13 0.40291187 237 acl-2010-Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection

14 0.39647323 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis

15 0.38755196 112 acl-2010-Extracting Social Networks from Literary Fiction

16 0.38677299 158 acl-2010-Latent Variable Models of Selectional Preference

17 0.38224959 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

18 0.38224742 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing

19 0.37942037 53 acl-2010-Blocked Inference in Bayesian Tree Substitution Grammars

20 0.37689409 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.035), (59, 0.09), (73, 0.546), (76, 0.012), (78, 0.02), (83, 0.08), (84, 0.01), (98, 0.1)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90309459 259 acl-2010-WebLicht: Web-Based LRT Services for German

Author: Erhard Hinrichs ; Marie Hinrichs ; Thomas Zastrow

Abstract: This software demonstration presents WebLicht (short for: Web-Based Linguistic Chaining Tool), a webbased service environment for the integration and use of language resources and tools (LRT). WebLicht is being developed as part of the D-SPIN project1. WebLicht is implemented as a web application so that there is no need for users to install any software on their own computers or to concern themselves with the technical details involved in building tool chains. The integrated web services are part of a prototypical infrastructure that was developed to facilitate chaining of LRT services. WebLicht allows the integration and use of distributed web services with standardized APIs. The nature of these open and standardized APIs makes it possible to access the web services from nearly any programming language, shell script or workflow engine (UIMA, Gate etc.) Additionally, an application for integration of additional services is available, allowing anyone to contribute his own web service. 1

same-paper 2 0.88346684 34 acl-2010-Authorship Attribution Using Probabilistic Context-Free Grammars

Author: Sindhu Raghavan ; Adriana Kovashka ; Raymond Mooney

Abstract: In this paper, we present a novel approach for authorship attribution, the task of identifying the author of a document, using probabilistic context-free grammars. Our approach involves building a probabilistic context-free grammar for each author and using this grammar as a language model for classification. We evaluate the performance of our method on a wide range of datasets to demonstrate its efficacy.

3 0.85657084 45 acl-2010-Balancing User Effort and Translation Error in Interactive Machine Translation via Confidence Measures

Author: Jesus Gonzalez Rubio ; Daniel Ortiz Martinez ; Francisco Casacuberta

Abstract: This work deals with the application of confidence measures within an interactivepredictive machine translation system in order to reduce human effort. If a small loss in translation quality can be tolerated for the sake of efficiency, user effort can be saved by interactively translating only those initial translations which the confidence measure classifies as incorrect. We apply confidence estimation as a way to achieve a balance between user effort savings and final translation error. Empirical results show that our proposal allows to obtain almost perfect translations while significantly reducing user effort.

4 0.85605049 68 acl-2010-Conditional Random Fields for Word Hyphenation

Author: Nikolaos Trogkanis ; Charles Elkan

Abstract: Finding allowable places in words to insert hyphens is an important practical problem. The algorithm that is used most often nowadays has remained essentially unchanged for 25 years. This method is the TEX hyphenation algorithm of Knuth and Liang. We present here a hyphenation method that is clearly more accurate. The new method is an application of conditional random fields. We create new training sets for English and Dutch from the CELEX European lexical resource, and achieve error rates for English of less than 0.1% for correctly allowed hyphens, and less than 0.01% for Dutch. Experiments show that both the Knuth/Liang method and a leading current commercial alternative have error rates several times higher for both languages.

5 0.83582801 141 acl-2010-Identifying Text Polarity Using Random Walks

Author: Ahmed Hassan ; Dragomir Radev

Abstract: Automatically identifying the polarity of words is a very important task in Natural Language Processing. It has applications in text classification, text filtering, analysis of product review, analysis of responses to surveys, and mining online discussions. We propose a method for identifying the polarity of words. We apply a Markov random walk model to a large word relatedness graph, producing a polarity estimate for any given word. A key advantage of the model is its ability to accurately and quickly assign a polarity sign and magnitude to any word. The method could be used both in a semi-supervised setting where a training set of labeled words is used, and in an unsupervised setting where a handful of seeds is used to define the two polarity classes. The method is experimentally tested using a manually labeled set of positive and negative words. It outperforms the state of the art methods in the semi-supervised setting. The results in the unsupervised setting is comparable to the best reported values. However, the proposed method is faster and does not need a large corpus.

6 0.80170387 238 acl-2010-Towards Open-Domain Semantic Role Labeling

7 0.76428652 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction

8 0.5609616 230 acl-2010-The Manually Annotated Sub-Corpus: A Community Resource for and by the People

9 0.55504078 121 acl-2010-Generating Entailment Rules from FrameNet

10 0.52711165 134 acl-2010-Hierarchical Sequential Learning for Extracting Opinions and Their Attributes

11 0.52309048 154 acl-2010-Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm

12 0.51905113 82 acl-2010-Demonstration of a Prototype for a Conversational Companion for Reminiscing about Images

13 0.51656699 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

14 0.51212692 204 acl-2010-Recommendation in Internet Forums and Blogs

15 0.50920999 175 acl-2010-Models of Metaphor in NLP

16 0.50890523 251 acl-2010-Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews

17 0.50165492 209 acl-2010-Sentiment Learning on Product Reviews via Sentiment Ontology Tree

18 0.4993372 85 acl-2010-Detecting Experiences from Weblogs

19 0.49881774 227 acl-2010-The Impact of Interpretation Problems on Tutorial Dialogue

20 0.4916923 158 acl-2010-Latent Variable Models of Selectional Preference