acl acl2010 acl2010-117 knowledge-graph by maker-knowledge-mining

117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms

Source: pdf

Author: Zhili Wu ; Katja Markert ; Serge Sharoff

Abstract: Prior use of machine learning in genre classification used a list of labels as classification categories. However, genre classes are often organised into hierarchies, e.g., covering the subgenres of fiction. In this paper we present a method of using the hierarchy of labels to improve the classification accuracy. As a testbed for this approach we use the Brown Corpus as well as a range of other corpora, including the BNC, HGC and Syracuse. The results are not encouraging: apart from the Brown corpus, the improvements of our structural classifier over the flat one are not statistically significant. We discuss the relation between structural learning performance and the visual and distributional balance of the label hierarchy, suggesting that only balanced hierarchies might profit from structural learning.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Prior use of machine learning in genre classification used a list of labels as classification categories. [sent-9, score-0.757]

2 However, genre classes are often organised into hierarchies, e. [sent-10, score-0.632]

3 The results are not encouraging: apart from the Brown corpus, the improvements of our structural classifier over the flat one are not statistically significant. [sent-15, score-0.314]

4 We discuss the relation between structural learning performance and the visual and distributional balance of the label hierarchy, suggesting that only balanced hierarchies might profit from structural learning. [sent-16, score-0.774]

5 1 Introduction Automatic genre identification (AGI) can be traced to the mid-1990s (Karlgren and Cutting, 1994; Kessler et al. [sent-17, score-0.596]

6 , 1997), but this research became much more active in recent years, partly because of the explosive growth of the Web, and partly because of the importance of making genre distinctions in NLP applications. [sent-18, score-0.596]

7 In Information Retrieval, given the large number of web pages on any given topic, it is often difficult for the users to find relevant pages that are in the right genre (Vidulin et al. [sent-19, score-0.645]

8 As for other applications, the accuracy of many tasks, such as machine translation, POS tagging (Giesbrecht and Evert, 2009) or identification of discourse relations (Webber, 2009) relies of defining the language model suitable for the genre of a given text. [sent-21, score-0.661]

9 This interest in genres resulted in a proliferation of studies on corpus development of web genres and comparison of methods for AGI. [sent-27, score-0.702]

10 One reason comes from the limited number of genres present in these two collections (eight genres in KI-04 and seven in Santinis). [sent-39, score-0.622]

11 This paper explores a way of using information on the hierarchy of labels for improving fine-grained genre classification. [sent-53, score-0.808]

12 To the best of our knowledge, this is the first work presenting structural genre classification and distance measures for gen- res. [sent-54, score-1.038]

13 In Section 2 we present a structural reformulation of Support Vector Machines (SVMs) that can take similarities between different genres into account. [sent-55, score-0.49]

14 This formulation necessitates the development of distance measures between different genres in a hierarchy, of which we present three different types in Section 3, along with possible estimation procedures for these distances. [sent-56, score-0.522]

15 We present experiments with these novel structural SVMs and distance measures on three different corpora in Section 4. [sent-57, score-0.39]

16 In Section 5 we investigate potential reasons for this, including the (im)balance of different genre hierarchies and problems with our distance measures. [sent-60, score-0.83]

17 Linear SVMs on a flat list of labels achieve high efficiency and accuracy in text classification when compared to nonlinear SVMs or other state-of-the-art methods. [sent-63, score-0.309]

18 Also they have not been applied to genre classification. [sent-69, score-0.596]

19 Let x be a document and wm a weight vector associated with the genre class m in a corpus with k genres at the most fine-grained level. [sent-81, score-1.04]

20 (1) 750 Accurate prediction requires that when a document vector is multiplied with the weight vector associated with its own class, the resulting inner product should be larger than its inner products with a weight vector for any other genre class m. [sent-83, score-0.785]

21 Let xi be the i−th training document, and yi its genre labbee lt. [sent-85, score-0.596]

22 (2) To strengthen the constraints, the zero value on the right hand side of the inequality for the flat SVM can be replaced by a positive value, corresponding to a distance measure h(yi, m) between two genre classes, leading to the following constraint: wyTixi − wmTxi ≥ h(yi, m),∀m. [sent-87, score-0.912]

23 3 Genre Distance Measures The structural SVM (Section 2) requires a distance measure h between two genres. [sent-94, score-0.36]

24 We can derive such distance measures from the genre hierarchy in a way similar to word similarity measures that were invented for lexical hierar- chies such as WordNet (see (Pedersen et al. [sent-95, score-1.09]

25 Whereas the information content of a word or concept in a lexical hierarchy has been well-defined (Resnik, 1995), it is less clear how to estimate the information content of a genre label. [sent-99, score-0.849]

26 We will therefore discuss several different ways of estimating information content of nodes in a genre hierarchy. [sent-100, score-0.678]

27 2 Distance Measures based on Information Content Path-based distance measures work relatively well on balanced hierarchies such as the one in Figure 1 but fail to treat hierarchies with different levels of granularity well. [sent-114, score-0.516]

28 For lexical hierarchies, as a result, several distance measures based on information content have been suggested where the information content of a concept c in a hierarchy is measured by (Resnik, 1995) IC(c) = −log(ffrerqe(qr(oco)t)). [sent-115, score-0.464]

29 1 Information Content of Genre Labels The notion of information content of a genre is not straightforward. [sent-126, score-0.645]

30 We can interpret the “frequency” of a genre node simply as the number of all documents belonging to that genre (including any of its subgenres). [sent-129, score-1.27]

31 Unfortunately, there are no estimates for genre frequencies on, for example, a representative sample of web documents. [sent-130, score-0.683]

32 Therefore, we approximate genre frequencies from the document frequencies (dfs) in the training sets used in classification. [sent-131, score-0.704]

33 We can also use the labels/names of the genre nodes as the unit of frequency estimation. [sent-134, score-0.694]

34 Then, the frequency of a genre node is the occurrence frequency of its label in a corpus plus the occurrence frequencies of the labels of all its subnodes. [sent-135, score-0.92]

35 Note that there is no direct correspondence between this measure and the document frequency of a genre: measuring the number of times the potential genre label poem occurs in a corpus is not in any way equivalent to the number of poems in that corpus. [sent-136, score-0.812]

36 a higher level genre label will have higher frequency (and lower information content) than a lower level genre label. [sent-139, score-1.359]

37 1 For label frequency estimation, we manually expand any label abbreviations (such as "newsp" for BNC genre labels), delete stop words and function words and then use two search methods. [sent-140, score-0.733]

38 For the search method word we simply search the frequency of the genre label in a corpus, using three different corpora (the BNC, Brown and Google web search). [sent-141, score-0.746]

39 As for the BNC and Brown corpus some labels are very rarely mentioned, we for these two corpora use also a search method gram where all character 5-grams within the genre label are searched for and their frequencies aggregated. [sent-142, score-0.802]

40 If the measure is infor— 1Obviously when using this measure we rely on genre labels which are meaningful in the sense that lower level labels were chosen to be more specific and therefore probably rarer terms in a corpus. [sent-145, score-0.847]

41 The measure could not possibly be useful on a genre hierarchy that would give random names to its genres such as genre 1. [sent-146, score-1.71]

42 The way for measuring genre frequency is indicated last with df for measuring via document frequency and word/gram when measured via frequency of genre labels. [sent-148, score-1.419]

43 If frequencies of genre labels are used, the corpus for counting the occurrence of genre labels is also indicated via brown, bnc or the Web as estimated by Google hit counts gg. [sent-149, score-1.535]

44 1 Datasets We use four genre-annotated corpora for genre classification: the Brown Corpus (Ku cˇera and Francis, 1967), BNC (Lee, 2001), HGC (Stubbe and Ringlstetter, 2007) and Syracuse (Crowston et al. [sent-152, score-0.596]

45 They have a wide variety of genre labels (from 15 in the Brown corpus to 32 genres in HGC to 70 in the BNC to 292 in Syracuse), and different types of hierarchies. [sent-154, score-0.995]

46 2 Evaluation Measures We use standard classification accuracy (Acc) on the most fine-grained level of target categories in the genre hierarchy. [sent-156, score-0.746]

47 In addition, given a structural distance H, misclassifications can be weighted based on the dis- tance measure. [sent-157, score-0.308]

48 In each fold, for each genre class 10% of documents are used for testing. [sent-170, score-0.683]

49 4 Features The features used for genre classification are character 4-grams for all algorithms, i. [sent-184, score-0.692]

50 We used character n-grams because they are very easy to extract, language-independent (no need to rely on parsing or even stemming), and they are known to have the best performance in genre classification tasks (Kanaris and Stamatatos, 2009; Sharoff et al. [sent-187, score-0.692]

51 In one experiment in (Karlgren and Cutting, 1994) the subgenres under fiction are grouped together, leading to 10 genres to classify. [sent-192, score-0.431]

52 4% whereas the best structural SVM based on Lin’s information content distance measure (IC-linword-bnc) achieves 68. [sent-195, score-0.409]

53 We perform experiments on all 15 genres on the end level ofthe Brown corpus. [sent-207, score-0.344]

54 The structural SVMs using information content measures IC-lin-gram-bnc and ICresk-word-br also perform equally well. [sent-212, score-0.31]

55 We are also interested in structural accuracy (SAcc) to see whether the structural SVMs make fewer "big" mistakes. [sent-214, score-0.423]

56 However, in our case, Lin’s information content measure and the plen measure perform well under any structural accuracy evaluation measure and outperform flat SVMs. [sent-219, score-0.68]

57 Standard accuracy for the best performing structural methods on HGC is just the same as for flat SVM (69. [sent-224, score-0.379]

58 The BNC corpus contains 70 genres and 4053 documents. [sent-229, score-0.342]

59 The Syracuse corpus is a recently developed large collection of 3027 annotated webpages divided into 292 genres (Crowston et al. [sent-233, score-0.342]

60 Focusing only on genres containing 15 or more examples, we arrived at a corpus of 2293 samples and 52 genres. [sent-235, score-0.342]

61 , 2004), the lack of success on genres is surprising. [sent-241, score-0.311]

62 1 Tree Depth and Balance Our best results were achieved on the Brown corpus, whose genre tree has at least three attractive properties. [sent-244, score-0.65]

63 Thus, the genres in HGC are al- most represented by a flat list with just one extra level over 32 categories. [sent-257, score-0.479]

64 Similarly, the vast majority of genres in the Syracuse corpus are also organised in two levels only. [sent-258, score-0.42]

65 Such flat hierarchies do not offer much scope to improve over a completely flat list. [sent-259, score-0.375]

66 , written/national/broadsheet/arts, but many other genres are still only specified to the second level of its hierarchy, e. [sent-262, score-0.344]

67 To test our hypothesis, we tried to skew the Brown genre tree in two ways. [sent-268, score-0.65]

68 First, we kept the tree relatively balanced visually and distributionally but flattened it by removing the second layer Press, Misc, Non-Fiction, Fiction from the hierarchy, leaving a tree with only two layers. [sent-269, score-0.34]

69 Second, we skewed the visual and distributional balance of the tree by collapsing its three leaf-level genres under Press, and the two under non-fiction, leading to 12 genres to classify (cf. [sent-270, score-0.929]

70 As expected, the structural methods on either skewed or flattened hierarchies are not significantly better than the flat SVM. [sent-274, score-0.53]

71 For the flattened hierarchy of 15 leaf genres the maximal accuracy is 54. [sent-275, score-0.611]

72 To measure the degree of balance of a tree, we introduce two tree balance scores based on entropy. [sent-282, score-0.418]

73 Then level by level we calculate an entropy score, either according to how many tree nodes at the next level belong to a node at this level (denoted as vb: visual balance), or according to how many end level documents belong to a node at this level (denoted as db: distribution balance). [sent-284, score-0.461]

74 It can be shown that any perfect N-ary tree will have the largest visual balance score of 1. [sent-287, score-0.276]

75 The first two rows for the Brown corpus have both large visual balance and distribution balance scores. [sent-290, score-0.409]

76 As shown earlier, for those two setups the structural SVMs perform better than the flat approach. [sent-291, score-0.314]

77 In contrast, for the tree hierarchies of Brown that we deformed or flattened, and also BNC and Syracuse, either or both of the two balance scores tend to be lower, and no improvement has been obtained over the flat approach. [sent-292, score-0.45]

78 This may indicate that a further exploration of the relation between tree balance and the performance of structural SVMs is warranted. [sent-293, score-0.389]

79 However, high visual balance and distribution scores do not necessarily imply high performance of structural SVMs, as very flat trees are also visually very balanced. [sent-294, score-0.596]

80 As an example, HGC has a high visual balance score due to a shallow hierarchy and a high distri- butional balance score due to a roughly equal number of documents contained in each genre. [sent-295, score-0.579]

81 A similar observation on the importance of well-balanced hierarchies comes from a recent Pascal challenge on large scale hierarchical text classification,2 which shows that some flat approaches perform competitively in topic classification with imbalanced hierarchies. [sent-297, score-0.292]

82 Other methods for measuring tree balance (some of which are related to ours) are used in the field ofphylogenetic research (Shao and Sokal, 1990) but they are only applicable to visual balance. [sent-299, score-0.276]

83 2 Distance Measures We also scrutinise our distance measures as these are crucial for the structural approach. [sent-302, score-0.39]

84 89d76059b43278903 form well overall; again for the Brown corpus this is probably due to its balanced hierarchy which makes path length appropriate. [sent-308, score-0.284]

85 When measured via genre label frequency, we run into at least two problems. [sent-311, score-0.632]

86 1 genre label frequency does not have to correspond to class frequency of documents. [sent-314, score-0.803]

87 Figure 5 shows several distance matrices on the (original) 15 genre Brown corpus. [sent-326, score-0.759]

88 The plen matrix has clear blocks for the super genres press, informative, imaginative, etc. [sent-327, score-0.457]

89 Values in bracket is the alignment with the plen matrix An alternative to structural distance measures would be distance measures between the genres based on pairwise cosine similarities between them. [sent-339, score-1.058]

90 To assess this, we aggregated all character 4-gram training vectors of each genre and calculated standard cosine similarities. [sent-340, score-0.64]

91 After converting the similarities to distance, we plug the distance matrix into our structural SVM. [sent-342, score-0.358]

92 This also indicates that the genre structural hierarchy clearly gives information not present in the simple character 4-gram features we use. [sent-345, score-0.974]

93 For a more detailed discussion of the problems of the currently prevalently used character n-grams as features for genre classification, we refer the reader to (Sharoff et al. [sent-346, score-0.64]

94 6 Conclusions In this paper, we have evaluated structural learning approaches to genre classification using several different genre distance measures. [sent-348, score-1.552]

95 As po- tential reasons for this negative result, we suggest that current genre hierarchies are either not of sufficient depth or are visually or distributionally imbalanced. [sent-350, score-0.847]

96 We think further investigation into the relationship between hierarchy balance and structural learning is warranted. [sent-351, score-0.49]

97 Further investigation is also needed into the appropriateness of n-gram features for genre identification as well as good measures of genre distance. [sent-352, score-1.274]

98 For a full assessment of hierarchical learning for genre classification, the field of genre studies needs a testbed similar to the Reuters or 20 Newsgroups datasets used in topic-based IR with a balanced genre hierarchy and a representative corpus of reliably annotated webpages. [sent-354, score-2.027]

99 With regard to algorithms, we are also interested in other formulations for structural SVMs and their large-scale implementation as well as the combination of different distance measures, for example in ensemble learning. [sent-355, score-0.308]

100 Recogniz- ing text genres with simple metrics using discriminant analysis. [sent-443, score-0.339]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('genre', 0.596), ('genres', 0.311), ('brown', 0.19), ('structural', 0.179), ('bnc', 0.16), ('balance', 0.156), ('hierarchy', 0.155), ('hgc', 0.144), ('svms', 0.139), ('flat', 0.135), ('distance', 0.129), ('lcs', 0.126), ('sharoff', 0.112), ('hierarchies', 0.105), ('karlgren', 0.096), ('leeds', 0.096), ('plen', 0.096), ('syracuse', 0.096), ('measures', 0.082), ('flattened', 0.08), ('santinis', 0.08), ('subsumer', 0.08), ('cutting', 0.072), ('fiction', 0.072), ('ic', 0.071), ('visual', 0.066), ('accuracy', 0.065), ('frequency', 0.065), ('wmtxi', 0.064), ('wytixi', 0.064), ('visually', 0.06), ('svm', 0.058), ('labels', 0.057), ('santini', 0.056), ('tree', 0.054), ('balanced', 0.053), ('measure', 0.052), ('classification', 0.052), ('tsochantaridis', 0.051), ('matrix', 0.05), ('content', 0.049), ('web', 0.049), ('crowston', 0.048), ('faqs', 0.048), ('kanaris', 0.048), ('resk', 0.048), ('shao', 0.048), ('sokal', 0.048), ('subgenres', 0.048), ('depth', 0.047), ('similarity', 0.046), ('documents', 0.046), ('path', 0.045), ('character', 0.044), ('keerthi', 0.042), ('misc', 0.042), ('levels', 0.042), ('class', 0.041), ('stamatatos', 0.039), ('dekel', 0.039), ('distributionally', 0.039), ('frequencies', 0.038), ('label', 0.036), ('joachims', 0.036), ('giesbrecht', 0.036), ('evert', 0.036), ('organised', 0.036), ('chodorow', 0.034), ('matrices', 0.034), ('level', 0.033), ('nodes', 0.033), ('francis', 0.033), ('ku', 0.033), ('boser', 0.032), ('eissen', 0.032), ('nciformativeimagnitvenofpcimrteosincsiformatvieimagnitve', 0.032), ('nofpcirmetios', 0.032), ('plsk', 0.032), ('pwupal', 0.032), ('ringlstetter', 0.032), ('stubbe', 0.032), ('vidulin', 0.032), ('westerns', 0.032), ('zu', 0.032), ('document', 0.032), ('lin', 0.032), ('margin', 0.032), ('node', 0.032), ('skewed', 0.031), ('corpus', 0.031), ('resnik', 0.03), ('leacock', 0.03), ('inner', 0.029), ('wu', 0.029), ('jiang', 0.029), ('weight', 0.029), ('dual', 0.028), ('discriminant', 0.028), ('stein', 0.028), ('meyer', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms

Author: Zhili Wu ; Katja Markert ; Serge Sharoff

2 0.080755673 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

Author: Fei Huang ; Alexander Yates

Abstract: Most supervised language processing systems show a significant drop-off in performance when they are tested on text that comes from a domain significantly different from the domain of the training data. Semantic role labeling techniques are typically trained on newswire text, and in tests their performance on fiction is as much as 19% worse than their performance on newswire text. We investigate techniques for building open-domain semantic role labeling systems that approach the ideal of a train-once, use-anywhere system. We leverage recently-developed techniques for learning representations of text using latent-variable language models, and extend these techniques to ones that provide the kinds of features that are useful for semantic role labeling. In experiments, our novel system reduces error by 16% relative to the previous state of the art on out-of-domain text.

3 0.07811109 155 acl-2010-Kernel Based Discourse Relation Recognition with Temporal Ordering Information

Author: WenTing Wang ; Jian Su ; Chew Lim Tan

Abstract: Syntactic knowledge is important for discourse relation recognition. Yet only heuristically selected flat paths and 2-level production rules have been used to incorporate such information so far. In this paper we propose using tree kernel based approach to automatically mine the syntactic information from the parse trees for discourse analysis, applying kernel function to the tree structures directly. These structural syntactic features, together with other normal flat features are incorporated into our composite kernel to capture diverse knowledge for simultaneous discourse identification and classification for both explicit and implicit relations. The experiment shows tree kernel approach is able to give statistical significant improvements over flat syntactic path feature. We also illustrate that tree kernel approach covers more structure information than the production rules, which allows tree kernel to further incorporate information from a higher dimension space for possible better discrimination. Besides, we further propose to leverage on temporal ordering information to constrain the interpretation of discourse relation, which also demonstrate statistical significant improvements for discourse relation recognition on PDTB 2.0 for both explicit and implicit as well. University of Singapore Singapore 117417 sg tacl @ comp .nus .edu . sg 1

4 0.074747756 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing

Author: Valentin I. Spitkovsky ; Daniel Jurafsky ; Hiyan Alshawi

Abstract: We show how web mark-up can be used to improve unsupervised dependency parsing. Starting from raw bracketings of four common HTML tags (anchors, bold, italics and underlines), we refine approximate partial phrase boundaries to yield accurate parsing constraints. Conversion procedures fall out of our linguistic analysis of a newly available million-word hyper-text corpus. We demonstrate that derived constraints aid grammar induction by training Klein and Manning’s Dependency Model with Valence (DMV) on this data set: parsing accuracy on Section 23 (all sentences) of the Wall Street Journal corpus jumps to 50.4%, beating previous state-of-the- art by more than 5%. Web-scale experiments show that the DMV, perhaps because it is unlexicalized, does not benefit from orders of magnitude more annotated but noisier data. Our model, trained on a single blog, generalizes to 53.3% accuracy out-of-domain, against the Brown corpus nearly 10% higher than the previous published best. The fact that web mark-up strongly correlates with syntactic structure may have broad applicability in NLP.

5 0.073586121 94 acl-2010-Edit Tree Distance Alignments for Semantic Role Labelling

Author: Hector-Hugo Franco-Penya

Abstract: ―Tree SRL system‖ is a Semantic Role Labelling supervised system based on a tree-distance algorithm and a simple k-NN implementation. The novelty of the system lies in comparing the sentences as tree structures with multiple relations instead of extracting vectors of features for each relation and classifying them. The system was tested with the English CoNLL-2009 shared task data set where 79% accuracy was obtained. 1

6 0.07044626 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

7 0.069676906 241 acl-2010-Transition-Based Parsing with Confidence-Weighted Classification

8 0.069445312 32 acl-2010-Arabic Named Entity Recognition: Using Features Extracted from Noisy Data

9 0.069307327 110 acl-2010-Exploring Syntactic Structural Features for Sub-Tree Alignment Using Bilingual Tree Kernels

10 0.067740187 37 acl-2010-Automatic Evaluation Method for Machine Translation Using Noun-Phrase Chunking

11 0.063247934 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

12 0.058206521 133 acl-2010-Hierarchical Search for Word Alignment

13 0.056485377 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis

14 0.055032145 158 acl-2010-Latent Variable Models of Selectional Preference

15 0.054507267 66 acl-2010-Compositional Matrix-Space Models of Language

16 0.054153819 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging

17 0.054021228 78 acl-2010-Cross-Language Text Classification Using Structural Correspondence Learning

18 0.053202614 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

19 0.052998159 13 acl-2010-A Rational Model of Eye Movement Control in Reading

20 0.052893877 25 acl-2010-Adapting Self-Training for Semantic Role Labeling

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.169), (1, 0.025), (2, -0.01), (3, 0.01), (4, 0.023), (5, -0.006), (6, 0.011), (7, -0.009), (8, 0.008), (9, -0.005), (10, -0.063), (11, 0.017), (12, 0.013), (13, -0.068), (14, -0.047), (15, 0.04), (16, 0.018), (17, -0.017), (18, 0.077), (19, -0.006), (20, 0.057), (21, 0.043), (22, 0.032), (23, 0.075), (24, -0.018), (25, -0.023), (26, 0.041), (27, 0.08), (28, 0.032), (29, 0.049), (30, -0.016), (31, -0.005), (32, 0.056), (33, 0.088), (34, -0.115), (35, 0.058), (36, -0.023), (37, -0.106), (38, -0.056), (39, -0.042), (40, 0.083), (41, 0.097), (42, 0.06), (43, 0.006), (44, 0.009), (45, -0.02), (46, -0.087), (47, -0.098), (48, -0.038), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94582492 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms

Author: Zhili Wu ; Katja Markert ; Serge Sharoff

2 0.55923307 76 acl-2010-Creating Robust Supervised Classifiers via Web-Scale N-Gram Data

Author: Shane Bergsma ; Emily Pitler ; Dekang Lin

Abstract: In this paper, we systematically assess the value of using web-scale N-gram data in state-of-the-art supervised NLP classifiers. We compare classifiers that include or exclude features for the counts of various N-grams, where the counts are obtained from a web-scale auxiliary corpus. We show that including N-gram count features can advance the state-of-the-art accuracy on standard data sets for adjective ordering, spelling correction, noun compound bracketing, and verb part-of-speech disambiguation. More importantly, when operating on new domains, or when labeled training data is not plentiful, we show that using web-scale N-gram features is essential for achieving robust performance.

3 0.54170078 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

Author: Beata Beigman Klebanov ; Eyal Beigman ; Daniel Diermeier

Abstract: We establish the following characteristics of the task of perspective classification: (a) using term frequencies in a document does not improve classification achieved with absence/presence features; (b) for datasets allowing the relevant comparisons, a small number of top features is found to be as effective as the full feature set and indispensable for the best achieved performance, testifying to the existence of perspective-specific keywords. We relate our findings to research on word frequency distributions and to discourse analytic studies of perspective.

4 0.52033216 263 acl-2010-Word Representations: A Simple and General Method for Semi-Supervised Learning

Author: Joseph Turian ; Lev-Arie Ratinov ; Yoshua Bengio

Abstract: If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining different word representations. You can download our word features, for off-the-shelf use in existing NLP systems, as well as our code, here: http ://metaoptimize com/proj ects/wordreprs/ .

5 0.50596076 212 acl-2010-Simple Semi-Supervised Training of Part-Of-Speech Taggers

Author: Anders Sogaard

Abstract: Most attempts to train part-of-speech taggers on a mixture of labeled and unlabeled data have failed. In this work stacked learning is used to reduce tagging to a classification task. This simplifies semisupervised training considerably. Our prefered semi-supervised method combines tri-training (Li and Zhou, 2005) and disagreement-based co-training. On the Wall Street Journal, we obtain an error reduction of 4.2% with SVMTool (Gimenez and Marquez, 2004).

6 0.49424338 139 acl-2010-Identifying Generic Noun Phrases

7 0.49028632 148 acl-2010-Improving the Use of Pseudo-Words for Evaluating Selectional Preferences

8 0.48657435 204 acl-2010-Recommendation in Internet Forums and Blogs

9 0.47740242 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models

10 0.47623083 161 acl-2010-Learning Better Data Representation Using Inference-Driven Metric Learning

11 0.47334927 205 acl-2010-SVD and Clustering for Unsupervised POS Tagging

12 0.46481976 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

13 0.46406856 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis

14 0.4611856 66 acl-2010-Compositional Matrix-Space Models of Language

15 0.46109873 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures

16 0.45933223 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices

17 0.45927602 112 acl-2010-Extracting Social Networks from Literary Fiction

18 0.44771188 19 acl-2010-A Taxonomy, Dataset, and Classifier for Automatic Noun Compound Interpretation

19 0.44295835 43 acl-2010-Automatically Generating Term Frequency Induced Taxonomies

20 0.44076782 155 acl-2010-Kernel Based Discourse Relation Recognition with Temporal Ordering Information

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.297), (14, 0.011), (25, 0.045), (42, 0.02), (59, 0.089), (73, 0.064), (76, 0.024), (78, 0.042), (80, 0.015), (83, 0.1), (84, 0.049), (98, 0.133)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8084107 204 acl-2010-Recommendation in Internet Forums and Blogs

Author: Jia Wang ; Qing Li ; Yuanzhu Peter Chen ; Zhangxi Lin

Abstract: The variety of engaging interactions among users in social medial distinguishes it from traditional Web media. Such a feature should be utilized while attempting to provide intelligent services to social media participants. In this article, we present a framework to recommend relevant information in Internet forums and blogs using user comments, one of the most representative of user behaviors in online discussion. When incorporating user comments, we consider structural, semantic, and authority information carried by them. One of the most important observation from this work is that semantic contents of user comments can play a fairly different role in a different form of social media. When designing a recommendation system for this purpose, such a difference must be considered with caution.

same-paper 2 0.77834982 117 acl-2010-Fine-Grained Genre Classification Using Structural Learning Algorithms

Author: Zhili Wu ; Katja Markert ; Serge Sharoff

3 0.7620728 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization

Author: Shay Cohen ; Noah A Smith

Abstract: We consider the search for a maximum likelihood assignment of hidden derivations and grammar weights for a probabilistic context-free grammar, the problem approximately solved by “Viterbi training.” We show that solving and even approximating Viterbi training for PCFGs is NP-hard. We motivate the use of uniformat-random initialization for Viterbi EM as an optimal initializer in absence of further information about the correct model parameters, providing an approximate bound on the log-likelihood.

4 0.56794953 184 acl-2010-Open-Domain Semantic Role Labeling by Modeling Word Spans

Author: Fei Huang ; Alexander Yates

5 0.56660324 158 acl-2010-Latent Variable Models of Selectional Preference

Author: Diarmuid O Seaghdha

Abstract: This paper describes the application of so-called topic models to selectional preference induction. Three models related to Latent Dirichlet Allocation, a proven method for modelling document-word cooccurrences, are presented and evaluated on datasets of human plausibility judgements. Compared to previously proposed techniques, these models perform very competitively, especially for infrequent predicate-argument combinations where they exceed the quality of Web-scale predictions while using relatively little data.

6 0.56481773 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images

7 0.5639618 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

8 0.56357378 153 acl-2010-Joint Syntactic and Semantic Parsing of Chinese

9 0.56353217 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

10 0.56331468 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

11 0.56314176 65 acl-2010-Complexity Metrics in an Incremental Right-Corner Parser

12 0.56242633 195 acl-2010-Phylogenetic Grammar Induction

13 0.56086946 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis

14 0.5608384 39 acl-2010-Automatic Generation of Story Highlights

15 0.56079024 101 acl-2010-Entity-Based Local Coherence Modelling Using Topological Fields

16 0.5607537 102 acl-2010-Error Detection for Statistical Machine Translation Using Linguistic Features

17 0.56060064 13 acl-2010-A Rational Model of Eye Movement Control in Reading

18 0.5589596 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification

19 0.55886292 245 acl-2010-Understanding the Semantic Structure of Noun Phrase Queries

20 0.55798817 251 acl-2010-Using Anaphora Resolution to Improve Opinion Target Identification in Movie Reviews