emnlp emnlp2010 emnlp2010-61 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arjun Mukherjee ; Bing Liu
Abstract: The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-ofspeech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. [sent-3, score-0.557]
2 The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. [sent-6, score-0.563]
3 The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. [sent-7, score-0.676]
4 Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly. [sent-8, score-0.359]
5 , blog search, blog topic tracking, and sentiment analysis of people’s opinions on products and services. [sent-14, score-0.426]
6 Gender classification of blog authors is one such study, which also has many commercial applications. [sent-15, score-0.317]
7 In the past few years, several authors have studied the problem of gender classification in the natural language processing and linguistic communities. [sent-20, score-0.448]
8 For instance, blog posts are typically short and unstructured, and consist of mostly informal sentences, which can contain spurious information and are full of grammar errors, abbreviations, slang words and phrases, and wrong spellings. [sent-25, score-0.271]
9 Due to these reasons, gender classification of blog posts is a harder problem than gender classification of traditional formal text. [sent-26, score-1.167]
10 Recent work has also attempted gender classification of blog authors using features such as content words, dictionary based content analysis results, POS (part-of-speech) tags and feature selection along with a supervised learning algorithm (Schler et al. [sent-27, score-0.964]
11 The patterns are frequent sequences of POS tags which can capture complex stylistic characteristics of male and female authors. [sent-32, score-0.415]
12 tc ho2d0s10 in A Nsastoucira tlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinag eusis 2t0ic7s–217, patterns are of variable lengths and need to satisfy some criteria in order for them to represent significant regularities. [sent-35, score-0.264]
13 The second technique is a new feature selection algorithm which uses an ensemble of feature selection criteria and methods. [sent-38, score-0.676]
14 It is well known that each individual feature selection criterion and method can be biased and tends to favor certain types of features. [sent-39, score-0.341]
15 Our experimental results based on a real life blog data set collected from a large number of blog hosting sites show that the two new techniques enable classification algorithms to significantly improve the accuracy of the current stateof-the-art techniques (Argamon et al. [sent-41, score-0.572]
16 2 Related Work There have been several recent papers on gender classification of blogs (e. [sent-48, score-0.501]
17 (Houvardas and Stamatatos, 2006) even applied character (rather than word or tag) n-grams to capture stylistic features for authorship classification of news articles in Reuters. [sent-60, score-0.353]
18 Given the complexity of blog posts, it makes sense to apply all classes of features jointly in order to classify genders. [sent-63, score-0.389]
19 Moreover, having many feature classes is very useful as they provide features with varied granularities and diversities. [sent-64, score-0.287]
20 Following the idea, this paper proposes a new ensemble feature selection method which is capable of extracting good features from different feature classes using multiple criteria. [sent-67, score-0.63]
21 For example, (Tannen, 1990) deals with gender differences in “conversational style” and in “formal written essays”, and (Gefen and Straub, 1997) reports differences in perception of males and females in the use of emails. [sent-69, score-0.474]
22 Furthermore, our POS sequence patterns can take care of n-grams and capture additional sequence regularities. [sent-80, score-0.291]
23 3 Feature Engineering and Mining There are different classes of features that have been experimented for gender classification, e. [sent-82, score-0.52]
24 , F-measure, stylistic features, gender preferential features, factor analysis and word classes (Nowson et al. [sent-84, score-0.715]
25 We use all these existing features and also propose a new class of features that are POS sequence patterns, which replace existing POS n-grams. [sent-89, score-0.31]
26 Also, as mentioned before, using all feature classes gives us features with varied granularities. [sent-90, score-0.287]
27 Upon extracting all these classes of features, a new ensemble feature selection (EFS) algorithm is proposed to select a subset of good or discriminative features. [sent-91, score-0.462]
28 The style of writing is typically captured by three types of features: part of speech, words, and in the blog context, words such as lol, hmm, and smiley that appear with high frequency. [sent-117, score-0.213]
29 In this work, we use words and blog words as stylistic features. [sent-118, score-0.367]
30 Part of speech features are mined using our POS sequence pattern mining algorithm. [sent-119, score-0.332]
31 3 Gender Preferential Features Gender preferential features consist of a set of signals that has been used in an email gender classification task (Corney et al. [sent-124, score-0.603]
32 These features come from various studies that have been undertaken on the issue of gender and language use (Schiffman, 2002). [sent-126, score-0.401]
33 We used the gender preferential features listed in Table 1, which indicate adjectives and adverbs based on the presence of suffixes and apologies as used in (Corney et al. [sent-131, score-0.499]
34 5 Proposed tures POS Sequence Pattern Fea- We now present the proposed POS sequence pattern features and the mining algorithm. [sent-146, score-0.256]
35 A POS sequence pattern is a sequence of consecutive POS tags that satisfy some constraints (discussed below). [sent-148, score-0.284]
36 Its mining algorithm mines all such patterns that satisfy the user-specified minimum support (minsup) and minimum adherence (minadherence) thresholds or constraints. [sent-155, score-0.326]
37 These thresholds ensure that the mined patterns represent significant regularities. [sent-156, score-0.218]
38 The SCP of a sequence with two elements |xy| is the product of the conditional probability of each given the other, SCP(x,y)=P(x|y)P(y|x)=PP((xx),Py()y2) Given a consecutive sequence of POS tags |x1…xn|, called a POS sequence of length n, a dispersion point defines two subparts of the sequence. [sent-163, score-0.35]
39 Output: All POS sequence patterns (stored in SP) mined from D that satisfy minsup and minadherence. [sent-187, score-0.489]
40 At the end of each scan, it determines which candidate sequences have minsup and minadherence (lines 12 - 13). [sent-226, score-0.34]
41 Finally, the algorithm returns the set of all sequence patterns (line 15) that meet the minsup and minadherence thresholds. [sent-228, score-0.489]
42 In our experiments, we used MAX-length = 7, minsup = 30%, and minadherence = 20% to mine all POS sequence patterns. [sent-231, score-0.388]
43 There are two common approaches to feature selection: the filter and the wrapper approaches (Blum and Langley, 1997; Kohavi and John, 1997). [sent-241, score-0.272]
44 In the filter approach, features are first ranked based on a feature selection criterion such as information gain, chisquare (χ2) test, and mutual information. [sent-242, score-0.478]
45 On the contrary, the wrapper model chooses features and adds to the current feature pool based on whether the new features improve the classification accuracy. [sent-244, score-0.443]
46 While the wrapper approach becomes very time consuming and impractical when the number of features is large as each feature is tested by building a new classifier. [sent-246, score-0.282]
47 The filter approach often uses only one feature selection criterion (e. [sent-247, score-0.388]
48 In this work, we developed a novel feature selection method that uses multiple criteria, and combines both the wrapper and the filter approaches. [sent-251, score-0.407]
49 It first uses a number of feature selection criteria to rank the features following the filter model. [sent-255, score-0.437]
50 Upon ranking, the algorithm generates some candidate feature subsets which are used to find the final feature set based on classification accuracy using the wrapper model. [sent-256, score-0.565]
51 Since our framework generates much fewer candidate feature subsets than the total number of features, using wrapper model with candidate feature sets is scalable. [sent-257, score-0.466]
52 Also, since the algorithm generates candidate feature sets using multiple criteria and all feature classes jointly, it is able to capture most of those features which are discriminating. [sent-258, score-0.532]
53 The algorithm takes as input, a set of n features F = {f1, fn}, a set of t feature selection criteria Θ = {θ1, θt}, a set of t thresholds Τ = {τ1, τt} corresponding to the criteria in Θ, and a window w. [sent-260, score-0.553]
54 Ci from Ci in order Λ ← Λ ∪ ζi endfor add Λ to OptCandFeatures // Λ is a set of features comprising of features in // feature sets ζi ? [sent-287, score-0.388]
55 Using a set of different feature selection measures, Θ, we rank all features in our feature pool, F, using the set of criteria (lines 1–3). [sent-295, score-0.501]
56 Each set Ci contains feature subsets, and each subset ζi is the set of top τ features in ξi ranked based on criterion θi in lines 1–2. [sent-298, score-0.306]
57 We vary τ and generate 2w + 1 feature sets and add all such feature sets ζi to Ci (in lines 6–8) in order. [sent-300, score-0.265]
58 In lines 11–20 we generate candidate feature sets using Ci and add each such + 212 candidate feature set Λ to OptCandFeatures. [sent-303, score-0.359]
59 Each candidate feature set Λ is a collection of top ranked features based on multiple criteria. [sent-304, score-0.215]
60 It is generated by unioning the features in the first feature subset ζi, which is then removed from Ci for each criterion θi (lines 14-17). [sent-305, score-0.263]
61 Since each Ci has 2w+1 feature subsets ζi, there are a total of 2w+1 candidate feature sets Λ in OptCandFeatures. [sent-307, score-0.305]
62 To counter this, we use the window w to select various feature subsets close to the top τi features in ξi. [sent-317, score-0.239]
63 Finally, we are aware that there are some existing ensemble feature selection methods in the machine learning literature (Garganté et al. [sent-322, score-0.376]
64 They mainly use ensemble classification methods to help choose good features rather than combining different feature selection criteria and integrating different feature selection approaches as in our method. [sent-326, score-0.837]
65 2 Feature Selection Criteria The set of feature selection criteria Θ = {θ1 θt} used in our work are those commonly used individual selection criteria in the filter approach. [sent-328, score-0.602]
66 The mutual information MI(f, c) between a class c and a feature f is defined as: MI(f,c)=∑f,f∑c,cP(f,c)logPP((f)fP,c()c) The scoring function generally used as the criterion is the max among all classes. [sent-333, score-0.274]
67 While the Boolean scheme assigns a 1 to the feature value if the feature is present in the document and a 0 otherwise, the TF scheme assigns the relative frequency of the number of times that the feature occurs in the document. [sent-344, score-0.405]
68 The feature value assignment to different classes of features is done as follows: The value of F-measure was assigned based on its actual value. [sent-346, score-0.287]
69 Stylistic features such words, and blog words were assigned values 1 or 0 in the Boolean scheme and the relative frequency in the TF scheme (we experimented with both schemes). [sent-347, score-0.342]
70 Feature values for gender preferential features were also assigned in a similar way. [sent-348, score-0.499]
71 Factor and word class features were assigned values according to the Boolean or TF scheme if any of the words belonging to the feature class exists (factor or word class appeared in that document). [sent-349, score-0.309]
72 Each POS sequence pattern feature was assigned a value according to the Boolean (or TF) scheme based on the appearances of the pattern in the POS tagged document. [sent-350, score-0.354]
73 In all our experiments, we used accuracy as the evaluation measure as the two classes (male and female) are roughly balanced (see the data description below), and both classes are equally important. [sent-365, score-0.28]
74 1 Blog Data Set To keep the problem of gender classification of informal text as general as possible, we collected blog posts from many blog hosting sites and blog search engines, e. [sent-367, score-1.145]
75 Each blog is labeled with the gender of its author. [sent-373, score-0.557]
76 The gender of the author was determined by visiting the profile of the author. [sent-374, score-0.344]
77 Profile pictures or avatars associated with the profile were also helpful in confirming the gender especially when the gender information was not available explicitly. [sent-375, score-0.688]
78 To ensure quality of the labels, one group of students collected the blogs and did the initial labeling, and the other group double-checked the labels by visiting the actual blog pages. [sent-376, score-0.266]
79 2 Results We used all features from different feature classes (Section 3) along with our POS patterns as our 214 pool of features. [sent-382, score-0.388]
80 EFS was compared with three commonly used feature selection methods on SVM classification (denoted by SVM), SVM regression (denoted by SVM_R) and the NB classifier. [sent-386, score-0.419]
81 We tested our system without any feature selection and without using the POS sequence patterns as features. [sent-395, score-0.442]
82 The comparison results with existing algorithms and public domain systems using our reallife blog data set are tabulated in Table 7. [sent-396, score-0.246]
83 Also, to see whether feature selection helps and how many features are optimal, we varied τ and w of the EFS algorithm and plotted the accuracy vs. [sent-397, score-0.345]
84 • Table 5 also shows that our EFS feature selection method brings about 6-10% improvement in accuracy over the other feature selection methods based on SVM classification and SVM regression. [sent-413, score-0.638]
85 • Keeping all other parameters constant, Table 5 also shows that Boolean feature values yielded better results than the TF scheme across all classifiers and feature selection methods. [sent-417, score-0.393]
86 • Row 1 of Table 6 tells us that feature selection is very useful. [sent-418, score-0.246]
87 Without feature selection (All features), SVM regression only achieves 70% accuracy, which is way inferior to the 88. [sent-419, score-0.315]
88 From Tables 5 and 6, we can infer that the overall accuracy improvement using EFS and all feature classes described in Section 3 is about 15% for SVM classification and regression and 10% for NB. [sent-426, score-0.445]
89 From Figure 1, we see that when the number of features selected is small (<100) the classification accuracy is lower than that obtained by using all features (no feature selection). [sent-444, score-0.371]
90 Finally, we would like to mention that (Herring and Paolillo, 06) has used genre relationships with gender classification. [sent-453, score-0.344]
91 Their finding that subgenre “diary” contains more “female” and subgenre “filter” having more “male” stylistic features independent of the author gender, may obscure gender classification as there are many factors to be considered. [sent-454, score-0.692]
92 We are also aware of other factors influencing gender classification like genre, age and ethnicity. [sent-456, score-0.448]
93 Also, EFS being a useful method for feature selection in machine learning, it would be useful to perform further experiments to investigate how well it performs on a variety of classification datasets. [sent-459, score-0.35]
94 7 Conclusions This paper studied the problem of gender classification. [sent-461, score-0.344]
95 In particular, we proposed a new class of features which are POS sequence patterns that are able to capture complex stylistic regularities of male and female authors. [sent-464, score-0.602]
96 Since there are a large number features that have been considered, it is important to find a subset of features that have positive effects on the classification task. [sent-465, score-0.218]
97 Here, we proposed an ensemble feature selection method which takes advantage of many different types of feature selection criteria in feature selection. [sent-466, score-0.787]
98 Experimental results based on a real-life blog data set demonstrated the effectiveness of the proposed techniques. [sent-467, score-0.213]
99 An extensive empirical study of feature selection metrics for text classification. [sent-528, score-0.246]
100 High performing and scalable feature selection for text classification. [sent-626, score-0.246]
wordName wordTfidf (topN-words)
[('efs', 0.359), ('gender', 0.344), ('blog', 0.213), ('pos', 0.211), ('argamon', 0.21), ('minsup', 0.179), ('yan', 0.176), ('endfor', 0.163), ('stylistic', 0.154), ('selection', 0.135), ('ci', 0.126), ('schler', 0.126), ('classes', 0.119), ('minadherence', 0.114), ('optcandfeatures', 0.114), ('wrapper', 0.114), ('feature', 0.111), ('classification', 0.104), ('patterns', 0.101), ('adherence', 0.098), ('nowson', 0.098), ('preferential', 0.098), ('scp', 0.098), ('koppel', 0.098), ('ensemble', 0.097), ('criterion', 0.095), ('sequence', 0.095), ('svm', 0.093), ('criteria', 0.087), ('male', 0.084), ('corney', 0.082), ('fairscp', 0.082), ('mladenic', 0.082), ('mined', 0.076), ('female', 0.076), ('boolean', 0.072), ('ck', 0.07), ('regression', 0.069), ('contextuality', 0.065), ('dispersion', 0.065), ('females', 0.065), ('males', 0.065), ('srikant', 0.065), ('posts', 0.058), ('features', 0.057), ('agrawal', 0.056), ('pennebaker', 0.056), ('pattern', 0.056), ('blogs', 0.053), ('xn', 0.05), ('bookblog', 0.049), ('dewaele', 0.049), ('formality', 0.049), ('genie', 0.049), ('grobelnik', 0.049), ('heylighen', 0.049), ('houvardas', 0.049), ('krawetz', 0.049), ('stamatatos', 0.049), ('writings', 0.049), ('mining', 0.048), ('filter', 0.047), ('candidate', 0.047), ('tf', 0.046), ('nb', 0.044), ('lines', 0.043), ('accuracy', 0.042), ('paolillo', 0.042), ('herring', 0.042), ('baayen', 0.042), ('thresholds', 0.041), ('men', 0.041), ('lengths', 0.038), ('satisfy', 0.038), ('authorship', 0.038), ('scheme', 0.036), ('mi', 0.036), ('subsets', 0.036), ('chicago', 0.036), ('window', 0.035), ('fmeasure', 0.035), ('class', 0.035), ('joachims', 0.033), ('existing', 0.033), ('mutual', 0.033), ('borgelt', 0.033), ('connotations', 0.033), ('downward', 0.033), ('forman', 0.033), ('gargant', 0.033), ('gefen', 0.033), ('guesser', 0.033), ('implicitness', 0.033), ('kohavi', 0.033), ('langley', 0.033), ('mapreduce', 0.033), ('obscure', 0.033), ('preferring', 0.033), ('schiffman', 0.033), ('spk', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 61 emnlp-2010-Improving Gender Classification of Blog Authors
Author: Arjun Mukherjee ; Bing Liu
Abstract: The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-ofspeech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly.
2 0.099370867 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
Author: Lei Shi ; Rada Mihalcea ; Mingjun Tian
Abstract: In this paper, we introduce a method that automatically builds text classifiers in a new language by training on already labeled data in another language. Our method transfers the classification knowledge across languages by translating the model features and by using an Expectation Maximization (EM) algorithm that naturally takes into account the ambiguity associated with the translation of a word. We further exploit the readily available unlabeled data in the target language via semisupervised learning, and adapt the translated model to better fit the data distribution of the target language.
3 0.079997122 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
Author: Yunliang Jiang ; Cindy Xide Lin ; Qiaozhu Mei
Abstract: In this paper, we conducted a systematic comparative analysis of language in different contexts of bursty topics, including web search, news media, blogging, and social bookmarking. We analyze (1) the content similarity and predictability between contexts, (2) the coverage of search content by each context, and (3) the intrinsic coherence of information in each context. Our experiments show that social bookmarking is a better predictor to the bursty search queries, but news media and social blogging media have a much more compelling coverage. This comparison provides insights on how the search behaviors and social information sharing behaviors of users are correlated to the professional news media in the context of bursty events.
4 0.07789357 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification
Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng
Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.
5 0.073237784 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu
Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.
6 0.069894627 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
7 0.069212988 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging
8 0.067911498 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
9 0.061594639 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
10 0.056039672 51 emnlp-2010-Function-Based Question Classification for General QA
11 0.05530861 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model
12 0.053511001 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models
13 0.053323533 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
14 0.053201981 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications
15 0.051230643 114 emnlp-2010-Unsupervised Parse Selection for HPSG
16 0.050516564 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
17 0.050340641 39 emnlp-2010-EMNLP 044
18 0.050271124 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars
19 0.049363624 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation
20 0.049166538 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution
topicId topicWeight
[(0, 0.177), (1, 0.108), (2, -0.06), (3, -0.008), (4, -0.055), (5, 0.025), (6, -0.009), (7, 0.001), (8, -0.054), (9, 0.023), (10, 0.075), (11, 0.028), (12, 0.043), (13, -0.052), (14, 0.059), (15, 0.108), (16, 0.039), (17, 0.069), (18, -0.098), (19, 0.092), (20, -0.073), (21, -0.046), (22, 0.049), (23, 0.01), (24, 0.116), (25, 0.065), (26, 0.085), (27, 0.003), (28, -0.038), (29, 0.119), (30, -0.11), (31, -0.113), (32, 0.086), (33, -0.057), (34, 0.112), (35, 0.066), (36, 0.152), (37, 0.113), (38, 0.33), (39, -0.302), (40, 0.26), (41, 0.007), (42, 0.04), (43, -0.112), (44, -0.216), (45, 0.004), (46, -0.195), (47, 0.206), (48, 0.072), (49, 0.083)]
simIndex simValue paperId paperTitle
same-paper 1 0.96276027 61 emnlp-2010-Improving Gender Classification of Blog Authors
Author: Arjun Mukherjee ; Bing Liu
Abstract: The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-ofspeech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly.
2 0.37174076 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
Author: Sankaranarayanan Ananthakrishnan ; Rohit Prasad ; David Stallard ; Prem Natarajan
Abstract: Production of parallel training corpora for the development of statistical machine translation (SMT) systems for resource-poor languages usually requires extensive manual effort. Active sample selection aims to reduce the labor, time, and expense incurred in producing such resources, attaining a given performance benchmark with the smallest possible training corpus by choosing informative, nonredundant source sentences from an available candidate pool for manual translation. We present a novel, discriminative sample selection strategy that preferentially selects batches of candidate sentences with constructs that lead to erroneous translations on a held-out development set. The proposed strategy supports a built-in diversity mechanism that reduces redundancy in the selected batches. Simulation experiments on English-to-Pashto and Spanish-to-English translation tasks demon- strate the superiority of the proposed approach to a number of competing techniques, such as random selection, dissimilarity-based selection, as well as a recently proposed semisupervised active learning strategy.
3 0.35256448 85 emnlp-2010-Negative Training Data Can be Harmful to Text Classification
Author: Xiao-Li Li ; Bing Liu ; See-Kiong Ng
Abstract: This paper studies the effects of training data on binary text classification and postulates that negative training data is not needed and may even be harmful for the task. Traditional binary classification involves building a classifier using labeled positive and negative training examples. The classifier is then applied to classify test instances into positive and negative classes. A fundamental assumption is that the training and test data are identically distributed. However, this assumption may not hold in practice. In this paper, we study a particular problem where the positive data is identically distributed but the negative data may or may not be so. Many practical text classification and retrieval applications fit this model. We argue that in this setting negative training data should not be used, and that PU learning can be employed to solve the problem. Empirical evaluation has been con- ducted to support our claim. This result is important as it may fundamentally change the current binary classification paradigm.
4 0.29833725 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task
Author: Roi Reichart ; Ari Rappoport
Abstract: Polysemy is a major characteristic of natural languages. Like words, syntactic forms can have several meanings. Understanding the correct meaning of a syntactic form is of great importance to many NLP applications. In this paper we address an important type of syntactic polysemy the multiple possible senses of tense syntactic forms. We make our discussion concrete by introducing the task of Tense Sense Disambiguation (TSD): given a concrete tense syntactic form present in a sentence, select its appropriate sense among a set of possible senses. Using English grammar textbooks, we compiled a syntactic sense dictionary comprising common tense syntactic forms and semantic senses for each. We annotated thousands of BNC sentences using the – defined senses. We describe a supervised TSD algorithm trained on these annotations, which outperforms a strong baseline for the task.
Author: Hugo Hernault ; Danushka Bollegala ; Mitsuru Ishizuka
Abstract: Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, unlabeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of cooccurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Treebank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.
6 0.26804367 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model
7 0.26597369 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
8 0.25887081 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
9 0.25706822 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
10 0.25440279 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
11 0.25392208 26 emnlp-2010-Classifying Dialogue Acts in One-on-One Live Chats
12 0.24844891 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning
13 0.24574213 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media
14 0.24249986 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
15 0.23964135 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution
16 0.23275407 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars
17 0.22440892 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
18 0.22085434 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing
20 0.21713151 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging
topicId topicWeight
[(3, 0.464), (10, 0.011), (12, 0.038), (29, 0.068), (30, 0.029), (52, 0.024), (56, 0.092), (66, 0.11), (72, 0.041), (76, 0.016), (79, 0.01), (87, 0.013)]
simIndex simValue paperId paperTitle
1 0.90349352 15 emnlp-2010-A Unified Framework for Scope Learning via Simplified Shallow Semantic Parsing
Author: Qiaoming Zhu ; Junhui Li ; Hongling Wang ; Guodong Zhou
Abstract: This paper approaches the scope learning problem via simplified shallow semantic parsing. This is done by regarding the cue as the predicate and mapping its scope into several constituents as the arguments of the cue. Evaluation on the BioScope corpus shows that the structural information plays a critical role in capturing the relationship between a cue and its dominated arguments. It also shows that our parsing approach significantly outperforms the state-of-the-art chunking ones. Although our parsing approach is only evaluated on negation and speculation scope learning here, it is portable to other kinds of scope learning. 1
2 0.74449933 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input
Author: Radu Florian ; John Pitrelli ; Salim Roukos ; Imed Zitouni
Abstract: Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text.
same-paper 3 0.72651333 61 emnlp-2010-Improving Gender Classification of Blog Authors
Author: Arjun Mukherjee ; Bing Liu
Abstract: The problem of automatically classifying the gender of a blog author has important applications in many commercial domains. Existing systems mainly use features such as words, word classes, and POS (part-ofspeech) n-grams, for classification learning. In this paper, we propose two new techniques to improve the current result. The first technique introduces a new class of features which are variable length POS sequence patterns mined from the training data using a sequence pattern mining algorithm. The second technique is a new feature selection method which is based on an ensemble of several feature selection criteria and approaches. Empirical evaluation using a real-life blog data set shows that these two techniques improve the classification accuracy of the current state-ofthe-art methods significantly.
4 0.39783677 44 emnlp-2010-Enhancing Mention Detection Using Projection via Aligned Corpora
Author: Yassine Benajiba ; Imed Zitouni
Abstract: The research question treated in this paper is centered on the idea of exploiting rich resources of one language to enhance the performance of a mention detection system of another one. We successfully achieve this goal by projecting information from one language to another via a parallel corpus. We examine the potential improvement using various degrees of linguistic information in a statistical framework and we show that the proposed technique is effective even when the target language model has access to a significantly rich feature set. Experimental results show up to 2.4F improvement in performance when the system has access to information obtained by projecting mentions from a resource-richlanguage mention detection system via a parallel corpus.
5 0.38134921 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu
Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.
6 0.38025504 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
7 0.37959316 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning
8 0.37791091 20 emnlp-2010-Automatic Detection and Classification of Social Events
9 0.3742533 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
10 0.37419099 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
11 0.37333769 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution
12 0.36519098 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions
13 0.36075497 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval
14 0.36014423 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks
15 0.35837287 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
16 0.35703105 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue
17 0.35534793 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text
18 0.35314843 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
19 0.35155064 84 emnlp-2010-NLP on Spoken Documents Without ASR
20 0.35016724 86 emnlp-2010-Non-Isomorphic Forest Pair Translation