acl acl2011 acl2011-97 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing
Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.
Reference: text
sentIndex sentText sentNum sentScore
1 edu s Abstract We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. [sent-5, score-0.228]
2 Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. [sent-6, score-1.068]
3 By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. [sent-7, score-0.18]
4 First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. [sent-9, score-0.838]
5 Next, we conjoin demographic attributes into features, which we use to predict term frequencies. [sent-10, score-0.917]
6 The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties. [sent-11, score-0.808]
7 Quantitative sociolinguistics usually addresses this question through carefully crafted studies that correlate individual demographic attributes and linguistic variables—for example, the interaction between income and the “dropped r” feature of the New York accent (Labov, 1966). [sent-13, score-0.875]
8 Using multi-output regression with structured sparsity, 1365 our method identifies a small subset of lexical items that are most influenced by demographics, and discovers conjunctions of demographic attributes that are especially salient for lexical variation. [sent-16, score-1.047]
9 On the demo- graphic side, the interaction between demographic attributes is often non-linear: for example, gender may negate or amplify class-based language differences (Zhang, 2005). [sent-19, score-0.805]
10 Thus, additive models which assume that each demographic attribute makes a linear contribution are inadequate. [sent-20, score-0.691]
11 In this paper, we explore the large space of potential sociolinguistic associations using structured sparsity. [sent-21, score-0.168]
12 We treat the relationship between language and demographics as a set of multi-input, multioutput regression problems. [sent-22, score-0.41]
13 The regression coefficients are arranged in a matrix, with rows indicating predictors and columns indicating outputs. [sent-23, score-0.341]
14 We apply a composite regularizer that drives entire rows of the coefficient matrix to zero, yielding compact, interpretable models that reuse features across different outputs. [sent-24, score-0.313]
15 If we treat the lexical frequencies as inputs and the author’s demographics as outputs, the induced sparsity pattern reveals the set of lexical items that is most closely tied to demographics. [sent-25, score-0.3]
16 If we treat the demographic attributes as inputs and build a model to predict the text, we can incrementally construct a conjunctive feature space of demographic attributes, capturing key non-linear interac- tions. [sent-26, score-1.546]
17 c s 2o0ci1a1ti Aonss foocria Ctioomnp fourta Ctioomnaplu Ltaintigouniaslti Lcisn,g puaigsetsic 1s365–1374, The primary purpose of this research is exploratory data analysis to identify both the most linguistic-salient demographic features, and the most demographically-salient words. [sent-29, score-0.631]
18 However, this model also enables predictions about demographic features by analyzing raw text, potentially supporting applications in targeted information extraction or advertising. [sent-30, score-0.721]
19 On the task of predicting demographics from text, we find that our sparse model yields performance that is statistically indistinguishable from the full vocabulary, even with a reduction in the model complexity an order of magnitude. [sent-31, score-0.208]
20 On the task of predicting text from author demographics, we find that our incrementally constructed feature set obtains significantly better perplexity than a linear model of demographic attributes. [sent-32, score-0.695]
21 2 Data Our dataset is derived from prior work in which we gathered the text and geographical locations of 9,250 microbloggers on the website twitter . [sent-33, score-0.123]
22 (2010) obtained aggregate demographic statistics for these data by mapping geolocations to publicly-available data from the U. [sent-43, score-0.669]
23 The demographic attributes that we consider in this paper are shown in Table 1. [sent-47, score-0.805]
24 The race and ethnicity attributes are not mutually exclusive—individuals can indicate any number of races or ethnicities. [sent-49, score-0.367]
25 4 % Spanish speakers % other language speakers 14. [sent-63, score-0.156]
26 4 18,100 Table 1: The demographic attributes used in this research. [sent-73, score-0.805]
27 “Urban areas” refer to sets of census tracts or census blocks which contain at least 2,500 residents; our “% urban” attribute is the percentage of individ- uals in each ZCTA who are listed as living in an urban area. [sent-75, score-0.395]
28 While geographical aggregate statistics are frequently used to proxy for individual socioeconomic status in research areas such as public health (e. [sent-77, score-0.277]
29 Polling research suggests that users of both Twitter (Smith and Rainie, 2010) and geolocation services (Zickuhr and Smith, 2010) are much more diverse with respect to age, gender, race and ethnicity than the general population of Internet users. [sent-81, score-0.193]
30 Nonetheless, at present we can only use aggregate statistics to make inferences about the geographic communities in which our authors live, and not the authors themselves. [sent-82, score-0.172]
31 3 Models The selection of both words and demographic features can be framed in terms of multi-output regression with structured sparsity. [sent-86, score-0.913]
32 To select the lexical indicators that best predict demographics, we construct a regression problem in which term frequencies are the predictors and demographic attributes are the outputs; to select the demographic features that predict word use, this arrangement is reversed. [sent-87, score-1.974]
33 Through structured sparsity, we learn models in which entire sets of coefficients are driven to zero; this tells us which words and demographic features can safely be ignored. [sent-88, score-0.77]
34 This section describes the model and implementation for output-regression with structured sparsity; in Section 4 and 5 we give the details of its application to select terms and demographic features. [sent-89, score-0.747]
35 We would like to solve the unconstrained optimization problem, minimizeB | |Y − XB| |2F + λR(B), (1) where | |A| |2F indicates the squared Frobenius norm Pi Pj ai2j, and the function R(B) defines a norm oPn Pthe regression coefficients B. [sent-94, score-0.335]
36 Ridqge regres- PtT=1 qPpPbp2t, sion applies the ‘2 norm R(B) = and lasso regression applies the ‘1P normq RP(B) = |bpt|; in both cases, it is possible to dePcompoPse th|eb multi-output regression problem, otre daet-- PtT=1 PpP ing eacPh output dimension separately. [sent-95, score-0.654]
37 However, our working hypothesis is that there will be substantial 1367 correlations across both the vocabulary and the demographic features—for example, a demographic feature such as the percentage of Spanish speakers will predict a large set of words. [sent-96, score-1.457]
38 Our goal is to select a small set of predictors yielding good performance across all output dimensions. [sent-97, score-0.122]
39 Thus, we desire structured sparsity, in which entire rows of the coefficient matrix B are driven to zero. [sent-98, score-0.19]
40 The lasso gives element-wise sparsity, in which many entries of B are driven to zero, but each predictor may have a non-zero value for some output dimension. [sent-100, score-0.305]
41 This norm, wPhich corresponds to a multioutput lasso regression, has the desired property of driving entire rows of B to zero. [sent-104, score-0.413]
42 If the Xnum −b Ner ¯x of predictors is too large, it may not be possible to store the dense matrix D in memory. [sent-123, score-0.144]
43 At each λi, we solve the sparse multi-output regression; the solution Bi defines a sparse set of predictors for all tasks. [sent-136, score-0.191]
44 We then use this limited set of predictors to construct a new input matrix which serves as the input in a standard ridge regression, thus refitting the model. [sent-137, score-0.271]
45 The tuning set performance of this regression is the score for λi. [sent-138, score-0.161]
46 Such post hoc refitting is often used in tandem with the lasso and related sparse methods; the effectiveness of this procedure has been demonstrated in both theory (Wasserman and Roeder, 2009) and practice (Wu et al. [sent-139, score-0.371]
47 The regularization parameter of the ridge regression is determined by internal cross-validation. [sent-141, score-0.302]
48 Xˆi, 4 Predicting Demographics from Text Sparse multi-output regression can be used to select a subset of vocabulary items that are especially indicative of demographic and geographic differences. [sent-142, score-0.947]
49 1368 Starting from the regression problem (1), the predictors X are set to the term frequencies, with one column for each word type and one row for each author in the dataset. [sent-144, score-0.314]
50 The outputs Y are set to the ten demographic attributes described in Table 1 (we consider much larger demographic feature spaces in the next section) The ‘1,∞ regularizer will drive entire rows of the coefficient matrix B to zero, eliminating all demographic effects for many words. [sent-145, score-2.267]
51 1 Quantitative Evaluation We evaluate the ability of lexical features to predict the demographic attributes of their authors (as proxied by the census data from the author’s geographical area). [sent-147, score-1.089]
52 In addition, this evaluation establishes a baseline for performance on the demographic prediction task. [sent-149, score-0.631]
53 We perform five-fold cross-validation, using the multi-output lasso to identify a sparse feature set in the training data. [sent-150, score-0.323]
54 We compare against several other dimensionality reduction techniques, matching the number of features obtained by the multioutput lasso at each fold. [sent-151, score-0.408]
55 As before, we perform post hoc refitting on the training data using a standard ridge regression. [sent-155, score-0.127]
56 The regularization constant for the ridge regression is identified using nested five-fold cross validation within the training set. [sent-156, score-0.338]
57 Linguistic features are best at predicting race, ethnicity, language, and the proportion of renters; the other de- mographic attributes are more difficult to predict. [sent-163, score-0.25]
58 Among feature sets, the highest average correlation is obtained by the full vocabulary, but the multioutput lasso obtains nearly identical performance using a feature set that is an order of magnitude smaller. [sent-164, score-0.364]
59 We find that the multi-output lasso and tthhae nf ±ull0 vocabulary regression are not significantly different on any of the attributes. [sent-170, score-0.462]
60 Thus, the multioutput lasso achieves a 93% compression of the feature set without a significant decrease in predictive performance. [sent-171, score-0.364]
61 The multi-output lasso yields higher correlations than the other dimensionality reduction techniques on all of the attributes; these differences are statistically significant in many—but not all— cases. [sent-172, score-0.318]
62 1369 Recall that the regularization coefficient was chosen by nested cross-validation within the training set; the average number of features selected is 394. [sent-174, score-0.175]
63 Computing the truncated SVD of a sparse matrix at very large truncation levels is computationally expensive, so we cannot draw the complete performance curve for this method. [sent-177, score-0.195]
64 The multi-output lasso dominates the alternatives, obtaining a particularly strong advantage with very small feature sets. [sent-178, score-0.269]
65 For each identified term, we apply a significance test on the relationship between the presence of each term and the demographic indicators shown in the columns of the table. [sent-182, score-0.696]
66 The use of sparse multioutput regression for variable selection increases the power of post hoc significance testing, because the Bonferroni correction bases the threshold for statistical significance on the total number of comparisons. [sent-184, score-0.346]
67 Table 3 shows the terms identified by our model which have a significant correlation with at least one of the demographic indicators. [sent-187, score-0.661]
68 Standard English words tend to appear in areas with more English speakers; predictably, Spanish words tend to appear in areas with Spanish speakers and Hispanics. [sent-192, score-0.22]
69 , lmaoo) have a nearly uniform demographic profile, displaying negative correlations with whites and English speakers, and positive correlations with African Americans, Hispanics, renters, Spanish speakers, and areas classified as urban. [sent-196, score-0.832]
70 , dats) appear in areas with high proportions of renters, African Americans, and non-English speakers, though a subset (haha, hahaha, and yep) display the opposite demographic pattern. [sent-199, score-0.702]
71 5 Conjunctive Demographic Features Next, we demonstrate how to select conjunctions of demographic features that predict text. [sent-203, score-0.75]
72 Again, we apply multi-output regression, but now we reverse the direction of inference: the predictors are demographic features, and the outputs are term frequencies. [sent-204, score-0.751]
73 The sparsity-inducing ‘1,∞ norm will select a subset of demographic features that explain the term frequencies. [sent-205, score-0.814]
74 We create an initial feature set f(0) (X) by binning each demographic attribute, using five equalfrequency bins. [sent-206, score-0.631]
75 We then constructive conjunctive features by applying a procedure inspired by related work in computational biology, called “Screen and Clean” (Wu et al. [sent-207, score-0.118]
76 On iteration i: 1370 • Solve the sparse multi-output regression problSeomlv Ye t = ? [sent-209, score-0.215]
77 In addition to the binned versions of the demographic attributes described in Table 1, we include geographical information. [sent-216, score-0.897]
78 For efficiency, the outputs Y are not set to the raw term frequencies; instead we compute a truncated singular value decomposition of the term frequencies W ≈ UVDT, and use the basis U. [sent-219, score-0.196]
79 1 Quantitative Evaluation The ability of the induced demographic features to predict text is evaluated using a traditional perplexity metric. [sent-222, score-0.778]
80 We construct a language model from the induced demographic features by training a multi-output ridge regression, which gives a matrix that maps from demographic features to term frequencies across the entire vocabulary. [sent-224, score-1.598]
81 1 Table 5: language features, frequency Word perplexity on test documents, using models estimated from induced demographic raw demographic attributes, and a relativebaseline. [sent-245, score-1.375]
82 The language models induced from demographic data yield small but statistically significant improvements over the baseline (Wilcoxon signed-rank test, p < . [sent-250, score-0.667]
83 Moreover, the model based on conjunctive features significantly outperforms the model constructed from raw attributes (p < . [sent-252, score-0.338]
84 The geographical area of F2 is completely contained by F1; the associated terms are thus very similar, but by having both features, the model can distinguish terms which are used in northeastern areas outside New York City, as well as terms which are especially likely in New York. [sent-264, score-0.289]
85 For example, F9 further refines the New York City area by focusing on communities that have relatively low numbers of Spanish speakers; F17 emphasizes New York neighborhoods that have very high numbers of African Americans and few speakers of languages other than English and Spanish. [sent-266, score-0.192]
86 The regression model can use these features in combination to make fine-grained distinctions about the differences between such neighborhoods. [sent-267, score-0.205]
87 Many of these features conjoined the proportion of African Americans with geographical features, identifying local linguistic styles used predominantly in either African Amer- ican or white communities. [sent-273, score-0.168]
88 Conversely, F23 selects areas with very few African Americans and Spanish-speakers in the western part of the United States, and F36 selects for similar demographics in the area of Washington and Philadelphia. [sent-275, score-0.261]
89 Other features differentiate between African Americans and Hispanics: F8 identifies regions with many Spanish speakers and Hispanics, but few African Americans; F20 identifies regions with both Spanish speakers and whites, but few African Americans. [sent-278, score-0.268]
90 While race, geography, and language predominate, the socioeconomic attributes appear in far fewer features. [sent-280, score-0.222]
91 The most prevalent attribute is the proportion of renters, which appears in F4 and F7, and in three other features not shown here. [sent-281, score-0.136]
92 This attribute may be a better indicator of the urban/rural divide than the “% urban” attribute, which has a very low threshold for what counts as urban (see Table 1). [sent-282, score-0.171]
93 Overall, the selected features tend to include attributes that are easy to predict from text (compare with Table 2). [sent-284, score-0.254]
94 Logistic regression has been used to identify relationships between demographic features and linguistic variables since the 1970s (Cedergren and Sankoff, 1974). [sent-286, score-0.836]
95 More recent developments include the use of mixed factor models to account for idiosyncrasies of individual speakers (Johnson, 2009), as well as clustering and multidimensional scaling (Nerbonne, 2009) to en- able aggregate inference across multiple linguistic variables. [sent-287, score-0.116]
96 However, all of these approaches assume that both the linguistic indicators and demographic attributes have already been identified by the researcher. [sent-288, score-0.833]
97 (2010) applies a similar generative model to demographic data. [sent-301, score-0.631]
98 The model presented here differs in two key ways: first, we use sparsity-inducing regu- larization to perform variable selection; second, we eschew high-dimensional mixture models in favor of a bottom-up approach of building conjunctions of demographic and geographic attributes. [sent-302, score-0.783]
99 In a mixture model, each component must define a distribution over all demographic variables, which may be difficult to estimate in a high-dimensional setting. [sent-303, score-0.663]
100 7 Conclusion This paper demonstrates how regression with structured sparsity can be applied to select words and conjunctive demographic features that reveal sociolinguistic associations. [sent-308, score-1.131]
wordName wordTfidf (topN-words)
[('demographic', 0.631), ('lasso', 0.269), ('african', 0.217), ('americans', 0.206), ('attributes', 0.174), ('regression', 0.161), ('demographics', 0.154), ('census', 0.112), ('renters', 0.111), ('urban', 0.111), ('ethnicity', 0.098), ('hispanics', 0.095), ('multioutput', 0.095), ('race', 0.095), ('geographical', 0.092), ('geographic', 0.084), ('predictors', 0.083), ('hispanic', 0.079), ('ridge', 0.079), ('speakers', 0.078), ('sparsity', 0.075), ('conjunctive', 0.074), ('areas', 0.071), ('income', 0.07), ('spanish', 0.067), ('eisenstein', 0.065), ('norm', 0.063), ('regularization', 0.062), ('matrix', 0.061), ('associations', 0.061), ('sociolinguistic', 0.06), ('attribute', 0.06), ('regularizer', 0.057), ('sparse', 0.054), ('communities', 0.05), ('correlations', 0.049), ('rows', 0.049), ('coefficients', 0.048), ('connor', 0.048), ('bonferroni', 0.048), ('refitting', 0.048), ('socioeconomic', 0.048), ('turlach', 0.048), ('structured', 0.047), ('raw', 0.046), ('features', 0.044), ('kathryn', 0.042), ('wasserman', 0.042), ('glossary', 0.042), ('vernacular', 0.042), ('truncated', 0.041), ('select', 0.039), ('truncation', 0.039), ('conjoin', 0.039), ('aggregate', 0.038), ('term', 0.037), ('nested', 0.036), ('predictor', 0.036), ('geography', 0.036), ('area', 0.036), ('predict', 0.036), ('variable', 0.036), ('induced', 0.036), ('composite', 0.036), ('frequencies', 0.035), ('identifies', 0.034), ('coefficient', 0.033), ('author', 0.033), ('west', 0.033), ('interpretable', 0.033), ('vocabulary', 0.032), ('mixture', 0.032), ('york', 0.032), ('compact', 0.032), ('blockwise', 0.032), ('bpt', 0.032), ('cedergren', 0.032), ('coast', 0.032), ('dats', 0.032), ('ptt', 0.032), ('quattoni', 0.032), ('rushton', 0.032), ('whites', 0.032), ('zcta', 0.032), ('zip', 0.032), ('proportion', 0.032), ('twitter', 0.031), ('perplexity', 0.031), ('brendan', 0.031), ('quantitative', 0.03), ('terms', 0.03), ('indicators', 0.028), ('proxy', 0.028), ('zickuhr', 0.028), ('compass', 0.028), ('duchi', 0.028), ('emphasizes', 0.028), ('geotagged', 0.028), ('residents', 0.028), ('xtx', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing
Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.
2 0.067857496 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
Author: Sara Rosenthal ; Kathleen McKeown
Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.
3 0.067302577 285 acl-2011-Simple supervised document geolocation with geodesic grids
Author: Benjamin Wing ; Jason Baldridge
Abstract: We investigate automatic geolocation (i.e. identification of the location, expressed as latitude/longitude coordinates) of documents. Geolocation can be an effective means of summarizing large document collections and it is an important component of geographic information retrieval. We describe several simple supervised methods for document geolocation using only the document’s raw text as evidence. All of our methods predict locations in the context of geodesic grids of varying degrees of resolution. We evaluate the methods on geotagged Wikipedia articles and Twitter feeds. For Wikipedia, our best method obtains a median prediction error of just 11.8 kilometers. Twitter geolocation is more challenging: we obtain a median error of 479 km, an improvement on previous results for the dataset.
4 0.060759865 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
Author: Kevin Gimpel ; Nathan Schneider ; Brendan O'Connor ; Dipanjan Das ; Daniel Mills ; Jacob Eisenstein ; Michael Heilman ; Dani Yogatama ; Jeffrey Flanigan ; Noah A. Smith
Abstract: We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.
5 0.059433211 82 acl-2011-Content Models with Attitude
Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay
Abstract: We present a probabilistic topic model for jointly identifying properties and attributes of social media review snippets. Our model simultaneously learns a set of properties of a product and captures aggregate user sentiments towards these properties. This approach directly enables discovery of highly rated or inconsistent properties of a product. Our model admits an efficient variational meanfield inference algorithm which can be parallelized and run on large snippet collections. We evaluate our model on a large corpus of snippets from Yelp reviews to assess property and attribute prediction. We demonstrate that it outperforms applicable baselines by a considerable margin.
6 0.059098795 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
7 0.059038304 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
8 0.057852101 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names
9 0.050121885 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
10 0.048546277 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
11 0.045212016 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
12 0.044994511 133 acl-2011-Extracting Social Power Relationships from Natural Language
13 0.04431827 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
14 0.043204244 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
15 0.042693432 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
16 0.042505164 117 acl-2011-Entity Set Expansion using Topic information
17 0.042457338 44 acl-2011-An exponential translation model for target language morphology
18 0.041828249 204 acl-2011-Learning Word Vectors for Sentiment Analysis
19 0.041809268 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation
20 0.041555613 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
topicId topicWeight
[(0, 0.124), (1, 0.031), (2, -0.014), (3, 0.006), (4, -0.027), (5, -0.002), (6, 0.027), (7, -0.011), (8, -0.012), (9, 0.042), (10, -0.044), (11, -0.007), (12, 0.018), (13, 0.035), (14, -0.027), (15, -0.017), (16, -0.029), (17, -0.001), (18, -0.013), (19, -0.057), (20, 0.062), (21, -0.018), (22, -0.013), (23, 0.024), (24, -0.035), (25, -0.023), (26, -0.007), (27, -0.011), (28, -0.033), (29, -0.001), (30, 0.027), (31, 0.008), (32, 0.004), (33, 0.08), (34, 0.05), (35, 0.001), (36, -0.023), (37, 0.013), (38, -0.01), (39, 0.052), (40, 0.108), (41, -0.0), (42, 0.111), (43, -0.067), (44, 0.056), (45, 0.005), (46, 0.0), (47, 0.022), (48, 0.023), (49, -0.032)]
simIndex simValue paperId paperTitle
same-paper 1 0.89489847 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing
Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.
Author: Sara Rosenthal ; Kathleen McKeown
Abstract: We investigate whether wording, stylistic choices, and online behavior can be used to predict the age category of blog authors. Our hypothesis is that significant changes in writing style distinguish pre-social media bloggers from post-social media bloggers. Through experimentation with a range of years, we found that the birth dates of students in college at the time when social media such as AIM, SMS text messaging, MySpace and Facebook first became popular, enable accurate age prediction. We also show that internet writing characteristics are important features for age prediction, but that lexical content is also needed to produce significantly more accurate results. Our best results allow for 81.57% accuracy.
3 0.63451213 133 acl-2011-Extracting Social Power Relationships from Natural Language
Author: Philip Bramsen ; Martha Escobar-Molano ; Ami Patel ; Rafael Alonso
Abstract: Sociolinguists have long argued that social context influences language use in all manner of ways, resulting in lects 1. This paper explores a text classification problem we will call lect modeling, an example of what has been termed computational sociolinguistics. In particular, we use machine learning techniques to identify social power relationships between members of a social network, based purely on the content of their interpersonal communication. We rely on statistical methods, as opposed to language-specific engineering, to extract features which represent vocabulary and grammar usage indicative of social power lect. We then apply support vector machines to model the social power lects representing superior-subordinate communication in the Enron email corpus. Our results validate the treatment of lect modeling as a text classification problem – albeit a hard one – and constitute a case for future research in computational sociolinguistics. 1
4 0.60494179 248 acl-2011-Predicting Clicks in a Vocabulary Learning System
Author: Aaron Michelony
Abstract: We consider the problem of predicting which words a student will click in a vocabulary learning system. Often a language learner will find value in the ability to look up the meaning of an unknown word while reading an electronic document by clicking the word. Highlighting words likely to be unknown to a readeris attractive due to drawing his orher attention to it and indicating that information is available. However, this option is usually done manually in vocabulary systems and online encyclopedias such as Wikipedia. Furthurmore, it is never on a per-user basis. This paper presents an automated way of highlighting words likely to be unknown to the specific user. We present related work in search engine ranking, a description of the study used to collect click data, the experiment we performed using the random forest machine learning algorithm and finish with a discussion of future work.
5 0.58818507 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
Author: Helen Yannakoudakis ; Ted Briscoe ; Ben Medlock
Abstract: We demonstrate how supervised discriminative machine learning techniques can be used to automate the assessment of ‘English as a Second or Other Language’ (ESOL) examination scripts. In particular, we use rank preference learning to explicitly model the grade relationships between scripts. A number of different features are extracted and ablation tests are used to investigate their contribution to overall performance. A comparison between regression and rank preference models further supports our method. Experimental results on the first publically available dataset show that our system can achieve levels of performance close to the upper bound for the task, as defined by the agreement between human examiners on the same corpus. Finally, using a set of ‘outlier’ texts, we test the validity of our model and identify cases where the model’s scores diverge from that of a human examiner.
6 0.58062428 55 acl-2011-Automatically Predicting Peer-Review Helpfulness
7 0.57981122 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution
8 0.56925488 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
9 0.56849778 341 acl-2011-Word Maturity: Computational Modeling of Word Knowledge
10 0.56263465 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
11 0.54882741 194 acl-2011-Language Use: What can it tell us?
12 0.54207355 314 acl-2011-Typed Graph Models for Learning Latent Attributes from Names
13 0.53316402 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
14 0.53159022 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
15 0.53138846 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics
16 0.53123254 223 acl-2011-Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts
17 0.52955794 291 acl-2011-SystemT: A Declarative Information Extraction System
18 0.52752572 288 acl-2011-Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications
19 0.51443815 74 acl-2011-Combining Indicators of Allophony
20 0.51164341 121 acl-2011-Event Discovery in Social Media Feeds
topicId topicWeight
[(5, 0.031), (17, 0.04), (26, 0.018), (37, 0.065), (39, 0.428), (41, 0.064), (55, 0.022), (59, 0.036), (72, 0.023), (91, 0.038), (96, 0.117), (97, 0.013)]
simIndex simValue paperId paperTitle
1 0.94778097 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser
Author: Yoav Goldberg ; Michael Elhadad
Abstract: We experiment with extending a lattice parsing methodology for parsing Hebrew (Goldberg and Tsarfaty, 2008; Golderg et al., 2009) to make use of a stronger syntactic model: the PCFG-LA Berkeley Parser. We show that the methodology is very effective: using a small training set of about 5500 trees, we construct a parser which parses and segments unsegmented Hebrew text with an F-score of almost 80%, an error reduction of over 20% over the best previous result for this task. This result indicates that lattice parsing with the Berkeley parser is an effective methodology for parsing over uncertain inputs.
2 0.93955004 1 acl-2011-(11-06-spirl)
Author: (hal)
Abstract: unkown-abstract
3 0.92530382 52 acl-2011-Automatic Labelling of Topic Models
Author: Jey Han Lau ; Karl Grieser ; David Newman ; Timothy Baldwin
Abstract: We propose a method for automatically labelling topics learned via LDA topic models. We generate our label candidate set from the top-ranking topic terms, titles of Wikipedia articles containing the top-ranking topic terms, and sub-phrases extracted from the Wikipedia article titles. We rank the label candidates using a combination of association measures and lexical features, optionally fed into a supervised ranking model. Our method is shown to perform strongly over four independent sets of topics, significantly better than a benchmark method.
4 0.92181551 59 acl-2011-Better Automatic Treebank Conversion Using A Feature-Based Approach
Author: Muhua Zhu ; Jingbo Zhu ; Minghan Hu
Abstract: For the task of automatic treebank conversion, this paper presents a feature-based approach which encodes bracketing structures in a treebank into features to guide the conversion of this treebank to a different standard. Experiments on two Chinese treebanks show that our approach improves conversion accuracy by 1.31% over a strong baseline.
same-paper 5 0.87613988 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing
Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.
6 0.82374185 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging
7 0.80567938 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation
8 0.77300519 192 acl-2011-Language-Independent Parsing with Empty Elements
9 0.68656611 182 acl-2011-Joint Annotation of Search Queries
10 0.63408667 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
11 0.61762232 316 acl-2011-Unary Constraints for Efficient Context-Free Parsing
12 0.61756957 282 acl-2011-Shift-Reduce CCG Parsing
13 0.61646736 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
14 0.61471748 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices
15 0.61248577 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
16 0.61156309 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing
17 0.60357636 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
18 0.60176408 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
19 0.59109229 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing
20 0.59014428 238 acl-2011-P11-2093 k2opt.pdf