emnlp emnlp2012 emnlp2012-120 knowledge-graph by maker-knowledge-mining

120 emnlp-2012-Streaming Analysis of Discourse Participants


Source: pdf

Author: Benjamin Van Durme

Abstract: Inferring attributes of discourse participants has been treated as a batch-processing task: data such as all tweets from a given author are gathered in bulk, processed, analyzed for a particular feature, then reported as a result of academic interest. Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge. We show that under certain common formulations, the batchprocessing analytic framework can be decomposed into a sequential series of updates, using as an example the task of gender classification. Once in a streaming framework, and motivated by large data sets generated by social media services, we present novel results in approximate counting, showing its applicability to space efficient streaming classification.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge. [sent-2, score-0.437]

2 We show that under certain common formulations, the batchprocessing analytic framework can be decomposed into a sequential series of updates, using as an example the task of gender classification. [sent-3, score-0.283]

3 Once in a streaming framework, and motivated by large data sets generated by social media services, we present novel results in approximate counting, showing its applicability to space efficient streaming classification. [sent-4, score-0.75]

4 1 Introduction The rapid growth in social media has led to an equally rapid growth in the desire to mine it for useful information: the content of public discussions, such as found in tweets, or in posts to online forums, can support a variety of data mining tasks. [sent-5, score-0.169]

5 Inferring the underlying properties of those that engage with these platforms, the discourse participants, has become an active topic of research: predicting individual attributes such as age, gender, and political preferences (Rao et al. [sent-6, score-0.112]

6 Classification with streaming data has usually been taken in the computational linguistics commu- nity to mean individual decisions made on items that are presented over time. [sent-16, score-0.309]

7 For example: assigning a label to each newly posted product review as to whether it contains positive or negative sentiment, or whether the latest tweet signals a novel topic that should be tagged for tracking (Petrovic et al. [sent-17, score-0.044]

8 Here we consider a distinct form of stream-based classification: we wish to assign, then dynamically update, labels on discourse participants based on their associated streaming communications. [sent-19, score-0.397]

9 For instance, rather than classifying individual reviews as to their sentiment polarity, we might wish to classify the underlying author as to whether they are genuine or paid-advertising, and then update that decision as they continue to post new reviews. [sent-20, score-0.123]

10 As the scale of social media continues to grow, we desire that our model be aggressively space efficient, which precludes a naive solution of storing the full communication history for all users. [sent-21, score-0.179]

11 ac l2 L0a1n2gu Aasgseoc Piraoticoens fionrg C aonmdp Cuotamtipounta lti Loin aglu Nisat iucsral erages, allowing for significant space savings in our classification model. [sent-24, score-0.108]

12 Our running example task is gender prediction, based on spoken communication and microblogs/Twitter feeds. [sent-25, score-0.224]

13 2 Model Assume that each discourse participant (e. [sent-26, score-0.049]

14 , speaker, author) a has an associated stream of communications (e. [sent-28, score-0.422]

15 itially take to be linear: author labels are determined by computing the sign of the dot product between a weight vector w, and feature vector f(C), each of dimension d. [sent-39, score-0.074]

16 Note that f(C) is a feature vector over the entire set of communications from a given author. [sent-40, score-0.08]

17 For example, Φ might be trained to classify author gender: Gender(a) =? [sent-41, score-0.074]

18 explicit how under certain common restrictions on the feature space, the classification decision can be decomposed into a series of decision updates over the elements of C. [sent-44, score-0.223]

19 Define to be the vector containing the local, count-based feature values of communication cj. [sent-45, score-0.061]

20 (XXwkfˆk(cj),zt) Xi= X1 kX= X1 which pairs the observed rolling sum, st with the feature stream length zt. [sent-51, score-0.393]

21 The classifier decision after seeing everything up to and including communication average: ct is thus a simple Φt(a) =? [sent-52, score-0.159]

22 Finally we reach the observation that: st = st−1 + w · zt = zt−1 fˆ(ct) + |fˆ(ct)|1 which means that from an engineering standpoint we can process a stream of communication one element at a time, without the need to preserve the history explicitly. [sent-54, score-0.53]

23 That is: for each author, for each attribute being analyzed, an online system only need maintain a state pair (st, zt) by extracting and weighting features locally for each new communication. [sent-55, score-0.09]

24 815 1% c8t164% Figure 1: A streaming analytic model should update its decision with each new communication, becoming more stable in its prediction as evidence is acquired. [sent-65, score-0.437]

25 1 Validation As an example of a model decomposed into a stream, we revist the task of gender classification based on speech transcripts, as explored by Boulis and Ostendorf (2005) and later Garera and Yarowsky (2009). [sent-68, score-0.243]

26 In the original problem definition, one would collect all transcribed utterances from a given speaker in a corpus such as Fisher (Cieri et al. [sent-69, score-0.122]

27 Then by collapsing these utterances into a single document, one could classify it as to whether it was generated by a male or female. [sent-72, score-0.118]

28 Here we define the task as: starting from scratch, report the classifier probability of the speaker being male, as each utterance is presented. [sent-73, score-0.1]

29 Intuitively we would expect that as more utterances are observed, the better our classification accuracy. [sent-74, score-0.098]

30 (201 1) have considered this point, but by comparing the classification accuracy based on the volume of batch data available per author (in that case, tweets): the more prolific the author had been, the better able they were to correctly classify their gender. [sent-76, score-0.187]

31 We confirm here this can be reframed: as a speaker (author) continues to emit a stream of communication, a dynamic model tends to improve its online prediction. [sent-77, score-0.456]

32 Similar to Boulis and Ostendorf, we extracted unigram and bigram counts as features, but without further 3Note that some non-linear kernels can be maintained online in a similar fashion. [sent-79, score-0.09]

33 50 Figure 2: Accuracy on Switchboard gender classification, reported at every fifth utterance, using a dynamic log-linear model with 10-fold cross validation. [sent-81, score-0.163]

34 Similar to previous work, we found intuitive features such as my husband to be weighted heavily (see Table 1), along with certain non-lexical vocalizations such as transcribed laughter. [sent-87, score-0.088]

35 Malea, wife, is, my wife, right, of, the, uh, actually, [vocalized-noise] Femalehave, and, [laughter], my husband, really, husband, children, are, would Figure 3: Streaming analysis of eight randomly sampled speakers, four per gender (red-solid: female, bluedashed: male). [sent-89, score-0.163]

36 Figure 3 highlights the streaming perspective: individual speakers can be viewed as distinct trajectories through [0, 1], based on features of their utterances. [sent-93, score-0.358]

37 3 Randomized Model Now situated within a streaming context we exact space savings through approximation, extending the approach of Van Durme and Lall (201 1), there concerned with online Locality Sensitive Hashing, here initially concerned with taking averages. [sent-94, score-0.429]

38 This will allow in the subsequent section to map our analytic model to a form a1c1c2 a2c1c2 amc1c2 . [sent-100, score-0.079]

39 1ct Figure 4: Social media platforms such as Facebook or Twitter deal with a very large number of individuals, each with a variety of implicit attributes (such as gender). [sent-111, score-0.253]

40 1 Reservoir Counting Reservoir Counting plays on the folklore algorithm of reservoir sampling, first described in the literature by Vitter (1985). [sent-115, score-0.491]

41 As applied to a stream of arbitrary elements, reservoir sampling maintains a list (reservoir) of length k, where the contents of the reservoir represents a uniform random sample over all el- ements 1. [sent-116, score-1.414]

42 When the stream is a sequence of positive and negative integers, reservoir counting implicitly views each value as being unrolled into a sequence made up ofeither 1or -1. [sent-120, score-1.077]

43 For instance, the sequence: (3, -2, 1) would be viewed as: (1, 1, 1, -1, -1, 1) Since there are only two distinct values in this stream, the contents of the reservoir can be characterized by knowing the fixed value k, and then s: how many elements in the reservoir are 1. [sent-121, score-1.136]

44 4 This led to Van Durme and Lall defining a method, Re s ervoi rUpdat e, here abbreviated to Re sUp, that allows for maintaining an approximate sum, defined as t(2ks − 1), through updating these two parameters t and− s 1w),ith th eraouchg newly toibnsger thveedse e tlwemoe pnat. [sent-122, score-0.055]

45 Expected accuracy of the approximation varies with the size of the sample, k. [sent-123, score-0.071]

46 Reservoir Counting exploits the fact that the reservoir need only be considered implicitly, where s represented as a b-bit unsigned integer can be used to characterize a reservoir of size k = 2b − 1. [sent-124, score-0.982]

47 Locality Sensitive Hashing, at similar levels of accuracy, by replacing explicit 32-bit counting variables with approximate counters of smaller size. [sent-127, score-0.261]

48 Modifying the earlier implicit construction, consider the sequence (3, -2, 1), with m∗ = 3, mapped to the sequence: (1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, -1, -1) where each value x is replaced with m∗ + m elements of σ, and m∗ −m elements of −σ. [sent-132, score-0.28]

49 This views x as a sequence of length 2mmen∗,t sm ofad −e up ohfi s1 vsi aewnds -1s, where each x in the discrete range [−m∗ , m∗] -h1ass, a unique ancuhm xbe irn o thf 1es d. [sent-133, score-0.089]

50 ni8at−hl1 tmo1∗th=aes defined, the average times m1∗ is equal to: Pin=n1xi(m1∗) = = n21m∗Xi=n1(m∗lX=+1miσi+mlX∗=−1mi−σi) Pinn=m1m∗iσ where n2m∗ is the total number of 1s and -1s observed in the implicit stream, up to and including the mapping of element xn. [sent-137, score-0.11]

51 If applying Reservoir Counting, s would then record the sampled number of 1s, as per norm, where t maintained as the implicit stream length can also be viewed as storing t = n2m∗. [sent-138, score-0.591]

52 We extend the definition to work 52 with a stream of fixed precision floating point variables. [sent-140, score-0.342]

53 Modify the mapping of value x from a sequence of length 2m∗, to a sequence of length g, comprised of instances of σ, and (1 instances of -σ. [sent-142, score-0.178]

54 16 ∈ (7, 8) instances of 1, followed by howegve =r many ∈in (s7ta,8n)ce ins otafn c-1e t ohfat 1 ,l feoadll to a sequence of length g, after probabilistic rounding. [sent-146, score-0.089]

55 The method enables counting in log-scale by probabilistically incrementing a counter, where it becomes less and less likely to update the counter after each incre- ment. [sent-151, score-0.32]

56 This scheme is popularly known and used in a variety of contexts, recently in the community by Talbot (2009) and Van Durme and Lall (2009) Figure 5: Results on averaging randomly generated sequences, with m∗ = 100, g = 100, and using an 8 bit Morris-style counter ofbase 2. [sent-152, score-0.104]

57 Larger reservoir sizes lead to better approximation, at higher cost in bits. [sent-153, score-0.491]

58 µ to provide a streaming extension to the Bloom-filter based count-storage mechanism of Talbot and Osborne (2007a) and Talbot and Osborne (2007b). [sent-154, score-0.309]

59 3 Experiment We show through experimentation on synthetic data that this approach gives reasonable levels of accuracy at space efficient sizes of the length and sum parameter. [sent-157, score-0.088]

60 Figure 5 shows results for varying reservoir sizes (using 4, 8 or 12 bits) when g = 100, m∗ = 100, and the length parameter was represented with an 8 bit Morris-style counter of base 2. [sent-159, score-0.646]

61 As Reservoir Counting already allows for keeping an online sum, and pairs it with a length parameter, then this would presumably be what is needed to get the average we 53 are focussed on. [sent-162, score-0.102]

62 Unfortunately that is not the case: the parameter recording the current stream length, here called t, tracks the length of the implicit stream of 1s and -1s, it does not track the length of the original stream of values that gave rise to the mapped version. [sent-163, score-1.238]

63 Both have the same sum, and would therefore be viewed the same under the pre-existing Reservoir Counting algorithm, giving rise to implicit streams of the same length. [sent-165, score-0.228]

64 But critically the sequences have different averages: which we cannot detect based on the original counting algorithm. [sent-166, score-0.168]

65 32 = 52, 4 Application to Classification Going back to our streaming analysis model, we have a situation that can be viewed as a sequence of values, such that we do know m∗. [sent-168, score-0.396]

66 First reinter- pret the fraction sztt equivalently as the normalized sum of a stream of elements sampled from w: sztt=z1tXi=t1Xj=d1fˆXlj=(c1i)wj The value m∗ is then: m∗ = maxj |wj|, over a sequence of length zt. [sent-169, score-0.575]

67 Rather than updating st earn da zt through basic addition, we can now use a smaller bit-wise representation for each variable, and update via Reservoir Averaging. [sent-170, score-0.176]

68 1 Problems in Practice Reconsidering the earlier classification experiment, we found this approximation method led to terrible results: while our experiments on synthetic data worked well, those sequences were sampled somewhat uniformly over the range ofpossible values. [sent-172, score-0.11]

69 In brief: the more the maximum possible update, m∗, can be viewed as an outlier, then the more the resulting implicit encoding of g elements per observed weight becomes dominated by “filler”. [sent-174, score-0.225]

70 As few observed elements will in that case require the provided range, then the implicit representation will be a mostly balanced set of Weight Values Figure 6: Frequency of individual feature weights observed over a full set of communications by a single example speaker. [sent-175, score-0.256]

71 These mostly balanced encodings make it difficult to maintain an adequate approximation of the true average, when reliant on a small, implicit uniform sample. [sent-183, score-0.22]

72 That is, we modify the contents of the implicit reservoir by rewriting history: pretending that earlier elements were larger than they were, but still within the reduced window. [sent-188, score-0.741]

73 As long as we don’t see too many values that are overly large, then there will be room to accommodate the overflow without any theoretical damage to the implicit stream: all count mass may still be ac54 counted for. [sent-189, score-0.231]

74 If a moderately high number of overly large elements are observed, then we expect in practice for this to have a negligible impact on downstream performance. [sent-190, score-0.105]

75 If an exceptional number of elements are overly large, then the training data was not representative of the test set. [sent-191, score-0.105]

76 This happens by first estimating the number of 1 values seen thus far in the stream: skn, then adding in twice the overflow value, which represents removing o instances of −σ fvraolmue ,t wheh stream, easnedn ttsh reenm adding o innssttaanncecess o off − σ. [sent-194, score-0.082]

77 pk − bpkc, bpkc otherwise Figure 7 compares the results seen in Figure 2 to a version of the experiment when using approximation. [sent-201, score-0.082]

78 Parameters were: g = 100; k = 255; and a Morris-style counter for stream length using 8 bits Communications Figure 7: Comparison between using explicit counting and approximation on the Switchboard dataset, with bands reflecting 95% confidence. [sent-202, score-0.82]

79 This result shows our ability to replace 2 variables of 32 bits (sum and length) with 2 approximation variables of 8 bits (reservoir status s, and stream length n), leading to a 75% reduction in the cost of maintaining online classifier state, with no significant cost in accuracy. [sent-206, score-0.646]

80 1 Setup Based on the tweet IDs from the data used by Burger et al. [sent-208, score-0.044]

81 5 These tweets were then matched against the gender labels established in that prior work. [sent-210, score-0.246]

82 5Standard practice in Twitter data exchanges is to share only the unique tweet identifications and then requery the content from Twitter, thus allowing, e. [sent-216, score-0.044]

83 While respectful of author privacy, it does pose a challenge for scientific reproducibility. [sent-219, score-0.074]

84 55 Figure 8: Summed 0/1 loss over all utterances by each speaker in the Switchboard training set, across 10 splits. [sent-220, score-0.122]

85 As seen in Figure 9, results were as in Switchboard: accuracy improves as more data streams in per author, and our approximate model sacrifices perhaps a point of accuracy in return for a 75% reduction in memory requirements per author. [sent-228, score-0.166]

86 (201 1), minus those tweets we were unable to retrieve (as previously discussed). [sent-235, score-0.083]

87 Communications Figure 9: Comparison between using explicit counting and approximation, on the Twitter dataset, with bands reflecting 95% confidence. [sent-236, score-0.247]

88 Foreign terms are recognized by their parenthetical translation and 1st- or 2nd-person + Male/Female gender marking. [sent-238, score-0.163]

89 A sumTable 2: Top thirty-five features by gender in Twitter. [sent-241, score-0.163]

90 , fuckin, omg omg, cheers, ai n’t Femaleobrigada (thank you [1F]), hubby, husband, cute, my husband, ? [sent-243, score-0.041]

91 Within computational linguistics interest in streaming approaches is a more recent development; we provide here examples of representative work, beyond those described in previous sections. [sent-245, score-0.309]

92 Levenberg and Osborne (2009) gave a streaming variant of the earlier perfect hashing language model of Talbot and Brants (2008), which operated in batch-mode. [sent-246, score-0.361]

93 For further background in predicting author attributes such as gender, see (Garera and Yarowsky, 2009) for an overview of previous work and (nonstreaming) methodology. [sent-251, score-0.137]

94 7 Conclusions and Future Work We have taken the predominately batch-oriented process ofanalyzing communication data and shown it to be fertile territory for research in large-scale streaming algorithms. [sent-252, score-0.37]

95 Using the example task of automatic gender detection, on both spoken transcripts and microblogs, we showed that classification can be thought of as a continuously running process, becoming more robust as further communications become available. [sent-253, score-0.282]

96 Once positioned within a streaming framework, we presented a novel approximation technique for compressing the streaming memory requirements of the classifier (per author) by 75%. [sent-254, score-0.726]

97 For instance, while here we assumed a static, pre-built classifier which was then applied to streaming data, future work may consider the interplay with online learning, based on methods such as by Crammer et al. [sent-256, score-0.397]

98 cations arena, one might take the savings provided here to run multiple models in parallel, either for more robust predictions (perhaps “triangulating” on language ID and/or domain over the stream), or predicting additional properties, such as age, nationality, political orientation, and so forth. [sent-259, score-0.069]

99 Finally, we assumed here strictly count-based features; streaming log-counting methods, tailored Bloom-filters for binary feature storage, and other related topics are assuredly applicable, and should give rise to many interesting new results. [sent-260, score-0.309]

100 From tweets to polls: Linking text sentiment to public opinion time series. [sent-348, score-0.083]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('reservoir', 0.491), ('stream', 0.342), ('streaming', 0.309), ('counting', 0.168), ('gender', 0.163), ('durme', 0.156), ('switchboard', 0.14), ('zt', 0.127), ('lall', 0.111), ('implicit', 0.11), ('husband', 0.088), ('tweets', 0.083), ('bpkc', 0.082), ('overflow', 0.082), ('communications', 0.08), ('analytic', 0.079), ('talbot', 0.079), ('van', 0.074), ('author', 0.074), ('approximation', 0.071), ('locality', 0.071), ('savings', 0.069), ('streams', 0.069), ('elements', 0.066), ('randomized', 0.065), ('wife', 0.065), ('burger', 0.065), ('speaker', 0.063), ('attributes', 0.063), ('counter', 0.062), ('boulis', 0.061), ('mua', 0.061), ('rewritehistory', 0.061), ('updateaverage', 0.061), ('ct', 0.061), ('communication', 0.061), ('xn', 0.059), ('utterances', 0.059), ('male', 0.059), ('hash', 0.055), ('approximate', 0.055), ('ci', 0.052), ('hashing', 0.052), ('length', 0.051), ('online', 0.051), ('viewed', 0.049), ('twitter', 0.049), ('discourse', 0.049), ('update', 0.049), ('sup', 0.048), ('ashwin', 0.048), ('bits', 0.047), ('osborne', 0.045), ('tweet', 0.044), ('benjamin', 0.043), ('sensitive', 0.043), ('return', 0.042), ('bit', 0.042), ('yarowsky', 0.042), ('garera', 0.041), ('desire', 0.041), ('accommodated', 0.041), ('bands', 0.041), ('bvc', 0.041), ('cute', 0.041), ('dpke', 0.041), ('godfrey', 0.041), ('incrementing', 0.041), ('jerboa', 0.041), ('jindal', 0.041), ('omg', 0.041), ('ott', 0.041), ('petrovic', 0.041), ('platforms', 0.041), ('sztt', 0.041), ('tired', 0.041), ('username', 0.041), ('decomposed', 0.041), ('miles', 0.041), ('participants', 0.039), ('contents', 0.039), ('media', 0.039), ('maintain', 0.039), ('updates', 0.039), ('maintained', 0.039), ('overly', 0.039), ('classification', 0.039), ('sequence', 0.038), ('social', 0.038), ('explicit', 0.038), ('sum', 0.037), ('pin', 0.037), ('xt', 0.037), ('classifier', 0.037), ('johns', 0.035), ('rajeev', 0.035), ('diehl', 0.035), ('pretending', 0.035), ('ioft', 0.035), ('magnitude', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 120 emnlp-2012-Streaming Analysis of Discourse Participants

Author: Benjamin Van Durme

Abstract: Inferring attributes of discourse participants has been treated as a batch-processing task: data such as all tweets from a given author are gathered in bulk, processed, analyzed for a particular feature, then reported as a result of academic interest. Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge. We show that under certain common formulations, the batchprocessing analytic framework can be decomposed into a sequential series of updates, using as an example the task of gender classification. Once in a streaming framework, and motivated by large data sets generated by social media services, we present novel results in approximate counting, showing its applicability to space efficient streaming classification.

2 0.16947301 134 emnlp-2012-User Demographics and Language in an Implicit Social Network

Author: Katja Filippova

Abstract: We consider the task of predicting the gender of the YouTube1 users and contrast two information sources: the comments they leave and the social environment induced from the affiliation graph of users and videos. We propagate gender information through the videos and show that a user’s gender can be predicted from her social environment with the accuracy above 90%. We also show that the gender can be predicted from language alone (89%). A surprising result of our study is that the latter predictions correlate more strongly with the gender predominant in the user’s environment than with the sex of the person as reported in the profile. We also investigate how the two views (linguistic and social) can be combined and analyse how prediction accuracy changes over different age groups.

3 0.13387819 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP

Author: Amit Goyal ; Hal Daume III ; Raul Guerra

Abstract: Many natural language processing problems involve constructing large nearest-neighbor graphs. We propose a system called FLAG to construct such graphs approximately from large data sets. To handle the large amount of data, our algorithm maintains approximate counts based on sketching algorithms. To find the approximate nearest neighbors, our algorithm pairs a new distributed online-PMI algorithm with novel fast approximate nearest neighbor search algorithms (variants of PLEB). These algorithms return the approximate nearest neighbors quickly. We show our system’s efficiency in both intrinsic and extrinsic experiments. We further evaluate our fast search algorithms both quantitatively and qualitatively on two NLP applications.

4 0.12718569 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities

Author: Xin Zhao ; Baihan Shu ; Jing Jiang ; Yang Song ; Hongfei Yan ; Xiaoming Li

Abstract: Activities on social media increase at a dramatic rate. When an external event happens, there is a surge in the degree of activities related to the event. These activities may be temporally correlated with one another, but they may also capture different aspects of an event and therefore exhibit different bursty patterns. In this paper, we propose to identify event-related bursts via social media activities. We study how to correlate multiple types of activities to derive a global bursty pattern. To model smoothness of one state sequence, we propose a novel function which can capture the state context. The experiments on a large Twitter dataset shows our methods are very effective.

5 0.11363214 117 emnlp-2012-Sketch Algorithms for Estimating Point Queries in NLP

Author: Amit Goyal ; Hal Daume III ; Graham Cormode

Abstract: Many NLP tasks rely on accurate statistics from large corpora. Tracking complete statistics is memory intensive, so recent work has proposed using compact approximate “sketches” of frequency distributions. We describe 10 sketch methods, including existing and novel variants. We compare and study the errors (over-estimation and underestimation) made by the sketches. We evaluate several sketches on three important NLP problems. Our experiments show that one sketch performs best for all the three tasks.

6 0.070707344 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

7 0.061036084 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

8 0.051925458 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

9 0.049681999 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

10 0.047831953 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

11 0.045667935 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

12 0.043707862 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

13 0.042834368 102 emnlp-2012-Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems

14 0.041789845 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage

15 0.041533258 126 emnlp-2012-Training Factored PCFGs with Expectation Propagation

16 0.040879786 101 emnlp-2012-Opinion Target Extraction Using Word-Based Translation Model

17 0.040866036 43 emnlp-2012-Exact Sampling and Decoding in High-Order Hidden Markov Models

18 0.040363759 9 emnlp-2012-A Sequence Labelling Approach to Quote Attribution

19 0.040219184 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis

20 0.039670691 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.169), (1, 0.044), (2, 0.017), (3, 0.101), (4, -0.013), (5, -0.004), (6, -0.005), (7, -0.037), (8, 0.173), (9, 0.121), (10, 0.035), (11, -0.318), (12, -0.153), (13, -0.036), (14, -0.092), (15, 0.199), (16, -0.03), (17, -0.111), (18, -0.106), (19, -0.052), (20, -0.145), (21, -0.033), (22, 0.054), (23, -0.04), (24, 0.023), (25, 0.118), (26, 0.004), (27, 0.182), (28, 0.039), (29, 0.135), (30, -0.055), (31, 0.124), (32, 0.084), (33, 0.047), (34, 0.074), (35, 0.05), (36, -0.149), (37, 0.014), (38, 0.007), (39, -0.052), (40, 0.038), (41, 0.079), (42, 0.059), (43, -0.041), (44, 0.085), (45, -0.075), (46, 0.05), (47, -0.112), (48, 0.023), (49, -0.112)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95983052 120 emnlp-2012-Streaming Analysis of Discourse Participants

Author: Benjamin Van Durme

Abstract: Inferring attributes of discourse participants has been treated as a batch-processing task: data such as all tweets from a given author are gathered in bulk, processed, analyzed for a particular feature, then reported as a result of academic interest. Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge. We show that under certain common formulations, the batchprocessing analytic framework can be decomposed into a sequential series of updates, using as an example the task of gender classification. Once in a streaming framework, and motivated by large data sets generated by social media services, we present novel results in approximate counting, showing its applicability to space efficient streaming classification.

2 0.74595618 134 emnlp-2012-User Demographics and Language in an Implicit Social Network

Author: Katja Filippova

Abstract: We consider the task of predicting the gender of the YouTube1 users and contrast two information sources: the comments they leave and the social environment induced from the affiliation graph of users and videos. We propagate gender information through the videos and show that a user’s gender can be predicted from her social environment with the accuracy above 90%. We also show that the gender can be predicted from language alone (89%). A surprising result of our study is that the latter predictions correlate more strongly with the gender predominant in the user’s environment than with the sex of the person as reported in the profile. We also investigate how the two views (linguistic and social) can be combined and analyse how prediction accuracy changes over different age groups.

3 0.64839387 63 emnlp-2012-Identifying Event-related Bursts via Social Media Activities

Author: Xin Zhao ; Baihan Shu ; Jing Jiang ; Yang Song ; Hongfei Yan ; Xiaoming Li

Abstract: Activities on social media increase at a dramatic rate. When an external event happens, there is a surge in the degree of activities related to the event. These activities may be temporally correlated with one another, but they may also capture different aspects of an event and therefore exhibit different bursty patterns. In this paper, we propose to identify event-related bursts via social media activities. We study how to correlate multiple types of activities to derive a global bursty pattern. To model smoothness of one state sequence, we propose a novel function which can capture the state context. The experiments on a large Twitter dataset shows our methods are very effective.

4 0.41547632 32 emnlp-2012-Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants

Author: Ahmed Hassan ; Amjad Abu-Jbara ; Dragomir Radev

Abstract: A mixture of positive (friendly) and negative (antagonistic) relations exist among users in most social media applications. However, many such applications do not allow users to explicitly express the polarity of their interactions. As a result most research has either ignored negative links or was limited to the few domains where such relations are explicitly expressed (e.g. Epinions trust/distrust). We study text exchanged between users in online communities. We find that the polarity of the links between users can be predicted with high accuracy given the text they exchange. This allows us to build a signed network representation of discussions; where every edge has a sign: positive to denote a friendly relation, or negative to denote an antagonistic relation. We also connect our analysis to social psychology theories of balance. We show that the automatically predicted networks are consistent with those theories. Inspired by that, we present a technique for identifying subgroups in discussions by partitioning singed networks representing them.

5 0.41144204 117 emnlp-2012-Sketch Algorithms for Estimating Point Queries in NLP

Author: Amit Goyal ; Hal Daume III ; Graham Cormode

Abstract: Many NLP tasks rely on accurate statistics from large corpora. Tracking complete statistics is memory intensive, so recent work has proposed using compact approximate “sketches” of frequency distributions. We describe 10 sketch methods, including existing and novel variants. We compare and study the errors (over-estimation and underestimation) made by the sketches. We evaluate several sketches on three important NLP problems. Our experiments show that one sketch performs best for all the three tasks.

6 0.40066466 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP

7 0.27161086 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories

8 0.26504511 9 emnlp-2012-A Sequence Labelling Approach to Quote Attribution

9 0.22831912 121 emnlp-2012-Supervised Text-based Geolocation Using Language Models on an Adaptive Grid

10 0.19862586 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs

11 0.19791418 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

12 0.19175866 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

13 0.19139907 15 emnlp-2012-Active Learning for Imbalanced Sentiment Classification

14 0.19020142 27 emnlp-2012-Characterizing Stylistic Elements in Syntactic Structure

15 0.18715347 45 emnlp-2012-Exploiting Chunk-level Features to Improve Phrase Chunking

16 0.18476583 74 emnlp-2012-Language Model Rest Costs and Space-Efficient Storage

17 0.17107961 99 emnlp-2012-On Amortizing Inference Cost for Structured Prediction

18 0.16997702 7 emnlp-2012-A Novel Discriminative Framework for Sentence-Level Discourse Analysis

19 0.16981933 43 emnlp-2012-Exact Sampling and Decoding in High-Order Hidden Markov Models

20 0.16929059 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.026), (6, 0.013), (8, 0.015), (9, 0.267), (14, 0.013), (16, 0.028), (34, 0.093), (45, 0.015), (60, 0.076), (63, 0.05), (64, 0.021), (65, 0.027), (68, 0.019), (70, 0.021), (74, 0.052), (76, 0.075), (80, 0.018), (86, 0.026), (87, 0.017), (95, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81103629 120 emnlp-2012-Streaming Analysis of Discourse Participants

Author: Benjamin Van Durme

Abstract: Inferring attributes of discourse participants has been treated as a batch-processing task: data such as all tweets from a given author are gathered in bulk, processed, analyzed for a particular feature, then reported as a result of academic interest. Given the sources and scale of material used in these efforts, along with potential use cases of such analytic tools, discourse analysis should be reconsidered as a streaming challenge. We show that under certain common formulations, the batchprocessing analytic framework can be decomposed into a sequential series of updates, using as an example the task of gender classification. Once in a streaming framework, and motivated by large data sets generated by social media services, we present novel results in approximate counting, showing its applicability to space efficient streaming classification.

2 0.71720719 30 emnlp-2012-Constructing Task-Specific Taxonomies for Document Collection Browsing

Author: Hui Yang

Abstract: Taxonomies can serve as browsing tools for document collections. However, given an arbitrary collection, pre-constructed taxonomies could not easily adapt to the specific topic/task present in the collection. This paper explores techniques to quickly derive task-specific taxonomies supporting browsing in arbitrary document collections. The supervised approach directly learns semantic distances from users to propose meaningful task-specific taxonomies. The approach aims to produce globally optimized taxonomy structures by incorporating path consistency control and usergenerated task specification into the general learning framework. A comparison to stateof-the-art systems and a user study jointly demonstrate that our techniques are highly effective. .

3 0.53113908 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP

Author: Amit Goyal ; Hal Daume III ; Raul Guerra

Abstract: Many natural language processing problems involve constructing large nearest-neighbor graphs. We propose a system called FLAG to construct such graphs approximately from large data sets. To handle the large amount of data, our algorithm maintains approximate counts based on sketching algorithms. To find the approximate nearest neighbors, our algorithm pairs a new distributed online-PMI algorithm with novel fast approximate nearest neighbor search algorithms (variants of PLEB). These algorithms return the approximate nearest neighbors quickly. We show our system’s efficiency in both intrinsic and extrinsic experiments. We further evaluate our fast search algorithms both quantitatively and qualitatively on two NLP applications.

4 0.50154901 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

5 0.50123155 77 emnlp-2012-Learning Constraints for Consistent Timeline Extraction

Author: David McClosky ; Christopher D. Manning

Abstract: We present a distantly supervised system for extracting the temporal bounds of fluents (relations which only hold during certain times, such as attends school). Unlike previous pipelined approaches, our model does not assume independence between each fluent or even between named entities with known connections (parent, spouse, employer, etc.). Instead, we model what makes timelines of fluents consistent by learning cross-fluent constraints, potentially spanning entities as well. For example, our model learns that someone is unlikely to start a job at age two or to marry someone who hasn’t been born yet. Our system achieves a 36% error reduction over a pipelined baseline.

6 0.49903682 117 emnlp-2012-Sketch Algorithms for Estimating Point Queries in NLP

7 0.49851876 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

8 0.49581212 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT

9 0.48772508 5 emnlp-2012-A Discriminative Model for Query Spelling Correction with Latent Structural SVM

10 0.48701 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

11 0.48394927 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

12 0.48331627 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

13 0.48268548 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

14 0.48193422 42 emnlp-2012-Entropy-based Pruning for Phrase-based Machine Translation

15 0.48164013 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

16 0.48101184 114 emnlp-2012-Revisiting the Predictability of Language: Response Completion in Social Media

17 0.47942176 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

18 0.47850049 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

19 0.47510371 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

20 0.47498295 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction