acl acl2012 acl2012-91 knowledge-graph by maker-knowledge-mining

91 acl-2012-Extracting and modeling durations for habits and events from Twitter

Source: pdf

Author: Jennifer Williams ; Graham Katz

Abstract: We seek to automatically estimate typical durations for events and habits described in Twitter tweets. A corpus of more than 14 million tweets containing temporal duration information was collected. These tweets were classified as to their habituality status using a bootstrapped, decision tree. For each verb lemma, associated duration information was collected for episodic and habitual uses of the verb. Summary statistics for 483 verb lemmas and their typical habit and episode durations has been compiled and made available. This automatically generated duration information is broadly comparable to hand-annotation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Extracting and modeling durations for habits and events from Twitter Jennifer Williams Department of Linguistics Georgetown University Washington, D. [sent-1, score-0.655]

2 Abstract We seek to automatically estimate typical durations for events and habits described in Twitter tweets. [sent-4, score-0.809]

3 A corpus of more than 14 million tweets containing temporal duration information was collected. [sent-5, score-1.079]

4 These tweets were classified as to their habituality status using a bootstrapped, decision tree. [sent-6, score-0.39]

5 For each verb lemma, associated duration information was collected for episodic and habitual uses of the verb. [sent-7, score-0.91]

6 Summary statistics for 483 verb lemmas and their typical habit and episode durations has been compiled and made available. [sent-8, score-0.988]

7 This automatically generated duration information is broadly comparable to hand-annotation. [sent-9, score-0.584]

8 1 Introduction Implicit information about temporal durations is crucial to any natural language processing task involving temporal understanding and reasoning. [sent-10, score-0.711]

9 This information comes in many forms, among them knowledge about typical durations for events and knowledge about typical times at which an event occurs. [sent-11, score-0.826]

10 Hand-annotation of event durations is expensive slow (Pan et al. [sent-14, score-0.439]

11 This paper describes a method for automatically extracting information about typical durations for events from tweets posted to the Twitter microblogging site. [sent-19, score-0.942]

12 Twitter is a rich resource for information about everyday events people post their tweets to Twitter publicly in real-time as they conduct their activities throughout the day, resulting in a significant amount of mundane information about common events. [sent-20, score-0.474]

13 For example, (1) and (2) were used to provide information about how long a work event can last: – (1) Had work for an hour and 30 mins now going to disneyland with my cousins :) (2) I play in a loud rock band, I worked at a night club for two years. [sent-21, score-0.276]

14 My ears have never hurt so much @melaniemarnie @giorossi88 @CharlieHi11 In this paper, we sought to use this kind information to determine likely durations for events and habits of a variety of verbs. [sent-22, score-0.676]

15 This involved two steps: extracting a wide range of tweets such as (1) and (2) and classifying these as to whether they referred to specific event (as in (1)) or a general habit (as in (2)), then summarizing the duration information associated with each kind of use of a given verb. [sent-23, score-1.261]

16 This paper answers two investigative questions: • How well can we automatically extract fine-grain duration information for events and habits from Twitter? [sent-24, score-0.867]

17 • Can we effectively distinguish episode and habit duration distributions ? [sent-25, score-0.899]

18 The results presented here show that Twitter can be mined for fine-grain event duration information Proce dJienjgus, R ofep thueb 5lic0t hof A Knonruea ,l M 8-e1e4ti Jnugly o f2 t0h1e2 A. [sent-26, score-0.651]

19 c so2c0ia1t2io Ans fso rc Ciatoiomnp fuotart Cio nmaplu Ltiantgiounisatlic Lsi,n pgaugiestsi2c 2s3–2 7, with high precision using regular expressions. [sent-28, score-0.023]

20 Additionally, verb uses can be effectively categorized as to their habituality, and duration information plays an important role in this categorization. [sent-29, score-0.725]

21 2 Prior Work Past research on typical durations has made use of standard corpora with texts from literature excerpts, news stories, and full-length weblogs (Pan et al, 2006; 2007; 2011; Kozareva & Hovy, 2011; Gusev et al. [sent-30, score-0.468]

22 (2011) hand-annotated of a portion of the TIMEBANK corpus that consisted of Wall Street Journal articles. [sent-33, score-0.028]

23 For 58 non-financial articles, they annotated over 2,200 events with typical temporal duration, specifying the upper and lower bounds for the duration of each event. [sent-34, score-0.998]

24 In addition they used their corpus to automatically determine event durations with machine learning, predicting features of the duration on the basis of the verb lemma, local textual context. [sent-35, score-1.14]

25 2% on the course-grained task of determining whether an event's duration was longer or shorter than one day (compared with 87. [sent-38, score-0.633]

26 For determining the fine-grained task of determining the most likely temporal unit–second, minute, hour, day, week, etc. [sent-40, score-0.229]

27 This shows that lexical information can be effectively leveraged for duration prediction. [sent-44, score-0.577]

28 To compile temporal duration information for a wider range of verbs, Gusev et al. [sent-45, score-0.766]

29 (2011) explored an automatic Web-based query method for harvesting typical durations of events. [sent-46, score-0.504]

30 They note that many verbs have a two-peaked distribution and they suggest that the two-peaked distribution could be a result of the usage referring to a habit or a single episode. [sent-48, score-0.314]

31 (When used with a duration marker, run, for example, is used about 15% of the time with hour-scale and 38% with year-scale duration markers). [sent-49, score-1.106]

32 Rather than making a distinction between habits and episodes in their data, they apply a heuristic to focus on episodes only. [sent-50, score-0.435]

33 224 Kozareva and Hovy (2011) also collected typical durations of events using Web query patterns. [sent-51, score-0.664]

34 They proposed a six-way classification of ways in which events are related to time, but provided only programmatic analyses of a few verbs using Web-based query patterns. [sent-52, score-0.232]

35 They have proposed a compilation of the 5,000 most common verbs along with their typical temporal durations. [sent-53, score-0.396]

36 In each of these efforts, automatically collecting a large amount of reliable to cover a wide range of verbs has been noted as a difficulty. [sent-54, score-0.116]

37 3 Corpus Methodology Our goal was to discover the duration distribution as well as typical habit and typical episode durations for each verb lemma that we found in our collection. [sent-56, score-1.7]

38 A wide range of factors influence typical event durations. [sent-57, score-0.278]

39 For this preliminary work, we ignored the effects of arguments, and focused only on generating duration information for verb lemmas. [sent-59, score-0.701]

40 Also, tweets that were negated, conditional tweets, and tweets in the future tense were put aside. [sent-60, score-0.706]

41 1 Data Collection A corpus of tweets was collected from the Twitter web service API using an open-source module called Tweetstream (Halvorsen & Schierkolk, 2010). [sent-62, score-0.368]

42 Tweets were collected that contained reference to a temporal duration. [sent-63, score-0.212]

43 The data collection task began on February 1, 2011 and ended on September 28, 2011. [sent-64, score-0.021]

44 Duplicate tweets were identified by their unique tweet ID provided by Twitter, and were removed from the data set. [sent-65, score-0.386]

45 Also tweets that were marked by Twitter as 'retweets' (tweets that have been reposted to Twitter) were removed. [sent-66, score-0.341]

46 2 Extraction Frames To associate each temporal duration with its event, events and durations were identified and extracted using four types of regular expression extraction frames. [sent-70, score-1.303]

47 The patterns applied a heuristic to associate each verb with a temporal expression, similar to the extraction frames used in Gusev et al. [sent-71, score-0.448]

48 The four types of extraction frames were: • verb for duration verb in duration • spend duration verbing • takes duration to verb where verb is the target verb and duration is a duration-measure term. [sent-73, score-3.63]

49 In (3), for example, the verb work is associated with the temporal duration term 44 years. [sent-74, score-0.886]

50 (3) Retired watchmaker worked for 44 years without a telephone, to avoid unnecessary interruptions, http://t. [sent-75, score-0.069]

51 co/ox3mB6g These four extraction frame types were also varied to include different tenses, different grammatical aspects, and optional verb arguments to reach a wide range of event mentions and ordering between the verb and the duration clause. [sent-76, score-1.095]

52 For example, the features extracted from (3) were: [work, years, past, simple, 1387584000, FOR] • Tweets with verbal lemmas that occur fewer than 100 times in the extracted corpus were filtered out. [sent-78, score-0.114]

53 The resulting data set contained 390,562 feature vectors covering 483 verb lemmas. [sent-79, score-0.172]

54 3 Extraction Precision Extraction frame performance was estimated using precision on a random sample of 400 hand-labeled tweets. [sent-81, score-0.084]

55 Each instance in the sample was labeled as correct if the extracted feature vector was correct 225 in its entirety. [sent-82, score-0.058]

56 The overall precision for extraction frames was estimated as 90. [sent-83, score-0.114]

57 25%, calculated using a two-tailed t-test for sample size of proportions with 95% confidence (p=0. [sent-84, score-0.096]

58 4 Duration Results In order to summarize information about duration for each of the 483 verb lemmas, we calculated the frequency distribution of tweets by duration in seconds. [sent-87, score-1.623]

59 This distribution can be represented in histogram form, as in Figure 1for the verb lemma search, with with bins corresponding to temporal units of measure (seconds, minutes, etc. [sent-88, score-0.472]

60 Figure 1: Frequency distribution for search This histogram shows the characteristic bimodal-distributions noted by Pan et al. [sent-90, score-0.064]

61 4 Episodic/Habitual Classification Most verbs have both episodic and habitual uses, which clearly correspond to different typical durations. [sent-94, score-0.372]

62 In order to draw this distinction we built a system to automatically classify our tweets as to their habituality. [sent-95, score-0.341]

63 The extracted feature vectors were used in a machine learning task to label each tweet in the collection as denoting a habit or an episode, broadly following Mathew & Katz (2009). [sent-96, score-0.355]

64 1 Bootstrapping Classifier First, a random sample of 1000 tweets from the extracted corpus was hand-labeled as being either habit or episode (236 habits; 764 episodes). [sent-99, score-0.721]

65 The extracted feature vectors for these tweets were used to train a C4. [sent-100, score-0.397]

66 We used this classifier and the hand-labeled set to seed the generic Yarowsky Algorithm (Abney, 2004), iteratively inducing a habit or episode label for all the tweets in the collection, using the WEKA output for confidence scoring and a confidence threshold of 0. [sent-105, score-0.764]

67 The extracted corpus was classified into 94,643 habitual tweets and 295,918 episodic tweets. [sent-107, score-0.555]

68 To estimate the accuracy of the classifier, 400 randomly chosen tweets from the extracted corpus were hand-labeled, giving an estimated accuracy of 85% accuracy with 95% confidence, using the twotailed t-test for sample size of proportions (p=0. [sent-108, score-0.435]

69 2 Results Clearly the data in Figure 1 represents two combined distributions: one for episodes and one for habits, as we illustrate in Figure 2. [sent-111, score-0.127]

70 We see that the verb search describes episodes that most often last minutes or hours, while it describes habits that go on for years. [sent-112, score-0.527]

71 Figure 2: Duration distribution for search These two different uses are illustrated in (4) and (5). [sent-113, score-0.028]

72 (4) Obviously I'm the one who found the tiny lost black Lego in 30 seconds after the 3 of them searched for 5 minutes. [sent-114, score-0.069]

73 226 (5) @jaynecheeseman they've been searching for you for 11 years now. [sent-115, score-0.045]

74 In Table 1 we provide summary information for several verb lemmas, indicating the average duration for each verb and the temporal unit corresponding to the largest bin for each verb. [sent-117, score-1.057]

75 Mean duration and mode for 6 of the verbs It is clear that the methodology overestimates the duration of episodes somewhat our estimates of typical durations are 2-3 times as long as those that come from the annotation in Pan, et. [sent-119, score-1.764]

76 Nevertheless, the modal bin corresponds approximately to that the hand annotation in Pan, et. [sent-122, score-0.023]

77 , (2011) for nearly half (45%) of the verbs lemmas. [sent-124, score-0.063]

78 – 5 Conclusion We have presented a hybrid approach for extract- ing typical durations of habits and episodes. [sent-125, score-0.649]

79 We are able to extract high-quality information about temporal durations and to effectively classify tweets as to their habituality. [sent-126, score-0.891]

80 It is clear that Twitter tweets contain a lot of unique data about different kinds of events and habits, and mining this data for temporal duration information has turned out to be a fruitful avenue for collecting the kind of worldknowledge that we need for robust temporal language processing. [sent-127, score-1.418]

81 “Using query patterns to learn the durations of events”. [sent-141, score-0.377]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('duration', 0.553), ('durations', 0.341), ('tweets', 0.341), ('habit', 0.195), ('temporal', 0.185), ('habits', 0.181), ('twitter', 0.165), ('verb', 0.148), ('events', 0.133), ('pan', 0.13), ('episode', 0.127), ('episodes', 0.127), ('typical', 0.127), ('gusev', 0.106), ('event', 0.098), ('georgetown', 0.097), ('habitual', 0.097), ('lasts', 0.085), ('episodic', 0.085), ('minutes', 0.071), ('hour', 0.07), ('seconds', 0.069), ('jerry', 0.064), ('verbs', 0.063), ('weeks', 0.058), ('kozareva', 0.058), ('hours', 0.058), ('day', 0.058), ('frames', 0.056), ('lemma', 0.054), ('lemmas', 0.05), ('chess', 0.049), ('habituality', 0.049), ('halvorsen', 0.049), ('indiana', 0.049), ('mathew', 0.049), ('tweetstream', 0.049), ('zoo', 0.049), ('tweet', 0.045), ('years', 0.045), ('mins', 0.042), ('rutu', 0.042), ('lunch', 0.042), ('days', 0.039), ('months', 0.038), ('histogram', 0.036), ('weka', 0.036), ('proportions', 0.036), ('query', 0.036), ('frame', 0.035), ('extraction', 0.035), ('confidence', 0.034), ('spend', 0.034), ('classifier', 0.033), ('minute', 0.033), ('nltk', 0.033), ('steven', 0.032), ('extracted', 0.032), ('yarowsky', 0.031), ('https', 0.031), ('bird', 0.031), ('broadly', 0.031), ('week', 0.03), ('katz', 0.029), ('game', 0.029), ('distribution', 0.028), ('consisted', 0.028), ('graham', 0.028), ('denoting', 0.028), ('range', 0.028), ('seek', 0.027), ('collected', 0.027), ('api', 0.026), ('sample', 0.026), ('arguments', 0.025), ('wide', 0.025), ('bootstrapping', 0.024), ('associate', 0.024), ('past', 0.024), ('effectively', 0.024), ('vectors', 0.024), ('worked', 0.024), ('tense', 0.024), ('washington', 0.023), ('precision', 0.023), ('bin', 0.023), ('determining', 0.022), ('feng', 0.022), ('kind', 0.021), ('ended', 0.021), ('rock', 0.021), ('club', 0.021), ('compilation', 0.021), ('eibe', 0.021), ('andrey', 0.021), ('divye', 0.021), ('khaitan', 0.021), ('khilnani', 0.021), ('bins', 0.021), ('centuries', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 91 acl-2012-Extracting and modeling durations for habits and events from Twitter

Author: Jennifer Williams ; Graham Katz

2 0.25537223 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

Author: Hao Wang ; Dogan Can ; Abe Kazemzadeh ; Francois Bar ; Shrikanth Narayanan

Abstract: This paper describes a system for real-time analysis of public sentiment toward presidential candidates in the 2012 U.S. election as expressed on Twitter, a microblogging service. Twitter has become a central site where people express their opinions and views on political parties and candidates. Emerging events or news are often followed almost instantly by a burst in Twitter volume, providing a unique opportunity to gauge the relation between expressed public sentiment and electoral events. In addition, sentiment analysis can help explore how these events affect public opinion. While traditional content analysis takes days or weeks to complete, the system demonstrated here analyzes sentiment in the entire Twitter traffic about the election, delivering results instantly and continuously. It offers the public, the media, politicians and scholars a new and timely perspective on the dynamics of the electoral process and public opinion. 1

3 0.24262391 205 acl-2012-Tweet Recommendation with Graph Co-Ranking

Author: Rui Yan ; Mirella Lapata ; Xiaoming Li

Abstract: Mirella Lapata‡ Xiaoming Li†, \ ‡Institute for Language, \State Key Laboratory of Software Cognition and Computation, Development Environment, University of Edinburgh, Beihang University, Edinburgh EH8 9AB, UK Beijing 100083, China mlap@ inf .ed .ac .uk lxm@pku .edu .cn 2012.1 Twitter enables users to send and read textbased posts ofup to 140 characters, known as tweets. As one of the most popular micro-blogging services, Twitter attracts millions of users, producing millions of tweets daily. Shared information through this service spreads faster than would have been possible with traditional sources, however the proliferation of user-generation content poses challenges to browsing and finding valuable information. In this paper we propose a graph-theoretic model for tweet recommendation that presents users with items they may have an interest in. Our model ranks tweets and their authors simultaneously using several networks: the social network connecting the users, the network connecting the tweets, and a third network that ties the two together. Tweet and author entities are ranked following a co-ranking algorithm based on the intuition that that there is a mutually reinforcing relationship between tweets and their authors that could be reflected in the rankings. We show that this framework can be parametrized to take into account user preferences, the popularity of tweets and their authors, and diversity. Experimental evaluation on a large dataset shows that our model out- performs competitive approaches by a large margin.

4 0.23474282 167 acl-2012-QuickView: NLP-based Tweet Search

Author: Xiaohua Liu ; Furu Wei ; Ming Zhou ; QuickView Team Microsoft

Abstract: Tweets have become a comprehensive repository for real-time information. However, it is often hard for users to quickly get information they are interested in from tweets, owing to the sheer volume of tweets as well as their noisy and informal nature. We present QuickView, an NLP-based tweet search platform to tackle this issue. Specifically, it exploits a series of natural language processing technologies, such as tweet normalization, named entity recognition, semantic role labeling, sentiment analysis, tweet classification, to extract useful information, i.e., named entities, events, opinions, etc., from a large volume of tweets. Then, non-noisy tweets, together with the mined information, are indexed, on top of which two brand new scenarios are enabled, i.e., categorized browsing and advanced search, allowing users to effectively access either the tweets or fine-grained information they are interested in.

5 0.23319708 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures

Author: Oleksandr Kolomiyets ; Steven Bethard ; Marie-Francine Moens

Abstract: We propose a new approach to characterizing the timeline of a text: temporal dependency structures, where all the events of a narrative are linked via partial ordering relations like BEFORE, AFTER, OVERLAP and IDENTITY. We annotate a corpus of children’s stories with temporal dependency trees, achieving agreement (Krippendorff’s Alpha) of 0.856 on the event words, 0.822 on the links between events, and of 0.700 on the ordering relation labels. We compare two parsing models for temporal dependency structures, and show that a deterministic non-projective dependency parser outperforms a graph-based maximum spanning tree parser, achieving labeled attachment accuracy of 0.647 and labeled tree edit distance of 0.596. Our analysis of the dependency parser errors gives some insights into future research directions.

6 0.15374938 191 acl-2012-Temporally Anchored Relation Extraction

7 0.14252979 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

8 0.13708855 135 acl-2012-Learning to Temporally Order Medical Events in Clinical Text

9 0.11777196 85 acl-2012-Event Linking: Grounding Event Reference in a News Archive

10 0.10533819 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

11 0.099030457 192 acl-2012-Tense and Aspect Error Correction for ESL Learners Using Global Context

12 0.089315131 60 acl-2012-Coupling Label Propagation and Constraints for Temporal Fact Extraction

13 0.08539816 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations

14 0.079950005 126 acl-2012-Labeling Documents with Timestamps: Learning from their Time Expressions

15 0.079136565 17 acl-2012-A Novel Burst-based Text Representation Model for Scalable Event Detection

16 0.075672857 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

17 0.065843157 98 acl-2012-Finding Bursty Topics from Microblogs

18 0.060902618 33 acl-2012-Automatic Event Extraction with Structured Preference Modeling

19 0.06052411 48 acl-2012-Classifying French Verbs Using French and English Lexical Resources

20 0.058551196 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.138), (1, 0.195), (2, -0.032), (3, 0.191), (4, 0.068), (5, -0.254), (6, 0.351), (7, 0.006), (8, 0.069), (9, 0.026), (10, -0.059), (11, -0.034), (12, 0.108), (13, 0.063), (14, 0.028), (15, -0.037), (16, -0.003), (17, -0.023), (18, 0.028), (19, 0.029), (20, 0.01), (21, -0.039), (22, -0.035), (23, 0.011), (24, 0.056), (25, -0.067), (26, 0.017), (27, -0.058), (28, 0.062), (29, 0.007), (30, -0.082), (31, 0.058), (32, -0.048), (33, -0.004), (34, 0.068), (35, 0.014), (36, 0.027), (37, 0.011), (38, 0.027), (39, -0.039), (40, -0.057), (41, 0.05), (42, -0.009), (43, -0.017), (44, 0.039), (45, 0.039), (46, 0.067), (47, 0.001), (48, 0.051), (49, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96959913 91 acl-2012-Extracting and modeling durations for habits and events from Twitter

Author: Jennifer Williams ; Graham Katz

2 0.73888105 167 acl-2012-QuickView: NLP-based Tweet Search

Author: Xiaohua Liu ; Furu Wei ; Ming Zhou ; QuickView Team Microsoft

3 0.70075446 205 acl-2012-Tweet Recommendation with Graph Co-Ranking

Author: Rui Yan ; Mirella Lapata ; Xiaoming Li

4 0.61012405 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

Author: Hao Wang ; Dogan Can ; Abe Kazemzadeh ; Francois Bar ; Shrikanth Narayanan

5 0.58176184 135 acl-2012-Learning to Temporally Order Medical Events in Clinical Text

Author: Preethi Raghavan ; Albert Lai ; Eric Fosler-Lussier

Abstract: We investigate the problem of ordering medical events in unstructured clinical narratives by learning to rank them based on their time of occurrence. We represent each medical event as a time duration, with a corresponding start and stop, and learn to rank the starts/stops based on their proximity to the admission date. Such a representation allows us to learn all of Allen’s temporal relations between medical events. Interestingly, we observe that this methodology performs better than a classification-based approach for this domain, but worse on the relationships found in the Timebank corpus. This finding has important implications for styles of data representation and resources used for temporal relation learning: clinical narratives may have different language attributes corresponding to temporal ordering relative to Timebank, implying that the field may need to look at a wider range ofdomains to fully understand the nature of temporal ordering.

6 0.55625141 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

7 0.53868061 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

8 0.46035048 90 acl-2012-Extracting Narrative Timelines as Temporal Dependency Structures

9 0.44199714 191 acl-2012-Temporally Anchored Relation Extraction

10 0.40734664 173 acl-2012-Self-Disclosure and Relationship Strength in Twitter Conversations

11 0.40075749 126 acl-2012-Labeling Documents with Timestamps: Learning from their Time Expressions

12 0.39877766 60 acl-2012-Coupling Label Propagation and Constraints for Temporal Fact Extraction

13 0.36380386 85 acl-2012-Event Linking: Grounding Event Reference in a News Archive

14 0.27391252 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

15 0.26833403 17 acl-2012-A Novel Burst-based Text Representation Model for Scalable Event Detection

16 0.25120884 48 acl-2012-Classifying French Verbs Using French and English Lexical Resources

17 0.24980588 192 acl-2012-Tense and Aspect Error Correction for ESL Learners Using Global Context

18 0.23838457 2 acl-2012-A Broad-Coverage Normalization System for Social Media Language

19 0.2270962 201 acl-2012-Towards the Unsupervised Acquisition of Discourse Relations

20 0.22136602 166 acl-2012-Qualitative Modeling of Spatial Prepositions and Motion Expressions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.023), (26, 0.462), (28, 0.022), (30, 0.014), (37, 0.024), (39, 0.051), (74, 0.041), (82, 0.045), (84, 0.032), (85, 0.016), (90, 0.075), (92, 0.06), (94, 0.012), (99, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91712636 91 acl-2012-Extracting and modeling durations for habits and events from Twitter

Author: Jennifer Williams ; Graham Katz

2 0.86907518 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench

Author: Rafal Rak ; BalaKrishna Kolluru ; Sophia Ananiadou

Abstract: Argo is a web-based NLP and text mining workbench with a convenient graphical user interface for designing and executing processing workflows of various complexity. The workbench is intended for specialists and nontechnical audiences alike, and provides the ever expanding library of analytics compliant with the Unstructured Information Management Architecture, a widely adopted interoperability framework. We explore the flexibility of this framework by demonstrating workflows involving three processing components capable of performing self-contained machine learning-based tagging. The three components are responsible for the three distinct tasks of 1) generating observations or features, 2) training a statistical model based on the generated features, and 3) tagging unlabelled data with the model. The learning and tagging components are based on an implementation of conditional random fields (CRF); whereas the feature generation component is an analytic capable of extending basic token information to a comprehensive set of features. Users define the features of their choice directly from Argo’s graphical interface, without resorting to programming (a commonly used approach to feature engineering). The experimental results performed on two tagging tasks, chunking and named entity recognition, showed that a tagger with a generic set of features built in Argo is capable of competing with taskspecific solutions. 121

3 0.8686834 209 acl-2012-Unsupervised Semantic Role Induction with Global Role Ordering

Author: Nikhil Garg ; James Henserdon

Abstract: We propose a probabilistic generative model for unsupervised semantic role induction, which integrates local role assignment decisions and a global role ordering decision in a unified model. The role sequence is divided into intervals based on the notion of primary roles, and each interval generates a sequence of secondary roles and syntactic constituents using local features. The global role ordering consists of the sequence of primary roles only, thus making it a partial ordering.

4 0.80084771 134 acl-2012-Learning to Find Translations and Transliterations on the Web

Author: Joseph Z. Chang ; Jason S. Chang ; Roger Jyh-Shing Jang

Abstract: Jason S. Chang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j s chang@ c s .nthu . edu .tw Jyh-Shing Roger Jang Department of Computer Science, National Tsing Hua University 101, Kuangfu Road, Hsinchu, 300, Taiwan j ang@ c s .nthu .edu .tw identifying such translation counterparts Web, we can cope with the OOV problem. In this paper, we present a new method on the for learning to finding translations and transliterations on the Web for a given term. The approach involves using a small set of terms and translations to obtain mixed-code snippets from a search engine, and automatically annotating the snippets with tags and features for training a conditional random field model. At runtime, the model is used to extracting translation candidates for a given term. Preliminary experiments and evaluation show our method cleanly combining various features, resulting in a system that outperforms previous work. 1

5 0.69428796 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

Author: Micha Elsner ; Sharon Goldwater ; Jacob Eisenstein

Abstract: ILCC, School of Informatics School of Interactive Computing University of Edinburgh Georgia Institute of Technology Edinburgh, EH8 9AB, UK Atlanta, GA, 30308, USA (a) intended: /ju want w2n/ /want e kUki/ (b) surface: [j@ w a?P w2n] [wan @ kUki] During early language acquisition, infants must learn both a lexicon and a model of phonetics that explains how lexical items can vary in pronunciation—for instance “the” might be realized as [Di] or [D@]. Previous models of acquisition have generally tackled these problems in isolation, yet behavioral evidence suggests infants acquire lexical and phonetic knowledge simultaneously. We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. The model is trained on transcribed surface pronunciations, and learns by bootstrapping, without access to the true lexicon. We test the model using a corpus of child-directed speech with realistic phonetic variation and either gold standard or automatically induced word boundaries. In both cases modeling variability improves the accuracy of the learned lexicon over a system that assumes each lexical item has a unique pronunciation.

6 0.54315931 187 acl-2012-Subgroup Detection in Ideological Discussions

7 0.53011966 48 acl-2012-Classifying French Verbs Using French and English Lexical Resources

8 0.47374395 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

9 0.47234711 11 acl-2012-A Feature-Rich Constituent Context Model for Grammar Induction

10 0.46276289 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

11 0.45741981 83 acl-2012-Error Mining on Dependency Trees

12 0.45258337 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

13 0.45196724 5 acl-2012-A Comparison of Chinese Parsers for Stanford Dependencies

14 0.4432835 211 acl-2012-Using Rejuvenation to Improve Particle Filtering for Bayesian Word Segmentation

15 0.44304579 139 acl-2012-MIX Is Not a Tree-Adjoining Language

16 0.44258353 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

17 0.44200018 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

18 0.44147104 47 acl-2012-Chinese Comma Disambiguation for Discourse Analysis

19 0.44142559 124 acl-2012-Joint Inference of Named Entity Recognition and Normalization for Tweets

20 0.43925366 167 acl-2012-QuickView: NLP-based Tweet Search