emnlp emnlp2013 emnlp2013-46 knowledge-graph by maker-knowledge-mining

46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

Source: pdf

Author: Ruihong Huang ; Ellen Riloff

Abstract: The goal of our research is to distinguish veterinary message board posts that describe a case involving a specific patient from posts that ask a general question. We create a text classifier that incorporates automatically generated attribute lists for veterinary patients to tackle this problem. Using a small amount of annotated data, we train an information extraction (IE) system to identify veterinary patient attributes. We then apply the IE system to a large collection of unannotated texts to produce a lexicon of veterinary patient attribute terms. Our experimental results show that using the learned attribute lists to encode patient information in the text classifier yields improved performance on this task.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes Ruihong Huang and Ellen Riloff School of Computing University of Utah Salt Lake City, UT 84112 {huangrh , ril f f}@ c s . [sent-1, score-0.02]

2 edu o Abstract The goal of our research is to distinguish veterinary message board posts that describe a case involving a specific patient from posts that ask a general question. [sent-3, score-2.052]

3 We create a text classifier that incorporates automatically generated attribute lists for veterinary patients to tackle this problem. [sent-4, score-0.948]

4 Using a small amount of annotated data, we train an information extraction (IE) system to identify veterinary patient attributes. [sent-5, score-1.14]

5 We then apply the IE system to a large collection of unannotated texts to produce a lexicon of veterinary patient attribute terms. [sent-6, score-1.602]

6 Our experimental results show that using the learned attribute lists to encode patient information in the text classifier yields improved performance on this task. [sent-7, score-1.02]

7 1 Introduction Our research focuses on the problem of classify- ing message board posts in the domain of veterinary medicine. [sent-8, score-1.022]

8 Most of the posts in our corpus discuss a case involving a specific patient, which we will call patient-specific posts. [sent-9, score-0.336]

9 But there are also posts that ask a general question, for example to seek advice about different medications, information about new procedures, or how to perform a test. [sent-10, score-0.362]

10 Our goal is to distinguish the patient-specific posts from general posts so that they can be automatically routed to different message board folders. [sent-11, score-0.907]

11 Distinguishing patient-specific posts from general posts is a challenging problem for two reasons. [sent-12, score-0.632]

12 First, virtually any medical topic can appear in either type of post, so the vocabulary is very similar. [sent-13, score-0.073]

13 Second, 1557 a highly skewed distribution exists between patientspecific posts and general posts. [sent-14, score-0.353]

14 Almost 90% of the posts in our data are about specific patients. [sent-15, score-0.32]

15 With such a highly skewed distribution, it would seem logical to focus on recognizing instances of the minority class. [sent-16, score-0.041]

16 But the distinguishing characteristic of a general post is the absence of a patient. [sent-17, score-0.087]

17 Two nearly identical posts belong in different categories if one mentions a patient and the other does not. [sent-18, score-0.93]

18 Consequently, our aim is to create features that identify references to a specific patient and use these to more accurately distinguish the two types of posts. [sent-19, score-0.745]

19 Our research explores the use of information extraction (IE) techniques to automatically identify common attributes of veterinary patients, which we use to encode patient information in a text classifier. [sent-20, score-1.264]

20 First, we train a conditional random fields (CRF) tagger to identify seven common types of attributes that are often ascribed to veterinary patients: SPECIES/BREED, NAME, AGE, GENDER, WEIGHT, POSSESSOR, and DISEASE/SYMPTOM. [sent-22, score-0.791]

21 Second, we apply the CRF tagger to a large set of unannotated message board posts, collect its extractions, and harvest the most frequently extracted terms to create a Veterinary Patient Attribute (VPA) Lexicon. [sent-23, score-0.58]

22 Finally, we define three types of features that exploit the harvested VPA lexicon. [sent-24, score-0.049]

23 These features represent the patient attribute terms, types, and combinations of them to help the classifier determine whether a post is discussing a specific patient. [sent-25, score-1.027]

24 We conduct experiments which show that the extracted patient attribute information improves text classifi- cation performance on this task. [sent-26, score-0.965]

25 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is5t5ic7s–1562, 2 Related Work Our work demonstrates the use of information extraction techniques to benefit a text classification application. [sent-29, score-0.054]

26 There has been a great deal of research on text classification (e. [sent-30, score-0.038]

27 Information extraction techniques have been used previously to create richer features for event-based text classification (Riloff and Lehnert, 1994) and web page classification (Furnkranz et al. [sent-37, score-0.122]

28 Semantic information has also been incorporated for text classification. [sent-39, score-0.018]

29 There is also a rich history of automatic lexicon induction from text corpora (e. [sent-42, score-0.089]

30 The novel aspects of our work are in using an IE tagger to harvest a domain-specific lexicon from unannotated texts, and using the induced lexicon to encode domain-specific features for text classification. [sent-52, score-0.423]

31 3 Text Classification with Extracted Patient Attributes This resesarch studies message board posts from the Veterinary Information Network (VIN), which is a web site (www. [sent-53, score-0.552]

32 VIN hosts forums where veterinarians discuss medical issues, challenging cases, etc. [sent-56, score-0.076]

33 We observed that patient-specific veterinary posts almost always include some basic facts about the patient, such as the animal’s breed, age, or gender. [sent-57, score-0.79]

34 It is also common to mention the patient’s owner (e. [sent-58, score-0.016]

35 , “a new client’s cat”) or a disease or symptom that the patient has (e. [sent-60, score-0.643]

36 Although some of these terms can be found in 1558 existing resources such as Wordnet (Miller, 1990), our veterinary message board posts are filled with informal and unconventional vocabulary. [sent-64, score-1.104]

37 For example, one might naively assume that “male ” and “female” are sufficient to identify gender. [sent-65, score-0.043]

38 But the gender of animals is often revealed by describing their spayed/neutered status, often indicated with shorthand notations. [sent-66, score-0.087]

39 For example, “m/n” means male and neutered, “fs” means female spayed, “castrated” means neutered and implies male. [sent-67, score-0.101]

40 Shorthand terms and informal jargon are also frequently used for breeds (e. [sent-68, score-0.061]

41 , “doxy” for dachsund, “labx” for labrador cross, “gshep” for German Shepherd) and ages (e. [sent-70, score-0.02]

42 A particularly creative age expression describes an animal as (say) “a 1999 model” (i. [sent-73, score-0.095]

43 To recognize the idiosyncratic vocabulary in these texts, we use information extraction techniques to identify terms corresponding to seven attributes of veterinary patients: SPECIES/BREED, NAME, AGE, WEIGHT, GENDER, POSSESSOR, and DISEASE/SYMPTOM. [sent-76, score-0.67]

44 First, we train a sequential IE tagger to label veterinary patient attributes using supervised learning. [sent-78, score-1.325]

45 Second, we apply the tagger to 10,000 unannotated message board posts to automatically create a Veterinary Patient Attribute (VPA) Lexicon. [sent-79, score-0.838]

46 Third, we use the VPA Lexicon to encode patient attribute features in a document classifier. [sent-80, score-0.957]

47 Step 1 An ToextastedPVIPCASlaeTnsta geifgne cre (CRF) Step 2 Step 3 UnanTenoxt astedPIC Slaenstseinfiecer VP(AC TRaFg)gerLeVxPicAon AnTneoxtat sedDocument LeVxPicAonClassifier Figure 1: Flowchart for Creating a Patient-Specific vs. [sent-81, score-0.02]

48 1 Patient Attribute Tagger The first component of our system is a tagger that labels veterinary patient attributes. [sent-83, score-1.249]

49 To train the tagger, we need texts labeled with patient attributes. [sent-84, score-0.676]

50 The message board posts can be long and tedious to read (i. [sent-85, score-0.569]

51 , they are often filled with medical history and test results), so manually annotating every word would be arduous. [sent-87, score-0.08]

52 However, the patient is usually described at the beginning of a post, most commonly in 1-2 “introductory” sentences. [sent-88, score-0.626]

53 Therefore we adopted a two stage process, both for manual and automatic tagging of patient attributes. [sent-89, score-0.626]

54 First, we created annotation guidelines to identify “patient introductory” (PI) sentences, which we defined as sentences that introduce a patient to the reader by providing a general (non-medical) description of the animal (e. [sent-90, score-0.74]

55 , “I was presented with a m/n Siamese cat that is lethargic. [sent-92, score-0.032]

56 ”) We randomly selected 300 posts from our text collection and asked two human annotators to manually identify the PI sentences. [sent-93, score-0.396]

57 We measured their inter-annotator agreement using Cohen’s kappa (κ) and their agreement was κ=. [sent-94, score-0.04]

58 The two annotators then adjudicated their differences to create our gold standard set of PI sentence annotations. [sent-96, score-0.093]

59 269 of the 300 posts contained at least one PI sentence ,indicating that 89. [sent-97, score-0.304]

60 Second, the annotators manually labeled the words in these PI sentences with respect to the 7 veterinary patient attributes. [sent-101, score-1.145]

61 On 50 randomly selected texts, the annotators achieved an inter-annotator agreement of κ = . [sent-102, score-0.05]

62 The remaining 250 posts were then annotated with patient attributes (in the PI sentences), providing us with gold standard attribute annotations for all 300 posts. [sent-104, score-1.338]

63 To illustrate, the sentence below would have the following labels: Daisyname is a 10yrage oldage labspecies We used these 300 annotated posts to train both a PI sentence classifier and a patient attribute tagger. [sent-105, score-1.276]

64 The PI sentence classifier is a support vector machine (SVM) with a linear kernel (Keerthi and DeCoste, 2005), unigram and bigram features, and binary feature values. [sent-106, score-0.045]

65 The PI sentences are the positive training instances, and the sentences in the general posts are negative training instances. [sent-107, score-0.366]

66 For the tagger, we trained a single conditional random fields (CRF) model to label all 7 types of patient attributes using the CRF++ package (Lafferty et al. [sent-108, score-0.717]

67 Given new texts to process, we first apply the PI sentence classifier to identify sentences that introduce a patient. [sent-111, score-0.142]

68 These sentences are given to the patient attribute tagger, which labels the words in those sentences for the 7 patient attribute categories. [sent-112, score-1.892]

69 To evaluate the performance of the patient attribute tagger, we randomly sampled 200 of the 300 annotated documents to use as training data and used the remaining 100 documents for testing. [sent-113, score-0.943]

70 For this experiment, we only applied the CRF tagger to the gold standard PI sentences, to eliminate any confounding factors from the PI sentence classifier. [sent-114, score-0.188]

71 Table 1 shows the performance of the CRF tagger in terms of Recall (%), Precision (%), and F Score (%). [sent-115, score-0.17]

72 Its precision is consistently high, averaging 91% across all seven attributes. [sent-116, score-0.03]

73 But the average recall is only 47%, with only one attribute (AGE) achieving recall ≥ 80%. [sent-117, score-0.301]

74 Nevertheless, the CRF’s high precisrieocna justifies our plan htoe use ,t thhee C CRRFF tagger t op heacir-vest additional attribute terms from a large collection of unannotated texts. [sent-118, score-0.61]

75 As we will see in Section 4, the additional terms harvested from the unannotated texts provide substantially more attribute information for the document classifier to use. [sent-119, score-0.547]

76 2 Creating a Veterinary Patient Attribute (VPA) Lexicon The patient attribute tagger was trained with supervised learning, so its ability to recognize important words is limited by the scope of its training set. [sent-121, score-1.097]

77 Since we had an additional 10,000 unannotated veterinary message board posts, we used the tagger to acquire a large lexicon of patient attribute terms. [sent-122, score-1.937]

78 We applied the PI sentence classifier to all 10,000 texts and then applied the patient attribute tagger to each PI sentence. [sent-123, score-1.175]

79 The patient attribute tagger is not perfect, so we assumed that words tagged with the same attribute value at least five times1 are most likely to be correct and harvested them to create a veterinary patient attribute (VPA) lexicon. [sent-124, score-2.875]

80 Table 2 shows examples of learned terms for each attribute, with the total number of learned words in parentheses. [sent-126, score-0.017]

81 3 Text Classification with Patient Attributes Our ultimate goal is to incorporate patient attribute information into a text classifier to help it distinguish between patient-specific posts and general posts. [sent-128, score-1.345]

82 We designed three sets of features: Attribute Types: We create one feature for each attribute type, indicating whether a word of that attribute type appeared or not. [sent-129, score-0.666]

83 Attribute Types with Neighbor: For each word labeled as a patient attribute, we create two features by pairing its Attribute Type with a preceding or following word. [sent-130, score-0.674]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('patient', 0.626), ('veterinary', 0.47), ('posts', 0.304), ('attribute', 0.301), ('vpa', 0.165), ('tagger', 0.153), ('board', 0.143), ('pi', 0.128), ('message', 0.105), ('unannotated', 0.085), ('attributes', 0.076), ('crf', 0.069), ('patients', 0.066), ('siamese', 0.061), ('lexicon', 0.054), ('age', 0.052), ('texts', 0.05), ('harvested', 0.049), ('create', 0.048), ('neutered', 0.047), ('classifier', 0.045), ('animal', 0.043), ('ie', 0.042), ('medical', 0.041), ('introductory', 0.041), ('shorthand', 0.041), ('vin', 0.041), ('post', 0.039), ('possessor', 0.037), ('cat', 0.032), ('riloff', 0.03), ('gender', 0.03), ('annotators', 0.03), ('encode', 0.03), ('seven', 0.03), ('harvest', 0.029), ('male', 0.029), ('identify', 0.028), ('distinguish', 0.027), ('female', 0.025), ('skewed', 0.025), ('informal', 0.024), ('general', 0.024), ('distinguishing', 0.024), ('filled', 0.022), ('shepherd', 0.02), ('nigam', 0.02), ('cre', 0.02), ('ages', 0.02), ('cation', 0.02), ('confounding', 0.02), ('flowchart', 0.02), ('hosts', 0.02), ('jargon', 0.02), ('lehnert', 0.02), ('professionals', 0.02), ('ril', 0.02), ('sebastiani', 0.02), ('agreement', 0.02), ('classification', 0.02), ('mcintosh', 0.019), ('species', 0.019), ('ascribed', 0.019), ('justifies', 0.019), ('htoe', 0.019), ('unconventional', 0.019), ('utah', 0.019), ('sentences', 0.019), ('text', 0.018), ('stan', 0.017), ('ruihong', 0.017), ('advice', 0.017), ('br', 0.017), ('tedious', 0.017), ('lake', 0.017), ('disease', 0.017), ('terms', 0.017), ('history', 0.017), ('ask', 0.017), ('recognize', 0.017), ('extraction', 0.016), ('baker', 0.016), ('lsi', 0.016), ('animals', 0.016), ('minority', 0.016), ('salt', 0.016), ('virtually', 0.016), ('specific', 0.016), ('mention', 0.016), ('involving', 0.016), ('collection', 0.016), ('remaining', 0.016), ('kozareva', 0.016), ('idiosyncratic', 0.016), ('almost', 0.016), ('type', 0.016), ('fields', 0.015), ('gold', 0.015), ('forums', 0.015), ('naively', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

Author: Ruihong Huang ; Ellen Riloff

2 0.13008127 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li

Abstract: Linking name mentions in microblog posts to a knowledge base, namely microblog entity linking, is useful for text mining tasks on microblog. Entity linking in long text has been well studied in previous works. However few work has focused on short text such as microblog post. Microblog posts are short and noisy. Previous method can extract few features from the post context. In this paper we propose to use extra posts for the microblog entity linking task. Experimental results show that our proposed method significantly improves the linking accuracy over traditional methods by 8.3% and 7.5% respectively.

3 0.10388272 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Author: Dong Nguyen ; A. Seza Dogruoz

Abstract: Multilingual speakers switch between languages in online and spoken communication. Analyses of large scale multilingual data require automatic language identification at the word level. For our experiments with multilingual online discussions, we first tag the language of individual words using language models and dictionaries. Secondly, we incorporate context to improve the performance. We achieve an accuracy of 98%. Besides word level accuracy, we use two new metrics to evaluate this task.

4 0.083743952 156 emnlp-2013-Recurrent Continuous Translation Models

Author: Nal Kalchbrenner ; Phil Blunsom

Abstract: We introduce a class of probabilistic continuous translation models called Recurrent Continuous Translation Models that are purely based on continuous representations for words, phrases and sentences and do not rely on alignments or phrasal translation units. The models have a generation and a conditioning aspect. The generation of the translation is modelled with a target Recurrent Language Model, whereas the conditioning on the source sentence is modelled with a Convolutional Sentence Model. Through various experiments, we show first that our models obtain a perplexity with respect to gold translations that is > 43% lower than that of stateof-the-art alignment-based translation models. Secondly, we show that they are remarkably sensitive to the word order, syntax, and meaning of the source sentence despite lacking alignments. Finally we show that they match a state-of-the-art system when rescoring n-best lists of translations.

5 0.075497098 4 emnlp-2013-A Dataset for Research on Short-Text Conversations

Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen

Abstract: Natural language conversation is widely regarded as a highly difficult problem, which is usually attacked with either rule-based or learning-based models. In this paper we propose a retrieval-based automatic response model for short-text conversation, to exploit the vast amount of short conversation instances available on social media. For this purpose we introduce a dataset of short-text conversation based on the real-world instances from Sina Weibo (a popular Chinese microblog service), which will be soon released to public. This dataset provides rich collection of instances for the research on finding natural and relevant short responses to a given short text, and useful for both training and testing of conversation models. This dataset consists of both naturally formed conversations, manually labeled data, and a large repository of candidate responses. Our preliminary experiments demonstrate that the simple retrieval-based conversation model performs reasonably well when combined with the rich instances in our dataset.

6 0.070581995 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication

7 0.068905577 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

8 0.067277238 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM

9 0.039867554 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students

10 0.03802311 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

11 0.037793893 29 emnlp-2013-Automatic Domain Partitioning for Multi-Domain Learning

12 0.034933396 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition

13 0.033964824 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

14 0.030565478 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation

15 0.029917248 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging

16 0.028583575 112 emnlp-2013-Joint Coreference Resolution and Named-Entity Linking with Multi-Pass Sieves

17 0.027587216 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation

18 0.026787713 24 emnlp-2013-Application of Localized Similarity for Web Documents

19 0.026733929 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

20 0.026702516 161 emnlp-2013-Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.092), (1, 0.043), (2, -0.029), (3, -0.045), (4, -0.014), (5, -0.005), (6, 0.027), (7, 0.078), (8, 0.04), (9, -0.038), (10, -0.051), (11, 0.041), (12, 0.058), (13, 0.06), (14, -0.132), (15, 0.003), (16, -0.005), (17, 0.016), (18, -0.239), (19, 0.029), (20, 0.223), (21, 0.087), (22, 0.026), (23, -0.013), (24, -0.023), (25, -0.018), (26, 0.042), (27, -0.043), (28, 0.058), (29, -0.017), (30, -0.086), (31, -0.002), (32, -0.029), (33, -0.102), (34, -0.05), (35, -0.082), (36, 0.121), (37, -0.048), (38, 0.149), (39, 0.035), (40, 0.042), (41, -0.093), (42, -0.026), (43, 0.072), (44, 0.107), (45, 0.055), (46, 0.043), (47, -0.141), (48, -0.123), (49, 0.084)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97448784 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

Author: Ruihong Huang ; Ellen Riloff

2 0.67147261 204 emnlp-2013-Word Level Language Identification in Online Multilingual Communication

Author: Dong Nguyen ; A. Seza Dogruoz

3 0.52119678 4 emnlp-2013-A Dataset for Research on Short-Text Conversations

Author: Hao Wang ; Zhengdong Lu ; Hang Li ; Enhong Chen

4 0.49701053 130 emnlp-2013-Microblog Entity Linking by Leveraging Extra Posts

Author: Yuhang Guo ; Bing Qin ; Ting Liu ; Sheng Li

5 0.31171998 156 emnlp-2013-Recurrent Continuous Translation Models

Author: Nal Kalchbrenner ; Phil Blunsom

6 0.30396348 89 emnlp-2013-Gender Inference of Twitter Users in Non-English Contexts

7 0.27963644 72 emnlp-2013-Elephant: Sequence Labeling for Word and Sentence Segmentation

8 0.27179781 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students

9 0.25780904 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition

10 0.25287971 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication

11 0.23477517 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM

12 0.22292732 26 emnlp-2013-Assembling the Kazakh Language Corpus

13 0.22194289 23 emnlp-2013-Animacy Detection with Voting Models

14 0.21576987 129 emnlp-2013-Measuring Ideological Proportions in Political Speeches

15 0.21251646 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

16 0.20760266 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction

17 0.19523327 81 emnlp-2013-Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media

18 0.19515182 32 emnlp-2013-Automatic Idiom Identification in Wiktionary

19 0.19435877 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities

20 0.19285701 142 emnlp-2013-Open-Domain Fine-Grained Class Extraction from Web Search Queries

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.038), (18, 0.025), (22, 0.035), (30, 0.079), (50, 0.011), (51, 0.183), (66, 0.019), (71, 0.023), (75, 0.038), (77, 0.017), (80, 0.353), (90, 0.012), (96, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75101757 46 emnlp-2013-Classifying Message Board Posts with an Extracted Lexicon of Patient Attributes

Author: Ruihong Huang ; Ellen Riloff

2 0.65317285 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

Author: Will Y. Zou ; Richard Socher ; Daniel Cer ; Christopher D. Manning

Abstract: We introduce bilingual word embeddings: semantic embeddings associated across two languages in the context of neural language models. We propose a method to learn bilingual embeddings from a large unlabeled corpus, while utilizing MT word alignments to constrain translational equivalence. The new embeddings significantly out-perform baselines in word semantic similarity. A single semantic similarity feature induced with bilingual embeddings adds near half a BLEU point to the results of NIST08 Chinese-English machine translation task.

3 0.51005447 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

Author: Yiping Jin ; Min-Yen Kan ; Jun-Ping Ng ; Xiangnan He

Abstract: This paper presents DefMiner, a supervised sequence labeling system that identifies scientific terms and their accompanying definitions. DefMiner achieves 85% F1 on a Wikipedia benchmark corpus, significantly improving the previous state-of-the-art by 8%. We exploit DefMiner to process the ACL Anthology Reference Corpus (ARC) – a large, real-world digital library of scientific articles in computational linguistics. The resulting automatically-acquired glossary represents the terminology defined over several thousand individual research articles. We highlight several interesting observations: more definitions are introduced for conference and workshop papers over the years and that multiword terms account for slightly less than half of all terms. Obtaining a list of popular , defined terms in a corpus ofcomputational linguistics papers, we find that concepts can often be categorized into one of three categories: resources, methodologies and evaluation metrics.

4 0.50813335 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks

Author: Zhongqing Wang ; Shoushan LI ; Fang Kong ; Guodong Zhou

Abstract: Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many interesting applications, such as talent recommendation and contextual advertising. However, personal profiles usually lack organization confronted with the large amount of available information. Therefore, it is always a challenge for people to find desired information from them. In this paper, we address the task of personal profile summarization by leveraging both personal profile textual information and social networks. Here, using social networks is motivated by the intuition that, people with similar academic, business or social connections (e.g. co-major, co-university, and cocorporation) tend to have similar experience and summaries. To achieve the learning process, we propose a collective factor graph (CoFG) model to incorporate all these resources of knowledge to summarize personal profiles with local textual attribute functions and social connection factors. Extensive evaluation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of the proposed approach. 1

5 0.50793058 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu

Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.

6 0.50746912 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

7 0.50711358 193 emnlp-2013-Unsupervised Induction of Cross-Lingual Semantic Relations

8 0.50710547 69 emnlp-2013-Efficient Collective Entity Linking with Stacking

9 0.50567222 167 emnlp-2013-Semi-Markov Phrase-Based Monolingual Alignment

10 0.50510705 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery

11 0.50496089 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

12 0.50455779 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity

13 0.50388479 152 emnlp-2013-Predicting the Presence of Discourse Connectives

14 0.50377142 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors

15 0.50332248 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

16 0.50295252 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation

17 0.50265747 13 emnlp-2013-A Study on Bootstrapping Bilingual Vector Spaces from Non-Parallel Data (and Nothing Else)

18 0.50254333 80 emnlp-2013-Exploiting Zero Pronouns to Improve Chinese Coreference Resolution

19 0.50123948 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

20 0.50106353 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types