acl acl2013 acl2013-356 knowledge-graph by maker-knowledge-mining

356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia


Source: pdf

Author: Zhigang Wang ; Zhixing Li ; Juanzi Li ; Jie Tang ; Jeff Z. Pan

Abstract: Wikipedia infoboxes are a valuable source of structured knowledge for global knowledge sharing. However, infobox information is very incomplete and imbalanced among the Wikipedias in different languages. It is a promising but challenging problem to utilize the rich structured knowledge from a source language Wikipedia to help complete the missing infoboxes for a target language. In this paper, we formulate the problem of cross-lingual knowledge extraction from multilingual Wikipedia sources, and present a novel framework, called WikiCiKE, to solve this problem. An instancebased transfer learning method is utilized to overcome the problems of topic drift and translation errors. Our experimental results demonstrate that WikiCiKE outperforms the monolingual knowledge extraction method and the translation-based method.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Wikipedia infoboxes are a valuable source of structured knowledge for global knowledge sharing. [sent-14, score-0.353]

2 However, infobox information is very incomplete and imbalanced among the Wikipedias in different languages. [sent-15, score-0.295]

3 It is a promising but challenging problem to utilize the rich structured knowledge from a source language Wikipedia to help complete the missing infoboxes for a target language. [sent-16, score-0.465]

4 In this paper, we formulate the problem of cross-lingual knowledge extraction from multilingual Wikipedia sources, and present a novel framework, called WikiCiKE, to solve this problem. [sent-17, score-0.104]

5 An instancebased transfer learning method is utilized to overcome the problems of topic drift and translation errors. [sent-18, score-0.12]

6 Our experimental results demonstrate that WikiCiKE outperforms the monolingual knowledge extraction method and the translation-based method. [sent-19, score-0.129]

7 1 Introduction In recent years, the automatic knowledge extraction using Wikipedia has attracted significant research interest in research fields, such as the semantic web. [sent-20, score-0.104]

8 As a valuable source of structured knowledge, Wikipedia infoboxes have been utilized to build linked open data (Suchanek et al. [sent-21, score-0.319]

9 However, most infoboxes in different Wikipedia language versions are missing. [sent-33, score-0.255]

10 Figure 1 shows the statistics of article numbers and infobox information for six major Wikipedias. [sent-34, score-0.368]

11 82% of the articles have infoboxes on average, and the numbers of infoboxes for these Wikipedias vary significantly. [sent-36, score-0.634]

12 For instance, the English Wikipedia has 13 times more infoboxes than the Chinese Wikipedia and 3. [sent-37, score-0.255]

13 5 times more infoboxes than the second largest Wikipedia of German language. [sent-38, score-0.255]

14 To solve this problem, KYLIN has been proposed to extract the missing infoboxes from unstructured article texts for the English Wikipedia (Wu and Weld, 2007). [sent-40, score-0.472]

15 KYLIN performs well when sufficient training data are available, and such techniques as shrinkage and retraining have been used to increase recall from English Wikipedia’s long tail of sparse infobox classes (Weld et al. [sent-41, score-0.325]

16 The extraction performance of KYLIN is limited by the number of available training samples. [sent-44, score-0.105]

17 These methods concentrate on translating existing infoboxes from a richer source language version of Wikipedia into the target language. [sent-51, score-0.372]

18 The recall of new target infoboxes is highly limited by the number of equivalent cross-lingual articles and the number of existing source infoboxes. [sent-52, score-0.521]

19 Hence, the challenge remains: how could we supplement the missing infoboxes for the rest 79. [sent-55, score-0.317]

20 On the other hand, the numbers of existing infobox attributes in different languages are highly imbalanced. [sent-57, score-0.344]

21 Table 1 shows the comparison of the numbers of the articles for the attributes in template PERSON between English and Chinese Wikipedia. [sent-58, score-0.309]

22 In this paper, we have the following hypothesis: one can use the rich English (auxiliary) information to assist the Chinese (target) infobox extraction. [sent-61, score-0.268]

23 In general, we address the problem of crosslingual knowledge extraction by using the imbalance between Wikipedias of different languages. [sent-62, score-0.135]

24 For each attribute, we aim to learn an extractor to find the missing value from the unstructured article texts in the target Wikipedia by using the rich information in the source language. [sent-63, score-0.419]

25 Specifically, we treat this cross-lingual information extraction task as a transfer learning-based binary classification problem. [sent-64, score-0.147]

26 We propose a transfer learning-based crosslingual knowledge extraction framework 1Chinese-English denotes the task of Chinese Wikipedia infobox completion using English Wikipedia called WikiCiKE. [sent-66, score-0.477]

27 The extraction performance for the target Wikipedia is improved by using rich infoboxes and textual information in the source language. [sent-67, score-0.445]

28 We propose the TrAdaBoost-based extractor training method to avoid the problems of topic drift and translation errors of the source Wikipedia. [sent-69, score-0.169]

29 Chinese-English experiments for four typical attributes demonstrate that WikiCiKE outperforms both the monolingual extraction method and current translation-based method. [sent-72, score-0.174]

30 47% for recall in the template named person are achieved when only 30 target training articles are available. [sent-75, score-0.402]

31 2 Preliminaries In this section, we introduce some basic concepts regarding Wikipedia, formally defining the key problem of cross-lingual knowledge extraction and providing an overview of the WikiCiKE framework. [sent-82, score-0.104]

32 1 Wiki Knowledge Base and Wiki Article We consider each language version of Wikipedia as a wiki knowledge base, which can be represented as K = {ai}ip=1, where ai is a disambiguated eadrti calse K Kin K= {aand} p is the size of K. [sent-84, score-0.136]

33 • tp = {attri}ir=1 is the infobox template asstpoci =ate {da twtrith} ib, where r is the number of attributes for one specific template, and C denotes the set of categories to which the aCrt dicelen a belongs. [sent-86, score-0.483]

34 We will use “name in TEMPLATE PERSON” to refer to the attribute attrname in the template tpPERSON. [sent-89, score-0.216]

35 In this cross-lingual task, we use the source (S) and target (T) languages to denote the languages of auxiliary and target Wikipedias, respectively. [sent-90, score-0.198]

36 For example, KS indicates the source wiki knowledge base, and KT denotes the target wiki knowledge base. [sent-91, score-0.339]

37 2 Problem Formulation Mining new infobox information from unstructured article texts is actually a multi-template, multi-slot information extraction problem. [sent-93, score-0.496]

38 In our task, each template represents an infobox template and each slot denotes an attribute. [sent-94, score-0.486]

39 In the Wiki- CiKE framework, for each attribute attrT in an infobox template tpT, we treat the task of missing value extraction as a binary classification problem. [sent-95, score-0.649]

40 It predicts whether a particular word (token) from the article text is the extraction target (Finn and Kushmerick, 2004; Lafferty et al. [sent-96, score-0.254]

41 Given an attribute attrT and an instance (word/token) xi, XS = {xi}in=1 and XT = {xi}ni=+nm+1 are the sets of ins{taxnc}es (words/tokens) {inx th}e source and the target language respectively. [sent-98, score-0.224]

42 Given the combined training data set TD, our objective is to estimate a hypothesis f : X → Y tohbajte cmtiivneim isiz teos ethsteim prediction error on testing →dat Ya in the target language. [sent-105, score-0.138]

43 3 WikiCiKE Framework WikiCiKE learns an extractor for a given attribute attrT in the target Wikipedia. [sent-109, score-0.243]

44 As shown in Figure 3, WikiCiKE contains four key components: (1) Automatic Training Data Generation: given the target attribute attrT and two wiki knowledge bases KS and KT, WikiCiKE first generates the training data set TD = TDS ∪ TDT automatically. [sent-110, score-0.331]

45 (3) Template fC :l Xass7 i→fica Ytio bny: u s WinigkiCiKE∪ tTheDn determines proper candidate articles which are suitable to generate the missing value of attrT. [sent-112, score-0.216]

46 (4) WikiCiKE Extraction: given a candidate article a, WikiCiKE uses the learned extractor f to label each word in the text of a, and generate the extraction result in the end. [sent-113, score-0.228]

47 1 Automatic Training Data Generation To generate the training data for the target attribute attrT, we first determine the equivalen- t cross-lingual attribute attrS. [sent-117, score-0.327]

48 The manual alignment is worthwhile because thousands of articles belong to the same template may benefit from it and at the same time it is not very costly. [sent-123, score-0.233]

49 In Chinese Wikipedia, the top 100 templates have covered nearly 80% of the articles which have been assigned a template. [sent-124, score-0.124]

50 Once the aligned attribute mapping attrT ↔ attrS is obtained, we collect the articles from bo↔th KS and KT containing the corresponding attr. [sent-125, score-0.231]

51 The collected articles from KS are translated into the target language. [sent-126, score-0.205]

52 s For each collected article a = {title, text, ib, tp, C} and its value of attr, we can automatically la abnedl e iatsch waolurde x fin atettxrt, according to whether x and its neighbors are contained by the value. [sent-128, score-0.13]

53 The target training data TDT is directly generated from articles in the target language Wikipedia. [sent-214, score-0.318]

54 Articles from the source language Wikipedia are translated into the target language in advance and then transformed into training data TDS. [sent-215, score-0.149]

55 2 WikiCiKE Training Given the attribute attrT, we want to train a classifier f : X → Y that can minimize the prediction 644 error for the testing data in the target language. [sent-218, score-0.213]

56 , 2007), which is an instance-based transfer learning algorithm that was first proposed by Dai to find TrAdaBoost requires that the source training instances XS and target training instances XT be drawn from the same feature space. [sent-220, score-0.255]

57 In WikiCiKE, the source articles are translated into the target language in advance to satisfy this requirement. [sent-221, score-0.241]

58 Specifically, the weight-updating strategy for the source instances is decided by the loss on the target instances. [sent-225, score-0.117]

59 3 Weight-updating Strategy of TrAd- Template Classification Before using the learned classifier f to extract missing infobox value for the target attribute attrT, we must select the correct articles to be processed. [sent-232, score-0.672]

60 For example, the article aNew Y ork is not a proper article for extracting the missing value of the attribute attrbirth day. [sent-233, score-0.399]

61 If a already has an incomplete infobox, it is clear that the correct tp is the template of its own infobox ib. [sent-234, score-0.434]

62 For those articles that have no infoboxes, we use the classical 5-nearest neighbor algo- rithm to determine their templates (Roussopoulos et al. [sent-235, score-0.124]

63 In this paper, we concentrate on the WikiCiKE training and extraction components. [sent-241, score-0.105]

64 4 WikiCiKE Extraction Given an article a determined by template classification, we generate the missing value of attr from the corresponding text. [sent-243, score-0.369]

65 645 4 Experiments In this section, we present our experiments to evaluate the effectiveness of WikiCiKE, where we focus on the Chinese-English case; in other words, the target language is Chinese and the source language is English. [sent-258, score-0.117]

66 It is part of our future work to try other language pairs which two Wikipedias of these languages are imbalanced in infobox information such as English-Dutch. [sent-259, score-0.268]

67 For each attribute, we collect both labeled articles (articles that contain the corresponding attribute attr) and unlabeled articles in Chinese. [sent-264, score-0.355]

68 We split the labeled articles into two subsets AT and Atest(AT ∩ Atest = ∅), in which AT is used as target training Aarticl=es ∅an),d i Atest cihs used as the first testing set. [sent-265, score-0.262]

69 For the unlabeled articles, represented as At′est, we manually label their infoboxes with their texts and use them as the second testing set. [sent-266, score-0.308]

70 For each attribute, we also collect a set of labeled articles AS in English as the source training data. [sent-267, score-0.192]

71 2 Comparison Methods We compare our WikiCiKE method with two different kinds of methods, the monolingual knowledge extraction method and the translation-based method. [sent-274, score-0.129]

72 It obtains the values by two steps: finding their counterparts (if available) in English using Wikipedia cross-lingual links and attribute alignments, and translating them into Chinese. [sent-282, score-0.107]

73 We compare WikiCiKE with KE-Mon on the first testing data set Atest, where most values can be found in the articles’ texts in those labeled articles, in order to demonstrate the performance improvement by using crosslingual knowledge transfer. [sent-285, score-0.115]

74 ∪ W Ae incrementally increase the number of target training articles from 10 to 500 (if available) to compare WikiCiKE with KE-Mon in different situations. [sent-318, score-0.237]

75 We can see that WikiCiKE outperforms KE-Mon on all three attributions especially when the number of target training samples is small. [sent-321, score-0.113]

76 Although the re- call for alma mater and the precision for nationality of WikiCiKE are lower than KE-Mon when only 10 target training articles are available, WikiCiKE performs better than KE-Mon if we take into consideration both precision and recall. [sent-322, score-0.372]

77 79516803510RP (2KW0iE k−MCi KoinE )30 %tnerp(c2105 130pe5rfom1a0nce2g0ani350RP number of target traninig artciels (c) nationality mater number of target traninig artciels (d) average improvements Figure 4: Results for TEMPLATE PERSON. [sent-326, score-0.372]

78 We can see that WikiCiKE yields significant improvements when only a few articles are available in target language and the improvements tend to decrease as the number of target articles is increased. [sent-330, score-0.41]

79 In this case, the articles in the target language are sufficient to train the extractors alone. [sent-331, score-0.236]

80 First, because of the limit of cross-lingual links and infoboxes in English Wikipedia, only a very smallset of values is found by KE-Tr. [sent-342, score-0.255]

81 WikiCiKE uses translators too, but it has better tolerance to translation errors because the extracted value is from the target article texts instead of the output of translators. [sent-344, score-0.239]

82 When the number of target training articles is less than 100, the χ is much less than 10. [sent-358, score-0.237]

83 When only 30 target training samples are available, WikiCiKE reaches comparable performance of KE-Mon using 300-500 target training samples. [sent-374, score-0.226]

84 39%) attributes have less than 30 and 200 labeled articles respectively. [sent-377, score-0.2]

85 We can see that WikiCiKE can save considerable human labor when no sufficient target training samples are available. [sent-378, score-0.113]

86 For attribute occupation when 30 target training samples are used, there are 71 errors. [sent-380, score-0.256]

87 The second category is because of the incomplete infoboxes (36. [sent-385, score-0.282]

88 5 Related Work Some approaches of knowledge extraction from the open Web have been proposed (Wu et al. [sent-392, score-0.104]

89 1 Monolingual Infobox Extraction KYLIN is the first system to autonomously extract the missing infoboxes from the corresponding article texts by using a self-supervised learning method (Wu and Weld, 2007). [sent-397, score-0.445]

90 Different from Wu’s research, WikiCiKE is a cross-lingual knowledge extraction framework, which leverags rich knowledge in the other language to improve extraction performance in the target Wikipedia. [sent-401, score-0.289]

91 2 Cross-lingual Infobox Completion Current translation based methods usually contain two steps: cross-lingual attribute alignment and value translation. [sent-403, score-0.137]

92 The attribute alignment strategies can be grouped into two categories: cross-lingual link based methods (Bouma et al. [sent-404, score-0.107]

93 After the first step, the value in the source language is translated into the target language. [sent-410, score-0.147]

94 However, recall of these methods is limited by the number of equivalent cross-lingual articles and the number of infoboxes in the source language. [sent-414, score-0.44]

95 Wi- kiCiKE attempts to mine the missing infoboxes directly from the article texts and thus achieves a higher recall compared with these methods as shown in Section 4. [sent-416, score-0.47]

96 However, few efforts have been spent in the information extraction tasks with knowledge transfer. [sent-423, score-0.104]

97 6 Conclusion and Future Work In this paper we proposed a general cross-lingual knowledge extraction framework called WikiCiKE, in which extraction performance in the target Wikipedia is improved by using rich infoboxes in the source language. [sent-424, score-0.549]

98 Chinese-English experimental results on four typical attributes showed that WikiCiKE significantly outperforms both the current translation based methods and the monolingual extraction methods. [sent-426, score-0.174]

99 In theory, WikiCiKE can be applied to any two wiki knowledge based of different languages. [sent-427, score-0.111]

100 Firstly, more attributes in more infobox templates should be explored to make our results much stronger. [sent-429, score-0.344]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wikicike', 0.713), ('infobox', 0.268), ('infoboxes', 0.255), ('wikipedia', 0.145), ('articles', 0.124), ('attrt', 0.119), ('tds', 0.119), ('template', 0.109), ('attribute', 0.107), ('atest', 0.104), ('article', 0.1), ('pan', 0.099), ('tdt', 0.085), ('target', 0.081), ('wiki', 0.08), ('attributes', 0.076), ('kylin', 0.074), ('tradaboost', 0.074), ('transfer', 0.074), ('extraction', 0.073), ('adar', 0.073), ('attr', 0.068), ('wikipedias', 0.068), ('missing', 0.062), ('adafre', 0.059), ('ferr', 0.059), ('bouma', 0.057), ('jeff', 0.056), ('chinese', 0.055), ('extractor', 0.055), ('wu', 0.053), ('xi', 0.053), ('dai', 0.052), ('ib', 0.047), ('drift', 0.046), ('alma', 0.045), ('mater', 0.045), ('nationality', 0.045), ('sensoy', 0.045), ('weld', 0.044), ('td', 0.041), ('bizer', 0.04), ('lavelli', 0.039), ('fokoue', 0.039), ('juanzi', 0.039), ('gates', 0.039), ('ks', 0.038), ('owl', 0.036), ('occupation', 0.036), ('source', 0.036), ('sigmod', 0.035), ('reasoning', 0.034), ('wt', 0.033), ('training', 0.032), ('knowledge', 0.031), ('crosslingual', 0.031), ('ndez', 0.031), ('extractors', 0.031), ('person', 0.031), ('nguyen', 0.03), ('tp', 0.03), ('value', 0.03), ('artciels', 0.03), ('attri', 0.03), ('aumueller', 0.03), ('fissaha', 0.03), ('haixun', 0.03), ('heino', 0.03), ('hotho', 0.03), ('kicike', 0.03), ('mcilraith', 0.03), ('norman', 0.03), ('rdfs', 0.03), ('roussopoulos', 0.03), ('textl', 0.03), ('traninig', 0.03), ('volkel', 0.03), ('zhigang', 0.03), ('linked', 0.028), ('qiang', 0.028), ('texts', 0.028), ('significance', 0.028), ('jie', 0.027), ('kt', 0.027), ('title', 0.027), ('incomplete', 0.027), ('exploitation', 0.027), ('unstructured', 0.027), ('hogan', 0.026), ('achille', 0.026), ('gosse', 0.026), ('wenyuan', 0.026), ('aidan', 0.026), ('kushmerick', 0.026), ('horrocks', 0.026), ('tang', 0.026), ('monolingual', 0.025), ('testing', 0.025), ('ai', 0.025), ('recall', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia

Author: Zhigang Wang ; Zhixing Li ; Juanzi Li ; Jie Tang ; Jeff Z. Pan

Abstract: Wikipedia infoboxes are a valuable source of structured knowledge for global knowledge sharing. However, infobox information is very incomplete and imbalanced among the Wikipedias in different languages. It is a promising but challenging problem to utilize the rich structured knowledge from a source language Wikipedia to help complete the missing infoboxes for a target language. In this paper, we formulate the problem of cross-lingual knowledge extraction from multilingual Wikipedia sources, and present a novel framework, called WikiCiKE, to solve this problem. An instancebased transfer learning method is utilized to overcome the problems of topic drift and translation errors. Our experimental results demonstrate that WikiCiKE outperforms the monolingual knowledge extraction method and the translation-based method.

2 0.085205801 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

Author: Oleg Rokhlenko ; Idan Szpektor

Abstract: We introduce the novel task of automatically generating questions that are relevant to a text but do not appear in it. One motivating example of its application is for increasing user engagement around news articles by suggesting relevant comparable questions, such as “is Beyonce a better singer than Madonna?”, for the user to answer. We present the first algorithm for the task, which consists of: (a) offline construction of a comparable question template database; (b) ranking of relevant templates to a given article; and (c) instantiation of templates only with entities in the article whose comparison under the template’s relation makes sense. We tested the suggestions generated by our algorithm via a Mechanical Turk experiment, which showed a significant improvement over the strongest baseline of more than 45% in all metrics.

3 0.082837015 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

Author: Shane Bergsma ; Benjamin Van Durme

Abstract: We describe a novel approach for automatically predicting the hidden demographic properties of social media users. Building on prior work in common-sense knowledge acquisition from third-person text, we first learn the distinguishing attributes of certain classes of people. For example, we learn that people in the Female class tend to have maiden names and engagement rings. We then show that this knowledge can be used in the analysis of first-person communication; knowledge of distinguishing attributes allows us to both classify users and to bootstrap new training examples. Our novel approach enables substantial improvements on the widelystudied task of user gender prediction, ob- taining a 20% relative error reduction over the current state-of-the-art.

4 0.078412443 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

Author: Oliver Ferschke ; Iryna Gurevych ; Marc Rittberger

Abstract: With the increasing amount of user generated reference texts in the web, automatic quality assessment has become a key challenge. However, only a small amount of annotated data is available for training quality assessment systems. Wikipedia contains a large amount of texts annotated with cleanup templates which identify quality flaws. We show that the distribution of these labels is topically biased, since they cannot be applied freely to any arbitrary article. We argue that it is necessary to consider the topical restrictions of each label in order to avoid a sampling bias that results in a skewed classifier and overly optimistic evaluation results. . We factor out the topic bias by extracting reliable training instances from the revision history which have a topic distribution similar to the labeled articles. This approach better reflects the situation a classifier would face in a real-life application.

5 0.072055504 249 acl-2013-Models of Semantic Representation with Visual Attributes

Author: Carina Silberer ; Vittorio Ferrari ; Mirella Lapata

Abstract: We consider the problem of grounding the meaning of words in the physical world and focus on the visual modality which we represent by visual attributes. We create a new large-scale taxonomy of visual attributes covering more than 500 concepts and their corresponding 688K images. We use this dataset to train attribute classifiers and integrate their predictions with text-based distributional models of word meaning. We show that these bimodal models give a better fit to human word association data compared to amodal models and word representations based on handcrafted norming data.

6 0.066223599 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

7 0.057470154 154 acl-2013-Extracting bilingual terminologies from comparable corpora

8 0.05694373 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations

9 0.056912523 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

10 0.055060416 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

11 0.053075287 242 acl-2013-Mining Equivalent Relations from Linked Data

12 0.050190255 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

13 0.049590871 179 acl-2013-HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text

14 0.04832476 365 acl-2013-Understanding Tables in Context Using Standard NLP Toolkits

15 0.047892075 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

16 0.04773597 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example

17 0.046637535 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction

18 0.045968834 97 acl-2013-Cross-lingual Projections between Languages from Different Families

19 0.044613346 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

20 0.044541337 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.133), (1, 0.008), (2, -0.006), (3, -0.017), (4, 0.059), (5, 0.008), (6, -0.026), (7, -0.008), (8, 0.028), (9, 0.004), (10, -0.025), (11, -0.046), (12, -0.005), (13, 0.042), (14, 0.01), (15, 0.026), (16, 0.006), (17, 0.024), (18, -0.018), (19, -0.014), (20, 0.01), (21, 0.019), (22, -0.019), (23, 0.055), (24, 0.058), (25, -0.004), (26, 0.048), (27, -0.032), (28, 0.01), (29, 0.025), (30, -0.029), (31, 0.028), (32, 0.082), (33, 0.055), (34, -0.037), (35, 0.044), (36, -0.022), (37, 0.0), (38, -0.042), (39, -0.032), (40, -0.017), (41, 0.052), (42, -0.04), (43, -0.021), (44, 0.003), (45, 0.02), (46, 0.026), (47, -0.046), (48, 0.01), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89720398 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia

Author: Zhigang Wang ; Zhixing Li ; Juanzi Li ; Jie Tang ; Jeff Z. Pan

Abstract: Wikipedia infoboxes are a valuable source of structured knowledge for global knowledge sharing. However, infobox information is very incomplete and imbalanced among the Wikipedias in different languages. It is a promising but challenging problem to utilize the rich structured knowledge from a source language Wikipedia to help complete the missing infoboxes for a target language. In this paper, we formulate the problem of cross-lingual knowledge extraction from multilingual Wikipedia sources, and present a novel framework, called WikiCiKE, to solve this problem. An instancebased transfer learning method is utilized to overcome the problems of topic drift and translation errors. Our experimental results demonstrate that WikiCiKE outperforms the monolingual knowledge extraction method and the translation-based method.

2 0.68778247 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

Author: Shane Bergsma ; Benjamin Van Durme

Abstract: We describe a novel approach for automatically predicting the hidden demographic properties of social media users. Building on prior work in common-sense knowledge acquisition from third-person text, we first learn the distinguishing attributes of certain classes of people. For example, we learn that people in the Female class tend to have maiden names and engagement rings. We then show that this knowledge can be used in the analysis of first-person communication; knowledge of distinguishing attributes allows us to both classify users and to bootstrap new training examples. Our novel approach enables substantial improvements on the widelystudied task of user gender prediction, ob- taining a 20% relative error reduction over the current state-of-the-art.

3 0.66755015 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia

Author: Oliver Ferschke ; Iryna Gurevych ; Marc Rittberger

Abstract: With the increasing amount of user generated reference texts in the web, automatic quality assessment has become a key challenge. However, only a small amount of annotated data is available for training quality assessment systems. Wikipedia contains a large amount of texts annotated with cleanup templates which identify quality flaws. We show that the distribution of these labels is topically biased, since they cannot be applied freely to any arbitrary article. We argue that it is necessary to consider the topical restrictions of each label in order to avoid a sampling bias that results in a skewed classifier and overly optimistic evaluation results. . We factor out the topic bias by extracting reliable training instances from the revision history which have a topic distribution similar to the labeled articles. This approach better reflects the situation a classifier would face in a real-life application.

4 0.6508531 14 acl-2013-A Novel Classifier Based on Quantum Computation

Author: Ding Liu ; Xiaofang Yang ; Minghu Jiang

Abstract: In this article, we propose a novel classifier based on quantum computation theory. Different from existing methods, we consider the classification as an evolutionary process of a physical system and build the classifier by using the basic quantum mechanics equation. The performance of the experiments on two datasets indicates feasibility and potentiality of the quantum classifier.

5 0.63362998 179 acl-2013-HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text

Author: Mohamed Amir Yosef ; Sandro Bauer ; Johannes Hoffart ; Marc Spaniol ; Gerhard Weikum

Abstract: Recent research has shown progress in achieving high-quality, very fine-grained type classification in hierarchical taxonomies. Within such a multi-level type hierarchy with several hundreds of types at different levels, many entities naturally belong to multiple types. In order to achieve high-precision in type classification, current approaches are either limited to certain domains or require time consuming multistage computations. As a consequence, existing systems are incapable of performing ad-hoc type classification on arbitrary input texts. In this demo, we present a novel Webbased tool that is able to perform domain independent entity type classification under real time conditions. Thanks to its efficient implementation and compacted feature representation, the system is able to process text inputs on-the-fly while still achieving equally high precision as leading state-ofthe-art implementations. Our system offers an online interface where natural-language text can be inserted, which returns semantic type labels for entity mentions. Further more, the user interface allows users to explore the assigned types by visualizing and navigating along the type-hierarchy.

6 0.62084597 365 acl-2013-Understanding Tables in Context Using Standard NLP Toolkits

7 0.57940912 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision

8 0.57446629 21 acl-2013-A Statistical NLG Framework for Aggregated Planning and Realization

9 0.57076526 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

10 0.56219941 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

11 0.55999416 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction

12 0.54824483 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing

13 0.54124391 232 acl-2013-Linguistic Models for Analyzing and Detecting Biased Language

14 0.54117042 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization

15 0.539738 154 acl-2013-Extracting bilingual terminologies from comparable corpora

16 0.53947932 160 acl-2013-Fine-grained Semantic Typing of Emerging Entities

17 0.53672874 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

18 0.52710837 315 acl-2013-Semi-Supervised Semantic Tagging of Conversational Understanding using Markov Topic Regression

19 0.52431428 256 acl-2013-Named Entity Recognition using Cross-lingual Resources: Arabic as an Example

20 0.51577687 243 acl-2013-Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.035), (6, 0.027), (11, 0.058), (15, 0.012), (24, 0.035), (26, 0.04), (35, 0.051), (42, 0.036), (48, 0.033), (70, 0.435), (71, 0.019), (88, 0.014), (90, 0.018), (95, 0.079)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9871034 384 acl-2013-Visual Features for Linguists: Basic image analysis techniques for multimodally-curious NLPers

Author: Elia Bruni ; Marco Baroni

Abstract: unkown-abstract

2 0.96002978 89 acl-2013-Computerized Analysis of a Verbal Fluency Test

Author: James O. Ryan ; Serguei Pakhomov ; Susan Marino ; Charles Bernick ; Sarah Banks

Abstract: We present a system for automated phonetic clustering analysis of cognitive tests of phonemic verbal fluency, on which one must name words starting with a specific letter (e.g., ‘F’) for one minute. Test responses are typically subjected to manual phonetic clustering analysis that is labor-intensive and subject to inter-rater variability. Our system provides an automated alternative. In a pilot study, we applied this system to tests of 55 novice and experienced professional fighters (boxers and mixed martial artists) and found that experienced fighters produced significantly longer chains of phonetically similar words, while no differences were found in the total number of words produced. These findings are preliminary, but strongly suggest that our system can be used to detect subtle signs of brain damage due to repetitive head trauma in individuals that are otherwise unimpaired.

3 0.95568383 296 acl-2013-Recognizing Identical Events with Graph Kernels

Author: Goran Glavas ; Jan Snajder

Abstract: Identifying news stories that discuss the same real-world events is important for news tracking and retrieval. Most existing approaches rely on the traditional vector space model. We propose an approach for recognizing identical real-world events based on a structured, event-oriented document representation. We structure documents as graphs of event mentions and use graph kernels to measure the similarity between document pairs. Our experiments indicate that the proposed graph-based approach can outperform the traditional vector space model, and is especially suitable for distinguishing between topically similar, yet non-identical events.

4 0.93568408 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs

Author: Shay B. Cohen ; Mark Johnson

Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.

5 0.92533576 19 acl-2013-A Shift-Reduce Parsing Algorithm for Phrase-based String-to-Dependency Translation

Author: Yang Liu

Abstract: We introduce a shift-reduce parsing algorithm for phrase-based string-todependency translation. As the algorithm generates dependency trees for partial translations left-to-right in decoding, it allows for efficient integration of both n-gram and dependency language models. To resolve conflicts in shift-reduce parsing, we propose a maximum entropy model trained on the derivation graph of training data. As our approach combines the merits of phrase-based and string-todependency models, it achieves significant improvements over the two baselines on the NIST Chinese-English datasets.

6 0.92420709 218 acl-2013-Latent Semantic Tensor Indexing for Community-based Question Answering

7 0.90394431 220 acl-2013-Learning Latent Personas of Film Characters

same-paper 8 0.86548936 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia

9 0.71683609 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities

10 0.67575854 249 acl-2013-Models of Semantic Representation with Visual Attributes

11 0.67134768 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

12 0.66171926 380 acl-2013-VSEM: An open library for visual semantics representation

13 0.63749057 274 acl-2013-Parsing Graphs with Hyperedge Replacement Grammars

14 0.63707423 80 acl-2013-Chinese Parsing Exploiting Characters

15 0.62036979 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus

16 0.60360062 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

17 0.60096252 339 acl-2013-Temporal Signals Help Label Temporal Relations

18 0.59717387 168 acl-2013-Generating Recommendation Dialogs by Extracting Information from User Reviews

19 0.58877516 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation

20 0.5868479 292 acl-2013-Question Classification Transfer