acl acl2013 acl2013-76 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jan Snajder ; Sebastian Pado ; Zeljko Agic
Abstract: We report on the first structured distributional semantic model for Croatian, DM.HR. It is constructed after the model of the English Distributional Memory (Baroni and Lenci, 2010), from a dependencyparsed Croatian web corpus, and covers about 2M lemmas. We give details on the linguistic processing and the design principles. An evaluation shows state-of-theart performance on a semantic similarity task with particularly good performance on nouns. The resource is freely available.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract We report on the first structured distributional semantic model for Croatian, DM. [sent-4, score-0.173]
2 An evaluation shows state-of-theart performance on a semantic similarity task with particularly good performance on nouns. [sent-8, score-0.037]
3 1 Introduction Most current work in lexical semantics is based on the Distributional Hypothesis (Harris, 1954), which posits a correlation between the degree of words’ semantic similarity and the similarity of the contexts in which they occur. [sent-10, score-0.037]
4 Words are typically represented as vectors whose dimensions correspond to context features. [sent-12, score-0.052]
5 The vector similarities, which are interpreted as semantic similarities, are used in numerous applications (Turney and Pantel, 2010). [sent-13, score-0.037]
6 Most vector spaces in current use are either wordbased (co-occurrence defined by surface window, context words as dimensions) or syntax-based (cooccurrence defined syntactically, syntactic objects as dimensions). [sent-14, score-0.073]
7 First, they are model to fine-grained types of semantic similarity such as predicate-argument plausibility (Erk et al. [sent-16, score-0.037]
8 HR on a synonym choice task, where it outperforms the standard bag-of-word model for nouns and verbs. [sent-30, score-0.216]
9 2 Related Work Vector space semantic models have been applied to a number of Slavic languages, including Bulgarian (Nakov, 2001a), Czech (Smrzˇ and Rychly´, 2001), Polish (Piasecki, 2009; Broda et al. [sent-31, score-0.037]
10 Previous work on distributional semantic models for Croatian × dealt with similarity prediction (Ljubeˇsic´ et al. [sent-34, score-0.173]
11 , 2012), however using only wordbased and not syntactic-based models. [sent-37, score-0.038]
12 Each tuple is mapped onto a number by scoring function σ : W L W → R+, that reflects the strength oσf t:h We a ×sso Lcia ×ti oWn. [sent-48, score-0.103]
13 → Wh Ren a par- ×× ticular task is selected, a vector space for this task can be generated from the tensor by matricization. [sent-49, score-0.147]
14 Regarding the examples from Section 1, synonym discovery would use a word by link-word space (W LW), which contains vectors for words w represented by pairs hl, wi of a link and a context word. [sent-50, score-0.224]
15 Analogy discovery iwould use a word-word by link space (WW L), which represents word pairs hw1, w2i by Wvec ×to Lrs over links l. [sent-51, score-0.232]
16 However, as noted by Pado´ and Utt (2012), dependency relations are the most obvious choice. [sent-53, score-0.043]
17 DepDM uses links that correspond to dependency relations, with subcategorization for subject (subj tr and subj intr) and object (obj and iobj). [sent-55, score-0.332]
18 Finally, the tensor is symmetrized: ofonr eSaunchd tuple hw1, l, w2i, its inverse hw2, l−1, w1i is included. [sent-59, score-0.295]
19 Thhwe other twio variants are more complex: LexDM uses more lexicalized links, encoding, e. [sent-60, score-0.089]
20 The corpus is freely available for download, along with a more detailed description of the preprocessing steps. [sent-93, score-0.048]
21 For morphosyntactic (MSD) tagging, lemmatization, and dependency parsing of hrWaC, we use freely available tools with models trained on the new SETimes Corpus of Croatian (SETIMES. [sent-95, score-0.128]
22 HR HunPos (POS only) HunPos (full MSD) CST lemmatizer MSTParser Wikipedia 97. [sent-103, score-0.045]
23 8 Table 1: Tagging, lemmatization, and parsing accuracy that are about to be released as parts of another work. [sent-111, score-0.037]
24 It is used also as a basis for a novel formalism for syntactic annotation and dependency parsing of Croatian (Agi c´ and Merkler, 2013). [sent-116, score-0.08]
25 In Table 1 we show lemmatization and tagging accuracy, as well as dependency parsing accuracy in terms of labeled attachment score (LAS). [sent-125, score-0.261]
26 The results show that lemmatization, tagging and parsing accuracy improves on the state of the art for Croatian. [sent-126, score-0.075]
27 We collect the co-occurrence counts of tuples using a set of syntactic patterns. [sent-130, score-0.1]
28 The patterns effectively define the link types, and hence the dimensions of the semantic space. [sent-131, score-0.185]
29 Similar to previous work, we use two sorts of links: unlexicalized and lexicalized. [sent-132, score-0.062]
30 These correspond to the main dependency relations produced by our parser: Pred for predicates, Atr for attributes, Adv for adverbs, Atv for verbal complements, Obj for objects, Prep for prepositions, and Pnom for nominal predicates. [sent-134, score-0.043]
31 We subcategorized the subject relation into Sub tr (sub3http : / / z e l ko . [sent-135, score-0.05]
32 me / re s ource s / j Link P (%) R (%) F1 (%) Unlexicalized Adv Atr Atv Obj Pnom Pred Prep Sb tr Sb intr Verb 57. [sent-137, score-0.186]
33 HR jects of transitive verbs) and Sub intr (subject of intransitive verbs). [sent-177, score-0.136]
34 The motivation for this is better modeling of verb semantics by capturing diathesis alternations. [sent-178, score-0.043]
35 In particular, for many Croatian verbs reflexivization introduces a meaning shift, e. [sent-179, score-0.072]
36 With subject subcategorization, reflexive and irreflexive readings will have different tensor representations; e. [sent-183, score-0.147]
37 Finally, sinmapilaard at ˇcoi Pahdtro´o apnsd, Utt (2012), we use Verb as an underspecified link between subjects and objects linked by non-auxiliary verbs. [sent-187, score-0.131]
38 For lexicalized links, we use two more extraction patterns for prepositions and verbs. [sent-188, score-0.148]
39 We computed the performance of tuple extraction by evaluating a sample of tuples extracted from a parsed version of SETIMES. [sent-198, score-0.203]
40 HR gold annotations (we use the same sample as for tagging and parsing performance evaluation). [sent-200, score-0.075]
41 The probabilities are computed from tuple counts as maximum likelihood estimates. [sent-208, score-0.103]
42 We exclude from the tensor all tuples with a negative LMI score. [sent-209, score-0.247]
43 Finally, we symmetrize the tensor by introducing inverse links. [sent-210, score-0.192]
44 3M lemmas, 121M links and 165K link types (including inverse links). [sent-214, score-0.277]
45 HR more sparse than English DM (796 link types), but less sparse than German DM (220K link types; 22 links per lemma). [sent-217, score-0.328]
46 Table 3 shows an example of the extracted tuples for the verb kupiti (to buy). [sent-218, score-0.225]
47 HR on a standard task from distributional semantics, namely synonym choice. [sent-224, score-0.264]
48 7 100 100 100 Table 4: Results on synonym choice task four synonym candidates (one is correct). [sent-243, score-0.296]
49 To make predictions, we compute pairwise cosine similarities of the target word vectors with the four candidates and predict the candidate(s) with maximal similarity (note that there may be ties). [sent-248, score-0.035]
50 Each correct prediction with a single most similar candidate receives a full credit (A), while ties for maximal similarity are discounted (B: two-way tie, C: three-way tie, D: four-way tie): A+ 21B + 31C + 14D. [sent-252, score-0.039]
51 In our experiments, ties occur when vector similarities are zero for all word pairs (due to vector sparsity). [sent-254, score-0.074]
52 We also compare against BOW-LSA, a state-of- the-art synonym detection model from Karan et al. [sent-260, score-0.128]
53 (2012), which uses 500 latent dimensions and paragraphs as contexts. [sent-261, score-0.052]
54 Table 4 shows the results for the three considered models on nouns (N), adjectives (A), 5Available at: http : / / t ake l . [sent-264, score-0.122]
55 HR outperforms the baseline BOW model for nouns and verbs (differences are significant at p < 0. [sent-270, score-0.12]
56 Conversely, on adjectives BOW-LSA performs slightly better than DM. [sent-274, score-0.074]
57 Nouns occur as heads and dependents of many link types (unlexicalized and lexicalized), and are thus well represented in the semantic space. [sent-278, score-0.133]
58 On the other hand, adjectives seem to be less well modeled. [sent-279, score-0.074]
59 Although the majority of adjectives occur as heads or dependents of the Atr relation, for which extraction accuracy is the highest (cf. [sent-280, score-0.074]
60 Table 2), it is likely that a single link type is not sufficient. [sent-281, score-0.096]
61 The generally low performance on verbs suggests that their semantic is not fully covered in word- and syntax-based spaces. [sent-284, score-0.109]
62 HR, a syntax-based distributional memory for Croatian built from a dependency-parsed web corpus. [sent-286, score-0.218]
63 HR is the first freely available distributional memory for a Slavic language. [sent-288, score-0.266]
64 This work provides a starting point for a systematic study of dependency-based distributional semantics for Croatian and similar languages. [sent-292, score-0.136]
65 Our first priority will be to analyze how corpus preprocessing and the choice oflink types relates to model performance on different semantic tasks. [sent-293, score-0.077]
66 Better modeling of adjectives and verbs is also an important topic for future research. [sent-294, score-0.146]
67 Three syntactic formalisms for data-driven dependency parsing of Croatian. [sent-301, score-0.08]
68 Improving part-of-speech tagging accuracy for Croatian by morphological analysis. [sent-305, score-0.038]
69 K-best spanning tree dependency parsing with verb valency lexicon reranking. [sent-314, score-0.123]
70 Superma- trix: a general tool for lexical semantic knowledge acquisition. [sent-322, score-0.037]
71 Corpusbased semantic relatedness for the construction of Polish WordNet. [sent-327, score-0.037]
72 A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI). [sent-359, score-0.143]
73 Cross-lingual distributional profiles of concepts for measuring semantic distance. [sent-392, score-0.173]
wordName wordTfidf (topN-words)
[('croatian', 0.496), ('hrwac', 0.218), ('agi', 0.191), ('lmi', 0.164), ('tensor', 0.147), ('lemmatization', 0.143), ('eljko', 0.136), ('intr', 0.136), ('links', 0.136), ('distributional', 0.136), ('karan', 0.134), ('synonym', 0.128), ('ljube', 0.121), ('depdm', 0.109), ('hunpos', 0.109), ('dm', 0.109), ('slavic', 0.105), ('subj', 0.103), ('tuple', 0.103), ('tuples', 0.1), ('utt', 0.096), ('link', 0.096), ('lexicalized', 0.089), ('broda', 0.089), ('memory', 0.082), ('agic', 0.082), ('kupiti', 0.082), ('lexdm', 0.082), ('plze', 0.082), ('zagreb', 0.08), ('atr', 0.076), ('adjectives', 0.074), ('bulgarian', 0.074), ('tadi', 0.072), ('verbs', 0.072), ('erjavec', 0.071), ('pado', 0.071), ('baroni', 0.069), ('lenci', 0.069), ('polish', 0.069), ('marko', 0.063), ('nikola', 0.063), ('unlexicalized', 0.062), ('prepositions', 0.059), ('najder', 0.057), ('piasecki', 0.057), ('sic', 0.057), ('maciej', 0.055), ('atv', 0.055), ('hstudent', 0.055), ('ingason', 0.055), ('mitrofanova', 0.055), ('pnom', 0.055), ('predati', 0.055), ('setimes', 0.055), ('typedm', 0.055), ('zdravko', 0.055), ('pad', 0.053), ('obj', 0.053), ('dimensions', 0.052), ('pred', 0.052), ('tr', 0.05), ('freely', 0.048), ('cst', 0.048), ('csy', 0.048), ('mstparser', 0.048), ('buy', 0.048), ('nouns', 0.048), ('dialogue', 0.047), ('czech', 0.046), ('inverse', 0.045), ('tie', 0.045), ('mladen', 0.045), ('farmer', 0.045), ('lemmatizer', 0.045), ('dependency', 0.043), ('verb', 0.043), ('bojana', 0.042), ('dalbelo', 0.042), ('bartosz', 0.042), ('croatia', 0.042), ('russian', 0.041), ('choice', 0.04), ('efron', 0.04), ('toma', 0.04), ('anton', 0.04), ('msd', 0.04), ('serbian', 0.04), ('slovene', 0.04), ('ties', 0.039), ('wordbased', 0.038), ('tagging', 0.038), ('parsing', 0.037), ('nakov', 0.037), ('preslav', 0.037), ('adv', 0.037), ('semantic', 0.037), ('objects', 0.035), ('similarities', 0.035), ('bow', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian
Author: Jan Snajder ; Sebastian Pado ; Zeljko Agic
Abstract: We report on the first structured distributional semantic model for Croatian, DM.HR. It is constructed after the model of the English Distributional Memory (Baroni and Lenci, 2010), from a dependencyparsed Croatian web corpus, and covers about 2M lemmas. We give details on the linguistic processing and the design principles. An evaluation shows state-of-theart performance on a semantic similarity task with particularly good performance on nouns. The resource is freely available.
2 0.1562366 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics
Author: Sebastian Pado ; Jan Snajder ; Britta Zeller
Abstract: Syntax-based vector spaces are used widely in lexical semantics and are more versatile than word-based spaces (Baroni and Lenci, 2010). However, they are also sparse, with resulting reliability and coverage problems. We address this problem by derivational smoothing, which uses knowledge about derivationally related words (oldish → old) to improve semantic similarity est→imates. We develop a set of derivational smoothing methods and evaluate them on two lexical semantics tasks in German. Even for models built from very large corpora, simple derivational smoothing can improve coverage considerably.
3 0.10847215 238 acl-2013-Measuring semantic content in distributional vectors
Author: Aurelie Herbelot ; Mohan Ganesalingam
Abstract: Some words are more contentful than others: for instance, make is intuitively more general than produce and fifteen is more ‘precise’ than a group. In this paper, we propose to measure the ‘semantic content’ of lexical items, as modelled by distributional representations. We investigate the hypothesis that semantic content can be computed using the KullbackLeibler (KL) divergence, an informationtheoretic measure of the relative entropy of two distributions. In a task focusing on retrieving the correct ordering of hyponym-hypernym pairs, the KL diver- gence achieves close to 80% precision but does not outperform a simpler (linguistically unmotivated) frequency measure. We suggest that this result illustrates the rather ‘intensional’ aspect of distributions.
4 0.097575359 218 acl-2013-Latent Semantic Tensor Indexing for Community-based Question Answering
Author: Xipeng Qiu ; Le Tian ; Xuanjing Huang
Abstract: Retrieving similar questions is very important in community-based question answering(CQA) . In this paper, we propose a unified question retrieval model based on latent semantic indexing with tensor analysis, which can capture word associations among different parts of CQA triples simultaneously. Thus, our method can reduce lexical chasm of question retrieval with the help of the information of question content and answer parts. The experimental result shows that our method outperforms the traditional methods.
5 0.088962264 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German
Author: Britta Zeller ; Jan Snajder ; Sebastian Pado
Abstract: Derivational models are still an underresearched area in computational morphology. Even for German, a rather resourcerich language, there is a lack of largecoverage derivational knowledge. This paper describes a rule-based framework for inducing derivational families (i.e., clusters of lemmas in derivational relationships) and its application to create a highcoverage German resource, DERIVBASE, mapping over 280k lemmas into more than 17k non-singleton clusters. We focus on the rule component and a qualitative and quantitative evaluation. Our approach achieves up to 93% precision and 71% recall. We attribute the high precision to the fact that our rules are based on information from grammar books.
6 0.08680515 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri
7 0.08626657 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models
8 0.071687795 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity
9 0.069982857 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
10 0.069537528 32 acl-2013-A relatedness benchmark to test the role of determiners in compositional distributional semantics
11 0.066156119 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners
12 0.063187592 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics
13 0.062477589 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference
14 0.06181629 290 acl-2013-Question Analysis for Polish Question Answering
15 0.058600087 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context
16 0.057865161 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing
17 0.05768666 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors
18 0.055537462 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates
19 0.055053297 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit
20 0.054574572 192 acl-2013-Improved Lexical Acquisition through DPP-based Verb Clustering
topicId topicWeight
[(0, 0.16), (1, 0.026), (2, -0.017), (3, -0.102), (4, -0.071), (5, -0.063), (6, -0.03), (7, 0.012), (8, 0.042), (9, -0.019), (10, -0.044), (11, 0.014), (12, 0.071), (13, -0.02), (14, -0.017), (15, 0.053), (16, -0.04), (17, -0.079), (18, 0.034), (19, 0.004), (20, -0.073), (21, 0.059), (22, 0.082), (23, -0.074), (24, -0.006), (25, 0.039), (26, -0.019), (27, -0.014), (28, -0.092), (29, 0.034), (30, -0.014), (31, 0.013), (32, 0.035), (33, -0.039), (34, -0.004), (35, -0.029), (36, 0.041), (37, -0.062), (38, -0.004), (39, -0.081), (40, -0.02), (41, -0.011), (42, -0.076), (43, -0.087), (44, -0.036), (45, -0.003), (46, -0.031), (47, -0.019), (48, 0.07), (49, -0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.90282607 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian
Author: Jan Snajder ; Sebastian Pado ; Zeljko Agic
Abstract: We report on the first structured distributional semantic model for Croatian, DM.HR. It is constructed after the model of the English Distributional Memory (Baroni and Lenci, 2010), from a dependencyparsed Croatian web corpus, and covers about 2M lemmas. We give details on the linguistic processing and the design principles. An evaluation shows state-of-theart performance on a semantic similarity task with particularly good performance on nouns. The resource is freely available.
2 0.79781669 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics
Author: Sebastian Pado ; Jan Snajder ; Britta Zeller
Abstract: Syntax-based vector spaces are used widely in lexical semantics and are more versatile than word-based spaces (Baroni and Lenci, 2010). However, they are also sparse, with resulting reliability and coverage problems. We address this problem by derivational smoothing, which uses knowledge about derivationally related words (oldish → old) to improve semantic similarity est→imates. We develop a set of derivational smoothing methods and evaluate them on two lexical semantics tasks in German. Even for models built from very large corpora, simple derivational smoothing can improve coverage considerably.
3 0.72810394 238 acl-2013-Measuring semantic content in distributional vectors
Author: Aurelie Herbelot ; Mohan Ganesalingam
Abstract: Some words are more contentful than others: for instance, make is intuitively more general than produce and fifteen is more ‘precise’ than a group. In this paper, we propose to measure the ‘semantic content’ of lexical items, as modelled by distributional representations. We investigate the hypothesis that semantic content can be computed using the KullbackLeibler (KL) divergence, an informationtheoretic measure of the relative entropy of two distributions. In a task focusing on retrieving the correct ordering of hyponym-hypernym pairs, the KL diver- gence achieves close to 80% precision but does not outperform a simpler (linguistically unmotivated) frequency measure. We suggest that this result illustrates the rather ‘intensional’ aspect of distributions.
4 0.70293736 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German
Author: Britta Zeller ; Jan Snajder ; Sebastian Pado
Abstract: Derivational models are still an underresearched area in computational morphology. Even for German, a rather resourcerich language, there is a lack of largecoverage derivational knowledge. This paper describes a rule-based framework for inducing derivational families (i.e., clusters of lemmas in derivational relationships) and its application to create a highcoverage German resource, DERIVBASE, mapping over 280k lemmas into more than 17k non-singleton clusters. We focus on the rule component and a qualitative and quantitative evaluation. Our approach achieves up to 93% precision and 71% recall. We attribute the high precision to the fact that our rules are based on information from grammar books.
5 0.66067082 31 acl-2013-A corpus-based evaluation method for Distributional Semantic Models
Author: Abdellah Fourtassi ; Emmanuel Dupoux
Abstract: Evaluation methods for Distributional Semantic Models typically rely on behaviorally derived gold standards. These methods are difficult to deploy in languages with scarce linguistic/behavioral resources. We introduce a corpus-based measure that evaluates the stability of the lexical semantic similarity space using a pseudo-synonym same-different detection task and no external resources. We show that it enables to predict two behaviorbased measures across a range of parameters in a Latent Semantic Analysis model.
6 0.64700186 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri
8 0.58099574 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics
9 0.57123017 32 acl-2013-A relatedness benchmark to test the role of determiners in compositional distributional semantics
10 0.52506405 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit
11 0.5188306 227 acl-2013-Learning to lemmatise Polish noun phrases
12 0.51593882 262 acl-2013-Offspring from Reproduction Problems: What Replication Failure Teaches Us
13 0.50015819 299 acl-2013-Reconstructing an Indo-European Family Tree from Non-native English Texts
14 0.4890852 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures
15 0.48679322 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference
16 0.48596501 231 acl-2013-Linggle: a Web-scale Linguistic Search Engine for Words in Context
17 0.47927994 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach
18 0.47689655 371 acl-2013-Unsupervised joke generation from big data
19 0.47609091 242 acl-2013-Mining Equivalent Relations from Linked Data
20 0.47576106 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach
topicId topicWeight
[(0, 0.043), (6, 0.014), (11, 0.036), (24, 0.035), (26, 0.038), (35, 0.566), (42, 0.032), (48, 0.032), (70, 0.039), (88, 0.025), (90, 0.026), (95, 0.041)]
simIndex simValue paperId paperTitle
1 0.97950011 278 acl-2013-Patient Experience in Online Support Forums: Modeling Interpersonal Interactions and Medication Use
Author: Annie Chen
Abstract: Though there has been substantial research concerning the extraction of information from clinical notes, to date there has been less work concerning the extraction of useful information from patient-generated content. Using a dataset comprised of online support group discussion content, this paper investigates two dimensions that may be important in the extraction of patient-generated experiences from text; significant individuals/groups and medication use. With regard to the former, the paper describes an approach involving the pairing of important figures (e.g. family, husbands, doctors, etc.) and affect, and suggests possible applications of such techniques to research concerning online social support, as well as integration into search interfaces for patients. Additionally, the paper demonstrates the extraction of side effects and sentiment at different phases in patient medication use, e.g. adoption, current use, discontinuation and switching, and demonstrates the utility of such an application for drug safety monitoring in online discussion forums. 1
2 0.97914219 55 acl-2013-Are Semantically Coherent Topic Models Useful for Ad Hoc Information Retrieval?
Author: Romain Deveaud ; Eric SanJuan ; Patrice Bellot
Abstract: The current topic modeling approaches for Information Retrieval do not allow to explicitly model query-oriented latent topics. More, the semantic coherence of the topics has never been considered in this field. We propose a model-based feedback approach that learns Latent Dirichlet Allocation topic models on the top-ranked pseudo-relevant feedback, and we measure the semantic coherence of those topics. We perform a first experimental evaluation using two major TREC test collections. Results show that retrieval perfor- mances tend to be better when using topics with higher semantic coherence.
3 0.9716959 160 acl-2013-Fine-grained Semantic Typing of Emerging Entities
Author: Ndapandula Nakashole ; Tomasz Tylenda ; Gerhard Weikum
Abstract: Methods for information extraction (IE) and knowledge base (KB) construction have been intensively studied. However, a largely under-explored case is tapping into highly dynamic sources like news streams and social media, where new entities are continuously emerging. In this paper, we present a method for discovering and semantically typing newly emerging out-ofKB entities, thus improving the freshness and recall of ontology-based IE and improving the precision and semantic rigor of open IE. Our method is based on a probabilistic model that feeds weights into integer linear programs that leverage type signatures of relational phrases and type correlation or disjointness constraints. Our experimental evaluation, based on crowdsourced user studies, show our method performing significantly better than prior work.
same-paper 4 0.96964455 76 acl-2013-Building and Evaluating a Distributional Memory for Croatian
Author: Jan Snajder ; Sebastian Pado ; Zeljko Agic
Abstract: We report on the first structured distributional semantic model for Croatian, DM.HR. It is constructed after the model of the English Distributional Memory (Baroni and Lenci, 2010), from a dependencyparsed Croatian web corpus, and covers about 2M lemmas. We give details on the linguistic processing and the design principles. An evaluation shows state-of-theart performance on a semantic similarity task with particularly good performance on nouns. The resource is freely available.
5 0.95977771 311 acl-2013-Semantic Neighborhoods as Hypergraphs
Author: Chris Quirk ; Pallavi Choudhury
Abstract: Ambiguity preserving representations such as lattices are very useful in a number of NLP tasks, including paraphrase generation, paraphrase recognition, and machine translation evaluation. Lattices compactly represent lexical variation, but word order variation leads to a combinatorial explosion of states. We advocate hypergraphs as compact representations for sets of utterances describing the same event or object. We present a method to construct hypergraphs from sets of utterances, and evaluate this method on a simple recognition task. Given a set of utterances that describe a single object or event, we construct such a hypergraph, and demonstrate that it can recognize novel descriptions of the same event with high accuracy.
7 0.95535755 10 acl-2013-A Markov Model of Machine Translation using Non-parametric Bayesian Inference
8 0.94909865 122 acl-2013-Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners
9 0.80114424 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval
10 0.7889508 238 acl-2013-Measuring semantic content in distributional vectors
11 0.78165567 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics
12 0.7691049 58 acl-2013-Automated Collocation Suggestion for Japanese Second Language Learners
13 0.75227785 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors
14 0.74490678 121 acl-2013-Discovering User Interactions in Ideological Discussions
15 0.73638105 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval
16 0.72595978 219 acl-2013-Learning Entity Representation for Entity Disambiguation
17 0.72321647 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics
18 0.72043347 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction
19 0.719136 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit
20 0.71795678 371 acl-2013-Unsupervised joke generation from big data