acl acl2010 acl2010-3 knowledge-graph by maker-knowledge-mining

3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

Source: pdf

Author: Jun'ichi Kazama ; Stijn De Saeger ; Kow Kuroda ; Masaki Murata ; Kentaro Torisawa

Abstract: Existing word similarity measures are not robust to data sparseness since they rely only on the point estimation of words’ context profiles obtained from a limited amount of data. This paper proposes a Bayesian method for robust distributional word similarities. The method uses a distribution of context profiles obtained by Bayesian estimation and takes the expectation of a base similarity measure under that distribution. When the context profiles are multinomial distributions, the priors are Dirichlet, and the base measure is . the Bhattacharyya coefficient, we can derive an analytical form that allows efficient calculation. For the task of word similarity estimation using a large amount of Web data in Japanese, we show that the proposed measure gives better accuracies than other well-known similarity measures.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This paper proposes a Bayesian method for robust distributional word similarities. [sent-6, score-0.053]

2 The method uses a distribution of context profiles obtained by Bayesian estimation and takes the expectation of a base similarity measure under that distribution. [sent-7, score-0.429]

3 When the context profiles are multinomial distributions, the priors are Dirichlet, and the base measure is . [sent-8, score-0.255]

4 the Bhattacharyya coefficient, we can derive an analytical form that allows efficient calculation. [sent-9, score-0.121]

5 For the task of word similarity estimation using a large amount of Web data in Japanese, we show that the proposed measure gives better accuracies than other well-known similarity measures. [sent-10, score-0.312]

6 1 Introduction The semantic similarity of words is a longstanding topic in computational linguistics because it is theoretically intriguing and has many applications in the field. [sent-11, score-0.108]

7 Many researchers have conducted studies based on the distributional hypothesis (Harris, 1954), which states that words that occur in the same contexts tend to have similar meanings. [sent-12, score-0.075]

8 A number of semantic similarity measures have been proposed based on this hypothesis (Hindle, 1990; Grefenstette, 1994; Dagan et al. [sent-13, score-0.141]

9 j p In general, most semantic similarity measures have the following form: sim(w1 , w2) = g(v(w1) , v(w2)) , (1) where v(wi) is a vector that represents the contexts in which wi appears, which we call a context profile of wi. [sent-20, score-0.527]

10 The function g is a function on these context profiles that is expected to produce good similarities. [sent-21, score-0.136]

11 Each dimension of the vector corresponds to a context, fk, which is typically a neighboring word or a word having dependency relations with wi in a corpus. [sent-22, score-0.272]

12 Its value, vk (wi), is typically a co-occurrence frequency c(wi, fk), a conditional probability p(fk |wi), or point-wise mutual information (PMI) be|twween wi and fk, which are all calculated from a corpus. [sent-23, score-0.272]

13 For g, various works have used the cosine, the Jaccard coefficient, or the Jensen-Shannon divergence is utilized, to name only a few measures. [sent-24, score-0.062]

14 On the other hand, our approach in this paper is to estimate context profiles (v(wi)) robustly and thus to estimate the similarity robustly. [sent-26, score-0.244]

15 All other things being equal, the similarity with a more frequent word should be larger, since it would be more reliable. [sent-29, score-0.108]

16 In the NLP field, data sparseness has been recognized as a serious problem and tackled in the context of language modeling and supervised machine learning. [sent-31, score-0.097]

17 c As2s0o1c0ia Atisosnoc foiart Cionom fopru Ctaotmiopnuatla Lti on gaulis Lti cnsg,u piasgtiecs 247–256, has been no study that seriously dealt with data sparseness in the context of semantic similarity calculation. [sent-34, score-0.184]

18 In this paper, we apply the Bayesian framework to the calculation of distributional similarity. [sent-38, score-0.093]

19 The uncertainty due to data sparseness is represented by p(v(wi)), and taking the expectation enables us to take this into account. [sent-40, score-0.087]

20 The Bayesian estimation usually gives diverging distributions for infrequent observations and thus decreases the expectation value as expected. [sent-41, score-0.138]

21 The Bayesian estimation and the expectation calculation in Eq. [sent-42, score-0.153]

22 Since our motivation for this research is to calculate good semantic similarities for a large set of words (e. [sent-44, score-0.109]

23 Our technical contribution in this paper is to show that in the case where the context profiles are multinomial distributions, the priors are Dirichlet, and the base similarity measure is the Bhattacharyya coefficient (Bhattacharyya, 1943), we can derive an analytical form for Eq. [sent-47, score-0.561]

24 2, that enables efficient calculation (with some implementation tricks). [sent-48, score-0.059]

25 In Section 2, we briefly introduce the Bayesian estimation and the Bhattacharyya coefficient. [sent-51, score-0.052]

26 Section 3 proposes our new Bayesian Bhattacharyya coefficient for robust similarity calculation. [sent-52, score-0.204]

27 1 Bayesian estimation with Dirichlet prior Assume that we estimate a probabilistic model for the observed data D, p(D|φ), which is paramettheeriz oebds ewrvithed parameters φ. [sent-56, score-0.052]

28 φIn) ,th we hmicahxi imsu pmar alimkee-lihood estimation (MLE), we find the point estimation φ∗ = argmaxφp(D|φ). [sent-57, score-0.104]

29 (3) On the other hand, the objective of the Bayesian estimation is to find the distribution of φ given the observed data D, i. [sent-60, score-0.074]

30 Estimating a conditional probability distribution φk = p(fk |wi) as a context profile for each wi falls into this| case. [sent-68, score-0.367]

31 The Dirichlet distribution is parametrized by hyperparameters αk (> 0). [sent-74, score-0.062]

32 2 Bhattacharyya coefficient When the context profiles are probability distributions, we usually utilize the measures on probability distributions such as the Jensen-Shannon (JS) divergence to calculate similarities (Dagan et al. [sent-80, score-0.461]

33 Although we found that the JS divergence is a good measure, it is difficult to derive an efficient calculation of Eq. [sent-86, score-0.146]

34 The BC is also a similarity measure on probability distributions and is suitable for our purposes as we describe in the next section. [sent-89, score-0.196]

35 Although BC has not been explored well in the literature on distributional word similarities, it is also a good similarity measure as the experiments show. [sent-90, score-0.186]

36 3 Method In this section, we show that if our base similarity measure is BC and the distributions under which we take the expectation are Dirichlet distributions, then Eq. [sent-91, score-0.263]

37 2 also has an analytical form, allowing efficient calculation. [sent-92, score-0.096]

38 △Z Z×Z Z△ After several derivation steps (see Appendix A), we obtain the following analytical solution for the above: 1A naive but general way might be to draw samples of v(wi) from p(v(wi)) and approximate the expectation using these samples. [sent-94, score-0.138]

39 β1Nk,foaktre )at h naedt (7) hyperparameters of the priors of w1 and w2, respectively. [sent-97, score-0.064]

40 To put it all together, we can obtain a new Bayesian similarity measure on words, which can be calculated only from the hyperparameters for the Dirichlet prior, α and β, and the observed counts c(wi, fk). [sent-98, score-0.213]

41 We∑ call this new measure the ∑Bayesian Bhattacharyya coefficient (BCb for short). [sent-101, score-0.121]

42 0, the Bayesian similarity between these words is calculated as BCb(w0, w1) = 0. [sent-110, score-0.108]

43 785463 We can see that similarities are different according to the number of observations, as expected. [sent-113, score-0.081]

44 249 4 Implementation Issues Although we have derived the analytical form (Eq. [sent-122, score-0.096]

45 2 Second, the calculation of the log Gamma function is heavier than operations such as simple multiplication, which is used in existing measures. [sent-128, score-0.082]

46 Because c(wi, fk) is usually sparse, that technique speeds up the calculation of the existing measures drastically and makes it practical. [sent-133, score-0.092]

47 In this study, the above problem is solved by pre-computing the required log Gamma values, assuming that we calculate similarities for a large set of words, and pre-computing default values for cases where c(wi, fk) = 0. [sent-134, score-0.132]

48 In the calculation of BCb(w1 , w2), we first assume that all c(wi, fk) = 0 and set the output variable to the default value. [sent-137, score-0.059]

49 1 Evaluation setting We evaluated our method in the calculation of similarities between nouns in Japanese. [sent-147, score-0.17]

50 Because human evaluation of word similarities is very difficult and costly, we conducted automatic evaluation in the set expansion setting, following previous studies such as Pantel et al. [sent-148, score-0.081]

51 Given a word set, which is expected to contain similar words, we assume that a good similarity measure should output, for each word in the set, the other words in the set as similar words. [sent-150, score-0.152]

52 We output a ranked list of 500 similar words for each word using a given similarity measure and checked whether they are included in the answers. [sent-152, score-0.152]

53 δ(wi ∈ ans) returns 1 if the output word wi is in the answers, aentudr n0s o 1th iefrw thiese o. [sent-156, score-0.272]

54 2 Collecting context profiles Dependency relations are used as context profiles as in Kazama and Torisawa (2008) and Kazama et al. [sent-160, score-0.272]

55 , 2008) (100 million 250 documents), where each sentence has a dependency parse, we extracted noun-verb and nounnoun dependencies with relation types and then calculated their frequencies in the corpus. [sent-163, score-0.065]

56 We extracted about 470 million unique dependencies from the corpus, containing 3 1 million unique nouns (including compound nouns as determined by our filters) and 22 million unique contexts, fk. [sent-170, score-0.285]

57 We sorted the nouns according to the number of unique co-occurring contexts and the contexts according to the number of unique cooccurring nouns, and then we selected the top one million nouns and 100,000 contexts. [sent-171, score-0.234]

58 We used only 260 million dependency pairs that contained both the selected nouns and the selected contexts. [sent-172, score-0.074]

59 Note that we do not deal with ambiguities in the construction of these sets as well as in the calculation of similarities. [sent-188, score-0.08]

60 That is, a word can be con- tained in several sets, and the answers for such a word is the union ofthe words in the sets it belongs to (excluding the word itself). [sent-189, score-0.057]

61 Set “A” contained 3,740 words that are actually evaluated, with about 115 answers on average, and “B” contained 3,657 words with about 65 answers on average. [sent-192, score-0.072]

62 4 Compared similarity measures We compared our Bayesian Bhattacharyya similarity measure, BCb, with the following similarity measures. [sent-195, score-0.357]

63 JS Jensen-Shannon divergence between p(fk |w1) and p(fk |w2) (Dagan et al. [sent-196, score-0.062]

64 PMI-cos The cosine of the context profile vec- tors, where the k-th dimension is the pointwise mutual information (PMI) between wi and fk defined as: PMI(wi, fk) = (Pantel and Lin, 2002; Pantel logp(p(wwi)i,pf(kf)k) et al. [sent-199, score-0.854]

65 (2009) proposed using the Jensen-Shannon divergence between hidden class distributions, p(c|w1) and p(c|w2), dwehnic chl are iosbtrtiabiunetiod by using an EdM p(-cb|wased clustering of depende∑ncy relations with a model p(wi, fk) = ∑c p(wi |c)p(fk |c)p(c) (Kazama and Torisaw∑a, 2008). [sent-202, score-0.102]

66 251 alleviate the effect of local minima of the EM clustering, they proposed averaging the similarities by several different clustering results, which can be obtained by using different initial parameters. [sent-204, score-0.121]

67 BC The Bhattacharyya coefficient (Bhattacharyya, 1943) between p(fk |w1) and p(fk |w2). [sent-207, score-0.077]

68 In calculating p(fk |wi), we subtract the discounting value, α, fr|owm c(wi, fk) and equally distribute the residual probability mass to the contexts whose frequency is zero. [sent-210, score-0.073]

69 Since it is very costly to calculate the similarities with all of the other words (one million in our case), we used the following approximation method that exploits the sparseness of c(wi, fk). [sent-212, score-0.198]

70 6 We merge all of the words above as candidate words and calculate the similarity only for the candidate words. [sent-218, score-0.136]

71 + 4In the case of EM clustering, the number of unique contexts, fk, was also set to one million instead of 100,000, following Kazama et al. [sent-224, score-0.068]

72 5It is possible that the number of contexts with non-zero counts is less than L. [sent-226, score-0.062]

73 In that case, all of the contexts with non-zero counts are used. [sent-227, score-0.062]

74 The MAP and the MPs at the top 1, 5, 10, and 20 are shown for each similarity measure. [sent-319, score-0.108]

75 Because tuning hyperparameters involves the possibility of overfitting, its robustness should be assessed. [sent-332, score-0.066]

76 Although Cls-JS showed very good performance for Set C, note that the EM clustering is very time-consuming (Kazama and Torisawa, 2008), and it took about one week with 24 CPU cores to get one clustering result in our computing environment. [sent-355, score-0.102]

77 # improved # unchanged # degraded Set A Set B Set C 755 643 3,153 2,585 2,610 3,962 400 404 1,738 @fDiogf. [sent-521, score-0.067]

78 We can see that BCb surely outputs more low-ID words than BC, and BC more than Cls-JS and JS. [sent-539, score-0.056]

79 Clearly, we need more analysis on what caused the improvement by the proposed method and how that affects the efficacy in real applications of similarity measures. [sent-544, score-0.108]

80 The proposed Bayesian similarity measure outperformed the baseline Bhattacharyya coefficient 8This suggests the use of different αs depending on ID ranges (e. [sent-545, score-0.229]

81 However, as noted above, there has been no serious attempt to assess the effect of smoothing in the context of word similarity calculation. [sent-564, score-0.204]

82 Recent studies have pointed out that the Bayesian framework derives state-of-the-art smoothing methods such as Kneser-Ney smoothing as a special case (Teh, 2006; Mochihashi et al. [sent-565, score-0.088]

83 Instead, we used the obtained analytical form directly with the assumption that αk = α and α can be tuned directly by using a simple grid search with a small subset of the vocabulary as the development set. [sent-573, score-0.096]

84 In terms of calculation procedure, BCb has the same form as other similarity measures, which is basically the same as the inner product of sparse vectors. [sent-576, score-0.167]

85 Thus, it can be as fast as other similarity measures with some effort as we described in Section 4 when our aim is to calculate similarities between words in a fixed large vocabulary. [sent-577, score-0.25]

86 For example, BCb took about 100 hours to calculate the 254 top 500 similar nouns for all of the one million nouns (using 16 CPU cores), while JS took about 57 hours. [sent-578, score-0.132]

87 The limitation of our method is that it cannot be used efficiently with similarity measures other than the Bhattacharyya coefficient, although that choice seems good as shown in the experi- × ments. [sent-580, score-0.141]

88 For example, it seems difficult to use the Jensen-Shannon divergence as the base similarity because the analytical form cannot be derived. [sent-581, score-0.291]

89 In another direction, we will be able to use a “weighted” Bh√attacharyya coefficient: ∑k µ(w1, fk)µ(w2, fk)√p1k× p2k, where the we∑ights, µ(wi, fk), do not depend on pik, as the bas∑e similarity measure. [sent-585, score-0.108]

90 The analytical form for it will be a weighted version of BCb. [sent-586, score-0.096]

91 BCb can also be generalized to the ∑case where the base similarity is BCd(p1,p2) = p2dk, where d > 0. [sent-587, score-0.133]

92 Finally, note that our BCb is different from the Bhattacharyya distance measure on Dirichlet distributions of the following form described in Rauber et al. [sent-592, score-0.088]

93 (2008) in its motivation and analytical form: qQkpΓ(Γα(′kα′0)q)ΓQ(βk0′)Γ(β′k)×ΓQ(k21ΓP( kKα(k′α+k′+ βk′ β)/k′2) . [sent-593, score-0.096]

94 10 7 Conclusion We proposed a Bayesian method for robust distributional word similarities. [sent-595, score-0.053]

95 Our method uses a distribution of context profiles obtained by Bayesian 10Our preliminary experiments show that calculating similarity using Eq. [sent-596, score-0.266]

96 estimation and takes the expectation of a base similarity measure under that distribution. [sent-601, score-0.271]

97 We showed × that, in the case where the context profiles are multinomial distributions, the priors are Dirichlet, and the base measure is the Bhattacharyya coefficient, we can derive an analytical form, permitting efficient calculation. [sent-602, score-0.376]

98 Experimental results show that the proposed measure gives better word similarities than a non-Bayesian Bhattacharyya coefficient, other well-known similarity measures such as Jensen-Shannon divergence and the cosine with PMI weights, and the Bhattacharyya coefficient with absolute discounting. [sent-603, score-0.425]

99 Appendix A Here, we give the analytical form for the generalized case (BCbd) in Section 6. [sent-604, score-0.096]

100 On a measure of divergence between two statistical populations defined by their probability distributions. [sent-613, score-0.106]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bcb', 0.576), ('fk', 0.489), ('wi', 0.272), ('bhattacharyya', 0.215), ('bc', 0.171), ('bca', 0.147), ('mp', 0.121), ('similarity', 0.108), ('profiles', 0.105), ('kazama', 0.1), ('bayesian', 0.1), ('analytical', 0.096), ('js', 0.095), ('dirichlet', 0.086), ('similarities', 0.081), ('kk', 0.081), ('gamma', 0.08), ('dagan', 0.08), ('ln', 0.079), ('coefficient', 0.077), ('xk', 0.063), ('divergence', 0.062), ('id', 0.062), ('calculation', 0.059), ('mps', 0.054), ('estimation', 0.052), ('dir', 0.049), ('mochihashi', 0.047), ('map', 0.047), ('pantel', 0.046), ('sparseness', 0.045), ('distributions', 0.044), ('smoothing', 0.044), ('measure', 0.044), ('million', 0.044), ('pmi', 0.043), ('degraded', 0.043), ('expectation', 0.042), ('profile', 0.042), ('contexts', 0.041), ('hyperparameters', 0.04), ('pavg', 0.04), ('clustering', 0.04), ('sim', 0.04), ('torisawa', 0.038), ('japanese', 0.036), ('answers', 0.036), ('masaki', 0.035), ('siblings', 0.035), ('distributional', 0.034), ('measures', 0.033), ('discounting', 0.032), ('murata', 0.032), ('surely', 0.032), ('context', 0.031), ('nouns', 0.03), ('ido', 0.028), ('calculate', 0.028), ('mm', 0.028), ('dk', 0.027), ('bcbd', 0.027), ('bcdb', 0.027), ('crl', 0.027), ('kuroda', 0.027), ('rauber', 0.027), ('shinzato', 0.027), ('terada', 0.027), ('zzdir', 0.027), ('multinomial', 0.026), ('ans', 0.026), ('appendix', 0.026), ('tuning', 0.026), ('exp', 0.025), ('base', 0.025), ('derive', 0.025), ('priors', 0.024), ('unique', 0.024), ('outputs', 0.024), ('kentaro', 0.024), ('unchanged', 0.024), ('saeger', 0.023), ('simb', 0.023), ('stijn', 0.023), ('log', 0.023), ('distribution', 0.022), ('cores', 0.022), ('counts', 0.021), ('dependencies', 0.021), ('ichi', 0.021), ('serious', 0.021), ('thesaurus', 0.021), ('sets', 0.021), ('ids', 0.02), ('kl', 0.02), ('cortes', 0.02), ('edr', 0.02), ('kx', 0.02), ('cosine', 0.02), ('robust', 0.019), ('correlate', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

Author: Jun'ichi Kazama ; Stijn De Saeger ; Kow Kuroda ; Masaki Murata ; Kentaro Torisawa

2 0.098851748 257 acl-2010-WSD as a Distributed Constraint Optimization Problem

Author: Siva Reddy ; Abhilash Inumella

Abstract: This work models Word Sense Disambiguation (WSD) problem as a Distributed Constraint Optimization Problem (DCOP). To model WSD as a DCOP, we view information from various knowledge sources as constraints. DCOP algorithms have the remarkable property to jointly maximize over a wide range of utility functions associated with these constraints. We show how utility functions can be designed for various knowledge sources. For the purpose of evaluation, we modelled all words WSD as a simple DCOP problem. The results are competi- tive with state-of-art knowledge based systems.

3 0.098164566 20 acl-2010-A Transition-Based Parser for 2-Planar Dependency Structures

Author: Carlos Gomez-Rodriguez ; Joakim Nivre

Abstract: Finding a class of structures that is rich enough for adequate linguistic representation yet restricted enough for efficient computational processing is an important problem for dependency parsing. In this paper, we present a transition system for 2-planar dependency trees trees that can be decomposed into at most two planar graphs and show that it can be used to implement a classifier-based parser that runs in linear time and outperforms a stateof-the-art transition-based parser on four data sets from the CoNLL-X shared task. In addition, we present an efficient method – – for determining whether an arbitrary tree is 2-planar and show that 99% or more of the trees in existing treebanks are 2-planar.

4 0.092445523 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

Author: Boxing Chen ; George Foster ; Roland Kuhn

Abstract: This paper proposes new algorithms to compute the sense similarity between two units (words, phrases, rules, etc.) from parallel corpora. The sense similarity scores are computed by using the vector space model. We then apply the algorithms to statistical machine translation by computing the sense similarity between the source and target side of translation rule pairs. Similarity scores are used as additional features of the translation model to improve translation performance. Significant improvements are obtained over a state-of-the-art hierarchical phrase-based machine translation system. 1

5 0.0823346 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion

Author: Xiao-Li Li ; Lei Zhang ; Bing Liu ; See-Kiong Ng

Abstract: Distributional similarity is a classic technique for entity set expansion, where the system is given a set of seed entities of a particular class, and is asked to expand the set using a corpus to obtain more entities of the same class as represented by the seeds. This paper shows that a machine learning model called positive and unlabeled learning (PU learning) can model the set expansion problem better. Based on the test results of 10 corpora, we show that a PU learning technique outperformed distributional similarity significantly. 1

6 0.08151143 197 acl-2010-Practical Very Large Scale CRFs

7 0.081112474 162 acl-2010-Learning Common Grammar from Multilingual Corpus

8 0.077937238 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

9 0.074032597 17 acl-2010-A Structured Model for Joint Learning of Argument Roles and Predicate Senses

10 0.071402729 27 acl-2010-An Active Learning Approach to Finding Related Terms

11 0.066713735 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images

12 0.065822408 8 acl-2010-A Hybrid Hierarchical Model for Multi-Document Summarization

13 0.063843153 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization

14 0.06037166 242 acl-2010-Tree-Based Deterministic Dependency Parsing - An Application to Nivre's Method -

15 0.05932162 249 acl-2010-Unsupervised Search for the Optimal Segmentation for Statistical Machine Translation

16 0.055016931 158 acl-2010-Latent Variable Models of Selectional Preference

17 0.048925173 214 acl-2010-Sparsity in Dependency Grammar Induction

18 0.04758063 10 acl-2010-A Latent Dirichlet Allocation Method for Selectional Preferences

19 0.044725291 70 acl-2010-Contextualizing Semantic Representations Using Syntactically Enriched Vector Models

20 0.043824837 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.129), (1, 0.026), (2, -0.008), (3, 0.006), (4, 0.062), (5, -0.02), (6, 0.039), (7, -0.042), (8, 0.028), (9, 0.008), (10, -0.096), (11, 0.023), (12, 0.052), (13, -0.004), (14, 0.088), (15, -0.061), (16, -0.037), (17, -0.113), (18, -0.086), (19, -0.035), (20, -0.022), (21, -0.075), (22, -0.058), (23, 0.045), (24, 0.017), (25, -0.004), (26, -0.009), (27, -0.054), (28, 0.088), (29, 0.008), (30, 0.029), (31, 0.022), (32, -0.017), (33, -0.059), (34, -0.026), (35, 0.068), (36, 0.17), (37, 0.024), (38, -0.045), (39, 0.052), (40, 0.008), (41, 0.135), (42, 0.104), (43, 0.172), (44, -0.106), (45, -0.101), (46, -0.116), (47, 0.051), (48, 0.088), (49, -0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93383038 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

Author: Jun'ichi Kazama ; Stijn De Saeger ; Kow Kuroda ; Masaki Murata ; Kentaro Torisawa

2 0.58557874 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images

Author: Yansong Feng ; Mirella Lapata

Abstract: In this paper we tackle the problem of automatic caption generation for news images. Our approach leverages the vast resource of pictures available on the web and the fact that many of them are captioned. Inspired by recent work in summarization, we propose extractive and abstractive caption generation models. They both operate over the output of a probabilistic image annotation model that preprocesses the pictures and suggests keywords to describe their content. Experimental results show that an abstractive model defined over phrases is superior to extractive methods.

3 0.51753598 27 acl-2010-An Active Learning Approach to Finding Related Terms

Author: David Vickrey ; Oscar Kipersztok ; Daphne Koller

Abstract: We present a novel system that helps nonexperts find sets of similar words. The user begins by specifying one or more seed words. The system then iteratively suggests a series of candidate words, which the user can either accept or reject. Current techniques for this task typically bootstrap a classifier based on a fixed seed set. In contrast, our system involves the user throughout the labeling process, using active learning to intelligently explore the space of similar words. In particular, our system can take advantage of negative examples provided by the user. Our system combines multiple preexisting sources of similarity data (a standard thesaurus, WordNet, contextual similarity), enabling it to capture many types of similarity groups (“synonyms of crash,” “types of car,” etc.). We evaluate on a hand-labeled evaluation set; our system improves over a strong baseline by 36%.

4 0.51140559 197 acl-2010-Practical Very Large Scale CRFs

Author: Thomas Lavergne ; Olivier Cappe ; Francois Yvon

Abstract: Conditional Random Fields (CRFs) are a widely-used approach for supervised sequence labelling, notably due to their ability to handle large description spaces and to integrate structural dependency between labels. Even for the simple linearchain model, taking structure into account implies a number of parameters and a computational effort that grows quadratically with the cardinality of the label set. In this paper, we address the issue of training very large CRFs, containing up to hun- dreds output labels and several billion features. Efficiency stems here from the sparsity induced by the use of a ‘1 penalty term. Based on our own implementation, we compare three recent proposals for implementing this regularization strategy. Our experiments demonstrate that very large CRFs can be trained efficiently and that very large models are able to improve the accuracy, while delivering compact parameter sets.

5 0.50035429 183 acl-2010-Online Generation of Locality Sensitive Hash Signatures

Author: Benjamin Van Durme ; Ashwin Lall

Abstract: Motivated by the recent interest in streaming algorithms for processing large text collections, we revisit the work of Ravichandran et al. (2005) on using the Locality Sensitive Hash (LSH) method of Charikar (2002) to enable fast, approximate comparisons of vector cosine similarity. For the common case of feature updates being additive over a data stream, we show that LSH signatures can be maintained online, without additional approximation error, and with lower memory requirements than when using the standard offline technique.

6 0.45668235 257 acl-2010-WSD as a Distributed Constraint Optimization Problem

7 0.42297477 20 acl-2010-A Transition-Based Parser for 2-Planar Dependency Structures

8 0.41177711 201 acl-2010-Pseudo-Word for Phrase-Based Machine Translation

9 0.40153214 51 acl-2010-Bilingual Sense Similarity for Statistical Machine Translation

10 0.39516014 242 acl-2010-Tree-Based Deterministic Dependency Parsing - An Application to Nivre's Method -

11 0.3598997 89 acl-2010-Distributional Similarity vs. PU Learning for Entity Set Expansion

12 0.35928082 62 acl-2010-Combining Orthogonal Monolingual and Multilingual Sources of Evidence for All Words WSD

13 0.34406069 214 acl-2010-Sparsity in Dependency Grammar Induction

14 0.31935441 256 acl-2010-Vocabulary Choice as an Indicator of Perspective

15 0.31594214 96 acl-2010-Efficient Optimization of an MDL-Inspired Objective Function for Unsupervised Part-Of-Speech Tagging

16 0.3097991 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.

17 0.29494479 129 acl-2010-Growing Related Words from Seed via User Behaviors: A Re-Ranking Based Approach

18 0.29051182 232 acl-2010-The S-Space Package: An Open Source Package for Word Space Models

19 0.28734186 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization

20 0.28342658 50 acl-2010-Bilingual Lexicon Generation Using Non-Aligned Signatures

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.026), (25, 0.062), (42, 0.014), (44, 0.021), (51, 0.281), (59, 0.087), (71, 0.013), (73, 0.05), (76, 0.016), (78, 0.024), (80, 0.015), (83, 0.079), (84, 0.024), (98, 0.16)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90293813 137 acl-2010-How Spoken Language Corpora Can Refine Current Speech Motor Training Methodologies

Author: Daniil Umanski ; Federico Sangati

Abstract: The growing availability of spoken language corpora presents new opportunities for enriching the methodologies of speech and language therapy. In this paper, we present a novel approach for constructing speech motor exercises, based on linguistic knowledge extracted from spoken language corpora. In our study with the Dutch Spoken Corpus, syllabic inventories were obtained by means of automatic syllabification of the spoken language data. Our experimental syllabification method exhibited a reliable performance, and allowed for the acquisition of syllabic tokens from the corpus. Consequently, the syl- labic tokens were integrated in a tool for clinicians, a result which holds the potential of contributing to the current state of speech motor training methodologies.

same-paper 2 0.77765024 3 acl-2010-A Bayesian Method for Robust Estimation of Distributional Similarities

Author: Jun'ichi Kazama ; Stijn De Saeger ; Kow Kuroda ; Masaki Murata ; Kentaro Torisawa

3 0.68301201 145 acl-2010-Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment

Author: Marine Carpuat ; Yuval Marton ; Nizar Habash

Abstract: We study the challenges raised by Arabic verb and subject detection and reordering in Statistical Machine Translation (SMT). We show that post-verbal subject (VS) constructions are hard to translate because they have highly ambiguous reordering patterns when translated to English. In addition, implementing reordering is difficult because the boundaries of VS constructions are hard to detect accurately, even with a state-of-the-art Arabic dependency parser. We therefore propose to reorder VS constructions into SV order for SMT word alignment only. This strategy significantly improves BLEU and TER scores, even on a strong large-scale baseline and despite noisy parses.

4 0.59850645 133 acl-2010-Hierarchical Search for Word Alignment

Author: Jason Riesa ; Daniel Marcu

Abstract: We present a simple yet powerful hierarchical search algorithm for automatic word alignment. Our algorithm induces a forest of alignments from which we can efficiently extract a ranked k-best list. We score a given alignment within the forest with a flexible, linear discriminative model incorporating hundreds of features, and trained on a relatively small amount of annotated data. We report results on Arabic-English word alignment and translation tasks. Our model outperforms a GIZA++ Model-4 baseline by 6.3 points in F-measure, yielding a 1.1 BLEU score increase over a state-of-the-art syntax-based machine translation system.

5 0.59631866 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

Author: Xianpei Han ; Jun Zhao

Abstract: Name ambiguity problem has raised urgent demands for efficient, high-quality named entity disambiguation methods. In recent years, the increasing availability of large-scale, rich semantic knowledge sources (such as Wikipedia and WordNet) creates new opportunities to enhance the named entity disambiguation by developing algorithms which can exploit these knowledge sources at best. The problem is that these knowledge sources are heterogeneous and most of the semantic knowledge within them is embedded in complex structures, such as graphs and networks. This paper proposes a knowledge-based method, called Structural Semantic Relatedness (SSR), which can enhance the named entity disambiguation by capturing and leveraging the structural semantic knowledge in multiple knowledge sources. Empirical results show that, in comparison with the classical BOW based methods and social network based methods, our method can significantly improve the disambiguation performance by respectively 8.7% and 14.7%. 1

6 0.59618235 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

7 0.596053 93 acl-2010-Dynamic Programming for Linear-Time Incremental Parsing

8 0.59411919 146 acl-2010-Improving Chinese Semantic Role Labeling with Rich Syntactic Features

9 0.59346879 79 acl-2010-Cross-Lingual Latent Topic Extraction

10 0.5927242 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

11 0.59262609 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

12 0.59170008 202 acl-2010-Reading between the Lines: Learning to Map High-Level Instructions to Commands

13 0.59105563 87 acl-2010-Discriminative Modeling of Extraction Sets for Machine Translation

14 0.59100473 83 acl-2010-Dependency Parsing and Projection Based on Word-Pair Classification

15 0.59029907 5 acl-2010-A Framework for Figurative Language Detection Based on Sense Differentiation

16 0.59024715 80 acl-2010-Cross Lingual Adaptation: An Experiment on Sentiment Classifications

17 0.59006822 52 acl-2010-Bitext Dependency Parsing with Bilingual Subtree Constraints

18 0.59002304 116 acl-2010-Finding Cognate Groups Using Phylogenies

19 0.58908451 71 acl-2010-Convolution Kernel over Packed Parse Forest

20 0.58884883 48 acl-2010-Better Filtration and Augmentation for Hierarchical Phrase-Based Translation Rules