emnlp emnlp2010 emnlp2010-7 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Joseph Reisinger ; Raymond Mooney
Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. [sent-4, score-0.817]
2 Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e. [sent-5, score-0.603]
3 the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. [sent-7, score-0.379]
4 Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. [sent-8, score-0.495]
5 We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental. [sent-9, score-0.627]
6 Such vector-space representations of meaning induce measures ofword similarity that can be tuned to correlate well with judgements made by humans. [sent-13, score-0.246]
7 In this paper, we introduce tiered clustering, a novel probabilistic model of the shared structure often neglected in clustering problems. [sent-30, score-0.859]
8 Tiered clustering performs soft feature selection, allocat- ing features between a Dirichlet Process clustering model and a background model consisting of a single component. [sent-31, score-0.682]
9 The background model accounts for features commonly shared by all occurrences (i. [sent-32, score-0.291]
10 context-independent feature variation), while the clustering model accounts for variation in word usage (i. [sent-34, score-0.327]
11 Using the tiered clustering model, we derive a multi-prototype representation capable of capturing varying degrees of sharing between word senses, and demonstrate its effectiveness in lexical semantic tasks where such sharing is desirable. [sent-37, score-0.956]
12 In particular we show that tiered clustering outperforms the multi-prototype approach for (1) selectional preference (Resnik, 1997; Pantel et al. [sent-38, score-1.041]
13 c od2s01 in0 N Aastsuorcaialt Lioan g foura Cgeom Prpoucteastisoin ga,l p Laignegsui 1s1ti7c3s–1 82, ing the typical filler of an argument slot of a verb, and (2) word-relatedness in the presence of highly polysemous words. [sent-43, score-0.186]
14 Such models can be evaluated based on their correlation with human-reported lexical similarity judgements using e. [sent-54, score-0.371]
15 Each boxed set shows the most common background (shared) features, and each prototype captures one thematic usage of the word. [sent-65, score-0.387]
16 For example, wizard is broken up into a background cluster describing features common to all usages of the word (e. [sent-66, score-0.302]
17 Distributional methods have also proven to be a powerful approach to modeling selectional preference (Pad o´ et al. [sent-72, score-0.236]
18 In this section we briefly introduce a version of the multiprototype model based on the Dirichlet Process Mix- Ñ ture Model (DPMM), capable of inferring automatically the number of prototypes necessary for each word (Rasmussen, 2000). [sent-80, score-0.218]
19 Multiple prototypes for each word w are generated by clustering feature vectors derived from each occurrence c P Cpwq in a large textual corpus an ocdc collecting Pthe C resulting acrlugsete tre cteuantlro coidrsπk , k P r1, Kws . [sent-82, score-0.401]
20 The DPMM is an infinite capacity model capable of assigning data to a variable, but finite number of clusters Kw, with probability of assignment to cluster k proportional to the number of data points previously assigned to k. [sent-86, score-0.271]
21 Using this model, pwq vpcq the number of clusters no longer needs to be fixed a priori, allowing the model to allocate expressivity dynamically to concepts with richer structure. [sent-88, score-0.174]
22 Such a model naturally allows the word representation to allocate additional capacity for highly polysemous words, with the number of clusters growing logarithmically with the number of occurrences. [sent-89, score-0.312]
23 4 Tiered Clustering Tiered clustering allocates features between two submodels: a (context-dependent) DPMM and a single (context-independent) background component. [sent-92, score-0.462]
24 This model is similar structurally to the feature selective clustering model proposed by Law et al. [sent-93, score-0.263]
25 However, instead of allocating entire feature dimensions between model and background compo1175 Figurβbeac1l:kug∞sPrtoelura! [sent-95, score-0.252]
26 At a high level, the tiered model can be viewed as a combination of a multi-prototype model and a single-prototype back-off model. [sent-99, score-0.573]
27 Concretely, each word occurrence wd first selects a cluster φd from the DPMM; then each feature wi,d is generated from either the background model φback or the selected cluster φd, determined by the tier indicator zi,d. [sent-101, score-0.58]
28 Since the background topic is shared across all occurrences, it can account for features with contextindependent variance, such as stop words and other high-frequency noise, as well as the central tendency of the collection (Table 1). [sent-104, score-0.273]
29 5 Measuring Semantic Similarity Due to its richer representational structure, computing similarity in the multi-prototype model is less straightforward than in the single prototype case. [sent-108, score-0.265]
30 Reisinger and Mooney (2010) found that simply averaging all similarity scores over all pairs of prototypes (sampled from the cluster distributions) performs reasonably well and is robust to noise. [sent-109, score-0.331]
31 As cluster sizes become more uniform, AvgSim tends towards the single prototype similarity,1 hence the effectiveness of AvgSim stems from boosting the influence of small clusters. [sent-115, score-0.286]
32 Tiered clustering representations offer more possibilities for computing semantic similarity than multi-prototype, as the background prototype can be treated separately from the other prototypes. [sent-116, score-0.788]
33 We make use of a simple sum of the distance between the two background components, and the AvgSim of the two sets of clustering components. [sent-117, score-0.419]
34 2 Evaluation Methodology We evaluate the tiered clustering model on two problems from lexical semantics: word relatedness and selectional preference. [sent-126, score-1.043]
35 For the word relatedness 1This can be problematic for certain clustering methods that specify uniform priors over cluster sizes; however the DPMM naturally exhibits a linear decay in cluster sizes with the Er# clusters of size Ms η{M. [sent-127, score-0.65]
36 evaluation, we compared the predicted similarity of word pairs from each model to two collections ofhuman similarity judgements: WordSim-353 (Finkelstein et al. [sent-132, score-0.226]
37 For selectional preference, we employ the Pad o´ dataset, which contains 211 verb-noun pairs with human similarity judgements for how plausible the noun is for each argument of the verb (2 arguments per verb, corresponding roughly to subject and object). [sent-139, score-0.433]
38 In all cases correlation with human judgements is computed using Spearman’s nonparametric rank correlation (ρ) with average human judgements (Agirre et al. [sent-143, score-0.554]
39 Finally, semantic similarity between word pairs is computed using cosine distance (‘2-normalized dot-product). [sent-148, score-0.196]
40 4 Feature Pruning Feature pruning is one of the most significant factors in obtaining high correlation with human similarity judgements using vector-space models, and has been suggested as one way to improve sense disambiguation for polysemous verbs (Xue et al. [sent-150, score-0.585]
41 In this section, we calibrate the single prototype and multiprototype methods on WS-353, reaching the limit of human and oracle performance and demonstrating robust performance gains even with semantically impoverished features. [sent-152, score-0.258]
42 75 correlation on WS-353 using only unigram collocations and ρ 0. [sent-154, score-0.165]
43 77 using a fixed-K multi- prototype representation (Figure 3; Reisinger and Mooney, 2010). [sent-155, score-0.171]
44 This result rivals average human performance, obtaining correlation near that of the supervised oracle approach of Agirre et al. [sent-156, score-0.165]
45 The optimal pruning cutoff depends on the feature weighting and number of prototypes as well as the feature representation. [sent-158, score-0.214]
46 Figure 4 breaks down the similarity pairs into four quantiles for each data set and then shows correlation separately for each quantile. [sent-161, score-0.398]
47 In general ratings for highly similar (dissimilar) pairs are more predictable (quantiles 1 and 4) than middle similarity pairs (quantiles 2, 3). [sent-166, score-0.202]
48 3 in semantic distance are easier for those Feature pruning improves correlations in quantiles 2–4 while reducing correlation in quantile 1 (lowest similarity). [sent-169, score-0.482]
49 7 Results We evaluate four models: (1) the standard singleprototype approach, (2) the DPMM multi-prototype approach outlined in §3, (3) a simple combinataipopnr oofa cthhe multi-prototype a (3nd) single-prototype approaches and (4) the tiered clustering approach (§4). [sent-171, score-0.805]
50 Unless otherwise specified, both DPMM multi-prototype and tiered clustering (MP+SP)4 c ionutont 5s,5 q 3The fact that the per-quantile correlation is significantly lower than the full correlation e. [sent-174, score-1.166]
51 4(MP+SP) Tiered clustering’s ability to model both shared and idiosyncratic structure can be easily approximated by using the single prototype model as the shared component and multi-prototype model as the clustering. [sent-177, score-0.279]
52 However, unlike in the tiered model, all features are assigned to both components. [sent-178, score-0.573]
53 In general tf-idf features are the most sensitive to pruning level, yielding the highest correlation for moderate levels of pruning and significantly lower correlation than other representations without pruning. [sent-184, score-0.576]
54 1, and tiered clustering uses α 10 for the background/clustering allocation smoother. [sent-190, score-0.805]
55 In general the approaches incorporating multiple prototypes outperform single prototype (ρ 0. [sent-193, score-0.255]
56 The tiered clustering model does not significantly outperform either the multi-prototype or MP+SP models on the full set, but yields significantly higher correlation on the high-polysemy set. [sent-197, score-1.032]
57 The tiered model generates more clusters than DPMM multi-prototype (27. [sent-198, score-0.682]
58 8), despite using the same hyperparameter settings: Since words commonly shared across clusters have been allocated to the background component, the cluster components have less overlap and hence the model naturally allocates more clusters. [sent-201, score-0.586]
59 Examples of the tiered clusterings for several 1178 Method Single prototype high polysemy Multi-prototype high polysemy MP+SP 100 ErCs background 73. [sent-202, score-1.351]
60 All refers to the full set of pairs, high polysemy refers to the top 20% of pairs, ranked by sense count. [sent-238, score-0.232]
61 ErCs is the average number ofclusters employed by each method and background is the average percentage of features allocated by the tiered model to the background cluster. [sent-239, score-0.995]
62 In general the background component does indeed capture commonalities between all the sense clusters (e. [sent-242, score-0.387]
63 all wizards use magic) and hence the tiered clusters are more semantically pure. [sent-244, score-0.682]
64 Compared to the tiered clustering results in Table 1the multi-prototype clusters are significantly less pure for thematically poly- semous words such as radio and wizard. [sent-250, score-0.945]
65 On WN-Evocation, the single prototype and multi-prototype do not differ significantly in terms of correlation (ρ 0. [sent-253, score-0.367]
66 201 respectively; Table 5), while SP+MP yields significantly lower correlation (ρ 0. [sent-255, score-0.196]
67 176), and the tiered model yields significantly higher correlation (ρ 0. [sent-256, score-0.769]
68 Restricting to the top 20% of pairs with highest human similarity judgements yields similar outcomes, with single prototype, multi-prototype and SP+MP statistically indistinguishable (ρ 0. [sent-258, score-0.244]
69 235), and tiered clustering yielding significantly higher correlation (ρ 0. [sent-261, score-1.04]
70 Likewise tiered clustering achieves the most significant gains on the high polysemy subset. [sent-263, score-0.997]
71 Method Single prototype high polysemy Multi-prototype high polysemy ρ ? [sent-319, score-0.555]
72 4 - MP+SP high polysemy Tiered high polysemy 19. [sent-336, score-0.384]
73 The background component of the tiered clustering model can capture such general argument structure. [sent-358, score-0.992]
74 We model each verb argument slot in the Pad o´ set with a separate tiered clustering model, separating terms co-occurring with the target verb according to which slot they fill. [sent-359, score-0.961]
75 On the Pad o´ set, the performance of the DPMM multi-prototype approach breaks down and it yields significantly lower correlation with human norms than the single prototype (ρ 0. [sent-360, score-0.367]
76 Furthermore combining with the single prototype does not significantly change its performance (ρ 0. [sent-364, score-0.202]
77 Moving to the tiered model, however, yields significant improvements in correlation over the other models (ρ 0. [sent-366, score-0.738]
78 294), primarily improving correlation in the case of highly polysemous verbs and arguments. [sent-367, score-0.303]
79 8 Discussion and Future Work We have demonstrated a novel model for distributional lexical semantics capable of capturing both shared (context-independent) and idiosyncratic (context-dependent) structure in a set of word occurrences. [sent-368, score-0.203]
80 The benefits of this tiered model were most pronounced on a selectional preference task, where there is significant shared structure imposed by conditioning on the verb. [sent-369, score-0.863]
81 Although our results on the Pad o´ are not state of the art,6 we believe this to be due to the impoverished vector-space design; tiered clustering can be applied to more expressive vector spaces, such as those incorporating dependency parse and FrameNet features. [sent-370, score-0.805]
82 One potential explanation for the superior performance of the tiered model vs. [sent-371, score-0.573]
83 the DPMM multiprototype model is simply that it allocates more clusters to represent each word (Reisinger and Mooney, 2010). [sent-372, score-0.239]
84 The additional clusters do not provide more semantic content due to significant background similarity. [sent-375, score-0.36]
85 Finally, the DPMM multi-prototype and tiered clustering models allocate clusters based on the variance of the underlying data set. [sent-376, score-0.979]
86 33) between the number of clusters allocated by the DPMM and the number of word senses found in WordNet. [sent-379, score-0.195]
87 This result is most likely due to our use of unigram context window features, which induce clustering based on thematic rather than syntactic differences. [sent-380, score-0.232]
88 (Future Work) The word similarity experiments can be expanded by breaking pairs down further into highly homonymous and highly polysemous pairs, using e. [sent-382, score-0.336]
89 With this data it would be interesting to validate the hypothesis that the percentage of features allocated to the background cluster is correlated with the degree of homonymy. [sent-385, score-0.35]
90 The basic tiered clustering can be extended with additional background tiers, allocating more expressivity to model background feature variation. [sent-386, score-1.244]
91 topic model (all background tiers) and a pure clustering model and may be reasonable when there is believed to be more background structure (e. [sent-392, score-0.638]
92 9 Conclusions This paper introduced a simple probabilistic model of tiered clustering inspired by feature selective clustering that leverages feature exchangeability to allocate data features between a clustering model and shared component. [sent-408, score-1.484]
93 The ability to model background variation, or shared structure, is shown to be beneficial for modeling words with high polysemy, yielding increased correlation with human similarity judgements modeling word relatedness and selectional preference. [sent-409, score-0.889]
94 Furthermore, the tiered clustering model is shown to significantly outperform related models, yielding qualitatively more precise clusters. [sent-410, score-0.875]
95 Word occurrence features are then drawn from a combination of a single cluster component indicated by cd and the background topic. [sent-415, score-0.356]
96 7 The update rule for the latent tier indicator z is similar to the update rule for 2-topic LDA, with the background component as the first topic and the second topic being determined by the per-word-occurrence cluster indicator c. [sent-421, score-0.409]
97 pi,dq is shorthand for the set z tzi,du, is the number of occurrences of word w in topic t not counting wi,d and is the number of features in occurrence d assigned to topic t, not counting wi,d. [sent-430, score-0.168]
98 Likewise sampling the cluster indicators conditioned on the data ppcd |w, c? [sent-431, score-0.199]
99 dq is the number of occurrences signed to k not including d, Ñnpkdq is the vector asof counts of words from occurrence wd assigned to 7Effectively, the tiered clustering model is a special case of the nested Chinese Restaurant Process with the tree depth fixed to two (Blei et al. [sent-452, score-0.978]
100 A study on similarity and relatedness using distributional and Wordnet-based approaches. [sent-461, score-0.232]
wordName wordTfidf (topN-words)
[('tiered', 0.573), ('dpmm', 0.236), ('clustering', 0.232), ('polysemy', 0.192), ('background', 0.187), ('prototype', 0.171), ('correlation', 0.165), ('selectional', 0.159), ('pad', 0.154), ('cluster', 0.115), ('judgements', 0.112), ('clusters', 0.109), ('polysemous', 0.106), ('mp', 0.104), ('reisinger', 0.104), ('quantiles', 0.101), ('similarity', 0.094), ('multiprototype', 0.087), ('ppcd', 0.084), ('quantile', 0.084), ('prototypes', 0.084), ('relatedness', 0.079), ('preference', 0.077), ('sp', 0.073), ('pruning', 0.068), ('avgsim', 0.068), ('ercs', 0.068), ('evocation', 0.068), ('allocate', 0.065), ('semantic', 0.064), ('mooney', 0.061), ('gabrilovich', 0.06), ('distributional', 0.059), ('finkelstein', 0.058), ('pantel', 0.058), ('occurrence', 0.054), ('shared', 0.054), ('commonalities', 0.051), ('markovitch', 0.051), ('occurrences', 0.05), ('allocated', 0.048), ('slot', 0.048), ('capable', 0.047), ('austin', 0.045), ('spearman', 0.045), ('allocates', 0.043), ('tier', 0.043), ('semantics', 0.043), ('restaurant', 0.042), ('dirichlet', 0.042), ('dp', 0.041), ('degrees', 0.04), ('sense', 0.04), ('representations', 0.04), ('yielding', 0.039), ('law', 0.039), ('senses', 0.038), ('pairs', 0.038), ('clusterings', 0.036), ('erk', 0.036), ('wd', 0.035), ('variation', 0.035), ('allocating', 0.034), ('dq', 0.034), ('exchangeability', 0.034), ('gdelen', 0.034), ('gorman', 0.034), ('guadalupe', 0.034), ('herda', 0.034), ('hindle', 0.034), ('homonymous', 0.034), ('ilarity', 0.034), ('magic', 0.034), ('nq', 0.034), ('ntpdq', 0.034), ('pado', 0.034), ('parsons', 0.034), ('raters', 0.034), ('sanborn', 0.034), ('selectionally', 0.034), ('shafto', 0.034), ('ulrike', 0.034), ('agirre', 0.034), ('cp', 0.034), ('pas', 0.034), ('sch', 0.033), ('highly', 0.032), ('accounting', 0.032), ('sebastian', 0.032), ('topic', 0.032), ('lda', 0.032), ('feature', 0.031), ('significantly', 0.031), ('griffiths', 0.031), ('verb', 0.03), ('wordnet', 0.03), ('hyperparameter', 0.03), ('landauer', 0.03), ('patrick', 0.03), ('usage', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
Author: Joseph Reisinger ; Raymond Mooney
Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.
2 0.14307366 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman
Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.
3 0.14107576 77 emnlp-2010-Measuring Distributional Similarity in Context
Author: Georgiana Dinu ; Mirella Lapata
Abstract: The computation of meaning similarity as operationalized by vector-based models has found widespread use in many tasks ranging from the acquisition of synonyms and paraphrases to word sense disambiguation and textual entailment. Vector-based models are typically directed at representing words in isolation and thus best suited for measuring similarity out of context. In his paper we propose a probabilistic framework for measuring similarity in context. Central to our approach is the intuition that word meaning is represented as a probability distribution over a set of latent senses and is modulated by context. Experimental results on lexical substitution and word similarity show that our algorithm outperforms previously proposed models.
4 0.1102255 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
Author: Roberto Navigli ; Giuseppe Crisafulli
Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.
5 0.11008683 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
Author: Longhua Qian ; Guodong Zhou
Abstract: Seed sampling is critical in semi-supervised learning. This paper proposes a clusteringbased stratified seed sampling approach to semi-supervised learning. First, various clustering algorithms are explored to partition the unlabeled instances into different strata with each stratum represented by a center. Then, diversity-motivated intra-stratum sampling is adopted to choose the center and additional instances from each stratum to form the unlabeled seed set for an oracle to annotate. Finally, the labeled seed set is fed into a bootstrapping procedure as the initial labeled data. We systematically evaluate our stratified bootstrapping approach in the semantic relation classification subtask of the ACE RDC (Relation Detection and Classification) task. In particular, we compare various clustering algorithms on the stratified bootstrapping performance. Experimental results on the ACE RDC 2004 corpus show that our clusteringbased stratified bootstrapping approach achieves the best F1-score of 75.9 on the subtask of semantic relation classification, approaching the one with golden clustering.
6 0.093331322 84 emnlp-2010-NLP on Spoken Documents Without ASR
7 0.071966708 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
8 0.071644284 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
9 0.069865316 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs
10 0.063857757 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping
11 0.058464974 22 emnlp-2010-Automatic Evaluation of Translation Quality for Distant Language Pairs
12 0.056735184 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging
13 0.055594094 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
14 0.055044468 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web
15 0.05312404 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution
16 0.0514239 95 emnlp-2010-SRL-Based Verb Selection for ESL
17 0.050421331 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
18 0.049371284 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
19 0.048982434 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
20 0.048052356 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
topicId topicWeight
[(0, 0.174), (1, 0.127), (2, -0.083), (3, 0.048), (4, 0.034), (5, 0.034), (6, -0.221), (7, -0.001), (8, 0.191), (9, 0.092), (10, 0.096), (11, -0.081), (12, -0.102), (13, -0.117), (14, -0.014), (15, -0.224), (16, -0.023), (17, -0.104), (18, 0.117), (19, -0.106), (20, -0.007), (21, 0.083), (22, -0.034), (23, -0.064), (24, 0.125), (25, 0.03), (26, 0.087), (27, 0.045), (28, 0.043), (29, 0.003), (30, 0.001), (31, -0.001), (32, -0.035), (33, 0.081), (34, 0.017), (35, 0.075), (36, -0.022), (37, 0.128), (38, -0.149), (39, -0.021), (40, -0.009), (41, 0.141), (42, 0.034), (43, -0.139), (44, -0.0), (45, 0.038), (46, 0.101), (47, -0.019), (48, 0.133), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.95785034 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
Author: Joseph Reisinger ; Raymond Mooney
Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.
2 0.54374713 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman
Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.
3 0.51255447 77 emnlp-2010-Measuring Distributional Similarity in Context
Author: Georgiana Dinu ; Mirella Lapata
Abstract: The computation of meaning similarity as operationalized by vector-based models has found widespread use in many tasks ranging from the acquisition of synonyms and paraphrases to word sense disambiguation and textual entailment. Vector-based models are typically directed at representing words in isolation and thus best suited for measuring similarity out of context. In his paper we propose a probabilistic framework for measuring similarity in context. Central to our approach is the intuition that word meaning is represented as a probability distribution over a set of latent senses and is modulated by context. Experimental results on lexical substitution and word similarity show that our algorithm outperforms previously proposed models.
4 0.49725235 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering
Author: Roberto Navigli ; Giuseppe Crisafulli
Abstract: In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). We first acquire the senses (i.e., meanings) of a query by means of a graphbased clustering algorithm that exploits cycles (triangles and squares) in the co-occurrence graph of the query. Then we cluster the search results based on their semantic similarity to the induced word senses. Our experiments, conducted on datasets of ambiguous queries, show that our approach improves search result clustering in terms of both clustering quality and degree of diversification.
5 0.47826037 84 emnlp-2010-NLP on Spoken Documents Without ASR
Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church
Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.
6 0.45564505 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification
7 0.424449 124 emnlp-2010-Word Sense Induction Disambiguation Using Hierarchical Random Graphs
8 0.38534552 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
9 0.33444846 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution
10 0.31999022 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
11 0.31122687 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors
12 0.29917216 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval
13 0.29050234 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs
14 0.22656067 13 emnlp-2010-A Simple Domain-Independent Probabilistic Approach to Generation
15 0.22194968 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective
16 0.21849765 112 emnlp-2010-Unsupervised Discovery of Negative Categories in Lexicon Bootstrapping
17 0.21716569 12 emnlp-2010-A Semi-Supervised Method to Learn and Construct Taxonomies Using the Web
18 0.20589216 52 emnlp-2010-Further Meta-Evaluation of Broad-Coverage Surface Realization
19 0.20029587 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
20 0.19827789 95 emnlp-2010-SRL-Based Verb Selection for ESL
topicId topicWeight
[(3, 0.014), (10, 0.012), (12, 0.035), (17, 0.014), (22, 0.01), (27, 0.113), (28, 0.132), (29, 0.175), (30, 0.048), (32, 0.021), (52, 0.027), (56, 0.064), (62, 0.016), (66, 0.121), (72, 0.04), (76, 0.025), (77, 0.012), (87, 0.024), (89, 0.011)]
simIndex simValue paperId paperTitle
1 0.89332289 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation
Author: Liang Huang ; Haitao Mi
Abstract: Syntax-based translation models should in principle be efficient with polynomially-sized search space, but in practice they are often embarassingly slow, partly due to the cost of language model integration. In this paper we borrow from phrase-based decoding the idea to generate a translation incrementally left-to-right, and show that for tree-to-string models, with a clever encoding of derivation history, this method runs in averagecase polynomial-time in theory, and lineartime with beam search in practice (whereas phrase-based decoding is exponential-time in theory and quadratic-time in practice). Experiments show that, with comparable translation quality, our tree-to-string system (in Python) can run more than 30 times faster than the phrase-based system Moses (in C++).
same-paper 2 0.88007718 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
Author: Joseph Reisinger ; Raymond Mooney
Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.
3 0.77765667 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts
Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng
Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.
4 0.76707947 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities
Author: Adria de Gispert ; Juan Pino ; William Byrne
Abstract: We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignmentmodel. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteri- ors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-totarget and target-to-source alignment models is to build two separate systems and combine their output translation lattices.
5 0.76667237 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation
Author: Taesun Moon ; Katrin Erk ; Jason Baldridge
Abstract: We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective.
6 0.76366627 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
7 0.76320535 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
8 0.76109445 77 emnlp-2010-Measuring Distributional Similarity in Context
9 0.75446314 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar
10 0.75044876 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging
11 0.74723744 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars
12 0.74713093 63 emnlp-2010-Improving Translation via Targeted Paraphrasing
13 0.74674392 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction
14 0.74634564 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions
15 0.74622744 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification
16 0.74464869 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
17 0.74359947 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding
18 0.74352747 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
19 0.74176973 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment
20 0.74135566 109 emnlp-2010-Translingual Document Representations from Discriminative Projections