emnlp emnlp2010 emnlp2010-87 knowledge-graph by maker-knowledge-mining

87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space


Source: pdf

Author: Marco Baroni ; Roberto Zamparelli

Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space Marco Baroni and Roberto Zamparelli Center for Mind/Brain Sciences, University of Trento Rovereto (TN), Italy {marco . [sent-1, score-0.415]

2 Abstract We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. [sent-3, score-0.878]

3 Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. [sent-4, score-0.256]

4 We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task. [sent-6, score-0.816]

5 In the simplest case, the meaning of an attributive adjective-noun (AN) constituent can be obtained as the intersection of the adjective and noun extensions A∩N: [ rnesdio car ] = {. [sent-16, score-0.685]

6 Even for red, the manner in which the color combines with a noun will be different in red Ferrari (the outside), red watermelon (the inside), red traffic light (the signal). [sent-41, score-0.39]

7 These problems have prompted a more flexible FS representation for attributive adjectives functions from the meaning of a noun onto the meaning of a modified noun (Montague, 1970a). [sent-42, score-0.769]

8 This mapping could now be sensitive to the particular noun the adjective receives, and it does not need to return a subset of the — ProceeMdiInTg,s M oaf sthseac 2h0u1s0et Ctso, UnfeSrAe,nc 9e-1 o1n O Ecmtopbireirca 2l0 M10e. [sent-43, score-0.52]

9 ), and because very frequent adjectives such as different are at the border between content and function words. [sent-50, score-0.234]

10 Following the insight of FS, we treat attributive adjectives as functions over noun meanings; however, noun meanings are vectors, not sets, and the functions are learnt from corpus-based noun-AN vector pairs. [sent-51, score-0.742]

11 Original contribution We propose and evaluate a new method to derive distributional representations for ANs, where an adjective is a linear function from a vector (the noun representation) to another vector (the AN representation). [sent-52, score-0.804]

12 The linear map for a specific adjective is learnt, using linear regression, from pairs of noun and AN vectors extracted from a corpus. [sent-53, score-0.856]

13 Section 5 provides some empirical justification for using corpusharvested AN vectors as the target of our function learning and evaluation benchmark. [sent-57, score-0.256]

14 In Section 6, we show that our model outperforms other approaches at the task of approximating such vectors for unseen ANs. [sent-58, score-0.256]

15 In Section 7, we discuss how adjectival meaning can be represented in our model and evaluate this representation in an adjective clustering task. [sent-59, score-0.475]

16 2 Related work The literature on compositionality in vector-based semantics encompasses various related topics, some of them not of direct interest here, such as how to 1184 encode word order information in context vectors (Jones and Mewhort, 2007; Sahlgren et al. [sent-61, score-0.377]

17 , 2008) or sophisticated composition methods based on tensor products, quantum logic, etc. [sent-62, score-0.275]

18 Closer to our current purposes is the general framework for vector composition proposed by Mitchell and Lapata (2008), subsuming various earlier proposals. [sent-64, score-0.247]

19 Given two vectors u and v, they identify two general classes of composition models, (linear) additive models: p = Au + Bv (1) where A and B are weight matrices, and multiplicative models: p = Cuv where C is a weight tensor projecting the uv tensor product onto the space of p. [sent-65, score-0.932]

20 Their simplified additive model p = αu + βv was a common approach to composition in the earlier literature, typically with the scalar weights set to 1 or to normalizing constants (Foltz et al. [sent-67, score-0.37]

21 Mitchell and Lapata also consider a constrained version of the multiplicative approach that reduces to componentwise multiplication, where the i-th component of the composed vector is given by: pi = uivi. [sent-69, score-0.291]

22 The simplified additive model produces a sort of (statistical) union of features, whereas component-wise multiplication has an intersective effect. [sent-70, score-0.262]

23 They also evaluate a weighted combination of the simplified additive and multiplicative functions. [sent-71, score-0.315]

24 Guevara adopts the full additive composition form from Equation (1) and he estimates the A and B weights using partial least squares regression. [sent-78, score-0.388]

25 Guevara compares his model to the simplified additive and multiplicative models of Mitchell and Lapata. [sent-80, score-0.315]

26 Observed ANs are nearer, in the space of observed and predicted test set ANs, to the ANs generated by his model than to those from the alternative approaches. [sent-81, score-0.244]

27 The additive model, on the other hand, is best in terms of shared neighbor count between observed and predicted ANs. [sent-82, score-0.342]

28 In our empirical tests, we compare our approach to the simplified additive and multiplicative models of Mitchell and Lapata (the former with normalization constants as scalar weights) as well as to Guevara’s approach. [sent-83, score-0.353]

29 3 Adjectives as linear maps As discussed in the introduction, we will take adjectives in attributive position to be functions from one noun meaning to another. [sent-84, score-0.59]

30 In the framework of Mitchell and Lapata, our approach derives from the additive form in Equation (1) with the matrix multiplying the adjective vector (say, A) set to 0: p = Bv where p is the observed AN vector, B the weight matrix representing the adjective at hand, and v a noun vector. [sent-86, score-1.448]

31 In our approach, the weight matrix B is specific to a single adjective as we will see in Section 7 below, it is our representation of the meaning of the adjective. [sent-87, score-0.566]

32 In our case, the independent variables for the re– gression equations are the dimensions of the corpusbased vectors of the component nouns, whereas the AN vectors provide the dependent variables. [sent-89, score-0.645]

33 First, although we use a supervised learning method (least squares regression), we do not need hand-annotated data, since the target AN vectors are automatically collected from the corpus just like vectors for single words are. [sent-92, score-0.582]

34 Second, our approach rests on the assumption that the corpus-derived AN vectors are interesting objects that should constitute the target of what a composition process tries to approximate. [sent-94, score-0.424]

35 Fourth, the approach is naturally syntax-sensitive, since we train it on observed data for a specific syntactic position: we would train separate linear models for, say, the same adjective in attributive (AN) and predicative (N is A) position. [sent-99, score-0.639]

36 Finally, although adjective representations are not directly harvested from corpora, we can still meaningfully compare adjectives to each other or other words by using their estimated matrix, or an average vector for the ANs that contain them: both options are tested in Section 7 below. [sent-101, score-0.778]

37 2 Vocabulary We could in principle limit ourselves to collecting vectors for the ANs to be analyzed (the AN test set) and their components. [sent-119, score-0.256]

38 However, to make the analysis more challenging and interesting, we populate the semantic space where we will look at the behaviour of the ANs with a large number of adjectives 1186 and nouns, as well as further ANs not in the test set. [sent-120, score-0.366]

39 We refer to the overall list of items we build semantic vectors for as the extended vocabulary. [sent-121, score-0.412]

40 We use a subset of the extended vocabulary containing only nouns and adjectives (the core vocabulary) for feature selection and dimensionality reduction, so that we do not implicitly bias the structure of the semantic space by our choice of ANs. [sent-122, score-0.583]

41 We were careful to include intersective cases such as electronic as well as non-intersective adjectives that are almost function words (the modals, different, etc. [sent-124, score-0.267]

42 By crossing the selected adjectives and nouns, we constructed a test set containing 26,440 ANs, all attested in the sample corpus (734 ANs per adjective on average, ranging from 1,337 for new to 202 for mental). [sent-127, score-0.642]

43 The core vocabulary contains the top 8K most frequent noun lemmas and top 4K adjective lemmas from the concatenated corpus (excluding the top 50 most frequent nouns and adjectives). [sent-128, score-0.805]

44 In total, the extended vocabulary contains 40,999 entries: 8,043 nouns, 4,016 adjectives and 28,940 ANs. [sent-130, score-0.311]

45 3 Semantic space construction Full co-occurrence matrix The 10K lemmas (nouns, adjectives or verbs) that co-occur with the largest number of items in the core vocabulary constitute the dimensions (columns) of our cooccurrence matrix. [sent-132, score-0.702]

46 To avoid bias in favour of dimensions that capture variance in the test set ANs, we applied SVD to the core vocabulary subset of the co-occurrence matrix (containing only adjective and noun rows). [sent-138, score-0.795]

47 r xTh wea osth reerrow vectors SoVf Dth teo f aul 1l co-occurrence . [sent-140, score-0.256]

48 m Tathreix o (including the ANs) were projected onto the same reduced space by multiplying them by a matrix containing the first n right singular vectors as columns. [sent-141, score-0.536]

49 Merging the items used to compute the SVD and those projected onto the resulting space, we obtain a 40,999 300 matrix representing 8,043 nouns, 4,016 adjectives a0n md 28,940 rAesNens. [sent-142, score-0.428]

50 t Tgh 8is,0 4 re3d nuocuends ,m 4,a0tr1i6x constitutes a realistically sized semantic space, that also contains many items that are not part of our test set, but will be potential neighbors of the observed and predicted test ANs in the experiments to follow. [sent-143, score-0.402]

51 4 Composition methods In the proposed adjective-specific linear map (alm) method, an AN is generated by multiplying an adjective weight matrix with a noun (column) vector. [sent-150, score-0.7]

52 The j weights in the i-th row of the matrix are the coefficients of a linear regression predicting the values of the i-th dimension of the AN vector as a linear combination of the j dimensions of the component noun. [sent-151, score-0.507]

53 The linear regression coefficients are estimated separately for each of the 36 tested adjectives from the corpus-observed noun-AN pairs containing that adjective (observed adjective vectors are not used). [sent-152, score-1.463]

54 Since we are working in the 300-dimensional right singular vector space, for each adjective we have 300 regression problems with 300 independent variables, and the training data (the noun-AN pairs available for each test set adjective) range from about 200 to more than 1K items. [sent-153, score-0.601]

55 We estimate the coefficients using (multivariate) partial least squares regression (PLSR) as implemented in the R pls package (Mevik and Wehrens, 2007). [sent-154, score-0.253]

56 We picked instead 50 latent variables, by the rule- of-thumb reasoning that for any adjective we can use at least 200 noun-AN pairs for training, and the independent-variable-to-training-item ratio will thus never be above 1/4. [sent-160, score-0.441]

57 We adopt a leave-one-out training regime, so that each target AN is generated by an adjective matrix that was estimated from all the other ANs with the same adjective, minus the target. [sent-161, score-0.499]

58 We use PLSR with 50 latent variables also for our re-implementation of Guevara’s (2010) single linear map (slm) approach, in which a single regression matrix is estimated for all ANs across adjectives. [sent-162, score-0.244]

59 The training data in this case are given by the concatenation of the observed adjective and noun vectors (600 independent variables) coupled with the corresponding AN vectors (300 dependent variables). [sent-163, score-1.125]

60 For each target AN, we randomly sample 2,000 other adjective-noun-AN tuples for training (with larger training sets we run into memory problems), and use the resulting coefficient matrix to generate the AN vector from the concatenated target adjective and noun vectors. [sent-164, score-0.727]

61 Additive AN vectors (add method) are obtained by summing the corresponding adjective and noun vectors after normalizing them (non-normalized addition was also tried, but it did not work nearly as well as the normalized variant). [sent-165, score-1.032]

62 Multiplicative vectors (mult method) were obtained by componentwise multiplication of the adjective and noun vectors (normalization does not matter here since it amounts to multiplying the composite vector by a scalar, and the cosine similarity measure we use is scale-invariant). [sent-166, score-1.351]

63 Finally, the adj and noun baselines use the adjective and noun vectors, respectively, as surrogates of the AN vector. [sent-167, score-0.66]

64 We tried to alleviate this problem by assigning a 0 to composite dimensions where the two input vectors had different signs. [sent-171, score-0.425]

65 Our method relies on the hypothesis that the semantics of AN composition does not depend on the independent distribution of adjectives themselves, but on how adjectives transform the distribution of nouns, as evidenced by observed pairs of noun-AN vectors. [sent-175, score-0.775]

66 Moreover, coherently with this view, our evaluation below will be based on how closely the models approximate the observed vectors of unseen ANs. [sent-176, score-0.349]

67 That our goal in modeling composition should be to approximate the vectors of observed ANs is in a sense almost trivial. [sent-177, score-0.517]

68 Whether we synthesize an AN for generation or decoding purposes, we would want the synthetic AN to look as much as possible like a real AN in its natural usage contexts, and co- occurrence vectors of observed ANs are a summary of their usage in actual linguistic contexts. [sent-178, score-0.349]

69 However, it might be the case that the specific resources we used for our vector construction procedure are not appropriate, so that the specific observed AN vectors we extract are not reliable (e. [sent-179, score-0.428]

70 First, we computed centroids from normalized SVD space vectors ofall the ANs that share the same adjective (e. [sent-183, score-0.78]

71 , the normalized vectors of American adult, American menu, etc. [sent-185, score-0.256]

72 We looked at the nearest neighbors of these centroids in semantic space among the 41K items (adjectives, nouns and ANs) in our extended vocabulary (here and in all experiments below, similarity is quantified by the cosine of the angle between two vectors). [sent-187, score-0.567]

73 Table 2 reports the nearest 3 neighbors of 9 randomly selected ANs involving different adjectives (we inspected a larger random set, coming to similar conclusions to the ones emerging from this table). [sent-191, score-0.405]

74 The nearest neighbors of the corpus-based AN vectors in Table 2 make in general intuitive sense. [sent-193, score-0.427]

75 Importantly, the neighbors pick up the composite meaning rather than that of the adjective or noun alone. [sent-194, score-0.783]

76 Indeed, we will argue in the next section that there are cases in which the corpus-derived AN vector might not be a good approximation to our semantic intuitions about the AN, and a model-composed AN vector is a better semantic surrogate. [sent-199, score-0.282]

77 Still, the neighbors of average and ANspecific vectors of Tables 1 and 2 suggest that, for the bulk of ANs, such corpus-based co-occurrence vectors are semantically reasonable. [sent-201, score-0.615]

78 We use nearness to the corpus-observed vectors of held-out ANs as a very direct way to evaluate the quality of modelgenerated ANs, since we just saw that the observed ANs look reasonable (but see the caveats at the end of this section). [sent-203, score-0.382]

79 The difference with the second best model, add (the only other model that does better than the non-trivial baseline of using the component noun vector as a surrogate for AN), is highly statistically significant (Wilcoxon signed rank test, p < 0. [sent-215, score-0.266]

80 If we randomly downsample the AN set to keep an equal number of ANs per adjective (200), the difference is still significant with p below the same threshold, indicating that the general result is not due to a better performance of alm on a few common adjectives. [sent-217, score-0.52]

81 1190 multiplicative models are, in general, better than additive ones in compositionality tasks (see Section 2 above). [sent-220, score-0.343]

82 The single linear mapping model (slm) proposed by Guevara (2010) is doing even worse than the multiplicative method, suggesting that a single set of weights does not provide enough flexibility to model a variety of adjective transformations successfully. [sent-222, score-0.599]

83 This is at odds with Guevara’s experiment in which slm outperformed mult and add on the task of ranking predicted ANs with respect to a target observed AN. [sent-223, score-0.295]

84 There is a high inverse correlation between median rank and adjective frequency (Spearman’s ρ = −0. [sent-228, score-0.492]

85 Although, in relative terms and considering the difficulty of the task, alm performs well, it is still far from perfect for 27% alm-predicted ANs, the ob– – served vector is not even in the top 1K neighbor set! [sent-232, score-0.242]

86 The left side of Table 4 compares the nearest neighbors (excluding each other) of the observed and alm-predicted vectors in 10 ran- where rank of observed w. [sent-234, score-0.66]

87 Right: nearest neighbors of predicted and observed ANs for random set where rank of observed w. [sent-238, score-0.485]

88 domly selected cases where the observed AN is the nearest neighbor of the predicted one. [sent-242, score-0.293]

89 Moving to the right, we see 10 random examples of ANs where the observed AN was at least 999 neighbors apart from the alm prediction. [sent-245, score-0.341]

90 Second, at least subjectively, we find that in many cases the nearest neighbor of predicted AN is actually more sensible than that of observed AN: current element (vs. [sent-250, score-0.326]

91 In the other cases, the predicted AN neighbor is at least not obviously worse than the observed AN neighbor. [sent-255, score-0.258]

92 48), suggesting that our model is worse at approximating tghgee sotbisnger tvhedat vectors doefl rare forms, that might, in turn, be those for which the corpusbased representation is less reliable. [sent-257, score-0.256]

93 In these cases, dissimilarities between observed and expected vectors, rather than signaling problems with the model, might indicate that the predicted vector, based on a composition function learned from many examples, 1191 is better than the one directly extracted from the corpus. [sent-258, score-0.342]

94 7 Study 3: Comparing adjectives If adjectives are functions, and not corpus-derived vectors, is it still possible to compare them meaningfully? [sent-260, score-0.468]

95 , representing the adjectives directly with their corpus co-occurrence profile vectors (in our case, projected onto the SVD-reduced space). [sent-264, score-0.53]

96 tations achieve similar performance, and they are (slightly) better than the traditional method based on adjective co-occurrence vectors. [sent-273, score-0.408]

97 We conclude that, although our approach does not provide a direct encoding of adjective meaning in terms of such independently collected vectors, it does have meaningful ways to represent their semantic properties. [sent-274, score-0.537]

98 Evaluation-wise, the differences between observed and predicted ANs must be analyzed more extensively, to support the claim that, when their vectors differ, model-based prediction improves on the observed vector. [sent-283, score-0.523]

99 For example, to account for re- prefixation we do not need to collect a re- vector (required by all other approaches to composition), but simply vectors for a set of V/reV pairs, where both members of the pairs are words (e. [sent-286, score-0.364]

100 The pls package: Principal component and partial least squares regression in R. [sent-380, score-0.248]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ans', 0.451), ('adjective', 0.408), ('vectors', 0.256), ('adjectives', 0.234), ('guevara', 0.228), ('composition', 0.168), ('multiplicative', 0.151), ('additive', 0.117), ('noun', 0.112), ('alm', 0.112), ('neighbors', 0.103), ('attributive', 0.098), ('observed', 0.093), ('composite', 0.093), ('matrix', 0.091), ('green', 0.087), ('svd', 0.087), ('regression', 0.084), ('predicted', 0.081), ('red', 0.08), ('vector', 0.079), ('nouns', 0.078), ('dimensions', 0.076), ('compositionality', 0.075), ('mult', 0.072), ('lapata', 0.071), ('space', 0.07), ('squares', 0.07), ('nearest', 0.068), ('mitchell', 0.068), ('meaning', 0.067), ('tensor', 0.065), ('thomason', 0.065), ('ukwac', 0.065), ('multiplication', 0.065), ('items', 0.063), ('semantic', 0.062), ('core', 0.062), ('fs', 0.058), ('distributional', 0.057), ('young', 0.056), ('neighbor', 0.051), ('constructions', 0.049), ('montague', 0.049), ('reprinted', 0.049), ('sahlgren', 0.049), ('slm', 0.049), ('multiplying', 0.049), ('simplified', 0.047), ('rank', 0.047), ('centroids', 0.046), ('erk', 0.046), ('vocabulary', 0.046), ('semantics', 0.046), ('kintsch', 0.042), ('cluto', 0.042), ('quantum', 0.042), ('black', 0.041), ('onto', 0.04), ('linear', 0.04), ('historical', 0.039), ('functions', 0.039), ('scalar', 0.038), ('color', 0.038), ('purity', 0.038), ('concatenated', 0.037), ('median', 0.037), ('pad', 0.037), ('dimension', 0.036), ('dog', 0.035), ('least', 0.033), ('componentwise', 0.033), ('fake', 0.033), ('initiative', 0.033), ('intersective', 0.033), ('lmi', 0.033), ('mevik', 0.033), ('modelgenerated', 0.033), ('nella', 0.033), ('nri', 0.033), ('pls', 0.033), ('plsr', 0.033), ('rubenstein', 0.033), ('coefficients', 0.033), ('lemmas', 0.031), ('extended', 0.031), ('matrices', 0.03), ('singular', 0.03), ('excluding', 0.03), ('meanings', 0.029), ('formal', 0.029), ('landauer', 0.029), ('cooccurrence', 0.029), ('representations', 0.029), ('variables', 0.029), ('collect', 0.029), ('component', 0.028), ('hastie', 0.028), ('meaningfully', 0.028), ('adj', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999815 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

Author: Marco Baroni ; Roberto Zamparelli

Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.

2 0.15425037 77 emnlp-2010-Measuring Distributional Similarity in Context

Author: Georgiana Dinu ; Mirella Lapata

Abstract: The computation of meaning similarity as operationalized by vector-based models has found widespread use in many tasks ranging from the acquisition of synonyms and paraphrases to word sense disambiguation and textual entailment. Vector-based models are typically directed at representing words in isolation and thus best suited for measuring similarity out of context. In his paper we propose a probabilistic framework for measuring similarity in context. Central to our approach is the intuition that word meaning is represented as a probability distribution over a set of latent senses and is modulated by context. Experimental results on lexical substitution and word similarity show that our algorithm outperforms previously proposed models.

3 0.082893692 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

Author: Michael Lamar ; Yariv Maron ; Elie Bienenstock

Abstract: We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.

4 0.072044045 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

Author: Hai Son Le ; Alexandre Allauzen ; Guillaume Wisniewski ; Francois Yvon

Abstract: Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms ofperplexity and on a large-scale translation task.

5 0.071966708 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

6 0.06735947 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

7 0.063557252 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

8 0.05788409 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

9 0.05253578 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

10 0.051991455 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

11 0.05161979 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

12 0.050953381 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

13 0.045572415 84 emnlp-2010-NLP on Spoken Documents Without ASR

14 0.045243289 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

15 0.044558864 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa

16 0.043399662 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

17 0.043309405 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

18 0.043166678 95 emnlp-2010-SRL-Based Verb Selection for ESL

19 0.042836607 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

20 0.042269785 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.175), (1, 0.096), (2, -0.044), (3, 0.036), (4, 0.008), (5, 0.056), (6, -0.113), (7, -0.045), (8, 0.111), (9, 0.025), (10, 0.049), (11, 0.005), (12, -0.07), (13, -0.066), (14, 0.021), (15, -0.155), (16, -0.098), (17, -0.058), (18, -0.065), (19, -0.097), (20, 0.125), (21, 0.094), (22, -0.103), (23, -0.107), (24, -0.094), (25, -0.259), (26, 0.144), (27, -0.237), (28, 0.122), (29, 0.039), (30, 0.086), (31, 0.034), (32, -0.066), (33, 0.068), (34, 0.09), (35, -0.058), (36, 0.053), (37, 0.129), (38, -0.019), (39, 0.012), (40, 0.215), (41, -0.125), (42, -0.003), (43, 0.066), (44, -0.162), (45, 0.069), (46, -0.029), (47, 0.006), (48, -0.006), (49, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97536534 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

Author: Marco Baroni ; Roberto Zamparelli

Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.

2 0.74289382 77 emnlp-2010-Measuring Distributional Similarity in Context

Author: Georgiana Dinu ; Mirella Lapata

Abstract: The computation of meaning similarity as operationalized by vector-based models has found widespread use in many tasks ranging from the acquisition of synonyms and paraphrases to word sense disambiguation and textual entailment. Vector-based models are typically directed at representing words in isolation and thus best suited for measuring similarity out of context. In his paper we propose a probabilistic framework for measuring similarity in context. Central to our approach is the intuition that word meaning is represented as a probability distribution over a set of latent senses and is modulated by context. Experimental results on lexical substitution and word similarity show that our algorithm outperforms previously proposed models.

3 0.51820248 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

Author: Hai Son Le ; Alexandre Allauzen ; Guillaume Wisniewski ; Francois Yvon

Abstract: Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms ofperplexity and on a large-scale translation task.

4 0.41341257 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

Author: Joseph Reisinger ; Raymond Mooney

Abstract: We introduce tiered clustering, a mixture model capable of accounting for varying degrees of shared (context-independent) feature structure, and demonstrate its applicability to inferring distributed representations of word meaning. Common tasks in lexical semantics such as word relatedness or selectional preference can benefit from modeling such structure: Polysemous word usage is often governed by some common background metaphoric usage (e.g. the senses of line or run), and likewise modeling the selectional preference of verbs relies on identifying commonalities shared by their typical arguments. Tiered clustering can also be viewed as a form of soft feature selection, where features that do not contribute meaningfully to the clustering can be excluded. We demonstrate the applicability of tiered clustering, highlighting particular cases where modeling shared structure is beneficial and where it can be detrimental.

5 0.37376413 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

Author: Michael Lamar ; Yariv Maron ; Elie Bienenstock

Abstract: We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.

6 0.3554402 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

7 0.30618349 117 emnlp-2010-Using Unknown Word Techniques to Learn Known Words

8 0.29479441 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

9 0.2803064 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

10 0.26344308 11 emnlp-2010-A Semi-Supervised Approach to Improve Classification of Infrequent Discourse Relations Using Feature Vector Extension

11 0.25765726 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing

12 0.23856765 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

13 0.23697315 80 emnlp-2010-Modeling Organization in Student Essays

14 0.21640736 95 emnlp-2010-SRL-Based Verb Selection for ESL

15 0.20839603 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models

16 0.19771855 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

17 0.18930444 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

18 0.18816479 31 emnlp-2010-Constraints Based Taxonomic Relation Classification

19 0.18587682 120 emnlp-2010-What's with the Attitude? Identifying Sentences with Attitude in Online Discussions

20 0.18437469 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.017), (10, 0.017), (12, 0.045), (29, 0.155), (30, 0.032), (32, 0.054), (37, 0.264), (52, 0.042), (56, 0.055), (62, 0.013), (66, 0.095), (72, 0.053), (76, 0.031), (79, 0.013), (82, 0.013), (87, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79654145 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

Author: Marco Baroni ; Roberto Zamparelli

Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.

2 0.58895338 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

3 0.58726579 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

Author: Samidh Chatterjee ; Nicola Cancedda

Abstract: Minimum Error Rate Training is the algorithm for log-linear model parameter training most used in state-of-the-art Statistical Machine Translation systems. In its original formulation, the algorithm uses N-best lists output by the decoder to grow the Translation Pool that shapes the surface on which the actual optimization is performed. Recent work has been done to extend the algorithm to use the entire translation lattice built by the decoder, instead of N-best lists. We propose here a third, intermediate way, consisting in growing the translation pool using samples randomly drawn from the translation lattice. We empirically measure a systematic im- provement in the BLEU scores compared to training using N-best lists, without suffering the increase in computational complexity associated with operating with the whole lattice.

4 0.58661044 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

Author: Adria de Gispert ; Juan Pino ; William Byrne

Abstract: We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignmentmodel. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteri- ors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-totarget and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

5 0.58576953 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

Author: Zhongqiang Huang ; Martin Cmejrek ; Bowen Zhou

Abstract: In this paper, we present a novel approach to enhance hierarchical phrase-based machine translation systems with linguistically motivated syntactic features. Rather than directly using treebank categories as in previous studies, we learn a set of linguistically-guided latent syntactic categories automatically from a source-side parsed, word-aligned parallel corpus, based on the hierarchical structure among phrase pairs as well as the syntactic structure of the source side. In our model, each X nonterminal in a SCFG rule is decorated with a real-valued feature vector computed based on its distribution of latent syntactic categories. These feature vectors are utilized at decod- ing time to measure the similarity between the syntactic analysis of the source side and the syntax of the SCFG rules that are applied to derive translations. Our approach maintains the advantages of hierarchical phrase-based translation systems while at the same time naturally incorporates soft syntactic constraints.

6 0.58462137 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

7 0.57731313 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

8 0.57640868 77 emnlp-2010-Measuring Distributional Similarity in Context

9 0.57621413 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

10 0.57394695 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

11 0.56866795 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

12 0.56789708 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

13 0.56529713 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

14 0.56515545 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

15 0.56505108 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

16 0.56434268 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

17 0.5638153 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

18 0.56371665 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

19 0.56338054 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

20 0.56336069 84 emnlp-2010-NLP on Spoken Documents Without ASR