emnlp emnlp2012 emnlp2012-61 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Carina Silberer ; Mirella Lapata
Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. [sent-4, score-0.201]
2 In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. [sent-5, score-0.867]
3 Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i. [sent-6, score-0.246]
4 Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two. [sent-10, score-1.068]
5 Beyond language acquisition, there is considerable evidence across both behavioral experiments and neuroimaging studies that the perceptual associates of words play an important role in language processing (for a review see Barsalou (2008)). [sent-25, score-0.906]
6 An important question in the formulation of such models concerns the provenance of perceptual information. [sent-27, score-0.811]
7 A few models use feature norms as a proxy for sensorimotor experience (Howell et al. [sent-28, score-0.243]
8 (201 1) learn textual and visual representations independently from distinct data sources. [sent-40, score-0.163]
9 Aside from the type of data used to capture perceptual information, another important issue concerns how the two modalities (perceptual and textual) are integrated. [sent-41, score-0.951]
10 , 2011) or to infer one modality by means of the other (Johns and Jones, 2012) and to arrive at a grounded representation simply by concatenating the two. [sent-43, score-0.183]
11 We examine three models with different assumptions regarding the integration of perceptual and linguistic data. [sent-48, score-0.855]
12 It simultaneously considers the distribution of words across contexts in a text corpus and the distribution of words across perceptual features and extracts joint information from both data sources. [sent-52, score-0.855]
13 Our second model is based on Johns and Jones (2012) who represent the meaning of a word as the concatenation of its textual and its perceptual vector. [sent-53, score-0.961]
14 Interestingly, their model allows to infer a perceptual vector for words without feature norms, simply by taking into account similar words for which perceptual information is available. [sent-54, score-1.702]
15 In all three models we use feature norms as a proxy for perceptual information. [sent-61, score-1.022]
16 In other words, feature norms can serve as an upper bound of what can be achieved when integrating detailed perceptual information with vanilla text-based distributional models. [sent-67, score-1.05]
17 2 Perceptually Grounded Models In this study we examine semantic representation models that rely on linguistic and perceptual data. [sent-69, score-0.888]
18 As mentioned earlier, we resort to feature norms as proxy for perceptual information. [sent-71, score-1.022]
19 In the remainder of this section we will describe our models and how they arrive at an integrated perceptual and linguistic representation. [sent-76, score-0.833]
20 In addition, those words of a document that are also included in the feature norms are paired with one of their features, where a feature is sampled according to the feature distribution given that word. [sent-82, score-0.258]
21 ,xC} ∈ C; a component xc comprises a {laxtent discourse} topic coupled with a feature cluster originating from the feature norms. [sent-93, score-0.19]
22 For each xc = xji, a word wji is drawn from distribution φc and a feature fji is drawn from distribution ψc. [sent-103, score-0.276]
23 , wjnj }, a component xc = xji is drawn from∈ πj; wji is then}, d ar caowmn pfroonmen tth xe corresponding distribution φc. [sent-108, score-0.241]
24 re 2: Example of the representation of the meaning of apple with the model of (Andrews et al. [sent-131, score-0.164]
25 Assuming a uniform dis- tribution over components xc in D, PX|W can be approximated as: PX=xc|W=wi=P(wiP|(xwc)iP)(xc)≈l=∑CP1P(w(wi|xi|cx)l) (4) where C is the total number of components. [sent-223, score-0.201]
26 The probability distribution PF|W over features given a word wi is simply inferred by summing over all components xc for each feature fk: PF(fk|W = wi) =c∑=C1P(fk|xc)P(xc|wi) (5) 2. [sent-225, score-0.291]
27 2 Global Similarity Model Johns and Jones (2012) propose an approach for generating perceptual representations for words by means of global lexical similarity. [sent-226, score-0.889]
28 Their model does 1426 not place so much emphasis on the integration of perceptual and linguistic information, rather its main focus is on inducing perceptual representations for words with no perceptual correlates. [sent-227, score-2.53]
29 Their idea is to assume that lexically similar words also share perceptual features and hence it should be possible to transfer perceptual information onto words that have none from their linguistically similar neighbors. [sent-228, score-1.655]
30 Let P ∈ [0, 1]N×F denote a perceptual matrix, representing a probability distribution over features for each word (see Table 1). [sent-230, score-0.833]
31 A word’s meaning is represented by the concatenation of its textual and perceptual vectors (see Figure 3). [sent-231, score-1.03]
32 If a word has not been normed, its perceptual vector will be all zeros. [sent-232, score-0.831]
33 Johns and Jones (2012) propose a two-step estimation process for words without perceptual vectors. [sent-233, score-0.811]
34 The process is repeated a second time, so as to incorporate the inferred perceptual vector in the computation of the inferred vectors of all other words. [sent-235, score-0.99]
35 (2004)) to learn a joint semantic representation from the textual and perceptual views. [sent-315, score-0.955]
36 Given two random variables x and y (or two sets of vectors), CCA can be seen as determining two sets of basis vectors in such a way, that the correlation between the projections of the variables onto these bases is mutually maximized (Borga, 2001). [sent-316, score-0.197]
37 The perceptual view is captured by a perceptual matrix, P ∈ [0, 1]N×F, representing words as a probability d,i Pst∈r ib [u0t,i1o]n over normed features. [sent-319, score-1.665]
38 Since the correlation between the linguistic and perceptual views may exist in some nonlinear relationship, we used a kernelized version of CCA (Hardoon et al. [sent-321, score-0.935]
39 After applying CCA we obtain two matrices projected onto L basis vectors, Ct ∈ RN×L, resulting from the projection of the textual∈ ∈m Ratrix T onto the new basis and Cp ∈ RN×L, resulting from the projection of the corresponding perceptual feature matrix. [sent-324, score-1.022]
40 The meaning of a word can thus be represented by its projected textual vector in CT, its projected perceptual vector in CP or their concatenation. [sent-325, score-1.046]
41 Figure 4 shows an example of the textual and perceptual vec- tors for the word apple which were used as input for CCA (first row) and their new representation after the projection onto new basis vectors (second row). [sent-326, score-1.175]
42 The CCA model as sketched above will only ob1427 tain full representations for words with perceptual features available. [sent-327, score-0.864]
43 One solution would be to apply the method from Johns and Jones (2012) to infer the perceptual vectors and then perform CCA on the inferred vectors. [sent-328, score-0.962]
44 Another approach which we assess experimentally (see Section 4) is to create a perceptual vector for a word that has none from its k-most (textually) similar neighbors, simply by taking the average of their perceptual vectors. [sent-329, score-1.642]
45 This inference procedure can be applied to the original vectors or the projected vectors in CT and CP, respectively, once CCA has taken place. [sent-330, score-0.197]
46 4 Discussion Johns and Jones (2012) primarily present a model of perceptual inference, where textual data is used to infer perceptual information for words not included in feature norms. [sent-332, score-1.771]
47 There is no means in this model to obtain a joint representation resulting from the mutual influence of the perceptual and textual views. [sent-333, score-0.927]
48 Rather than simply adding perceptual information to textual data it integrates both modalities jointly in a single representation which is desirable, at least from a cognitive perspective. [sent-337, score-1.067]
49 Similarly to Johns and Jones (2012), Andrews et al’s (2009) feature-topic model can also infer perceptual representations for words that have none. [sent-340, score-0.901]
50 In CCA, textual and perceptual data represent two different views of the same objects and the model operates on these views directly without combining or manipulating any of them a priori. [sent-342, score-0.962]
51 A drawback of the model lies in the need of additional methods for inferring perceptual representations for words not available in feature norms. [sent-344, score-0.908]
52 (2005) were used as a proxy for perceptual information. [sent-347, score-0.832]
53 ’s feature norms consist of 541 words and 2,526 features; 824 ofthese features occur with at least two different words. [sent-350, score-0.19]
54 In order to simulate word association, we used the human norms collected by (Nelson et al. [sent-353, score-0.167]
55 For each cue word, the norms provide a set of associates and the frequencies with which they were named. [sent-360, score-0.222]
56 Analogously, we can estimate the degree of similarity between a cue and its associates using our models (see the following section for details on the similarity measures we employed). [sent-362, score-0.145]
57 The norms contain 63,619 unique normed cue-associate pairs in total. [sent-363, score-0.21]
58 Our third task assessed the models’ ability to infer perceptual vectors for words that have none. [sent-388, score-0.917]
59 We treated the perceptual vectors in each test fold as unseen, and used the data in the corresponding training fold together with the models presented in Section 2 to infer them. [sent-391, score-0.917]
60 where P(xc) is uniform, a single component xc is sampled from the distribution P(xc |w1), and an over- all estimate is obtained by averaging over all C components. [sent-405, score-0.166]
61 Johns and Jones’ (2012) model uses binary textual vectors to represent word meaning. [sent-406, score-0.158]
62 5 The textual and perceptual matrices were projected onto 410 vectors. [sent-412, score-0.969]
63 3, CCA does not naturally lend itself to inferring perceptual vectors, yet a perceptual vector for a word can be created from its k-nearest neighbors. [sent-414, score-1.663]
64 We inferred a perceptual vector by averaging over the perceptual vectors ofthe word’s k most similar words; textual similarity between two words was measured using the cosine of the angle of the two vectors representing them. [sent-415, score-1.959]
65 The high- est correlation was achieved with k = 2 when the perceptual vectors were created prior to CCA and k = 8 when they were inferred on the projected textual and perceptual matrices. [sent-417, score-1.932]
66 4 Results Our experiments were designed to answer three questions: (1) Does the integration of perceptual and textual information yield a better fit with behavioral data compared to a model that considers only one data source? [sent-418, score-0.992]
67 (3) How accurately can we approximate the perceptual information when the latter is absent? [sent-422, score-0.811]
68 To answer the first question, we assessed the models’ performance when textual and perceptual information are both available. [sent-423, score-0.9]
69 (1998) norms when taking into account the textual and perceptual modalities on their own (+t−p and −t+p) and in tcuoamlb minoadtiaolinti (+t+p). [sent-426, score-1.207]
70 norms (520 cue-associate pairs) that also appeared in McRae et al. [sent-429, score-0.187]
71 (2005) and for which a perceptual vector was present. [sent-430, score-0.831]
72 The table shows different instantiations of the three models depending on the type of modality taken into account: textual, perceptual or both. [sent-431, score-0.865]
73 ’s (2009) feature- topic model provides a better fit with the association data when both modalities are taken into account (+t+p). [sent-433, score-0.164]
74 ’s (2005) featounre t norms (−t+p) yields substantially slo (w20e0r5 correltuartieon nso. [sent-435, score-0.167]
75 Concatenation of perceptual and textual vectors yields the best fit with the norming data, relying on perceptual information alone (−t+p) comes close, whereas textfuoralm iantfioonrm aaltoionne on ti+tsp own seems otsoe ,h wavhee a ewasea tekxereffect (+t−p). [sent-437, score-1.826]
76 6 The CCA model takes perceptual eafnfde ttex (t+uat−l pin). [sent-438, score-0.811]
77 formation as input in order to find a projection onto basis vectors that are maximally correlated. [sent-439, score-0.145]
78 tual matrix (+t−p), the perceptual matrix (−t+p) or tthuaeilr m mcaotnrcixa t(e+nat−tiopn), (+t+p). [sent-442, score-0.859]
79 pWtuea ol bmtaaitnri xb (e−stt +resp)ul otsr with the latter representation; again we observe that the perceptual information is more dominant. [sent-443, score-0.811]
80 Recall that the feature-topic model (+t+p) represents words as distributions over components, whereas the global similarity model simply concatenates the textual and perceptual vectors. [sent-448, score-0.97]
81 In sum, we can conclude that the higher correlation with human judgments indicates that integrating textual and perceptual modalities jointly is preferable to concatenation. [sent-450, score-1.111]
82 One might even argue that the comparison is slightly unfair as the global similarity model is more geared towards inferring perceptual vectors rather than integrating the two modalities in the best possible way. [sent-459, score-1.111]
83 7 This entails that the models will infer perceptual vectors for the 7This excludes the data used as development set for tuning the k-nearest neighbors for CCA. [sent-461, score-0.917]
84 This is hardly surprising as perceptual information is approximate and in several cases likely to be wrong. [sent-482, score-0.811]
85 Interestingly, we observe similar modeling trends, irrespective of whether the models are performing perceptual inference or not. [sent-483, score-0.834]
86 In order to isolate the influence of the inference method from the resulting semantic representation we evaluated the inferred perceptual vectors on their own by computing their correlation with the original feature distributions in McRae et al. [sent-499, score-1.115]
87 CCA has in fact none, whereas in the feature-topic model the inference of missing perceptual information is a by-product of the generative process. [sent-504, score-0.834]
88 The results in Table 4 indicate that the perceptual vectors are not reconstructed very accurately (the highest correlation coefficient is . [sent-505, score-0.951]
89 25) and that better inference mechanisms are required for perceptual information to have a positive impact on semantic representation. [sent-506, score-0.882]
90 greater impact on the resulting semantic representations compared to the mechanism by which missing perceptual information is inferred. [sent-520, score-0.892]
91 5 Conclusions In this paper, we have presented a comparative study of semantic representation models which compute word meaning on the basis of linguistic and perceptual information. [sent-521, score-0.946]
92 , 2009), the textual and perceptual views are integrated via a set of latent components that are in- ferred from the joint distribution of textual words and perceptual features. [sent-524, score-1.886]
93 , 2004) integrates the two views by deriving a consensus representation based on the correlation between the linguistic and perceptual modalities. [sent-526, score-0.962]
94 In addition, it uses the linguistic representations of words to infer perceptual information when the latter is absent. [sent-528, score-0.923]
95 Experiments on word association and similarity show that all models benefit from the integration of perceptual data. [sent-529, score-0.878]
96 We have also examined how these models perform on the perceptual inference task which has implications for the wider applicability of grounded semantic representation models. [sent-531, score-0.954]
97 ’s (2005) norms without any extensive feature engineering other than applying a frequency cut-off. [sent-534, score-0.19]
98 In the future we plan to experiment with feature selection methods in an attempt to represent perceptual information more succinctly. [sent-535, score-0.834]
99 Although feature norms are a useful first approximation ofperceptual data, the effort involved in eliciting them limits the scope of any computational model based on normed data. [sent-539, score-0.233]
100 A natural avenue for future work would be to develop semantic representation models that exploit perceptual data that is both naturally occurring and easily accessible (e. [sent-540, score-0.866]
wordName wordTfidf (topN-words)
[('perceptual', 0.811), ('cca', 0.263), ('norms', 0.167), ('mcrae', 0.151), ('xc', 0.144), ('modalities', 0.14), ('jones', 0.109), ('johns', 0.106), ('apple', 0.103), ('andrews', 0.093), ('textual', 0.089), ('nelson', 0.074), ('correlation', 0.071), ('vectors', 0.069), ('grounded', 0.065), ('fk', 0.062), ('hardoon', 0.054), ('modality', 0.054), ('representations', 0.053), ('bnc', 0.046), ('behavioral', 0.046), ('inferred', 0.045), ('similarity', 0.045), ('crunchy', 0.043), ('normed', 0.043), ('wji', 0.043), ('coefficients', 0.037), ('infer', 0.037), ('participants', 0.036), ('projected', 0.036), ('meaning', 0.034), ('dj', 0.034), ('fruit', 0.033), ('onto', 0.033), ('components', 0.033), ('fangs', 0.032), ('modelspearson', 0.032), ('perceptually', 0.032), ('px', 0.032), ('sensorimotor', 0.032), ('xji', 0.032), ('views', 0.031), ('canonical', 0.031), ('red', 0.03), ('associates', 0.029), ('semantic', 0.028), ('bruni', 0.028), ('perceived', 0.028), ('concatenation', 0.027), ('representation', 0.027), ('lda', 0.027), ('cue', 0.026), ('green', 0.026), ('round', 0.026), ('distributional', 0.026), ('perception', 0.025), ('pf', 0.025), ('global', 0.025), ('wi', 0.024), ('basis', 0.024), ('matrix', 0.024), ('approximated', 0.024), ('fit', 0.024), ('feature', 0.023), ('finkelstein', 0.023), ('vanilla', 0.023), ('inference', 0.023), ('linguistic', 0.022), ('integration', 0.022), ('cp', 0.022), ('distribution', 0.022), ('barsalou', 0.022), ('bornstein', 0.022), ('celery', 0.022), ('featuretopic', 0.022), ('fji', 0.022), ('glenberg', 0.022), ('howell', 0.022), ('landau', 0.022), ('norming', 0.022), ('pinf', 0.022), ('voorspoels', 0.022), ('row', 0.021), ('inferring', 0.021), ('proxy', 0.021), ('visual', 0.021), ('review', 0.02), ('vector', 0.02), ('appeared', 0.02), ('mechanisms', 0.02), ('yellow', 0.02), ('projection', 0.019), ('correlations', 0.019), ('dd', 0.019), ('reliability', 0.019), ('physical', 0.019), ('psychonomic', 0.018), ('isolate', 0.018), ('juice', 0.018), ('mix', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 61 emnlp-2012-Grounded Models of Semantic Representation
Author: Carina Silberer ; Mirella Lapata
Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
2 0.078740604 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
Author: William Blacoe ; Mirella Lapata
Abstract: In this paper we address the problem of modeling compositional meaning for phrases and sentences using distributional methods. We experiment with several possible combinations of representation and composition, exhibiting varying degrees of sophistication. Some are shallow while others operate over syntactic structure, rely on parameter learning, or require access to very large corpora. We find that shallow approaches are as good as more computationally intensive alternatives with regards to two particular tests: (1) phrase similarity and (2) paraphrase detection. The sizes of the involved training corpora and the generated vectors are not as important as the fit between the meaning representation and compositional method.
3 0.070996419 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.
4 0.061585329 19 emnlp-2012-An Entity-Topic Model for Entity Linking
Author: Xianpei Han ; Le Sun
Abstract: Entity Linking (EL) has received considerable attention in recent years. Given many name mentions in a document, the goal of EL is to predict their referent entities in a knowledge base. Traditionally, there have been two distinct directions of EL research: one focusing on the effects of mention’s context compatibility, assuming that “the referent entity of a mention is reflected by its context”; the other dealing with the effects of document’s topic coherence, assuming that “a mention ’s referent entity should be coherent with the document’ ’s main topics”. In this paper, we propose a generative model called entitytopic model, to effectively join the above two complementary directions together. By jointly modeling and exploiting the context compatibility, the topic coherence and the correlation between them, our model can – accurately link all mentions in a document using both the local information (including the words and the mentions in a document) and the global knowledge (including the topic knowledge, the entity context knowledge and the entity name knowledge). Experimental results demonstrate the effectiveness of the proposed model. 1
5 0.056407146 116 emnlp-2012-Semantic Compositionality through Recursive Matrix-Vector Spaces
Author: Richard Socher ; Brody Huval ; Christopher D. Manning ; Andrew Y. Ng
Abstract: Single-word vector space models have been very successful at learning lexical information. However, they cannot capture the compositional meaning of longer phrases, preventing them from a deeper understanding of language. We introduce a recursive neural network (RNN) model that learns compositional vector representations for phrases and sentences of arbitrary syntactic type and length. Our model assigns a vector and a matrix to every node in a parse tree: the vector captures the inherent meaning of the constituent, while the matrix captures how it changes the meaning of neighboring words or phrases. This matrix-vector RNN can learn the meaning of operators in propositional logic and natural language. The model obtains state of the art performance on three different experiments: predicting fine-grained sentiment distributions of adverb-adjective pairs; classifying sentiment labels of movie reviews and classifying semantic relationships such as cause-effect or topic-message between nouns using the syntactic path between them.
6 0.04655556 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories
7 0.043486483 133 emnlp-2012-Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision
8 0.041175228 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
9 0.037628297 53 emnlp-2012-First Order vs. Higher Order Modification in Distributional Semantics
10 0.037053492 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
11 0.036047261 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
12 0.034978911 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
13 0.034364078 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics
14 0.03218266 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
15 0.03142342 8 emnlp-2012-A Phrase-Discovering Topic Model Using Hierarchical Pitman-Yor Processes
16 0.031028816 16 emnlp-2012-Aligning Predicates across Monolingual Comparable Texts using Graph-based Clustering
17 0.028353648 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
18 0.02835202 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
19 0.028151549 80 emnlp-2012-Learning Verb Inference Rules from Linguistically-Motivated Evidence
20 0.027528524 90 emnlp-2012-Modelling Sequential Text with an Adaptive Topic Model
topicId topicWeight
[(0, 0.107), (1, 0.04), (2, 0.007), (3, 0.057), (4, -0.035), (5, 0.088), (6, 0.042), (7, 0.023), (8, 0.082), (9, 0.008), (10, -0.092), (11, 0.02), (12, -0.012), (13, 0.002), (14, -0.022), (15, -0.038), (16, 0.121), (17, -0.05), (18, 0.061), (19, 0.009), (20, -0.063), (21, 0.087), (22, -0.025), (23, 0.027), (24, -0.068), (25, -0.089), (26, 0.068), (27, -0.039), (28, -0.022), (29, -0.007), (30, 0.033), (31, 0.034), (32, 0.153), (33, -0.042), (34, -0.008), (35, 0.007), (36, -0.056), (37, 0.051), (38, 0.03), (39, 0.037), (40, -0.039), (41, 0.236), (42, 0.041), (43, -0.226), (44, 0.285), (45, 0.253), (46, -0.162), (47, -0.113), (48, 0.059), (49, 0.169)]
simIndex simValue paperId paperTitle
same-paper 1 0.93820029 61 emnlp-2012-Grounded Models of Semantic Representation
Author: Carina Silberer ; Mirella Lapata
Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
2 0.47911307 133 emnlp-2012-Unsupervised PCFG Induction for Grounded Language Learning with Highly Ambiguous Supervision
Author: Joohyun Kim ; Raymond Mooney
Abstract: “Grounded” language learning employs training data in the form of sentences paired with relevant but ambiguous perceptual contexts. B ¨orschinger et al. (201 1) introduced an approach to grounded language learning based on unsupervised PCFG induction. Their approach works well when each sentence potentially refers to one of a small set of possible meanings, such as in the sportscasting task. However, it does not scale to problems with a large set of potential meanings for each sentence, such as the navigation instruction following task studied by Chen and Mooney (201 1). This paper presents an enhancement of the PCFG approach that scales to such problems with highly-ambiguous supervision. Experimental results on the navigation task demonstrates the effectiveness of our approach.
3 0.37048954 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories
Author: Afra Alishahi ; Grzegorz Chrupala
Abstract: Learning the meaning of words from ambiguous and noisy context is a challenging task for language learners. It has been suggested that children draw on syntactic cues such as lexical categories of words to constrain potential referents of words in a complex scene. Although the acquisition of lexical categories should be interleaved with learning word meanings, it has not previously been modeled in that fashion. In this paper, we investigate the interplay of word learning and category induction by integrating an LDA-based word class learning module with a probabilistic word learning model. Our results show that the incrementally induced word classes significantly improve word learning, and their contribution is comparable to that of manually assigned part of speech categories. 1 Learning the Meaning of Words For young learners of a natural language, mapping each word to its correct meaning is a challenging task. Words are often used as part of an utterance rather than in isolation. The meaning of an utterance must be inferred from among numerous possible interpretations that the (usually complex) surrounding scene offers. In addition, the linguistic and visual context in which words are heard and used is often noisy and highly ambiguous. Particularly, many words in a language are polysemous and have different meanings. Various learning mechanisms have been proposed for word learning. One well-studied mechanism is cross-situational learning, a bottom-up strategy based on statistical co-occurrence of words and referents across situations (Quine 1960, Pinker 1989). 643 Grzegorz Chrupała gchrupala@ l .uni-s aarland .de sv Spoken Language Systems Saarland University, Germany Several experimental studies have shown that adults and children are sensitive to cross-situational evidence and use this information for mapping words to objects, actions and properties (Smith and Yu 2007, Monaghan and Mattock 2009). A number of computational models have been developed based on this principle, demonstrating that cross-situational learning is a powerful and efficient mechanism for learning the correct mappings between words and meanings from noisy input (e.g. Siskind 1996, Yu 2005, Fazly et al. 2010). Another potential source of information that can help the learner to constrain the relevant aspects of a scene is the sentential context of a word. It has been suggested that children draw on syntactic cues provided by the linguistic context in order to guide word learning, a hypothesis known as syntactic bootstrapping (Gleitman 1990). There is substantial evidence that children are sensitive to the structural regularities of language from a very young age, and that they use these structural cues to find the referent of a novel word (e.g. Naigles and Hoff-Ginsberg 1995, Gertner et al. 2006). In particular, young children have robust knowledge of some of the abstract lexical categories such as nouns and verbs (e.g. Gelman and Taylor 1984, Kemp et al. 2005). Recent studies have examined the interplay of cross-situational learning and sentence-level learning mechanisms, showing that adult learners of an artificial language can successfully and simultaneously apply cues and constraints from both sources of information when mapping words to their referents (Gillette et al. 1999, Lidz et al. 2010, Koehne and Crocker 2010; 2011). Several computational models have also investigated this interaction by adding manually annotated part-of-speech tags as PLraoncge uadgineg Lse oafr tnhineg 2,0 p1a2g Jeosin 64t C3–o6n5f4e,re Jnecjue Iosnla Enmd,p Kiroicraela, M 1e2t–h1o4ds Ju ilny N 20a1tu2r.a ?lc L2a0n1g2ua Agseso Pcrioactieosnsi fnogr a Cnodm Cpoumtaptiuotna tilo Lnianlg Nuaist uircasl input to word learning algorithms, and suggesting that integration of lexical categories can boost the performance of a cross-situational model (Yu 2006, Alishahi and Fazly 2010). However, none of the existing experimental or computational studies have examined the acquisition of word meanings and lexical categories in parallel. They all make the simplifying assumption that prior to the onset of word learning, the categoriza- tion module has already formed a relatively robust set of lexical categories. This assumption can be justified in the case of adult learners of a second or artificial language. But children’s acquisition of categories is most probably interleaved with the acquisition of word meaning, and these two processes must ultimately be studied simultaneously. In this paper, we investigate concurrent acquisition of word meanings and lexical categories. We use an online version of the LDA algorithm to induce a set of word classes from child-directed speech, and integrate them into an existing probabilistic model of word learning which combines cross-situational evidence with cues from lexical categories. Through a number of simulations of a word learning scenario, we show that our automatically and incrementally induced categories significantly improve the performance ofthe word learning model, and are closely comparable to a set of goldstandard, manually-annotated part of speech tags. 2 A Word Learning Model We want to investigate whether lexical categories (i.e. word classes) that are incrementally induced from child-directed speech can improve the performance of a cross-situational word learning model. For this purpose, we use the model of Alishahi and Fazly (2010). This model uses a probabilistic learning algorithm for combining evidence from word– referent co-occurrence statistics and the meanings associated with a set of pre-defined categories. They use child-directed utterances, manually annotated with a small set of part of speech tags, from the Manchester corpus (Theakston et al. 2001) in the CHILDES database (MacWhinney 1995). Their experimental results show that integrating these goldstandard categories into the algorithm boosts its performance over a pure cross-situational version. 644 The model of Alishahi and Fazly (2010) has the suitable architecture for our goal: it provides an integrated learning mechanism which combines evidence from word-referent co-occurrence with cues from the meaning representation associated with word categories. However, the model has two major shortcomings. First, it assumes that lexical categories are formed and finalized prior to the onset of word learning and that a correct and unique category for a target word can be identified at each point in time, assumptions that are highly unlikely. Second, it does not handle any ambiguity in the meaning of a word. Instead, each word is assumed to have only one correct meaning. Considering the high level of lexical ambiguity in most natural languages, this assumption unreasonably simplifies the word learning problem. To investigate the plausibility of integrating word and category learning, we use an online algorithm for automatically and incrementally inducing a set of lexical categories. Moreover, we use each word in its original form instead of lemmatizing them, which implies that categories contain different morphological forms of the same word. By applying these changes, we are able to study the contribution of lexical categories to word learning in a more realistic scenario. Representation of input. The input to the model consists of a sequence of utterances, each paired with a representation of an observed scene. We represent an utterance as a set of words, U = {w} (e.g. {she, went, home, ... }), aonfd w wthored corresponding scene as a set not,f hsoemmaen, .ti.c.} features, S c = {f} (e.g. {ANIMATE, HUMAN, FEMALE, ...}). Word and category meaning. We represent the meaning of a word as a time-dependent probability distribution over all the semantic features, where (f|w) |isw t)h oev probability mofa fnetaictur feea f rbees-, ing associa(tefd|w ww)i tish twhoerd p w aatb tiilmitye t o. fI fne tahteu raeb sfen bceeof any prior knowledge, the model assumes a uniform distribution over all features as the meaning of a novel word. Also, a function (w) gives us the category to which a word w in utterance belongs. At each point in time, a category c contains a set of word tokens. We assign a meaning to each category as a weighted sum of the meaning learned so far for each of its members, or p(t) (f|c) = (1/|c|) Pw∈c (f|w), where |c| is the nu(fm|bc)er o=f (w1o/r|dc |t)oPkenws∈ icn c a(tf t|hwe )c,u wrrheenrte em |co|m iesn tht. p(t) p(t)(·|w) cat(t) U(t) p(t) Learning algorithm. Given an utterance-scene pair (U(t) , S(t) ) received at time t, the model first calculates an alignment score a for each word w ∈ laantde se aanch a sleigmnamnetinct f secaoturere a f ∈ Sea(tc)h . Aw oserdm want ∈ic Ufea- U(t) × taunrde can h b see aligned fteoa a wreo frd ∈ according to the meaning acquired for that word from previous observations (word-based alignment, or aw). Alternatively, distributional clues of the word can be used to determine its category, and the semantic features can be aligned to the word according to the meaning associated to its category (category-based alignment, or ac). We combine these two sources of evidence when estimating an alignment score: a(w|f, U(t), S(t)) = λ(w) +(1 − λ(w)) U(t), S(t)) ac(w|f, U(t), S(t)) aw(w|f, (1) where the word-based and category-based alignment scores are estimated based on the acquired meanings of the word and its category, respectively: aw(w|f,U(t),S(t)) = Xp(t−p1(t)(−f1|)w(f)|wk) wkX∈U(t) ac(w|f,U(t),S(t)) = Xp(t−p1()t(−f1)|(cfat|c(awt)()wk)) wkX∈U(t) The relative contribution ofthe word-based versus the category-based alignment is determined by the weight function λ(w). Cross-situational evidence is a reliable cue for frequent words; on the other hand, the category-based score is most informative when the model encounters a low-frequency word (See Alishahi and Fazly (2010) for a full analysis of the frequency effect). Therefore, we define λ(w) as a function of the frequency of the word n(w) : λ(w) = n(w)/(n(w) + 1) Once an alignment score is calculated for each word w ∈ and each feature f ∈ S(t) , the model rweovirsdes w t ∈he U meanings cohf aeallt utrhee fw ∈ord Ss in and U(t) U(t) 645 their corresponding categories as follows: assoc(t)(w, f) = assoc(t−1)(w, f) + a(w|f, U(t), S(t)) where assoc(t−1) (w, f) is zero if w and f have not co-occurred before. These association scores are then used to update the meaning of the words in the current input: × p(t)(f|w) =Xasasossco(tc)( tf),(f wj,) w) fXj X∈F (2) where F is the set of all features seen so far. We use a hsmeroeo Fthe isd thveers sieotn o of fa ltlh fies aftourrmesul sea eton asocc faorm.m Woed uastee noisy or rare input. This process is repeated for all the input pairs, one at a time. Uniform categories. Adding the category-based alignment as a new factor to Eqn. (1) might imply that the role of categories in this model is nothing more than smoothing the cross-situational-based alignment of words and referents. In order to investigate this issue, we use the following alignment formula as an informed baseline in our experiments, where we replace ac(· |f, U(t) , S(t)) with a uniform distribution:1 a(w|f, U(t), S(t)) = λ(w) U(t), S(t)) +(1 − λ(w)) ×|U1(t)| aw(w|f, (3) where aw (w| f, U(t) , S(t) ) and λ(w) are estimated as before. In( our experiments in Section 4, we refer to this baseline as the ‘uniform’ condition. 3 Online induction of word classes with LDA Empirical findings suggest that young children form their knowledge of abstract categories, such as verbs, nouns, and adjectives, gradually (e.g. Gelman and Taylor 1984, Kemp et al. 2005). In addition, several unsupervised computational models have been proposed for inducing categories of words which resemble part-of-speech categories, by 1We thank an anonymous reviewers for suggesting this dition as an informed baseline. con- drawing on distributional properties of their context (see for example Redington et al. 1998, Clark 2000, Mintz 2003, Parisien et al. 2008, Chrupała and Alishahi 2010). However, explicit accounts of how such categories can be integrated in a crosssituational model of word learning have been rare. Here we adopt an online version of the model proposed in Chrupała (201 1), a method of soft word class learning using Latent Dirichlet Allocation. The approach is much more efficient than the commonly used alternative (Brown clustering, (Brown et al. 1992)) while at the same time matching or outperforming it when the word classes are used as automatically learned features for supervised learning of various language understanding tasks. Here we adopt this model as our approach to learning lexical categories. In Section 3.1 we describe the LDA model for word classes; in Section 3.2 we discuss the online Gibbs sampler we use for inference. 3.1 Word class learning with LDA Latent Dirichlet Allocation (LDA) was introduced by Blei et al. (2003) and is most commonly used for modeling the topic structure in document collections. It is a generative, probabilistic hierarchical Bayesian model that induces a set of latent variables, which correspond to the topics. The topics themselves are multinomial distributions over words. The generative structure of the LDA model is the following: φk ∼ Dirichlet(β) , k ∈ [1, K] zθdnd∼ D Ciraitcehgloetr(icαa),l(θd), wnd ∼ dnd ∈∈ [1 [1,D,N]d] Categorical(φznd ) , nd (4) ∈ [1, Nd] Chrupała (201 1) reinterprets the LDA model in terms of word classes as follows: K is the number of classes, D is the number of unique word types, Nd is the number ofcontext features (such as right or left neighbor) associated with word type d, znd is the class ofword type d in the ntdh context, and wnd is the ntdh context feature of word type d. Hyperparameters α and β control the sparseness of the vectors θd and φk. 646 Wordtype Features HowdoR do you HowL doL youR youL doR Table 1: Matrix of context features 1.8M words (CHILDES) 100M words (BNC) tsbgmrhioav oinekseycrbhaolrb ilnetbhgiesbJcmula nsceinkswMlwaionhalrmgituceahnge Table 2: Most similar word pairs As an example consider the small corpus consisting of the single sentence How do you do. The rows in Table 1 show the features w1 . . . wNd for each word type d if we use each word’s left and right neighbors as features, and subscript words with L and R to indicate left and right. After inference, the θd parameters correspond to word class probability distributions given a word type while the φk correspond to feature distributions given a word class: the model provides a probabilistic representation for word types independently of their context, and also for contexts independently of the word type. Probabilistic, soft word classes are more expressive than hard categories. First, they make it easy and efficient to express shared ambiguities: Chrupała (201 1) gives an example of words used as either first names or surnames, and this shared ambiguity is reflected in the similarity of their word class distributions. Second, with soft word classes it becomes easy to express graded similarity between words: as an example, Table 2 shows a random selection out of the 100 most similar word pairs according to the Jensen-Shannon divergence between their word class distributions, according to a word class model with 25 classes induced from (i) 1.8 million words of the CHILDES corpus or (ii) 100 million word of the BNC corpus. The similarities were measured between each of the 1000 most frequent CHILDES or BNC words. 3.2 Online Gibbs sampling for LDA There have been a number of attempts to develop online inference algorithms for topic modeling with LDA. A simple modification of the standard Gibbs sampler (o-LDA) was proposed by Song et al. (2005) and Banerjee and Basu (2007). Canini et al. (2009) experiment with three sampling algorithms for online topic inference: (i) oLDA, (ii) incremental Gibbs sampler, and (iii) a par- ticle filter. Only o-LDA is truly online in the sense that it does not revisit previously seen documents. The other two, the incremental Gibbs sampler and the particle filter, keep seen documents and periodically resample them. In Canini et al.’s experiments all of the online algorithms perform worse than the standard batch Gibbs sampler on a document clustering task. Hoffman et al. (2010) develop an online version of the variational Bayes (VB) optimization method for inference for topic modeling with LDA. Their method achieves good empirical results compared to batch VB as measured by perplexity on heldout data, especially when used with large minibatch sizes. Online VB for LDA is appropriate when streaming documents: with online VB documents are represented as word count tables. In our scenario where we apply LDA to modeling word classes we need to process context features from sentences arriving in a stream: i.e. we need to sample entries from a table like Table 1 in order of arrival rather than row by row. This means that online VB is not directly applicable to online word-class induction. However it also means that one issue with o-LDA identified by Canini et al. (2009) is ameliorated. When sampling in a topic modeling setting, documents are unique and are never seen again. Thus, the topics associated with old documents get stale and need to be periodically rejuvenated (i.e. resampled). This is the reason why the incremental Gibbs sampler and the particle filter algorithms in Canini et al. (2009) need to keep old documents around and cannot run in a true online fashion. Since for word class modeling we stream context features as they arrive, we will continue to see features associated with the seen word types, and will automatically resample their class assignments. In exploratory ex647 periments we have seen that this narrows the performance gap between the o-LDA sampler and the batch collapsed Gibbs sampler. We present our version of the o-LDA sampler in Algorithm 1. For each incoming sentence t we run J passes of sampling, updating the counts tables after each sampling step. We sample the class assignment zti for feature wti according to: P(zt|zt−1,wt,dt) ∝(nztt−,d1Pt+jV=t− α11)n ×ztt−, (w1njtz−t+,1wt β+ β), (5) where stands for the nPumber of times class z co-occurred with word type d up to step t, and similarly ntz,w is the number of times feature w was assigned to class z. Vt is the number of unique features seen up to step t, while α and β are the LDA hyperparameters. There are two differences between the original o-LDA and our version: we do not initialize the algorithm with a batch run over a prefix of the data, and we allow more than one sampling pass per sentence.2 Exploratory experiments have shown that batch initialization is unnecessary, and that multiple passes typically improve the quality of the induced word classes. ntz,d Algorithm 1 Online Gibbs sampler for word class induction with LDA for t = 1 → ∞ do fto =r j = →1 → dJo do fjor = =i = →1 → It do sample zti ∼ P(zti |zti−1 , wti , dti ) increment ntzti,wti and ntzti,dti Figure 1 shows the top 10 words for each of the 10 word classes induced with our online Gibbs sampler from 1.8 million words of CHILDES. Similarly, Figure 2 shows the top 10 words for 5 randomly chosen topics out of 50, learned online from 100 million words of the BNC. The topics are relatively coherent and at these levels of granularity express mostly part of speech and subcategorization frame information. Note that for each word class we show the words most frequently assigned to it while Gibbs sampling. 2Note that we do not allow multiple passes over the stream of sentences. Rather, while processing the current sentence, we allow the words in this sentence to be sampled more than once. CHILDES taIwhsyeatiofheuiwsmh’esorihntea shdoeam,tyihrewa thseliasrenhao wT,othYwaueotlbdietHcsdhaieyuors eIaitdwfmboyerwhat Figure 2: Top 10 words of 5 randomly chosen classes learned from BNC Since we are dealing with soft classes, most wordtypes have non-zero assignment probabilities for many classes. Thus frequently occurring words such as not will typically be listed for several classes. 4 Evaluation 4.1 Experimental setup As training data, we extract utterances from the Manchester corpus (Theakston et al. 2001) in the CHILDES database (MacWhinney 1995), a corpus that contains transcripts of conversations with children between the ages of 1 year, 8 months and 3 years. We use the mother’s speech from transcripts of 12 children (henceforth referred to by children’s names). We run word class induction while simultaneously outputting the highest scoring word-class label for each word: for a new sentence, we sample class assignments for each feature (doing J passes), update the counts, and then for each word dti output the highest scoring class label according to argmaxz ntz,dti (where ntz,dti stands for the num- 648 ber of times class z co-occurred with word type dti up to step t). During development we ran the online word class induction module on data for Aran, Becky, Carl and Anne and then started the word learning module for the Anne portion while continuing inducing categories. We then evaluated word learning on Anne. We chose the parameters of the word class induction module based on those development results: α = 10, β = 0.1, K = 10 and J = 20. We used cross-validation for the final evaluation. PFor each of six data files (Anne, Aran, Becky, Carl, Dominic and Gail), we ran word-class induction on the whole corpus with the chosen file last, and then started applying the word-learning algorithm on this last chosen file (while continuing with category induction). We evaluated how well word meanings were learned in those six cases. We follow Alishahi and Fazly (2010) in the construction of the input. We need a semantic representation paired with each utterance. Such a representation is not available from the corpus and has to be P1K=1 constructed. We automatically construct a gold lexicon for all nouns and verbs in this corpus as follows. For each word, we extract all hypernyms for its first sense in the appropriate (verb or noun) hierarchy in WordNet (Fellbaum 1998), and add the first word in the synset of each hypernym to the set of semantic features for the target word. For verbs, we also extract features from VerbNet (Kipper et al. 2006). A small subset of words (pronouns and frequent quantifiers) are also manually added. This lexicon represents the true meaning of each word, and is used in generating the scene representations in the input and in evaluation. For each utterance in the input corpus, we form the union of the feature representations of all its words. Words not found in the lexicon (i.e. for which we could not extract a semantic representation from WordNet and VerbNet) are removed from the utterance (only for the word learning module). In order to simulate the high level of noise that children receive from their environment, we follow Alishahi and Fazly (2010) and pair each utterance with a combination of its own scene representation and the scene representation for the following utter- ance. This decision was based on the intuition that consequent utterances are more likely to be about re- SUctenra:ce{BmCPAOLRoNAImOTMSCmEAU,yTOMELaPB,ItTeJH,EIVUObCEMrNToG,cAE. NTo.C,Al}Ti.BILO},EN. Figure 3: A sample input item to the word learning model lated topics and scenes. This results in a (roughly) 200% ambiguity. In addition, we remove the meaning of one random word from the scene representation of every second utterance in an attempt to simulate cases where the referent of an uttered word is not within the perception field (such as ‘daddy is not home yet’). A sample utterance and its corresponding scene are shown in Figure 3. As mentioned before, many words in our input corpus are polysemous. For such words, we extract different sets of features depending on their manually tagged part of speech and keep them in the lexicon (e.g. the lexicon contains two different entries for set:N and set:V). When constructing a scene representation for an utterance which contains an ambiguous word, we choose the correct sense from our lexicon according to the word’s part of speech tag in Manchester corpus. In the experiments reported in the next section, we assess the performance of our model on learning words at each point in time: for each target word, we compare its set of features in the lexicon with its probability distribution over the semantic features that the model has learned. We use mean average precision (MAP) to measure how well (· |w) ranks the features of w. p(t) 4.2 Learning curves To understand whether our categories contribute to learning of word–meaning mappings, we compare the pattern of word learning over time in four conditions. The first condition represents our baseline, in which we do not use category-based alignment in the word learning model by setting λ(w) = 1 in Eqn. (1). In the second condition we use a set of uniformly distributed categories for alignment, as estimated by Eqn. (3) on page 3 (this condition is introduced to examine whether categories act as more than a simple smoothing factor in the align649 UCNoantinefego rmryAvg.0 M0. 6 A3236PStd.0 . D.0 e3 v2 . TabPlLeDO3AS:FinalMeanAv0 e.6 r5a7g29ePrecis0 io.n0 32s09cores ment process.) In the third condition we use the categories induced by online LDA in the word learning model. The fourth condition represents the performance ceiling, in which we use the pre-defined and manually annotated part of speech categories from the Manchester corpus. Table 3 shows the average and the standard deviation of the final MAP scores across the six datasets, for the four conditions (no categories, uniform categories, LDA categories and gold part-of-speech tags). The differences between LDA and None, and between LDA and Uniform are statistically signif- icant according to the paired t test (p < 0.01), while the difference between LDA and POS is not (p = 0.16). Figure 4 shows the learning curves in each condition, averaged over the six splits explained in the previous section. The top panel shows the average learning curve over the minimum number of sentences across the six sub-corpora (8800 sentences). The curves show that our LDA categories significantly improve the performance of the model over both baselines. That means that using these categories can improve word learning compared to not using them and relying on cross-situational evidence alone. Moreover, LDA-induced categories are not merely acting as a smoothing function the way the ‘uniform’ categories are. Our results show that they are bringing relevant information to the task at hand, that is, improving word learning by using the sentential context. In fact, this improvement is comparable to the improvement achieved by integrating the ‘gold-standard’ POS categories. The middle and bottom panels of Figure 4 zoom in on shorter time spans (5000 and 1000 sentences, respectively). These diagrams suggest that the pat- tern of improvement over baseline is relatively constant, even at very early stages of learning. In fact, once the model receives enough input data, crosssituational evidence becomes stronger (since fewer words in the input are encountered for the first time) and the contribution of the categories becomes less significant. 4.3 Class granularity In Figure 5 we show the influence of the number of word classes used on the performance in word learning. It is evident that in the range between 5 to 20 classes the performance ofthe word learning module is quite stable and insensitive to the exact class granularity. Even with only 5 classes the model can still roughly distinguish noun-like words from verb-like words from pronoun-like words, and this will help learn the meaning elements derived from the higher levels of WordNet hierarchy. Notwithstanding that, ideally we would like to avoid having to pre-specify the number of classes for the word class induction module: we thus plan to investigate non-parametric models such as Hierarchical Dirichlet Process for this purpose. 5 Related Work This paper investigates the interplay between two language learning tasks which have so far been studied in isolation: the acquisition of lexical categories from distributional clues, and learning the mapping between words and meanings. Previous models have shown that lexical categories can be learned from unannotated text, mainly drawing on distributional properties of words (e.g. Redington et al. 1998, Clark 2000, Mintz 2003, Parisien et al. 2008, Chrupała and Alishahi 2010). Independently, several computational models have exploited cross-situational evidence in learning the correct mappings between words and meanings, using rule-based inference (Siskind 1996), neural networks (Li et al. 2004, Regier 2005), hierarchical Bayesian models (Frank et al. 2007) and probabilistic alignment inspired by machine translation models (Yu 2005, Fazly et al. 2010). There are only a few existing computational models that explore the role of syntax in word learning. Maurits et al. (2009) investigates the joint acquisition of word meaning and word order using a batch model. This model is tested on an artificial language with a simple first order predicate representation of meaning, and limited built-in possibilities for word 650 Figure 4: Mean average precision for all observed words at each point in time for four conditions: with gold POS categories, with LDA categories, with uniform categories, and without using categories. Each panel displays a different time span. Figure 5: Mean average precision for all observed words at each point in time in four conditions: using online LDA categories of varying numbers of 20, 10 and 5, and with- out using categories. order. The model of Niyogi (2002) simulates the mutual bootstrapping effects of syntactic and semantic knowledge in verb learning, that is the use of syntax to aid in inducing the semantics of a verb, and the use of semantics to narrow down possible syntactic frames in which a verb can participate. However, this model relies on manually assigned priors for associations between syntactic and semantic features, and is tested on a toy language with very limited vocabulary and a constrained syntax. Yu (2006) integrates automatically induced syntactic word categories into his model of crosssituational word learning, showing that they can improve the model’s performance. Yu’s model also processes input utterances in a batch mode, and its evaluation is limited to situations in which only a coarse distinction between referring words (words that could potentially refer to objects in a scene, e.g. concrete nouns) and non-referring words (words that cannot possibly refer to objects, e.g. function words) is sufficient. It is thus not clear whether information about finer-grained categories (e.g. verbs and nouns) can indeed help word learning in a more naturalistic incremental setting. On the other hand, the model of Alishahi and Fazly (2010) integrates manually annotated part-ofspeech tags into an incremental word learning algorithm, and shows that these tags boost the over651 all word learning performance, especially for infrequent words. In a different line of research, a number of models have been proposed which study the acquisition of the link between syntax and semantics within the Combinatory Categorial Grammar (CCG) framework (Briscoe 1997, Villavicencio 2002, Buttery 2006, Kwiatkowski et al. 2012). These approaches set the parameters of a semantic parser on a corpus of utterances paired with a logical form as their meaning. These models bring in extensive and detailed prior assumptions about the nature of the syntactic representation (i.e. atomic categories such as S and NP, and built-in rules which govern their combination), as well as about the representation of meaning via the formalism of lambda calculus. This is fundamentally different than the approach taken in this paper, which in comparison only assumes very simple syntactic and semantic representations of syntax. We view word and category learning as stand-alone cognitive tasks with independent representations (word meanings as probabilistic collections of properties or features as opposed to single symbols; categories as sets of word tokens with similar context distribution) and we do not bring in any prior knowledge of specific atomic categories. 6 Conclusion In this paper, we show the plausibility of using automatically and incrementally induced categories while learning word meanings. Our results suggest that the sentential context that a word appears in across its different uses can be used as a complementary source of guidance for mapping it to its featural meaning representation. In Section 4 we show that the improvement achieved by our categories is comparable to that gained by integrating gold POS categories. This result is very encouraging, since manually assigned POS tags are typically believed to set the upper bound on the usefulness of category information. We believe that it automatically induced categories have the potential to do even better: Chrupała and Alishahi (2010) have shown that categories induced from usage data in an unsupervised fashion can be used more effectively than POS categories in a number of tasks. In our experiments here on the development data we observed some improvements over POS categories. This advantage can result from the fact that our categories are more fine-grained (if also more noisy) than POS categories, which sometimes yields more accurate predictions. One important characteristic of the category induction algorithm we have used in this paper is that it provides a soft categorization scheme, where each word is associated with a probability distribution over all categories. In future, we plan to exploit this feature: when estimating the category-based alignment, we can interpolate predictions of multiple categories to which a word belongs, weighted by its probabilities associated with membership in each category. Acknowledgements Grzegorz Chrupała was funded by the German Federal Ministry of Education and Research (BMBF) under grant number 01IC10S01O as part of the Software-Cluster project EMERGENT (www . s o ftware-clu ster .org). References Alishahi, A. and Fazly, A. (2010). Integrating Syntactic Knowledge into a Model of Crosssituational Word Learning. In Proceedings of the 32nd Annual Conference of the Cognitive Science Society. Banerjee, A. and Basu, S. (2007). Topic models over text streams: A study ofbatch and online unsupervised learning. In SIAM Data Mining. Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022. Briscoe, T. (1997). Co-evolution of language and of the language acquisition device. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 418–427. Association for Computa- tional Linguistics. Brown, P. F., Mercer, R. L., Della Pietra, V. J., and Lai, J. C. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479. 652 Buttery, P. (2006). Computational models for first language acquisition. Computer Laboratory, University of Cambridge, Tech. Rep. UCAM-CLTR675. Canini, K., Shi, L., and Griffiths, T. (2009). Online inference of topics with latent dirichlet allocation. In Proceedings ofthe International Conference on Artificial Intelligence and Statistics. Chrupała, G. (201 1). Efficient induction of probabilistic word classes with LDA. In International Joint Conference on Natural Language Processing. Chrupała, G. and Alishahi, A. (2010). Online Entropy-based Model of Lexical Category Acquisition. In CoNLL 2010. Clark, A. (2000). Inducing syntactic categories by context distribution clustering. In Proceedings of the 2nd workshop on Learning Language in Logic and the 4th conference on Computational Natural Language Learning, pages 91–94. Association for Computational Linguistics Morristown, NJ, USA. Fazly, A., Alishahi, A., and Stevenson, S. (2010). A Probabilistic Computational Model of CrossSituational Word Learning. Cognitive Science, 34(6): 1017–1063. Fellbaum, C., editor (1998). WordNet, An Electronic Lexical Database. MIT Press. Frank, M. C., Goodman, N. D., and Tenenbaum, J. B. (2007). A Bayesian framework for crosssituational word-learning. In Advances in Neural Information Processing Systems, volume 20. Gelman, S. and Taylor, M. (1984). How two-yearold children interpret proper and common names for unfamiliar objects. Child Development, pages 1535–1540. Gertner, Y., Fisher, C., and Eisengart, J. (2006). Learning words and rules: Abstract knowledge of word order in early sentence comprehension. Psychological Science, 17(8):684–691 . Gillette, J., Gleitman, H., Gleitman, L., and Led- erer, A. (1999). Human simulations of vocabulary learning. Cognition, 73(2): 135–76. Gleitman, L. (1990). The structural sources of verb meanings. Language Acquisition, 1:135–176. Hoffman, M., Blei, D., and Bach, F. (2010). Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems. Kemp, N., Lieven, E., and Tomasello, M. (2005). Young Children’s Knowledge of the” Determiner” and” Adjective” Categories. Journal of Speech, Language and Hearing Research, 48(3):592–609. Kipper, K., Korhonen, A., Ryant, N., and Palmer, M. (2006). Extensive classifications of english verbs. In Proceedings of the 12th EURALEX International Congress. Koehne, J. and Crocker, M. W. (2010). Sentence processing mechanisms influence crosssituational word learning. In Proceedings of the Annual Conference of the Cognitive Science Society. Koehne, J. and Crocker, M. W. (201 1). The interplay of multiple mechanisms in word learning. In Proceedings ofthe Annual Conference ofthe Cognitive Science Society. Kwiatkowski, T., Goldwater, S., Zettelmoyer, L., and Steedman, M. (2012). A probabilistic model of syntactic and semantic acquisition from childdirected utterances and their meanings. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Li, P., Farkas, I., and MacWhinney, B. (2004). Early lexical development in a self-organizing neural network. Neural Networks, 17: 1345–1362. Lidz, J., Bunger, A., Leddon, E., Baier, R., and Waxman, S. R. (2010). When one cue is better than two: lexical vs . syntactic cues to verb learning. Unpublished manuscript. MacWhinney, B. (1995). The CHILDES Project: Tools for Analyzing Talk. Hillsdale, NJ: Lawrence Erlbaum Associates, second edition. Maurits, L., Perfors, A. F., and Navarro, D. J. (2009). Joint acquisition of word order and word reference. In Proceedings of the 31st Annual Conference of the Cognitive Science Society. Mintz, T. (2003). Frequent frames as a cue for gram- 653 matical categories in child directed speech. Cognition, 90(1):91–1 17. Monaghan, P. and Mattock, K. (2009). Crosssituational language learning: The effects of grammatical categories as constraints on referential labeling. In Proceedings of the 31st Annual Conference of the Cognitive Science Society. Naigles, L. and Hoff-Ginsberg, E. (1995). Input to Verb Learning: Evidence for the Plausibility of Syntactic Bootstrapping. Developmental Psychology, 31(5):827–37. Niyogi, S. (2002). Bayesian learning at the syntaxsemantics interface. In Proceedings of the 24th annual conference of the Cognitive Science Society, pages 697–702. Parisien, C., Fazly, A., and Stevenson, S. (2008). An incremental bayesian model for learning syntactic categories. In Proceedings of the Twelfth Conference on Computational Natural Language Learning. Pinker, S. (1989). Learnability and Cognition: The Acquisition of Argument Structure. Cambridge, MA: MIT Press. Quine, W. (1960). Word and Object. Cambridge University Press, Cambridge, MA. Redington, M., Crater, N., and Finch, S. (1998). Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science: A Multidisciplinary Journal, 22(4):425–469. Regier, T. (2005). The emergence of words: Attentional learning in form and meaning. Cognitive Science, 29:819–865. Siskind, J. M. (1996). A computational study of cross-situational techniques for learning word-tomeaning mappings. Cognition, 61:39–91 . Smith, L. and Yu, C. (2007). Infants rapidly learn words from noisy data via cross-situational statistics. In Proceedings of the 29th Annual Conference of the Cognitive Science Society. Song, X., Lin, C., Tseng, B., and Sun, M. (2005). Modeling and predicting personal information dissemination behavior. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 479–488. ACM. Theakston, A. L., Lieven, E. V., Pine, J. M., and Rowland, C. F. (2001). The role of performance limitations in the acquisition structure: An alternative Child Language, of verb-argument account. 28: 127–152. Journal of Villavicencio, A. (2002). The acquisition of a unification-based generalised categorial grammar. In Proceedings of the Third CLUK Colloquium, pages 59–66. Yu, C. (2005). The emergence of links between lexical acquisition and object categorization: A computational study. Connection Science, 17(3– 4):381–397. Yu, C. (2006). Learning syntax–semantics mappings to bootstrap word learning. In Proceedings of the 28th Annual Conference of the Cognitive Science Society. 654
4 0.33339307 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
Author: Wen-tau Yih ; Geoffrey Zweig ; John Platt
Abstract: Existing vector space models typically map synonyms and antonyms to similar word vectors, and thus fail to represent antonymy. We introduce a new vector space representation where antonyms lie on opposite sides of a sphere: in the word vector space, synonyms have cosine similarities close to one, while antonyms are close to minus one. We derive this representation with the aid of a thesaurus and latent semantic analysis (LSA). Each entry in the thesaurus a word sense along with its synonyms and antonyms is treated as a “document,” and the resulting document collection is subjected to LSA. The key contribution of this work is to show how to assign signs to the entries in the co-occurrence matrix on which LSA operates, so as to induce a subspace with the desired property. – – We evaluate this procedure with the Graduate Record Examination questions of (Mohammed et al., 2008) and find that the method improves on the results of that study. Further improvements result from refining the subspace representation with discriminative training, and augmenting the training data with general newspaper text. Altogether, we improve on the best previous results by 11points absolute in F measure.
5 0.31722623 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
Author: Young-Bum Kim ; Benjamin Snyder
Abstract: We consider the problem of inducing grapheme-to-phoneme mappings for unknown languages written in a Latin alphabet. First, we collect a data-set of 107 languages with known grapheme-phoneme relationships, along with a short text in each language. We then cast our task in the framework of supervised learning, where each known language serves as a training example, and predictions are made on unknown languages. We induce an undirected graphical model that learns phonotactic regularities, thus relating textual patterns to plausible phonemic interpretations across the entire range of languages. Our model correctly predicts grapheme-phoneme pairs with over 88% F1-measure.
6 0.28975031 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
7 0.28111529 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
8 0.2581411 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
9 0.24523431 19 emnlp-2012-An Entity-Topic Model for Entity Linking
10 0.22283384 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
11 0.22120674 116 emnlp-2012-Semantic Compositionality through Recursive Matrix-Vector Spaces
12 0.20106497 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
13 0.19341229 53 emnlp-2012-First Order vs. Higher Order Modification in Distributional Semantics
14 0.18483496 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification
15 0.17974785 49 emnlp-2012-Exploring Topic Coherence over Many Models and Many Topics
16 0.17542495 87 emnlp-2012-Lyrics, Music, and Emotions
17 0.17200272 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
18 0.1661779 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP
19 0.16609496 99 emnlp-2012-On Amortizing Inference Cost for Structured Prediction
20 0.16028062 103 emnlp-2012-PATTY: A Taxonomy of Relational Patterns with Semantic Types
topicId topicWeight
[(2, 0.012), (16, 0.02), (25, 0.018), (34, 0.05), (45, 0.045), (60, 0.484), (63, 0.058), (64, 0.015), (65, 0.024), (70, 0.025), (73, 0.01), (74, 0.033), (76, 0.054), (80, 0.022), (86, 0.024), (95, 0.022)]
simIndex simValue paperId paperTitle
1 0.99068475 58 emnlp-2012-Generalizing Sub-sentential Paraphrase Acquisition across Original Signal Type of Text Pairs
Author: Aurelien Max ; Houda Bouamor ; Anne Vilnat
Abstract: This paper describes a study on the impact of the original signal (text, speech, visual scene, event) of a text pair on the task of both manual and automatic sub-sentential paraphrase acquisition. A corpus of 2,500 annotated sentences in English and French is described, and performance on this corpus is reported for an efficient system combination exploiting a large set of features for paraphrase recognition. A detailed quantified typology of subsentential paraphrases found in our corpus types is given.
2 0.9887017 84 emnlp-2012-Linking Named Entities to Any Database
Author: Avirup Sil ; Ernest Cronin ; Penghai Nie ; Yinfei Yang ; Ana-Maria Popescu ; Alexander Yates
Abstract: Existing techniques for disambiguating named entities in text mostly focus on Wikipedia as a target catalog of entities. Yet for many types of entities, such as restaurants and cult movies, relational databases exist that contain far more extensive information than Wikipedia. This paper introduces a new task, called Open-Database Named-Entity Disambiguation (Open-DB NED), in which a system must be able to resolve named entities to symbols in an arbitrary database, without requiring labeled data for each new database. We introduce two techniques for Open-DB NED, one based on distant supervision and the other based on domain adaptation. In experiments on two domains, one with poor coverage by Wikipedia and the other with near-perfect coverage, our Open-DB NED strategies outperform a state-of-the-art Wikipedia NED system by over 25% in accuracy.
3 0.98728788 41 emnlp-2012-Entity based QA Retrieval
Author: Amit Singh
Abstract: Bridging the lexical gap between the user’s question and the question-answer pairs in the Q&A; archives has been a major challenge for Q&A; retrieval. State-of-the-art approaches address this issue by implicitly expanding the queries with additional words using statistical translation models. While useful, the effectiveness of these models is highly dependant on the availability of quality corpus in the absence of which they are troubled by noise issues. Moreover these models perform word based expansion in a context agnostic manner resulting in translation that might be mixed and fairly general. This results in degraded retrieval performance. In this work we address the above issues by extending the lexical word based translation model to incorporate semantic concepts (entities). We explore strategies to learn the translation probabilities between words and the concepts using the Q&A; archives and a popular entity catalog. Experiments conducted on a large scale real data show that the proposed techniques are promising.
4 0.98441589 68 emnlp-2012-Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation
Author: Wenbin Jiang ; Fandong Meng ; Qun Liu ; Yajuan Lu
Abstract: In this paper we first describe the technology of automatic annotation transformation, which is based on the annotation adaptation algorithm (Jiang et al., 2009). It can automatically transform a human-annotated corpus from one annotation guideline to another. We then propose two optimization strategies, iterative training and predict-selfreestimation, to further improve the accuracy of annotation guideline transformation. Experiments on Chinese word segmentation show that, the iterative training strategy together with predictself reestimation brings significant improvement over the simple annotation transformation baseline, and leads to classifiers with significantly higher accuracy and several times faster processing than annotation adaptation does. On the Penn Chinese Treebank 5.0, , it achieves an F-measure of 98.43%, significantly outperforms previous works although using a single classifier with only local features.
same-paper 5 0.98371196 61 emnlp-2012-Grounded Models of Semantic Representation
Author: Carina Silberer ; Mirella Lapata
Abstract: A popular tradition of studying semantic representation has been driven by the assumption that word meaning can be learned from the linguistic environment, despite ample evidence suggesting that language is grounded in perception and action. In this paper we present a comparative study of models that represent word meaning based on linguistic and perceptual data. Linguistic information is approximated by naturally occurring corpora and sensorimotor experience by feature norms (i.e., attributes native speakers consider important in describing the meaning of a word). The models differ in terms of the mechanisms by which they integrate the two modalities. Experimental results show that a closer correspondence to human data can be obtained by uncovering latent information shared among the textual and perceptual modalities rather than arriving at semantic knowledge by concatenating the two.
6 0.95470625 48 emnlp-2012-Exploring Adaptor Grammars for Native Language Identification
7 0.90749562 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities
8 0.88328308 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
9 0.87801117 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
10 0.86618382 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
11 0.86107618 137 emnlp-2012-Why Question Answering using Sentiment Analysis and Word Classes
12 0.85486889 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?
13 0.85466343 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
14 0.85158181 19 emnlp-2012-An Entity-Topic Model for Entity Linking
15 0.84872961 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction
16 0.83483839 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents
17 0.82275683 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
18 0.82236516 72 emnlp-2012-Joint Inference for Event Timeline Construction
19 0.81327337 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
20 0.81196779 128 emnlp-2012-Translation Model Based Cross-Lingual Language Model Adaptation: from Word Models to Phrase Models