emnlp emnlp2013 emnlp2013-11 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stephen Roller ; Sabine Schulte im Walde
Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal mod- els to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.
Reference: text
sentIndex sentText sentNum sentScore
1 In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al. [sent-4, score-0.193]
2 (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. [sent-7, score-0.285]
3 We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure. [sent-10, score-0.51]
4 1 Introduction In recent years, an increasing body of work has been devoted to multimodal or “grounded” models of lanSabine Schulte im Walde Institut f u¨r Maschinelle Sprachverarbeitung Universit a¨t Stuttgart s chult e @ ims . [sent-11, score-0.193]
5 Some efforts have tackled tasks such as automatic image caption generation (Feng and Lapata, 2010a; Ordonez et al. [sent-18, score-0.181]
6 Another line of research approaches grounded language knowledge by augmenting distributional approaches of word meaning with perceptual information (Andrews et al. [sent-23, score-0.208]
7 In this paper, we explore various ways to integrate new perceptual information through novel computational modeling of this grounded knowledge into a multimodal distributional model of word meaning. [sent-33, score-0.363]
8 While prior work has used the model only with feature norms and visual attributes, we show that low-level image features are directly compatible with the model and provide improved representations of word meaning. [sent-40, score-0.694]
9 We also show how simple, unsupervised clusters of images can act as a semantically useful and qualitatively interesting set offeatures. [sent-41, score-0.325]
10 We find that each modality provides useful but disjoint information for describing word meaning, and that a hybrid integration of multiple modalities provides significant improvements in the representations of word meaning. [sent-43, score-0.659]
11 Many approaches to multimodal research have succeeded by abstracting away raw perceptual information and using high-level representations instead. [sent-52, score-0.314]
12 (2004)), and feature norms, where subjects are given a cue word and asked to describe typical properties of the cue concept (e. [sent-60, score-0.196]
13 (2007) helped pave the path for cognitive-linguistic multimodal research, showing that Latent Dirichlet Allocation outperformed Latent Semantic Analysis (Deerwester et al. [sent-65, score-0.193]
14 (2009) furthered this work by showing that a bimodal topic model, consisting of both text and feature norms, outperformed models using only one modality on the prediction of association norms, word substitution errors, and semantic interference tasks. [sent-68, score-0.51]
15 Johns and Jones (2012) take an entirely different approach by showing that one can successfully infer held out feature norms from weighted mixtures based on textual similarity. [sent-70, score-0.355]
16 Silberer and Lapata (2012) introduce a new method of multimodal integration based on Canonical Correlation Analysis, and performs a systematic comparison between their CCA-based model and others on association norm prediction, held out feature prediction, and word similarity. [sent-71, score-0.338]
17 As computer vision techniques have improved over the past decade, other research has begun directly using visual information in place of feature norms. [sent-72, score-0.297]
18 The topic model using the bimodal vocabulary outperforms a purely textual based model in word association and word similarity prediction. [sent-75, score-0.262]
19 the visual features around an object, rather than of the object itself) are even more useful at times, suggesting the plausibility of a sort of distributional hypothesis for images. [sent-81, score-0.233]
20 (2013) show that visual attribute classifiers, which have been immensely successful in object recognition (Farhadi et al. [sent-83, score-0.192]
21 We also introduce two new overlapping data sets: a collection of feature norms and a collection of images for a number of German nouns. [sent-96, score-0.578]
22 2 Cognitive Modalities Association Norms (AN) is a collection of association norms collected by Schulte im Walde et al. [sent-104, score-0.336]
23 In association norm experiments, subjects are presented with a cue word and asked to list the first few words that come to mind. [sent-106, score-0.208]
24 With enough subjects and responses, association norms can provide a common and detailed view of the meaning components of cue words. [sent-107, score-0.449]
25 Feature Norms (FN) is our new collection of feature norms for a group of 569 German nouns. [sent-109, score-0.347]
26 Note that the difference between association norms and feature norms is subtle, but important. [sent-116, score-0.623]
27 ImageNet is a large-scale and widely used image database, built on top of WordNet, which maps words into groups of images, called synsets (Deng et al. [sent-120, score-0.181]
28 For example, ImageNet contains two different synsets for the word mouse: one contains images of the animal, while the other contains images of the computer peripheral. [sent-123, score-0.402]
29 This BilderNetle data set provides mappings from German noun types to images of the nouns via ImageNet. [sent-124, score-0.201]
30 Finally, the German speakers review samples of the images for each word to en2For brevity, we include the full details of the spammer identification, cleansing process and normalization techniques in the Supplementary Materials. [sent-128, score-0.201]
31 After extracting sections of images using bounding boxes when available by ImageNet (and using the entire image when bounding boxes are unavailable), the data set contains 1,305,602 images. [sent-133, score-0.382]
32 1 Image Processing After the collection ofall the images, we extracted simple, low-level computer vision features to use as modalities in our experiments. [sent-136, score-0.432]
33 First, we compute a simple Bag of Visual Words (BoVW) model for our images using SURF keypoints (Bay et al. [sent-137, score-0.244]
34 We compute SURF keypoints for every image in our data set using SimpleCV3 and randomly sample 1% of the keypoints. [sent-141, score-0.224]
35 The keypoints are clustered into 5,000 visual codewords (centroids) using k-means clustering (Sculley, 2010), and images are then quantized over the 5,000 codewords. [sent-142, score-0.434]
36 All images for a given word are summed together to provide an average representation for the word. [sent-143, score-0.201]
37 To test that similar concepts should share similar visual codewords, we cluster the BoVW representations for all our images into 500 clusters with kmeans clustering, and represent each word as membership over the image clusters, forming the SURF Clusters modality. [sent-146, score-0.702]
38 Ideally, each cluster should have a common object or clear visual attribute, and words are express in terms of these visual commonalities. [sent-148, score-0.353]
39 o rg 1149 We also compute GIST vectors (Oliva and Tor- ralba, 2001) for every image using LearGIST (Douze et al. [sent-150, score-0.181]
40 Finally, as with the SURF features, we clustered the GIST representations for our images into 500 clusters, and represented words as membership in the clusters, forming the GIST Clusters modality. [sent-156, score-0.236]
41 4 Model Definition Our experiments are based on the multimodal extension of Latent Dirichlet Allocation developed by Andrews et al. [sent-157, score-0.193]
42 Previously LDA has been successfully used to infer unsupervised joint topic distributions over words and feature norms together (Andrews et al. [sent-159, score-0.412]
43 It has also been shown to be useful in joint inference of text with visual attributes obtained using visual classifiers (Silberer et al. [sent-161, score-0.351]
44 These multimodal LDA models (hereafter, mLDA) have been shown to be qualitatively sensible and highly predictive of several psycholinguistic tasks (Andrews et al. [sent-163, score-0.193]
45 However, prior work using mLDA is limited to two modalities at a time. [sent-165, score-0.308]
46 (2009) extend LDA to allow for the inference of document and topic distributions in a multimodal corpus. [sent-185, score-0.317]
47 For the ith (word, feature) pair in the document, (a) A topic assignment zi ∼ θd is drawn; (b) a word wi ∼ βzi is drawn; (c) a feature fi ∼ ψzi is drawn; (d) the pair (wi, fi) is observed. [sent-191, score-0.238]
48 Xk The key aspect to notice is that the observed word wi and feature fi are conditionally independent given the topic selection, zi. [sent-193, score-0.183]
49 This powerful extension allows for joint inference over both words 4Here, and elsewhere, feature and f simply refer to a token from a nontextual modality and should not be confused with the machine learning sense of feature. [sent-194, score-0.293]
50 3 3D Multimodal LDA We can easily extend the bimodal LDA model to incorporate three or more modalities by simply performing inference over n-tuples instead of pairs, and still mandating that each modality is conditionally independent given the topic. [sent-197, score-0.684]
51 4 Hybrid Multimodal LDA 3D Multimodal LDA assumes that all modalities share the same latent topic structure, θd. [sent-208, score-0.425]
52 It is pos- sible, however, that all modalities do not share some latent structure, but the modalities can still combine in order to enhance word meaning. [sent-209, score-0.677]
53 In this setting, we perform separate, bimodal mLDA inference according to Section 4. [sent-214, score-0.204]
54 In this way, Hybrid mLDA assumes that every modality shares some latent structure with the text in the corpus, but the latent structures are not shared between non-textual modalities. [sent-216, score-0.294]
55 For example, to generate a hybrid model for text, feature norms and SURF, we separately perform bimodal mLDA for the text/feature norms modalities and the text/SURF modalities. [sent-217, score-1.172]
56 1 Generating Multimodal Corpora In order to evaluate our algorithms, we first need to generate multimodal corpora for each of our nontextual modalities. [sent-231, score-0.243]
57 (2009) for generating our multimodal corpora: for each word token in the text corpus, a feature is selected stochastically from the word’s feature distribution, creating a word-feature pair. [sent-233, score-0.277]
58 The 3D text-feature-association norm corpus is generated slightly differently: for each word in the original text corpus, we check the existence of multimodal features in either modality. [sent-237, score-0.265]
59 If the word had only feature norms, but no associations, it is generated as (word, feature, placeholderAN), and similarly for association norms without feature norms. [sent-239, score-0.39]
60 This allows association norms and feature norms to influence each other via the document mixtures θ, but avoids falsely labeling explicit relationships between randomly selected feature norms and associations. [sent-241, score-0.978]
61 2 Evaluation We evaluate each of our models with two data sets: a set of compositionality ratings for a number of German noun-noun compounds, and the same association norm data set used as one of our training modalities in some settings. [sent-244, score-0.591]
62 Compositionality Ratings is a data set of compositionality ratings originally collected by von der Heide and Borgwaldt (2009). [sent-245, score-0.233]
63 The data set consists of 450 concrete, depictable German noun compounds along with compositionality ratings with regard to their constituents. [sent-246, score-0.22]
64 309 of the targets have images (the entire image data set); 563 have feature norms; and all 571 of have association norms. [sent-254, score-0.455]
65 In order to optimize the number of topics K, we run five trials of each modality for 2000 iterations for K = {50, 100, 150, 200, 250} (a total of 25 runs per setup). [sent-280, score-0.214]
66 The 2D models employing feature norms and association norms do significantly better than the text-only model (two-tailed t-test). [sent-284, score-0.623]
67 These features, which are usually more useful for comparing overall image likeness than object likeness, do not individually contain semantic information useful for compositionality prediction. [sent-290, score-0.38]
68 The performance of the visual modalities reverses when we look at our cluster-based models. [sent-291, score-0.469]
69 combined with SURF clusters is our worst performing system, indicating our clusters of images with common visual words are actively working against us. [sent-298, score-0.61]
70 The clusters based on GIST, on the other hand, provide a minor improvement in compositionality prediction. [sent-299, score-0.258]
71 The model combining text, feature norms, and association norms is especially surprising: despite the excellent performance of each of the bimodal models, the 3D model performs significantly worse than either of its components (p < . [sent-301, score-0.523]
72 This indicates that these modalities provide new insight into word meaning, but cannot be forced into the same latent structure. [sent-303, score-0.413]
73 Indeed, our 5 modality hybrid model obtains a performance nearly twice that of the text-only model. [sent-309, score-0.269]
74 Furthermore, improvements generally continue to grow significantly with each additional modality we incorporate into the hybrid model (p < . [sent-312, score-0.269]
75 Clearly, there is a great deal to learn from combining three, four and even five modalities, but the modalities are learning disjoint knowledge which cannot be forced into a shared, latent structure. [sent-316, score-0.46]
76 Here we see that feature norms do not seem to be improving performance on the association norms. [sent-319, score-0.348]
77 This is slightly unexpected, but consistent with the result that feature norms seem to provide helpful, but disjoint semantic information as association norms. [sent-320, score-0.429]
78 We see that the image modalities are much more useful than they are in compositionality prediction. [sent-321, score-0.623]
79 Since the SURF and GIST image features tend to capture object-likeness and scene-likeness respectively, it is possible that words which share associates are likely related through common settings and objects that appear with them. [sent-323, score-0.215]
80 (2012b)’s suggestion that something like a distributional hypothesis of images is plausible. [sent-325, score-0.242]
81 Once again, the clusters of images using SURF causes a dramatic drop in performance. [sent-326, score-0.325]
82 Combined with the evidence from the compositionality assessment, this shows that the SURF clusters are actively confusing the models and not providing semantic information. [sent-327, score-0.292]
83 Once again, we see that the 3D models are ineffective compared to their bimodal components, but the hybrid models provide at least as much information as their components. [sent-330, score-0.272]
84 As with the compositionality evaluation, we conclude that the image and and feature norm models are providing disjoint semantic information that cannot be forced into a shared latent structure, but still augment each other when combined. [sent-333, score-0.615]
85 One nice property of the cluster-based modalities is that we may represent each cluster as its prototypical images, and examine whether the prototypes are related to the topics. [sent-335, score-0.34]
86 1154 ters for two primary reasons: first, the SURF clusters did not perform well in our evaluations, and second, preliminary investigation into the SURF clusters show that the majority of SURF clusters are nearly identical. [sent-337, score-0.372]
87 We select our single best Text + GIST Clusters trial from the Compositionality evaluation and look at the topic distributions for words and image clusters. [sent-339, score-0.276]
88 We extract tthhee fhiivgeh images hclto fsoerst t to ttohpei cc,l ups(cte|rψ centroids, and select two topics whose prototypical images are the most interesting and informative. [sent-341, score-0.476]
89 It seems that the GIST cluster does not tend to group images of water, but rather nature scenes that may contain water. [sent-346, score-0.201]
90 We see GIST has a preference toward clustering images based on the predominant shape of the image. [sent-352, score-0.201]
91 Here we see the clusters of GIST images are not providing a definite semantic relationship, but an overwhelming visual one. [sent-353, score-0.52]
92 8 Conclusions In this paper, we evaluated the role of low-level image features, SURF and GIST, for their compatibility with the multimodal Latent Dirichlet Allocation model of Andrews et al. [sent-354, score-0.374]
93 ture sets were directly compatible with multimodal LDA and provided significant gains in their ability to predict association norms over traditional text-only LDA. [sent-358, score-0.499]
94 We also showed that words may be represented in terms of membership of image clusters based on the low-level image features. [sent-360, score-0.486]
95 Finally, we showed two methods for extending multimodal LDA to three or more modalities: the first as a 3D model with a shared latent structure between all modalities, and the second where latent structures were inferred separately for each modality and joined together into a hybrid model. [sent-362, score-0.584]
96 Although the 3D model was unable to compete with its bimodal components, we found the hybrid model consistently improved performance over its component modalities. [sent-363, score-0.272]
97 We conclude that the combination of many modalities provides the best representation of word meaning, and that each nontextual modality is discovering disjoint information about word meaning that cannot be forced into a global latent structure. [sent-364, score-0.72]
98 Distributional semantics with eyes: Using image analysis to improve computational representations of word meaning. [sent-407, score-0.216]
99 Semantic feature production norms for a large set of living and nonliving things. [sent-512, score-0.317]
100 Combining feature norms and text data with topic models. [sent-584, score-0.373]
wordName wordTfidf (topN-words)
[('surf', 0.331), ('modalities', 0.308), ('gist', 0.3), ('norms', 0.275), ('images', 0.201), ('multimodal', 0.193), ('andrews', 0.188), ('image', 0.181), ('bimodal', 0.175), ('modality', 0.172), ('visual', 0.161), ('mlda', 0.138), ('compositionality', 0.134), ('bruni', 0.125), ('silberer', 0.125), ('clusters', 0.124), ('imagenet', 0.1), ('hybrid', 0.097), ('vision', 0.094), ('bovw', 0.088), ('perceptual', 0.086), ('lda', 0.084), ('norm', 0.072), ('skl', 0.072), ('german', 0.07), ('kl', 0.069), ('hoffman', 0.063), ('latent', 0.061), ('roller', 0.058), ('vbi', 0.058), ('subjects', 0.056), ('topic', 0.056), ('zi', 0.055), ('dirichlet', 0.055), ('von', 0.053), ('nontextual', 0.05), ('cue', 0.049), ('disjoint', 0.047), ('ratings', 0.046), ('pictures', 0.046), ('schulte', 0.046), ('walde', 0.046), ('allocation', 0.044), ('forced', 0.044), ('wi', 0.044), ('grounded', 0.043), ('ahornblatt', 0.043), ('borgwaldt', 0.043), ('heide', 0.043), ('keypoints', 0.043), ('motor', 0.043), ('oliva', 0.043), ('placeholderan', 0.043), ('rohrbach', 0.043), ('topics', 0.042), ('feature', 0.042), ('distributional', 0.041), ('lapata', 0.041), ('fi', 0.041), ('compounds', 0.04), ('distributions', 0.039), ('mixtures', 0.038), ('meaning', 0.038), ('marco', 0.038), ('percentile', 0.038), ('clock', 0.038), ('elia', 0.038), ('farhadi', 0.038), ('divergence', 0.037), ('fn', 0.036), ('responses', 0.035), ('representations', 0.035), ('vittorio', 0.034), ('wing', 0.034), ('objects', 0.034), ('semantic', 0.034), ('commands', 0.032), ('prototypical', 0.032), ('association', 0.031), ('object', 0.031), ('instructions', 0.031), ('matuszek', 0.03), ('collection', 0.03), ('baroni', 0.029), ('inference', 0.029), ('ahorn', 0.029), ('bernt', 0.029), ('bildernetle', 0.029), ('blatt', 0.029), ('codeword', 0.029), ('codewords', 0.029), ('deselaers', 0.029), ('dewac', 0.029), ('douze', 0.029), ('ferrari', 0.029), ('giacomo', 0.029), ('ller', 0.029), ('maple', 0.029), ('mathe', 0.029), ('placeholderfn', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
Author: Stephen Roller ; Sabine Schulte im Walde
Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal mod- els to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.
2 0.20086411 78 emnlp-2013-Exploiting Language Models for Visual Recognition
Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi
Abstract: The problem of learning language models from large text corpora has been widely studied within the computational linguistic community. However, little is known about the performance of these language models when applied to the computer vision domain. In this work, we compare representative models: a window-based model, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in two visual recognition scenarios: human action recognition and object prediction. We examine whether the knowledge extracted from texts through these models are compatible to the knowledge represented in images. We determine the usefulness of different language models in aiding the two visual recognition tasks. The study shows that the language models built from general text corpora can be used instead of expensive annotated images and even outperform the image model when testing on a big general dataset.
3 0.18953513 98 emnlp-2013-Image Description using Visual Dependency Representations
Author: Desmond Elliott ; Frank Keller
Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.
Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni
Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.
5 0.12715086 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
Author: Shize Xu ; Shanshan Wang ; Yan Zhang
Abstract: The rapid development of Web2.0 leads to significant information redundancy. Especially for a complex news event, it is difficult to understand its general idea within a single coherent picture. A complex event often contains branches, intertwining narratives and side news which are all called storylines. In this paper, we propose a novel solution to tackle the challenging problem of storylines extraction and reconstruction. Specifically, we first investigate two requisite properties of an ideal storyline. Then a unified algorithm is devised to extract all effective storylines by optimizing these properties at the same time. Finally, we reconstruct all extracted lines and generate the high-quality story map. Experiments on real-world datasets show that our method is quite efficient and highly competitive, which can bring about quicker, clearer and deeper comprehension to readers.
6 0.10280869 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation
7 0.097614728 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models
8 0.081906721 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
9 0.080026507 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation
10 0.078145966 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM
11 0.071380056 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
12 0.067525819 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
13 0.06271106 138 emnlp-2013-Naive Bayes Word Sense Induction
14 0.06183356 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
15 0.061092895 134 emnlp-2013-Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks
16 0.060346503 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
17 0.054393038 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
18 0.05023494 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
19 0.050121844 158 emnlp-2013-Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
20 0.048402354 121 emnlp-2013-Learning Topics and Positions from Debatepedia
topicId topicWeight
[(0, -0.182), (1, 0.058), (2, -0.084), (3, 0.048), (4, -0.045), (5, 0.197), (6, -0.007), (7, -0.096), (8, -0.151), (9, -0.116), (10, -0.326), (11, -0.095), (12, 0.022), (13, -0.019), (14, -0.168), (15, -0.035), (16, 0.024), (17, 0.031), (18, 0.074), (19, -0.049), (20, -0.059), (21, 0.041), (22, -0.037), (23, -0.041), (24, 0.016), (25, 0.031), (26, 0.004), (27, 0.059), (28, 0.02), (29, -0.037), (30, 0.031), (31, 0.018), (32, -0.033), (33, 0.039), (34, -0.039), (35, -0.021), (36, 0.016), (37, 0.007), (38, 0.057), (39, 0.094), (40, -0.034), (41, -0.159), (42, -0.035), (43, 0.106), (44, -0.019), (45, 0.083), (46, -0.019), (47, -0.053), (48, -0.028), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.92156637 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
Author: Stephen Roller ; Sabine Schulte im Walde
Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal mod- els to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.
Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni
Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.
3 0.83298874 78 emnlp-2013-Exploiting Language Models for Visual Recognition
Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi
Abstract: The problem of learning language models from large text corpora has been widely studied within the computational linguistic community. However, little is known about the performance of these language models when applied to the computer vision domain. In this work, we compare representative models: a window-based model, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in two visual recognition scenarios: human action recognition and object prediction. We examine whether the knowledge extracted from texts through these models are compatible to the knowledge represented in images. We determine the usefulness of different language models in aiding the two visual recognition tasks. The study shows that the language models built from general text corpora can be used instead of expensive annotated images and even outperform the image model when testing on a big general dataset.
4 0.80328447 98 emnlp-2013-Image Description using Visual Dependency Representations
Author: Desmond Elliott ; Frank Keller
Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.
5 0.51256239 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
Author: Shize Xu ; Shanshan Wang ; Yan Zhang
Abstract: The rapid development of Web2.0 leads to significant information redundancy. Especially for a complex news event, it is difficult to understand its general idea within a single coherent picture. A complex event often contains branches, intertwining narratives and side news which are all called storylines. In this paper, we propose a novel solution to tackle the challenging problem of storylines extraction and reconstruction. Specifically, we first investigate two requisite properties of an ideal storyline. Then a unified algorithm is devised to extract all effective storylines by optimizing these properties at the same time. Finally, we reconstruct all extracted lines and generate the high-quality story map. Experiments on real-world datasets show that our method is quite efficient and highly competitive, which can bring about quicker, clearer and deeper comprehension to readers.
6 0.43419647 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students
7 0.3969458 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation
8 0.37883255 138 emnlp-2013-Naive Bayes Word Sense Induction
9 0.37807754 100 emnlp-2013-Improvements to the Bayesian Topic N-Gram Models
10 0.34714347 77 emnlp-2013-Exploiting Domain Knowledge in Aspect Extraction
11 0.34246814 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models
12 0.33073482 134 emnlp-2013-Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks
13 0.32617858 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM
14 0.32411656 191 emnlp-2013-Understanding and Quantifying Creativity in Lexical Composition
15 0.30809796 121 emnlp-2013-Learning Topics and Positions from Debatepedia
16 0.29604867 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
17 0.28464755 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge
18 0.28251114 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation
19 0.28045923 177 emnlp-2013-Studying the Recursive Behaviour of Adjectival Modification with Compositional Distributional Semantics
20 0.279535 148 emnlp-2013-Orthonormal Explicit Topic Analysis for Cross-Lingual Document Matching
topicId topicWeight
[(3, 0.032), (6, 0.01), (9, 0.017), (18, 0.026), (22, 0.03), (30, 0.04), (50, 0.065), (51, 0.187), (54, 0.024), (66, 0.032), (71, 0.028), (75, 0.033), (77, 0.018), (88, 0.305), (90, 0.013), (95, 0.012), (96, 0.02), (97, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.76353592 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities
Author: Stephen Roller ; Sabine Schulte im Walde
Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal mod- els to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.
Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni
Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.
3 0.54898286 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models
Author: Douwe Kiela ; Stephen Clark
Abstract: We present a novel unsupervised approach to detecting the compositionality of multi-word expressions. We compute the compositionality of a phrase through substituting the constituent words with their “neighbours” in a semantic vector space and averaging over the distance between the original phrase and the substituted neighbour phrases. Several methods of obtaining neighbours are presented. The results are compared to existing supervised results and achieve state-of-the-art performance on a verb-object dataset of human compositionality ratings.
4 0.54639119 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
Author: Kuzman Ganchev ; Dipanjan Das
Abstract: We present a framework for cross-lingual transfer of sequence information from a resource-rich source language to a resourceimpoverished target language that incorporates soft constraints via posterior regularization. To this end, we use automatically word aligned bitext between the source and target language pair, and learn a discriminative conditional random field model on the target side. Our posterior regularization constraints are derived from simple intuitions about the task at hand and from cross-lingual alignment information. We show improvements over strong baselines for two tasks: part-of-speech tagging and namedentity segmentation.
5 0.54585332 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types
Author: Hrushikesh Mohapatra ; Siddhanth Jain ; Soumen Chakrabarti
Abstract: Web search can be enhanced in powerful ways if token spans in Web text are annotated with disambiguated entities from large catalogs like Freebase. Entity annotators need to be trained on sample mention snippets. Wikipedia entities and annotated pages offer high-quality labeled data for training and evaluation. Unfortunately, Wikipedia features only one-ninth the number of entities as Freebase, and these are a highly biased sample of well-connected, frequently mentioned “head” entities. To bring hope to “tail” entities, we broaden our goal to a second task: assigning types to entities in Freebase but not Wikipedia. The two tasks are synergistic: knowing the types of unfamiliar entities helps disambiguate mentions, and words in mention contexts help assign types to entities. We present TMI, a bipartite graphical model for joint type-mention inference. TMI attempts no schema integration or entity resolution, but exploits the above-mentioned synergy. In experiments involving 780,000 people in Wikipedia, 2.3 million people in Freebase, 700 million Web pages, and over 20 professional editors, TMI shows considerable annotation accuracy improvement (e.g., 70%) compared to baselines (e.g., 46%), especially for “tail” and emerging entities. We also compare with Google’s recent annotations of the same corpus with Freebase entities, and report considerable improvements within the people domain.
6 0.54377091 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
7 0.54263282 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
8 0.54226065 98 emnlp-2013-Image Description using Visual Dependency Representations
9 0.5416438 73 emnlp-2013-Error-Driven Analysis of Challenges in Coreference Resolution
10 0.54155248 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
11 0.54132384 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
12 0.54031211 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction
13 0.53948718 152 emnlp-2013-Predicting the Presence of Discourse Connectives
14 0.53938448 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
15 0.53909981 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors
16 0.53786784 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
17 0.53721368 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology
18 0.53695941 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?
19 0.53673518 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability
20 0.53618681 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery