emnlp emnlp2013 emnlp2013-98 knowledge-graph by maker-knowledge-mining

98 emnlp-2013-Image Description using Visual Dependency Representations


Source: pdf

Author: Desmond Elliott ; Frank Keller

Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. [sent-4, score-0.75]

2 In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. [sent-6, score-1.106]

3 In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements. [sent-9, score-0.954]

4 1 Introduction Humans are readily able to produce a description of an image that correctly identifies the objects and actions depicted. [sent-10, score-0.787]

5 A common aspect of existing work is that an image is represented as a bag of image regions. [sent-25, score-1.178]

6 Bags of regions encode which objects co-occur in an image, but they are unable to express how the regions relate to each other, which makes it hard to describe what is happening. [sent-26, score-0.822]

7 If the man was instead repairing the bike, then the bagof-regions representation would be the same, even though the image would depict a different action and would have to be described differently. [sent-28, score-0.835]

8 This type of co-occurrence of regions indicates the need for a more structured image representation; an image description system that has access to structured repreProce Sdeiantgtlse o,f W thaesh 2i0n1gt3o nC,o UnSfeAre,n 1c8e- o2n1 E Omctpoibriecra 2l0 M13et. [sent-29, score-1.719]

9 (b) on --aboveROOT bike car man road trees an A man is riding a bike down the road. [sent-33, score-0.917]

10 These relationships make it possible to infer that the man is riding a bike down the road, which corresponds 1293 to the first sentence of the human-generated image description in Figure 1b. [sent-40, score-1.235]

11 In order to test the hypothesis that structured image representations are useful for description generation, we present a series of template-based image description models. [sent-41, score-1.486]

12 The other two models use visual dependency representations, either on their own or in conjunction with gold-standard image descriptions at training time. [sent-43, score-1.042]

13 The BLEU score improvements are found at bi-, tri-, and four-gram levels, and humans rate VDR-based image descriptions 1. [sent-45, score-0.746]

14 Finally, we also show that the benefit of the visual dependency representation is maintained when image descriptions are generated from automatically parsed VDRs. [sent-47, score-1.129]

15 Note that throughout the paper, we work with goldstandard region annotations; this makes it possible to explore the effect of structured image representations independently of automatic object detection. [sent-51, score-1.089]

16 2 Visual Dependency Representation In analogy to dependency grammar for natural language syntax, we define Visual Dependency Grammar to describe the spatial relations between pairs of image regions. [sent-52, score-0.826]

17 A directed arc between two regions is labelled with the spatial relationship between those regions, defined in terms of three geometric properties: pixel overlap, the angle between regions, and the distance between regions. [sent-53, score-0.607]

18 A visual dependency representation of an image is constructed by creating a directed acyclic graph X o→n Y More than 50% of the pixels of region X overlap with region Y. [sent-55, score-1.667]

19 1 X −su −−r r −ou − −n →ds Y The entirety of region X overlaps with region Y. [sent-56, score-0.716]

20 To simplify explanation, all regions are circles, where X is the grey region and Y is the white region. [sent-64, score-0.721]

21 over the set of regions in an image using the spatial relationships in the Visual Dependency Grammar. [sent-67, score-1.126]

22 It is created from a region-annotated image and a corresponding image description by first identifying the central actor of the image. [sent-68, score-1.342]

23 The remaining regions are then attached based on their relationship with either the actor or the other regions in the image as they are 1As per the PASCAL VOC definition of overlap in the object detection task (Everingham et al. [sent-71, score-1.525]

24 Each arc introduced is labelled with one of the spatial relations defined in the grammar, or with no label if the region is not described in relation to anything else in the image. [sent-74, score-0.554]

25 The region corresponding to MAN is therefore attached to ROOT without a spatial relation. [sent-77, score-0.509]

26 If these− r −e g− →ions were attached to other regions, such as CAR a−b −o − →ve ROAD then this would imply structure in the image that is not conveyed in the description. [sent-80, score-0.66]

27 We annotated the data set in a three-step process: (1) collect a description for each image; (2) annotate the regions in the image; and (3) create a visual dependency representation ofthe image. [sent-85, score-0.868]

28 Note that Steps (2) and (3) are dependent on the image description, as both the region labels and the relations between them are derived from the description. [sent-86, score-1.009]

29 2 Image Descriptions We collected three descriptions of each image in our data set from Amazon Mechanical Turk. [sent-88, score-0.746]

30 Workers were asked to describe an image in two sentences. [sent-89, score-0.608]

31 The first sentence describes the action in the image, the person performing the action and the region involved in the action; the second sentence describes any other regions in the image not directly involved in the action. [sent-90, score-1.428]

32 A total of 2,424 images were described by three workers each, resulting in a total of 7,272 image de- Figure 2: Top 20 annotated regions. [sent-92, score-0.753]

33 04 per image and it took on average 67 123 seconds to describe a single image. [sent-96, score-0.608]

34 3 Region Annotations We trained two annotators to draw polygons around the outlines of the regions in an image using the LabelMe annotation tool (Russell et al. [sent-104, score-0.975]

35 The regions annotated for a given image were limited to those mentioned in the description paired with the image. [sent-106, score-1.114]

36 Region annotation was performed on a subset of 341 images and resulted in a total of 5,034 annotated regions with a mean of 4. [sent-107, score-0.506]

37 Figure 2 shows the distribution of the top 20 region annotations in the data; people-type regions are the most commonly annotated regions. [sent-113, score-0.755]

38 This normalization process reduced the size of the region label vocabulary from 496 labels to 362 la1295 Figure 3: Distribution of the spatial relations. [sent-117, score-0.497]

39 The process for creating a visual dependency representation of an image is described earlier in this section of the paper. [sent-123, score-0.932]

40 We induced an alignment between the annotated region labels and words in the image description using simple lexical matching augmented with WordNet hyponym lookup. [sent-137, score-1.145]

41 ± 3 Image Description Models We present four template-based models for generating image descriptions in this section. [sent-139, score-0.77]

42 presents an overview of the amount of information available to each model at training time, ranging from only the annotated regions of an image to using visual dependency representation of an image aligned with the syntactic dependency representation of its description. [sent-141, score-2.054]

43 At test time, all models have access to image regions and their labels, and use these to generate image descriptions. [sent-142, score-1.611]

44 Two of the models also have access to VDRs at test time, allowing us to test the hypothesis that image structure is useful for generating good image descriptions. [sent-143, score-1.275]

45 The aim of each model is to determine what is happening in the image, which regions are important for describing it, and how these regions relate to each other. [sent-144, score-0.752]

46 A good description therefore is one that relates the main actors depicted in the image to each other, typically through a verb; a mere enumeration of the regions in the image is not sufficient. [sent-146, score-1.727]

47 It has access to only the annotated image regions and their labels. [sent-152, score-1.036]

48 } be the set of possible region– region relationships f}o buend th by calculating eth ree nearest neighbour of each region in Euclidean space between the centroids of the polygons that mark the region boundaries. [sent-166, score-1.192]

49 The tuple with the subject closest to the centre of the image is used to describe what is happening in the image, and the remaining regions are used to describe the background. [sent-167, score-1.013]

50 oi is the label of the subject region and oj is the label of the object region. [sent-169, score-0.63]

51 DT is a simple determiner chosen from {the, a}, depending on mwphleeth deerte etrhme region olasebenl f riso a plural noun; AUX nisg ge oithne wr {is, are}, depending on sth ae p number of the region label; ,a anrde REL pise a dwinogrd o to dhees ncruimbethe relationship between the regions. [sent-170, score-0.759]

52 For this model, REL is the spatial relationship between the centroids chosen from {above, below, beside}, depending on tchheo angle ofmorm {aebdo v beet,w beelenow th,e b region centroids, using the definitions in Table 1. [sent-171, score-0.559]

53 At training time, the model has access to the annotated image regions and labels, and to the dependency-parsed version of the English Gigaword Corpus (Napoles et al. [sent-185, score-1.036]

54 } is determined by searching fo {(ro the most) likely o∗j, v d∗e given an oi over a cshetof verbs V extracted from the corpus and the other regions in the image. [sent-191, score-0.498]

55 However, since noun co-occurrence in the corpus controls which regions can be mentioned in the description, this model will be prone to relating regions simply because their labels occur together frequently in the corpus. [sent-201, score-0.762]

56 3 STRUCTURE The model STRUCTURE exploits the visual dependency representation of an image to generate language for only the relationships that hold between pairs of regions. [sent-203, score-1.023]

57 It has access to the image regions, the region labels, and the visual dependency representation of an image. [sent-204, score-1.34]

58 The VDR of an image is traversed and language fragments are generated and then combined depending on the number ofchildren of a node in the tree. [sent-206, score-0.687]

59 In these cases, the nearest region is realized in direct relation to the head using either T3 (two children) or T1 (more than two children), and the remaining regions form a separate sentence using T5. [sent-212, score-0.721]

60 In comparison to PROXIMITY, this model can exploit a representation of an image that encodes the relationships between regions in an image (the VDR). [sent-217, score-1.679]

61 4 PARALLEL The model PARALLEL is an extension of STRUCTURE that uses the image descriptions available to predict verbs that relate regions in parent-child relationships in a VDR. [sent-220, score-1.227]

62 At training time it has access to the annotated regions and labels, the visual dependency representations, and the gold-standard image descriptions. [sent-221, score-1.332]

63 (2005) and alignments were calculated between the nodes in the VDRs and the words in the parsed image descriptions. [sent-224, score-0.608]

64 We estimate two distributions from the image descriptions using the alignments: p(verb|ohead,ochild, relhead−child) and p(verb|ohead, ochild). [sent-225, score-0.746]

65 The remaining fragments revert back to spatial relationships to avoid generating language that places the subject region in multiple relationships with other regions. [sent-230, score-0.698]

66 In comparison to CORPUS, this model generates descriptions in which the relations between the regions determined by the image itself and not by an external corpus. [sent-233, score-1.175]

67 In comparison to PROXIMITY and STRUCTURE, this model generates descriptions that express meaningful relations between the regions and not simple spatial relationships. [sent-234, score-0.669]

68 In this section we describe an image parser that can induce VDRs automatically from 1298 region-annotated images, providing the input for the STRUCTURE-PARSED and PARALLEL-PARSED models at test time. [sent-236, score-0.641]

69 In our notation, xvis is the set of annotated regions and yvis is a visual dependency representation of the image; (i, j) is a directed arc from node ito node j in xvis, f(i, j) is a feature representation of the arc (i, j), and w is a vector of feature weights to be learned by the model. [sent-240, score-0.93]

70 The overall score of a visual dependency representation is: s(xvis,yvis) =(i,∑j)∈yvisw·f(i, j) (2) The features in the model are defined over region labels in the visual dependency representation as well as the relationship labels. [sent-241, score-1.123]

71 As our dependency representations are unordered, none of the features encode the linear order of region labels, unlike the feature set of the original model. [sent-242, score-0.499]

72 Unigram features describe how likely individual region labels are to appear as either heads or arguments and bigram feature captures which region labels are in head-argument relationships. [sent-243, score-0.807]

73 5li fnoer was c laalbceull aetded an by attaching all image regions to the root node; this is the most frequent form of attachment in our data). [sent-257, score-0.971]

74 ± ± ± 5 Language Generation Experiments We evaluate the image description models in an automatic setting and with human judgements. [sent-258, score-0.717]

75 In 2Different visual dependency representations of the same image are never split between the training and test data. [sent-259, score-0.937]

76 05) in the descriptions generated by PARALLEL-PARSED compared to models that operate over an unstructured bag ofimage regions representation. [sent-412, score-0.541]

77 The PROXIMITY and CORPUS models have access to gold-standard region labels and region boundaries at test time. [sent-418, score-0.802]

78 These representations are either the gold-standard, or in the case of STRUCTURE-PARSED and PARALLEL-PARSED, produced by the image parser described in Section 4. [sent-420, score-0.674]

79 The image parser used for models STRUCTURE-PARSED and PARALLEL-PARSED is trained on the gold-standard VDRs of the training splits, and then predicts VDRs on the development and test splits. [sent-424, score-0.622]

80 PARALLEL, the model with access to both image structure and aligned image descriptions at training time outperforms all other models on higher-order BLEU measures. [sent-430, score-1.408]

81 The probability associated with each fragment generated for nodes with multiple children also tends to lead to a more accurate order of mentioning image regions. [sent-432, score-0.631]

82 It can also be seen that PARALLEL-PARSED remains significantly better than the other models when the VDRs of images are predicted by an image parser, rather than being gold-standard. [sent-433, score-0.698]

83 3Recall that gold-standard VDRs and the output of the image parser de- PARALLEL PARALLEL-PARSED uses uses scribed in Section 4. [sent-434, score-0.622]

84 STRUCTURE uses the VDR of an image to generate the description, which this leads to an improvement over PROXIMITY on some of the BLEU metrics; however, it is not sufficient to outperform CORPUS. [sent-439, score-0.609]

85 Scene: give high scores if the description correctly describes the rest of the image (background, other objects, etc). [sent-448, score-0.736]

86 PROXIMITY, CORPUS, and STRUCTURE all perform badly with mean judgements around two, PARALLEL, which uses both image structure and aligned descriptions, significantly outperforms all other models with the exception of PARALLEL-PARSED, which has very similar performance. [sent-458, score-0.662]

87 The fact that PARALLEL and PARALLEL-PARSED perform similarly on all three human measures confirms that automatically parsed VDRs are as useful for image description as goldstandard VDRs. [sent-459, score-0.762]

88 This is probably due to the fact that they all have access to goldstandard region labels, which enables them to correctly refer to regions in the scene most of the time. [sent-461, score-0.869]

89 The additional information about the relationships between regions that STRUCTURE and PARALLEL have access to does not improve the quality of the background scene description. [sent-462, score-0.537]

90 6 Related Work Previous work on image description can be grouped into three approaches: description-by-retrieval, description using language models, and templatebased description. [sent-463, score-0.845]

91 (2012) generate descriptions by retrieving the most similar image from a large data set of images paired with descriptions. [sent-467, score-0.875]

92 Both approaches first determine the attributes and relationships between regions in an image as region–preposition–region triples. [sent-472, score-1.023]

93 The disadvantage of relying on region– preposition–region triples is that they cannot distinguish between the main event of the image and the PGCSTRAO RLAUXPDLICMTSEURLY SCPATOR UAPLCTSUERL PROXIMITY GOLD A A A A A A man is beside a phone. [sent-473, score-0.831]

94 Previous research has relied extensively on automatically detecting object regions in an image using state-of-the art object detectors (Felzenszwalb et al. [sent-503, score-1.08]

95 We use gold-standard region annotations to remove this noisy component from the description generation pipeline, allowing us to focus on the utility of image structure for description generation. [sent-505, score-1.251]

96 7 Conclusion In this paper we introduced a novel representation of an image as a set of dependencies over its annotated regions. [sent-506, score-0.67]

97 This visual dependency representation encodes which regions are related to each other in an image, and can be used to infer the ac1301 tion or event that is depicted. [sent-507, score-0.726]

98 We found that image description models based on visual dependency representations significantly outperform competing models in both automatic and human evaluations. [sent-508, score-1.065]

99 We showed that visual dependency representations can be induced automatically using a standard dependency parser and that the descriptions generated from the induced representations are as good as the ones generated from gold-standard representations. [sent-509, score-0.721]

100 Future work will focus on improvements to the image parser, on exploring this representation in opendomain data sets, and on using the output of an object detector to obtain a fully automated model. [sent-510, score-0.7]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('image', 0.589), ('regions', 0.363), ('region', 0.358), ('bike', 0.21), ('visual', 0.207), ('descriptions', 0.157), ('description', 0.128), ('proximity', 0.124), ('beside', 0.122), ('man', 0.12), ('riding', 0.117), ('vdrs', 0.117), ('oi', 0.114), ('images', 0.109), ('spatial', 0.103), ('vdr', 0.093), ('dependency', 0.089), ('oj', 0.071), ('relationships', 0.071), ('object', 0.064), ('car', 0.064), ('berg', 0.06), ('action', 0.059), ('scene', 0.053), ('road', 0.053), ('representations', 0.052), ('dt', 0.052), ('objects', 0.051), ('kulkarni', 0.05), ('rel', 0.05), ('judgements', 0.05), ('access', 0.05), ('attached', 0.048), ('fragments', 0.048), ('representation', 0.047), ('woman', 0.046), ('aux', 0.046), ('relationship', 0.043), ('tamara', 0.041), ('ordonez', 0.041), ('depicted', 0.039), ('parallel', 0.036), ('labelled', 0.036), ('actor', 0.036), ('labels', 0.036), ('everingham', 0.035), ('horse', 0.035), ('annotated', 0.034), ('parser', 0.033), ('arc', 0.031), ('angle', 0.031), ('farhadi', 0.03), ('girish', 0.03), ('realised', 0.03), ('node', 0.029), ('bleu', 0.028), ('verb', 0.027), ('relate', 0.026), ('goldstandard', 0.026), ('kuznetsova', 0.026), ('relations', 0.026), ('generation', 0.025), ('generating', 0.024), ('centroids', 0.024), ('beach', 0.024), ('unlabelled', 0.024), ('template', 0.024), ('elliott', 0.023), ('felzenszwalb', 0.023), ('horses', 0.023), ('labelme', 0.023), ('polygons', 0.023), ('siming', 0.023), ('xvis', 0.023), ('bags', 0.023), ('mcdonald', 0.023), ('wall', 0.023), ('subject', 0.023), ('trees', 0.023), ('structure', 0.023), ('generated', 0.021), ('child', 0.021), ('verbs', 0.021), ('children', 0.021), ('workers', 0.021), ('vicente', 0.02), ('depict', 0.02), ('napoles', 0.02), ('generate', 0.02), ('generates', 0.02), ('encodes', 0.02), ('yejin', 0.02), ('external', 0.02), ('relates', 0.019), ('parsed', 0.019), ('correctly', 0.019), ('templates', 0.019), ('overlap', 0.019), ('root', 0.019), ('describe', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 98 emnlp-2013-Image Description using Visual Dependency Representations

Author: Desmond Elliott ; Frank Keller

Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

2 0.30128336 78 emnlp-2013-Exploiting Language Models for Visual Recognition

Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi

Abstract: The problem of learning language models from large text corpora has been widely studied within the computational linguistic community. However, little is known about the performance of these language models when applied to the computer vision domain. In this work, we compare representative models: a window-based model, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in two visual recognition scenarios: human action recognition and object prediction. We examine whether the knowledge extracted from texts through these models are compatible to the knowledge represented in images. We determine the usefulness of different language models in aiding the two visual recognition tasks. The study shows that the language models built from general text corpora can be used instead of expensive annotated images and even outperform the image model when testing on a big general dataset.

3 0.22531928 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni

Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.

4 0.18953513 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities

Author: Stephen Roller ; Sabine Schulte im Walde

Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal mod- els to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.

5 0.13396943 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

Author: Shize Xu ; Shanshan Wang ; Yan Zhang

Abstract: The rapid development of Web2.0 leads to significant information redundancy. Especially for a complex news event, it is difficult to understand its general idea within a single coherent picture. A complex event often contains branches, intertwining narratives and side news which are all called storylines. In this paper, we propose a novel solution to tackle the challenging problem of storylines extraction and reconstruction. Specifically, we first investigate two requisite properties of an ideal storyline. Then a unified algorithm is devised to extract all effective storylines by optimizing these properties at the same time. Finally, we reconstruct all extracted lines and generate the high-quality story map. Experiments on real-world datasets show that our method is quite efficient and highly competitive, which can bring about quicker, clearer and deeper comprehension to readers.

6 0.10717177 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation

7 0.076958209 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation

8 0.058064725 58 emnlp-2013-Dependency Language Models for Sentence Completion

9 0.050731909 184 emnlp-2013-This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics

10 0.045547783 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

11 0.042709336 153 emnlp-2013-Predicting the Resolution of Referring Expressions from User Behavior

12 0.042450365 187 emnlp-2013-Translation with Source Constituency and Dependency Trees

13 0.04066686 44 emnlp-2013-Centering Similarity Measures to Reduce Hubs

14 0.039834898 191 emnlp-2013-Understanding and Quantifying Creativity in Lexical Composition

15 0.039534457 181 emnlp-2013-The Effects of Syntactic Features in Automatic Prediction of Morphology

16 0.039352108 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction

17 0.037000787 192 emnlp-2013-Unsupervised Induction of Contingent Event Pairs from Film Scenes

18 0.036369219 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing

19 0.036292892 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors

20 0.03582431 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.138), (1, 0.033), (2, -0.051), (3, 0.099), (4, -0.069), (5, 0.2), (6, -0.047), (7, -0.114), (8, -0.131), (9, -0.055), (10, -0.397), (11, -0.013), (12, 0.078), (13, -0.087), (14, -0.269), (15, -0.022), (16, -0.1), (17, 0.085), (18, 0.18), (19, -0.002), (20, -0.038), (21, 0.028), (22, 0.015), (23, -0.035), (24, -0.013), (25, 0.093), (26, 0.016), (27, 0.103), (28, 0.093), (29, -0.024), (30, 0.023), (31, 0.031), (32, 0.035), (33, 0.027), (34, -0.05), (35, 0.029), (36, 0.059), (37, 0.073), (38, 0.029), (39, 0.041), (40, -0.036), (41, -0.026), (42, -0.046), (43, -0.077), (44, -0.047), (45, 0.022), (46, 0.03), (47, -0.028), (48, -0.034), (49, 0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97861987 98 emnlp-2013-Image Description using Visual Dependency Representations

Author: Desmond Elliott ; Frank Keller

Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

2 0.90158039 78 emnlp-2013-Exploiting Language Models for Visual Recognition

Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi

Abstract: The problem of learning language models from large text corpora has been widely studied within the computational linguistic community. However, little is known about the performance of these language models when applied to the computer vision domain. In this work, we compare representative models: a window-based model, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in two visual recognition scenarios: human action recognition and object prediction. We examine whether the knowledge extracted from texts through these models are compatible to the knowledge represented in images. We determine the usefulness of different language models in aiding the two visual recognition tasks. The study shows that the language models built from general text corpora can be used instead of expensive annotated images and even outperform the image model when testing on a big general dataset.

3 0.83606273 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni

Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.

4 0.67647719 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities

Author: Stephen Roller ; Sabine Schulte im Walde

Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal mod- els to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.

5 0.42182311 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

Author: Shize Xu ; Shanshan Wang ; Yan Zhang

Abstract: The rapid development of Web2.0 leads to significant information redundancy. Especially for a complex news event, it is difficult to understand its general idea within a single coherent picture. A complex event often contains branches, intertwining narratives and side news which are all called storylines. In this paper, we propose a novel solution to tackle the challenging problem of storylines extraction and reconstruction. Specifically, we first investigate two requisite properties of an ideal storyline. Then a unified algorithm is devised to extract all effective storylines by optimizing these properties at the same time. Finally, we reconstruct all extracted lines and generate the high-quality story map. Experiments on real-world datasets show that our method is quite efficient and highly competitive, which can bring about quicker, clearer and deeper comprehension to readers.

6 0.3921515 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation

7 0.29242721 44 emnlp-2013-Centering Similarity Measures to Reduce Hubs

8 0.2723518 184 emnlp-2013-This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics

9 0.2637603 58 emnlp-2013-Dependency Language Models for Sentence Completion

10 0.2526713 153 emnlp-2013-Predicting the Resolution of Referring Expressions from User Behavior

11 0.21138175 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time

12 0.19831079 33 emnlp-2013-Automatic Knowledge Acquisition for Case Alternation between the Passive and Active Voices in Japanese

13 0.19553636 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

14 0.19500028 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation

15 0.18653572 183 emnlp-2013-The VerbCorner Project: Toward an Empirically-Based Semantic Decomposition of Verbs

16 0.18277255 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations

17 0.18048175 62 emnlp-2013-Detection of Product Comparisons - How Far Does an Out-of-the-Box Semantic Role Labeling System Take You?

18 0.17406392 26 emnlp-2013-Assembling the Kazakh Language Corpus

19 0.16064918 168 emnlp-2013-Semi-Supervised Feature Transformation for Dependency Parsing

20 0.15606509 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.037), (18, 0.034), (22, 0.033), (30, 0.064), (45, 0.016), (50, 0.126), (51, 0.134), (66, 0.044), (71, 0.013), (75, 0.035), (77, 0.024), (90, 0.013), (93, 0.263), (95, 0.013), (96, 0.024), (97, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77747142 98 emnlp-2013-Image Description using Visual Dependency Representations

Author: Desmond Elliott ; Frank Keller

Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

2 0.6163823 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology

Author: Jan A. Botha ; Phil Blunsom

Abstract: This paper contributes an approach for expressing non-concatenative morphological phenomena, such as stem derivation in Semitic languages, in terms of a mildly context-sensitive grammar formalism. This offers a convenient level of modelling abstraction while remaining computationally tractable. The nonparametric Bayesian framework of adaptor grammars is extended to this richer grammar formalism to propose a probabilistic model that can learn word segmentation and morpheme lexicons, including ones with discontiguous strings as elements, from unannotated data. Our experiments on Hebrew and three variants of Arabic data find that the additional expressiveness to capture roots and templates as atomic units improves the quality of concatenative segmentation and stem identification. We obtain 74% accuracy in identifying triliteral Hebrew roots, while performing morphological segmentation with an F1-score of 78. 1.

3 0.60165042 78 emnlp-2013-Exploiting Language Models for Visual Recognition

Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi

Abstract: The problem of learning language models from large text corpora has been widely studied within the computational linguistic community. However, little is known about the performance of these language models when applied to the computer vision domain. In this work, we compare representative models: a window-based model, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in two visual recognition scenarios: human action recognition and object prediction. We examine whether the knowledge extracted from texts through these models are compatible to the knowledge represented in images. We determine the usefulness of different language models in aiding the two visual recognition tasks. The study shows that the language models built from general text corpora can be used instead of expensive annotated images and even outperform the image model when testing on a big general dataset.

4 0.59409565 159 emnlp-2013-Regularized Minimum Error Rate Training

Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova

Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.

5 0.5441758 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni

Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.

6 0.52831852 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

7 0.52058643 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction

8 0.51622581 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization

9 0.51516151 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models

10 0.51199496 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation

11 0.51170713 15 emnlp-2013-A Systematic Exploration of Diversity in Machine Translation

12 0.51160955 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries

13 0.50978726 175 emnlp-2013-Source-Side Classifier Preordering for Machine Translation

14 0.50899947 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities

15 0.50698197 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability

16 0.50664634 143 emnlp-2013-Open Domain Targeted Sentiment

17 0.50630152 30 emnlp-2013-Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora

18 0.50623149 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

19 0.50616914 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types

20 0.50486147 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors