emnlp emnlp2013 emnlp2013-78 knowledge-graph by maker-knowledge-mining

78 emnlp-2013-Exploiting Language Models for Visual Recognition

Source: pdf

Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi

Abstract: The problem of learning language models from large text corpora has been widely studied within the computational linguistic community. However, little is known about the performance of these language models when applied to the computer vision domain. In this work, we compare representative models: a window-based model, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in two visual recognition scenarios: human action recognition and object prediction. We examine whether the knowledge extracted from texts through these models are compatible to the knowledge represented in images. We determine the usefulness of different language models in aiding the two visual recognition tasks. The study shows that the language models built from general text corpora can be used instead of expensive annotated images and even outperform the image model when testing on a big general dataset.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Exploiting language models for visual recognition Dieu-Thu Le DISI, University of Trento Povo, 38123, Italy . [sent-1, score-0.466]

2 it Abstract The problem of learning language models from large text corpora has been widely studied within the computational linguistic community. [sent-3, score-0.126]

3 However, little is known about the performance of these language models when applied to the computer vision domain. [sent-4, score-0.33]

4 In this work, we compare representative models: a window-based model, a topic model, a distributional memory and a commonsense knowledge database, ConceptNet, in two visual recognition scenarios: human action recognition and object prediction. [sent-5, score-1.193]

5 We examine whether the knowledge extracted from texts through these models are compatible to the knowledge represented in images. [sent-6, score-0.216]

6 We determine the usefulness of different language models in aiding the two visual recognition tasks. [sent-7, score-0.516]

7 The study shows that the language models built from general text corpora can be used instead of expensive annotated images and even outperform the image model when testing on a big general dataset. [sent-8, score-0.447]

8 1 Introduction Computational linguistics have created many tools for automatic knowledge acquisition which have been successfully applied in many tasks inside the language domain, such as question answering, machine translation, semantic web, etc. [sent-9, score-0.168]

9 In this paper we ask whether such knowledge generalizes to the observed reality outside the language domain, where we use well-known image datasets as a proxy for observed reality. [sent-10, score-0.383]

10 In particular, we aim to determine which language model yields knowledge that is most suitable for use 769 Jasper Uijlings Raffaella Bernardi DISI, University of Trento DISI, University of Trento Povo, 38123, Italy Povo, 38123, Italy j rr@ di s i . [sent-11, score-0.182]

11 Therefore we test a variety of language models and a linguistically mined knowledge base within two computer vision scenarios: Human action recognition : Recognizing triples based on objects (e. [sent-16, score-1.099]

12 , car, horse) and scenes (the place that the actions occur, e. [sent-18, score-0.502]

13 In this scenario, we only consider images with human actions so the “human” subject is always present. [sent-21, score-0.552]

14 Objects in context : Predicting the most likely identity of an object given its context as expressed in terms of co-occurring objects. [sent-22, score-0.083]

15 Computer vision can greatly benefit from natural language processing as learning from images requires a prohibitively expensive annotation effort. [sent-23, score-0.583]

16 A major goal of natural language processing is to obtain general knowledge from text and in this paper we test which model provides the best knowledge for use in the visual domain. [sent-24, score-0.502]

17 We test the language models in two ways: (1) We directly compare the statistics of the linguistic models with statistics extracted from the visual domain. [sent-26, score-0.348]

18 hc o2d0s1 i3n A Nsastoucria lti Loan fgoura Cgoem Ppruotcaetsiosin agl, L piang eusis 7t6ic9s–7 9, (2) We compare the linguistic models inside the two computer vision applications, leading to a direct estimation of their usefulness. [sent-29, score-0.376]

19 To summarize, our main research questions are: (1) Is the knowledge from language compatible with the knowledge from vision? [sent-30, score-0.216]

20 (2) Can the knowledge extracted from language help in computer vision scenarios? [sent-31, score-0.407]

21 2 Related Work Using high level knowledge to aid image understanding has become a recent interest in the computer vision community. [sent-32, score-0.597]

22 Objects, actions and scenes are detected and localized in images using lowlevel features. [sent-33, score-0.796]

23 This detection and localization process is guided by reasoning and knowledge. [sent-34, score-0.152]

24 Such knowledge is employed to disambiguate locations between objects in (Gupta and Davis, 2008). [sent-35, score-0.435]

25 , above, below, brighter, smaller), the system constrains which region in an image corresponds to which object/noun. [sent-38, score-0.326]

26 , 2005) exploit ontologies extracted from WordNet to associate words and images and image regions. [sent-40, score-0.41]

27 , 2011) employ relations between scenes and objects introducing an active model to recognize scenes through objects. [sent-42, score-0.612]

28 The reasoning knowledge limits the detector to search for an object within a particular region rather than on the whole image. [sent-43, score-0.383]

29 Language models have also been employed to generate descriptive sentences for images. [sent-44, score-0.08]

30 Similarly, from objects and scenes detected in an image, (Yang et al. [sent-47, score-0.491]

31 , 2011) estimated a sentence structure to generate a sentence description composed of a noun, verb, scene and preposition. [sent-48, score-0.115]

32 , 2012), the Gigaword corpus is used to extract relationships between tools and actions (e. [sent-53, score-0.417]

33 , knife - cut, cup - drink) by counting their co-occurences. [sent-55, score-0.106]

34 These relationships are used to constrain and select the most plausible actions within a predefined set of actions in cooking videos. [sent-56, score-0.882]

35 Instead of using this knowledge as a guidance during recognition, we compare 770 different language models and build a general framework that is able to detect unseen actions through their components (verb - object - scene), hence our method does not limit the number of actions in images. [sent-57, score-0.932]

36 They can detect animals without having seen training examples by manually defining the attributes of the tar- get animal. [sent-63, score-0.133]

37 In this work, rather than relying on manual definitions, our aim is to find the best language models built automatically from available corpora to extract relations from natural language. [sent-64, score-0.071]

38 Currently, human action recognition is popular and mostly studied in video using the Bag-of-VisualWords method (Delaitre et al. [sent-65, score-0.509]

39 In this method one extracts small local visual patches of, say, 24 by 24 pixels by 10 frames at every 12th pixel at every 5th frame. [sent-70, score-0.615]

40 For each patch local gradients or local movement (optical flow) histograms are calculated. [sent-71, score-0.253]

41 Then these local visual features are mapped to abstract, predefined “visual words”, previously obtained using k-means clustering on a set of random features. [sent-72, score-0.47]

42 While results are good, there are two main drawbacks with this approach. [sent-73, score-0.041]

43 First of all, human actions are semantic and more naturally recognized through their components (human, objects, scene) rather than through a bag of local gradient/motion patterns. [sent-74, score-0.554]

44 Hence we use a component-based method for human action recognition. [sent-75, score-0.308]

45 Second, the number of possible human actions is huge (the number of objects times the num- ber of verbs). [sent-76, score-0.653]

46 Obtaining annotated visual examples for each action is therefore prohibitively expensive. [sent-77, score-0.685]

47 So we learn from language models how components combine into human actions. [sent-78, score-0.119]

48 3 Two Visual Recognition Scenarios We now describe the two computer vision scenarios: human action recognition and objects in context. [sent-79, score-0.996]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('visual', 0.348), ('actions', 0.316), ('vision', 0.273), ('action', 0.247), ('objects', 0.24), ('scenarios', 0.216), ('image', 0.19), ('scenes', 0.186), ('disi', 0.177), ('images', 0.175), ('povo', 0.161), ('conceptnet', 0.135), ('lampert', 0.135), ('italy', 0.119), ('recognition', 0.118), ('scene', 0.115), ('trento', 0.115), ('prohibitively', 0.09), ('object', 0.083), ('teo', 0.079), ('knowledge', 0.077), ('region', 0.077), ('di', 0.071), ('detected', 0.065), ('local', 0.063), ('reasoning', 0.062), ('compatible', 0.062), ('human', 0.061), ('predefined', 0.059), ('constrains', 0.059), ('shah', 0.059), ('dle', 0.059), ('knife', 0.059), ('optical', 0.059), ('patches', 0.059), ('components', 0.058), ('computer', 0.057), ('relationships', 0.056), ('recognized', 0.056), ('jasper', 0.054), ('uijlings', 0.054), ('localization', 0.054), ('localized', 0.054), ('pixel', 0.054), ('commonsense', 0.054), ('bernardi', 0.054), ('cooking', 0.054), ('pixels', 0.054), ('raffaella', 0.054), ('eaghdha', 0.05), ('aiding', 0.05), ('fish', 0.05), ('rr', 0.05), ('davis', 0.05), ('drink', 0.05), ('horse', 0.05), ('attributes', 0.049), ('memory', 0.048), ('lenci', 0.047), ('animals', 0.047), ('cup', 0.047), ('within', 0.047), ('inside', 0.046), ('guidance', 0.045), ('reddy', 0.045), ('locations', 0.045), ('water', 0.045), ('ontologies', 0.045), ('unitn', 0.045), ('histograms', 0.045), ('tools', 0.045), ('expensive', 0.045), ('gupta', 0.043), ('descriptive', 0.043), ('reality', 0.043), ('studied', 0.042), ('gradients', 0.041), ('movement', 0.041), ('animal', 0.041), ('video', 0.041), ('drawbacks', 0.041), ('flow', 0.04), ('mined', 0.04), ('distributional', 0.039), ('generalizes', 0.037), ('detector', 0.037), ('frames', 0.037), ('car', 0.037), ('employed', 0.037), ('detect', 0.037), ('corpora', 0.037), ('guided', 0.036), ('proxy', 0.036), ('ber', 0.036), ('eat', 0.036), ('disambiguate', 0.036), ('cut', 0.035), ('aim', 0.034), ('plausible', 0.034), ('baroni', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 78 emnlp-2013-Exploiting Language Models for Visual Recognition

Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi

2 0.30128336 98 emnlp-2013-Image Description using Visual Dependency Representations

Author: Desmond Elliott ; Frank Keller

Abstract: Describing the main event of an image involves identifying the objects depicted and predicting the relationships between them. Previous approaches have represented images as unstructured bags of regions, which makes it difficult to accurately predict meaningful relationships between regions. In this paper, we introduce visual dependency representations to capture the relationships between the objects in an image, and hypothesize that this representation can improve image description. We test this hypothesis using a new data set of region-annotated images, associated with visual dependency representations and gold-standard descriptions. We describe two template-based description generation models that operate over visual dependency representations. In an image descrip- tion task, we find that these models outperform approaches that rely on object proximity or corpus information to generate descriptions on both automatic measures and on human judgements.

3 0.2039987 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni

Abstract: Traditional distributional semantic models extract word meaning representations from cooccurrence patterns of words in text corpora. Recently, the distributional approach has been extended to models that record the cooccurrence of words with visual features in image collections. These image-based models should be complementary to text-based ones, providing a more cognitively plausible view of meaning grounded in visual perception. In this study, we test whether image-based models capture the semantic patterns that emerge from fMRI recordings of the neural signal. Our results indicate that, indeed, there is a significant correlation between image-based and brain-based semantic similarities, and that image-based models complement text-based ones, so that the best correlations are achieved when the two modalities are combined. Despite some unsatisfactory, but explained out- comes (in particular, failure to detect differential association of models with brain areas), the results show, on the one hand, that imagebased distributional semantic models can be a precious new tool to explore semantic representation in the brain, and, on the other, that neural data can be used as the ultimate test set to validate artificial semantic models in terms of their cognitive plausibility.

4 0.20086411 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities

Author: Stephen Roller ; Sabine Schulte im Walde

Abstract: Recent investigations into grounded models of language have shown that holistic views of language and perception can provide higher performance than independent views. In this work, we improve a two-dimensional multimodal version of Latent Dirichlet Allocation (Andrews et al., 2009) in various ways. (1) We outperform text-only models in two different evaluations, and demonstrate that low-level visual features are directly compatible with the existing model. (2) We present a novel way to integrate visual features into the LDA model using unsupervised clusters of images. The clusters are directly interpretable and improve on our evaluation tasks. (3) We provide two novel ways to extend the bimodal mod- els to support three or more modalities. We find that the three-, four-, and five-dimensional models significantly outperform models using only one or two modalities, and that nontextual modalities each provide separate, disjoint knowledge that cannot be forced into a shared, latent structure.

5 0.15057139 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation

Author: Rui Fang ; Changsong Liu ; Lanbo She ; Joyce Y. Chai

Abstract: In situated dialogue, humans and agents have mismatched capabilities of perceiving the shared environment. Their representations of the shared world are misaligned. Thus referring expression generation (REG) will need to take this discrepancy into consideration. To address this issue, we developed a hypergraph-based approach to account for group-based spatial relations and uncertainties in perceiving the environment. Our empirical results have shown that this approach outperforms a previous graph-based approach with an absolute gain of 9%. However, while these graph-based approaches perform effectively when the agent has perfect knowledge or perception of the environment (e.g., 84%), they perform rather poorly when the agent has imperfect perception of the environment (e.g., 45%). This big performance gap calls for new solutions to REG that can mediate a shared perceptual basis in situated dialogue.

6 0.12796071 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation

7 0.11535744 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

8 0.098652817 44 emnlp-2013-Centering Similarity Measures to Reduce Hubs

9 0.081295304 153 emnlp-2013-Predicting the Resolution of Referring Expressions from User Behavior

10 0.06574706 192 emnlp-2013-Unsupervised Induction of Contingent Event Pairs from Film Scenes

11 0.059727866 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time

12 0.053072285 171 emnlp-2013-Shift-Reduce Word Reordering for Machine Translation

13 0.045370974 154 emnlp-2013-Prior Disambiguation of Word Tensors for Constructing Sentence Vectors

14 0.042423464 99 emnlp-2013-Implicit Feature Detection via a Constrained Topic Model and SVM

15 0.040453583 60 emnlp-2013-Detecting Compositionality of Multi-Word Expressions using Nearest Neighbours in Vector Space Models

16 0.039768517 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation

17 0.037440661 42 emnlp-2013-Building Specialized Bilingual Lexicons Using Large Scale Background Knowledge

18 0.035679109 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

19 0.032503713 12 emnlp-2013-A Semantically Enhanced Approach to Determine Textual Similarity

20 0.032331992 194 emnlp-2013-Unsupervised Relation Extraction with General Domain Knowledge

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.143), (1, 0.055), (2, -0.062), (3, 0.102), (4, -0.059), (5, 0.234), (6, -0.041), (7, -0.099), (8, -0.131), (9, -0.073), (10, -0.447), (11, -0.054), (12, 0.099), (13, -0.063), (14, -0.258), (15, -0.014), (16, -0.107), (17, 0.035), (18, 0.131), (19, -0.006), (20, -0.042), (21, 0.006), (22, -0.003), (23, -0.035), (24, 0.003), (25, 0.021), (26, 0.066), (27, 0.057), (28, 0.037), (29, -0.026), (30, -0.003), (31, 0.039), (32, -0.003), (33, 0.028), (34, 0.043), (35, 0.031), (36, 0.016), (37, 0.029), (38, 0.041), (39, 0.029), (40, -0.027), (41, 0.025), (42, -0.05), (43, -0.019), (44, -0.007), (45, -0.012), (46, -0.002), (47, 0.027), (48, 0.005), (49, -0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97391951 78 emnlp-2013-Exploiting Language Models for Visual Recognition

Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi

2 0.91575152 98 emnlp-2013-Image Description using Visual Dependency Representations

Author: Desmond Elliott ; Frank Keller

3 0.82375568 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni

4 0.708417 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities

Author: Stephen Roller ; Sabine Schulte im Walde

5 0.59666836 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation

Author: Rui Fang ; Changsong Liu ; Lanbo She ; Joyce Y. Chai

6 0.40848741 179 emnlp-2013-Summarizing Complex Events: a Cross-Modal Solution of Storylines Extraction and Reconstruction

7 0.40349558 153 emnlp-2013-Predicting the Resolution of Referring Expressions from User Behavior

8 0.39621991 44 emnlp-2013-Centering Similarity Measures to Reduce Hubs

9 0.35847673 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation

10 0.2414922 184 emnlp-2013-This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics

11 0.19710237 116 emnlp-2013-Joint Parsing and Disfluency Detection in Linear Time

12 0.18208298 58 emnlp-2013-Dependency Language Models for Sentence Completion

13 0.18180433 23 emnlp-2013-Animacy Detection with Voting Models

14 0.17105329 132 emnlp-2013-Mining Scientific Terms and their Definitions: A Study of the ACL Anthology

15 0.17011234 191 emnlp-2013-Understanding and Quantifying Creativity in Lexical Composition

16 0.16789535 106 emnlp-2013-Inducing Document Plans for Concept-to-Text Generation

17 0.16148356 24 emnlp-2013-Application of Localized Similarity for Web Documents

18 0.15945409 108 emnlp-2013-Interpreting Anaphoric Shell Nouns using Antecedents of Cataphoric Shell Nouns as Training Data

19 0.15908396 92 emnlp-2013-Growing Multi-Domain Glossaries from a Few Seeds using Probabilistic Topic Models

20 0.15894186 134 emnlp-2013-Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.016), (22, 0.021), (30, 0.066), (50, 0.589), (51, 0.128), (66, 0.026), (71, 0.013), (75, 0.029), (90, 0.01), (96, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85569263 78 emnlp-2013-Exploiting Language Models for Visual Recognition

Author: Dieu-Thu Le ; Jasper Uijlings ; Raffaella Bernardi

2 0.84740645 159 emnlp-2013-Regularized Minimum Error Rate Training

Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova

Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.

3 0.72297293 19 emnlp-2013-Adaptor Grammars for Learning Non-Concatenative Morphology

Author: Jan A. Botha ; Phil Blunsom

Abstract: This paper contributes an approach for expressing non-concatenative morphological phenomena, such as stem derivation in Semitic languages, in terms of a mildly context-sensitive grammar formalism. This offers a convenient level of modelling abstraction while remaining computationally tractable. The nonparametric Bayesian framework of adaptor grammars is extended to this richer grammar formalism to propose a probabilistic model that can learn word segmentation and morpheme lexicons, including ones with discontiguous strings as elements, from unannotated data. Our experiments on Hebrew and three variants of Arabic data find that the additional expressiveness to capture roots and templates as atomic units improves the quality of concatenative segmentation and stem identification. We obtain 74% accuracy in identifying triliteral Hebrew roots, while performing morphological segmentation with an F1-score of 78. 1.

4 0.4949795 98 emnlp-2013-Image Description using Visual Dependency Representations

Author: Desmond Elliott ; Frank Keller

5 0.41031843 140 emnlp-2013-Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts

Author: Andrew J. Anderson ; Elia Bruni ; Ulisse Bordignon ; Massimo Poesio ; Marco Baroni

6 0.37068909 105 emnlp-2013-Improving Web Search Ranking by Incorporating Structured Annotation of Queries

7 0.36246541 185 emnlp-2013-Towards Situated Dialogue: Revisiting Referring Expression Generation

8 0.35185346 11 emnlp-2013-A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities

9 0.34218141 8 emnlp-2013-A Joint Learning Model of Word Segmentation, Lexical Acquisition, and Phonetic Variability

10 0.34204963 30 emnlp-2013-Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora

11 0.33565086 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging

12 0.33267656 44 emnlp-2013-Centering Similarity Measures to Reduce Hubs

13 0.31941646 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction