iccv iccv2013 iccv2013-428 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele
Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.
Reference: text
sentIndex sentText sentNum sentScore
1 In order to provide natural language descriptions for visual content, this paper combines two important ingredients. [sent-2, score-0.674]
2 And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. [sent-7, score-1.205]
3 For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. [sent-8, score-0.879]
4 We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. [sent-9, score-0.931]
5 Thus, this work addresses the problem of generating textual descriptions for videos. [sent-16, score-0.417]
6 This task has a wide range of applications in the domain of human-computer/robot interaction, generating summary descriptions of (web-)videos, and automating movie descriptions for visually impaired people. [sent-17, score-0.691]
7 Furthermore, being able to convert visual content to language is an important step in understanding the relationship between visual and linguistic information which are the richest interaction modalities available to humans. [sent-18, score-0.43]
8 Generating natural language descriptions of visual content is an intriguing task but requires combining the fundamental research problems of visual recognition and natural language generation (NLG). [sent-19, score-1.18]
9 While for descriptions of images, recent approaches have proposed to statistically model the conversion from images to text [5, 15, 16, 18], most approaches for video description use rules and templates to generated video descriptions [14, 9, 2, 10, 11, 26, 3, 8]. [sent-20, score-1.068]
10 To address the first question we suggest to learn the conversion from video to language descriptions in a two-step approach. [sent-25, score-0.749]
11 In the first step we learn an intermediate SR using a probabilistic model, following ideas used to generate image descriptions [5, 15]. [sent-26, score-0.399]
12 Then, given the SR, we propose to phrase the problem of NLG as a translation problem, that is translating the SRs to natural language descriptions. [sent-27, score-0.66]
13 In contrast to related work on video description, we learn both the SR as well as the language descriptions from an aligned parallel corpus containing videos, semantic annotations and textual descriptions. [sent-28, score-1.084]
14 We compare our approach to related work and baselines using no intermediate SR and/or language model. [sent-29, score-0.384]
15 Instead we learn from a parallel training corpus the most relevant information to verbalize and how to verbalize it. [sent-31, score-0.325]
16 This has been shown for language translation, where statistical machine translation has generally replaced rule-based approaches [12]. [sent-49, score-0.539]
17 First, we phrase video description as a translation problem from video content to natural language descriptions (Sec. [sent-53, score-1.201]
18 The SR, when using ground truth annotations, allows generating language that is close to human performance. [sent-60, score-0.378]
19 Third, annotations as well as intermediate outputs and final descriptions to allow for comparisons to our work or building on our SR are released on our website. [sent-63, score-0.44]
20 Machine translation aims to translate from one natural language to another. [sent-66, score-0.558]
21 Based on sentence-aligned corpora of source and target language a translation model is estimated. [sent-71, score-0.552]
22 Additionally, a model for the target language is learnt to generate a fluent and grammatical output. [sent-72, score-0.417]
23 This setting is different from ours as we want to generate descriptions at test time from visual content only. [sent-80, score-0.43]
24 (2) Given a SR extracted from visual content it is possible to generate language using manually defined rules and templates. [sent-81, score-0.472]
25 From the CRF predictions they generate descriptions based on simple templates (or n-gram model, which falls into (4)). [sent-84, score-0.428]
26 We also use a CRF to predict an intermediate SR but we show that our translation system generates descriptions more similar to human descriptions. [sent-85, score-0.6]
27 [26] learns audio-visual concepts and generates a video description for three different activities using rules to combine action, scene, and audio concepts with glue words. [sent-88, score-0.35]
28 The best suited triple is used to generate multiple sentences 434 based on a template which are again scored against a ngram language model. [sent-98, score-0.488]
29 Similarly, our translation approach weights resulting sentences according to a language model. [sent-99, score-0.654]
30 (3) The third group of approaches reduces the generation process to retrieving sentences from a training corpus based on locally [20] or globally [5] similar images. [sent-101, score-0.444]
31 (4) The fourth line of work, which also includes this work, goes beyond retrieving existing descriptions by learning a language model to compose novel descriptions. [sent-104, score-0.637]
32 [15] learns an n-gram language model to predict function words for their SR. [sent-105, score-0.33]
33 Two recent approaches use an aligned corpus of images and descriptions as a basis for generating novel descriptions for images using state-of-the art language generation techniques. [sent-109, score-1.221]
34 While they hand craft constraints to translate from the image, we learn a statistical translation model. [sent-112, score-0.29]
35 In contrast to their Treeadjoining-grammar (TAG)-like natural language generation approach we use flat, cooccurrence based techniques from SMT. [sent-115, score-0.395]
36 Video description as a translation problem In this section we present a two-step approach which describes video content with natural language. [sent-117, score-0.476]
37 We assume that for training we have a parallel corpus which contains a set of video snippets and sentences. [sent-118, score-0.315]
38 Video snippets represented by the video descriptor xi are aligned with a sentence zi, i. [sent-119, score-0.534]
39 At test time we first predict the SR y∗ for a new video (descriptor) x∗ and then generate a sentence z∗ from y∗ . [sent-123, score-0.464]
40 , , zi) where Li is the number of SR annotations for sentence zi. [sent-141, score-0.454]
41 Translating from a SR to a description Converting a SR to descriptions (SR → D) has many simCiloarnivteiertsi ntog translating efsrcormip a source Rto a target language (LS → LT) in machine translation. [sent-183, score-0.9]
42 In a natural description of video not necessarily all semantic concepts are verbalized, e. [sent-191, score-0.305]
43 When translating sL tSh e→ ca rLroTt a language model of LT is used to achieve a grammatically correct and fluent target sentence, same for D in SR → D. [sent-204, score-0.451]
44 Motivated by these similarities, we propose to use established techniques for statistical machine translation (SMT) to learn a translation model from a parallel corpus of SRs and descriptions. [sent-205, score-0.635]
45 In TACoS we encounter the problem that one sentence can be aligned to multiple SRs, i. [sent-208, score-0.421]
46 For all SR annotations aligned to a sentence we create a separate training example, i. [sent-216, score-0.525]
47 We estimate the highest word overlap between the sentence and the string of the SR: where Lemma refers to lemmatizing, i. [sent-227, score-0.427]
48 While we do not have an annotated SR for the sentence level, we can predict one SR yLii ,(yiLi (yLii |yi∩Le|myim|a(zi)|, for each sentence, i. [sent-233, score-0.386]
49 While this will be noisier during training time it also reflects better the situation at test time where we also have predictions at sentence level as annotations are unavailable. [sent-236, score-0.542]
50 SMT expects an input string as source language expression. [sent-237, score-0.371]
51 To estimate the fluency of the descriptions we use IRSTLM [6] which is based on n-gram statistics of TACoS. [sent-248, score-0.319]
52 The final step involves optimizing a linear model between the probabilities from the language model, phrase tables, and reordering model, as well as word, phrase, and rule counts [13]. [sent-249, score-0.353]
53 For testing, we apply our translation model to the SR y∗ predicted by the CRF for a given input video x∗ . [sent-252, score-0.294]
54 Sentence retrieval An alternative to generating novel descriptions is to retrieve the most likely sentence from a training corpus [5]. [sent-260, score-0.957]
55 Given a test video x∗ we search for the closest training video xi and output the sentence z∗ = zi (in case there are several we choose the first). [sent-261, score-0.631]
56 While the raw video features tend to be too noisy to compute reliable distances, it has been shown that using the vector of attribute classifier outputs instead of the raw video features improves similarity estimates between videos [23]. [sent-266, score-0.315]
57 NLG with N-grams While we keep the same SR we replace the SMT pipeline by learning a n-gram language model on the training set of the descriptions. [sent-272, score-0.325]
58 Evaluation: Translating video to text We evaluate our video description approach on the TACoS dataset [23] which contains videos with aligned SR annotations and sentence descriptions. [sent-295, score-0.835]
59 We test our approach on a subset of 490 video snippet / sentence pairs. [sent-302, score-0.501]
60 For manual evaluation, we follow [16] and ask 10 human subjects to rate grammatical correctness (independent of video content), correctness, and relevance (latter two independent of grammatical correctness). [sent-313, score-0.42]
61 Correctness rates if the sentences are correct with respect to the video, and relevance judges if the sentence describes the most salient activity and objects. [sent-314, score-0.705]
62 We present the human judges with different sentences of our systems in a random order for each video and ask explicitly to make consistent relative judgment between different sentences. [sent-317, score-0.363]
63 Results: Translating video to text Results of the various baselines and from our translation system are provided in Table 2 and typical sample outputs of our approach and baseline systems are shown in Table 4. [sent-326, score-0.355]
64 When retrieving the closest sentence from the training data based on the raw video features (first row in Table 2), we obtain BLEU@4 of 6. [sent-329, score-0.566]
65 Modeling the language statistics with a n-gram model to fill function words between predicted keywords of the SR leads to a further improvement to 16% with n = 3 and a search span of up to 10 words. [sent-334, score-0.354]
66 Second, using our translation ap- proach clearly improves over sentence retrieval (+9. [sent-343, score-0.605]
67 Comparing our different variants it is interesting to see that it is important how to match a SR with descriptions during training SMT model. [sent-349, score-0.39]
68 When a sentence is aligned to multiple SRs, just matching all SRs to it leads to a noisy model (11. [sent-350, score-0.421]
69 9%), or the largest semantic overlap between a SR and training sentence (18. [sent-353, score-0.479]
70 91 Table 2: Evaluating generated descriptions For correctness judgments we additionally on TACoS video-description corpus. [sent-419, score-0.522]
71 In contrast to the SRs based on annotations, the predictions are on sentence intervals. [sent-424, score-0.438]
72 This indicates that a SR on the same level of the sentence granularity is most powerful. [sent-425, score-0.438]
73 As an upper bound we report the BLEU score for the human descriptions which is 36. [sent-433, score-0.355]
74 Starting with the last column (relevance, 6th column) the two main trends suggested by the BLEU scores are confirmed: our proposed approach using training on sentence level predictions outperforms all baselines; and using our SR based on annotations is encouragingly close to human performance (4. [sent-436, score-0.578]
75 The human judgments about correctness (5th column) show scores for overall correctness (first number) followed by the scores for activities, objects (including tools and ingredients), and location (covering source and target location, see Table 1). [sent-440, score-0.423]
76 9), only our training on sentence level predictions performs higher with a average score of 3. [sent-444, score-0.474]
77 1 as it can recover from errors by learning typical errors by the CRF during training (see also examples in Ta- 1Computed only on a 272 sentence subset where the corpus contains more than a single reference sentence for the same video. [sent-445, score-0.944]
78 It is interesting to look at the 4th column which judges the grammatical correctness of the produced sentences disregarding the visual input. [sent-448, score-0.434]
79 6), indicating that our system learned a better language model than most human descriptions have. [sent-451, score-0.644]
80 For our evaluation we choose the Pascal sentence dataset [5] which consist of 1,000 images, each paired with 5 different descriptions of one sentence. [sent-457, score-0.705]
81 We learn our translation approach on the training set of triples and image descriptions. [sent-459, score-0.311]
82 We evaluate on a subset of 323 images where there are predicted descriptions available for both related approaches [5, 15]. [sent-460, score-0.343]
83 [18] also predicts sentences for this dataset but only example sentences were available to us. [sent-462, score-0.346]
84 [15] reports 15% BLEU@ 1 438 Approach BLEU @4 @1 Related Work Template-based generation [15] MRF + sentence retrieval [5] 0. [sent-466, score-0.483]
85 6 Translation (this work) MRF + translation MRF + adjective extension + translation 4. [sent-470, score-0.384]
86 7 Upper Bound Human descriptions Table 3: Evaluating generated descriptions on the Pascal Sentence dataset. [sent-476, score-0.638]
87 This is not surprising as the templates produce very different text compared to descriptions by humans. [sent-486, score-0.419]
88 Never-the-less we outperform the best reported BLEU-score result on this dataset of 30% @ 1by 5% (note the not identical test set) for language model based generation or meaning representation [15]. [sent-500, score-0.359]
89 Conclusion Automatically describing videos with natural language is both a compelling as well as a challenging task. [sent-503, score-0.413]
90 This work proposes to learn the conversion from visual content to natural descriptions from a parallel corpus of videos and textual descriptions rather than using rules and templates to generate language. [sent-504, score-1.223]
91 Our model is a two-step approach, first learning an intermediate representation of semantic labels from the video, and then translating it to natural language adopting techniques from statistical machine translation. [sent-505, score-0.597]
92 In order to form a natural description of the content as humans would give it our model learns which words should be added although they are not directly present in the visual content. [sent-507, score-0.303]
93 In an extensive experimental evaluation we show improvements of our approach compared to retrieval and ngram based sentence generation used in prior work. [sent-508, score-0.509]
94 The application of our approach to sentence descriptions shows clear improvements over [15] and [5] using BLEU sore evaluation, indicating that we produce descriptions more similar to human descriptions. [sent-510, score-1.06]
95 To handle the different levels of granularity in the SR compared to the description we compare different variants of our model, showing that an estimation of the largest semantic overlap between the SR and the description during training performs best. [sent-511, score-0.358]
96 While we show the benefits ofphrasing video description as a translation problem, there are many possibilities to improve our work. [sent-512, score-0.359]
97 Further directions include modeling temporal dependencies in both the SR and the language generation, as well as modeling the uncertainty of the visual input explicitly in the generation process, which has similarities to translating from uncertain speech input. [sent-513, score-0.493]
98 IRSTLM: an open source toolkit for handling large scale language models. [sent-576, score-0.375]
99 Automated textual descriptions for a wide range of video events with 48 human actions. [sent-609, score-0.478]
100 Natural language description of human activities from video images based on concept hierarchy of actions. [sent-644, score-0.536]
wordName wordTfidf (topN-words)
[('sr', 0.387), ('sentence', 0.386), ('descriptions', 0.319), ('language', 0.289), ('bleu', 0.274), ('translation', 0.192), ('tacos', 0.175), ('sentences', 0.173), ('crf', 0.156), ('corpus', 0.136), ('smt', 0.127), ('correctness', 0.113), ('srs', 0.113), ('translating', 0.104), ('ylii', 0.095), ('judgments', 0.09), ('description', 0.089), ('content', 0.081), ('nlg', 0.079), ('verbalized', 0.079), ('video', 0.078), ('activity', 0.073), ('grammatical', 0.07), ('generation', 0.07), ('annotations', 0.068), ('moses', 0.059), ('videos', 0.058), ('templates', 0.057), ('semantic', 0.057), ('triples', 0.056), ('carrot', 0.056), ('zi', 0.053), ('generating', 0.053), ('intermediate', 0.053), ('predictions', 0.052), ('granularity', 0.052), ('acl', 0.051), ('rules', 0.049), ('hob', 0.048), ('judges', 0.048), ('verbalization', 0.048), ('verbalize', 0.048), ('toolkit', 0.045), ('concepts', 0.045), ('textual', 0.045), ('activities', 0.044), ('text', 0.043), ('knife', 0.042), ('baselines', 0.042), ('words', 0.041), ('translate', 0.041), ('string', 0.041), ('source', 0.041), ('phrase', 0.039), ('phrases', 0.038), ('snippet', 0.037), ('raw', 0.037), ('natural', 0.036), ('conversion', 0.036), ('human', 0.036), ('training', 0.036), ('variants', 0.035), ('snippets', 0.035), ('aligned', 0.035), ('tool', 0.035), ('cooking', 0.034), ('nj', 0.032), ('bertoldi', 0.032), ('irstlm', 0.032), ('pot', 0.032), ('stove', 0.032), ('xlii', 0.032), ('counter', 0.031), ('target', 0.03), ('parallel', 0.03), ('statistical', 0.03), ('describing', 0.03), ('visual', 0.03), ('retrieving', 0.029), ('machine', 0.028), ('fluent', 0.028), ('ask', 0.028), ('learn', 0.027), ('retrieval', 0.027), ('attribute', 0.027), ('federico', 0.026), ('ngram', 0.026), ('regneri', 0.026), ('humans', 0.026), ('plate', 0.026), ('relevance', 0.025), ('ls', 0.025), ('reordering', 0.025), ('dice', 0.025), ('cken', 0.025), ('alignment', 0.024), ('predicted', 0.024), ('prepositions', 0.023), ('manually', 0.023), ('pan', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 428 iccv-2013-Translating Video Content to Natural Language Descriptions
Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele
Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.
2 0.32996103 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei
Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.
3 0.30013892 246 iccv-2013-Learning the Visual Interpretation of Sentences
Author: C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende
Abstract: Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.
Author: Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to “fill in ” novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
5 0.16716771 293 iccv-2013-Nonparametric Blind Super-resolution
Author: Tomer Michaeli, Michal Irani
Abstract: Super resolution (SR) algorithms typically assume that the blur kernel is known (either the Point Spread Function ‘PSF’ of the camera, or some default low-pass filter, e.g. a Gaussian). However, the performance of SR methods significantly deteriorates when the assumed blur kernel deviates from the true one. We propose a general framework for “blind” super resolution. In particular, we show that: (i) Unlike the common belief, the PSF of the camera is the wrong blur kernel to use in SR algorithms. (ii) We show how the correct SR blur kernel can be recovered directly from the low-resolution image. This is done by exploiting the inherent recurrence property of small natural image patches (either internally within the same image, or externally in a collection of other natural images). In particular, we show that recurrence of small patches across scales of the low-res image (which forms the basis for single-image SR), can also be used for estimating the optimal blur kernel. This leads to significant improvement in SR results.
6 0.16154277 35 iccv-2013-Accurate Blur Models vs. Image Priors in Single Image Super-resolution
7 0.151737 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
8 0.1354842 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
9 0.11372752 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions
10 0.11206681 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
11 0.10723884 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
12 0.093723141 96 iccv-2013-Coupled Dictionary and Feature Space Learning with Applications to Cross-Domain Image Synthesis and Recognition
13 0.08552134 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
14 0.076617792 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
15 0.075270168 53 iccv-2013-Attribute Dominance: What Pops Out?
16 0.072525062 234 iccv-2013-Learning CRFs for Image Parsing with Adaptive Subgradient Descent
17 0.072037801 166 iccv-2013-Finding Actors and Actions in Movies
18 0.069243535 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
19 0.068083532 257 iccv-2013-Log-Euclidean Kernels for Sparse Representation and Dictionary Learning
20 0.067041278 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
topicId topicWeight
[(0, 0.16), (1, 0.116), (2, 0.011), (3, 0.015), (4, 0.059), (5, 0.026), (6, -0.001), (7, -0.104), (8, 0.026), (9, -0.028), (10, 0.064), (11, -0.122), (12, 0.028), (13, -0.004), (14, -0.001), (15, 0.015), (16, -0.051), (17, -0.029), (18, -0.109), (19, -0.032), (20, -0.021), (21, 0.044), (22, 0.035), (23, 0.077), (24, -0.01), (25, -0.009), (26, 0.122), (27, -0.102), (28, -0.262), (29, -0.031), (30, 0.061), (31, -0.299), (32, 0.104), (33, -0.084), (34, -0.074), (35, -0.059), (36, -0.157), (37, 0.083), (38, -0.269), (39, -0.045), (40, -0.039), (41, 0.087), (42, -0.068), (43, 0.012), (44, -0.013), (45, -0.005), (46, 0.085), (47, -0.064), (48, -0.014), (49, 0.08)]
simIndex simValue paperId paperTitle
same-paper 1 0.95666349 428 iccv-2013-Translating Video Content to Natural Language Descriptions
Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele
Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.
Author: Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko
Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to “fill in ” novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.
3 0.69868469 246 iccv-2013-Learning the Visual Interpretation of Sentences
Author: C. Lawrence Zitnick, Devi Parikh, Lucy Vanderwende
Abstract: Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences’ visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.
4 0.56630176 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
Author: Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal
Abstract: The main question we address in this paper is how to use purely textual description of categories with no training images to learn visual classifiers for these categories. We propose an approach for zero-shot learning of object categories where the description of unseen categories comes in the form of typical text such as an encyclopedia entry, without the need to explicitly defined attributes. We propose and investigate two baseline formulations, based on regression and domain adaptation. Then, we propose a new constrained optimization formulation that combines a regression function and a knowledge transfer function with additional constraints to predict the classifier parameters for new classes. We applied the proposed approach on two fine-grained categorization datasets, and the results indicate successful classifier prediction.
5 0.56423467 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions
Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei
Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.
6 0.54554039 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
7 0.4965252 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
8 0.38909456 35 iccv-2013-Accurate Blur Models vs. Image Priors in Single Image Super-resolution
9 0.37525058 446 iccv-2013-Visual Semantic Complex Network for Web Images
10 0.37520847 145 iccv-2013-Estimating the Material Properties of Fabric from Video
11 0.37134156 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
12 0.36771137 293 iccv-2013-Nonparametric Blind Super-resolution
13 0.36554322 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
14 0.36280662 156 iccv-2013-Fast Direct Super-Resolution by Simple Functions
15 0.35024691 166 iccv-2013-Finding Actors and Actions in Movies
16 0.3336255 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
17 0.31031087 248 iccv-2013-Learning to Rank Using Privileged Information
18 0.30753741 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
19 0.29612648 96 iccv-2013-Coupled Dictionary and Feature Space Learning with Applications to Cross-Domain Image Synthesis and Recognition
20 0.29448736 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
topicId topicWeight
[(2, 0.077), (7, 0.01), (8, 0.041), (12, 0.07), (26, 0.061), (31, 0.048), (42, 0.114), (64, 0.048), (71, 0.223), (73, 0.021), (86, 0.026), (89, 0.141)]
simIndex simValue paperId paperTitle
same-paper 1 0.78137255 428 iccv-2013-Translating Video Content to Natural Language Descriptions
Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele
Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.
2 0.74466467 264 iccv-2013-Minimal Basis Facility Location for Subspace Segmentation
Author: Choon-Meng Lee, Loong-Fah Cheong
Abstract: In contrast to the current motion segmentation paradigm that assumes independence between the motion subspaces, we approach the motion segmentation problem by seeking the parsimonious basis set that can represent the data. Our formulation explicitly looks for the overlap between subspaces in order to achieve a minimal basis representation. This parsimonious basis set is important for the performance of our model selection scheme because the sharing of basis results in savings of model complexity cost. We propose the use of affinity propagation based method to determine the number of motion. The key lies in the incorporation of a global cost model into the factor graph, serving the role of model complexity. The introduction of this global cost model requires additional message update in the factor graph. We derive an efficient update for the new messages associated with this global cost model. An important step in the use of affinity propagation is the subspace hypotheses generation. We use the row-sparse convex proxy solution as an initialization strategy. We further encourage the selection of subspace hypotheses with shared basis by integrat- ing a discount scheme that lowers the factor graph facility cost based on shared basis. We verified the model selection and classification performance of our proposed method on both the original Hopkins 155 dataset and the more balanced Hopkins 380 dataset.
3 0.7277168 27 iccv-2013-A Robust Analytical Solution to Isometric Shape-from-Template with Focal Length Calibration
Author: Adrien Bartoli, Daniel Pizarro, Toby Collins
Abstract: We study the uncalibrated isometric Shape-fromTemplate problem, that consists in estimating an isometric deformation from a template shape to an input image whose focal length is unknown. Our method is the first that combines the following features: solving for both the 3D deformation and the camera ’s focal length, involving only local analytical solutions (there is no numerical optimization), being robust to mismatches, handling general surfaces and running extremely fast. This was achieved through two key steps. First, an ‘uncalibrated’ 3D deformation is computed thanks to a novel piecewise weak-perspective projection model. Second, the camera’s focal length is estimated and enables upgrading the 3D deformation to metric. We use a variational framework, implemented using a smooth function basis and sampled local deformation models. The only degeneracy which we easily detect– for focal length estimation is a flat and fronto-parallel surface. Experimental results on simulated and real datasets show that our method achieves a 3D shape accuracy – slightly below state of the art methods using a precalibrated or the true focal length, and a focal length accuracy slightly below static calibration methods.
4 0.71792305 41 iccv-2013-Active Learning of an Action Detector from Untrimmed Videos
Author: Sunil Bandla, Kristen Grauman
Abstract: Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the “untrimmed” nature of real video data.
5 0.69858009 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions
Author: Mohamed Elhoseiny, Babak Saleh, Ahmed Elgammal
Abstract: The main question we address in this paper is how to use purely textual description of categories with no training images to learn visual classifiers for these categories. We propose an approach for zero-shot learning of object categories where the description of unseen categories comes in the form of typical text such as an encyclopedia entry, without the need to explicitly defined attributes. We propose and investigate two baseline formulations, based on regression and domain adaptation. Then, we propose a new constrained optimization formulation that combines a regression function and a knowledge transfer function with additional constraints to predict the classifier parameters for new classes. We applied the proposed approach on two fine-grained categorization datasets, and the results indicate successful classifier prediction.
6 0.69191325 305 iccv-2013-POP: Person Re-identification Post-rank Optimisation
7 0.68632674 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
8 0.68280506 413 iccv-2013-Target-Driven Moire Pattern Synthesis by Phase Modulation
9 0.67993224 180 iccv-2013-From Where and How to What We See
10 0.67803836 338 iccv-2013-Randomized Ensemble Tracking
11 0.67477691 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval
12 0.67191076 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve
13 0.67174447 124 iccv-2013-Domain Transfer Support Vector Ranking for Person Re-identification without Target Camera Label Information
14 0.67172182 328 iccv-2013-Probabilistic Elastic Part Model for Unsupervised Face Detector Adaptation
15 0.67080605 59 iccv-2013-Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation
17 0.67001158 44 iccv-2013-Adapting Classification Cascades to New Domains
18 0.66991329 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
19 0.66980171 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
20 0.66964591 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps