acl acl2013 acl2013-175 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Haonan Yu ; Jeffrey Mark Siskind
Abstract: We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts si- multaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract We present a method that learns representations for word meanings from short video clips paired with sentences. [sent-3, score-0.75]
2 Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. [sent-4, score-0.585]
3 The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts si- multaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. [sent-6, score-0.876]
4 Language is grounded by mapping words, phrases, and sentences to meaning representations referring to the world. [sent-9, score-0.172]
5 Siskind (1996) has shown that even with referential uncertainty and noise, a system based on crosssituational learning can robustly acquire a lexicon, mapping words to word-level meanings from sentences paired with sentence-level meanings. [sent-10, score-0.313]
6 However, it did so only for symbolic representations of word- and sentence-level meanings that were not perceptually grounded. [sent-11, score-0.212]
7 An ideal system would not require detailed word-level labelings to acquire word meanings from video but rather could learn language in a largely unsupervised fashion, just as a child does, from video paired with sentences. [sent-12, score-1.016]
8 Yu and Ballard (2004) paired training images containing multiple objects with spoken name candidates for the objects to find the correspondence between lexical items and visual features. [sent-16, score-0.325]
9 (2012) present an approach that learns Montague- grammar representations of word meanings together with a combinatory categorial grammar (CCG) from child-directed sentences paired with first-order formulas that represent their meaning. [sent-20, score-0.267]
10 Although most of these methods succeed in learning word meanings from sentential descriptions they do so only for symbolic or simple visual input (often synthesized); they fail to bridge the gap between language and computer vision, i. [sent-21, score-0.25]
11 On the other hand, there has been research on training object and event models from large corpora of complex images and video in the computer-vision community (Kuznetsova et al. [sent-24, score-0.675]
12 c e2 A0s1s3oc Aiastsio cnia fotiron C fo mrp Cuotmatpiounta tlio Lninaglu Li sntgicusi,s ptaicgses 53–63, jects delineated via bounding boxes in images as nouns and events that occur in short video clips as verbs). [sent-33, score-0.584]
13 There is no attempt to model phrasal or sentential meaning, let alone acquire the object or event models from training data labeled with phrasal or sentential annotations. [sent-34, score-0.368]
14 In this paper, we present a method that learns representations for word meanings from short video clips paired with sentences. [sent-38, score-0.75]
15 First, our input consists of realistic video filmed in an outdoor environment. [sent-40, score-0.527]
16 Second, we learn the entire lexicon, including nouns, verbs, prepositions, adjectives, and adverbs, simultaneously from video described with whole sentences. [sent-41, score-0.393]
17 Third we adopt a uniform representation for the meanings of words in all parts of speech, namely Hidden Markov Models (HMMs) whose states and distributions allow for multiple possible interpretations of a word or a sentence in an ambiguous perceptual context. [sent-42, score-0.254]
18 We employ the following representation to ground the meanings of words, phrases, and sen- tences in video clips. [sent-43, score-0.524]
19 We first run an object detector on each video frame to yield a set of detections, each a subregion of the frame. [sent-44, score-0.801]
20 In principle, the object detector need just detect the objects rather than classify them. [sent-45, score-0.355]
21 In practice, we employ a collection of class-, shape-, pose-, and viewpoint-specific detectors and pool the detections to account for objects whose shape, pose, and viewpoint may vary over time. [sent-46, score-0.488]
22 Our methods can learn to associate a single noun with detections produced by multiple detectors. [sent-47, score-0.352]
23 We then string together detections from individual frames to yield tracks for objects that temporally span the video clip. [sent-48, score-1.069]
24 We also compute features between pairs of tracks to encode the relative position and motion of the pairs of objects that participate in events that involve two participants. [sent-51, score-0.387]
25 Nouns, like person, can be represented as one-state HMMs over image features that correlate with the object classes denoted by those nouns. [sent-58, score-0.257]
26 Adjectives, like red, round, and big, can be represented as one-state HMMs over region color, shape, and size features that correlate with object properties denoted by such adjectives. [sent-59, score-0.207]
27 This 54 involves computing the associated feature vector for that HMM over the detections in the tracks chosen to fill its arguments. [sent-73, score-0.542]
28 Thus a sentence like The person to the left of the backpack approached the trashcan would be represented as a conjunction of person(p0), to-the-left-of(p0, p1), backback(p1), approached(p0, p2), and trash-can(p2) over the three participants p0, p1, and p2. [sent-77, score-0.409]
29 This whole sentence is then grounded in a particular video by mapping these participants to particular tracks and instantiating the associated HMMs over those tracks, by computing the feature vectors for each HMM from the tracks chosen to fill its arguments. [sent-78, score-0.984]
30 Second, we pair individual sentences each with a short video clip that depicts that sentence. [sent-81, score-0.527]
31 The algorithm is not able to determine the alignment between multi- ple sentences and longer video segments. [sent-82, score-0.393]
32 Note that there is no requirement that the video depict only that sentence. [sent-83, score-0.448]
33 Moreover, our algorithm potentially can handle a small amount of noise, where a video clip is paired with an incorrect sentence that the video does not depict. [sent-86, score-1.07]
34 Third, we assume that we already have (pre-trained) low-level object detectors capable of detecting instances of our target event participants in individual frames of the video. [sent-87, score-0.382]
35 We allow such detections to be unreliable; our method can handle a moderate amount of false positives and false negatives. [sent-88, score-0.352]
36 Fifth, we assume that we know the total number of distinct participants that collectively fill all of the arguments for all of the words in each training sentence. [sent-94, score-0.197]
37 For example, for the sentence The person to the left of the backpack approached the trash-can, we assume that we know that there are three distinct objects that participate in the event denoted. [sent-95, score-0.591]
38 It consists of a video clip paired with a sentence, where the arguments of the words in the sentence are mapped to participants. [sent-103, score-0.724]
39 From a sequence of such training samples, our learner determines the objects tracks and the mapping from participants to those tracks, together with the meanings of the words. [sent-104, score-0.588]
40 Section 3 introduces our work on the sentence tracker, a method for jointly tracking the motion of multiple objects in a video that participate in a sententiallyspecified event. [sent-107, score-0.638]
41 ,p3) in the event described by the sentence, and each participant can be assigned to any object track in the video. [sent-116, score-0.417]
42 , DR) of video clips Dr, each paired with a sentence Sr from a sequence S = (S1, . [sent-126, score-0.701]
43 Let us assume, for a moment, that we can process each video clip Dr to yield a sequence (τr,1 , . [sent-138, score-0.648]
44 Let us also assume that Dr is paired with a sentence Sr = The person approached the chair, specified to have two participants, pr,0 and pr,1, with the mapping person(pr,0), chair(pr,1), and approached(pr,0, pr,1). [sent-142, score-0.413]
45 Let us further assume, for a moment, that we are given a mapping from participants to object tracks, say pr,0 → τr,39 and pr,1 → τr,51 . [sent-143, score-0.283]
46 This would allow7→ us to instantiate →the HMMs with object tracks for a given video clip: person(τr,39), chair(τr,51), and approached(τr,39, τr,51). [sent-144, score-0.748]
47 Let us further assume that we can score each such instantiated HMM and aggregate the scores for all of the words in a sentence to yield a sentence score and further aggregate the scores for all of the sentences in the corpus to yield a corpus score. [sent-145, score-0.249]
48 The problem is to simultaneously determine (a) those parameters along with (b) the object tracks and (c) the mapping from participants to object tracks. [sent-148, score-0.638]
49 (2012a) presented a method that first determines object tracks from a single video clip and then uses these fixed tracks with HMMs to recognize actions corresponding to verbs and construct sentential descriptions with templates. [sent-151, score-1.147]
50 (2012b) addressed the problem of solving (b) and (c), for a single object track constrained by a single intransitive verb, without solving (a), in the context of a single video clip. [sent-153, score-0.736]
51 Our group has generalized this work to yield an algorithm called the sentence tracker which operates by way of a factorial HMM framework. [sent-154, score-0.276]
52 We run an object detector on each frame to yield a set Drt of detections. [sent-157, score-0.408]
53 Since our object detector is unreliable, we bias it to have high recall but low precision, yielding multiple detections in each frame. [sent-158, score-0.626]
54 We form an object track by selecting a single detection for that track for each frame. [sent-159, score-0.484]
55 For a moment, let us consider a single video clip with length T, with detections Dt in frame t. [sent-160, score-0.96]
56 Further, let us assume that we seek a single object track in that video clip. [sent-161, score-0.728]
57 Let jt denote the index of the detection from Dt in frame t that is selected to form the track. [sent-162, score-0.283]
58 More56 over, we wish the track to be temporally coherent; we want the objects in a track to move smoothly over time and not jump around the field of view. [sent-165, score-0.38]
59 Let G(Dt−1 ,jt−1 , Dt, jt) denote some measure of coherence between two detections in adjacent frames. [sent-166, score-0.414]
60 ) One can select the detections to yield a track that maximizes both the aggregate detection score and the aggregate temporal coherence score. [sent-168, score-0.595]
61 A given video clip may depict multiple objects, each moving along its own trajectory. [sent-181, score-0.62]
62 The key insight of the sentence tracker is to bias the selection of a track so that it matches an HMM. [sent-184, score-0.266]
63 This jointly selects the optimal detections to form the track, together with the optimal state sequence, and scores that combination. [sent-190, score-0.394]
64 a,xqjT t+XT=t1TF2+(GB D(tAD,j(tq−,)t1j ,qt−,1λ D)t,j (3) While we formulated the above around a single track and a word that contains a single participant, it is straightforward to extend this so that it supports multiple tracks and words of higher arity by forming a larger cross product. [sent-192, score-0.374]
65 When doing so, we generalize jt to denote a sequence of detections from Dt, one for each of the tracks. [sent-193, score-0.561]
66 When doing so, we generalize qt to denote a sequence (qt1, . [sent-204, score-0.212]
67 , qLt) of states qlt, one for each word lin the sentence, and use ql to denote the sequence (q1l, . [sent-207, score-0.2]
68 This allows selection of an optimal sequence of tracks that yields the highest score for the sentential meaning of a sequence of words. [sent-215, score-0.444]
69 Modeling the meaning of a sentence through a sequence of words whose meanings are modeled by HMMs, defines a factorial HMM for that sentence, since the overall Markov process for that sentence can be factored into inde57 pendent component processes (Brand et al. [sent-216, score-0.43]
70 In this view, q denotes the state sequence for the combined factorial HMM and ql denotes the factor of that state sequence for word l. [sent-218, score-0.34]
71 4 Detailed Problem Formulation We adapt the sentence tracker to training a corpus of R video clips, each paired with a sentence. [sent-221, score-0.629]
72 Thus we augment our notation, generalizing jt to jtr and qlt to qrt,l. [sent-222, score-0.213]
73 Let Nc denote the length of the feature vector for part of speech c, xr,l denote the time-series (x1r,l, . [sent-242, score-0.181]
74 , of feature vectors xrt,l, associated with Sr,l (which recall is some entry m), and xr denote the sequence (xr,1, . [sent-245, score-0.237]
75 Like before, we could have a distinct Im for each entry m but instead have ICm depend only on the part of speech of entry m, and assume that we know the fixed I for each part of speech. [sent-258, score-0.283]
76 Towards this end, we define the score of a video clip Dr paired with sentence Sr given the parameter set λ to characterize how well this training sample is explained. [sent-261, score-0.677]
77 By definition, P(jr|Dr) = where PVj0r (VD (rD,jr ,)jr0), the numerator is the score of aP particular track sequence jr while the denominator sums the scores over all possible track sequences. [sent-267, score-0.387]
78 Similarly, when counting the frequency of being at state i while observing h as the nth feature in frame t for the lth word of entry m, i. [sent-296, score-0.195]
79 Acquisition of a word meaning is driven across sentences by entries that appear in more than one training sample and within sentences by the requirement that the meanings of all of the individual words in a sentence be consistent with the collective sentential meaning. [sent-305, score-0.3]
80 6 Experiment We filmed 61 video clips (each 3–5 seconds at 640×480 resolution and 40 fps) that depict a variety o4f8 0dif rfeesroelnutt compound fepvesn) ttsh. [sent-306, score-0.605]
81 These clips were filmed in three different outdoor environments which we use for cross validation. [sent-310, score-0.224]
82 We manually annotated each video with several sentences that describe what occurs in that video. [sent-311, score-0.393]
83 We use an off-the-shelf object detector (Felzenszwalb et al. [sent-318, score-0.274]
84 , 2010b) which outputs detections in the form of scored axis-aligned rectangles. [sent-320, score-0.352]
85 We trained four object detectors, one for each of the four object classes in 1Our code, videos, and sentential annotations are available at http : / /haonanyu . [sent-321, score-0.405]
86 59 c arity Ic N 1 1 Φc α α α detector index VEL MAG VEL ORIENT V 2 3 ββ VVEELL OMRAIGENT P ADV 2 1 1 3 α-β DIST α-β size ratio α-β x-position α VEL MAG PM 2 3 αα-β VE DLIS MTAG Table 2: Arguments and model configurations for different parts of speech c. [sent-323, score-0.257]
87 For each frame, we pick the two highestscoring detections produced by each object detector and pool the results yielding eight detections per frame. [sent-326, score-0.978]
88 Having a larger pool of detections per frame can better compensate for false negatives in the object detection and potentially yield smoother tracks but it increases the size of the lattice and the concomitant running time and does not lead to appreciably better performance on our corpus. [sent-327, score-0.902]
89 We compute continuous features, such as velocity, distance, size ratio, and x-position solely from the detection rectangles and quantize the features into bins as follows: velocity To reduce noise, we compute the velocity of a participant by averaging the optical flow in the detection rectangle. [sent-328, score-0.428]
90 The velocity magnitude is quantized into 5 levels: absolutely stationary, stationary, moving, fast moving, and quickly. [sent-329, score-0.2]
91 The velocity orientation is quantized into 4 directions: left, up, right, and down. [sent-330, score-0.2]
92 size ratio We compute the ratio of detection area of the first participant to the detection area of the second participant, quantized into 2 possibilities: larger/smaller than. [sent-332, score-0.274]
93 We perform a three-fold cross validation, taking the test data for each fold to be the videos filmed in a given outdoor environment and the training data for that fold to be all training samples that contain other videos. [sent-340, score-0.178]
94 We score each testing video paired with every sentence in both NV and ALL. [sent-344, score-0.543]
95 However, the score depends on the sentence length, the collective numbers of states and features in the HMMs for words in that sentence, and the length of the video clip. [sent-348, score-0.48]
96 7 Conclusion We presented a method that learns word meanings from video paired with sentences. [sent-374, score-0.623]
97 Unlike prior work, our method deals with realistic video scenes labeled with whole sentences, not individual words labeling hand delineated objects or events. [sent-375, score-0.474]
98 We believe that this will allow learning larger lexicons from more complex video without excessive amounts of training data. [sent-380, score-0.393]
99 A An Upper Bound on the F1 Score of any Blind Method A Blind algorithm makes identical decisions on the same sentence paired with different video clips. [sent-388, score-0.543]
100 Learning to talk about events from narrated video in a construction grammar framework. [sent-479, score-0.43]
wordName wordTfidf (topN-words)
[('video', 0.393), ('detections', 0.352), ('hmms', 0.251), ('tracks', 0.19), ('object', 0.165), ('hmm', 0.135), ('clip', 0.134), ('dr', 0.134), ('baum', 0.134), ('meanings', 0.131), ('sr', 0.131), ('track', 0.129), ('backpack', 0.118), ('velocity', 0.118), ('detector', 0.109), ('paired', 0.099), ('approached', 0.098), ('chair', 0.094), ('clips', 0.09), ('tracker', 0.086), ('factorial', 0.086), ('icm', 0.084), ('qt', 0.082), ('quantized', 0.082), ('frame', 0.081), ('objects', 0.081), ('jt', 0.079), ('blind', 0.079), ('sentential', 0.075), ('person', 0.074), ('siskind', 0.074), ('dt', 0.074), ('entry', 0.072), ('participant', 0.07), ('sequence', 0.068), ('participants', 0.068), ('filmed', 0.067), ('jrt', 0.067), ('jtr', 0.067), ('outdoor', 0.067), ('qlt', 0.067), ('images', 0.064), ('denote', 0.062), ('jr', 0.061), ('detection', 0.061), ('vision', 0.06), ('felzenszwalb', 0.059), ('vel', 0.059), ('speech', 0.057), ('detectors', 0.055), ('depict', 0.055), ('arity', 0.055), ('nv', 0.055), ('event', 0.053), ('yield', 0.053), ('barbu', 0.051), ('sentence', 0.051), ('prepositions', 0.05), ('dinkelbach', 0.05), ('mag', 0.05), ('qrt', 0.05), ('image', 0.05), ('mapping', 0.05), ('intransitive', 0.049), ('ieee', 0.048), ('arguments', 0.047), ('motion', 0.045), ('ordonez', 0.044), ('symbolic', 0.044), ('videos', 0.044), ('meaning', 0.043), ('correlate', 0.042), ('grounded', 0.042), ('state', 0.042), ('know', 0.041), ('assume', 0.041), ('markov', 0.041), ('jump', 0.041), ('reestimation', 0.041), ('log', 0.041), ('cm', 0.038), ('moving', 0.038), ('events', 0.037), ('berg', 0.037), ('representations', 0.037), ('states', 0.036), ('parts', 0.036), ('xr', 0.035), ('transitive', 0.035), ('computes', 0.034), ('participate', 0.034), ('tracking', 0.034), ('ql', 0.034), ('backback', 0.033), ('crosssituational', 0.033), ('dominey', 0.033), ('drt', 0.033), ('haonan', 0.033), ('leftward', 0.033), ('michaux', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 175 acl-2013-Grounded Language Learning from Video Described with Sentences
Author: Haonan Yu ; Jeffrey Mark Siskind
Abstract: We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts si- multaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.
2 0.13498825 379 acl-2013-Utterance-Level Multimodal Sentiment Analysis
Author: Veronica Perez-Rosas ; Rada Mihalcea ; Louis-Philippe Morency
Abstract: During real-life interactions, people are naturally gesturing and modulating their voice to emphasize specific points or to express their emotions. With the recent growth of social websites such as YouTube, Facebook, and Amazon, video reviews are emerging as a new source of multimodal and natural opinions that has been left almost untapped by automatic opinion analysis techniques. This paper presents a method for multimodal sentiment classification, which can identify the sentiment expressed in utterance-level visual datastreams. Using a new multimodal dataset consisting of sentiment annotated utterances extracted from video reviews, we show that multimodal sentiment analysis can be effectively performed, and that the joint use of visual, acoustic, and linguistic modalities can lead to error rate reductions of up to 10.5% as compared to the best performing individual modality.
3 0.10703844 311 acl-2013-Semantic Neighborhoods as Hypergraphs
Author: Chris Quirk ; Pallavi Choudhury
Abstract: Ambiguity preserving representations such as lattices are very useful in a number of NLP tasks, including paraphrase generation, paraphrase recognition, and machine translation evaluation. Lattices compactly represent lexical variation, but word order variation leads to a combinatorial explosion of states. We advocate hypergraphs as compact representations for sets of utterances describing the same event or object. We present a method to construct hypergraphs from sets of utterances, and evaluate this method on a simple recognition task. Given a set of utterances that describe a single object or event, we construct such a hypergraph, and demonstrate that it can recognize novel descriptions of the same event with high accuracy.
4 0.10333509 249 acl-2013-Models of Semantic Representation with Visual Attributes
Author: Carina Silberer ; Vittorio Ferrari ; Mirella Lapata
Abstract: We consider the problem of grounding the meaning of words in the physical world and focus on the visual modality which we represent by visual attributes. We create a new large-scale taxonomy of visual attributes covering more than 500 concepts and their corresponding 688K images. We use this dataset to train attribute classifiers and integrate their predictions with text-based distributional models of word meaning. We show that these bimodal models give a better fit to human word association data compared to amodal models and word representations based on handcrafted norming data.
5 0.095716938 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus
Author: Polina Kuznetsova ; Vicente Ordonez ; Alexander Berg ; Tamara Berg ; Yejin Choi
Abstract: The ever growing amount of web images and their associated texts offers new opportunities for integrative models bridging natural language processing and computer vision. However, the potential benefits of such data are yet to be fully realized due to the complexity and noise in the alignment between image content and text. We address this challenge with contributions in two folds: first, we introduce the new task of image caption generalization, formulated as visually-guided sentence compression, and present an efficient algorithm based on dynamic beam search with dependency-based constraints. Second, we release a new large-scale corpus with 1 million image-caption pairs achieving tighter content alignment between images and text. Evaluation results show the intrinsic quality of the generalized captions and the extrinsic utility of the new imagetext parallel corpus with respect to a concrete application of image caption transfer.
6 0.088064119 384 acl-2013-Visual Features for Linguists: Basic image analysis techniques for multimodally-curious NLPers
7 0.085213855 380 acl-2013-VSEM: An open library for visual semantics representation
8 0.072681978 62 acl-2013-Automatic Term Ambiguity Detection
9 0.071205512 224 acl-2013-Learning to Extract International Relations from Political Context
10 0.070481896 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
11 0.068682708 265 acl-2013-Outsourcing FrameNet to the Crowd
12 0.067520618 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model
13 0.066980131 217 acl-2013-Latent Semantic Matching: Application to Cross-language Text Categorization without Alignment Information
14 0.064257436 318 acl-2013-Sentiment Relevance
15 0.064047433 56 acl-2013-Argument Inference from Relevant Event Mentions in Chinese Argument Extraction
16 0.061523031 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages
17 0.061280914 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors
18 0.059891477 310 acl-2013-Semantic Frames to Predict Stock Price Movement
19 0.056601014 321 acl-2013-Sign Language Lexical Recognition With Propositional Dynamic Logic
20 0.05600768 206 acl-2013-Joint Event Extraction via Structured Prediction with Global Features
topicId topicWeight
[(0, 0.173), (1, 0.035), (2, -0.016), (3, -0.055), (4, -0.056), (5, -0.026), (6, 0.056), (7, 0.018), (8, -0.035), (9, 0.077), (10, -0.122), (11, -0.126), (12, -0.006), (13, 0.017), (14, 0.049), (15, -0.032), (16, -0.02), (17, 0.018), (18, 0.044), (19, -0.042), (20, 0.014), (21, -0.059), (22, 0.03), (23, 0.012), (24, 0.029), (25, -0.029), (26, -0.088), (27, 0.01), (28, -0.002), (29, 0.03), (30, 0.025), (31, 0.01), (32, 0.021), (33, -0.0), (34, -0.001), (35, -0.062), (36, -0.052), (37, 0.017), (38, -0.033), (39, 0.051), (40, -0.025), (41, -0.041), (42, 0.007), (43, -0.037), (44, 0.007), (45, 0.103), (46, -0.002), (47, 0.074), (48, -0.04), (49, -0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.89828306 175 acl-2013-Grounded Language Learning from Video Described with Sentences
Author: Haonan Yu ; Jeffrey Mark Siskind
Abstract: We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts si- multaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.
2 0.65205592 321 acl-2013-Sign Language Lexical Recognition With Propositional Dynamic Logic
Author: Arturo Curiel ; Christophe Collet
Abstract: . This paper explores the use of Propositional Dynamic Logic (PDL) as a suitable formal framework for describing Sign Language (SL) , the language of deaf people, in the context of natural language processing. SLs are visual, complete, standalone languages which are just as expressive as oral languages. Signs in SL usually correspond to sequences of highly specific body postures interleaved with movements, which make reference to real world objects, characters or situations. Here we propose a formal representation of SL signs, that will help us with the analysis of automatically-collected hand tracking data from French Sign Language (FSL) video corpora. We further show how such a representation could help us with the design of computer aided SL verification tools, which in turn would bring us closer to the development of an automatic recognition system for these languages.
3 0.64690995 249 acl-2013-Models of Semantic Representation with Visual Attributes
Author: Carina Silberer ; Vittorio Ferrari ; Mirella Lapata
Abstract: We consider the problem of grounding the meaning of words in the physical world and focus on the visual modality which we represent by visual attributes. We create a new large-scale taxonomy of visual attributes covering more than 500 concepts and their corresponding 688K images. We use this dataset to train attribute classifiers and integrate their predictions with text-based distributional models of word meaning. We show that these bimodal models give a better fit to human word association data compared to amodal models and word representations based on handcrafted norming data.
4 0.63210064 380 acl-2013-VSEM: An open library for visual semantics representation
Author: Elia Bruni ; Ulisse Bordignon ; Adam Liska ; Jasper Uijlings ; Irina Sergienya
Abstract: VSEM is an open library for visual semantics. Starting from a collection of tagged images, it is possible to automatically construct an image-based representation of concepts by using off-theshelf VSEM functionalities. VSEM is entirely written in MATLAB and its objectoriented design allows a large flexibility and reusability. The software is accompanied by a website with supporting documentation and examples.
5 0.61855775 370 acl-2013-Unsupervised Transcription of Historical Documents
Author: Taylor Berg-Kirkpatrick ; Greg Durrett ; Dan Klein
Abstract: We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 3 1% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.
6 0.60693115 384 acl-2013-Visual Features for Linguists: Basic image analysis techniques for multimodally-curious NLPers
7 0.60267138 224 acl-2013-Learning to Extract International Relations from Political Context
8 0.5932737 349 acl-2013-The mathematics of language learning
9 0.59263241 311 acl-2013-Semantic Neighborhoods as Hypergraphs
10 0.58991838 167 acl-2013-Generalizing Image Captions for Image-Text Parallel Corpus
11 0.56850886 149 acl-2013-Exploring Word Order Universals: a Probabilistic Graphical Model Approach
12 0.55063534 212 acl-2013-Language-Independent Discriminative Parsing of Temporal Expressions
13 0.54500341 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics
14 0.54334778 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning
15 0.54104984 228 acl-2013-Leveraging Domain-Independent Information in Semantic Parsing
16 0.53048557 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars
17 0.52663976 364 acl-2013-Typesetting for Improved Readability using Lexical and Syntactic Information
18 0.51285851 379 acl-2013-Utterance-Level Multimodal Sentiment Analysis
19 0.5030539 213 acl-2013-Language Acquisition and Probabilistic Models: keeping it simple
20 0.4921461 29 acl-2013-A Visual Analytics System for Cluster Exploration
topicId topicWeight
[(0, 0.051), (6, 0.052), (11, 0.045), (15, 0.012), (21, 0.194), (24, 0.034), (26, 0.034), (28, 0.018), (35, 0.107), (42, 0.046), (48, 0.068), (64, 0.013), (70, 0.069), (82, 0.031), (88, 0.03), (90, 0.024), (95, 0.07)]
simIndex simValue paperId paperTitle
1 0.92354763 293 acl-2013-Random Walk Factoid Annotation for Collective Discourse
Author: Ben King ; Rahul Jha ; Dragomir Radev ; Robert Mankoff
Abstract: In this paper, we study the problem of automatically annotating the factoids present in collective discourse. Factoids are information units that are shared between instances of collective discourse and may have many different ways ofbeing realized in words. Our approach divides this problem into two steps, using a graph-based approach for each step: (1) factoid discovery, finding groups of words that correspond to the same factoid, and (2) factoid assignment, using these groups of words to mark collective discourse units that contain the respective factoids. We study this on two novel data sets: the New Yorker caption contest data set, and the crossword clues data set.
same-paper 2 0.84964073 175 acl-2013-Grounded Language Learning from Video Described with Sentences
Author: Haonan Yu ; Jeffrey Mark Siskind
Abstract: We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence between words and concepts in the video is learned in an unsupervised fashion, even when the video depicts si- multaneous events described by multiple sentences or when different aspects of a single event are described with multiple sentences. The learned word meanings can be subsequently used to automatically generate description of new video.
3 0.84036416 352 acl-2013-Towards Accurate Distant Supervision for Relational Facts Extraction
Author: Xingxing Zhang ; Jianwen Zhang ; Junyu Zeng ; Jun Yan ; Zheng Chen ; Zhifang Sui
Abstract: Distant supervision (DS) is an appealing learning method which learns from existing relational facts to extract more from a text corpus. However, the accuracy is still not satisfying. In this paper, we point out and analyze some critical factors in DS which have great impact on accuracy, including valid entity type detection, negative training examples construction and ensembles. We propose an approach to handle these factors. By experimenting on Wikipedia articles to extract the facts in Freebase (the top 92 relations), we show the impact of these three factors on the accuracy of DS and the remarkable improvement led by the proposed approach.
4 0.83384627 282 acl-2013-Predicting and Eliciting Addressee's Emotion in Online Dialogue
Author: Takayuki Hasegawa ; Nobuhiro Kaji ; Naoki Yoshinaga ; Masashi Toyoda
Abstract: While there have been many attempts to estimate the emotion of an addresser from her/his utterance, few studies have explored how her/his utterance affects the emotion of the addressee. This has motivated us to investigate two novel tasks: predicting the emotion of the addressee and generating a response that elicits a specific emotion in the addressee’s mind. We target Japanese Twitter posts as a source of dialogue data and automatically build training data for learning the predictors and generators. The feasibility of our approaches is assessed by using 1099 utterance-response pairs that are built by . five human workers.
5 0.76771706 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
Author: Akihiro Tamura ; Taro Watanabe ; Eiichiro Sumita ; Hiroya Takamura ; Manabu Okumura
Abstract: This paper proposes a nonparametric Bayesian method for inducing Part-ofSpeech (POS) tags in dependency trees to improve the performance of statistical machine translation (SMT). In particular, we extend the monolingual infinite tree model (Finkel et al., 2007) to a bilingual scenario: each hidden state (POS tag) of a source-side dependency tree emits a source word together with its aligned target word, either jointly (joint model), or independently (independent model). Evaluations of Japanese-to-English translation on the NTCIR-9 data show that our induced Japanese POS tags for dependency trees improve the performance of a forest- to-string SMT system. Our independent model gains over 1 point in BLEU by resolving the sparseness problem introduced in the joint model.
6 0.71426284 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction
7 0.68227893 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation
9 0.68162692 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models
10 0.67819554 275 acl-2013-Parsing with Compositional Vector Grammars
11 0.67539531 283 acl-2013-Probabilistic Domain Modelling With Contextualized Distributional Semantic Vectors
12 0.67129624 249 acl-2013-Models of Semantic Representation with Visual Attributes
13 0.67048514 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics
14 0.67043608 172 acl-2013-Graph-based Local Coherence Modeling
15 0.6681391 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering
16 0.66776544 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval
17 0.6676954 224 acl-2013-Learning to Extract International Relations from Political Context
18 0.6676181 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference
19 0.66736931 215 acl-2013-Large-scale Semantic Parsing via Schema Matching and Lexicon Extension
20 0.6657216 46 acl-2013-An Infinite Hierarchical Bayesian Model of Phrasal Translation