emnlp emnlp2011 emnlp2011-34 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yezhou Yang ; Ching Teo ; Hal Daume III ; Yiannis Aloimonos
Abstract: We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and de- , scriptive sentences compared to naive strategies that use vision alone.
Reference: text
sentIndex sentText sentNum sentScore
1 The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. [sent-2, score-0.846]
2 As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. [sent-3, score-0.878]
3 We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. [sent-4, score-0.659]
4 Experimental results show that our strategy of combining vision and language produces readable and de- , scriptive sentences compared to naive strategies that use vision alone. [sent-5, score-0.238]
5 This description of an image is the output of an extremely complex process that involves: 1) perception in the Visual space, 2) grounding to World Knowledge in the Language Space and 3) speech/text production (see Fig. [sent-8, score-0.396]
6 Our hypothesis is based on the assumption that natural images accurately reflect common everyday scenarios which are captured in language. [sent-16, score-0.456]
7 For example, knowing that boats usually occur over water will enable us to constrain the possible scenes a boat can occur and exclude highly unlikely ones st reet highway. [sent-17, score-0.408]
8 It also enables us to predict likely actions (Verbs) given the current object detections in the image: detecting a dog with a person will likely induce walk rather than swim, j ump fly. [sent-18, score-0.572]
9 Key to our approach is , – , the use of a large generic corpus such as the English Gigaword [Graff, 2003] as the semantic grounding to predict and correct the initial and often noisy visual detections of an image to produce a reasonable sentence that succinctly describes the image. [sent-19, score-0.692]
10 perceptual challenges for (a) Different images with (b) Pose relates ambiguSee text for details. [sent-25, score-0.42]
11 Based on our observations of annotated image data (see Fig. [sent-27, score-0.315]
12 The key challenge is that detecting objects, actions and scenes directly from images is often noisy and unreliable. [sent-32, score-0.838]
13 We illustrate this using example images from the Pascal-Visual Object Classes (VOC) 2008 challenge [Everingham et al. [sent-33, score-0.42]
14 2(a) shows the variability of images in their raw image representations: pixels, edges and local features. [sent-36, score-0.735]
15 , 2009] to reliably detect important objects in the scene: boat, humans and water average precision scores reported in [Felzenszwalb et al. [sent-39, score-0.307]
16 , 2010] manages around 42% for humans and only 11% for boat over a dataset of almost 5000 images in 20 object categories. [sent-40, score-0.709]
17 Yet, these images are semantically similar in terms of their high level description. [sent-41, score-0.42]
18 Clearly, this assumption is weak as 1) similar actions may be represented by different poses due to the inherent dynamic nature of the action itself: e. [sent-44, score-0.23]
19 walking a dog and 2) different actions may have the same pose: e. [sent-46, score-0.227]
20 , 2010] that used poses for recognition of actions achieved 70% and 61% accuracy respectively under extremely limited testing conditions with only 5-6 action classes each. [sent-52, score-0.265]
21 Finally, state of the art scene detectors [Oliva and Torralba, 2001 ; Torralba et al. [sent-53, score-0.385]
22 , 2003] need to have enough representative training examples of scenes from pre-defined scene classes for a classification to be successful with a reported average precision of 83. [sent-54, score-0.641]
23 Our focus instead is to show that with the addition of language to ground the noisy initial visual detections, we are able to improve the quality of the generated sentence as a faithful description of the image. [sent-57, score-0.214]
24 In particular, we show that it is possible to avoid predicting actions directly from images which is still unreliable and to use the corpus instead to guide our predictions. [sent-58, score-0.584]
25 Our proposed strategy is also generic, that is, we make no prior assumptions on the image domain considered. [sent-59, score-0.35]
26 2) depend on strong annotations between images and text to ground their predictions (and to remove wrong sentences), we show – – that a large generic corpus is also able to provide the same grounding over larger domains of images. [sent-61, score-0.507]
27 Here, we do not require “labeled” data containing images and captions but only separate data from each side. [sent-64, score-0.451]
28 Another contribution is a computationally feasible way via dynamic programming to determine the most likely quadruplet T∗ = {n∗ , v∗, s∗, p∗} that describes the image for generating possible se}nt tehnatce dse. [sent-65, score-0.407]
29 2 Related Work Recently, several works from the Computer Vision domain have attempted to use language to aid image scene understanding. [sent-66, score-0.658]
30 , 2009] extended this work to associate poses detected from images with the verbs in the captions. [sent-71, score-0.558]
31 Both approaches use annotated examples from a limited news caption corpus to learn a joint image-text model so that one can annotate new unknown images with textual information easily. [sent-72, score-0.42]
32 Neither of these works have been tested on complex everyday images where the large variations of objects and poses makes it nearly impossible to learn a more general model. [sent-73, score-0.716]
33 , 2010] attempts to “generate” sentences by first learning from a set of human annotated examples, and producing the same sentence if both images and sentence share common properties in terms of their triplets: (Nouns-Verbs-Scenes). [sent-76, score-0.517]
34 No attempt was made to generate novel sentences from images beyond what has been annotated by humans. [sent-77, score-0.449]
35 The input is a test image where we detect objects and scenes using trained detection algorithms [Felzenszwalb et al. [sent-91, score-0.834]
36 To keep the framework computationally tractable, we limit the elements of the quadruplet (Nouns-Verbs-Scenes-Prepositions) to come from a finite set of objects N, actions V, scenes eS aronmd prepositions Pf cblajescstess Ntha,t a are com- monly Senc aondun pterreepdo. [sent-94, score-0.767]
37 In addition, the sentence that is generated for each image is limited to at most two objects occurring in a unique scene. [sent-97, score-0.577]
38 Denoting the current test image as I, the initial visual processing first detects objects n ∈ N and scenes s ∈ Sin using dtheteesect sde otbejcetcotsrs nto ∈ compute Pr (n|I) san ∈d Pr (s|I), t thhee probabilities otha cto object n a(nnd|I scene s ex(iss|tI )un,d theer I p. [sent-102, score-1.435]
39 Lm is also used to com- pute Pr(s|n, v), th(ev predicted scene using the corpus given |tnhe, object a pnrded verb; a sncden Pr (p| s), tthhee predicted preposition given the scene. [sent-104, score-0.611]
40 In the following sections, we first introduce the image dataset used for testing followed by details of how these components are derived. [sent-107, score-0.376]
41 It contains 1000 images taken from a subset of the Pascal-VOC 2008 challenge image dataset and are hand annotated with sentences that describe the image by paid human annotators using Amazon Mechanical Turk. [sent-114, score-1.112]
42 We randomly selected 900 images (4500 sentences) as the learning corpus to construct the verb and scene sets, {V, S} as destocr icboends rinu sec. [sent-118, score-0.801]
43 , 2008] of 20 common everyday object classes that are defined in N. [sent-123, score-0.226]
44 s Etaracihne odf on a large number of the objects’ image representations from a large variety of sources. [sent-125, score-0.315]
45 (b) Examples of GIST gradients: (left) an outdoor scene vs (right) an indoor scene [Torralba et al. [sent-134, score-0.686]
46 humans, cars and plants) makes them particularly important for our task, since humans tend to describe these common objects as well. [sent-138, score-0.276]
47 a cow, into its constituent parts: head, torso, legs, which are shared by other objects in a hierarchical manner. [sent-143, score-0.228]
48 This model’s intuition lies in the assumption that objects can be deformed but the relative position of each constituent parts should remain the same. [sent-148, score-0.228]
49 We convert the object detection scores to probabilities using Platt’s method [Lin et al. [sent-149, score-0.214]
50 For detecting scenes defined in S, we use the GISFTo-rb daesteedc scene descriptor dof i [Torralba sete al. [sent-152, score-0.677]
51 Averaging out these responses over larger spatial regions gives us a set of global image properties. [sent-156, score-0.404]
52 This representation forms the GIST descriptor of an image (Fig. [sent-158, score-0.355]
53 5(b)) which is used to train a set of SVM classifiers for each scene class in S. [sent-159, score-0.343]
54 The set of common scenes defined in S is l2e0a0rn7e]. [sent-162, score-0.263]
55 3 Corpus-Guided Predictions Figure 6: (a) Selecting the ROOT verb from the dependency parse ride reveals its subject woman and direct object bicycle. [sent-167, score-0.246]
56 (b) Selecting the head noun (PMOD) as the scene st reet reveals ADV as the preposition on Predicting Verbs: The key component of our approach is the trained language model Lm that predicts the most likely verb v, associated with the objects Nk detected in the image. [sent-168, score-0.753]
57 Since it is possible that different verbs may be associated with varying number of object arguments, we limit ourselves to verbs that take on at most two objects (or more specifically two noun phrase arguments) as a simplifying assumption: Nk = {n1, n2} where n2 can be NULL. [sent-169, score-0.515]
58 That is, n1 an=d n2 are t}he w subject and direct objects associated with v ∈ V. [sent-170, score-0.228]
59 t Theo training images from the UIUC Pascal-VOC dataset 448 (sec. [sent-174, score-0.453]
60 Next, we process all the parses to select verbs which are marked as ROOT and check the existence of a subject (DEP) and direct object (PMOD, OBJ) that are linked to the ROOT verb (see Fig. [sent-178, score-0.259]
61 will be reduced to the stemmed sequence dog chase cat cat run owner2 from which we obtain the target trigram relationships: {dog chase cat}, {cat run owner} as pths:es {ed trigrams respect t,he { (n1, v, n2) ordering. [sent-194, score-0.243]
62 Predicting Scenes: Just as an action is strongly related to the objects that participate in it, a scene can be predicted from the objects and verbs that occur in the image. [sent-198, score-0.974]
63 For example, detecting Nk={boat pers on} with v={row} would have predicted ,the scene s={coast }, osinwc}e wbooualtds usually occur din t water regions. [sent-199, score-0.44]
64 aTos tle},arn si ntchies broealat-s tionship from the corpus, we use the UIUC dataset to discover what are the common scenes that should be included in S. [sent-200, score-0.296]
65 oW neo tt choen-o rcacnukre wdi tthhe s creemneasin ining scenes in terms of their frequency to select the top 8 scenes used in S. [sent-203, score-0.526]
66 To improve recall and generalization, we expand each of the 8 scene classes using their WordNet synsets hsi (up to a max of three hyponymns levels). [sent-204, score-0.462]
67 scribed above, we compute the log-likelihood ratio of ordered bigrams, {n, hsi } and {v, hsi }: λns and λvs, by reducing tsh,e { corpus asnednte {nvc,eh stio} t:h λe target nouns, verbs and scenes defined in N, V and S. [sent-206, score-0.435]
68 7(b), we are able to predict scenes that colocate with reasonable correctness given the nouns and verbs. [sent-212, score-0.324]
69 fU osridnegre Pd, bigrams, {p, hsi} fthore prepositions dth raatt oco o-lfo ocradteer ewdit bhi gthraem scene synonyms over the corpus. [sent-217, score-0.433]
70 From START, we assume all object pair detections are equiprobable: Pr(NN|START) = |N|∗(|1N|+1) where we have added an Na|dSdTitAioRnTa)l NULL|N v|∗a(l|uNe| +f1or) objects (at most 1). [sent-241, score-0.582]
71 At each NN, the HMM emits a detection from the image and by independence we have: Pr (nn |NN) = Pr (n1|I)Pr (n2 |I). [sent-242, score-0.343]
72 The HMM then transits from NV to S with Pr (S |NV) = Pr (s|n, v) computed from the corpus wh(iSch|N eVm)it =s t hPe scene )de cteocmtipount score mfro tmhe th coer ipmu-s age: Pr (s |S) = Pr (s|I). [sent-245, score-0.374]
73 es Ot (most likely) path through the HMM given the image observations using the Viterbi algorithm which can be done in O(105) time which is significantly faster than the naive approach. [sent-250, score-0.315]
74 When several possible choices are available, a random choice is made that depends on the object detection scores: the is preferred when we are confident of the detections while an a is preferred otherwise. [sent-270, score-0.382]
75 , , , , , 4) The sentence structure is therefore of the form: NP-VP-PP with variations when only one object or multiple detections of the same objects are detected. [sent-275, score-0.616]
76 A special case is when no objects are detected (below the predefined threshold). [sent-276, score-0.268]
77 In this case, we simply generate a sentence that describes the scene only: for e. [sent-278, score-0.377]
78 1) since they do not fully describe the image content in terms of the objects and actions. [sent-283, score-0.543]
79 As a baseline, we simply generated T∗ directly from images without using tphley corpus. [sent-290, score-0.42]
80 dTThere are two variants of this baseline where we seek to determine if listing all objects in the image is crucial for scene description. [sent-291, score-0.886]
81 Tb1 is a tbhaese imlinaeg teh iast uses aalll possible objects tainodn scene detected: Tb1 = {n1, n2, · · · , nm, s} and our sentence tweicltle d be: oTf th=e {fonrm: {Ob j ect 1, ob j ect 2 and ob j ect 3 are IN the s cene . [sent-292, score-0.904]
82 For the second baseline, Tb2, we limit the number of objects ctoo just any tew,o T: Tb2 = {n1, n2, s} and the jseecnttsen tcoe generated ow:il lT be =of {thne form,s {Ob j ect 1 and ob j ect 2 are IN the s cene}. [sent-294, score-0.373]
83 All experiments were performed on the 100 unseen testing images from the UIUC dataset and we used only the most likely (top) sentence generated for all evaluation. [sent-298, score-0.487]
84 In this work, the short descriptive sentence of an image can be viewed as summarizing the image content and ROUGE-1 is able to capture how well this sentence can describe the image by comparing it with the human annotated ground truth of the UIUC dataset. [sent-301, score-1.137]
85 The second evaluation metric: Relevance and Readability is therefore proposed as an empirical measure of how much the sentence: 1) conveys the image content (relevance) in terms of the objects, actions and scene predicted and 2) is grammatically correct (readability). [sent-309, score-0.817]
86 Firstly, the R1 452 first baseline, Tb1∗ is considered the most relevant description oinfe t,h Te image and the least readable at the same time. [sent-315, score-0.388]
87 This is most likely due to the fact that this recall oriented strategy will almost certainly describe some objects but the lack of any verb description; and longer sentences that average 8. [sent-316, score-0.33]
88 It is also possible that humans tend to penalize less irrelevant objects compared to missing objects, and further evaluations are necessary to confirm this. [sent-318, score-0.276]
89 Since Tb2∗ is limited ator etw noec objects just lnifkierm mth teh proposed HMM, it is a more suitable baseline for comparison. [sent-319, score-0.255]
90 Finally, in terms of readability, T∗ generates the most readable sentences, aanbidl ttyhi,s Tis achieved by leveraging on the corpus to guide our predictions of the most reasonable nouns, verbs, scenes and prepositions that agree with the detections in the image. [sent-321, score-0.558]
91 5 Future Work In this work, we have introduced a computationally feasible framework that integrates visual perception together with semantic grounding obtained from a large textual corpus for the purpose of generating a descriptive sentence of an image. [sent-322, score-0.23]
92 Compared to human gold standards, therefore, much work still remains in terms of detecting these objects and scenes with high precision. [sent-327, score-0.522]
93 Currently, at most two object classes are used to generate simple sentences which was shown in the results to have penalized the relevance score of our approach. [sent-328, score-0.262]
94 Another interesting direction of future work would be to detect salient objects, learned from training image+corpus or eye-movement data, and to verify if these objects aid in improving the descriptive sentences we generate. [sent-330, score-0.309]
95 , , of representing images using T∗ is that we can easily seoprrte asnendt rinegtri iemvea images gth Tat are similar in terms of their semantic content. [sent-332, score-0.84]
96 This would enable us to retrieve, for example, more relevant images given a verbal search query such as {ride s it fly}, returning images werhye srue thhe asse { vreirdbse are ftou,fndl yin} ,T re∗-. [sent-333, score-0.84]
97 Stuornmine gre ismulatsg eosf wrehterireev tedhe images bsa asreed f on nthde i nr verbal components are shown in Fig. [sent-334, score-0.448]
98 10: many images with dissimilar visual content are correctly classified based on their semantic meaning. [sent-335, score-0.52]
99 Recognizing human actions from still images with latent poses. [sent-523, score-0.544]
100 Grouplet: a structured image representation for recognizing human and object interactions. [sent-528, score-0.47]
wordName wordTfidf (topN-words)
[('images', 0.42), ('scene', 0.343), ('image', 0.315), ('pr', 0.3), ('scenes', 0.263), ('objects', 0.228), ('detections', 0.199), ('object', 0.155), ('uiuc', 0.145), ('actions', 0.124), ('felzenszwalb', 0.122), ('hmm', 0.116), ('torralba', 0.107), ('visual', 0.1), ('quadruplet', 0.092), ('nv', 0.077), ('nvn', 0.077), ('action', 0.074), ('vision', 0.069), ('verbs', 0.066), ('dog', 0.063), ('nouns', 0.061), ('reet', 0.061), ('prepositions', 0.06), ('cat', 0.054), ('hsi', 0.053), ('ride', 0.053), ('boat', 0.053), ('spatial', 0.053), ('descriptive', 0.052), ('nk', 0.052), ('nn', 0.051), ('generation', 0.049), ('ect', 0.049), ('humans', 0.048), ('ob', 0.047), ('pose', 0.046), ('equiprobable', 0.046), ('farhadi', 0.046), ('readability', 0.045), ('grounding', 0.044), ('relevance', 0.043), ('ground', 0.043), ('preposition', 0.043), ('detectors', 0.042), ('graff', 0.042), ('yao', 0.042), ('predicting', 0.04), ('detected', 0.04), ('berg', 0.04), ('descriptor', 0.04), ('gist', 0.04), ('walking', 0.04), ('verb', 0.038), ('description', 0.037), ('everyday', 0.036), ('chase', 0.036), ('readable', 0.036), ('responses', 0.036), ('choi', 0.036), ('predicted', 0.035), ('classes', 0.035), ('strategy', 0.035), ('sentence', 0.034), ('dataset', 0.033), ('pmod', 0.033), ('platt', 0.033), ('poses', 0.032), ('adv', 0.031), ('water', 0.031), ('probabilities', 0.031), ('gigaword', 0.031), ('captions', 0.031), ('cene', 0.031), ('cow', 0.031), ('everingham', 0.031), ('forsyth', 0.031), ('girshick', 0.031), ('golland', 0.031), ('hyponymns', 0.031), ('kojima', 0.031), ('kourtzi', 0.031), ('legs', 0.031), ('oliva', 0.031), ('torso', 0.031), ('transits', 0.031), ('urgesi', 0.031), ('detecting', 0.031), ('synonyms', 0.03), ('sentences', 0.029), ('truth', 0.029), ('yang', 0.029), ('detection', 0.028), ('components', 0.028), ('teh', 0.027), ('adn', 0.026), ('nnd', 0.026), ('owner', 0.026), ('card', 0.026), ('coast', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 34 emnlp-2011-Corpus-Guided Sentence Generation of Natural Images
Author: Yezhou Yang ; Ching Teo ; Hal Daume III ; Yiannis Aloimonos
Abstract: We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and de- , scriptive sentences compared to naive strategies that use vision alone.
2 0.07098037 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
Author: Amit Dubey ; Frank Keller ; Patrick Sturt
Abstract: This paper introduces a psycholinguistic model of sentence processing which combines a Hidden Markov Model noun phrase chunker with a co-reference classifier. Both models are fully incremental and generative, giving probabilities of lexical elements conditional upon linguistic structure. This allows us to compute the information theoretic measure of surprisal, which is known to correlate with human processing effort. We evaluate our surprisal predictions on the Dundee corpus of eye-movement data show that our model achieve a better fit with human reading times than a syntax-only model which does not have access to co-reference information.
3 0.067583658 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling
Author: Vivek Srikumar ; Dan Roth
Abstract: This paper presents a model that extends semantic role labeling. Existing approaches independently analyze relations expressed by verb predicates or those expressed as nominalizations. However, sentences express relations via other linguistic phenomena as well. Furthermore, these phenomena interact with each other, thus restricting the structures they articulate. In this paper, we use this intuition to define a joint inference model that captures the inter-dependencies between verb semantic role labeling and relations expressed using prepositions. The scarcity of jointly labeled data presents a crucial technical challenge for learning a joint model. The key strength of our model is that we use existing structure predictors as black boxes. By enforcing consistency constraints between their predictions, we show improvements in the performance of both tasks without retraining the individual models.
4 0.065184109 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use
Author: Jette Viethen ; Robert Dale ; Markus Guhe
Abstract: Traditional computational approaches to referring expression generation operate in a deliberate manner, choosing the attributes to be included on the basis of their ability to distinguish the intended referent from its distractors. However, work in psycholinguistics suggests that speakers align their referring expressions with those used previously in the discourse, implying less deliberate choice and more subconscious reuse. This raises the question as to which is a more accurate characterisation of what people do. Using a corpus of dialogues containing 16,358 referring expressions, we explore this question via the generation of subsequent references in shared visual scenes. We use a machine learning approach to referring expression generation and demonstrate that incorporating features that correspond to the computational tradition does not match human referring behaviour as well as using features corresponding to the process of alignment. The results support the view that the traditional model of referring expression generation that is widely assumed in work on natural language generation may not in fact be correct; our analysis may also help explain the oft-observed redundancy found in humanproduced referring expressions.
5 0.060317039 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
Author: Edward Grefenstette ; Mehrnoosh Sadrzadeh
Abstract: Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (2010) using data from the BNC and evaluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The evaluation is based on the word disambiguation task developed by Mitchell and Lapata (2008) for intransitive sentences, and on a similar new experiment designed for transitive sentences. Our model matches the results of its competitors . in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity showcases the compositional power of our model.
6 0.056743577 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion
7 0.053461831 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser
8 0.043938518 38 emnlp-2011-Data-Driven Response Generation in Social Media
9 0.038372967 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
10 0.036647003 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
11 0.035582937 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions
12 0.034890153 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
13 0.034208901 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition
14 0.034140732 92 emnlp-2011-Minimally Supervised Event Causality Identification
15 0.034018088 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
16 0.033749953 128 emnlp-2011-Structured Relation Discovery using Generative Models
17 0.033659283 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
18 0.033621613 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
19 0.033440039 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus
20 0.033361595 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification
topicId topicWeight
[(0, 0.138), (1, -0.035), (2, -0.037), (3, 0.01), (4, 0.006), (5, -0.018), (6, -0.02), (7, 0.017), (8, 0.015), (9, -0.039), (10, -0.031), (11, -0.088), (12, 0.004), (13, -0.008), (14, -0.027), (15, 0.034), (16, 0.029), (17, 0.024), (18, -0.021), (19, -0.005), (20, 0.01), (21, 0.002), (22, 0.054), (23, -0.082), (24, -0.038), (25, 0.052), (26, 0.125), (27, -0.013), (28, 0.041), (29, -0.045), (30, -0.121), (31, 0.172), (32, -0.045), (33, -0.133), (34, 0.062), (35, 0.145), (36, -0.125), (37, 0.126), (38, 0.166), (39, 0.035), (40, 0.064), (41, 0.142), (42, 0.067), (43, 0.286), (44, 0.048), (45, 0.084), (46, -0.161), (47, -0.252), (48, -0.089), (49, -0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.96253455 34 emnlp-2011-Corpus-Guided Sentence Generation of Natural Images
Author: Yezhou Yang ; Ching Teo ; Hal Daume III ; Yiannis Aloimonos
Abstract: We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and de- , scriptive sentences compared to naive strategies that use vision alone.
2 0.4340561 62 emnlp-2011-Generating Subsequent Reference in Shared Visual Scenes: Computation vs Re-Use
Author: Jette Viethen ; Robert Dale ; Markus Guhe
Abstract: Traditional computational approaches to referring expression generation operate in a deliberate manner, choosing the attributes to be included on the basis of their ability to distinguish the intended referent from its distractors. However, work in psycholinguistics suggests that speakers align their referring expressions with those used previously in the discourse, implying less deliberate choice and more subconscious reuse. This raises the question as to which is a more accurate characterisation of what people do. Using a corpus of dialogues containing 16,358 referring expressions, we explore this question via the generation of subsequent references in shared visual scenes. We use a machine learning approach to referring expression generation and demonstrate that incorporating features that correspond to the computational tradition does not match human referring behaviour as well as using features corresponding to the process of alignment. The results support the view that the traditional model of referring expression generation that is widely assumed in work on natural language generation may not in fact be correct; our analysis may also help explain the oft-observed redundancy found in humanproduced referring expressions.
3 0.38845798 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
Author: Amit Dubey ; Frank Keller ; Patrick Sturt
Abstract: This paper introduces a psycholinguistic model of sentence processing which combines a Hidden Markov Model noun phrase chunker with a co-reference classifier. Both models are fully incremental and generative, giving probabilities of lexical elements conditional upon linguistic structure. This allows us to compute the information theoretic measure of surprisal, which is known to correlate with human processing effort. We evaluate our surprisal predictions on the Dundee corpus of eye-movement data show that our model achieve a better fit with human reading times than a syntax-only model which does not have access to co-reference information.
4 0.38054758 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling
Author: Vivek Srikumar ; Dan Roth
Abstract: This paper presents a model that extends semantic role labeling. Existing approaches independently analyze relations expressed by verb predicates or those expressed as nominalizations. However, sentences express relations via other linguistic phenomena as well. Furthermore, these phenomena interact with each other, thus restricting the structures they articulate. In this paper, we use this intuition to define a joint inference model that captures the inter-dependencies between verb semantic role labeling and relations expressed using prepositions. The scarcity of jointly labeled data presents a crucial technical challenge for learning a joint model. The key strength of our model is that we use existing structure predictors as black boxes. By enforcing consistency constraints between their predictions, we show improvements in the performance of both tasks without retraining the individual models.
5 0.36559147 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
Author: Edward Grefenstette ; Mehrnoosh Sadrzadeh
Abstract: Modelling compositional meaning for sentences using empirical distributional methods has been a challenge for computational linguists. We implement the abstract categorical model of Coecke et al. (2010) using data from the BNC and evaluate it. The implementation is based on unsupervised learning of matrices for relational words and applying them to the vectors of their arguments. The evaluation is based on the word disambiguation task developed by Mitchell and Lapata (2008) for intransitive sentences, and on a similar new experiment designed for transitive sentences. Our model matches the results of its competitors . in the first experiment, and betters them in the second. The general improvement in results with increase in syntactic complexity showcases the compositional power of our model.
6 0.34725809 91 emnlp-2011-Literal and Metaphorical Sense Identification through Concrete and Abstract Context
7 0.32681355 11 emnlp-2011-A Simple Word Trigger Method for Social Tag Suggestion
8 0.29848713 32 emnlp-2011-Computing Logical Form on Regulatory Texts
9 0.27174988 3 emnlp-2011-A Correction Model for Word Alignments
10 0.24007586 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information
11 0.23615913 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus
12 0.2235443 103 emnlp-2011-Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus
13 0.22205921 42 emnlp-2011-Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora
14 0.21371587 79 emnlp-2011-Lateen EM: Unsupervised Training with Multiple Objectives, Applied to Dependency Grammar Induction
15 0.2043874 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
16 0.20285241 110 emnlp-2011-Ranking Human and Machine Summarization Systems
17 0.189014 67 emnlp-2011-Hierarchical Verb Clustering Using Graph Factorization
18 0.17932031 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
19 0.17658867 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition
20 0.1733629 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
topicId topicWeight
[(23, 0.097), (36, 0.021), (37, 0.026), (45, 0.057), (54, 0.025), (57, 0.014), (62, 0.02), (64, 0.018), (66, 0.025), (69, 0.028), (79, 0.501), (82, 0.018), (87, 0.012), (90, 0.014), (96, 0.027), (98, 0.024)]
simIndex simValue paperId paperTitle
1 0.99024636 121 emnlp-2011-Semi-supervised CCG Lexicon Extension
Author: Emily Thomforde ; Mark Steedman
Abstract: This paper introduces Chart Inference (CI), an algorithm for deriving a CCG category for an unknown word from a partial parse chart. It is shown to be faster and more precise than a baseline brute-force method, and to achieve wider coverage than a rule-based system. In addition, we show the application of CI to a domain adaptation task for question words, which are largely missing in the Penn Treebank. When used in combination with self-training, CI increases the precision of the baseline StatCCG parser over subjectextraction questions by 50%. An error analysis shows that CI contributes to the increase by expanding the number of category types available to the parser, while self-training adjusts the counts.
2 0.93859476 115 emnlp-2011-Relaxed Cross-lingual Projection of Constituent Syntax
Author: Wenbin Jiang ; Qun Liu ; Yajuan Lv
Abstract: We propose a relaxed correspondence assumption for cross-lingual projection of constituent syntax, which allows a supposed constituent of the target sentence to correspond to an unrestricted treelet in the source parse. Such a relaxed assumption fundamentally tolerates the syntactic non-isomorphism between languages, and enables us to learn the target-language-specific syntactic idiosyncrasy rather than a strained grammar directly projected from the source language syntax. Based on this assumption, a novel constituency projection method is also proposed in order to induce a projected constituent treebank from the source-parsed bilingual corpus. Experiments show that, the parser trained on the projected treebank dramatically outperforms previous projected and unsupervised parsers.
3 0.90479058 36 emnlp-2011-Corroborating Text Evaluation Results with Heterogeneous Measures
Author: Enrique Amigo ; Julio Gonzalo ; Jesus Gimenez ; Felisa Verdejo
Abstract: Automatically produced texts (e.g. translations or summaries) are usually evaluated with n-gram based measures such as BLEU or ROUGE, while the wide set of more sophisticated measures that have been proposed in the last years remains largely ignored for practical purposes. In this paper we first present an indepth analysis of the state of the art in order to clarify this issue. After this, we formalize and verify empirically a set of properties that every text evaluation measure based on similarity to human-produced references satisfies. These properties imply that corroborating system improvements with additional measures always increases the overall reliability of the evaluation process. In addition, the greater the heterogeneity of the measures (which is measurable) the higher their combined reliability. These results support the use of heterogeneous measures in order to consolidate text evaluation results.
same-paper 4 0.90120083 34 emnlp-2011-Corpus-Guided Sentence Generation of Natural Images
Author: Yezhou Yang ; Ching Teo ; Hal Daume III ; Yiannis Aloimonos
Abstract: We propose a sentence generation strategy that describes images by predicting the most likely nouns, verbs, scenes and prepositions that make up the core sentence structure. The input are initial noisy estimates of the objects and scenes detected in the image using state of the art trained detectors. As predicting actions from still images directly is unreliable, we use a language model trained from the English Gigaword corpus to obtain their estimates; together with probabilities of co-located nouns, scenes and prepositions. We use these estimates as parameters on a HMM that models the sentence generation process, with hidden nodes as sentence components and image detections as the emissions. Experimental results show that our strategy of combining vision and language produces readable and de- , scriptive sentences compared to naive strategies that use vision alone.
5 0.60385418 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing
Author: Tom Kwiatkowski ; Luke Zettlemoyer ; Sharon Goldwater ; Mark Steedman
Abstract: We consider the problem of learning factored probabilistic CCG grammars for semantic parsing from data containing sentences paired with logical-form meaning representations. Traditional CCG lexicons list lexical items that pair words and phrases with syntactic and semantic content. Such lexicons can be inefficient when words appear repeatedly with closely related lexical content. In this paper, we introduce factored lexicons, which include both lexemes to model word meaning and templates to model systematic variation in word usage. We also present an algorithm for learning factored CCG lexicons, along with a probabilistic parse-selection model. Evaluations on benchmark datasets demonstrate that the approach learns highly accurate parsers, whose generalization performance greatly from the lexical factoring. benefits
6 0.55746335 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
7 0.55194855 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference
8 0.53717232 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases
9 0.53415686 57 emnlp-2011-Extreme Extraction - Machine Reading in a Week
10 0.53297478 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
11 0.52429384 22 emnlp-2011-Better Evaluation Metrics Lead to Better Machine Translation
12 0.51799744 31 emnlp-2011-Computation of Infix Probabilities for Probabilistic Context-Free Grammars
13 0.51278955 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification
14 0.50504225 20 emnlp-2011-Augmenting String-to-Tree Translation Models with Fuzzy Use of Source-side Syntax
15 0.50102049 70 emnlp-2011-Identifying Relations for Open Information Extraction
16 0.50077641 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming
17 0.49910489 83 emnlp-2011-Learning Sentential Paraphrases from Bilingual Parallel Corpora for Text-to-Text Generation
18 0.49438363 147 emnlp-2011-Using Syntactic and Semantic Structural Kernels for Classifying Definition Questions in Jeopardy!
19 0.49309278 38 emnlp-2011-Data-Driven Response Generation in Social Media
20 0.49269778 136 emnlp-2011-Training a Parser for Machine Translation Reordering