acl acl2013 acl2013-275 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.
Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.
Reference: text
sentIndex sentText sentNum sentScore
1 Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. [sent-6, score-0.725]
2 It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. [sent-10, score-0.13]
3 For example, much work has shown the usefulness of syntactic representations for subsequent tasks such as relation extraction, semantic role labeling (Gildea and Palmer, 2002) and paraphrase detection (Callison-Burch, 2008). [sent-13, score-0.152]
4 Syntactic descriptions standardly use coarse discrete categories such as NP for noun phrases or PP for prepositional phrases. [sent-14, score-0.237]
5 However, recent work has shown that parsing results can be greatly improved by defining more fine-grained syntactic . [sent-15, score-0.13]
6 edu (rRidepng,sDVintcargeo)nsbSiy(kateh,cDVCiPo(–t,am pboinks)te,uNnoaPlsV(Sbecimkto)arn,GNicmar) manning@ st anford ang@ cs Figure 1: Example of a CVG tree with (category,vector) representations at each node. [sent-18, score-0.18]
7 The vectors for nonterminals are computed via a new type of recursive neural network which is conditioned on syntactic categories from a PCFG. [sent-19, score-0.518]
8 Unlike them, itjointly learns how to parse and how to represent phrases as both discrete categories and continuous vectors as illustrated in Fig. [sent-40, score-0.405]
9 CVGs combine the advantages of standard probabilistic context free grammars (PCFG) with those of recursive neural networks (RNNs). [sent-42, score-0.342]
10 The former can capture the discrete categorization of phrases into NP or PP while the latter can capture fine-grained syntactic and compositional-semantic information on phrases and words. [sent-43, score-0.34]
11 Previous RNN-based parsers used the same (tied) weights at all nodes to compute the vector representing a constituent (Socher et al. [sent-47, score-0.212]
12 This requires the composition function to be extremely powerful, since it has to combine phrases with different syntactic head words, and it is hard to optimize since the parameters form a very deep neural network. [sent-49, score-0.443]
13 We generalize the fully tied RNN to one with syntactically untied weights. [sent-50, score-0.232]
14 The weights at each node are conditionally dependent on the categories of the child constituents. [sent-51, score-0.211]
15 This allows different composition functions when combining different types of phrases and is shown to result in a large improvement in parsing accuracy. [sent-52, score-0.236]
16 Our compositional distributed representation allows a CVG parser to make accurate parsing decisions and capture similarities between phrases and sentences. [sent-53, score-0.387]
17 2 Related Work The CVG is inspired by two lines of research: Enriching PCFG parsers through more diverse sets of discrete states and recursive deep learning models that jointly learn classifiers and continuous feature representations for variable-sized inputs. [sent-61, score-0.52]
18 Improving Discrete Syntactic Representations As mentioned in the introduction, there are several approaches to improving discrete representations for parsing. [sent-62, score-0.212]
19 (2006) use a learning algorithm that splits and merges the syntactic categories in order to maximize likelihood on the treebank. [sent-64, score-0.133]
20 Another approach is lexicalized parsers (Collins, 2003; Charniak, 2000) that describe each category with a lexical item, usually the head word. [sent-66, score-0.143]
21 More recently, Hall and Klein (2012) combine several such annotation schemes in a factored parser. [sent-67, score-0.13]
22 We extend the above ideas from discrete representations to richer continuous ones. [sent-68, score-0.309]
23 The CVG can be seen as factoring discrete and continuous parsing in one model. [sent-69, score-0.258]
24 We also borrow ideas from this line of research in that our parser combines the generative PCFG model with discriminatively learned RNNs. [sent-73, score-0.166]
25 Collobert and Weston (2008) showed that neural networks can perform well on sequence labeling language processing tasks while also learning appropriate features. [sent-76, score-0.203]
26 456 Henderson (2003) was the first to show that neural networks can be successfully used for large scale parsing. [sent-81, score-0.203]
27 He introduced a left-corner parser to estimate the probabilities ofparsing decisions conditioned on the parsing history. [sent-82, score-0.183]
28 Both the original parsing system and its probabilistic interpretation (Titov and Henderson, 2007) learn features that represent the parsing history and do not provide a principled linguistic representation like our phrase representations. [sent-84, score-0.156]
29 Other related work includes (Henderson, 2004), who discriminatively trains a parser based on synchrony networks and (Titov and Henderson, 2006), who use an SVM to adapt a generative parser to different domains. [sent-85, score-0.299]
30 (2003) apply recursive neural networks to re-rank possible phrase attachments in an incremental parser. [sent-87, score-0.363]
31 For their results on full sentence parsing, they rerank candidate trees created by the Collins parser (Collins, 2003). [sent-91, score-0.139]
32 Similar to their work, we use the idea ofletting discrete categories reduce the search space during inference. [sent-92, score-0.193]
33 Our syntactically untied RNNs outperform them by a significant margin. [sent-94, score-0.19]
34 The main differences are (i) the dual representation of nodes as discrete categories and vectors, (ii) the combination with a PCFG, and (iii) the syntactic untying of weights based on child categories. [sent-99, score-0.346]
35 We directly compare models with fully tied and untied weights. [sent-100, score-0.182]
36 3 Compositional Vector Grammars This section introduces Compositional Vector Grammars (CVGs), a model to jointly find syntactic structure and capture compositional semantic information. [sent-103, score-0.212]
37 Therefore we combine syntactic and semantic information by giving the parser access to rich syntacticosemantic information in the form of distributional word vectors and compute compositional semantic vector representations for longer phrases (Costa et al. [sent-108, score-0.597]
38 The CVG model merges ideas from both generative models that assume discrete syntactic categories and discriminative models that are trained using continuous vectors. [sent-112, score-0.342]
39 We will first briefly introduce single word vector representations and then describe the CVG objective function, tree scoring and inference. [sent-113, score-0.325]
40 1 Word Vector Representations In most systems that use a vector representation for words, such vectors are based on cooccurrence statistics of each word and its context (Turney and Pantel, 2010). [sent-115, score-0.147]
41 Another line of research to learn distributional word vectors is based on neural language models (Bengio et al. [sent-116, score-0.217]
42 These vector representations capture interesting linear relationships (up to some accuracy), such as king− man+woman ≈ queen (Mikolov setu al. [sent-118, score-0.22]
43 The idea is to construct a neural network that outputs high scores for windows that occur in a large unlabeled corpus and low scores for windows where one word is replaced by a random word. [sent-121, score-0.203]
44 When such a network is optimized via gradient ascent the derivatives backpropagate into the word embedding matrix X. [sent-122, score-0.16]
45 In order to predict correct scores the vectors in the matrix capture co-occurrence statistics. [sent-123, score-0.181]
46 This index is used to retrieve the word’s vector representation aw using a simple multiplication with a binary vector e, which is zero everywhere, except 457 at the ith index. [sent-130, score-0.152]
47 Now that we have discrete and continuous representations for all words, we can continue with the approach for computing tree structures and vectors for nonterminal nodes. [sent-136, score-0.431]
48 T ish teh see ste ot fo fa all possible etr leaebse fleodr a given sentence xi is defined as Y (xi) and the correct tree for a sentence is yi. [sent-139, score-0.137]
49 , 2011b) trains the CVG so that the highest scoring tree will be the correct tree: gθ (xi) = yi and its score will be larger up to a margin to other possible trees ˆy ∈ Y(xi) : s(CVG(θ, xi, yi)) ≥ s(CVG(θ, xi, ˆy )) + ∆(yi, ˆy ). [sent-151, score-0.206]
50 G(xi, yi)) (3) Intuitively, to minimize this objective, the score of the correct tree yi is increased and the score of the highest scoring incorrect tree ˆy is decreased. [sent-155, score-0.252]
51 We define the word representations as (vector, POS) pairs: ((a, A) , (b, B) , (c, C)), where the vectors are defined as in Sec. [sent-161, score-0.171]
52 The standard RNN essentially ignores all POS tags and syntactic categories and each nonterminal node is associated with the same neural network (i. [sent-164, score-0.4]
53 Einac thh esu fcohr triplet draennochteisn gth tarti a parent n →ode p has two children and each ck can be either a word vector or a non-terminal node in the tree. [sent-169, score-0.258]
54 Note that in order to replicate t)h,(ep pneu→ral napetwork and compute node representations in a bottom up fashion, the parent must have the same dimensionality as the children: p ∈ Rn. [sent-172, score-0.272]
55 ∈G iRven this tree structure, we can now compute activations for each node from the bottom up. [sent-173, score-0.177]
56 458 (A,a=SPta(2n),dpa(r2d)=(RBPe,c(b1u)=,r spifv(1 e)W=N eupa r(1C)=l,Nfce=Wtwocbr)k Figure 2: An example tree with a simple Recursive Neural Network: The same weight matrix is repli- cated and used to compute all non-terminal node representations. [sent-192, score-0.243]
57 In order to compute a score of how plausible of a syntactic constituent a parent is the RNN uses a single-unit linear layer for all i: s(p(i)) = vTp(i), where v ∈ Rn is a vector of parameters that need twoh beere et rvai ∈ned R. [sent-194, score-0.236]
58 The standard RNN requires a single composition function to capture all types of compositions: adjectives and nouns, verbs and nouns, adverbs and adjectives, etc. [sent-198, score-0.158]
59 Even though this function is a powerful one, we find a single neural network weight matrix cannot fully capture the richness of compositionality. [sent-199, score-0.346]
60 Several extensions are possible: A two-layered RNN would provide more expres- sive power, however, it is much harder to train because the resulting neural network becomes very deep and suffers from vanishing gradient problems. [sent-200, score-0.258]
61 The matrix is then applied to the sibling node’s vector during the composition. [sent-203, score-0.142]
62 While this results in a powerful composition function that essentially depends on the words being combined, the number of model parameters explodes and the composition functions do not capture the syntactic commonalities between similar POS tags or syntactic categories. [sent-204, score-0.376]
63 Hence, CVGs combine discrete, syntactic rule probabilities and continuous vector compositions. [sent-207, score-0.196]
64 The idea is that the syntactic categories of the children determine what composition function to use for computing the vector of their parents. [sent-208, score-0.366]
65 While not perfect, a dedicated composition function for each rule RHS can well capture common composition processes such as adjective or adverb modification versus noun or clausal complementa- tion. [sent-209, score-0.272]
66 In contrast, the CVG uses a syntactically untied RNN (SU-RNN) which has a set of such weights. [sent-212, score-0.19]
67 3 shows an example SU-RNN that computes parent vectors with syntactically untied weights. [sent-215, score-0.336]
68 The CVG computes the first parent vector via the SU-RNN: p(1)= f? [sent-216, score-0.151]
69 , W(B,C) where ∈ Rn×2n is now a matrix that depends on the categories of the two children. [sent-220, score-0.147]
70 In this bottom up procedure, the score for each node consists of summing two elements: First, a single linear unit that scores the parent vector and second, the log probability of the PCFG for the rule that combines these two children: s ? [sent-221, score-0.215]
71 Assuming that node p1 has syntactic category P1, we compute the second parent vector via: p(2)= f? [sent-227, score-0.337]
72 The goodness of a tree is measured in terms of its score and the CVG score of a complete tree is the sum of the scores at each node: s(CVG(θ,x, yˆ ) =d∈XN( yˆ)s? [sent-240, score-0.16]
73 A (category, vector) node representation is dependent on all the words in its span and hence to find the true global optimum, we would have to compute the scores for all binary trees. [sent-247, score-0.127]
74 This is similar to a re-ranking setup but with one main difference: the SU-RNN rule score computation at each node still only has access to its child vectors, not the whole tree or other global features. [sent-261, score-0.181]
75 The derivative of tree ihas to be taken with respect to all parameter matrices that appear in it. [sent-273, score-0.145]
76 The main difference between backpropagation in standard RNNs and SURNNs is that the derivatives at each node only add to the overall derivative of the specific matrix at that node. [sent-274, score-0.221]
77 Let θ = (X, ∈ RM be a vector of all M model parameters, w)h ∈ere R we denote as the set of matrices that appear in the training set. [sent-280, score-0.141]
78 These include split categories, such as parent annotation categories like VPˆ S. [sent-311, score-0.156]
79 However, since the vectors will capture lexical and semantic information, even simple base PCFGs can be substantially improved. [sent-316, score-0.151]
80 Testing on the full WSJ section 22 dev set (1700 sentences) takes roughly 470 seconds with the simple base PCFG, 1320 seconds with our new CVG and 1600 seconds with the currently published Stanford factored parser. [sent-318, score-0.166]
81 1 Table 1: Comparison of parsers with richer state representations on the WSJ. [sent-345, score-0.174]
82 We hypothesize that the larger word vector sizes, while capturing more semantic knowledge, result in too many SU-RNN matrix parameters to train and hence perform worse. [sent-348, score-0.172]
83 (2006), which bootstraps and parses additional large corpora multiple times, Charniak-RS: the state of the art self-trained and discriminatively re-ranked Charniak-Johnson parser combin- ing (Charniak, 2000; McClosky et al. [sent-355, score-0.137]
84 (2012) and compare to the previous version of the Stanford factored parser as well as to the Berkeley and Charniak-reranked-self-trained parsers (defined above). [sent-364, score-0.309]
85 the largest sources of improved performance over the original Stanford factored parser is in the correct placement of PP phrases. [sent-408, score-0.235]
86 We then continue to train both parsers on two similar sentences and then analyze if the parsers correctly transferred the knowledge. [sent-425, score-0.148]
87 The training sentences are He eats spaghetti with a fork. [sent-426, score-0.148]
88 The very similar test sentences are He eats spaghetti with a spoon. [sent-428, score-0.148]
89 After training, the CVG parses both correctly, while the factored Stanford parser incorrectly attaches both PPs to spaghetti. [sent-431, score-0.235]
90 The CVG’s ability to transfer the correct PP attachments is due to the semantic word vector similarity between the words in the sentences. [sent-432, score-0.125]
91 In contrast, the Stanford parser could not distinguish the PP attachments based on the word semantics. [sent-437, score-0.154]
92 2 ADJP-NP Figure 5: Three binary composition matrices showing that head words dominate the composition. [sent-445, score-0.211]
93 5 Conclusion We introduced Compositional Vector Grammars (CVGs), a parsing model that combines the speed of small-state PCFGs with the semantic richness of neural word representations and compositional phrase vectors. [sent-448, score-0.473]
94 The compositional vectors are learned with a new syntactically untied recursive neural network. [sent-449, score-0.634]
95 This model is linguistically more plausible since it chooses different composition functions for a parent node based on the syntactic categories of its children. [sent-450, score-0.386]
96 A unified architecture for natural language processing: deep neural networks with multitask learning. [sent-510, score-0.258]
97 Towards incremental parsing of natural language using recursive neural networks. [sent-518, score-0.335]
98 Fast exact inference with a factored model for natural language parsing. [sent-611, score-0.13]
99 Wide coverage natural language processing using kernel methods and neural networks for structured data. [sent-656, score-0.203]
100 Learning continuous phrase representations and syntactic parsing with recursive neural networks. [sent-694, score-0.555]
wordName wordTfidf (topN-words)
[('cvg', 0.649), ('pcfg', 0.205), ('rnn', 0.188), ('neural', 0.146), ('untied', 0.14), ('socher', 0.134), ('factored', 0.13), ('rnns', 0.124), ('cvgs', 0.123), ('compositional', 0.116), ('composition', 0.114), ('discrete', 0.112), ('recursive', 0.111), ('parser', 0.105), ('representations', 0.1), ('pcfgs', 0.095), ('eats', 0.086), ('henderson', 0.083), ('pp', 0.083), ('stanford', 0.081), ('categories', 0.081), ('tree', 0.08), ('parsing', 0.078), ('vector', 0.076), ('parent', 0.075), ('parsers', 0.074), ('vp', 0.074), ('vectors', 0.071), ('continuous', 0.068), ('matrix', 0.066), ('matrices', 0.065), ('subgradient', 0.065), ('node', 0.064), ('wsj', 0.063), ('klein', 0.063), ('spaghetti', 0.062), ('network', 0.057), ('xi', 0.057), ('yi', 0.057), ('networks', 0.057), ('deep', 0.055), ('charniak', 0.054), ('kummerfeld', 0.054), ('backpropagation', 0.054), ('costa', 0.054), ('np', 0.053), ('syntactic', 0.052), ('syntactically', 0.05), ('attachments', 0.049), ('manning', 0.047), ('menchetti', 0.047), ('ratliff', 0.047), ('collins', 0.045), ('capture', 0.044), ('phrases', 0.044), ('children', 0.043), ('goller', 0.043), ('titov', 0.042), ('gt', 0.042), ('tied', 0.042), ('adagrad', 0.04), ('taskar', 0.04), ('cb', 0.038), ('attach', 0.038), ('beam', 0.037), ('child', 0.037), ('derivatives', 0.037), ('category', 0.037), ('base', 0.036), ('scoring', 0.035), ('diagonals', 0.035), ('frasconi', 0.035), ('kartsaklis', 0.035), ('nppp', 0.035), ('spnagnhset', 0.035), ('ssn', 0.035), ('udon', 0.035), ('untying', 0.035), ('vbznp', 0.035), ('ymax', 0.035), ('collobert', 0.034), ('trees', 0.034), ('objective', 0.034), ('richness', 0.033), ('petrov', 0.033), ('compute', 0.033), ('discriminatively', 0.032), ('turian', 0.032), ('head', 0.032), ('uchler', 0.031), ('grammar', 0.031), ('rn', 0.031), ('hence', 0.03), ('ideas', 0.029), ('mcclosky', 0.029), ('learns', 0.029), ('weights', 0.029), ('multiplied', 0.029), ('finkel', 0.029), ('grammars', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 275 acl-2013-Parsing with Compositional Vector Grammars
Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.
Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.
2 0.17578694 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics
Author: Karl Moritz Hermann ; Phil Blunsom
Abstract: Modelling the compositional process by which the meaning of an utterance arises from the meaning of its parts is a fundamental task of Natural Language Processing. In this paper we draw upon recent advances in the learning of vector space representations of sentential semantics and the transparent interface between syntax and semantics provided by Combinatory Categorial Grammar to introduce Combinatory Categorial Autoencoders. This model leverages the CCG combinatory operators to guide a non-linear transformation of meaning within a sentence. We use this model to learn high dimensional embeddings for sentences and evaluate them in a range of tasks, demonstrating that the incorporation of syntax allows a concise model to learn representations that are both effective and general.
3 0.14633545 44 acl-2013-An Empirical Examination of Challenges in Chinese Parsing
Author: Jonathan K. Kummerfeld ; Daniel Tse ; James R. Curran ; Dan Klein
Abstract: Aspects of Chinese syntax result in a distinctive mix of parsing challenges. However, the contribution of individual sources of error to overall difficulty is not well understood. We conduct a comprehensive automatic analysis of error types made by Chinese parsers, covering a broad range of error types for large sets of sentences, enabling the first empirical ranking of Chinese error types by their performance impact. We also investigate which error types are resolved by using gold part-of-speech tags, showing that improving Chinese tagging only addresses certain error types, leaving substantial outstanding challenges.
4 0.14584221 348 acl-2013-The effect of non-tightness on Bayesian estimation of PCFGs
Author: Shay B. Cohen ; Mark Johnson
Abstract: Probabilistic context-free grammars have the unusual property of not always defining tight distributions (i.e., the sum of the “probabilities” of the trees the grammar generates can be less than one). This paper reviews how this non-tightness can arise and discusses its impact on Bayesian estimation of PCFGs. We begin by presenting the notion of “almost everywhere tight grammars” and show that linear CFGs follow it. We then propose three different ways of reinterpreting non-tight PCFGs to make them tight, show that the Bayesian estimators in Johnson et al. (2007) are correct under one of them, and provide MCMC samplers for the other two. We conclude with a discussion of the impact of tightness empirically.
5 0.13906401 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
6 0.13512713 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
7 0.12582542 40 acl-2013-Advancements in Reordering Models for Statistical Machine Translation
8 0.12512089 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
9 0.11130268 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics
10 0.11111302 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
11 0.10754169 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching
12 0.10540833 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference
13 0.10197872 80 acl-2013-Chinese Parsing Exploiting Characters
14 0.10184218 294 acl-2013-Re-embedding words
15 0.10183837 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling
16 0.10073154 32 acl-2013-A relatedness benchmark to test the role of determiners in compositional distributional semantics
17 0.097626381 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
18 0.096314259 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit
19 0.093450405 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning
20 0.086951315 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy
topicId topicWeight
[(0, 0.21), (1, -0.093), (2, -0.107), (3, -0.005), (4, -0.134), (5, -0.021), (6, 0.083), (7, -0.033), (8, -0.047), (9, 0.044), (10, -0.008), (11, -0.009), (12, 0.162), (13, -0.164), (14, -0.018), (15, 0.114), (16, -0.086), (17, 0.01), (18, -0.039), (19, -0.132), (20, 0.029), (21, -0.034), (22, -0.047), (23, -0.007), (24, -0.018), (25, -0.065), (26, 0.066), (27, -0.061), (28, 0.037), (29, 0.063), (30, -0.037), (31, -0.021), (32, 0.023), (33, -0.063), (34, 0.024), (35, 0.005), (36, -0.013), (37, -0.006), (38, 0.059), (39, 0.046), (40, -0.002), (41, -0.042), (42, 0.07), (43, -0.04), (44, -0.042), (45, -0.006), (46, -0.024), (47, 0.017), (48, -0.004), (49, -0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.93455285 275 acl-2013-Parsing with Compositional Vector Grammars
Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.
Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.
2 0.68991715 349 acl-2013-The mathematics of language learning
Author: Andras Kornai ; Gerald Penn ; James Rogers ; Anssi Yli-Jyra
Abstract: unkown-abstract
3 0.6785621 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics
Author: Karl Moritz Hermann ; Phil Blunsom
Abstract: Modelling the compositional process by which the meaning of an utterance arises from the meaning of its parts is a fundamental task of Natural Language Processing. In this paper we draw upon recent advances in the learning of vector space representations of sentential semantics and the transparent interface between syntax and semantics provided by Combinatory Categorial Grammar to introduce Combinatory Categorial Autoencoders. This model leverages the CCG combinatory operators to guide a non-linear transformation of meaning within a sentence. We use this model to learn high dimensional embeddings for sentences and evaluate them in a range of tasks, demonstrating that the incorporation of syntax allows a concise model to learn representations that are both effective and general.
4 0.64619774 294 acl-2013-Re-embedding words
Author: Igor Labutov ; Hod Lipson
Abstract: We present a fast method for re-purposing existing semantic word vectors to improve performance in a supervised task. Recently, with an increase in computing resources, it became possible to learn rich word embeddings from massive amounts of unlabeled data. However, some methods take days or weeks to learn good embeddings, and some are notoriously difficult to train. We propose a method that takes as input an existing embedding, some labeled data, and produces an embedding in the same space, but with a better predictive performance in the supervised task. We show improvement on the task of sentiment classification with re- spect to several baselines, and observe that the approach is most useful when the training set is sufficiently small.
5 0.59219736 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network
Author: Nan Yang ; Shujie Liu ; Mu Li ; Ming Zhou ; Nenghai Yu
Abstract: In this paper, we explore a novel bilingual word alignment approach based on DNN (Deep Neural Network), which has been proven to be very effective in various machine learning tasks (Collobert et al., 2011). We describe in detail how we adapt and extend the CD-DNNHMM (Dahl et al., 2012) method introduced in speech recognition to the HMMbased word alignment model, in which bilingual word embedding is discriminatively learnt to capture lexical translation information, and surrounding words are leveraged to model context information in bilingual sentences. While being capable to model the rich bilingual correspondence, our method generates a very compact model with much fewer parameters. Experiments on a large scale EnglishChinese word alignment task show that the proposed method outperforms the HMM and IBM model 4 baselines by 2 points in F-score.
6 0.59036177 35 acl-2013-Adaptation Data Selection using Neural Language Models: Experiments in Machine Translation
7 0.58671278 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
8 0.57634699 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning
9 0.56103897 313 acl-2013-Semantic Parsing with Combinatory Categorial Grammars
10 0.55484456 84 acl-2013-Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling
11 0.55463099 38 acl-2013-Additive Neural Networks for Statistical Machine Translation
12 0.55195278 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit
13 0.54759103 32 acl-2013-A relatedness benchmark to test the role of determiners in compositional distributional semantics
14 0.51986307 176 acl-2013-Grounded Unsupervised Semantic Parsing
15 0.51791799 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models
16 0.50839078 219 acl-2013-Learning Entity Representation for Entity Disambiguation
17 0.50405711 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
18 0.4994489 22 acl-2013-A Structured Distributional Semantic Model for Event Co-reference
19 0.49593881 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers
20 0.49223709 26 acl-2013-A Transition-Based Dependency Parser Using a Dynamic Parsing Strategy
topicId topicWeight
[(0, 0.054), (2, 0.159), (6, 0.046), (11, 0.061), (14, 0.018), (15, 0.019), (24, 0.031), (26, 0.063), (29, 0.021), (35, 0.088), (42, 0.077), (48, 0.083), (67, 0.024), (70, 0.079), (71, 0.014), (88, 0.023), (90, 0.024), (95, 0.047)]
simIndex simValue paperId paperTitle
1 0.95490152 261 acl-2013-Nonparametric Bayesian Inference and Efficient Parsing for Tree-adjoining Grammars
Author: Elif Yamangil ; Stuart M. Shieber
Abstract: In the line of research extending statistical parsing to more expressive grammar formalisms, we demonstrate for the first time the use of tree-adjoining grammars (TAG). We present a Bayesian nonparametric model for estimating a probabilistic TAG from a parsed corpus, along with novel block sampling methods and approximation transformations for TAG that allow efficient parsing. Our work shows performance improvements on the Penn Treebank and finds more compact yet linguistically rich representations of the data, but more importantly provides techniques in grammar transformation and statistical inference that make practical the use of these more expressive systems, thereby enabling further experimentation along these lines.
same-paper 2 0.86004019 275 acl-2013-Parsing with Compositional Vector Grammars
Author: Richard Socher ; John Bauer ; Christopher D. Manning ; Ng Andrew Y.
Abstract: Natural language parsing has typically been done with small sets of discrete categories such as NP and VP, but this representation does not capture the full syntactic nor semantic richness of linguistic phrases, and attempts to improve on this by lexicalizing phrases or splitting categories only partly address the problem at the cost of huge feature spaces and sparseness. Instead, we introduce a Compositional Vector Grammar (CVG), which combines PCFGs with a syntactically untied recursive neural network that learns syntactico-semantic, compositional vector representations. The CVG improves the PCFG of the Stanford Parser by 3.8% to obtain an F1 score of 90.4%. It is fast to train and implemented approximately as an efficient reranker it is about 20% faster than the current Stanford factored parser. The CVG learns a soft notion of head words and improves performance on the types of ambiguities that require semantic information such as PP attachments.
3 0.85172558 73 acl-2013-Broadcast News Story Segmentation Using Manifold Learning on Latent Topic Distributions
Author: Xiaoming Lu ; Lei Xie ; Cheung-Chi Leung ; Bin Ma ; Haizhou Li
Abstract: We present an efficient approach for broadcast news story segmentation using a manifold learning algorithm on latent topic distributions. The latent topic distribution estimated by Latent Dirichlet Allocation (LDA) is used to represent each text block. We employ Laplacian Eigenmaps (LE) to project the latent topic distributions into low-dimensional semantic representations while preserving the intrinsic local geometric structure. We evaluate two approaches employing LDA and probabilistic latent semantic analysis (PLSA) distributions respectively. The effects of different amounts of training data and different numbers of latent topics on the two approaches are studied. Experimental re- sults show that our proposed LDA-based approach can outperform the corresponding PLSA-based approach. The proposed approach provides the best performance with the highest F1-measure of 0.7860.
4 0.83045769 4 acl-2013-A Context Free TAG Variant
Author: Ben Swanson ; Elif Yamangil ; Eugene Charniak ; Stuart Shieber
Abstract: We propose a new variant of TreeAdjoining Grammar that allows adjunction of full wrapping trees but still bears only context-free expressivity. We provide a transformation to context-free form, and a further reduction in probabilistic model size through factorization and pooling of parameters. This collapsed context-free form is used to implement efficient gram- mar estimation and parsing algorithms. We perform parsing experiments the Penn Treebank and draw comparisons to TreeSubstitution Grammars and between different variations in probabilistic model design. Examination of the most probable derivations reveals examples of the linguistically relevant structure that our variant makes possible.
5 0.81733149 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
Author: Dan Garrette ; Jason Mielens ; Jason Baldridge
Abstract: Developing natural language processing tools for low-resource languages often requires creating resources from scratch. While a variety of semi-supervised methods exist for training from incomplete data, there are open questions regarding what types of training data should be used and how much is necessary. We discuss a series of experiments designed to shed light on such questions in the context of part-of-speech tagging. We obtain timed annotations from linguists for the low-resource languages Kinyarwanda and Malagasy (as well as English) and eval- uate how the amounts of various kinds of data affect performance of a trained POS-tagger. Our results show that annotation of word types is the most important, provided a sufficiently capable semi-supervised learning infrastructure is in place to project type information onto a raw corpus. We also show that finitestate morphological analyzers are effective sources of type information when few labeled examples are available.
6 0.75822836 57 acl-2013-Arguments and Modifiers from the Learner's Perspective
7 0.74396104 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
8 0.74295551 276 acl-2013-Part-of-Speech Induction in Dependency Trees for Statistical Machine Translation
9 0.74232966 225 acl-2013-Learning to Order Natural Language Texts
10 0.73872346 80 acl-2013-Chinese Parsing Exploiting Characters
11 0.73242551 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction
12 0.73180783 132 acl-2013-Easy-First POS Tagging and Dependency Parsing with Beam Search
14 0.73007292 347 acl-2013-The Role of Syntax in Vector Space Models of Compositional Semantics
15 0.72982085 175 acl-2013-Grounded Language Learning from Video Described with Sentences
16 0.72917813 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
17 0.72870409 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
18 0.72857195 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing
19 0.72846174 318 acl-2013-Sentiment Relevance
20 0.72832131 224 acl-2013-Learning to Extract International Relations from Political Context