acl acl2011 acl2011-295 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nikhil Garg ; James Henderson
Abstract: We propose a generative model based on Temporal Restricted Boltzmann Machines for transition based dependency parsing. The parse tree is built incrementally using a shiftreduce parse and an RBM is used to model each decision step. The RBM at the current time step induces latent features with the help of temporal connections to the relevant previous steps which provide context information. Our parser achieves labeled and unlabeled attachment scores of 88.72% and 91.65% respectively, which compare well with similar previous models and the state-of-the-art.
Reference: text
sentIndex sentText sentNum sentScore
1 ch Abstract We propose a generative model based on Temporal Restricted Boltzmann Machines for transition based dependency parsing. [sent-3, score-0.128]
2 The parse tree is built incrementally using a shiftreduce parse and an RBM is used to model each decision step. [sent-4, score-0.151]
3 The RBM at the current time step induces latent features with the help of temporal connections to the relevant previous steps which provide context information. [sent-5, score-0.579]
4 Our parser achieves labeled and unlabeled attachment scores of 88. [sent-6, score-0.026]
5 1 Introduction There has been significant interest recently in machine learning methods that induce generative models with high-dimensional hidden representations, including neural networks (Bengio et al. [sent-9, score-0.112]
6 , 2003; Collobert and Weston, 2008), Bayesian networks (Titov and Henderson, 2007a), and Deep Belief Networks (Hinton et al. [sent-10, score-0.062]
7 In this paper, we investigate how these models can be applied to dependency parsing. [sent-12, score-0.052]
8 We focus on Shift-Reduce transition-based parsing proposed by Nivre et al. [sent-13, score-0.066]
9 In this class of algorithms, at any given step, the parser has to choose among a set of possible actions, each representing an incremental modification to the partially built tree. [sent-15, score-0.058]
10 ch work, Ratnaparkhi (1999) proposed a maximum entropy model for transition-based constituency parsing. [sent-21, score-0.076]
11 Of these approaches, only ISBNs induce highdimensional latent representations to encode parse history, but suffer from either very approximate or slow inference procedures. [sent-22, score-0.434]
12 We propose to address the problem of inference in a high-dimensional latent space by using an undirected graphical model, Restricted Boltzmann Machines (RBMs), to model the individual parsing decisions. [sent-23, score-0.523]
13 Unlike the Sigmoid Belief Networks (SBNs) used in ISBNs, RBMs have tractable inference procedures for both forward and backward reasoning, which allows us to efficiently infer both the probability of the decision given the latent variables and vice versa. [sent-24, score-0.638]
14 The key structural difference between the two models is that the directed connections between latent and decision vectors in SBNs become undirected in RBMs. [sent-25, score-0.571]
15 A complete parsing model consists of a sequence of RBMs interlinked via directed edges, which gives us a form of Temporal Restricted Boltzmann Machines (TRBM) (Tay- lor et al. [sent-26, score-0.15]
16 In this paper, we analyze and contrast ISBNs with TRBMs and show that the latter provide an accurate and theoretically sound model for parsing with highdimensional latent variables. [sent-28, score-0.385]
17 2 An ISBN Parsing Model Our TRBM parser uses the same historybased probability model as the ISBN parser of Titov and Henderson (2007b): P(tree) = ΠtP(vt |v1, . [sent-29, score-0.077]
18 Shaded nodes represent decision variables and ‘H’ represents a vector of latent variables. [sent-35, score-0.478]
19 denotes the weight matrix for directed connection of type c between two latent vectors. [sent-36, score-0.343]
20 WH(cH) vt is a parser decision of the type Left-Arc, Right-Arc, Reduce or Shift. [sent-37, score-0.292]
21 In the ISBN model shown in Figure 1, the decisions are shown as boxes and the sub-decisions as shaded circles. [sent-46, score-0.05]
22 At each decision step, the ISBN model also includes a vector of latent variables, denoted by ‘H’, which act as latent features of the parse history. [sent-47, score-0.631]
23 As explained in (Titov and Henderson, 2007b), the temporal connections between latent variables are constructed to take into account the structural locality in the partial dependency structure. [sent-48, score-0.732]
24 The model parameters are learned by backpropagating likelihood gradients. [sent-49, score-0.051]
25 Because decision probabilities are conditioned on the history, once a decision is made the corresponding variable becomes observed, or visible. [sent-50, score-0.162]
26 In an ISBN, the directed edges to these visible variables and the large numbers of heavily inter-connected latent variables make exact inference of decision probabilities intractable. [sent-51, score-0.908]
27 Titov and Henderson (2007a) proposed two approximation procedures for inference. [sent-52, score-0.039]
28 The first was a feed forward approximation where latent variables were allowed to depend only on their parent variables, and hence did not take into account the current or future observations. [sent-53, score-0.587]
29 Due to this limitation, the authors proposed to make latent variables conditionally dependent also on a set of explicit features derived from the parsing history, specifically, the base features defined in (Nivre et al. [sent-54, score-0.507]
30 As shown in our experiments, this addition results in a big improvement for the parsing task. [sent-56, score-0.066]
31 The second approximate inference procedure, called the incremental mean field approximation, extended the feed-forward approximation by updating the current time step’s latent variables after each sub-decision. [sent-57, score-0.585]
32 Although this approximation is more 12 accurate than the feed-forward one, there is no analytical way to maximize likelihood w. [sent-58, score-0.065]
33 the means of the latent variables, which requires an iterative numerical method and thus makes inference very slow, restricting the model to only shorter sentences. [sent-61, score-0.352]
34 3 Temporal Restricted Boltzmann Machines In the proposed TRBM model, RBMs provide an an- alytical way to do exact inference within each time step. [sent-62, score-0.071]
35 Although information passing between time steps is still approximated, TRBM inference is more accurate than the ISBN approximations. [sent-63, score-0.097]
36 1 Restricted Boltzmann Machines (RBM) An RBM is an undirected graphical model with a set of binary visible variables v, a set of binary latent variables h, and a weight matrix W for bipartite connections between v and h. [sent-65, score-0.965]
37 Given the visible variables, the latent variables are conditionally independent of each other, and vice versa: p(hj = 1|v) = σ(bj + Σiviwij ) (1) p(vi = 1|h) = σ(ai + Σjhjwij) (2) where σ(x) = 1/(1 + e−x) (the logistic sigmoid). [sent-67, score-0.581]
38 , 2006) and TRBMs for modeling motion capture data (Taylor et al. [sent-69, score-0.03]
39 , 2007) can be used to model sequences where the decision at each step requires some context information from the past. [sent-75, score-0.122]
40 The directed temporal connections between time steps contribute a bias to the latent layer inference in the current step. [sent-78, score-0.742]
41 shows our proposed TRBM model with latent to latent connections between time steps. [sent-79, score-0.656]
42 Each step has an RBM with weights WRBM composed of µ smaller weight matrices corresponding to different sub-decisions. [sent-80, score-0.035]
43 For instance, for the action Left-Arc, WRBM consists of RBM weights between the latent vector and the sub-decisions: “Left-Arc” and “Label”. [sent-81, score-0.256]
44 ,h(C)) where v1Tdenotes theset of visible vectors from time steps 1 to T i. [sent-86, score-0.166]
45 denotes the latent vector in the past time step that is connected to the current latent vector through a connection of type c. [sent-90, score-0.602]
46 ≈ + Σivitwij)i (4) Here, denotes the mean of the corresponding latent variable. [sent-92, score-0.284]
47 To keep inference tractable, we do not do any backward reasoning across directed connections to update . [sent-93, score-0.304]
48 Thus, the inference procedure for latent variables takes into account both the parse µ(c) history and the current observation, but no future observations. [sent-94, score-0.651]
49 The limited set of possible values for the visible layer makes it possible to marginalize out latent variables in linear time to compute the exact likelihood. [sent-95, score-0.624]
50 Let vt(k) denote a vector with vkt = 1 and vit(i6=k) = 0. [sent-96, score-0.043]
51 1 where vit and hjt denote the ith visible and jth latent variable respectively at time step t. [sent-103, score-0.572]
52 hl(c) denotes a lnaotetenst vhaeri wabeliegh itn o thfe th pea csotr triemspeo sntdeipn,g an codn wneH(c H)tilojnd. [sent-104, score-0.028]
53 1 describes an RBM where visible variables can take binary values. [sent-107, score-0.3]
54 , 2007), we have multi-valued visible variables which we represent as one-hot binary vectors and model via a softmax distribution: p(vkt= 1|ht) =Pexiepx(ap(ka+i+PPjhj thwjtkwj)ij) (3) Latent variable inferPence is similPar to equation 1 with an additional bias due to the temporal connections. [sent-109, score-0.542]
55 4 TRBM Training The gradient of an RBM is given by: ∂ log p(v)/∂wij = hvihjidata − hvihjimodel (6) where hid denotes the expectation under distributwiohne de. [sent-115, score-0.07]
56 In general, computing the exact gradient is intractable and previous work proposed a Contrastive Divergence (CD) based learning procedure that approximates the above gradient using only one step reconstruction (Hinton, 2002). [sent-116, score-0.153]
57 ∂ log p(vt(k) |historyt) ∂wij (δki p(vt (i)|historyt)) (k)|history − = σ(b′j + wij) (7) Further, the weights on the temporal connections are learned by back-propagating the likelihood gradients through the directed links between steps. [sent-119, score-0.32]
58 The back-proped gradient from future time steps is also used to train the current RBM weights. [sent-120, score-0.095]
59 However, unlike their model, we do not use CD at each step to compute gradients. [sent-123, score-0.035]
60 Given a derivation prefix, its partial parse tree and associated TRBM, the decoder adds a step to the TRBM for calculating the probabilities of hypothesized next decisions using equation 5. [sent-126, score-0.097]
61 If the decoder selects a decision for addition to the candidate list, then the current step’s latent variable means are inferred using equation 4, given that the chosen decision is now visible. [sent-127, score-0.475]
62 4 Experiments & Results We used syntactic dependencies from the English section of the CoNLL 2009 shared task dataset (Haji ˇc et al. [sent-129, score-0.044]
63 Row a shows that a simple ISBN model without features, using feed forward inference procedure, does not work well. [sent-137, score-0.172]
64 As explained in section 2, this is expected since in the absence of explicit features, the latent variables in a given layer do not take into account the observations in the previous layers. [sent-138, score-0.513]
65 The huge improvement in performance 14 on adding the features (row b) shows that the feed forward inference procedure for ISBNs relies heavily on these feature connections to compensate for the lack of backward inference. [sent-139, score-0.381]
66 The TRBM model avoids this problem as the inference procedure takes into account the current observation, which makes the latent variables much more informed. [sent-140, score-0.602]
67 However, as row c shows, the TRBM model without features falls a bit short of the ISBN performance, indicating that features are indeed a powerful substitute for backward inference in sequential latent variable models. [sent-141, score-0.478]
68 TRBM models would still be preferred in cases where such feature engineering is difficult or expensive, or where the objective is to compute the latent features themselves. [sent-142, score-0.256]
69 The improved inference in TRBM does however come at the cost of increased training and testing time. [sent-144, score-0.071]
70 We also tried a Contrastive Divergence based training procedure for TRBM instead of equation 7, but that resulted in about an absolute 10% lower LAS. [sent-151, score-0.064]
71 Further, we also tried a very simple model without latent variables where temporal connections are between decision variables themselves. [sent-152, score-0.898]
72 46%, which indicates that without latent variables, it is very difficult to capture the parse history. [sent-154, score-0.288]
73 For comparison, we also include the performance numbers for some state-of-the-art dependency parsing systems. [sent-155, score-0.118]
74 Surdeanu and Manning (2010) compare different parsing models using CoNLL 2008 shared task dataset (Surdeanu et al. [sent-156, score-0.11]
75 Row j shows the best syntactic model in CoNLL 2009 shared task. [sent-159, score-0.069]
76 2 Latent Layer Analysis We analyzed the latent layers in our models to see if they captured semantic patterns. [sent-164, score-0.256]
77 A latent layer is a vector of 100 latent variables. [sent-165, score-0.58]
78 Every Shift operation gives a latent representation for the corresponding word. [sent-166, score-0.256]
79 We took all the verbs in the development set2 and partitioned their representations into 50 clusters using the k-means algorithm. [sent-167, score-0.037]
80 The partitions look semantically meaningful but to get a quantitative analysis, we computed pairwise semantic similarity between all word pairs in a given cluster and aggregated this number over all the clusters. [sent-169, score-0.032]
81 The semantic similarity was calculated using two different similarity measures on the wordnet corpus (Miller et al. [sent-170, score-0.064]
82 path similarity is a score between 0 and 1, equal to the inverse of the shortest path length between the two word senses. [sent-172, score-0.032]
83 lin similarity (Lin, 1998) is a score between 0 and 1 based on the Information Content of the two word senses and of the Least Common Subsumer. [sent-173, score-0.032]
84 3 We observe that TRBM latent representations give a slightly better clustering than ISBN models. [sent-175, score-0.293]
85 Again, this is because of the fact that the inference procedure in TRBMs takes into account the current observation. [sent-176, score-0.161]
86 However, at the same time, the similarity numbers for ISBN with features 2Verbs are words corresponding to POS tags: VB, VBD, VBG, VBN, VBP, VBZ. [sent-177, score-0.032]
87 3To account for randomness in k-means clustering, the clustering was performed 10 times with random initializations, similarity scores were computed for each run and a mean was taken. [sent-179, score-0.061]
88 are not very low, which shows that features are a powerful way to compensate for the lack of backward inference. [sent-185, score-0.081]
89 This is in agreement with their good performance on the parsing task. [sent-186, score-0.066]
90 5 Conclusions & Future Work We have presented a Temporal Restricted Boltzmann Machines based model for dependency parsing. [sent-187, score-0.077]
91 The model shows how undirected graphical models can be used to generate latent representations of local parsing actions, which can then be used as features for later decisions. [sent-188, score-0.489]
92 The TRBM model for dependency parsing could be extended to a Deep Belief Network by adding one more latent layer on top of the existing one (Hinton et al. [sent-189, score-0.467]
93 Parser latent representations could also help other tasks such as Semantic Role Labeling (Henderson et al. [sent-193, score-0.293]
94 A unified architecture for natural language processing: Deep neural networks with multitask learning. [sent-219, score-0.112]
95 The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. [sent-235, score-0.044]
96 A latent variable model of synchronous parsing for syntactic and semantic dependencies. [sent-257, score-0.385]
97 Training products of experts by min- imizing contrastive divergence. [sent-273, score-0.036]
98 Learning to parse natural language with maximum entropy models. [sent-347, score-0.032]
99 The CoNLL-2008 shared task on joint parsing of syntactic and semantic dependencies. [sent-378, score-0.11]
100 Fast and robust multilingual dependency parsing with a generative latent variable model. [sent-407, score-0.412]
wordName wordTfidf (topN-words)
[('trbm', 0.586), ('rbm', 0.282), ('latent', 0.256), ('vt', 0.204), ('isbn', 0.203), ('boltzmann', 0.172), ('variables', 0.16), ('hinton', 0.145), ('visible', 0.14), ('historyt', 0.13), ('rbms', 0.13), ('henderson', 0.122), ('connections', 0.119), ('temporal', 0.116), ('isbns', 0.108), ('nivre', 0.102), ('titov', 0.1), ('trbms', 0.087), ('belief', 0.075), ('undirected', 0.075), ('bj', 0.072), ('inference', 0.071), ('salakhutdinov', 0.07), ('wij', 0.07), ('layer', 0.068), ('parsing', 0.066), ('hjt', 0.065), ('machines', 0.062), ('decision', 0.062), ('networks', 0.062), ('directed', 0.059), ('restricted', 0.057), ('backward', 0.055), ('sigmoid', 0.054), ('mnih', 0.053), ('dependency', 0.052), ('conll', 0.052), ('ch', 0.051), ('neural', 0.05), ('surdeanu', 0.05), ('shared', 0.044), ('hch', 0.043), ('ivitwij', 0.043), ('ljh', 0.043), ('recurrent', 0.043), ('sutskever', 0.043), ('vkt', 0.043), ('wrbm', 0.043), ('feed', 0.042), ('history', 0.042), ('gradient', 0.042), ('deep', 0.041), ('haji', 0.039), ('approximation', 0.039), ('highdimensional', 0.038), ('sbns', 0.038), ('vit', 0.038), ('unige', 0.038), ('variable', 0.038), ('representations', 0.037), ('las', 0.036), ('contrastive', 0.036), ('collobert', 0.035), ('twelfth', 0.035), ('step', 0.035), ('forward', 0.034), ('procedure', 0.034), ('nilsson', 0.033), ('shift', 0.033), ('johansson', 0.033), ('row', 0.033), ('garg', 0.033), ('softmax', 0.033), ('parse', 0.032), ('incremental', 0.032), ('similarity', 0.032), ('hall', 0.03), ('motion', 0.03), ('equation', 0.03), ('actions', 0.03), ('graphical', 0.03), ('taylor', 0.029), ('account', 0.029), ('denotes', 0.028), ('bengio', 0.028), ('current', 0.027), ('lc', 0.026), ('meyers', 0.026), ('thirteenth', 0.026), ('compensate', 0.026), ('likelihood', 0.026), ('parser', 0.026), ('steps', 0.026), ('arquez', 0.026), ('geneva', 0.026), ('switzerland', 0.026), ('uas', 0.026), ('model', 0.025), ('conditionally', 0.025), ('shaded', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing
Author: Nikhil Garg ; James Henderson
Abstract: We propose a generative model based on Temporal Restricted Boltzmann Machines for transition based dependency parsing. The parse tree is built incrementally using a shiftreduce parse and an RBM is used to model each decision step. The RBM at the current time step induces latent features with the help of temporal connections to the relevant previous steps which provide context information. Our parser achieves labeled and unlabeled attachment scores of 88.72% and 91.65% respectively, which compare well with similar previous models and the state-of-the-art.
2 0.16620846 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
Author: Ivan Titov
Abstract: We consider a semi-supervised setting for domain adaptation where only unlabeled data is available for the target domain. One way to tackle this problem is to train a generative model with latent variables on the mixture of data from the source and target domains. Such a model would cluster features in both domains and ensure that at least some of the latent variables are predictive of the label on the source domain. The danger is that these predictive clusters will consist of features specific to the source domain only and, consequently, a classifier relying on such clusters would perform badly on the target domain. We introduce a constraint enforcing that marginal distributions of each cluster (i.e., each latent variable) do not vary significantly across domains. We show that this constraint is effec- tive on the sentiment classification task (Pang et al., 2002), resulting in scores similar to the ones obtained by the structural correspondence methods (Blitzer et al., 2007) without the need to engineer auxiliary tasks.
3 0.14438005 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing
Author: Jinho D. Choi ; Martha Palmer
Abstract: This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a bootstrapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-theart performance with respect to other parsing approaches evaluated on the same data set.
4 0.099477679 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing
Author: Gholamreza Haffari ; Marzieh Razavi ; Anoop Sarkar
Abstract: We combine multiple word representations based on semantic clusters extracted from the (Brown et al., 1992) algorithm and syntactic clusters obtained from the Berkeley parser (Petrov et al., 2006) in order to improve discriminative dependency parsing in the MSTParser framework (McDonald et al., 2005). We also provide an ensemble method for combining diverse cluster-based models. The two contributions together significantly improves unlabeled dependency accuracy from 90.82% to 92. 13%.
5 0.096864484 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
Author: Yue Zhang ; Joakim Nivre
Abstract: Transition-based dependency parsers generally use heuristic decoding algorithms but can accommodate arbitrarily rich feature representations. In this paper, we show that we can improve the accuracy of such parsers by considering even richer feature sets than those employed in previous systems. In the standard Penn Treebank setup, our novel features improve attachment score form 91.4% to 92.9%, giving the best results so far for transitionbased parsing and rivaling the best results overall. For the Chinese Treebank, they give a signficant improvement of the state of the art. An open source release of our parser is freely available.
6 0.088643454 279 acl-2011-Semi-supervised latent variable models for sentence-level sentiment analysis
7 0.084902167 167 acl-2011-Improving Dependency Parsing with Semantic Classes
8 0.081091173 18 acl-2011-A Latent Topic Extracting Method based on Events in a Document and its Application
9 0.07536862 294 acl-2011-Temporal Evaluation
10 0.074936129 204 acl-2011-Learning Word Vectors for Sentiment Analysis
11 0.068577804 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation
12 0.066582374 198 acl-2011-Latent Semantic Word Sense Induction and Disambiguation
13 0.065714955 150 acl-2011-Hierarchical Text Classification with Latent Concepts
14 0.064556733 10 acl-2011-A Discriminative Model for Joint Morphological Disambiguation and Dependency Parsing
15 0.062025335 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
16 0.060342088 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
17 0.059645858 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation
18 0.059216876 282 acl-2011-Shift-Reduce CCG Parsing
19 0.058181778 161 acl-2011-Identifying Word Translations from Comparable Corpora Using Latent Topic Models
20 0.056830537 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
topicId topicWeight
[(0, 0.148), (1, 0.008), (2, -0.036), (3, -0.114), (4, -0.0), (5, -0.04), (6, -0.006), (7, 0.102), (8, 0.031), (9, 0.018), (10, 0.062), (11, 0.01), (12, 0.072), (13, -0.013), (14, 0.008), (15, 0.011), (16, -0.019), (17, -0.035), (18, -0.008), (19, -0.028), (20, -0.047), (21, -0.022), (22, 0.058), (23, -0.004), (24, 0.022), (25, -0.071), (26, -0.004), (27, -0.073), (28, -0.056), (29, -0.018), (30, -0.038), (31, 0.012), (32, -0.002), (33, 0.035), (34, -0.001), (35, -0.029), (36, -0.038), (37, 0.101), (38, 0.046), (39, 0.064), (40, 0.002), (41, -0.042), (42, 0.007), (43, -0.004), (44, -0.084), (45, -0.039), (46, -0.045), (47, -0.032), (48, 0.061), (49, -0.078)]
simIndex simValue paperId paperTitle
same-paper 1 0.92553133 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing
Author: Nikhil Garg ; James Henderson
Abstract: We propose a generative model based on Temporal Restricted Boltzmann Machines for transition based dependency parsing. The parse tree is built incrementally using a shiftreduce parse and an RBM is used to model each decision step. The RBM at the current time step induces latent features with the help of temporal connections to the relevant previous steps which provide context information. Our parser achieves labeled and unlabeled attachment scores of 88.72% and 91.65% respectively, which compare well with similar previous models and the state-of-the-art.
2 0.76344699 107 acl-2011-Dynamic Programming Algorithms for Transition-Based Dependency Parsers
Author: Marco Kuhlmann ; Carlos Gomez-Rodriguez ; Giorgio Satta
Abstract: We develop a general dynamic programming technique for the tabulation of transition-based dependency parsers, and apply it to obtain novel, polynomial-time algorithms for parsing with the arc-standard and arc-eager models. We also show how to reverse our technique to obtain new transition-based dependency parsers from existing tabular methods. Additionally, we provide a detailed discussion of the conditions under which the feature models commonly used in transition-based parsing can be integrated into our algorithms.
3 0.74637341 143 acl-2011-Getting the Most out of Transition-based Dependency Parsing
Author: Jinho D. Choi ; Martha Palmer
Abstract: This paper suggests two ways of improving transition-based, non-projective dependency parsing. First, we add a transition to an existing non-projective parsing algorithm, so it can perform either projective or non-projective parsing as needed. Second, we present a bootstrapping technique that narrows down discrepancies between gold-standard and automatic parses used as features. The new addition to the algorithm shows a clear advantage in parsing speed. The bootstrapping technique gives a significant improvement to parsing accuracy, showing near state-of-theart performance with respect to other parsing approaches evaluated on the same data set.
4 0.68103677 186 acl-2011-Joint Training of Dependency Parsing Filters through Latent Support Vector Machines
Author: Colin Cherry ; Shane Bergsma
Abstract: Graph-based dependency parsing can be sped up significantly if implausible arcs are eliminated from the search-space before parsing begins. State-of-the-art methods for arc filtering use separate classifiers to make pointwise decisions about the tree; they label tokens with roles such as root, leaf, or attaches-tothe-left, and then filter arcs accordingly. Because these classifiers overlap substantially in their filtering consequences, we propose to train them jointly, so that each classifier can focus on the gaps of the others. We integrate the various pointwise decisions as latent variables in a single arc-level SVM classifier. This novel framework allows us to combine nine pointwise filters, and adjust their sensitivity using a shared threshold based on arc length. Our system filters 32% more arcs than the independently-trained classifiers, without reducing filtering speed. This leads to faster parsing with no reduction in accuracy.
Author: Gholamreza Haffari ; Marzieh Razavi ; Anoop Sarkar
Abstract: We combine multiple word representations based on semantic clusters extracted from the (Brown et al., 1992) algorithm and syntactic clusters obtained from the Berkeley parser (Petrov et al., 2006) in order to improve discriminative dependency parsing in the MSTParser framework (McDonald et al., 2005). We also provide an ensemble method for combining diverse cluster-based models. The two contributions together significantly improves unlabeled dependency accuracy from 90.82% to 92. 13%.
6 0.65498471 243 acl-2011-Partial Parsing from Bitext Projections
7 0.63223118 309 acl-2011-Transition-based Dependency Parsing with Rich Non-local Features
8 0.61770296 230 acl-2011-Neutralizing Linguistically Problematic Annotations in Unsupervised Dependency Parsing Evaluation
9 0.60402155 236 acl-2011-Optimistic Backtracking - A Backtracking Overlay for Deterministic Incremental Parsing
10 0.59606892 342 acl-2011-full-for-print
11 0.57904398 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
12 0.57122087 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
13 0.56926191 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
14 0.55007148 150 acl-2011-Hierarchical Text Classification with Latent Concepts
15 0.54445273 167 acl-2011-Improving Dependency Parsing with Semantic Classes
16 0.52619314 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers
17 0.51585102 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation
18 0.50396824 234 acl-2011-Optimal Head-Driven Parsing Complexity for Linear Context-Free Rewriting Systems
19 0.47954872 282 acl-2011-Shift-Reduce CCG Parsing
20 0.47600678 210 acl-2011-Lexicographic Semirings for Exact Automata Encoding of Sequence Models
topicId topicWeight
[(5, 0.023), (15, 0.017), (17, 0.037), (26, 0.026), (37, 0.137), (39, 0.021), (41, 0.099), (44, 0.227), (55, 0.027), (59, 0.058), (72, 0.028), (88, 0.021), (91, 0.033), (96, 0.13), (97, 0.019), (98, 0.01)]
simIndex simValue paperId paperTitle
1 0.82895666 95 acl-2011-Detection of Agreement and Disagreement in Broadcast Conversations
Author: Wen Wang ; Sibel Yaman ; Kristin Precoda ; Colleen Richey ; Geoffrey Raymond
Abstract: We present Conditional Random Fields based approaches for detecting agreement/disagreement between speakers in English broadcast conversation shows. We develop annotation approaches for a variety of linguistic phenomena. Various lexical, structural, durational, and prosodic features are explored. We compare the performance when using features extracted from automatically generated annotations against that when using human annotations. We investigate the efficacy of adding prosodic features on top of lexical, structural, and durational features. Since the training data is highly imbalanced, we explore two sampling approaches, random downsampling and ensemble downsampling. Overall, our approach achieves 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on the English broadcast conversation data. 1 ?yIntroduction In ?ythis work, we present models for detecting agre?yement/disagreement (denoted (dis)agreement) betwy?een speakers in English broadcast conversation show?ys. The Broadcast Conversation (BC) genre differs from the Broadcast News (BN) genre in that it is?y more interactive and spontaneous, referring to freey? speech in news-style TV and radio programs and consisting of talk shows, interviews, call-in prog?yrams, live reports, and round-tables. Previous y? y?This work was performed while the author was at ICSI. syaman@us . ibm .com, graymond@ s oc .uc sb . edu work on detecting (dis)agreements has been focused on meeting data. (Hillard et al., 2003), (Galley et al., 2004), (Hahn et al., 2006) used spurt-level agreement annotations from the ICSI meeting corpus (Janin et al., 2003). (Hillard et al., 2003) explored unsupervised machine learning approaches and on manual transcripts, they achieved an overall 3-way agreement/disagreement classification ac- curacy as 82% with keyword features. (Galley et al., 2004) explored Bayesian Networks for the detection of (dis)agreements. They used adjacency pair information to determine the structure of their conditional Markov model and outperformed the results of (Hillard et al., 2003) by improving the 3way classification accuracy into 86.9%. (Hahn et al., 2006) explored semi-supervised learning algorithms and reached a competitive performance of 86.7% 3-way classification accuracy on manual transcriptions with only lexical features. (Germesin and Wilson, 2009) investigated supervised machine learning techniques and yields competitive results on the annotated data from the AMI meeting corpus (McCowan et al., 2005). Our work differs from these previous studies in two major categories. One is that a different definition of (dis)agreement was used. In the current work, a (dis)agreement occurs when a responding speaker agrees with, accepts, or disagrees with or rejects, a statement or proposition by a first speaker. Second, we explored (dis)agreement detection in broadcast conversation. Due to the difference in publicity and intimacy/collegiality between speakers in broadcast conversations vs. meet- ings, (dis)agreement may have different character374 Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o.c?i ac t2io0n11 fo Ar Cssoocmiaptuiotanti foonra Clo Lminpguutiast i ocns:aslh Loirntpgaupisetrics , pages 374–378, istics. Different from the unsupervised approaches in (Hillard et al., 2003) and semi-supervised approaches in (Hahn et al., 2006), we conducted supervised training. Also, different from (Hillard et al., 2003) and (Galley et al., 2004), our classification was carried out on the utterance level, instead of on the spurt-level. Galley et al. extended Hillard et al.’s work by adding features from previous spurts and features from the general dialog context to infer the class of the current spurt, on top of features from the current spurt (local features) used by Hillard et al. Galley et al. used adjacency pairs to describe the interaction between speakers and the relations between consecutive spurts. In this preliminary study on broadcast conversation, we directly modeled (dis)agreement detection without using adjacency pairs. Still, within the conditional random fields (CRF) framework, we explored features from preceding and following utterances to consider context in the discourse structure. We explored a wide variety of features, including lexical, structural, du- rational, and prosodic features. To our knowledge, this is the first work to systematically investigate detection of agreement/disagreement for broadcast conversation data. The remainder of the paper is organized as follows. Section 2 presents our data and automatic annotation modules. Section 3 describes various features and the CRF model we explored. Experimental results and discussion appear in Section 4, as well as conclusions and future directions. 2 Data and Automatic Annotation In this work, we selected English broadcast conversation data from the DARPA GALE program collected data (GALE Phase 1 Release 4, LDC2006E91; GALE Phase 4 Release 2, LDC2009E15). Human transcriptions and manual speaker turn labels are used in this study. Also, since the (dis)agreement detection output will be used to analyze social roles and relations of an interacting group, we first manually marked soundbites and then excluded soundbites during annotation and modeling. We recruited annotators to provide manual annotations of speaker roles and (dis)agreement to use for the supervised training of models. We de- fined a set of speaker roles as follows. Host/chair is a person associated with running the discussions 375 or calling the meeting. Reporting participant is a person reporting from the field, from a subcommittee, etc. Commentator participant/Topic participant is a person providing commentary on some subject, or person who is the subject of the conversation and plays a role, e.g., as a newsmaker. Audience participant is an ordinary person who may call in, ask questions at a microphone at e.g. a large presentation, or be interviewed because of their presence at a news event. Other is any speaker who does not fit in one of the above categories, such as a voice talent, an announcer doing show openings or commercial breaks, or a translator. Agreements and disagreements are composed of different combinations of initiating utterances and responses. We reformulated the (dis)agreement detection task as the sequence tagging of 11 (dis)agreement-related labels for identifying whether a given utterance is initiating a (dis)agreement opportunity, is a (dis)agreement response to such an opportunity, or is neither of these, in the show. For example, a Negative tag question followed by a negation response forms an agreement, that is, A: [Negative tag] This is not black and white, is it? B: [Agreeing Response] No, it isn’t. The data sparsity problem is serious. Among all 27,071 utterances, only 2,589 utterances are involved in (dis)agreement as initiating or response utterances, about 10% only among all data, while 24,482 utterances are not involved. These annotators also labeled shows with a variety of linguistic phenomena (denoted language use constituents, LUC), including discourse markers, disfluencies, person addresses and person mentions, prefaces, extreme case formulations, and dialog act tags (DAT). We categorized dialog acts into statement, question, backchannel, and incomplete. We classified disfluencies (DF) into filled pauses (e.g., uh, um), repetitions, corrections, and false starts. Person address (PA) terms are terms that a speaker uses to address another person. Person mentions (PM) are references to non-participants in the conversation. Discourse markers (DM) are words or phrases that are related to the structure of the discourse and express a relation between two utter- ances, for example, I mean, you know. Prefaces (PR) are sentence-initial lexical tokens serving functions close to discourse markers (e.g., Well, I think that...). Extreme case formulations (ECF) are lexical patterns emphasizing extremeness (e.g., This is the best book I have ever read). In the end, we manually annotated 49 English shows. We preprocessed English manual transcripts by removing transcriber annotation markers and noise, removing punctuation and case information, and conducting text normalization. We also built automatic rule-based and statistical annotation tools for these LUCs. 3 Features and Model We explored lexical, structural, durational, and prosodic features for (dis)agreement detection. We included a set of “lexical” features, including ngrams extracted from all of that speaker’s utterances, denoted ngram features. Other lexical features include the presence of negation and acquiescence, yes/no equivalents, positive and negative tag questions, and other features distinguishing different types of initiating utterances and responses. We also included various lexical features extracted from LUC annotations, denoted LUC features. These additional features include features related to the presence of prefaces, the counts of types and tokens of discourse markers, extreme case formulations, disfluencies, person addressing events, and person mentions, and the normalized values of these counts by sentence length. We also include a set of features related to the DAT of the current utterance and preceding and following utterances. We developed a set of “structural” and “durational” features, inspired by conversation analysis, to quantitatively represent the different participation and interaction patterns of speakers in a show. We extracted features related to pausing and overlaps between consecutive turns, the absolute and relative duration of consecutive turns, and so on. We used a set of prosodic features including pause, duration, and the speech rate of a speaker. We also used pitch and energy of the voice. Prosodic features were computed on words and phonetic alignment of manual transcripts. Features are computed for the beginning and ending words of an utterance. For the duration features, we used the average and maximum vowel duration from forced align- ment, both unnormalized and normalized for vowel identity and phone context. For pitch and energy, we 376 calculated the minimum, maximum,E range, mean, standard deviation, skewnesSs and kurEtosis values. A decision tree model was useSd to comEpute posteriors fFrom prosodic features and Swe used cuEmulative binnFing of posteriors as final feSatures , simEilar to (Liu et aFl., 2006). As ilPlu?stErajtSed?F i?n SectionS 2, we refEormulated the F(dis)agrePe?mEEejnSt? Fdet?ection taSsk as a seqEuence tagging FproblemP. EWEejS u?sFe?d the MalSlet packagEe (McCallum, 2F002) toP i?mEEpjSle?mFe?nt the linSear chain CERF model for FsequencPe ?tEEagjSgi?nFg.? A CRFS is an undEirected graphiFcal modPe?lEE EthjSa?t Fde?fines a glSobal log-lEinear distributFion of Pthe?EE sjtaSt?eF (o?r label) Ssequence E conditioned oFn an oPbs?EeErvjaSt?ioFn? sequencSe, in our case including Fthe sequPe?nEcEej So?fF Fse?ntences S and the corresponding sFequencPe ?oEEf jfSea?Ftur?es for this sequence of sentences F. TheP ?mEEodjSe?l Fis? optimized globally over the entire seqPue?nEEcejS. TFh?e CRF model is trained to maximize theP c?oEEnjdSit?iFon?al log-likelihood of a given training set P?EEjS? F?. During testing, the most likely sequence E is found using the Viterbi algorithm. One of the motivations of choosing conditional random fields was to avoid the label-bias problem found in hidden Markov models. Compared to Maximum Entropy modeling, the CRF model is optimized globally over the entire sequence, whereas the ME model makes a decision at each point individually without considering the context event information. 4 Experiments All (dis)agreement detection results are based on nfold cross-validation. In this procedure, we held out one show as the test set, randomly held out another show as the dev set, trained models on the rest of the data, and tested the model on the heldout show. We iterated through all shows and computed the overall accuracy. Table 1 shows the results of (dis)agreement detection using all features except prosodic features. We compared two conditions: (1) features extracted completely from the automatic LUC annotations and automatically detected speaker roles, and (2) features from manual speaker role labels and manual LUC annotations when man- ual annotations are available. Table 1 showed that running a fully automatic system to generate automatic annotations and automatic speaker roles produced comparable performance to the system using features from manual annotations whenever available. Table 1: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using features extracted from manual speaker role labels and manual LUC annotations when available, denoted Manual Annotation, and automatic LUC annotations and automatically detected speaker roles, denoted Automatic Annotation. AMuatnoumaltAicn Aontaoitantio78P91.5Agr4eR3em.26en5tF671.5 AMuatnoumal tAicn Aontaoitanio76P04D.13isag3rR86e.56emn4F96t.176 We then focused on the condition of using features from manual annotations when available and added prosodic features as described in Section 3. The results are shown in Table 2. Adding prosodic features produced a 0.7% absolute gain on F1 on agreement detection, and 1.5% absolute gain on F1 on disagreement detection. Table 2: Precision (%), recall (%), and F1 (%) of (dis)agreement detection using manual annotations without and with prosodic features. w /itohp ro s o d ic 8 P1 .58Agr4 e34Re.m02en5t F76.125 w i/tohp ro s o d ic 7 0 PD.81isag43r0R8e.15eme5n4F19t.172 Note that only about 10% utterances among all data are involved in (dis)agreement. This indicates a highly imbalanced data set as one class is more heavily represented than the other/others. We suspected that this high imbalance has played a major role in the high precision and low recall results we obtained so far. Various approaches have been studied to handle imbalanced data for classifications, 377 trying to balaNnce the class distribution in the training set by eithNer oversaNmpling the minority class or downsamplinNg the maNjority class. In this preliminary study of NNsamplingN Napproaches for handling imbalanced dataN NNfor CRF Ntraining, we investigated two apprNoaches, rNNandom dNownsampling and ensemble dowNnsamplinNgN. RandoNm downsampling randomly dowNnsamples NNthe majorNity class to equate the number Nof minoritNNy and maNjority class samples. Ensemble NdownsampNNling is a N refinement of random downsamNpling whiNNch doesn’Nt discard any majority class samNples. InstNNead, we pNartitioned the majority class samNples into NN subspaNces with each subspace containiNng the samNe numbNer of samples as the minority clasNs. Then wNe train N CRF models, each based on thNe minoritNy class samples and one disjoint partitionN Nfrom the N subspaces. During testing, the posterioNr probability for one utterance is averaged over the N CRF models. The results from these two sampling approaches as well as the baseline are shown in Table 3. Both sampling approaches achieved significant improvement over the baseline, i.e., train- ing on the original data set, and ensemble downsampling produced better performance than downsampling. We noticed that both sampling approaches degraded slightly in precision but improved significantly in recall, resulting in 4.5% absolute gain on F1 for agreement detection and 4.7% absolute gain on F1 for disagreement detection. Table 3: Precision (%), recall (%), and F1 (%) of (dis)agreement detection without sampling, with random downsampling and ensemble downsampling. Manual annotations and prosodic features are used. BERansedlmoinbedwonsampling78P19D.825Aisagr4e8R0.m7e5n6 tF701. 2 EBRa ns ne dlmoinmbel dodwowns asmamp lin gn 67 09. 8324 046. 8915 351. 892 In conclusion, this paper presents our work on detection of agreements and disagreements in English broadcast conversation data. We explored a variety of features, including lexical, structural, durational, and prosodic features. We experimented these features using a linear-chain conditional random fields model and conducted supervised training. We observed significant improvement from adding prosodic features and employing two sampling approaches, random downsampling and ensemble downsampling. Overall, we achieved 79.2% (precision), 50.5% (recall), 61.7% (F1) for agreement detection and 69.2% (precision), 46.9% (recall), and 55.9% (F1) for disagreement detection, on English broadcast conversation data. In future work, we plan to continue adding and refining features, explore dependencies between features and contextual cues with respect to agreements and disagreements, and investigate the efficacy of other machine learning approaches such as Bayesian networks and Support Vector Machines. Acknowledgments The authors thank Gokhan Tur and Dilek HakkaniT u¨r for valuable insights and suggestions. This work has been supported by the Intelligence Advanced Research Projects Activity (IARPA) via Army Research Laboratory (ARL) contract number W91 1NF-09-C-0089. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, ARL, or the U.S. Government. References M. Galley, K. McKeown, J. Hirschberg, and E. Shriberg. 2004. Identifying agreement and disagreement in conversational speech: Use ofbayesian networks to model pragmatic dependencies. In Proceedings of ACL. S. Germesin and T. Wilson. 2009. Agreement detection in multiparty conversation. In Proceedings of International Conference on Multimodal Interfaces. S. Hahn, R. Ladner, and M. Ostendorf. 2006. Agreement/disagreement classification: Exploiting unlabeled data using constraint classifiers. In Proceedings of HLT/NAACL. 378 D. Hillard, M. Ostendorf, and E. Shriberg. 2003. Detection of agreement vs. disagreement in meetings: Training with unlabeled data. In Proceedings of HLT/NAACL. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters. 2003. The ICSI Meeting Corpus. In Proc. ICASSP, Hong Kong, April. Yang Liu, Elizabeth Shriberg, Andreas Stolcke, Dustin Hillard, Mari Ostendorf, and Mary Harper. 2006. Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5): 1526–1540, September. Special Issue on Progress in Rich Transcription. Andrew McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, W. Post, D. Reidsma, and P. Wellner. 2005. The AMI meeting corpus. In Proceedings of Measuring Behavior 2005, the 5th International Conference on Methods and Techniques in Behavioral Research.
same-paper 2 0.82057309 295 acl-2011-Temporal Restricted Boltzmann Machines for Dependency Parsing
Author: Nikhil Garg ; James Henderson
Abstract: We propose a generative model based on Temporal Restricted Boltzmann Machines for transition based dependency parsing. The parse tree is built incrementally using a shiftreduce parse and an RBM is used to model each decision step. The RBM at the current time step induces latent features with the help of temporal connections to the relevant previous steps which provide context information. Our parser achieves labeled and unlabeled attachment scores of 88.72% and 91.65% respectively, which compare well with similar previous models and the state-of-the-art.
3 0.81543404 150 acl-2011-Hierarchical Text Classification with Latent Concepts
Author: Xipeng Qiu ; Xuanjing Huang ; Zhao Liu ; Jinlong Zhou
Abstract: Recently, hierarchical text classification has become an active research topic. The essential idea is that the descendant classes can share the information of the ancestor classes in a predefined taxonomy. In this paper, we claim that each class has several latent concepts and its subclasses share information with these different concepts respectively. Then, we propose a variant Passive-Aggressive (PA) algorithm for hierarchical text classification with latent concepts. Experimental results show that the performance of our algorithm is competitive with the recently proposed hierarchical classification algorithms.
4 0.81095755 135 acl-2011-Faster and Smaller N-Gram Language Models
Author: Adam Pauls ; Dan Klein
Abstract: N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.
5 0.77652776 12 acl-2011-A Generative Entity-Mention Model for Linking Entities with Knowledge Base
Author: Xianpei Han ; Le Sun
Abstract: Linking entities with knowledge base (entity linking) is a key issue in bridging the textual data with the structural knowledge base. Due to the name variation problem and the name ambiguity problem, the entity linking decisions are critically depending on the heterogenous knowledge of entities. In this paper, we propose a generative probabilistic model, called entitymention model, which can leverage heterogenous entity knowledge (including popularity knowledge, name knowledge and context knowledge) for the entity linking task. In our model, each name mention to be linked is modeled as a sample generated through a three-step generative story, and the entity knowledge is encoded in the distribution of entities in document P(e), the distribution of possible names of a specific entity P(s|e), and the distribution of possible contexts of a specific entity P(c|e). To find the referent entity of a name mention, our method combines the evidences from all the three distributions P(e), P(s|e) and P(c|e). Experimental results show that our method can significantly outperform the traditional methods. 1
6 0.69201446 265 acl-2011-Reordering Modeling using Weighted Alignment Matrices
7 0.68549335 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
8 0.68253672 103 acl-2011-Domain Adaptation by Constraining Inter-Domain Variability of Latent Feature Representation
9 0.68130386 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
10 0.6760056 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation
11 0.67438674 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
12 0.67416656 311 acl-2011-Translationese and Its Dialects
13 0.67111325 73 acl-2011-Collective Classification of Congressional Floor-Debate Transcripts
14 0.67104781 292 acl-2011-Target-dependent Twitter Sentiment Classification
15 0.67010683 92 acl-2011-Data point selection for cross-language adaptation of dependency parsers
16 0.66974461 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
17 0.66931665 277 acl-2011-Semi-supervised Relation Extraction with Large-scale Word Clustering
18 0.66895133 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation
19 0.66864264 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
20 0.66705358 289 acl-2011-Subjectivity and Sentiment Analysis of Modern Standard Arabic