emnlp emnlp2011 emnlp2011-96 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ai Azuma ; Yuji Matsumoto
Abstract: In this paper, we describe a novel approach to cascaded learning and inference on sequences. We propose a weakly joint learning model on cascaded inference on sequences, called multilayer sequence labeling. In this model, inference on sequences is modeled as cascaded decision. However, the decision on a sequence labeling sequel to other decisions utilizes the features on the preceding results as marginalized by the probabilistic models on them. It is not novel itself, but our idea central to this paper is that the probabilistic models on succeeding labeling are viewed as indirectly depending on the probabilistic models on preceding analyses. We also propose two types of efficient dynamic programming which are required in the gradient-based optimization of an objective function. One of the dynamic programming algorithms resembles back propagation algorithm for mul- tilayer feed-forward neural networks. The other is a generalized version of the forwardbackward algorithm. We also report experiments of cascaded part-of-speech tagging and chunking of English sentences and show effectiveness of the proposed method.
Reference: text
sentIndex sentText sentNum sentScore
1 We propose a weakly joint learning model on cascaded inference on sequences, called multilayer sequence labeling. [sent-4, score-0.322]
2 In this model, inference on sequences is modeled as cascaded decision. [sent-5, score-0.192]
3 However, the decision on a sequence labeling sequel to other decisions utilizes the features on the preceding results as marginalized by the probabilistic models on them. [sent-6, score-0.583]
4 It is not novel itself, but our idea central to this paper is that the probabilistic models on succeeding labeling are viewed as indirectly depending on the probabilistic models on preceding analyses. [sent-7, score-0.238]
5 We also propose two types of efficient dynamic programming which are required in the gradient-based optimization of an objective function. [sent-8, score-0.29]
6 We also report experiments of cascaded part-of-speech tagging and chunking of English sentences and show effectiveness of the proposed method. [sent-11, score-0.348]
7 Sequence labeling is the simplest subclass of structured prediction problems. [sent-15, score-0.257]
8 In sequence labeling, the most likely one 628 among all the possible label sequences is predicted for a given input. [sent-16, score-0.297]
9 Although sequence labeling is the simplest subclass, a lot of real-world tasks are modeled as problems of this simplest subclass. [sent-17, score-0.405]
10 Many models have been proposed for sequence labeling tasks, such as Hidden Markov Models (HMM), Conditional Random Fields (CRF) (Lafferty et al. [sent-19, score-0.301]
11 A cascade of predictions here means the situation in which some of predictions are made based upon the results of other predictions. [sent-25, score-0.31]
12 For example, in NLP, we perform named entity recognition or base-phrase chunking for given sentences based on part-of-speech (POS) labels predicted by another sequence labeler. [sent-27, score-0.436]
13 Therefore, many tasks in NLP are modeled as a cascade of sequence predictions. [sent-29, score-0.319]
14 If a prediction is based upon the result of another prediction, we call the former upper stage and the latter lower stage. [sent-30, score-0.44]
15 Methods pursued for a cascade of predictions including sequence predictions, of course–, are desired to perform certain types of capability. [sent-31, score-0.384]
16 Another is backward information propagation, that is, the rich annotated data on an upper stage should affect the models on lower stages retroactively. [sent-36, score-0.638]
17 Many current systems for a cascade of sequence predictions adopt a simple 1-best feed-forward approach. [sent-37, score-0.384]
18 They simply take the most likely output at each prediction stage and transfer it to the next upper stage. [sent-38, score-0.473]
19 Such a framework can maximize reusability of existing sequence labeling systems. [sent-39, score-0.301]
20 The essence of this orientation is that the labeler on an upper stage utilizes the information of all the possible output candidates on lower stages. [sent-44, score-0.673]
21 However, the size of the output space can become quite large in sequence labeling. [sent-45, score-0.23]
22 (2006), a cascades of sequence predictions is viewed as a Bayesian network, and sample sequences are drawn at each stage according to the output distribution. [sent-49, score-0.599]
23 In the method proposed in Bunescu (2008), an upper labeler uses the probabilities marginalized on the parts of the output sequences on lower stages as weights for the features. [sent-51, score-0.857]
24 The weighted features are integrated in the model of the labeler on the upper stage. [sent-52, score-0.369]
25 It enables simultaneous learning and estimation of multiple sequence labelings on the same input sequences, where time slices of the outputs of all the out sequences are regularly aligned. [sent-60, score-0.439]
26 Moreover, it only considers the cases where labels of an input sequence and all output sequences are regularly aligned. [sent-63, score-0.48]
27 It is not clear how to build a joint labeling model which handles irregular output label sequences like semi-Markov models (Sarawagi and Cohen, 2005). [sent-64, score-0.411]
28 In this paper, we propose a middle ground for a cascade of sequence predictions. [sent-65, score-0.319]
29 We first assume that the model on all the sequence labeling stages is probabilistic one. [sent-67, score-0.435]
30 In modeling of an upper stage, a feature is weighted by the marginal probability of the fragment of the outputs from a lower stage. [sent-68, score-0.375]
31 Features integrated in the model on each stage are weighted by the marginal probabilities of the fragments of the outputs on lower stages. [sent-71, score-0.36]
32 So, if the output distributions on lower stages change, the marginal probabilities of any fragments also change, and this in turn can change the value of the features on the upper stage. [sent-72, score-0.524]
33 In other words, the features on an upper stage indirectly depend on the models on the lower stages. [sent-73, score-0.49]
34 Based on this intuition, the learning procedure of the model on an upper stage can affect not only direct model parameters, but also the weights of the features by changing the model on the lower stages. [sent-74, score-0.598]
35 Supervised learning based on annotated data on an upper stage may affect the model or model parameters on the lower stages. [sent-75, score-0.481]
36 It could be said that the information of annotation data on an upper stage is propagated back to the model on lower stages. [sent-76, score-0.48]
37 In Section 3, we propose an optimization procedure according to the intuition noted above. [sent-78, score-0.23]
38 The proposed method shows some improvements on a real-world task in comparison with ordinary methods. [sent-80, score-0.23]
39 Hereafter, for the sake of simplicity, we only describe the simplest case in which there are just two stages, one lower stage of sequence labeling named L1 and one upper stage of sequence labeling named L2. [sent-82, score-1.351]
40 L2 is also a sequence labeling stage for the same input x and the output of L1. [sent-84, score-0.655]
41 The model for L1 per se is the same as ordinary ones for sequence labeling. [sent-89, score-0.369]
42 (2) ∑∈Y1 It is worth noting that this formalization subsumes both directed and undirected linear-chain graphical models, which are the most typical models for sequence labeling, including HMM and CRF. [sent-109, score-0.297]
43 In such configuration, all the possible successful paths defined in our notation have strict one-to-one correspondence to all the possible joint assignments of labels in linear-chain graphical models. [sent-111, score-0.186]
44 Next, we formalize the probabilistic model on the upper stage L2. [sent-113, score-0.42]
45 A feature on an arc e2 ∈ E2 can access local characteristics of the confide∈nc Ee-rated superposition of the L1’s outputs, in addition to the information of the input x. [sent-117, score-0.329]
46 To formulate local characteristics of the superposition of the L1’s outputs, we first define output features of L1, denoted by h⟨1,k1′,e1⟩ ∈ R (k′1 ∈ K1′, e1 ∈ E1). [sent-118, score-0.238]
47 Before the output features are integrated into the model for L2, they all are confidence-rated with respect to P1, that is, each output feature h⟨1,k1′,e1⟩ is numerically rated by the estimated probabilities summed over the sequences emitting that feature. [sent-120, score-0.374]
48 Here, the notation ∑y1∼e1 represents the summation over sequen∑ces ∼ceonsistent with an arc e1 ∈ E1, that is, ∑the summation over the set {y1 ∈ Y1 | e1 ∈ y1}. [sent-122, score-0.199]
49 The input features for P2 on an arc e2 ∈ E2 are permitted ptou arbitrarily combine the infor∈ma Etion of x and the L1’s marginalized output features ¯h1, in addition to the local characteristics of the arc at hand e2. [sent-124, score-0.742]
50 In summary, an input feature for L2 on an arc e2 ∈ E2 is of the form f⟨2,k2,e2,x⟩(h¯1(θ1)) ∈ R (k2 ∈ K2) , (5) where K2 is th(e index) set of the input feature types fwohr L2. [sent-125, score-0.333]
51 KTo make the optimization procedure feasible, smoothness condition on any L2’s input feature is assumed with respect to all the L1 output features, that is, ’s ∂∂fh¯⟨2⟨,1k,2k1′,e,e21,x⟩⟩is always guaranteed to exist for 631 ∀k′1, e1, k2, e2. [sent-126, score-0.477]
52 P2 is viewed not only as the function of the ordinary direct parameters θ2 but also as the function of θ1, which represents the parameters for the L1’s model, through the intermediate variables ¯h1. [sent-130, score-0.23]
53 So optimization procedure on P2 may affect the determination of the values not only of the direct parameters θ2 but also of the indirect ones θ1 . [sent-131, score-0.31]
54 3 Optimization Algorithm In this section, we describe optimization procedure for the model formulated in the previous section. [sent-135, score-0.23]
55 Optimization procedure repeatedly searches a direction in the parameter space which is ascendent with respect to the objective function, and updates the parameter values into that direction by small advances. [sent-179, score-0.272]
56 Many existing optimization routines like steepest descent or conjugation gradient do that job only by giving the objective value and gra- dients on parameter values to be updated. [sent-180, score-0.435]
57 So, the optimization problem here boils down to the calculation of the objective value and gradients on given parameter values. [sent-181, score-0.421]
58 Before entering the detailed description of the algorithm for calculating the objective function and gradients, we note the functional relations among the objective function and previously defined variables. [sent-182, score-0.304]
59 The diagram shown in Figure 1 illustrates the functional relations among the parameters, input and output feature functions, models, and objective function. [sent-183, score-0.328]
60 The value of the 632 objective function on given parameter values can be calculated in order of the arrows shown in the diagram. [sent-185, score-0.203]
61 The functional relations illustrated in the Figure 1 ensure some forms of the chain rule of differentiation among the variables. [sent-187, score-0.204]
62 These two directions of stepwise computation are analogous to the forward and back propagation for multilayer feedforward neural networks, respectively. [sent-189, score-0.27]
63 Algorithm 1 shows the whole picture of the gradient-based optimization procedure for our model. [sent-190, score-0.23]
64 The values of marginalized output features h¯⟨1,x⟩ can be calculated by (3). [sent-192, score-0.412]
65 Because they are the simple marginals of features, the ordinary forwardbackward algorithm (hereafter, abbreviated as “FB”) on G1 offers an efficient way to calculate their values. [sent-193, score-0.441]
66 ca Fuisnea they are no dnidff ethreennt Lfr aomre the ordinary log-likelihood computation. [sent-196, score-0.23]
67 The terms and in (11) become the same forms that appear in the ordinary CRF optimization, i. [sent-199, score-0.23]
68 (12) These calculations are performed by the ordinary FB on G1 and G2, respectively. [sent-202, score-0.23]
69 As described in the previous section, it is assumed that the values of the second factor∂∂f⟨h¯21,x⟩is guaranteed to exists for any given θ1, and the procedure for calculating them is fixed in advance. [sent-207, score-0.243]
70 In other words, ∂θ∂⟨1L,2k1⟩ bec∑omes the covariance between the k1-th input feature for L1 and the hypothetical feature h′⟨1,e1⟩ d≡ef∂h¯∂⟨1L,2e1⟩ · h⟨1,e1⟩. [sent-214, score-0.23]
71 The second term of (18) can be calculated by the ordinary F-B because it consists of the marginals of arc features. [sent-216, score-0.464]
72 d≡ef It is easy for calcul)ating the value AD transforms this F-B into anoth[er algorit]hffmflll for calculating the differentiation w. [sent-226, score-0.199]
73 see Li and Eisner (2009) and 634 numbers, the arithmetic operations and the exponential function are generalized to the dual numbers, and the ordinary F-B is also generalized to the dual numbers. [sent-233, score-0.478]
74 The final line in the loop of Algorithm 1 can be implemented by various optimization routines and line search algorithms. [sent-236, score-0.223]
75 The time and space complexity to compute the objective and gradient values for given parameter vectors θ1 , θ2 is the same as that for that for Bunescu (2008), up to a constant factor. [sent-237, score-0.212]
76 The task is to annotate the POS tags and to perform base-phrase chunking on English sentences. [sent-240, score-0.255]
77 Base-phrase chunking is a task to classify con- tinuous subsequences of words into syntactic categories. [sent-241, score-0.255]
78 This task is performed by annotating a chunking label on each word (Ramshaw and Marcus, 1995). [sent-242, score-0.314]
79 The types of chunking label consist of “Begin-Category”, which represents the beginning of a chunk, “Inside-Category”, which represents the inside of a chunk, and “Other. [sent-243, score-0.314]
80 ” Usually, POS labeling runs first before base-phrase chunking is performed. [sent-244, score-0.417]
81 Therefore, this task is a typical interesting case where a sequence labeling depends on the output from other sequence labelers. [sent-245, score-0.531]
82 The number of the label types used in base-phrase chunking is equal to 23. [sent-251, score-0.375]
83 We compare the proposed method to two existing sequence labeling methods as baselines. [sent-252, score-0.301]
84 This labeler is a simple CRF and learned by ordinary optimization procedure. [sent-254, score-0.535]
85 A simple CRF model is learned for the chunking labeling, on the input sentences and the most likely POS label sequences predicted by the already learned POS labeler. [sent-256, score-0.471]
86 ” The other baseline method has a CRF model for the chunking labeling, which uses the marginalized features offered by the POS labeler. [sent-258, score-0.499]
87 However, the parameters of the POS labeler are fixed in the training of the chunking model. [sent-259, score-0.397]
88 In “CRF + CRF-BP,” the objective function for joint learning (10) is not guaranteed to be convex, so optimization procedure is sensible to the initial configuration of the model parameters. [sent-267, score-0.371]
89 Although we only described the formalization and optimization procedure ofthe models with arc features, We use node features in the experiment. [sent-270, score-0.542]
90 All node features are combined with the corresponding node label (POS or chunking label) feature. [sent-272, score-0.464]
91 All arc features are combined with the feature of the corresponding instantiated on each time slice in five character window. [sent-273, score-0.301]
92 output features for “CRF + CRF-MF” and “CRF ‡ arc label pair. [sent-274, score-0.331]
93 † features are features are not used in POS labeler, and marginalized as + CRF-BP. [sent-275, score-0.294]
94 From Table 2, the proposed method significantly outperforms two baseline methods on chunking performance. [sent-279, score-0.255]
95 Although the improvement on POS labeling performance by the proposed method “CRF + CRF-BP” is not significant, it might show that optimization procedure provides some form of backward information propagation in comparison to “CRF + CRF-MF. [sent-280, score-0.535]
96 ” 5 Conclusions In this paper, we adopt the method to weight features on an upper sequence labeling stage by the marginalized probabilities estimated by the model on lower stages. [sent-281, score-0.985]
97 We also point out that the model on an upper stage is considered to depend on the model on lower stages indirectly. [sent-282, score-0.536]
98 In addition, we propose optimization procedure that enables the joint optimization of the multiple models on the different level of stages. [sent-283, score-0.393]
99 Con- ditional random fields: Probabilistic models for menting and labeling sequence data. [sent-328, score-0.301]
100 Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. [sent-366, score-0.339]
wordName wordTfidf (topN-words)
[('ef', 0.264), ('chunking', 0.255), ('crf', 0.249), ('ordinary', 0.23), ('stage', 0.205), ('marginalized', 0.194), ('cascade', 0.18), ('upper', 0.177), ('optimization', 0.163), ('labeling', 0.162), ('labeler', 0.142), ('dag', 0.142), ('sequence', 0.139), ('arc', 0.131), ('differentiation', 0.117), ('bunescu', 0.105), ('sequences', 0.099), ('stages', 0.096), ('calculate', 0.095), ('cascaded', 0.093), ('output', 0.091), ('jacobian', 0.09), ('multilayer', 0.09), ('gradients', 0.087), ('objective', 0.086), ('propagation', 0.082), ('calculating', 0.082), ('formalization', 0.081), ('slice', 0.077), ('prev', 0.077), ('pos', 0.07), ('dual', 0.068), ('notation', 0.068), ('procedure', 0.067), ('predictions', 0.065), ('marginals', 0.065), ('omitted', 0.062), ('exp', 0.061), ('backward', 0.061), ('equal', 0.061), ('corliss', 0.06), ('covariances', 0.06), ('dlog', 0.06), ('routines', 0.06), ('sink', 0.06), ('superposition', 0.06), ('label', 0.059), ('input', 0.058), ('forward', 0.058), ('lower', 0.058), ('guaranteed', 0.055), ('sake', 0.052), ('expectation', 0.052), ('simplest', 0.052), ('marginal', 0.052), ('imaginary', 0.051), ('regularly', 0.051), ('forwardbackward', 0.051), ('functional', 0.05), ('hereafter', 0.05), ('features', 0.05), ('node', 0.05), ('gradient', 0.047), ('featu', 0.047), ('slices', 0.047), ('sarawagi', 0.047), ('nara', 0.047), ('outputs', 0.045), ('calculation', 0.045), ('src', 0.043), ('subscripts', 0.043), ('covariance', 0.043), ('subclass', 0.043), ('fh', 0.043), ('hypothetical', 0.043), ('sutton', 0.043), ('feature', 0.043), ('labels', 0.042), ('fields', 0.042), ('affect', 0.041), ('dynamic', 0.041), ('fl', 0.041), ('back', 0.04), ('directed', 0.04), ('parameter', 0.04), ('successful', 0.039), ('values', 0.039), ('calculated', 0.038), ('dot', 0.038), ('chunk', 0.038), ('arithmetic', 0.038), ('ramshaw', 0.038), ('probabilistic', 0.038), ('networks', 0.038), ('generalized', 0.037), ('chain', 0.037), ('marcus', 0.037), ('characteristics', 0.037), ('graphical', 0.037), ('totally', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 96 emnlp-2011-Multilayer Sequence Labeling
Author: Ai Azuma ; Yuji Matsumoto
Abstract: In this paper, we describe a novel approach to cascaded learning and inference on sequences. We propose a weakly joint learning model on cascaded inference on sequences, called multilayer sequence labeling. In this model, inference on sequences is modeled as cascaded decision. However, the decision on a sequence labeling sequel to other decisions utilizes the features on the preceding results as marginalized by the probabilistic models on them. It is not novel itself, but our idea central to this paper is that the probabilistic models on succeeding labeling are viewed as indirectly depending on the probabilistic models on preceding analyses. We also propose two types of efficient dynamic programming which are required in the gradient-based optimization of an objective function. One of the dynamic programming algorithms resembles back propagation algorithm for mul- tilayer feed-forward neural networks. The other is a generalized version of the forwardbackward algorithm. We also report experiments of cascaded part-of-speech tagging and chunking of English sentences and show effectiveness of the proposed method.
2 0.1046921 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
Author: Marco Dinarelli ; Sophie Rosset
Abstract: Reranking models have been successfully applied to many tasks of Natural Language Processing. However, there are two aspects of this approach that need a deeper investigation: (i) Assessment of hypotheses generated for reranking at classification phase: baseline models generate a list of hypotheses and these are used for reranking without any assessment; (ii) Detection of cases where reranking models provide a worst result: the best hypothesis provided by the reranking model is assumed to be always the best result. In some cases the reranking model provides an incorrect hypothesis while the baseline best hypothesis is correct, especially when baseline models are accurate. In this paper we propose solutions for these two aspects: (i) a semantic inconsistency metric to select possibly more correct n-best hypotheses, from a large set generated by an SLU basiline model. The selected hypotheses are reranked applying a state-of-the-art model based on Partial Tree Kernels, which encode SLU hypotheses in Support Vector Machines with complex structured features; (ii) finally, we apply a decision strategy, based on confidence values, to select the final hypothesis between the first ranked hypothesis provided by the baseline SLU model and the first ranked hypothesis provided by the re-ranker. We show the effectiveness of these solutions presenting comparative results obtained reranking hypotheses generated by a very accurate Conditional Random Field model. We evaluate our approach on the French MEDIA corpus. The results show significant improvements with respect to current state-of-the-art and previous 1104 Sophie Rosset LIMSI-CNRS B.P. 133, 91403 Orsay Cedex France ro s set @ l ims i fr . re-ranking models.
3 0.097998723 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
Author: Zhenghua Li ; Min Zhang ; Wanxiang Che ; Ting Liu ; Wenliang Chen ; Haizhou Li
Abstract: Part-of-speech (POS) is an indispensable feature in dependency parsing. Current research usually models POS tagging and dependency parsing independently. This may suffer from error propagation problem. Our experiments show that parsing accuracy drops by about 6% when using automatic POS tags instead of gold ones. To solve this issue, this paper proposes a solution by jointly optimizing POS tagging and dependency parsing in a unique model. We design several joint models and their corresponding decoding algorithms to incorporate different feature sets. We further present an effective pruning strategy to reduce the search space of candidate POS tags, leading to significant improvement of parsing speed. Experimental results on Chinese Penn Treebank 5 show that our joint models significantly improve the state-of-the-art parsing accuracy by about 1.5%. Detailed analysis shows that the joint method is able to choose such POS tags that are more helpful and discriminative from parsing viewpoint. This is the fundamental reason of parsing accuracy improvement.
4 0.097028896 129 emnlp-2011-Structured Sparsity in Structured Prediction
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Linear models have enjoyed great success in structured prediction in NLP. While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. Common approaches employ ad hoc filtering or L1regularization; both ignore the structure of the feature space, preventing practicioners from encoding structural prior knowledge. We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. Experiments on three tasks (chunking, entity recognition, and dependency parsing) show gains in performance, compactness, and model interpretability.
5 0.088117473 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation
Author: Reut Tsarfaty ; Joakim Nivre ; Evelina Andersson
Abstract: unkown-abstract
6 0.072153889 51 emnlp-2011-Exact Decoding of Phrase-Based Translation Models through Lagrangian Relaxation
7 0.071895011 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction
8 0.067889243 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
9 0.067070171 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
10 0.066083334 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling
11 0.064202577 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning
12 0.063985199 45 emnlp-2011-Dual Decomposition with Many Overlapping Components
13 0.062998526 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
14 0.062418684 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
15 0.062355414 125 emnlp-2011-Statistical Machine Translation with Local Language Models
16 0.061054472 52 emnlp-2011-Exact Inference for Generative Probabilistic Non-Projective Dependency Parsing
17 0.060509995 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
18 0.059461202 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
19 0.058924701 9 emnlp-2011-A Non-negative Matrix Factorization Based Approach for Active Dual Supervision from Document and Word Labels
20 0.057039116 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
topicId topicWeight
[(0, 0.221), (1, -0.001), (2, -0.035), (3, 0.05), (4, 0.022), (5, -0.006), (6, -0.061), (7, -0.202), (8, -0.027), (9, 0.008), (10, 0.026), (11, -0.054), (12, -0.022), (13, 0.044), (14, -0.12), (15, -0.084), (16, -0.043), (17, -0.01), (18, 0.025), (19, 0.042), (20, 0.023), (21, 0.027), (22, 0.042), (23, -0.009), (24, 0.137), (25, -0.001), (26, -0.006), (27, 0.088), (28, -0.056), (29, -0.007), (30, 0.053), (31, 0.109), (32, -0.092), (33, -0.015), (34, 0.16), (35, 0.051), (36, 0.022), (37, 0.094), (38, 0.058), (39, 0.049), (40, 0.028), (41, -0.093), (42, -0.125), (43, -0.013), (44, 0.289), (45, -0.016), (46, 0.036), (47, 0.059), (48, -0.062), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.95672208 96 emnlp-2011-Multilayer Sequence Labeling
Author: Ai Azuma ; Yuji Matsumoto
Abstract: In this paper, we describe a novel approach to cascaded learning and inference on sequences. We propose a weakly joint learning model on cascaded inference on sequences, called multilayer sequence labeling. In this model, inference on sequences is modeled as cascaded decision. However, the decision on a sequence labeling sequel to other decisions utilizes the features on the preceding results as marginalized by the probabilistic models on them. It is not novel itself, but our idea central to this paper is that the probabilistic models on succeeding labeling are viewed as indirectly depending on the probabilistic models on preceding analyses. We also propose two types of efficient dynamic programming which are required in the gradient-based optimization of an objective function. One of the dynamic programming algorithms resembles back propagation algorithm for mul- tilayer feed-forward neural networks. The other is a generalized version of the forwardbackward algorithm. We also report experiments of cascaded part-of-speech tagging and chunking of English sentences and show effectiveness of the proposed method.
2 0.67785144 129 emnlp-2011-Structured Sparsity in Structured Prediction
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Linear models have enjoyed great success in structured prediction in NLP. While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. Common approaches employ ad hoc filtering or L1regularization; both ignore the structure of the feature space, preventing practicioners from encoding structural prior knowledge. We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. Experiments on three tasks (chunking, entity recognition, and dependency parsing) show gains in performance, compactness, and model interpretability.
3 0.47055548 26 emnlp-2011-Class Label Enhancement via Related Instances
Author: Zornitsa Kozareva ; Konstantin Voevodski ; Shanghua Teng
Abstract: Class-instance label propagation algorithms have been successfully used to fuse information from multiple sources in order to enrich a set of unlabeled instances with class labels. Yet, nobody has explored the relationships between the instances themselves to enhance an initial set of class-instance pairs. We propose two graph-theoretic methods (centrality and regularization), which start with a small set of labeled class-instance pairs and use the instance-instance network to extend the class labels to all instances in the network. We carry out a comparative study with state-of-the-art knowledge harvesting algorithm and show that our approach can learn additional class labels while maintaining high accuracy. We conduct a comparative study between class-instance and instance-instance graphs used to propagate the class labels and show that the latter one achieves higher accuracy.
4 0.46261755 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
Author: Cane Wing-ki Leung ; Jing Jiang ; Kian Ming A. Chai ; Hai Leong Chieu ; Loo-Nin Teow
Abstract: We address the task of automatic discovery of information extraction template from a given text collection. Our approach clusters candidate slot fillers to identify meaningful template slots. We propose a generative model that incorporates distributional prior knowledge to help distribute candidates in a document into appropriate slots. Empirical results suggest that the proposed prior can bring substantial improvements to our task as compared to a K-means baseline and a Gaussian mixture model baseline. Specifically, the proposed prior has shown to be effective when coupled with discriminative features of the candidates.
5 0.46076599 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
Author: Marco Dinarelli ; Sophie Rosset
Abstract: Reranking models have been successfully applied to many tasks of Natural Language Processing. However, there are two aspects of this approach that need a deeper investigation: (i) Assessment of hypotheses generated for reranking at classification phase: baseline models generate a list of hypotheses and these are used for reranking without any assessment; (ii) Detection of cases where reranking models provide a worst result: the best hypothesis provided by the reranking model is assumed to be always the best result. In some cases the reranking model provides an incorrect hypothesis while the baseline best hypothesis is correct, especially when baseline models are accurate. In this paper we propose solutions for these two aspects: (i) a semantic inconsistency metric to select possibly more correct n-best hypotheses, from a large set generated by an SLU basiline model. The selected hypotheses are reranked applying a state-of-the-art model based on Partial Tree Kernels, which encode SLU hypotheses in Support Vector Machines with complex structured features; (ii) finally, we apply a decision strategy, based on confidence values, to select the final hypothesis between the first ranked hypothesis provided by the baseline SLU model and the first ranked hypothesis provided by the re-ranker. We show the effectiveness of these solutions presenting comparative results obtained reranking hypotheses generated by a very accurate Conditional Random Field model. We evaluate our approach on the French MEDIA corpus. The results show significant improvements with respect to current state-of-the-art and previous 1104 Sophie Rosset LIMSI-CNRS B.P. 133, 91403 Orsay Cedex France ro s set @ l ims i fr . re-ranking models.
6 0.42909446 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction
7 0.40600103 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
8 0.38400984 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
9 0.38067773 45 emnlp-2011-Dual Decomposition with Many Overlapping Components
10 0.37178552 7 emnlp-2011-A Joint Model for Extended Semantic Role Labeling
11 0.36156449 12 emnlp-2011-A Weakly-supervised Approach to Argumentative Zoning of Scientific Documents
12 0.34732556 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
13 0.34038082 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
14 0.33676633 145 emnlp-2011-Unsupervised Semantic Role Induction with Graph Partitioning
15 0.33031473 82 emnlp-2011-Learning Local Content Shift Detectors from Document-level Information
16 0.32979202 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
17 0.32956204 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
18 0.32844934 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
19 0.32601759 88 emnlp-2011-Linear Text Segmentation Using Affinity Propagation
20 0.32296389 52 emnlp-2011-Exact Inference for Generative Probabilistic Non-Projective Dependency Parsing
topicId topicWeight
[(23, 0.105), (36, 0.52), (37, 0.024), (45, 0.036), (53, 0.014), (54, 0.011), (57, 0.013), (62, 0.016), (64, 0.029), (66, 0.021), (69, 0.022), (79, 0.036), (82, 0.029), (90, 0.016), (96, 0.035), (98, 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.90681303 96 emnlp-2011-Multilayer Sequence Labeling
Author: Ai Azuma ; Yuji Matsumoto
Abstract: In this paper, we describe a novel approach to cascaded learning and inference on sequences. We propose a weakly joint learning model on cascaded inference on sequences, called multilayer sequence labeling. In this model, inference on sequences is modeled as cascaded decision. However, the decision on a sequence labeling sequel to other decisions utilizes the features on the preceding results as marginalized by the probabilistic models on them. It is not novel itself, but our idea central to this paper is that the probabilistic models on succeeding labeling are viewed as indirectly depending on the probabilistic models on preceding analyses. We also propose two types of efficient dynamic programming which are required in the gradient-based optimization of an objective function. One of the dynamic programming algorithms resembles back propagation algorithm for mul- tilayer feed-forward neural networks. The other is a generalized version of the forwardbackward algorithm. We also report experiments of cascaded part-of-speech tagging and chunking of English sentences and show effectiveness of the proposed method.
2 0.87355703 102 emnlp-2011-Parse Correction with Specialized Models for Difficult Attachment Types
Author: Enrique Henestroza Anguiano ; Marie Candito
Abstract: This paper develops a framework for syntactic dependency parse correction. Dependencies in an input parse tree are revised by selecting, for a given dependent, the best governor from within a small set of candidates. We use a discriminative linear ranking model to select the best governor from a group of candidates for a dependent, and our model includes a rich feature set that encodes syntactic structure in the input parse tree. The parse correction framework is parser-agnostic, and can correct attachments using either a generic model or specialized models tailored to difficult attachment types like coordination and pp-attachment. Our experiments show that parse correction, combining a generic model with specialized models for difficult attachment types, can successfully improve the quality of predicted parse trees output by sev- eral representative state-of-the-art dependency parsers for French.
3 0.76262075 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
Author: Yulia Tsvetkov ; Shuly Wintner
Abstract: We propose an architecture for expressing various linguistically-motivated features that help identify multi-word expressions in natural language texts. The architecture combines various linguistically-motivated classification features in a Bayesian Network. We introduce novel ways for computing many of these features, and manually define linguistically-motivated interrelationships among them, which the Bayesian network models. Our methodology is almost entirely unsupervised and completely languageindependent; it relies on few language resources and is thus suitable for a large number of languages. Furthermore, unlike much recent work, our approach can identify expressions of various types and syntactic con- structions. We demonstrate a significant improvement in identification accuracy, compared with less sophisticated baselines.
4 0.47510698 138 emnlp-2011-Tuning as Ranking
Author: Mark Hopkins ; Jonathan May
Abstract: We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses off-the-shelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrase-based and syntax-based systems in a variety of language pairs, using large scale data scenarios.
5 0.42443737 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
Author: Marco Dinarelli ; Sophie Rosset
Abstract: Reranking models have been successfully applied to many tasks of Natural Language Processing. However, there are two aspects of this approach that need a deeper investigation: (i) Assessment of hypotheses generated for reranking at classification phase: baseline models generate a list of hypotheses and these are used for reranking without any assessment; (ii) Detection of cases where reranking models provide a worst result: the best hypothesis provided by the reranking model is assumed to be always the best result. In some cases the reranking model provides an incorrect hypothesis while the baseline best hypothesis is correct, especially when baseline models are accurate. In this paper we propose solutions for these two aspects: (i) a semantic inconsistency metric to select possibly more correct n-best hypotheses, from a large set generated by an SLU basiline model. The selected hypotheses are reranked applying a state-of-the-art model based on Partial Tree Kernels, which encode SLU hypotheses in Support Vector Machines with complex structured features; (ii) finally, we apply a decision strategy, based on confidence values, to select the final hypothesis between the first ranked hypothesis provided by the baseline SLU model and the first ranked hypothesis provided by the re-ranker. We show the effectiveness of these solutions presenting comparative results obtained reranking hypotheses generated by a very accurate Conditional Random Field model. We evaluate our approach on the French MEDIA corpus. The results show significant improvements with respect to current state-of-the-art and previous 1104 Sophie Rosset LIMSI-CNRS B.P. 133, 91403 Orsay Cedex France ro s set @ l ims i fr . re-ranking models.
6 0.40005854 134 emnlp-2011-Third-order Variational Reranking on Packed-Shared Dependency Forests
7 0.39602122 132 emnlp-2011-Syntax-Based Grammaticality Improvement using CCG and Guided Search
8 0.38857651 88 emnlp-2011-Linear Text Segmentation Using Affinity Propagation
9 0.38028058 65 emnlp-2011-Heuristic Search for Non-Bottom-Up Tree Structure Prediction
10 0.37773472 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
11 0.36891302 136 emnlp-2011-Training a Parser for Machine Translation Reordering
12 0.36736909 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
13 0.35782751 97 emnlp-2011-Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French
14 0.34453899 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs
15 0.34346625 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing
16 0.34198183 47 emnlp-2011-Efficient retrieval of tree translation examples for Syntax-Based Machine Translation
17 0.34193066 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
18 0.3410503 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing
19 0.34091857 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
20 0.34046334 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification