emnlp emnlp2011 emnlp2011-129 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Linear models have enjoyed great success in structured prediction in NLP. While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. Common approaches employ ad hoc filtering or L1regularization; both ignore the structure of the feature space, preventing practicioners from encoding structural prior knowledge. We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. Experiments on three tasks (chunking, entity recognition, and dependency parsing) show gains in performance, compactness, and model interpretability.
Reference: text
sentIndex sentText sentNum sentScore
1 While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. [sent-16, score-0.238]
2 We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. [sent-18, score-0.417]
3 Our focus is on methods which embed this selection into the learning problem via the regularization term. [sent-30, score-0.238]
4 For example, we want to be able to select entire feature templates (rather than features individually), or to make the inclusion of some features depend on the inclusion of other features. [sent-32, score-0.321]
5 We achieve the goal stated above by employing regularizers which promote structured sparsity. [sent-33, score-0.417]
6 Such regularizers are able to encode prior knowledge and guide the selection of features by modeling the structure of the feature space. [sent-34, score-0.37]
7 Lately, this type of regularizers has received a lot of attention in computer vision, signal processing, and computational biology (Zhao et al. [sent-35, score-0.219]
8 t T ot oof pacrehdieivceti nthgiy sb gwohael,n tlihneeta rru meoo duetplus ta irse yu;s ouualrl ygo taral iinse tdo by solving a problem of the form θb = argminθ Ω(θ) + N1 PiN=1 L(θ,xi, yi), (2) whbere Ω is a regularizer anPd L is a loss function. [sent-60, score-0.236]
9 , 2004), LSVM(θ,x,y) =y0 m∈aY(xx)θ · δφ(y0) + c(y0,y), (4) where δφ(y0) = φ(x, y0) φ(x, y); and the loss underlying the structured perceptron (Collins, 2002a), − LSP (θ, x, y) = maxy0∈Y(x) θ · δφ(y0) . [sent-64, score-0.219]
10 Neither of the regularizers just described “looks” at the structure of the feature space, since they all treat each dimension independently—we call them unstructured regularizers, as opposed to the structured ones that we next describe. [sent-83, score-0.399]
11 3 Structured Sparsity We are interested in regularizers that share with ΩτL1 the ability to promote sparsity, so that they can be used for selecting features. [sent-84, score-0.289]
12 In addition, we want to endow the feature space RD with additional structure, so that features are not penalized individually (as in the L1-case) but collectively, encouraging en- tire groups of features to be discarded. [sent-85, score-0.383]
13 The choice of groups will allow encoding prior knowledge regarding the kind of sparsity patterns that are intended in the model. [sent-86, score-0.358]
14 1 The Group Lasso To capture the structure of the feature space, we group our D features into M groups G1, . [sent-89, score-0.543]
15 Ahead, we discuss meaningful ways 1of, choosing group decompositions; for now, let us assume a sensible choice is obvious to the model designer. [sent-96, score-0.204]
16 (8) These regularizers wPere first proposed by Bakin (1999) and Yuan and Lin (2006) in the context of regression. [sent-102, score-0.219]
17 1 These regularizers subsume the L1 and L2 cases, which corre- spond to trivial choices of groups: • • If each group is a singleton, i. [sent-108, score-0.423]
18 Let us first consider the case where G is a partition of the feature space: the groups cover all the features (Sm Gm = {1, . [sent-129, score-0.339]
19 It encourages sparsity patterns in which entire groups are discarded. [sent-134, score-0.358]
20 A judicious choice of groups can lead to very compact = 1In the statistics literature, such mixed-norm regularizers, which group features and then apply a separate norm for each group, are called composite absolute penalties (Zhao et al. [sent-135, score-0.569]
21 However it can be shown that both regularizers lead to identical learning problems (Eq. [sent-143, score-0.219]
22 In this case, all groups have the same size and we typically set d1 = . [sent-155, score-0.243]
23 In §5, we learn feature templates from the data, by associating ne faecaht group tpol a fsea ftruomre template, and letting that group contain all features that are instantiations of this template. [sent-164, score-0.685]
24 Since groups have different sizes, it is a good idea to let dm increase with the group size, so that larger groups pay a larger penalty for being included. [sent-165, score-0.801]
25 More generally, we may let the groups in G overlap but be nested, i. [sent-167, score-0.243]
26 , we may want them to form a hierarchy (two distinct groups either have empty intersection or one is contained in the other). [sent-169, score-0.243]
27 d Eraawchn gfrroomup group Ga to group Gb if Gb ⊂ Ga and there is no b0 s. [sent-173, score-0.408]
28 W⊂hen Gthe groups are nested, this diagram is a forest (a union of directed trees). [sent-176, score-0.316]
29 The corresponding regularizer enforces sparsity patterns 3The same idea is also used in multitask learning, where labels correspond to tasks (Caruana, 1997). [sent-177, score-0.26]
30 where a group of features is only selected if all its ancestors are also selected. [sent-178, score-0.248]
31 (2010): ΩSdG,τL(θ) = PMm=01 (dmkθmk2 + τmkθmk1) , (9) where the totalP number of groups is M = M0 + D, and the components θ1 , . [sent-189, score-0.243]
32 This regularizer promotes sparsity at both group and feature levels (i. [sent-193, score-0.516]
33 , it eliminates entire groups and sparsifies within each group). [sent-195, score-0.243]
34 In general, the groups in G may overlap without being nested. [sent-197, score-0.243]
35 As in the tree-structured case, a group of features is only selected if all its ancestors are also selected. [sent-199, score-0.248]
36 (2009) suggested a way of reverse engineering the groups from the desired sparsity pattern. [sent-201, score-0.358]
37 We next describe a strategy for coarse-to-fine feature template selection that directly builds on that idea. [sent-202, score-0.235]
38 Suppose that we are given M feature templates T = {T1, . [sent-203, score-0.233]
39 thie same shape; 4We say that a group of features Gm is selected if some fea- ture in Gm (but not necessarily all) has a nonzero weight. [sent-231, score-0.32]
40 The result is a “coarse-to-fine” regularizer, which prefers to select feature templates that are coarser before zooming into finer features. [sent-234, score-0.233]
41 This reflects th∼e f Nact( 0t,haτt some groups should have their feature weights shrunk more towards zero than others. [sent-250, score-0.295]
42 Digits below each node are the group [iTndices where each template belongs. [sent-269, score-0.332]
43 (13) For the non-overlapping group Lasso case, the proximity operator is given by [proxΩdGL(θ)]m=( 0kθkmθkm2−k2dmθm ioft hkθermwkis2e. [sent-274, score-0.282]
44 13: if the L2-norm of the m-th group is less than dm, the entire group is discarded; otherwise it is scaled so that its L2-norm decreases by an amount of dm. [sent-276, score-0.408]
45 When groups overlap, the proximity operator lacks a closed form. [sent-277, score-0.321]
46 2 by alternating between online (sub-)gradient steps with respect to the loss term, and proximal steps with respect to the regularizer. [sent-293, score-0.474]
47 , GJ such that Gj = MG an dids jtohien groups on each Gj are nonoSverlapping. [sent-307, score-0.243]
48 The proximal steps are then applied sequentially, one per each Ωj. [sent-308, score-0.299]
49 Each proximal step is lin- ear in the number of groups M, and does not need be to performed every round (as we will see later). [sent-313, score-0.502]
50 Entire groups of features can be deleted after each proximal step. [sent-317, score-0.546]
51 Furthermore, only the features which correspond to nonzero entries in the gradient vector need to be inserted in the active set; for some losses (LSVM and LSP) many irrelevant features are never instantianted. [sent-318, score-0.254]
52 (2009) and set hσjtijJ=1 to zero for all t wfohrdich e ti sa not a multiple oeft some prespecified integer K; this way, proximal steps need only be performed each K rounds, yielding a significant speed-up when the number of groups M is large. [sent-335, score-0.542]
53 (2009) would set σjt = Kσj for those rounds; however, we have observed that such a strategy makes the number of groups vary substantially in early epochs. [sent-337, score-0.243]
54 We use a different strategy: for each Gj, we specify a budget of Bj ≥ 0 groups (this may take into consideration practical limitations, siusc mh as ttahek ea ivnaitloa bcolen memory). [sent-338, score-0.321]
55 Otherwise, sort the groups in Gj by decreasing order of their L2-norms. [sent-343, score-0.243]
56 At the end of this step, no more than Bj groups will remain nonzero. [sent-348, score-0.243]
57 6 Hence, we have shifted the control of Pthe amount of regularization to the budget constants Bj, which unlike the σj have a clear meaning and can be chosen under practical considerations. [sent-350, score-0.261]
58 1have a scaling effect on each group, which affects all features belonging to that group (see Eq. [sent-353, score-0.248]
59 • • The first strategy is suitable when M is large and only a sfet wst groups (? [sent-357, score-0.243]
60 , t oof delay tghe l update of all features in a group until at least one of them fires; then apply a cumulative penalty. [sent-366, score-0.248]
61 The second strategy is suitable when M is small aTnhde some groups are very populated; Mthi iss i ssm tahlel typical case of template-based groups (§3. [sent-368, score-0.486]
62 Two operations en oefed te to blaet performed: updating eTawcho feature weight (in the gradient steps), and scaling entire groups (in the proximal steps). [sent-370, score-0.648]
63 The number of actually selected groups may be less than B, however, since in this case some groups can be shrunk more than once. [sent-378, score-0.486]
64 T×hiRs representation allows performing the two operations above in constant time, and it keeps track of the group L2-norms, necessary in the proximal updates. [sent-382, score-0.463]
65 Only features that, at some point, intervene in the gradient computed in line 5 need to be instantiated; and all features that receive zero weights after some proximal step can be deleted from the model (cf. [sent-384, score-0.441]
66 1 allows to simultaneously select features and learn the model parameters, it has been observed in the sparse modeling literature that Lasso-like regularizers usually have a strong bias which may harm predictive performance. [sent-389, score-0.357]
67 A post-processing stage is usually taken (called debiasing), in which the model is refitted without any regularization and using only the selected features (Wright et al. [sent-390, score-0.227]
68 1 with a group-Lasso regularizer and the loss LSP the sparseptron. [sent-403, score-0.236]
69 Refit the model without any regularization and using the loss L which one wants to optimize. [sent-407, score-0.274]
70 7To see why this is the case, note that both gradient and proximal updates come scaled by η0 ; and that the gradient of the loss is ∇LSP (θ, xt , yt) = φ(xt , by t) − φ(xt ,yt), where ybt tish teh leo prediction under the current m,ob d ye)l, −wh φic(hx is ins),en wshietirvee tyo the scaling of θ. [sent-408, score-0.617]
71 1506 5 Experiments We present experiments in three structured prediction tasks for several group choices. [sent-410, score-0.373]
72 We defined unigram features by conjoining these templates with each of the 22 output labels. [sent-418, score-0.225]
73 An additional template was defined to account for label bigrams—features in this template do not look at the input string, but only at consecutive pairs of labels. [sent-419, score-0.256]
74 8 We evaluate the ability of group-Lasso regularization to perform feature template selection. [sent-420, score-0.363]
75 To do that, we ran 5 epochs of the sparseptron algorithm with template-based groups and budget-driven shrinkage (budgets of 10, 20, 30, 40, and 50 templates were tried). [sent-421, score-0.725]
76 For each group Gm, we set dm = log2 |Gm |, which is the average number of bits necessary to e,wn choidche a f tehaetu arvee irang teh natu group, fif b aitsll nfeeca-tures were equiprobable. [sent-422, score-0.315]
77 We set K = 1000 (the number of instances between consecutive proximal steps). [sent-423, score-0.259]
78 The oscillation in the first 5 epochs (bottom line) comes from the proximal steps each K = 1000 rounds. [sent-455, score-0.371]
79 We do feature template selection (same setting as before) for budget sizes of 100, 200, and 300. [sent-479, score-0.313]
80 We compare with both MIRA (using all the features) and the sparseptron with a standard Lasso regularizer ΩτL1 , for several values of C = 1/(τN). [sent-480, score-0.305]
81 Note also that the ability to discard feature templates (rather than individual features) yields faster test runtime than models regularized with the standard Lasso: fewer templates will need to be instantiated, with a speed-up in score computation. [sent-483, score-0.449]
82 5 6540510 5 x 106 Figure 3: Comparison between non-overlapping group-Lasso, coarse-to-fine group-Lasso (C2F), and a filter-based method based on information gain for selecting feature templates in multilingual dependency parsing. [sent-495, score-0.233]
83 The x-axis is the total number of features at different regularization levels, and the y-axis is the unlabeled attachment score. [sent-496, score-0.227]
84 We defined M = 684 feature templates for each candidate arc by conjoining the words, shapes, lemmas, and POS of the head and the modifier, as well as the contextual POS, and the distance and direction of attachment. [sent-501, score-0.233]
85 We followed the same two-stage use approach as before, and compared with a baseline which selects feature templates by ranking them according to the information gain criterion. [sent-502, score-0.233]
86 We observe that for all but one language (Spanish is the exception), nonoverlapping group-Lasso regularization is more effective at selecting feature templates than the information gain criterion, and slightly better than coarse-to-fine group-Lasso. [sent-507, score-0.416]
87 Table 3 shows what kind of feature templates were most selected for each language. [sent-509, score-0.233]
88 6 Related Work A variant of the online proximal gradient algorithm used in this paper was proposed by Martins et al. [sent-512, score-0.397]
89 PMOidSdl →e P POOSS SDhiarepcetion Distance ++ ++ + ++ + + + ++ ++ ++ ++ + + + + + ++ + + + + + + ++ + + Table 3: Variation of feature templates that were selected accross languages. [sent-523, score-0.233]
90 Each line groups together similar templates, involving lexical, contextual POS, word shape information, as well as attachment direction and length. [sent-524, score-0.243]
91 The focus there, however, was multiple kernel learning, hence overlapping groups were not considered in their experiments. [sent-527, score-0.283]
92 Budget-driven shrinkage and the sparseptron are novel techniques, at the best of our knowledge. [sent-528, score-0.229]
93 (201 1), the only work we are aware of which combines structured sparsity with structured prediction is Schmidt and Murphy (2010); however, their goal is to predict the structure of graphical models, while we are mostly interested in the structure of the feature space. [sent-530, score-0.464]
94 Mixed norm regularization has been used for a while in statistics as a means to promote structured sparsity. [sent-532, score-0.459]
95 We have shown how the latter can be useful in model design, through the use of regularizers which promote structured sparsity. [sent-548, score-0.417]
96 Our algorithm, which specializes into the sparseptron, yields a mechanism for selecting entire groups of features. [sent-550, score-0.243]
97 We apply sparseptron for selecting feature templates in three structured prediction tasks, with advantages over filter-based methods, L1, and L2 regularization in terms of performance, compactness, and model interpretability. [sent-551, score-0.745]
98 Lazy sparse stochastic gradient descent for regularized multinomial logistic regression. [sent-607, score-0.261]
99 A note on the group lasso and a sparse group lasso. [sent-677, score-0.674]
100 Tree-guided group lasso for multi-task regression with structured sparsity. [sent-739, score-0.54]
wordName wordTfidf (topN-words)
[('proximal', 0.259), ('groups', 0.243), ('jenatton', 0.22), ('regularizers', 0.219), ('group', 0.204), ('regularization', 0.183), ('templates', 0.181), ('lasso', 0.172), ('sparseptron', 0.16), ('regularizer', 0.145), ('jt', 0.141), ('martins', 0.128), ('template', 0.128), ('structured', 0.128), ('gm', 0.122), ('prox', 0.12), ('sparsity', 0.115), ('dm', 0.111), ('obozinski', 0.1), ('sparse', 0.094), ('gradient', 0.094), ('gj', 0.094), ('bach', 0.094), ('loss', 0.091), ('bj', 0.087), ('lsp', 0.086), ('lsvm', 0.08), ('mairal', 0.08), ('budget', 0.078), ('norm', 0.078), ('diagram', 0.073), ('mira', 0.073), ('epochs', 0.072), ('nonzero', 0.072), ('promote', 0.07), ('langford', 0.069), ('lavergne', 0.069), ('shrinkage', 0.069), ('sang', 0.064), ('instituto', 0.062), ('dgl', 0.06), ('hasse', 0.06), ('schmidt', 0.06), ('taskar', 0.058), ('gb', 0.058), ('selection', 0.055), ('windows', 0.054), ('memory', 0.053), ('feature', 0.052), ('ye', 0.052), ('treestructured', 0.052), ('budgets', 0.052), ('figueiredo', 0.052), ('gravity', 0.052), ('elastic', 0.052), ('tsochantaridis', 0.051), ('jmlr', 0.051), ('xing', 0.049), ('wright', 0.047), ('aguiar', 0.047), ('stepsize', 0.047), ('ga', 0.047), ('eisenstein', 0.045), ('tb', 0.045), ('features', 0.044), ('online', 0.044), ('slovene', 0.043), ('nonnegative', 0.043), ('shapes', 0.043), ('compactness', 0.043), ('operator', 0.043), ('prediction', 0.041), ('exp', 0.041), ('yuan', 0.041), ('rounds', 0.041), ('petrov', 0.04), ('kernel', 0.04), ('altun', 0.04), ('bakin', 0.04), ('debiasing', 0.04), ('endowing', 0.04), ('guyon', 0.04), ('jtijj', 0.04), ('lanckriet', 0.04), ('lcrf', 0.04), ('pdd', 0.04), ('pjj', 0.04), ('poset', 0.04), ('quattoni', 0.04), ('refit', 0.04), ('sokolovska', 0.04), ('unregularized', 0.04), ('steps', 0.04), ('spanish', 0.039), ('stochastic', 0.038), ('neural', 0.038), ('xt', 0.038), ('regression', 0.036), ('regularized', 0.035), ('proximity', 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000018 129 emnlp-2011-Structured Sparsity in Structured Prediction
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Linear models have enjoyed great success in structured prediction in NLP. While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. Common approaches employ ad hoc filtering or L1regularization; both ignore the structure of the feature space, preventing practicioners from encoding structural prior knowledge. We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. Experiments on three tasks (chunking, entity recognition, and dependency parsing) show gains in performance, compactness, and model interpretability.
2 0.11782721 45 emnlp-2011-Dual Decomposition with Many Overlapping Components
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Dual decomposition has been recently proposed as a way of combining complementary models, with a boost in predictive power. However, in cases where lightweight decompositions are not readily available (e.g., due to the presence of rich features or logical constraints), the original subgradient algorithm is inefficient. We sidestep that difficulty by adopting an augmented Lagrangian method that accelerates model consensus by regularizing towards the averaged votes. We show how first-order logical constraints can be handled efficiently, even though the corresponding subproblems are no longer combinatorial, and report experiments in dependency parsing, with state-of-the-art results. 1
3 0.11024031 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
Author: Keith Hall ; Ryan McDonald ; Jason Katz-Brown ; Michael Ringgaard
Abstract: We present an online learning algorithm for training parsers which allows for the inclusion of multiple objective functions. The primary example is the extension of a standard supervised parsing objective function with additional loss-functions, either based on intrinsic parsing quality or task-specific extrinsic measures of quality. Our empirical results show how this approach performs for two dependency parsing algorithms (graph-based and transition-based parsing) and how it achieves increased performance on multiple target tasks including reordering for machine translation and parser adaptation.
4 0.097028896 96 emnlp-2011-Multilayer Sequence Labeling
Author: Ai Azuma ; Yuji Matsumoto
Abstract: In this paper, we describe a novel approach to cascaded learning and inference on sequences. We propose a weakly joint learning model on cascaded inference on sequences, called multilayer sequence labeling. In this model, inference on sequences is modeled as cascaded decision. However, the decision on a sequence labeling sequel to other decisions utilizes the features on the preceding results as marginalized by the probabilistic models on them. It is not novel itself, but our idea central to this paper is that the probabilistic models on succeeding labeling are viewed as indirectly depending on the probabilistic models on preceding analyses. We also propose two types of efficient dynamic programming which are required in the gradient-based optimization of an objective function. One of the dynamic programming algorithms resembles back propagation algorithm for mul- tilayer feed-forward neural networks. The other is a generalized version of the forwardbackward algorithm. We also report experiments of cascaded part-of-speech tagging and chunking of English sentences and show effectiveness of the proposed method.
5 0.09105207 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing
Author: Tom Kwiatkowski ; Luke Zettlemoyer ; Sharon Goldwater ; Mark Steedman
Abstract: We consider the problem of learning factored probabilistic CCG grammars for semantic parsing from data containing sentences paired with logical-form meaning representations. Traditional CCG lexicons list lexical items that pair words and phrases with syntactic and semantic content. Such lexicons can be inefficient when words appear repeatedly with closely related lexical content. In this paper, we introduce factored lexicons, which include both lexemes to model word meaning and templates to model systematic variation in word usage. We also present an algorithm for learning factored CCG lexicons, along with a probabilistic parse-selection model. Evaluations on benchmark datasets demonstrate that the approach learns highly accurate parsers, whose generalization performance greatly from the lexical factoring. benefits
6 0.081720822 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
7 0.07474079 95 emnlp-2011-Multi-Source Transfer of Delexicalized Dependency Parsers
8 0.07239034 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
9 0.068079539 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
10 0.065076552 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
11 0.060721144 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction
12 0.060255095 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training
13 0.059927311 77 emnlp-2011-Large-Scale Cognate Recovery
14 0.058988106 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
15 0.056045678 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
16 0.055399731 4 emnlp-2011-A Fast, Accurate, Non-Projective, Semantically-Enriched Parser
17 0.055129405 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
18 0.054276191 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
19 0.052681927 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances
20 0.052033916 50 emnlp-2011-Evaluating Dependency Parsing: Robust and Heuristics-Free Cross-Annotation Evaluation
topicId topicWeight
[(0, 0.209), (1, 0.002), (2, -0.02), (3, 0.053), (4, 0.033), (5, 0.052), (6, -0.045), (7, -0.166), (8, -0.062), (9, 0.038), (10, -0.049), (11, -0.073), (12, -0.061), (13, -0.055), (14, -0.08), (15, -0.087), (16, -0.019), (17, -0.025), (18, -0.038), (19, -0.064), (20, -0.051), (21, -0.014), (22, 0.092), (23, -0.022), (24, 0.247), (25, 0.017), (26, -0.039), (27, 0.075), (28, -0.027), (29, -0.01), (30, 0.042), (31, 0.019), (32, -0.147), (33, -0.121), (34, 0.041), (35, -0.036), (36, -0.058), (37, 0.038), (38, 0.09), (39, 0.09), (40, -0.096), (41, -0.211), (42, -0.042), (43, -0.208), (44, 0.061), (45, 0.069), (46, -0.136), (47, 0.208), (48, -0.146), (49, 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 0.95275426 129 emnlp-2011-Structured Sparsity in Structured Prediction
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Linear models have enjoyed great success in structured prediction in NLP. While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. Common approaches employ ad hoc filtering or L1regularization; both ignore the structure of the feature space, preventing practicioners from encoding structural prior knowledge. We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. Experiments on three tasks (chunking, entity recognition, and dependency parsing) show gains in performance, compactness, and model interpretability.
2 0.63386685 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
Author: Dani Yogatama ; Michael Heilman ; Brendan O'Connor ; Chris Dyer ; Bryan R. Routledge ; Noah A. Smith
Abstract: We consider the problem of predicting measurable responses to scientific articles based primarily on their text content. Specifically, we consider papers in two fields (economics and computational linguistics) and make predictions about downloads and within-community citations. Our approach is based on generalized linear models, allowing interpretability; a novel extension that captures first-order temporal effects is also presented. We demonstrate that text features significantly improve accuracy of predictions over metadata features like authors, topical categories, and publication venues.
3 0.62192267 96 emnlp-2011-Multilayer Sequence Labeling
Author: Ai Azuma ; Yuji Matsumoto
Abstract: In this paper, we describe a novel approach to cascaded learning and inference on sequences. We propose a weakly joint learning model on cascaded inference on sequences, called multilayer sequence labeling. In this model, inference on sequences is modeled as cascaded decision. However, the decision on a sequence labeling sequel to other decisions utilizes the features on the preceding results as marginalized by the probabilistic models on them. It is not novel itself, but our idea central to this paper is that the probabilistic models on succeeding labeling are viewed as indirectly depending on the probabilistic models on preceding analyses. We also propose two types of efficient dynamic programming which are required in the gradient-based optimization of an objective function. One of the dynamic programming algorithms resembles back propagation algorithm for mul- tilayer feed-forward neural networks. The other is a generalized version of the forwardbackward algorithm. We also report experiments of cascaded part-of-speech tagging and chunking of English sentences and show effectiveness of the proposed method.
4 0.44750524 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models
Author: Puyang Xu ; Asela Gunawardana ; Sanjeev Khudanpur
Abstract: We propose an efficient way to train maximum entropy language models (MELM) and neural network language models (NNLM). The advantage of the proposed method comes from a more robust and efficient subsampling technique. The original multi-class language modeling problem is transformed into a set of binary problems where each binary classifier predicts whether or not a particular word will occur. We show that the binarized model is as powerful as the standard model and allows us to aggressively subsample negative training examples without sacrificing predictive performance. Empirical results show that we can train MELM and NNLM at 1% ∼ 5% of the strtaaninda MrdE complexity LwMith a no %los ∼s 5in% performance.
5 0.43861842 45 emnlp-2011-Dual Decomposition with Many Overlapping Components
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Dual decomposition has been recently proposed as a way of combining complementary models, with a boost in predictive power. However, in cases where lightweight decompositions are not readily available (e.g., due to the presence of rich features or logical constraints), the original subgradient algorithm is inefficient. We sidestep that difficulty by adopting an augmented Lagrangian method that accelerates model consensus by regularizing towards the averaged votes. We show how first-order logical constraints can be handled efficiently, even though the corresponding subproblems are no longer combinatorial, and report experiments in dependency parsing, with state-of-the-art results. 1
6 0.41986373 32 emnlp-2011-Computing Logical Form on Regulatory Texts
7 0.39906141 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
8 0.35266131 137 emnlp-2011-Training dependency parsers by jointly optimizing multiple objectives
9 0.3442792 100 emnlp-2011-Optimal Search for Minimum Error Rate Training
10 0.33765042 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP
11 0.33034432 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing
12 0.31589612 73 emnlp-2011-Improving Bilingual Projections via Sparse Covariance Matrices
13 0.29915693 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
14 0.29807615 148 emnlp-2011-Watermarking the Outputs of Structured Prediction with an application in Statistical Machine Translation.
15 0.29172593 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation
16 0.27950394 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
17 0.27348417 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data
18 0.27000234 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing
19 0.26735428 77 emnlp-2011-Large-Scale Cognate Recovery
20 0.26225969 26 emnlp-2011-Class Label Enhancement via Related Instances
topicId topicWeight
[(23, 0.087), (36, 0.034), (37, 0.024), (45, 0.061), (53, 0.016), (54, 0.015), (62, 0.02), (64, 0.052), (66, 0.023), (69, 0.017), (79, 0.031), (82, 0.479), (87, 0.01), (90, 0.013), (96, 0.027), (98, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.88809496 129 emnlp-2011-Structured Sparsity in Structured Prediction
Author: Andre Martins ; Noah Smith ; Mario Figueiredo ; Pedro Aguiar
Abstract: Linear models have enjoyed great success in structured prediction in NLP. While a lot of progress has been made on efficient training with several loss functions, the problem of endowing learners with a mechanism for feature selection is still unsolved. Common approaches employ ad hoc filtering or L1regularization; both ignore the structure of the feature space, preventing practicioners from encoding structural prior knowledge. We fill this gap by adopting regularizers that promote structured sparsity, along with efficient algorithms to handle them. Experiments on three tasks (chunking, entity recognition, and dependency parsing) show gains in performance, compactness, and model interpretability.
2 0.87964696 52 emnlp-2011-Exact Inference for Generative Probabilistic Non-Projective Dependency Parsing
Author: Shay B. Cohen ; Carlos Gomez-Rodriguez ; Giorgio Satta
Abstract: We describe a generative model for nonprojective dependency parsing based on a simplified version of a transition system that has recently appeared in the literature. We then develop a dynamic programming parsing algorithm for our model, and derive an insideoutside algorithm that can be used for unsupervised learning of non-projective dependency trees.
3 0.65000308 107 emnlp-2011-Probabilistic models of similarity in syntactic context
Author: Diarmuid O Seaghdha ; Anna Korhonen
Abstract: This paper investigates novel methods for incorporating syntactic information in probabilistic latent variable models of lexical choice and contextual similarity. The resulting models capture the effects of context on the interpretation of a word and in particular its effect on the appropriateness of replacing that word with a potentially related one. Evaluating our techniques on two datasets, we report performance above the prior state of the art for estimating sentence similarity and ranking lexical substitutes.
4 0.36720565 66 emnlp-2011-Hierarchical Phrase-based Translation Representations
Author: Gonzalo Iglesias ; Cyril Allauzen ; William Byrne ; Adria de Gispert ; Michael Riley
Abstract: This paper compares several translation representations for a synchronous context-free grammar parse including CFGs/hypergraphs, finite-state automata (FSA), and pushdown automata (PDA). The representation choice is shown to determine the form and complexity of target LM intersection and shortest-path algorithms that follow. Intersection, shortest path, FSA expansion and RTN replacement algorithms are presented for PDAs. Chinese-toEnglish translation experiments using HiFST and HiPDT, FSA and PDA-based decoders, are presented using admissible (or exact) search, possible for HiFST with compact SCFG rulesets and HiPDT with compact LMs. For large rulesets with large LMs, we introduce a two-pass search strategy which we then analyze in terms of search errors and translation performance.
5 0.36133918 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge
Author: Cane Wing-ki Leung ; Jing Jiang ; Kian Ming A. Chai ; Hai Leong Chieu ; Loo-Nin Teow
Abstract: We address the task of automatic discovery of information extraction template from a given text collection. Our approach clusters candidate slot fillers to identify meaningful template slots. We propose a generative model that incorporates distributional prior knowledge to help distribute candidates in a document into appropriate slots. Empirical results suggest that the proposed prior can bring substantial improvements to our task as compared to a K-means baseline and a Gaussian mixture model baseline. Specifically, the proposed prior has shown to be effective when coupled with discriminative features of the candidates.
6 0.35206452 31 emnlp-2011-Computation of Infix Probabilities for Probabilistic Context-Free Grammars
7 0.35083172 46 emnlp-2011-Efficient Subsampling for Training Complex Language Models
8 0.34775999 87 emnlp-2011-Lexical Generalization in CCG Grammar Induction for Semantic Parsing
9 0.34740648 77 emnlp-2011-Large-Scale Cognate Recovery
10 0.3453972 56 emnlp-2011-Exploring Supervised LDA Models for Assigning Attributes to Adjective-Noun Phrases
11 0.34451172 53 emnlp-2011-Experimental Support for a Categorical Compositional Distributional Model of Meaning
12 0.33938977 116 emnlp-2011-Robust Disambiguation of Named Entities in Text
13 0.33910981 59 emnlp-2011-Fast and Robust Joint Models for Biomedical Event Extraction
14 0.33825764 45 emnlp-2011-Dual Decomposition with Many Overlapping Components
15 0.33804065 68 emnlp-2011-Hypotheses Selection Criteria in a Reranking Framework for Spoken Language Understanding
16 0.33115047 49 emnlp-2011-Entire Relaxation Path for Maximum Entropy Problems
17 0.33067113 106 emnlp-2011-Predicting a Scientific Communitys Response to an Article
18 0.32904598 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation
19 0.32784238 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model
20 0.32289127 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation