acl acl2013 acl2013-334 knowledge-graph by maker-knowledge-mining

334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

Source: pdf

Author: Jun Suzuki ; Masaaki Nagata

Abstract: This paper proposes a framework of supervised model learning that realizes feature grouping to obtain lower complexity models. The main idea of our method is to integrate a discrete constraint into model learning with the help of the dual decomposition technique. Experiments on two well-studied NLP tasks, dependency parsing and NER, demonstrate that our method can provide state-of-the-art performance even if the degrees of freedom in trained models are surprisingly small, i.e., 8 or even 2. This significant benefit enables us to provide compact model representation, which is especially useful in actual use.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 j un , nagat a Abstract This paper proposes a framework of supervised model learning that realizes feature grouping to obtain lower complexity models. [sent-2, score-0.482]

2 The main idea of our method is to integrate a discrete constraint into model learning with the help of the dual decomposition technique. [sent-3, score-0.622]

3 Experiments on two well-studied NLP tasks, dependency parsing and NER, demonstrate that our method can provide state-of-the-art performance even if the degrees of freedom in trained models are surprisingly small, i. [sent-4, score-0.44]

4 This significant benefit enables us to provide compact model representation, which is especially useful in actual use. [sent-7, score-0.041]

5 This paper focuses on the topic of supervised model learning, which is typically represented as the following form of the optimization problem: wˆ=aOrg(wwm;iDn)? [sent-9, score-0.218]

6 w is an N-dimensional vector representation ∈o fD a wset i so afn optimization variables, which are also interpreted as feature weights. [sent-12, score-0.288]

7 L(w; D) and Ω(w) represent a loss function and a regularization term, respectively. [sent-13, score-0.071]

8 The reason is that L1-regularizers encourage feature weights to be zero as much as possible in model learning, which makes the resultant model a sparse solution (many zero-weights exist). [sent-22, score-0.258]

9 We can discard all features whose weight is zero from the trained model1 without any loss. [sent-23, score-0.096]

10 Therefore, L1-regularizers have the ability to easily and automatically yield compact models without strong concern over feature selection. [sent-24, score-0.147]

11 Given this background, our aim is to establish a model learning framework that can reduce the model complexity beyond that possible by simply applying L1-regularizers. [sent-26, score-0.143]

12 To achieve our goal, we focus on the recently developed concept of automatic feature grouping (Tibshirani et al. [sent-27, score-0.346]

13 We introduce a model learning framework that achieves feature group- ing by incorporating a discrete constraint during model learning. [sent-29, score-0.406]

14 2 Feature Grouping Concept Going beyond L1-regularized sparse modeling, the idea of ‘automatic feature grouping’ has recently been developed. [sent-30, score-0.148]

15 , 2005), grouping pursuit (Shen and Huang, 2010), and OSCAR (Bondell and Reich, 2008). [sent-32, score-0.299]

16 The concept of automatic feature grouping is to find accurate models that have fewer degrees of freedom. [sent-33, score-0.464]

17 This is equivalent to enforce every optimization variables to be equal as much as possible. [sent-34, score-0.28]

18 There are several merits to reducing the degree 1This paper refers to model after completion of (supervised) model learning as “trained model” 18 Proce dinSgosfi oa,f tB huel 5g1arsita, An Anu gauls Mt 4e-e9ti n2g01 o3f. [sent-46, score-0.084]

19 For example, previous studies clarified that it can reduce the chance of over-fitting to the training data (Shen and Huang, 2010). [sent-49, score-0.043]

20 This is an important property for many NLP tasks since they are often modeled with a high-dimensional feature space, and thus, the over-fitting problem is readily triggered. [sent-50, score-0.106]

21 It has also been reported that it can improve the stability of selecting non-zero features beyond that possible with the standard L1regularizer given the existence ofmany highly correlated features (J¨ ornsten and Yu, 2003; Zou and Hastie, 2005). [sent-51, score-0.072]

22 This is because we can merge all features whose feature weight values are equivalent in the trained model into a single feature cluster without any loss. [sent-53, score-0.345]

23 3 Modeling with Feature Grouping This section describes our proposal for obtaining a feature grouping solution. [sent-54, score-0.346]

24 1 Integration of a Discrete Constraint Let S be a finite set of discrete values, i. [sent-56, score-0.183]

25 fo Tuhned d eint our experiments oswec twioen d seifnicnee iSt deeply depends on training mdaetnat. [sent-67, score-0.055]

26 s Then, we ndceefine the objective that can simultaneously achieve a feature grouping and model learning as follows: O(w;Ds. [sent-68, score-0.431]

27 This conscrtreatient c means ntht,at n aemacehl ,va wriab ∈le (feature weight) in trained models must take a value in S, that is, wˆn ∈ eSd, mwhoedreel wˆ n isst t thaek en a-th v flauceto irn o Sf, wˆ th,a ta insd, n ∈ {1, . [sent-75, score-0.056]

28 As a result, feature weights in tnra∈i ne {d1 m,. [sent-79, score-0.146]

29 This is the basic idea of feature grouping proposed in this paper. [sent-83, score-0.346]

30 2 since it involves a NP-hard combinatorial optimization problem. [sent-85, score-0.273]

31 The time complexity of the direct optimization is exponential against N. [sent-86, score-0.236]

32 2 by using the dual decomposition technique (Everett, 1963): O(w,u;Ds. [sent-95, score-0.322]

33 3 has an additional term Υ(u), which is similar to the regularizer Ω(w), whose optimization variables w and u are tightened with equality constraint w = u. [sent-100, score-0.467]

34 This|| objective|| can also be v≥iew 0e adn as the≥ decomposition of the standard loss minimization problem shown in Eq. [sent-102, score-0.138]

35 1 and the additional discrete constraint regu- ≥ ≥ λ22 larizer by the dual decomposition technique. [sent-103, score-0.576]

36 3, we leverage the alternating direction method of multiplier (ADMM) (Gabay and Mercier, 1976; Boyd et al. [sent-105, score-0.164]

37 ADMM provides a very efficient optimization framework for the problem in the dual decomposition form. [sent-107, score-0.504]

38 Here, α represents dual variables for the equivalence constraint w = u. [sent-108, score-0.399]

39 3 can be converted into a series of iterative optimization problems. [sent-111, score-0.182]

40 1 shows the entire model learning framework of our proposed method. [sent-115, score-0.046]

41 The remarkable point is that ADMM works by iteratively computing one of the three optimization variable sets w, u, and α while holding the other variables fixed in the iterations t = 1, 2, . [sent-116, score-0.367]

42 Step1 (w-update): This part of the optimization problem shown in Eq. [sent-120, score-0.182]

43 ‘bias’ means here that the direction of regularization is toward point a instead of the origin. [sent-123, score-0.163]

44 We can select any learning algorithm that can handle the L2regularizer for this part of the optimization. [sent-125, score-0.087]

45 Step2 (u-update): This part ofthe optimization problem shown in Eq. [sent-126, score-0.182]

46 3Standard dual decomposition can be viewed as ρ = 0 19 Input: Training data:D, parameters:ρ, ξ, ? [sent-128, score-0.322]

47 dual IInniptiuat:li zTer:a iwni(n1g) = 0, Du(,1 p)a = 0, teαrs(1:ρ) = 0, and t = 1. [sent-130, score-0.222]

48 Output: u(t+1) Figure 1: Entire learning framework of our method derived from ADMM (Boyd et al. [sent-139, score-0.046]

49 5, namely The remarkable point is that the costly combinatorial optimization problem is disappeared, and instead, we are only required to perform two feature-wise calculations whose total time complexities is O(N log |S|) and fully parallelizable. [sent-150, score-0.523]

50 lTehxeit seism isila Or( technique ahnads been introduced in Zhong and Kwok (201 1) for discarding a costly combinatorial problem from the optimization with OSCAR-regularizers with the help of proximal gradient methods, i. [sent-151, score-0.358]

51 8 is a convex and also symmetric function with respect to ˆu 0, where uˆ0 is the optimal solution of Eq. [sent-158, score-0.148]

52 Therefore, the optimal solution ˆu is at the point where the Input: b0= (b0n)nN=1, λ01, and S. [sent-160, score-0.159]

53 The optimization of mixed L2 and L1-norms is known to have a closed form solution, i. [sent-164, score-0.182]

54 2, Find the nearest valid point in SN from uˆ0 in terms of the LFi2n-dd tihstaen nceaer; uˆn = argmin( uˆ0n − u)2 − =uˆ . [sent-167, score-0.108]

55 Thuis∈S can be performed by a binary where ( uˆn)nN=1 search, whose time complexity is generally O(log |S|). [sent-168, score-0.094]

56 Oustepaurct:h ,u ˆ w Figure 2: Procedure for solving Step2 nearest valid point given SN from uˆ0 in terms of tnheea L2-distance. [sent-169, score-0.108]

57 a t( tih) eT hveer vteaxlieds oofi axis-aligned orthotopes (hyperrectangles) in the parameter space of feature weights. [sent-171, score-0.106]

58 Thus, the solution uˆ, which is the nearest valid point from ˆu 0, can be obtained by individually taking the nearest value in S from ˆu 0n finord iavlild n. [sent-172, score-0.238]

59 Step3 (α-update): We perform gradient ascent on dual variables to tighten the constraint w = u. [sent-173, score-0.445]

60 Note that ξ is the learning rate; we can simply set it to 1. [sent-174, score-0.046]

61 Step4 (convergence check): It can be evaluated both primal and dual residuals as defined in Eq. [sent-177, score-0.318]

62 3 Online Learning We can select an online learning algorithm for Step1 since the ADMM framework does not require exact minimization of Eq. [sent-182, score-0.178]

63 Note that the total calculation cost of our method does not increase much from original online learning algorithm since the calculation cost of Steps 2 through 4 is relatively much smaller than that of Step1 . [sent-186, score-0.218]

64 20 Our decoding models are the Viterbi algorithm on CRF (Lafferty et al. [sent-192, score-0.041]

65 Features are automatically generated according to the predefined feature templates widely-used in the previous studies. [sent-194, score-0.106]

66 Evaluation measures: The purpose of our experiments is to investigate the effectiveness of our proposed method in terms of both its performance and the complexity of the trained model. [sent-197, score-0.11]

67 Task performance was mainly evaluated in terms of the complete sentence accuracy (COMP) since the objective of all model learning methods evaluated in our experiments is to maximize COMP. [sent-199, score-0.085]

68 Model complexity is evaluated by the number of non-zero active features (#nzF) and the degree of freedom (#DoF) (Zhong and Kwok, 2011). [sent-201, score-0.281]

69 #nzF is the number of features whose corresponding feature weight is non-zero in the trained model, and #DoF is the number of unique non-zero feature weights. [sent-202, score-0.308]

70 Baseline methods: Our main baseline is L1regularized sparse modeling. [sent-203, score-0.042]

71 To cover both batch and online leaning, we selected L1-regularized CRF (L1CRF) (Lafferty et al. [sent-204, score-0.092]

72 , 2001) optimized by OWL-QN (Andrew and Gao, 2007) for the NER experiment, and the L1-regularized regularized dual averaging (L1RDA) method (Xiao, 2010)4 for DEPAR. [sent-205, score-0.258]

73 Additionally, we also evaluated L2regularized CRF (L2CRF) with L-BFGS (Liu and Nocedal, 1989) for NER, and passive-aggressive algorithm (L2PA) (Crammer et al. [sent-206, score-0.041]

74 For a fair comparison, we applied the proce- dure of Step2 as a simple quantization method to trained models obtained from L1-regularized model learning, which we refer to as (QT). [sent-209, score-0.109]

75 1 Configurations of Our Method Base learning algorithm: The settings of our method in our experiments imitate L1-regularized learning algorithm since the purpose of our experiments is to investigate the effectiveness against standard L1-regularized learning algorithms. [sent-215, score-0.179]

76 Then, we have the following two possible settings; DC-ADMM: we leveraged the baseline L1-regularized learning algorithm to solve Step1, and set λ1 = 0 and λ2 = 0 for Step2. [sent-216, score-0.172]

77 DCwL1ADMM: we leveraged the baseline L2-regularized learning algorithm, but without L2-regularizer, to solve Step1, and set λ1 > 0 and λ2 = 0 for Step2. [sent-217, score-0.131]

78 o However, we hMave ca tno carefully selneitcet i ste sti fnocer Sit deeply aveffre,c wtse t hhea performance. [sent-223, score-0.055]

79 A sec-tually, this is the most considerable point of our method. [sent-224, score-0.048]

80 Here, we introduce an example of template which is suitable for large feature set. [sent-226, score-0.106]

81 Then, we define a finite set of values S as follows: Sη,δ,κ,ζ = {fη,δ,κ(x, y)|(x, y) ∈ Sζ σ} ∪ {0}, where Sζ is a set of non-negative integers from zero to ζ −1, that is, Sζ = For example, zife we ose ζt η = 0. [sent-228, score-0.045]

82 tion of the feature weights in trained model often takes a form a similar to that of the ‘power law’ in the case of the large feature sets. [sent-242, score-0.308]

83 02CPRE-AD+ 06M(w/QT) # of degrees of freedom (#DoF) [log-scale] # of degrees of freedom (#DoF) [log-scale] (a) NER (b) DEPAR Figure 3: Performance vs. [sent-251, score-0.614]

84 degree of freedom in the trained model for the development data Note that we can control the upper bound of #DoF in trained model by ζ, namely if ζ = 4 then the upper bound of #DoF is 8 (doubled by positive and negative sides). [sent-252, score-0.427]

85 3 shows the task performance on the develop- ment data against the model complexities in terms of the degrees of freedom in the trained models. [sent-257, score-0.41]

86 The plots of the standard L1-regularized methods are given by changing the regularization constants λ1. [sent-259, score-0.114]

87 According to the figure and table, the most remarkable point is that DC-ADMM successfully maintained the task performance even if#DoF (the degree of freedom) was 8, and the performance drop-offs were surprisingly limited even if #DoF was 2, which is the upper bound of feature grouping. [sent-262, score-0.353]

88 The reason may be that such low degrees of freedom prevent over-fitting to the training data. [sent-264, score-0.307]

89 Surprisingly, the simple quantization method (QT) provided fairly good results. [sent-265, score-0.053]

90 In contrast, DC-ADMM can truly provide the optimal solution of Eq. [sent-267, score-0.111]

91 3 since the discrete constraint is also considered during the model learning. [sent-268, score-0.254]

92 In general, a trained model consists oftwo parts: DNLC12E-RA(wFD/MQTζ( =214)C786O354. [sent-269, score-0.056]

93 61MKx4F data (K: thousand, M: million) feature weights and an indexed structure of feature strings, which are used as the key for obtaining the corresponding feature weight. [sent-273, score-0.358]

94 This paper mainly discussed how to reduce the size of the former part, and described its successful reduction. [sent-274, score-0.043]

95 We note that it is also possible to reduce the latter part especially if the feature string structure is TRIE. [sent-275, score-0.149]

96 We omit the details here since it is not the main topic of this paper, but by merging feature strings that have the same feature weights, the size of entire trained models in our DEPAR case can be reduced to about 10 times smaller than those obtained by standard L1-regularization, i. [sent-276, score-0.268]

97 5 Conclusion This paper proposed a model learning framework that can simultaneously realize feature grouping by the incorporation of a simple discrete constraint into model learning optimization. [sent-281, score-0.692]

98 This paper also introduced a feasible algorithm, DC- ADMM, which can vanish the infeasible combinatorial optimization part from the entire learning algorithm with the help of the ADMM technique. [sent-282, score-0.36]

99 Experiments showed that DC-ADMM drastically reduced model complexity in terms of the degrees of freedom in trained models while maintaining the performance. [sent-283, score-0.417]

100 There may exist theoretically cleverer approaches to feature grouping, but the performance of DC-ADMM is close to the upper bound. [sent-284, score-0.15]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('admm', 0.325), ('depar', 0.255), ('dof', 0.255), ('grouping', 0.24), ('dual', 0.222), ('freedom', 0.189), ('optimization', 0.182), ('discrete', 0.138), ('boyd', 0.132), ('ner', 0.126), ('degrees', 0.118), ('constraint', 0.116), ('bondell', 0.108), ('feature', 0.106), ('decomposition', 0.1), ('primal', 0.096), ('combinatorial', 0.091), ('beck', 0.083), ('zhong', 0.076), ('remarkable', 0.076), ('qt', 0.076), ('tsuruoka', 0.073), ('sn', 0.073), ('dcadmm', 0.072), ('gabay', 0.072), ('nzf', 0.072), ('ornsten', 0.072), ('teboulle', 0.072), ('regularization', 0.071), ('solution', 0.07), ('regularizer', 0.068), ('tibshirani', 0.065), ('alternating', 0.065), ('reich', 0.064), ('kwok', 0.064), ('gao', 0.061), ('variables', 0.061), ('nearest', 0.06), ('zou', 0.059), ('ntt', 0.059), ('fused', 0.059), ('leaning', 0.059), ('sang', 0.059), ('tjong', 0.059), ('pursuit', 0.059), ('trained', 0.056), ('multiplier', 0.055), ('deeply', 0.055), ('complexity', 0.054), ('crammer', 0.054), ('online', 0.053), ('galen', 0.053), ('quantization', 0.053), ('lw', 0.049), ('duchi', 0.049), ('point', 0.048), ('duh', 0.047), ('complexities', 0.047), ('gradient', 0.046), ('learning', 0.046), ('xavier', 0.046), ('finite', 0.045), ('upper', 0.044), ('lafferty', 0.044), ('direction', 0.044), ('solve', 0.044), ('reduce', 0.043), ('masaaki', 0.043), ('plots', 0.043), ('simultaneous', 0.042), ('sparse', 0.042), ('algorithm', 0.041), ('compact', 0.041), ('crf', 0.041), ('leveraged', 0.041), ('surprisingly', 0.041), ('optimal', 0.041), ('koby', 0.04), ('whose', 0.04), ('weights', 0.04), ('batch', 0.039), ('jun', 0.039), ('costly', 0.039), ('calculation', 0.039), ('carreras', 0.039), ('objective', 0.039), ('minimization', 0.038), ('koo', 0.038), ('degree', 0.038), ('suzuki', 0.037), ('convex', 0.037), ('jianfeng', 0.037), ('equivalent', 0.037), ('royal', 0.037), ('dependency', 0.036), ('memory', 0.036), ('mcdonald', 0.036), ('regularized', 0.036), ('supervised', 0.036), ('shen', 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999911 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

Author: Jun Suzuki ; Masaaki Nagata

2 0.14238411 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

Author: Mengqiu Wang ; Wanxiang Che ; Christopher D. Manning

Abstract: Translated bi-texts contain complementary language cues, and previous work on Named Entity Recognition (NER) has demonstrated improvements in performance over monolingual taggers by promoting agreement of tagging decisions between the two languages. However, most previous approaches to bilingual tagging assume word alignments are given as fixed input, which can cause cascading errors. We observe that NER label information can be used to correct alignment mistakes, and present a graphical model that performs bilingual NER tagging jointly with word alignment, by combining two monolingual tagging models with two unidirectional alignment models. We intro- duce additional cross-lingual edge factors that encourage agreements between tagging and alignment decisions. We design a dual decomposition inference algorithm to perform joint decoding over the combined alignment and NER output space. Experiments on the OntoNotes dataset demonstrate that our method yields significant improvements in both NER and word alignment over state-of-the-art monolingual baselines.

3 0.12613244 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

Author: Andre Martins ; Miguel Almeida ; Noah A. Smith

Abstract: We present fast, accurate, direct nonprojective dependency parsers with thirdorder features. Our approach uses AD3, an accelerated dual decomposition algorithm which we extend to handle specialized head automata and sequential head bigram models. Experiments in fourteen languages yield parsing speeds competitive to projective parsers, with state-ofthe-art accuracies for the largest datasets (English, Czech, and German).

4 0.11906013 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

Author: Chris Quirk

Abstract: The notion of fertility in word alignment (the number of words emitted by a single state) is useful but difficult to model. Initial attempts at modeling fertility used heuristic search methods. Recent approaches instead use more principled approximate inference techniques such as Gibbs sampling for parameter estimation. Yet in practice we also need the single best alignment, which is difficult to find using Gibbs. Building on recent advances in dual decomposition, this paper introduces an exact algorithm for finding the single best alignment with a fertility HMM. Finding the best alignment appears important, as this model leads to a substantial improvement in alignment quality.

5 0.11373106 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification

Author: Rui Xia ; Tao Wang ; Xuelei Hu ; Shoushan Li ; Chengqing Zong

Abstract: Bag-of-words (BOW) is now the most popular way to model text in machine learning based sentiment classification. However, the performance of such approach sometimes remains rather limited due to some fundamental deficiencies of the BOW model. In this paper, we focus on the polarity shift problem, and propose a novel approach, called dual training and dual prediction (DTDP), to address it. The basic idea of DTDP is to first generate artificial samples that are polarity-opposite to the original samples by polarity reversion, and then leverage both the original and opposite samples for (dual) training and (dual) prediction. Experimental results on four datasets demonstrate the effectiveness of the proposed approach for polarity classification. 1

6 0.112572 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

7 0.10316277 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

8 0.098286442 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

9 0.094751358 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

10 0.090965919 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

11 0.083063722 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

12 0.082377486 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

13 0.081866242 237 acl-2013-Margin-based Decomposed Amortized Inference

14 0.078978568 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

15 0.075986326 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

16 0.071442127 358 acl-2013-Transition-based Dependency Parsing with Selectional Branching

17 0.066689596 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

18 0.065813631 112 acl-2013-Dependency Parser Adaptation with Subtrees from Auto-Parsed Target Domain Data

19 0.064742118 38 acl-2013-Additive Neural Networks for Statistical Machine Translation

20 0.063778415 173 acl-2013-Graph-based Semi-Supervised Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.189), (1, -0.031), (2, -0.052), (3, 0.014), (4, -0.001), (5, -0.016), (6, 0.048), (7, -0.028), (8, -0.069), (9, -0.022), (10, 0.028), (11, -0.081), (12, -0.055), (13, -0.13), (14, -0.021), (15, 0.059), (16, -0.019), (17, 0.031), (18, 0.065), (19, -0.033), (20, 0.063), (21, 0.107), (22, -0.013), (23, 0.125), (24, -0.005), (25, 0.056), (26, -0.038), (27, -0.008), (28, -0.007), (29, -0.01), (30, 0.148), (31, -0.01), (32, -0.036), (33, 0.056), (34, 0.006), (35, 0.035), (36, 0.008), (37, 0.022), (38, 0.145), (39, 0.054), (40, -0.013), (41, -0.043), (42, -0.017), (43, -0.042), (44, -0.068), (45, -0.055), (46, 0.042), (47, -0.007), (48, -0.004), (49, -0.072)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9081611 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

Author: Jun Suzuki ; Masaaki Nagata

2 0.74621189 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

Author: Matthew R. Gormley ; Jason Eisner

Abstract: Many models in NLP involve latent variables, such as unknown parses, tags, or alignments. Finding the optimal model parameters is then usually a difficult nonconvex optimization problem. The usual practice is to settle for local optimization methods such as EM or gradient ascent. We explore how one might instead search for a global optimum in parameter space, using branch-and-bound. Our method would eventually find the global maximum (up to a user-specified ?) if run for long enough, but at any point can return a suboptimal solution together with an upper bound on the global maximum. As an illustrative case, we study a generative model for dependency parsing. We search for the maximum-likelihood model parameters and corpus parse, subject to posterior constraints. We show how to formulate this as a mixed integer quadratic programming problem with nonlinear constraints. We use the Reformulation Linearization Technique to produce convex relaxations during branch-and-bound. Although these techniques do not yet provide a practical solution to our instance of this NP-hard problem, they sometimes find better solutions than Viterbi EM with random restarts, in the same time.

3 0.7287727 237 acl-2013-Margin-based Decomposed Amortized Inference

Author: Gourab Kundu ; Vivek Srikumar ; Dan Roth

Abstract: Given that structured output prediction is typically performed over entire datasets, one natural question is whether it is possible to re-use computation from earlier inference instances to speed up inference for future instances. Amortized inference has been proposed as a way to accomplish this. In this paper, first, we introduce a new amortized inference algorithm called the Margin-based Amortized Inference, which uses the notion of structured margin to identify inference problems for which previous solutions are provably optimal. Second, we introduce decomposed amortized inference, which is designed to address very large inference problems, where earlier amortization methods become less ef- fective. This approach works by decomposing the output structure and applying amortization piece-wise, thus increasing the chance that we can re-use previous solutions for parts of the output structure. These parts are then combined to a global coherent solution using Lagrangian relaxation. In our experiments, using the NLP tasks of semantic role labeling and entityrelation extraction, we demonstrate that with the margin-based algorithm, we need to call the inference engine only for a third of the test examples. Further, we show that the decomposed variant of margin-based amortized inference achieves a greater reduction in the number of inference calls.

4 0.705185 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

Author: Andre Martins ; Miguel Almeida ; Noah A. Smith

5 0.66684353 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation

Author: Vladimir Eidelman ; Yuval Marton ; Philip Resnik

Abstract: Recent advances in large-margin learning have shown that better generalization can be achieved by incorporating higher order information into the optimization, such as the spread of the data. However, these solutions are impractical in complex structured prediction problems such as statistical machine translation. We present an online gradient-based algorithm for relative margin maximization, which bounds the spread ofthe projected data while maximizing the margin. We evaluate our optimizer on Chinese-English and ArabicEnglish translation tasks, each with small and large feature sets, and show that our learner is able to achieve significant im- provements of 1.2-2 BLEU and 1.7-4.3 TER on average over state-of-the-art optimizers with the large feature set.

6 0.66207832 251 acl-2013-Mr. MIRA: Open-Source Large-Margin Structured Learning on MapReduce

7 0.65639257 382 acl-2013-Variational Inference for Structured NLP Models

8 0.64231467 143 acl-2013-Exact Maximum Inference for the Fertility Hidden Markov Model

9 0.62037134 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

10 0.57992965 210 acl-2013-Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition

11 0.56433368 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

12 0.53927147 156 acl-2013-Fast and Adaptive Online Training of Feature-Rich Translation Models

13 0.53791839 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification

14 0.515854 349 acl-2013-The mathematics of language learning

15 0.4880054 221 acl-2013-Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines

16 0.46982786 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

17 0.46391353 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

18 0.46267366 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

19 0.45481354 332 acl-2013-Subtree Extractive Summarization via Submodular Maximization

20 0.4535594 370 acl-2013-Unsupervised Transcription of Historical Documents

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.053), (6, 0.046), (11, 0.051), (24, 0.022), (26, 0.05), (28, 0.013), (35, 0.039), (42, 0.032), (48, 0.514), (70, 0.038), (88, 0.017), (90, 0.014), (95, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96512443 334 acl-2013-Supervised Model Learning with Feature Grouping based on a Discrete Constraint

Author: Jun Suzuki ; Masaaki Nagata

2 0.93958914 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors

Author: Volkan Cirik

Abstract: We study substitute vectors to solve the part-of-speech ambiguity problem in an unsupervised setting. Part-of-speech tagging is a crucial preliminary process in many natural language processing applications. Because many words in natural languages have more than one part-of-speech tag, resolving part-of-speech ambiguity is an important task. We claim that partof-speech ambiguity can be solved using substitute vectors. A substitute vector is constructed with possible substitutes of a target word. This study is built on previous work which has proven that word substitutes are very fruitful for part-ofspeech induction. Experiments show that our methodology works for words with high ambiguity.

3 0.93156505 54 acl-2013-Are School-of-thought Words Characterizable?

Author: Xiaorui Jiang ; Xiaoping Sun ; Hai Zhuge

Abstract: School of thought analysis is an important yet not-well-elaborated scientific knowledge discovery task. This paper makes the first attempt at this problem. We focus on one aspect of the problem: do characteristic school-of-thought words exist and whether they are characterizable? To answer these questions, we propose a probabilistic generative School-Of-Thought (SOT) model to simulate the scientific authoring process based on several assumptions. SOT defines a school of thought as a distribution of topics and assumes that authors determine the school of thought for each sentence before choosing words to deliver scientific ideas. SOT distinguishes between two types of school-ofthought words for either the general background of a school of thought or the original ideas each paper contributes to its school of thought. Narrative and quantitative experiments show positive and promising results to the questions raised above. 1

4 0.87299001 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

Author: Tiziano Flati ; Roberto Navigli

Abstract: We present SPred, a novel method for the creation of large repositories of semantic predicates. We start from existing collocations to form lexical predicates (e.g., break ∗) and learn the semantic classes that best f∗it) tahned ∗ argument. Taon idco this, we extract failtl thhee ∗ occurrences ion Wikipedia ewxthraiccht match the predicate and abstract its arguments to general semantic classes (e.g., break BODY PART, break AGREEMENT, etc.). Our experiments show that we are able to create a large collection of semantic predicates from the Oxford Advanced Learner’s Dictionary with high precision and recall, and perform well against the most similar approach.

5 0.86407167 354 acl-2013-Training Nondeficient Variants of IBM-3 and IBM-4 for Word Alignment

Author: Thomas Schoenemann

Abstract: We derive variants of the fertility based models IBM-3 and IBM-4 that, while maintaining their zero and first order parameters, are nondeficient. Subsequently, we proceed to derive a method to compute a likely alignment and its neighbors as well as give a solution of EM training. The arising M-step energies are non-trivial and handled via projected gradient ascent. Our evaluation on gold alignments shows substantial improvements (in weighted Fmeasure) for the IBM-3. For the IBM4 there are no consistent improvements. Training the nondeficient IBM-5 in the regular way gives surprisingly good results. Using the resulting alignments for phrase- based translation systems offers no clear insights w.r.t. BLEU scores.

6 0.85596085 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics

7 0.62959212 237 acl-2013-Margin-based Decomposed Amortized Inference

8 0.62375885 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words

9 0.61700386 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing

10 0.60727584 103 acl-2013-DISSECT - DIStributional SEmantics Composition Toolkit

11 0.59663689 294 acl-2013-Re-embedding words

12 0.58521658 260 acl-2013-Nonconvex Global Optimization for Latent-Variable Models

13 0.57792634 62 acl-2013-Automatic Term Ambiguity Detection

14 0.57784992 275 acl-2013-Parsing with Compositional Vector Grammars

15 0.57468617 109 acl-2013-Decipherment Complexity in 1:1 Substitution Ciphers

16 0.5669207 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis

17 0.55912089 91 acl-2013-Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning

18 0.54902667 175 acl-2013-Grounded Language Learning from Video Described with Sentences

19 0.54860407 191 acl-2013-Improved Bayesian Logistic Supervised Topic Models with Data Augmentation

20 0.54707265 264 acl-2013-Online Relative Margin Maximization for Statistical Machine Translation