emnlp emnlp2013 emnlp2013-86 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sida Wang ; Mengqiu Wang ; Stefan Wager ; Percy Liang ; Christopher D. Manning
Abstract: NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. Applied to text classification and NER, our method provides a > 1% absolute performance gain over use of standard L2 regularization.
Reference: text
sentIndex sentText sentNum sentScore
1 egd}@uc , , Abstract NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. [sent-6, score-0.191]
2 A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. [sent-7, score-0.306]
3 We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. [sent-8, score-0.537]
4 We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. [sent-9, score-0.121]
5 We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. [sent-10, score-0.483]
6 The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. [sent-11, score-0.56]
7 As a result, balancing overfitting versus underfitting through good weight regularization remains a key issue for achieving optimal performance. [sent-14, score-0.191]
8 Traditionally, L2 or L1 regularization is employed, but these simple types of regularization penalize all features in a uniform way without taking into account the properties of the actual model. [sent-15, score-0.33]
9 An alternative approach to regularization is to generate fake training data by adding random noise to the input features of the original training data. [sent-16, score-0.306]
10 , 2013), but working directly with many corrupted copies of a dataset can be computationally prohibitive. [sent-21, score-0.078]
11 Fortunately, feature noising ideas often lead to tractable deterministic objectives that can be optimized directly. [sent-22, score-0.529]
12 Sometimes, training with corrupted features reduces to a special form of regularization (Matsuoka, 1992; Bishop, 1995; Rifai et al. [sent-23, score-0.243]
13 For example, Bishop (1995) showed that training with features that have been corrupted with additive Gaussian noise is equivalent to a form of L2 regularization in the low noise limit. [sent-26, score-0.419]
14 In other cases it is possible to develop a new objective function by marginalizing over the artificial noise (Wang and Manning, 2013; van der Maaten et al. [sent-27, score-0.114]
15 The central contribution of this paper is to show how to efficiently simulate training with artificially noised features in the context of log-linear structured prediction, without actually having to generate noised data. [sent-29, score-0.307]
16 , 2012), a recently popularized form of artificial feature noise where a random subset of features is omitted independently for each training example. [sent-31, score-0.133]
17 Dropout and its variants have been shown to outperform L2 regularization on various tasks (Hinton et al. [sent-32, score-0.165]
18 Dropout is is similar in spirit to feature bagging in the deliberate removal of features, but performs the removal in a preset way rather than randomly (Bryll et al. [sent-35, score-0.127]
19 oc d2s0 i1n3 N Aastusorcaila Ltiaon g fuoarg Ceo Pmrpoucetastsi on ga,l p Laignegsu 1is1t7ic0s–1 79, Our approach is based on a second-order approximation to feature noising developed among others by Bishop (1995) and Wager et al. [sent-41, score-0.604]
20 (2013), which allows us to convert dropout noise into a form of adaptive regularization. [sent-42, score-0.387]
21 This method is suitable for structured prediction in log-linear models where second derivatives are computable. [sent-43, score-0.096]
22 In particular, it can be used for multiclass classification with maximum entropy models (a. [sent-44, score-0.117]
23 , softmax or multinomial logistic regression) and for the sequence models that are ubiquitous in NLP, via linear chain Conditional Random Fields (CRFs). [sent-47, score-0.089]
24 For linear chain CRFs, we additionally show how we can use a noising scheme that takes advantage of the clique structure so that the resulting noising regularizer can be computed in terms of the pair- wise marginals. [sent-48, score-1.417]
25 A simple forward-backward-type dynamic program can then be used to compute the gradient tractably. [sent-49, score-0.108]
26 For ease of implementation and scalability to semi-supervised learning, we also outline an even faster approximation to the regularizer. [sent-50, score-0.075]
27 The general approach also works in other clique structures in addition to the linear chain when the clique marginals can be computed efficiently. [sent-51, score-0.157]
28 Finally, we extend feature noising for structured prediction to a transductive or semi-supervised setting. [sent-52, score-0.738]
29 The regularizer induced by feature noising is label-independent for log-linear models, and so we can use unlabeled data to learn a better regularizer. [sent-53, score-0.951]
30 NLP sequence labeling tasks are especially well suited to a semi-supervised approach, as input features are numerous but sparse, and labeled data is expensive to obtain but unlabeled data is abundant (Li and McCallum, 2005; Jiao et al. [sent-54, score-0.077]
31 (2013) showed that semi-supervised dropout training for logistic regression captures a similar intuition to techniques such as entropy regularization (Grandvalet and Bengio, 2005) and transductive SVMs (Joachims, 1999), which encourage confident predictions on the unlabeled data. [sent-57, score-0.709]
32 , s|Y| ) θ be ∈ a Rvector of scores for each output, with sy = f(y, x) · θ. [sent-75, score-0.083]
33 The key idea behind feature noising is to artificially corrupt the feature vector f(y, x) randomly into some x) and then maximize the average log-likelihood of y given these corrupted features— the motivation is to choose predictors θ that are robust to noise (missing words for example). [sent-78, score-0.764]
34 Wlye p ewrtilulr baeldso v assume tohrefeature noising preserves the mean: x)] = f(y, x), so that E[˜ s] = s. [sent-80, score-0.507]
35 This can always be done f˜(y, hf˜e(y r,a E[f˜(y, by scaling the noised features as described in the list of noising schemes. [sent-81, score-0.61]
36 It is useful to view feature noising as a form of regularization. [sent-82, score-0.529]
37 Since feature noising preserves the mean, the feature noising objective can be written as the sum of the original log-likelihood plus the difference in log-normalization constants: E[log˜ p(y | x; θ)] = E[˜ sy − A( s˜)] = logp(y | x; θ) R(θ, x) d=ef E[A( s˜)] − − A(s). [sent-83, score-1.206]
38 Computing the regularizer (4) requires summing over all possible noised feature vectors, which can imply exponential effort in the number of features. [sent-86, score-0.546]
39 (2013), we approximate R(θ, x) using a second-order Taylor expansion, which will allow us to work with only means and covariances of the noised features. [sent-89, score-0.126]
40 We take a quadratic approximation of the log-partition function A(·) of the noised score vector s around tfuhen cthtieo nun Ano(·is)e odf score voeiscetdor s: A(˜ s) u A(s) + ∇A(s)>( s˜ − s) (5) +21( ˜s − s)>∇2A(s)( s˜ − s). [sent-90, score-0.291]
41 Plugging (5) into (4), we obtain a new regularizer Rq(θ, x), which we will use as an approximation to R(θ, x): Rq(θ,x) =21E[(˜ s − s)>∇2A(s)(˜ s − s)] =21tr(∇2A(s)Cov( s˜)). [sent-91, score-0.45]
42 (6) (7) 1172 This expression still has two sources of potential intractability, a sum over an exponential number of noised score vectors and a sum over the |Y| components ocofr s se. [sent-92, score-0.234]
43 1 The regularizer Rq(θ, x) involves the product of two variance terms, the first is non-convex in θ and the second is quadratic in θ. [sent-94, score-0.465]
44 Note that to reduce the regularization, we will favor models that (i) predict confidently and (ii) have stable scores in the presence of feature noise. [sent-95, score-0.076]
45 For multiclass classification, we can explicitly sum over each y ∈ Y to compute the regularizer, bsuutm mth oisv ewri lela c beh i yntr ∈act Yab lteo f coorm mstpruuctetu trheed prediction. [sent-96, score-0.123]
46 To specialize to multiclass classification for the moment, let us assume that we have a separate weight vector for each output y applied to the same feature vector g(x) ; that is, the score sy = θy · g(x) . [sent-97, score-0.245]
47 (10) Xj Noising schemes We now give some examples of possible noise schemes for generating x) given the original features f(y, x). [sent-101, score-0.156]
48 This distribution affects the regularization through the variance term Var[ s˜y]. [sent-102, score-0.165]
49 In this case, the contribution to the regularizer from noising is Var[ s˜y] = Pj σ2θy2j. [sent-105, score-0.859]
50 Note that under our secondoPrder approximation Rq(θ, x), the multiplicatPive Gaussian and dropout schemes are equivalent, but they differ under the original regularizer R(θ, x). [sent-115, score-0.783]
51 , 2013) is that the noising regularizer R (8), while involving a sum over examples, is independent of the output y. [sent-118, score-0.901]
52 , un}, th unenla we can xdaemfinpele a regularizer =th {aut is a linear co}m,b thiennati woen ctahen regularizer estimated on both datasets, with α tuning the tradeoff between the two: R∗(θ, D, Dunlabeled) (11) Xn Xm =defn +n αm? [sent-126, score-0.75]
53 3 Feature Noising in Linear-Chain CRFs So far, we have developed a regularizer that works for all log-linear models, but—in its current form— is only practical for multiclass classification. [sent-129, score-0.456]
54 We now exploit the decomposable structure in CRFs to define a new noising scheme which does not require us to explicitly sum over all possible outputs y ∈ Y. [sent-130, score-0.526]
55 uTsh eto key ildiceiatl yw silul m be o tvoe nro ailsle p oeasschib l oec oault pfeuattsu yre vector (which implicitly affects many y) rather than noise each y independently. [sent-131, score-0.088]
56 In linear chain CRFs, the feature vector f decomposes into a sum of local feature vectors gt : XT f(y,x) = Xgt(yt−1,yt,x), (12) Xt=1 where gt(a, b, x) is defined on a pair of consecutive tags a, b for positions t − 1 and t. [sent-136, score-0.268]
57 gRsa ath,ber f otrh apon working w 1i tahn a score sy for each y ∈ Y, we define a collection of local scores s = {sa,b,t}, feofirn eeac ah tag pair (a, b) aanld s positsion = =t = 1, . [sent-137, score-0.083]
58 aWceh c taongsi pdaeirr noising nscdh epmoseiswhich independently set ˜g t(a, b, x) for each a, b, t. [sent-141, score-0.484]
59 (13) The first derivative yields the edge marginals under the model, µa,b,t = pθ (yt−1 = a, yt = b | x), and the diagonal elements of the Hessian =∇2 bA |(s x)) yield tthhee marginal velaermiaennctess. [sent-144, score-0.401]
60 Again, minimizing the regularizer means making confident predictions and having stable scores under feature noise. [sent-146, score-0.42]
61 Computing partial derivatives So far, we have defined the regularizer Rq(θ, x) based on feature noising. [sent-147, score-0.45]
62 (15) Using the fact that ∇µa,b,t = µa,b,t∇ log µa,b,t and tUhesi nfagc tth teha fta Var[ s˜a,b,t] is a quadratic fluongcµtion in θ, we can simply apply the product rule to derive the final gradient ∇Rq(θ, x) . [sent-151, score-0.244]
63 First, it will be convenient to define the partial sum of the local feature vector from positions ito j as follows: Gi:j = Xjgt(yt−1,yt,x). [sent-156, score-0.087]
64 (16) Xt=i Consider the task of computing the feature expectation Epθ(y|yt−1=a,yt=b) [f(y, x)] for a fixed (a, b, t) . [sent-157, score-0.086]
65 Running the resulting dynamic program takes O(K2Tq) time and requires O(KTq) storage, wOh(eKre TKq )is timhee enu amndber r oqfu tags, OT( iKs Tthqe) sequence length and q is the number of active features. [sent-163, score-0.149]
66 4 Fast Gradient Computations In this section, we provide two ways to further improve the efficiency of the gradient calculation based on ignoring long-range interactions and based on exploiting feature sparsity. [sent-165, score-0.084]
67 1 Exploiting Feature Sparsity and Co-occurrence In each forward-backward pass over a training example, we need to compute the conditional expectations for all features active in that example. [sent-167, score-0.116]
68 Naively applying the dynamic program in Section 3 is O(K2T) for each active feature. [sent-168, score-0.119]
69 As a result, we can collapse such a group of features into a single feature as a preprocessing step to avoid computing identical expectations for each of the features. [sent-177, score-0.084]
70 In our case, the dynamic program from Section 3 together with the trick described above ran in a manageable amount of time. [sent-182, score-0.094]
71 only has to consider the sum of the local feature vectors from i−r to i+r, which is captured by Gi−r:i+r: − Epθ(y|yt−1=a,yt=b,x)[f(y, x)] − Epθ(y|x)[f(y, x)] ≈ Epθ(y|yt−1=a,yt=b,x)[Gt−r:t+r] − Epθ(y|x)[Gt−r:t+r]. [sent-188, score-0.111]
72 CoNLL has a development set of size 51578, which we used to tune regularization parameters. [sent-200, score-0.165]
73 1 Multiclass Classification We begin by testing our regularizer in the simple case of classification where Y = {1, 2, . [sent-203, score-0.411]
74 t,hKe }no fios-r ing regularizer in both the fully supervised setting as well as the transductive learning setting. [sent-210, score-0.518]
75 In the transductive learning setting, the learner is allowed to inspect the test features at train time (without the labels). [sent-211, score-0.143]
76 5182 Table 2: Classification performance and transductive learning results on some standard datasets. [sent-236, score-0.143]
77 None: use no regularization, Drop: quadratic approximation to the dropout noise (8), +Test: also use the test set to estimate the noising regularizer (11). [sent-237, score-1.411]
78 1 Semi-supervised Learning with Feature Noising In the transductive setting, we used test data (without labels) to learn a better regularizer. [sent-240, score-0.143]
79 In most cases, our semi-supervised accuracies are lower than the transductive accuracies given in Table 2; this is normal in our setup, because we used less labeled data to train the semi-supervised classifier than the transductive one. [sent-245, score-0.286]
80 2 The Second-Order Approximation The results reported above all rely on the ap- proximate dropout regularizer (8) that is based on a second-order Taylor expansion. [sent-248, score-0.674]
81 To test the validity of this approximation we compare it to the Gaussian method developed by Wang and Manning (2013) on a two-class classification task. [sent-249, score-0.111]
82 4The CoNNL results look somewhat surprising, as the semisupervised results are better than the transductive ones. [sent-254, score-0.197]
83 :P Eloftfteecdt oisf t λhe i nte λstk θsket tic regression as a function of izer, Gaussian dropout (Wang + additional L2, and quadratic on the testset perforaccuracy with logisλ for the L2 regularand Manning, 2013) dropout (8) + L2 de- scribed in this paper. [sent-266, score-0.715]
84 The default noising regularizer is quite good, and additional L2 does not help. [sent-267, score-0.859]
85 Over a broad range of λ values, we find that dropout plus L2 regularization performs far better than using just L2 regularization for any value of λ. [sent-270, score-0.629]
86 We see that Gaussian dropout appears to perform slightly better than the quadratic approximation discussed in this paper. [sent-271, score-0.464]
87 However, our quadratic approximation extends easily to the multiclass case and to structured prediction in general, while Gaussian dropout does not. [sent-272, score-0.611]
88 Thus, it appears that our approximation presents a reasonable trade-off between computational efficiency and prediction accuracy. [sent-273, score-0.11]
89 2 CRF Experiments We evaluate the quadratic dropout regularizer in linear-chain CRFs on two sequence tagging tasks: the CoNLL 2003 NER shared task (Tjong Kim Sang and De Meulder, 2003) and the SANCL 2012 POS tagging task (Petrov and McDonald, 2012) . [sent-275, score-0.858]
90 None: no regularization, Drop: quadratic dropout regularization (14) described in this paper. [sent-296, score-0.554]
91 We obtained a small but consistent improvement using the quadratic dropout regularizer in (14) over the L2-regularized CRFs baseline. [sent-298, score-0.764]
92 This is also interesting because here is a situation where the features are extremely sparse, L2 regularization gave no improve- ment, and where regularization overall matters less. [sent-301, score-0.33]
93 6 Conclusion We have presented a new regularizer for learning log-linear models such as multiclass logistic regression and conditional random fields. [sent-302, score-0.538]
94 This regularizer is based on a second-order approximation of feature noising schemes, and attempts to favor models that predict confidently and are robust to noise in the data. [sent-303, score-1.098]
95 In order to apply our method to CRFs, we tackle the key challenge of dealing with feature correlations that arise in the structured prediction setting in several ways. [sent-304, score-0.111]
96 In addition, we show that the regularizer can be applied naturally in the semisupervised setting. [sent-305, score-0.429]
97 89u6 t ularization regularization MOLTIORSaGC gP87 r87 ec. [sent-327, score-0.196]
98 4 u21t Table 6: ularization regularization CoNLL NER results broken down by tags and by precision, recall, and Fβ=1 Top: development . [sent-345, score-0.196]
99 tigating how to better optimize this non-convex regularizer online and convincingly scale it to the semisupervised setting seem to be promising future directions. [sent-347, score-0.429]
100 Adding noise to the input of a model trained with a regularized objective. [sent-424, score-0.088]
wordName wordTfidf (topN-words)
[('noising', 0.484), ('regularizer', 0.375), ('yt', 0.361), ('dropout', 0.299), ('regularization', 0.165), ('wager', 0.144), ('transductive', 0.143), ('rq', 0.142), ('var', 0.14), ('noised', 0.126), ('sancl', 0.125), ('quadratic', 0.09), ('fta', 0.09), ('ep', 0.089), ('noise', 0.088), ('sy', 0.083), ('gt', 0.081), ('multiclass', 0.081), ('corrupted', 0.078), ('crfs', 0.077), ('approximation', 0.075), ('rifai', 0.072), ('gaussian', 0.066), ('arxiv', 0.057), ('gi', 0.057), ('semisupervised', 0.054), ('cov', 0.054), ('sida', 0.054), ('yann', 0.054), ('bishop', 0.053), ('fake', 0.053), ('active', 0.05), ('unlabeled', 0.047), ('btb', 0.047), ('crf', 0.046), ('ner', 0.046), ('xt', 0.045), ('feature', 0.045), ('shape', 0.044), ('maaten', 0.043), ('clique', 0.043), ('tjong', 0.043), ('sum', 0.042), ('expectation', 0.041), ('sang', 0.04), ('preprint', 0.04), ('marginals', 0.04), ('expectations', 0.039), ('gradient', 0.039), ('conll', 0.038), ('classification', 0.036), ('bryll', 0.036), ('dunlabeled', 0.036), ('grandvalet', 0.036), ('salah', 0.036), ('simard', 0.036), ('ytf', 0.036), ('pj', 0.036), ('program', 0.035), ('prediction', 0.035), ('schemes', 0.034), ('dynamic', 0.034), ('bagging', 0.034), ('tagging', 0.032), ('meulder', 0.031), ('tangent', 0.031), ('confidently', 0.031), ('efron', 0.031), ('gj', 0.031), ('ularization', 0.031), ('structured', 0.031), ('chain', 0.031), ('wang', 0.03), ('yi', 0.03), ('sequence', 0.03), ('derivatives', 0.03), ('yoshua', 0.029), ('recurrence', 0.028), ('jiao', 0.028), ('uncertainty', 0.028), ('hinton', 0.028), ('logistic', 0.028), ('regression', 0.027), ('conditional', 0.027), ('py', 0.027), ('mengqiu', 0.027), ('overfitting', 0.026), ('der', 0.026), ('manning', 0.025), ('sutton', 0.025), ('trick', 0.025), ('log', 0.025), ('petrov', 0.025), ('vectors', 0.024), ('preventing', 0.024), ('removal', 0.024), ('artificially', 0.024), ('bengio', 0.024), ('mann', 0.023), ('preserves', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.000001 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
Author: Sida Wang ; Mengqiu Wang ; Stefan Wager ; Percy Liang ; Christopher D. Manning
Abstract: NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. Applied to text classification and NER, our method provides a > 1% absolute performance gain over use of standard L2 regularization.
2 0.12912703 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication
Author: Byron C. Wallace ; Thomas A Trikalinos ; M. Barton Laws ; Ira B. Wilson ; Eugene Charniak
Abstract: We develop a novel generative model of conversation that jointly captures both the topical content and the speech act type associated with each utterance. Our model expresses both token emission and state transition probabilities as log-linear functions of separate components corresponding to topics and speech acts (and their interactions). We apply this model to a dataset comprising annotated patient-physician visits and show that the proposed joint approach outperforms a baseline univariate model.
3 0.091963679 159 emnlp-2013-Regularized Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova
Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.
4 0.086436622 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
Author: Kuzman Ganchev ; Dipanjan Das
Abstract: We present a framework for cross-lingual transfer of sequence information from a resource-rich source language to a resourceimpoverished target language that incorporates soft constraints via posterior regularization. To this end, we use automatically word aligned bitext between the source and target language pair, and learn a discriminative conditional random field model on the target side. Our posterior regularization constraints are derived from simple intuitions about the task at hand and from cross-lingual alignment information. We show improvements over strong baselines for two tasks: part-of-speech tagging and namedentity segmentation.
5 0.063009761 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
Author: Thomas Mueller ; Helmut Schmid ; Hinrich Schutze
Abstract: Training higher-order conditional random fields is prohibitive for huge tag sets. We present an approximated conditional random field using coarse-to-fine decoding and early updating. We show that our implementation yields fast and accurate morphological taggers across six languages with different morphological properties and that across languages higher-order models give significant improvements over 1st-order models.
6 0.062128063 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
7 0.050245672 40 emnlp-2013-Breaking Out of Local Optima with Count Transforms and Model Recombination: A Study in Grammar Induction
8 0.050105516 120 emnlp-2013-Learning Latent Word Representations for Domain Adaptation using Supervised Word Clustering
9 0.049524464 82 emnlp-2013-Exploring Representations from Unlabeled Data with Co-training for Chinese Word Segmentation
10 0.048614446 169 emnlp-2013-Semi-Supervised Representation Learning for Cross-Lingual Text Classification
11 0.045122523 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification
12 0.043520316 9 emnlp-2013-A Log-Linear Model for Unsupervised Text Normalization
13 0.043213986 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
14 0.040504254 184 emnlp-2013-This Text Has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics
15 0.039883539 66 emnlp-2013-Dynamic Feature Selection for Dependency Parsing
16 0.039457887 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
17 0.039361689 119 emnlp-2013-Learning Distributions over Logical Forms for Referring Expression Generation
18 0.039183781 113 emnlp-2013-Joint Language and Translation Modeling with Recurrent Neural Networks
19 0.03870935 139 emnlp-2013-Noise-Aware Character Alignment for Bootstrapping Statistical Machine Transliteration from Bilingual Corpora
20 0.036371995 158 emnlp-2013-Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
topicId topicWeight
[(0, -0.14), (1, -0.009), (2, -0.026), (3, -0.031), (4, -0.038), (5, 0.018), (6, 0.051), (7, 0.035), (8, -0.066), (9, 0.035), (10, -0.018), (11, -0.054), (12, -0.031), (13, 0.02), (14, 0.041), (15, -0.051), (16, 0.009), (17, 0.039), (18, -0.062), (19, 0.025), (20, 0.097), (21, 0.066), (22, 0.052), (23, 0.136), (24, 0.04), (25, 0.106), (26, -0.109), (27, -0.166), (28, 0.04), (29, 0.008), (30, 0.005), (31, 0.058), (32, -0.077), (33, 0.042), (34, 0.031), (35, -0.024), (36, 0.107), (37, 0.052), (38, -0.051), (39, 0.101), (40, 0.016), (41, -0.012), (42, 0.005), (43, -0.05), (44, -0.071), (45, -0.199), (46, 0.188), (47, 0.051), (48, -0.176), (49, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.91004437 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
Author: Sida Wang ; Mengqiu Wang ; Stefan Wager ; Percy Liang ; Christopher D. Manning
Abstract: NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. Applied to text classification and NER, our method provides a > 1% absolute performance gain over use of standard L2 regularization.
2 0.62293857 6 emnlp-2013-A Generative Joint, Additive, Sequential Model of Topics and Speech Acts in Patient-Doctor Communication
Author: Byron C. Wallace ; Thomas A Trikalinos ; M. Barton Laws ; Ira B. Wilson ; Eugene Charniak
Abstract: We develop a novel generative model of conversation that jointly captures both the topical content and the speech act type associated with each utterance. Our model expresses both token emission and state transition probabilities as log-linear functions of separate components corresponding to topics and speech acts (and their interactions). We apply this model to a dataset comprising annotated patient-physician visits and show that the proposed joint approach outperforms a baseline univariate model.
3 0.4824484 159 emnlp-2013-Regularized Minimum Error Rate Training
Author: Michel Galley ; Chris Quirk ; Colin Cherry ; Kristina Toutanova
Abstract: Minimum Error Rate Training (MERT) remains one of the preferred methods for tuning linear parameters in machine translation systems, yet it faces significant issues. First, MERT is an unregularized learner and is therefore prone to overfitting. Second, it is commonly used on a noisy, non-convex loss function that becomes more difficult to optimize as the number of parameters increases. To address these issues, we study the addition of a regularization term to the MERT objective function. Since standard regularizers such as ‘2 are inapplicable to MERT due to the scale invariance of its objective function, we turn to two regularizers—‘0 and a modification of‘2— and present methods for efficiently integrating them during search. To improve search in large parameter spaces, we also present a new direction finding algorithm that uses the gradient of expected BLEU to orient MERT’s exact line searches. Experiments with up to 3600 features show that these extensions of MERT yield results comparable to PRO, a learner often used with large feature sets.
Author: William Yang Wang ; Edward Lin ; John Kominek
Abstract: We propose a Laplacian structured sparsity model to study computational branding analytics. To do this, we collected customer reviews from Starbucks, Dunkin’ Donuts, and other coffee shops across 38 major cities in the Midwest and Northeastern regions of USA. We study the brand related language use through these reviews, with focuses on the brand satisfaction and gender factors. In particular, we perform three tasks: automatic brand identification from raw text, joint brand-satisfaction prediction, and joint brandgender-satisfaction prediction. This work extends previous studies in text classification by incorporating the dependency and interaction among local features in the form of structured sparsity in a log-linear model. Our quantitative evaluation shows that our approach which combines the advantages of graphical modeling and sparsity modeling techniques significantly outperforms various standard and stateof-the-art text classification algorithms. In addition, qualitative analysis of our model reveals important features of the language uses associated with the specific brands.
5 0.46014473 91 emnlp-2013-Grounding Strategic Conversation: Using Negotiation Dialogues to Predict Trades in a Win-Lose Game
Author: Anais Cadilhac ; Nicholas Asher ; Farah Benamara ; Alex Lascarides
Abstract: This paper describes a method that predicts which trades players execute during a winlose game. Our method uses data collected from chat negotiations of the game The Settlers of Catan and exploits the conversation to construct dynamically a partial model of each player’s preferences. This in turn yields equilibrium trading moves via principles from game theory. We compare our method against four baselines and show that tracking how preferences evolve through the dialogue and reasoning about equilibrium moves are both crucial to success.
6 0.44949269 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
7 0.41957292 70 emnlp-2013-Efficient Higher-Order CRFs for Morphological Tagging
8 0.40575558 2 emnlp-2013-A Convex Alternative to IBM Model 2
9 0.40262571 195 emnlp-2013-Unsupervised Spectral Learning of WCFG as Low-rank Matrix Completion
10 0.3426283 198 emnlp-2013-Using Soft Constraints in Joint Inference for Clinical Concept Recognition
11 0.33859214 172 emnlp-2013-Simple Customization of Recursive Neural Networks for Semantic Relation Classification
12 0.33193216 28 emnlp-2013-Automated Essay Scoring by Maximizing Human-Machine Agreement
13 0.3215397 199 emnlp-2013-Using Topic Modeling to Improve Prediction of Neuroticism and Depression in College Students
15 0.31609333 176 emnlp-2013-Structured Penalties for Log-Linear Language Models
16 0.29695779 18 emnlp-2013-A temporal model of text periodicities using Gaussian Processes
17 0.29326895 26 emnlp-2013-Assembling the Kazakh Language Corpus
18 0.28550604 190 emnlp-2013-Ubertagging: Joint Segmentation and Supertagging for English
19 0.27444655 50 emnlp-2013-Combining PCFG-LA Models with Dual Decomposition: A Case Study with Function Labels and Binarization
20 0.27218375 35 emnlp-2013-Automatically Detecting and Attributing Indirect Quotations
topicId topicWeight
[(0, 0.011), (3, 0.046), (18, 0.043), (22, 0.036), (27, 0.296), (30, 0.075), (45, 0.016), (47, 0.016), (50, 0.026), (51, 0.15), (66, 0.028), (71, 0.026), (75, 0.032), (77, 0.021), (90, 0.011), (95, 0.019), (96, 0.035), (97, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.73709297 86 emnlp-2013-Feature Noising for Log-Linear Structured Prediction
Author: Sida Wang ; Mengqiu Wang ; Stefan Wager ; Percy Liang ; Christopher D. Manning
Abstract: NLP models have many and sparse features, and regularization is key for balancing model overfitting versus underfitting. A recently repopularized form of regularization is to generate fake training data by repeatedly adding noise to real data. We reinterpret this noising as an explicit regularizer, and approximate it with a second-order formula that can be used during training without actually generating fake data. We show how to apply this method to structured prediction using multinomial logistic regression and linear-chain CRFs. We tackle the key challenge of developing a dynamic program to compute the gradient of the regularizer efficiently. The regularizer is a sum over inputs, so we can estimate it more accurately via a semi-supervised or transductive extension. Applied to text classification and NER, our method provides a > 1% absolute performance gain over use of standard L2 regularization.
2 0.68489337 187 emnlp-2013-Translation with Source Constituency and Dependency Trees
Author: Fandong Meng ; Jun Xie ; Linfeng Song ; Yajuan Lu ; Qun Liu
Abstract: We present a novel translation model, which simultaneously exploits the constituency and dependency trees on the source side, to combine the advantages of two types of trees. We take head-dependents relations of dependency trees as backbone and incorporate phrasal nodes of constituency trees as the source side of our translation rules, and the target side as strings. Our rules hold the property of long distance reorderings and the compatibility with phrases. Large-scale experimental results show that our model achieves significantly improvements over the constituency-to-string (+2.45 BLEU on average) and dependencyto-string (+0.91 BLEU on average) models, which only employ single type of trees, and significantly outperforms the state-of-theart hierarchical phrase-based model (+1.12 BLEU on average), on three Chinese-English NIST test sets.
3 0.53084803 48 emnlp-2013-Collective Personal Profile Summarization with Social Networks
Author: Zhongqing Wang ; Shoushan LI ; Fang Kong ; Guodong Zhou
Abstract: Personal profile information on social media like LinkedIn.com and Facebook.com is at the core of many interesting applications, such as talent recommendation and contextual advertising. However, personal profiles usually lack organization confronted with the large amount of available information. Therefore, it is always a challenge for people to find desired information from them. In this paper, we address the task of personal profile summarization by leveraging both personal profile textual information and social networks. Here, using social networks is motivated by the intuition that, people with similar academic, business or social connections (e.g. co-major, co-university, and cocorporation) tend to have similar experience and summaries. To achieve the learning process, we propose a collective factor graph (CoFG) model to incorporate all these resources of knowledge to summarize personal profiles with local textual attribute functions and social connection factors. Extensive evaluation on a large-scale dataset from LinkedIn.com demonstrates the effectiveness of the proposed approach. 1
4 0.52990407 56 emnlp-2013-Deep Learning for Chinese Word Segmentation and POS Tagging
Author: Xiaoqing Zheng ; Hanyang Chen ; Tianyu Xu
Abstract: This study explores the feasibility of performing Chinese word segmentation (CWS) and POS tagging by deep learning. We try to avoid task-specific feature engineering, and use deep layers of neural networks to discover relevant features to the tasks. We leverage large-scale unlabeled data to improve internal representation of Chinese characters, and use these improved representations to enhance supervised word segmentation and POS tagging models. Our networks achieved close to state-of-theart performance with minimal computational cost. We also describe a perceptron-style algorithm for training the neural networks, as an alternative to maximum-likelihood method, to speed up the training process and make the learning algorithm easier to be implemented.
5 0.52634424 110 emnlp-2013-Joint Bootstrapping of Corpus Annotations and Entity Types
Author: Hrushikesh Mohapatra ; Siddhanth Jain ; Soumen Chakrabarti
Abstract: Web search can be enhanced in powerful ways if token spans in Web text are annotated with disambiguated entities from large catalogs like Freebase. Entity annotators need to be trained on sample mention snippets. Wikipedia entities and annotated pages offer high-quality labeled data for training and evaluation. Unfortunately, Wikipedia features only one-ninth the number of entities as Freebase, and these are a highly biased sample of well-connected, frequently mentioned “head” entities. To bring hope to “tail” entities, we broaden our goal to a second task: assigning types to entities in Freebase but not Wikipedia. The two tasks are synergistic: knowing the types of unfamiliar entities helps disambiguate mentions, and words in mention contexts help assign types to entities. We present TMI, a bipartite graphical model for joint type-mention inference. TMI attempts no schema integration or entity resolution, but exploits the above-mentioned synergy. In experiments involving 780,000 people in Wikipedia, 2.3 million people in Freebase, 700 million Web pages, and over 20 professional editors, TMI shows considerable annotation accuracy improvement (e.g., 70%) compared to baselines (e.g., 46%), especially for “tail” and emerging entities. We also compare with Google’s recent annotations of the same corpus with Freebase entities, and report considerable improvements within the people domain.
6 0.52440578 114 emnlp-2013-Joint Learning and Inference for Grammatical Error Correction
7 0.52406216 53 emnlp-2013-Cross-Lingual Discriminative Learning of Sequence Models with Posterior Regularization
8 0.52370578 51 emnlp-2013-Connecting Language and Knowledge Bases with Embedding Models for Relation Extraction
10 0.52177429 64 emnlp-2013-Discriminative Improvements to Distributional Sentence Similarity
11 0.52118087 69 emnlp-2013-Efficient Collective Entity Linking with Stacking
12 0.52091599 143 emnlp-2013-Open Domain Targeted Sentiment
13 0.52055883 36 emnlp-2013-Automatically Determining a Proper Length for Multi-Document Summarization: A Bayesian Nonparametric Approach
14 0.51932502 107 emnlp-2013-Interactive Machine Translation using Hierarchical Translation Models
15 0.51917118 164 emnlp-2013-Scaling Semantic Parsers with On-the-Fly Ontology Matching
16 0.51873207 47 emnlp-2013-Collective Opinion Target Extraction in Chinese Microblogs
17 0.51870495 38 emnlp-2013-Bilingual Word Embeddings for Phrase-Based Machine Translation
18 0.51863313 157 emnlp-2013-Recursive Autoencoders for ITG-Based Translation
19 0.51848638 79 emnlp-2013-Exploiting Multiple Sources for Open-Domain Hypernym Discovery
20 0.51814485 65 emnlp-2013-Document Summarization via Guided Sentence Compression