acl acl2011 acl2011-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Viet Ha Thuc ; Nicola Cancedda
Abstract: Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. We propose an approach to overcome these two limitations. We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. [sent-3, score-0.203]
2 We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. [sent-5, score-0.962]
3 Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling. [sent-6, score-0.378]
4 Standard n-gram generative language models have been extended in several ways. [sent-8, score-0.142]
5 Generative factored language models (Bilmes and Kirchhoff, 2003) represent each token by multiple factors such as part-of-speech, lemma and surface form– and capture linguistic patterns in the target language at the appropriate level of abstraction. [sent-9, score-0.636]
6 Instead of estimating likelihood, discriminative language models (Roark et al. [sent-10, score-0.29]
7 , 2007; Li and Khudanpur, 2008) directly model fluency by casting the task as a binary classification or a ranking problem. [sent-12, score-0.091]
8 We use factored features to capture linguistic patterns and discriminative learning for directly modeling fluency. [sent-14, score-0.791]
9 We define highly overlapping and correlated factored – 439 Models Nicola Cancedda Xerox Research Centre Europe 6, chemin de Maupertuis 38240 Meylan, France Ni col a . [sent-15, score-0.436]
10 com features, and extend a robust learning algorithm to handle them and cope with a high rate of label noise. [sent-18, score-0.147]
11 For discriminatively learning language models, we use confidence-weighted learning (Dredze et al. [sent-19, score-0.122]
12 , 2008), an extension of the perceptron-based online learning used in previous work on discriminative language models. [sent-20, score-0.422]
13 Furthermore, we extend confidence-weighted learning with soft margin to handle the case where training data labels are noisy, as is typically the case in discriminative language modeling. [sent-21, score-0.602]
14 In Section 2, we introduce factored features for discriminative language models. [sent-23, score-0.675]
15 Section 4 describes its extension for the case where training data are noisy. [sent-25, score-0.092]
16 2 Factored features Factored features are n-gram features where each component in the n-gram can be characterized by different linguistic dimensions of words such as surface, lemma, part of speech (POS). [sent-28, score-0.24]
17 An example of a factored feature is “pick PRON up”, where PRON is the part of speech (POS) tag for pronouns. [sent-30, score-0.425]
18 Appropriately weighted, this feature can capture the fact that in English that pattern is often fluent. [sent-31, score-0.093]
19 Compared to traditional surface n-gram features like “pick her up”, “pick me up” etc. [sent-32, score-0.124]
20 , the feature “pick PRON up” generalizes the pattern better. [sent-33, score-0.056]
21 On the other hand, this feature is more precise Proceedings ofP thoer t4l9atnhd A, Onrnuegaoln M,e Jeuntineg 19 o-f2 t4h,e 2 A0s1s1o. [sent-34, score-0.056]
22 than the corresponding POS n-gram feature “VERB PRON PREP” since the latter also promotes undesirable patterns such as “pick PRON off” and “go PRON in”. [sent-37, score-0.09]
23 So, constructing features with components from different abstraction levels allows better capturing linguistic patterns. [sent-38, score-0.055]
24 In this study, we use tri-gram factored features to learn a discriminative language model for English, where each token is characterized by three factors including surface, POS, and extended POS. [sent-39, score-0.841]
25 In other words, we will use all possible trigrams where each element is either a surface from, a POS, or an extended POS. [sent-41, score-0.111]
26 3 Confidence-weighted Learning Online learning algorithms scale well to large datasets, and are thus well adapted to discriminative language modeling. [sent-42, score-0.324]
27 On the other hand, the perceptron and Passive Aggressive (PA) algorithms1 (Crammer et al. [sent-43, score-0.11]
28 , 2006) can be ill-suited for learning tasks where there is a long tail of rare significant features as in the case of language modeling. [sent-44, score-0.195]
29 Motivated by this, we adopt a simplified version of the CW algorithm of (Dredze et al. [sent-45, score-0.098]
30 We introduce a score , based on the number of times a feature has been obseerved in training, indicating how confident the algorithm is in the current estimate wi for the weight of feature i. [sent-47, score-0.228]
31 Instead of equally changing all feature weights upon a mistake, the algorithm now changes more aggressively the weights it is less confident in. [sent-48, score-0.41]
32 At iteration t, if the algorithm miss-ranks the pair of positive and negative instances (pt, nt), it updates the weight vector by solving the optimization in Eq. [sent-49, score-0.218]
33 (1): 1The popular MIRA algorithm is a particular PA algorithm, suitable for the linearly-separable case. [sent-50, score-0.062]
34 w>∆t ≥ 1 (2) where ∆t = φ(pt) − φ(nt), φ(x) is the vector representation of sentence x in factored feature space, and Λt is a diagonal matrix with confidence scores. [sent-53, score-0.66]
35 The algorithm thus updates weights aggressively enough to correctly rank the current pair of instances (i. [sent-54, score-0.32]
36 In the special case when Λt = I this is the update of the Passive-Aggressive algorithm of (Crammer et al. [sent-59, score-0.132]
37 So, to avoid over-fitting to the current instance pair (thus generalize better to the others), the difference between w and wt is weighted by confidence matrix Λ in the objective function. [sent-63, score-0.42]
38 (1), we form the corresponding Lagrangian: L(w,τ) =21(w−wt)>Λt2(w−wt)+τ(1−w>∆) (3) where τ is the Lagrange multiplier corresponding to the constraint in Eq. [sent-65, score-0.06]
39 Setting the partial derivatives of L with respect to w to zero, and then setting the derivative of L with respect to τ to zero, we get: τ =1kΛ −− w1∆t>k∆2 (4) Given this, we obtain Algorithm 1 for confidenceweighted passive-aggressive learning (Figure 1). [sent-67, score-0.117]
40 In the algorithm, Pi and Ni are sets of fluent and nonfluent sentences that can be contrasted, e. [sent-68, score-0.103]
41 Pi is a set of fluent translations and Ni is a set of non-fluent translations of a same source sentence si. [sent-70, score-0.205]
42 In our work, we set Λii to the logarithm of the number of times the algorithm has seen feature i, but alternative choices are possible. [sent-73, score-0.118]
43 4 Extension to soft margin In many practical situations, training data is noisy. [sent-74, score-0.266]
44 This is particularly true for language modeling, where even human experts will argue about whether a given sentence is fluent or not. [sent-75, score-0.103]
45 Instead, collecting fluency judgments is often done by a less expensive and thus even less reliable manner. [sent-77, score-0.057]
46 One way is to rank translations in n-best lists by NIST or BLEU scores, then take the top ones as fluent instances and bottom ones as non-fluent instances. [sent-78, score-0.253]
47 Therefore, in our setting it is crucial to be robust to noise in the training labels. [sent-81, score-0.088]
48 The update rule derived in the previous section always forces the new weights to satisfy the constraint (Corrective updates): mislabeled training instances could make feature weights change erratically. [sent-82, score-0.397]
49 To increase robustness to noise, we propose a soft mar- 441 gin variant of confidence-weighted learning. [sent-83, score-0.139]
50 Solving the optimization problem, we obtain, for the Lagrange multiplier: τ =∆t>1Λ −t− 2w∆tt>∆+t2C1 (7) Thus, the training algorithm with soft-margins is the same as Algorithm 1, but using Eq. [sent-87, score-0.139]
51 We first measured the effectiveness ofthe algorithms in deciding, given a pair of candidate translations for a same source sentence, whether the first candidate is more fluent than the second. [sent-90, score-0.188]
52 In a second experiment we used the score provided by the trained DLM as an additional feature in an n-best list reranking task and compared algorithms in terms of impact on NIST and BLEU. [sent-91, score-0.09]
53 , 2005), including a trigram generative language model with Kneser-Ney smoothing. [sent-95, score-0.061]
54 We then obtain training data for the discriminative language model as follows. [sent-96, score-0.285]
55 Using this dataset, we trained discriminative language models by standard perceptron, confidence-weighted learning and confidenceweighted learning with soft margin. [sent-109, score-0.585]
56 We then trained the weights of a re-ranker using eight features (seven from the baseline Matrax plus one from the DLM) using a simple structured perceptron algorithm on the development set. [sent-110, score-0.32]
57 For testing, we used the same trained Matrax model to generate n-best lists of size 1,000 each for each source sentence. [sent-111, score-0.048]
58 The score is used with seven standard Matrax features for re-ranking. [sent-113, score-0.108]
59 Finally, we measure the quality of the translations re-ranked to the top. [sent-114, score-0.051]
60 (2001)) on the target side of the training corpus used for creating the phrase-table, and extended the phrase-table format so as to record, for each token, all its factors. [sent-116, score-0.076]
61 2 Results In the first experiment, we measure the quality of the re-ranked n-best lists by classification error rate. [sent-118, score-0.048]
62 The error rate is computed as the fraction of pairs from a test-set which is ranked correctly according to its fluency score (approximated here by the NIST score). [sent-119, score-0.057]
63 For the baseline, we use the seven default Matrax features, including a generative language model score. [sent-121, score-0.114]
64 DLM* are discriminative language models trained using, respectively, POS features only 442 NISTBLEU TaB ba lse el3i:n e N +mISoD TdL eaMl nd3201BL67 E. [sent-122, score-0.345]
65 u2 p7 o407n590-bestlir- ranking with the proposed discriminative language models. [sent-124, score-0.251]
66 (DLM 0) or factored features by standard perceptron (DLM 1), confidence-weighted learning (DLM 2) and confidence-weighted learning with soft margin (DLM 3). [sent-125, score-0.844]
67 All discriminative language models strongly reduce the error rate compared to the baseline (9. [sent-126, score-0.29]
68 Recall that the training set for these discriminative language models is a relatively small subset of the one used to train Matrax’s integrated generative language model. [sent-131, score-0.385]
69 Amongst the four discriminative learning algorithms, we see that factored features are slightly better then POS features, confidence-weighted learning is slightly better than perceptron, and confidence-weighted learning with soft margin is the best (9. [sent-132, score-1.024]
70 04% better than perceptron and confidence-weighted learning with hard margin). [sent-134, score-0.149]
71 Again, all three discriminative language models could improve performances over the baseline. [sent-138, score-0.29]
72 Amongst the three, confidence-weighted learning with soft margin performs best. [sent-139, score-0.271]
73 6 Related Work This work is related to several existing directions: generative factored language model, discriminative language models, online passive-aggressive learning and confidence-weighted learning. [sent-140, score-0.794]
74 Generative factored language models are proposed by (Bilmes and Kirchhoff, 2003). [sent-141, score-0.408]
75 In this work, factors are used to define alternative backoff paths in case surface-form n-grams are not observed a sufficient number of times in the training corpus. [sent-142, score-0.076]
76 Unlike ours, this model cannot consider simultaneously multiple factored features coming from the same token n-gram, thus integrating all possible available information sources. [sent-143, score-0.467]
77 Discriminative language models have also been studied in speech recognition and statistical machine translation (Roark et al. [sent-144, score-0.08]
78 An attempt to combine factored features and discriminative language modeling is presented in (Mah e´ and Cancedda, 2009). [sent-146, score-0.715]
79 Unlike us, they combine together instances from multiple n-best lists, generally not comparable, in forming positive and negative instances. [sent-147, score-0.051]
80 Also, they use an SVM to train the DLM, as opposed to the proposed online algorithms. [sent-148, score-0.074]
81 , 2006) and the CW online algorithm proposed by (Dredze et al. [sent-150, score-0.136]
82 propose an online learning algorithm with soft margins to handle noise in training data. [sent-153, score-0.505]
83 However, the work does not consider the confidence associated with estimated feature weights. [sent-154, score-0.182]
84 On the other hand, the CW online algorithm in the later does not consider the case where the training data is noisy. [sent-155, score-0.17]
85 While developed independently, our soft-margin extension is closely related to the AROW(project) algorithm of (Crammer et al. [sent-156, score-0.12]
86 The cited work models classifiers as non-correlated Gaussian distributions over weights, while our approach uses point estimates for weights coupled with confidence scores. [sent-158, score-0.372]
87 Despite the different conceptual modeling, though, in practice the algorithms are similar, with point estimates playing the same role as the mean vector, and our (squared) confidence score matrix the same role as the precision (inverse covariance) matrix. [sent-159, score-0.262]
88 Unlike in the cited work, however, in our proposal, confidence scores are updated also upon correct classification of training examples, and not only on mistakes. [sent-160, score-0.27]
89 The rationale of this is that correctly classifying an example could also increase the confidence on the current model. [sent-161, score-0.126]
90 Thus, the update formulas are also different compared to the work cited above. [sent-162, score-0.145]
91 443 7 Conclusions We proposed a novel approach to discriminative language models. [sent-163, score-0.251]
92 First, we introduced the idea of using factored features in the discriminative language modeling framework. [sent-164, score-0.715]
93 Factored features allow the language model to capture linguistic patterns at multiple levels of abstraction. [sent-165, score-0.092]
94 Moreover, the discriminative framework is appropriate for handling highly overlapping features, which is the case of factored features. [sent-166, score-0.653]
95 While we did not experiment with this, a natural extension consists in using all n-grams up to a certain order, thus providing back-off features and enabling the use of higher-order n-grams. [sent-167, score-0.113]
96 Second, for learning factored language models discriminatively, we adopt a simple confidence-weighted algorithm, limiting the problem of poor estimation of weights for rare features. [sent-168, score-0.632]
97 Finally, we extended confidence-weighted learning with soft margins to handle the case where labels of training data are noisy. [sent-169, score-0.357]
98 This is typically the case in discriminative language modeling, where labels are obtained only indirectly. [sent-170, score-0.251]
99 Large-scale discriminative n-gram language models for statistical machine translation. [sent-199, score-0.29]
100 Discriminative language modeling with conditional random fields and the perceptron algorithm. [sent-208, score-0.15]
wordName wordTfidf (topN-words)
[('factored', 0.369), ('dlm', 0.313), ('matrax', 0.274), ('discriminative', 0.251), ('wt', 0.231), ('pron', 0.179), ('crammer', 0.178), ('cancedda', 0.156), ('nist', 0.144), ('soft', 0.139), ('confidence', 0.126), ('nisti', 0.117), ('perceptron', 0.11), ('roark', 0.105), ('fluent', 0.103), ('ni', 0.097), ('margin', 0.093), ('weights', 0.093), ('pi', 0.09), ('pick', 0.085), ('dredze', 0.084), ('koby', 0.08), ('confidenceweighted', 0.078), ('iowa', 0.078), ('mah', 0.078), ('nbesti', 0.078), ('bilmes', 0.075), ('cited', 0.075), ('online', 0.074), ('cw', 0.071), ('update', 0.07), ('surface', 0.069), ('pos', 0.067), ('simard', 0.063), ('matrix', 0.063), ('algorithm', 0.062), ('updates', 0.062), ('bleu', 0.061), ('generative', 0.061), ('saraclar', 0.06), ('xerox', 0.06), ('multiplier', 0.06), ('extension', 0.058), ('fluency', 0.057), ('lagrange', 0.057), ('margins', 0.057), ('feature', 0.056), ('rare', 0.056), ('features', 0.055), ('noise', 0.054), ('confident', 0.054), ('nips', 0.054), ('murat', 0.054), ('seven', 0.053), ('aggressively', 0.052), ('kirchhoff', 0.052), ('translations', 0.051), ('instances', 0.051), ('amongst', 0.049), ('lists', 0.048), ('pj', 0.047), ('diagonal', 0.046), ('khudanpur', 0.046), ('handle', 0.046), ('tail', 0.045), ('discriminatively', 0.044), ('token', 0.043), ('optimization', 0.043), ('extended', 0.042), ('factors', 0.042), ('translation', 0.041), ('modeling', 0.04), ('brian', 0.039), ('models', 0.039), ('estimates', 0.039), ('learning', 0.039), ('characterized', 0.039), ('lemma', 0.037), ('capture', 0.037), ('adopt', 0.036), ('dimensions', 0.036), ('gaussian', 0.036), ('tt', 0.036), ('updated', 0.035), ('training', 0.034), ('corrective', 0.034), ('salah', 0.034), ('casting', 0.034), ('chemin', 0.034), ('claude', 0.034), ('dymetman', 0.034), ('ifrom', 0.034), ('promotes', 0.034), ('risky', 0.034), ('spanishenglish', 0.034), ('tdl', 0.034), ('xip', 0.034), ('algorithms', 0.034), ('overlapping', 0.033), ('nt', 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999934 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
Author: Viet Ha Thuc ; Nicola Cancedda
Abstract: Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. We propose an approach to overcome these two limitations. We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling.
2 0.12958959 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
Author: Dan Goldwasser ; Roi Reichart ; James Clarke ; Dan Roth
Abstract: Current approaches for semantic parsing take a supervised approach requiring a considerable amount of training data which is expensive and difficult to obtain. This supervision bottleneck is one of the major difficulties in scaling up semantic parsing. We argue that a semantic parser can be trained effectively without annotated data, and introduce an unsupervised learning algorithm. The algorithm takes a self training approach driven by confidence estimation. Evaluated over Geoquery, a standard dataset for this task, our system achieved 66% accuracy, compared to 80% of its fully supervised counterpart, demonstrating the promise of unsupervised approaches for this task.
3 0.1132296 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
Author: Nguyen Bach ; Fei Huang ; Yaser Al-Onaizan
Abstract: State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improve- ments between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
4 0.10764544 150 acl-2011-Hierarchical Text Classification with Latent Concepts
Author: Xipeng Qiu ; Xuanjing Huang ; Zhao Liu ; Jinlong Zhou
Abstract: Recently, hierarchical text classification has become an active research topic. The essential idea is that the descendant classes can share the information of the ancestor classes in a predefined taxonomy. In this paper, we claim that each class has several latent concepts and its subclasses share information with these different concepts respectively. Then, we propose a variant Passive-Aggressive (PA) algorithm for hierarchical text classification with latent concepts. Experimental results show that the performance of our algorithm is competitive with the recently proposed hierarchical classification algorithms.
5 0.096699834 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
Author: Bing Xiang ; Abraham Ittycheriah
Abstract: In this paper we present a novel discriminative mixture model for statistical machine translation (SMT). We model the feature space with a log-linear combination ofmultiple mixture components. Each component contains a large set of features trained in a maximumentropy framework. All features within the same mixture component are tied and share the same mixture weights, where the mixture weights are trained discriminatively to maximize the translation performance. This approach aims at bridging the gap between the maximum-likelihood training and the discriminative training for SMT. It is shown that the feature space can be partitioned in a variety of ways, such as based on feature types, word alignments, or domains, for various applications. The proposed approach improves the translation performance significantly on a large-scale Arabic-to-English MT task.
6 0.093345806 44 acl-2011-An exponential translation model for target language morphology
7 0.092034884 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
8 0.090620145 325 acl-2011-Unsupervised Word Alignment with Arbitrary Features
9 0.087014265 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
10 0.086411715 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
11 0.084729739 313 acl-2011-Two Easy Improvements to Lexical Weighting
12 0.083935328 75 acl-2011-Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction
13 0.079625539 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents
14 0.073868662 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes
15 0.069362059 39 acl-2011-An Ensemble Model that Combines Syntactic and Semantic Clustering for Discriminative Dependency Parsing
16 0.067146443 333 acl-2011-Web-Scale Features for Full-Scale Parsing
17 0.06526795 171 acl-2011-Incremental Syntactic Language Models for Phrase-based Translation
18 0.065012604 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
19 0.064873084 290 acl-2011-Syntax-based Statistical Machine Translation using Tree Automata and Tree Transducers
20 0.063969068 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations
topicId topicWeight
[(0, 0.202), (1, -0.04), (2, 0.017), (3, 0.015), (4, 0.007), (5, -0.014), (6, 0.047), (7, 0.012), (8, 0.013), (9, 0.054), (10, -0.001), (11, -0.042), (12, -0.006), (13, 0.002), (14, 0.005), (15, 0.045), (16, -0.073), (17, -0.041), (18, 0.006), (19, -0.008), (20, 0.033), (21, -0.089), (22, 0.047), (23, -0.036), (24, -0.049), (25, -0.027), (26, -0.002), (27, -0.001), (28, 0.032), (29, 0.001), (30, -0.005), (31, 0.024), (32, -0.038), (33, 0.069), (34, 0.038), (35, 0.122), (36, 0.017), (37, -0.035), (38, 0.04), (39, -0.03), (40, 0.139), (41, -0.048), (42, 0.023), (43, -0.064), (44, -0.045), (45, -0.048), (46, 0.11), (47, -0.05), (48, 0.084), (49, 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.93826717 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
Author: Viet Ha Thuc ; Nicola Cancedda
Abstract: Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. We propose an approach to overcome these two limitations. We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling.
2 0.71916264 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
Author: Joel Lang
Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.
3 0.64044905 79 acl-2011-Confidence Driven Unsupervised Semantic Parsing
Author: Dan Goldwasser ; Roi Reichart ; James Clarke ; Dan Roth
Abstract: Current approaches for semantic parsing take a supervised approach requiring a considerable amount of training data which is expensive and difficult to obtain. This supervision bottleneck is one of the major difficulties in scaling up semantic parsing. We argue that a semantic parser can be trained effectively without annotated data, and introduce an unsupervised learning algorithm. The algorithm takes a self training approach driven by confidence estimation. Evaluated over Geoquery, a standard dataset for this task, our system achieved 66% accuracy, compared to 80% of its fully supervised counterpart, demonstrating the promise of unsupervised approaches for this task.
4 0.63240457 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
Author: Jacob Eisenstein ; Noah A. Smith ; Eric P. Xing
Abstract: We present a method to discover robust and interpretable sociolinguistic associations from raw geotagged text data. Using aggregate demographic statistics about the authors’ geographic communities, we solve a multi-output regression problem between demographics and lexical frequencies. By imposing a composite ‘1,∞ regularizer, we obtain structured sparsity, driving entire rows of coefficients to zero. We perform two regression studies. First, we use term frequencies to predict demographic attributes; our method identifies a compact set of words that are strongly associated with author demographics. Next, we conjoin demographic attributes into features, which we use to predict term frequencies. The composite regularizer identifies a small number of features, which correspond to communities of authors united by shared demographic and linguistic properties.
5 0.62771213 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
Author: Mariet Theune ; Ruud Koolen ; Emiel Krahmer ; Sander Wubben
Abstract: In this paper we investigate how much data is required to train an algorithm for attribute selection, a subtask of Referring Expressions Generation (REG). To enable comparison between different-sized training sets, a systematic training method was developed. The results show that depending on the complexity of the domain, training on 10 to 20 items may already lead to a good performance.
6 0.62012661 199 acl-2011-Learning Condensed Feature Representations from Large Unsupervised Data Sets for Supervised Learning
7 0.61140645 60 acl-2011-Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
8 0.60407668 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence
9 0.59840947 100 acl-2011-Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation
10 0.59654576 301 acl-2011-The impact of language models and loss functions on repair disfluency detection
11 0.59489 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals
12 0.58916146 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
13 0.58913451 20 acl-2011-A New Dataset and Method for Automatically Grading ESOL Texts
14 0.58803809 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech
15 0.57967919 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
16 0.57846361 150 acl-2011-Hierarchical Text Classification with Latent Concepts
17 0.57595706 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes
18 0.55934465 267 acl-2011-Reversible Stochastic Attribute-Value Grammars
19 0.55012804 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
20 0.54544288 313 acl-2011-Two Easy Improvements to Lexical Weighting
topicId topicWeight
[(5, 0.019), (17, 0.028), (26, 0.026), (31, 0.012), (37, 0.087), (39, 0.034), (41, 0.048), (55, 0.423), (59, 0.027), (72, 0.03), (91, 0.061), (96, 0.142)]
simIndex simValue paperId paperTitle
1 0.89267015 24 acl-2011-A Scalable Probabilistic Classifier for Language Modeling
Author: Joel Lang
Abstract: We present a novel probabilistic classifier, which scales well to problems that involve a large number ofclasses and require training on large datasets. A prominent example of such a problem is language modeling. Our classifier is based on the assumption that each feature is associated with a predictive strength, which quantifies how well the feature can predict the class by itself. The predictions of individual features can then be combined according to their predictive strength, resulting in a model, whose parameters can be reliably and efficiently estimated. We show that a generative language model based on our classifier consistently matches modified Kneser-Ney smoothing and can outperform it if sufficiently rich features are incorporated.
same-paper 2 0.87331265 78 acl-2011-Confidence-Weighted Learning of Factored Discriminative Language Models
Author: Viet Ha Thuc ; Nicola Cancedda
Abstract: Language models based on word surface forms only are unable to benefit from available linguistic knowledge, and tend to suffer from poor estimates for rare features. We propose an approach to overcome these two limitations. We use factored features that can flexibly capture linguistic regularities, and we adopt confidence-weighted learning, a form of discriminative online learning that can better take advantage of a heavy tail of rare features. Finally, we extend the confidence-weighted learning to deal with label noise in training data, a common case with discriminative lan- guage modeling.
3 0.86250484 275 acl-2011-Semi-Supervised Modeling for Prenominal Modifier Ordering
Author: Margaret Mitchell ; Aaron Dunlop ; Brian Roark
Abstract: In this paper, we argue that ordering prenominal modifiers typically pursued as a supervised modeling task is particularly wellsuited to semi-supervised approaches. By relying on automatic parses to extract noun phrases, we can scale up the training data by orders of magnitude. This minimizes the predominant issue of data sparsity that has informed most previous approaches. We compare several recent approaches, and find improvements from additional training data across the board; however, none outperform a simple n-gram model. – –
4 0.85442531 124 acl-2011-Exploiting Morphology in Turkish Named Entity Recognition System
Author: Reyyan Yeniterzi
Abstract: Turkish is an agglutinative language with complex morphological structures, therefore using only word forms is not enough for many computational tasks. In this paper we analyze the effect of morphology in a Named Entity Recognition system for Turkish. We start with the standard word-level representation and incrementally explore the effect of capturing syntactic and contextual properties of tokens. Furthermore, we also explore a new representation in which roots and morphological features are represented as separate tokens instead of representing only words as tokens. Using syntactic and contextual properties with the new representation provide an 7.6% relative improvement over the baseline.
5 0.78857535 144 acl-2011-Global Learning of Typed Entailment Rules
Author: Jonathan Berant ; Ido Dagan ; Jacob Goldberger
Abstract: Extensive knowledge bases ofentailment rules between predicates are crucial for applied semantic inference. In this paper we propose an algorithm that utilizes transitivity constraints to learn a globally-optimal set of entailment rules for typed predicates. We model the task as a graph learning problem and suggest methods that scale the algorithm to larger graphs. We apply the algorithm over a large data set of extracted predicate instances, from which a resource of typed entailment rules has been recently released (Schoenmackers et al., 2010). Our results show that using global transitivity information substantially improves performance over this resource and several baselines, and that our scaling methods allow us to increase the scope of global learning of entailment-rule graphs.
6 0.7654689 237 acl-2011-Ordering Prenominal Modifiers with a Reranking Approach
7 0.72031325 245 acl-2011-Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives
8 0.58196253 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
9 0.58042598 175 acl-2011-Integrating history-length interpolation and classes in language modeling
10 0.57951337 150 acl-2011-Hierarchical Text Classification with Latent Concepts
11 0.57459259 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
13 0.55889416 17 acl-2011-A Large Scale Distributed Syntactic, Semantic and Lexical Language Model for Machine Translation
14 0.55826688 280 acl-2011-Sentence Ordering Driven by Local and Global Coherence for Summary Generation
15 0.5532366 85 acl-2011-Coreference Resolution with World Knowledge
16 0.55233979 135 acl-2011-Faster and Smaller N-Gram Language Models
17 0.55051637 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
18 0.54963028 197 acl-2011-Latent Class Transliteration based on Source Language Origin
19 0.54800379 38 acl-2011-An Empirical Investigation of Discounting in Cross-Domain Language Models
20 0.54779637 163 acl-2011-Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes