emnlp emnlp2010 emnlp2010-30 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Avihai Mejer ; Koby Crammer
Abstract: Confidence-Weighted linear classifiers (CW) and its successors were shown to perform well on binary and multiclass NLP problems. In this paper we extend the CW approach for sequence learning and show that it achieves state-of-the-art performance on four noun phrase chucking and named entity recognition tasks. We then derive few algorithmic approaches to estimate the prediction’s correctness of each label in the output sequence. We show that our approach provides a reliable relative correctness information as it outperforms other alternatives in ranking label-predictions according to their error. We also show empirically that our methods output close to absolute estimation of error. Finally, we show how to use this information to improve active learning.
Reference: text
sentIndex sentText sentNum sentScore
1 In this work we fill this gap proposing few alternatives to compute confidence in the output of discriminative non-probabilistic algorithms. [sent-22, score-0.26]
2 However, they also compute additional labelings, that are used to compute the per word confidence in its labelings. [sent-24, score-0.26]
3 , 2009b) and induce a distribution over labelings from the distribution maintained over weight-vectors. [sent-27, score-0.163]
4 We show how to compute confidence estimates in the label predicted per word, such that the confidence reflects the probability that the label is not correct. [sent-28, score-0.665]
5 We then use this confidence information to rank all labeled words (in all sentences). [sent-29, score-0.306]
6 This can be thought of as a retrieval of the erroneous words, which can than be passed to human annotator for an examination, either to correct these mistakes or as a quality control component. [sent-30, score-0.255]
7 We evaluate our methods on four NP chunking and NER datasets and demonstrate the usefulness of our methods. [sent-32, score-0.19]
8 i tT vhee mean ∈ ΣRd) -co fnotari bnsin tahrey current estimate Tfhore eth mee banest µ weight vector, whereas the Gaussian covariance matrix Σ ∈ Rd×d captures tthhee Gcoanufsisdiaennc ceo vina itahnisc e mstiamtraixte Σ. [sent-45, score-0.17]
9 ∈M Rore precisely, the diagonal elements Σp,p, capture the confidence in the value of the corresponding weight µp ; the smaller the value of Σp,p, is, the more confident is the model in the value of µp. [sent-46, score-0.494]
10 t Whehe cno rthreel dtiaotan is of large dimension, such as in natural language processing, a model that maintains a full covariance matrix is not feasible and we back-off to diagonal covariance matrices. [sent-48, score-0.283]
11 At each round, the new mean and covariance of the weight vector distribution is chosen to be the solucion of an optimization problem (see (Crammer et al. [sent-50, score-0.142]
12 do Get xi ∈ X Predict ∈ be Xst labeling × yˆi = arg maxz µi−1 · Φ(xi, z) Get correct labeling yi ∈ Define ∆i,y, ˆy = Φ(x, yi) − Φ(x, ˆy i) Compute αi and βi (Eq. [sent-61, score-0.361]
13 µµµ Given an input instance x and a model ∈ Rd we predict nth ien labeling nwceith x t ahen highest score, ˆy = arg maxz · Φ(x, z). [sent-74, score-0.201]
14 A brute-force approach evaluates the vaµlu·Φe o(fx t,hze score · Φ(x, z) pfoprro eaacchh possuiabtlees labeling z ∈ Yn, cwohreic µh ·isΦ n(oxt ,fzea)si fborle e faocrh large vsaiblulees la bofe n. [sent-75, score-0.132]
15 1) to sequence labeling by a reduction to binary classification. [sent-85, score-0.128]
16 We first define the difference between the feature vector associated with the current labeling yi φ and the feature vector associated with some labeling z to be, ∆i,y,z = Φ(x, yi) − Φ(x, z) , and in particular, when we use the prediction yˆi we get, ∆i,y, ˆy = Φ(x, yi) − Φ(x, ˆy i) . [sent-86, score-0.34]
17 The CW update is, µi = µi−1 + αiΣi−1∆i,y, yˆ Σi−1 = + , (2) where the two scalars αi and βi are set using the update rule defined by (Crammer et al. [sent-87, score-0.175]
18 01 We proceed with the confidence paramters in (Crammer et al. [sent-127, score-0.26]
19 r1 4 Evaluation For the experiments described in this paper we used four large sequential classification datasets taken from the CoNLL-2000, 2002 and 2003 shared tasks: noun-phrase (NP) chunking (Kim et al. [sent-145, score-0.19]
20 Although our primary goal is estimating confidence in prediction and not the actual performance itself, we first report the results of using AROW and CW for sequence learning. [sent-150, score-0.396]
21 Three options are possible: compute a full Σ and then take its diagonal elements; compute a full inverse Σ, take its diagonal elements and then compute its inverse; assume that Σ is diagonal and compute the optimal update for this choice. [sent-175, score-0.231]
22 The F-measure of the four algorithms after 10 iterations over the four datasets is summarized in Table 2. [sent-182, score-0.243]
23 Each panel summarizes the results on a single dataset, and in each panel a single set of connected points corresponds to one algorithm. [sent-189, score-0.163]
24 Second, the performance of all algorithms is converging in about 10 iterations as indicated by the fact the points in the top-right of the plot are close to each other. [sent-194, score-0.144]
25 5 Confidence in the Prediction Most large-margin-based training algorithms output models that their prediction is a single labeling of the input, with no additional confidence information about the correctness of that prediction. [sent-214, score-0.537]
26 This situation is not acceptable when the output of the system is used as an input of another system that is sensitive to correctness of the specific prediction or that integrates various input sources. [sent-216, score-0.137]
27 In such cases, additional confidence information about the correctness of these feeds for specific input can be used to improve the total output quality. [sent-217, score-0.317]
28 The confidence information can be used to direct the check into small number of suspected predictions as opposed to random check, which may miss errors if their rate is small. [sent-219, score-0.327]
29 This information can be used to rank all predictions according to their confidence score, which can be used to direct a quality control component to detect errors in the prediction. [sent-221, score-0.295]
30 Note, the confidence score is meaningless by itself and in fact, any monotonic transformation of the confidence scores yield equivalent confidence information. [sent-222, score-0.812]
31 Other methods are providing confidence in the predicted output as an absolute information, that is, the probability of a prediction to be correct. [sent-223, score-0.421]
32 When taking a large set of events (predictions) with similar probability confidence value ν of being correct, we expect that about ν fraction of the predictions in the group will be correct. [sent-225, score-0.36]
33 First, a method generates a set of K possible labelings for the input sentence (instead of a single prediction). [sent-227, score-0.163]
34 Then, the confidence in a predicted labeling for a specific word is defined to be the proportion of labelings which are consistent with the predicted label. [sent-228, score-0.625]
35 K be the K labelings for some input x, and let ˆy be the actual prediction for the input. [sent-232, score-0.271]
36 The confidence in the label z(i) z(i) yˆp of word p = 1. [sent-234, score-0.307]
37 Nc−21 05 0(h1)NE2R0WorSdInp3e0xaiRKDWCseaKBRtln−VF4hdaP(0oxiCKme=0d3( K5)0= Figure 3: Total number of detected erroneous words vs. [sent-246, score-0.263]
38 In other words, the lines in the bottom panels are the number of additional erroneous words detected compared to Delta method. [sent-248, score-0.47]
39 In this case, we draw K labelings from this distribution. [sent-252, score-0.163]
40 Specifically, we exploit the Gaussian distribution over weight vectors w ∼ N (µ,Σ) maintained by AROW and CW, by inducing a ,diΣs)tr mibuaitinotnai over labelings given an i bnyput. [sent-253, score-0.199]
41 The algorithm samples K weight vectors according to this Gaussian distribution and outputs the best labeling with respect to each weight vector. [sent-254, score-0.172]
42 Formally, we define the set Z = : = {z(i) z(i) µ arg maxz w · Φ(x, z) swethe Zre w ∼ z N (µ,Σ)} The predictions of algorithms that use the mean weight vector ˆy = arg maxz · Φ(x, z) are invariant to the value of the input Σµ (as (nxo,tezd) by (Crammer et al. [sent-255, score-0.344]
43 However for the purpose of confidence estimation the specific value of Σ has a huge affect. [sent-257, score-0.328]
44 Then, after the training is completed, we 976 try few scalings of the final covariance sΣ for some positive scalar s, and choose the best value s using the training set. [sent-262, score-0.137]
45 The second method to estimate confidence follows the same conceptual steps, except that we used µ an isotropic covariance matrix, Σ = sI for some positive scale information s. [sent-264, score-0.366]
46 This method is especially appealing, since it can be used in combination with training algorithms that do not maintain confidence information, such as the Perceptron or PA. [sent-267, score-0.3]
47 We modified the Viterbi algorithm to output the K distinct labelings with highest score (computed using the mean weight vector in case of CW or AROW). [sent-269, score-0.231]
48 The third method assigns uniform importance to each of the K labelings ignoring the actual score values. [sent-270, score-0.223]
49 We thus propose the fourth method in which we define an importance weight ωi to each labeling ? [sent-272, score-0.136]
50 This method provide confidence score that is only relative and not absolute, namely its output can be used to compare the confidence in two labelings, yet there is no semantics defined over the scores. [sent-284, score-0.552]
51 Given an input sentence to be labeled x and a model we define the confidence in the prediction associated with the pth word to be the difference in the highest score and the closest score, where we set the label of that word to anything but the label with the highest score. [sent-285, score-0.512]
52 We refer to this method as Delt a where the confidence information is a difference, aka as delta, between two score values. [sent-287, score-0.292]
53 Finally, as an additional baseline, we used a sixth method based on the confidence values for single words produced by CRF model. [sent-288, score-0.26]
54 We trained a classifier using the CW algorithm running for ten (10) iterations on three-fourth of the data and applied it to the remaining one-fourth to get a labeling of the test set. [sent-294, score-0.131]
55 The size of K of the number of labelings used in the four first methods (KD-PC, KD-Fixed, KBV, WKBV) and the weighting scalar s used in KD-PC and KD-Fixed were tuned for each dataset on a single evaluation on subset of the training set according to the best measured average precision. [sent-299, score-0.208]
56 We also trained CRF on the same training sets and applied it to label and assign confidence values to all the words in the test sets. [sent-308, score-0.307]
57 Relative Confidence: For each of the datasets, we first trained a model using the CW algorithm and applied each of the confidence methods on the output, ranking from low to high all the words of the test set according to the confidence in the prediction associated with them. [sent-310, score-0.633]
58 This task can be thought of as a retrieval task of the erroneous words. [sent-312, score-0.255]
59 The average precision is the average of the precision values computed at all ranks of erroneous words. [sent-313, score-0.273]
60 The average precision for ranking the words ofthe test-set according the confidence in the prediction of seven methods appears in the top-left panel of Fig. [sent-314, score-0.469]
61 ) We see that when ordering the words randomly, the average precision is about the frequency of erroneous word, which is the lowest average precision. [sent-317, score-0.245]
62 Thus, taking the actual score value into consideration improves the ability to detect erroneous words. [sent-319, score-0.308]
63 All confidence estimation methods can be used except the KD-PC, which does take the confidence information into consideration. [sent-341, score-0.557]
64 To better understand the behavior of the various methods we plot the total number of detected erroneous words vs. [sent-344, score-0.309]
65 the number of ranked words (first 5, 000 ranked words) in the top panels of Fig. [sent-345, score-0.229]
66 The bottom panels show the relative additional number of words each methods detects on top of the marginbased Delta method. [sent-347, score-0.207]
67 Clearly, KD-Fixed and KDPC detect erroneous words better than the other CW based methods, finding about 100 more words than Delta (when ranking 5, 000 words) which is about 8% of the total number of erroneous words. [sent-348, score-0.467]
68 We emphasize that all methods except CRF were based on the same exact weight vector, ranking the same predations, while CRF used an alternative weight vector that yields different number of erroneous words. [sent-350, score-0.322]
69 In details, we observe some correlation between the percentage or erroneous words in the entire set and the number of erroneous words detected among the first 5, 000 ranked words. [sent-351, score-0.507]
70 For NP chunking and NER English datasets, CRF has more erroneous words compared to CW and it detects more erroneous words compared to K-Draws. [sent-352, score-0.525]
71 For NER Dutch dataset CRF and CW have almost same number of erroneous words and almost same number of erroneous words detected, and finally in NER Spanish dataset CRF has fewer erroneous words and it detected less erroneous words. [sent-353, score-0.914]
72 In other words, where there are more erroneous words to find (e. [sent-354, score-0.217]
73 CRF in NP chunking), the task of ranking erroneous words is easier, and vice-versa. [sent-356, score-0.25]
74 We hypothesize that part of the performance differences we see between the K-Draws and CRF methods is due to the difference in the number of erroneous words in the ranked set. [sent-357, score-0.244]
75 Absolute Confidence: Our next goal is to evaluate how reliable are the absolute confidence values output by the proposed methods. [sent-361, score-0.29]
76 As before, the confidence estimation methods (KD-PC, KD-Fixed, KBV, WKBV and CRF) were applied on the entire set of predicted labels. [sent-362, score-0.348]
77 (Delta method is omitted as the confidence score it produces is not in [0, 1]). [sent-363, score-0.292]
78 For each of the four datasets and the five algorithms we grouped the words according to the value of their confidence. [sent-364, score-0.17]
79 Specifically, we used twenty (20) bins dividing uniformly the confidence range into intervals of size 0. [sent-365, score-0.305]
80 Formally, bin indexed j contains words with confidence value in the range [(j −1)/20, j/20) for j = 1. [sent-369, score-0.336]
81 Let bj bthee et rhaen cgeen t[e(rj v−a1lu)e/ 2o0f, bijn/ j,)th faotr ri sj bj = j/20−1/40. [sent-373, score-0.19]
82 by cj is the fraction of words with confidence ν ∈ [(j −1)/20, j/20) othna ot fth weiorr assigned cloanbefidl eisn correct. [sent-375, score-0.346]
83 Ultimately, jth/2es0e) two vheailure ass sighnoeudld l bbee ti she c same, bj = cj, meaning that the confidence information is a good estimator of the frequency of correct labels. [sent-376, score-0.355]
84 Methods for which cj > bj are too pessimistic, predicting too high frequency of erroneous labels, while methods for which cj < bj are too optimistic, predicting too low frequency of erroneous words. [sent-377, score-0.728]
85 The results are summarized in Fig 4, one panel per dataset, where we plot the value of the centerof-bin bj vs. [sent-378, score-0.268]
86 We hypothesis that its superi- ority is because it makes use of the uncertainty information captured in the covariance matrix Σ which is part of the Gaussian distribution. [sent-387, score-0.134]
87 Finally, these bins plots does not reflect the fact that different bins were not populated uniformly, the bins with higher values were more heavily popu979 lated. [sent-388, score-0.135]
88 The success of KD-PC and KD-Fixed in evaluat- ing confidence led us to experiment with using similar techniques for inference. [sent-404, score-0.26]
89 Given an input sentence, the inference algorithm samples K times from the Gaussian distribution and output the best labeling according to each sampled weight vector. [sent-405, score-0.136]
90 , 2009a), as they output the most frequent labeling in a set, while the predicted label of our algorithm may not even belong to the set of predictions. [sent-409, score-0.198]
91 6 Active Learning Encouraged by the success of the KD-PC and KDFixed algorithms in estimating the confidence in the prediction we apply these methods to the task of active learning. [sent-410, score-0.44]
92 Many active learning algorithms are first computing a prediction for each of the unlabeled-data examples, which is then used to choose new examples to be labeled. [sent-416, score-0.18]
93 In the previous section, we used the confidence estimation algorithms to choose individual words to be annotated by a human. [sent-423, score-0.337]
94 A similar approach, motivated by (Dredze and Crammer, 2008), normalizes MinMargin score using the confidence information extracted from the Gaussian covariance matrix, we call this method MinConfMargin. [sent-432, score-0.398]
95 The top panels show the results for up to 10, 000 labeled words, while the bottom panels show the results for more than 10k labeled words. [sent-451, score-0.474]
96 bottom panels show the results for more than 10k training words. [sent-452, score-0.207]
97 Related Work: Most previous work has focused on confidence estimation for an entire example or some fields of an entry (Culotta and McCallum, 2004) using CRFs. [sent-454, score-0.297]
98 , 2004) show the utility of confidence estimation is extracted fields of an interactive information extraction system by high-lighting low confidence fields for the user. [sent-456, score-0.557]
99 , 2001) estimate confidence of single token label in HMM based information extraction system by a method similar to the Delta method we used. [sent-458, score-0.307]
100 (Ueffing and Ney, 2007) propose several methods for word level confidence estimation for the task ofmachine translation. [sent-459, score-0.297]
wordName wordTfidf (topN-words)
[('cw', 0.379), ('arow', 0.365), ('crammer', 0.265), ('confidence', 0.26), ('ner', 0.243), ('crf', 0.238), ('erroneous', 0.217), ('wkbv', 0.191), ('panels', 0.175), ('delta', 0.163), ('labelings', 0.163), ('kbv', 0.143), ('covariance', 0.106), ('dredze', 0.106), ('labeling', 0.1), ('bj', 0.095), ('chunking', 0.091), ('dutch', 0.085), ('prediction', 0.08), ('tjong', 0.079), ('update', 0.072), ('panel', 0.068), ('kdfixed', 0.064), ('pa', 0.062), ('maxz', 0.061), ('gaussian', 0.061), ('active', 0.06), ('yi', 0.06), ('correctness', 0.057), ('datasets', 0.054), ('cj', 0.052), ('predicted', 0.051), ('np', 0.048), ('kristjansson', 0.048), ('minmargin', 0.048), ('pessimistic', 0.048), ('culotta', 0.047), ('spanish', 0.047), ('label', 0.047), ('plot', 0.046), ('detected', 0.046), ('mcdonald', 0.046), ('labeled', 0.046), ('bins', 0.045), ('bin', 0.045), ('four', 0.045), ('diagonal', 0.043), ('shimizu', 0.041), ('algorithms', 0.04), ('online', 0.04), ('arg', 0.04), ('thought', 0.038), ('viterbi', 0.037), ('rd', 0.037), ('optimistic', 0.037), ('scheffer', 0.037), ('ueffing', 0.037), ('zp', 0.037), ('estimation', 0.037), ('round', 0.036), ('max', 0.036), ('weight', 0.036), ('predictions', 0.035), ('wick', 0.034), ('mallet', 0.034), ('sha', 0.034), ('mislabeled', 0.034), ('fraction', 0.034), ('line', 0.034), ('ranking', 0.033), ('score', 0.032), ('delt', 0.032), ('haifa', 0.032), ('hyper', 0.032), ('kdpc', 0.032), ('labeeld', 0.032), ('minconfmargin', 0.032), ('pipes', 0.032), ('pjnj', 0.032), ('suspected', 0.032), ('koby', 0.032), ('tong', 0.032), ('confident', 0.032), ('marginals', 0.032), ('bottom', 0.032), ('iterations', 0.031), ('value', 0.031), ('rule', 0.031), ('elements', 0.03), ('averaged', 0.03), ('absolute', 0.03), ('perceptron', 0.03), ('summarized', 0.028), ('sequence', 0.028), ('matrix', 0.028), ('actual', 0.028), ('precision', 0.028), ('ranked', 0.027), ('points', 0.027), ('yq', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models
Author: Avihai Mejer ; Koby Crammer
Abstract: Confidence-Weighted linear classifiers (CW) and its successors were shown to perform well on binary and multiclass NLP problems. In this paper we extend the CW approach for sequence learning and show that it achieves state-of-the-art performance on four noun phrase chucking and named entity recognition tasks. We then derive few algorithmic approaches to estimate the prediction’s correctness of each label in the output sequence. We show that our approach provides a reliable relative correctness information as it outperforms other alternatives in ranking label-predictions according to their error. We also show empirically that our methods output close to absolute estimation of error. Finally, we show how to use this information to improve active learning.
2 0.18794401 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
Author: Mark Dredze ; Tim Oates ; Christine Piatko
Abstract: Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention – detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses Aodifst uannlaceb,e a dm eextraicm fpolre detecting tshhoifdts u sine sd Aatastreams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.
3 0.16817994 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks
Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Frederick Reiss ; Shivakumar Vaithyanathan
Abstract: Named-entity recognition (NER) is an important task required in a wide variety of applications. While rule-based systems are appealing due to their well-known “explainability,” most, if not all, state-of-the-art results for NER tasks are based on machine learning techniques. Motivated by these results, we explore the following natural question in this paper: Are rule-based systems still a viable approach to named-entity recognition? Specifically, we have designed and implemented a high-level language NERL on top of SystemT, a general-purpose algebraic information extraction system. NERL is tuned to the needs of NER tasks and simplifies the process of building, understanding, and customizing complex rule-based named-entity annotators. We show that these customized annotators match or outperform the best published results achieved with machine learning techniques. These results confirm that we can reap the benefits of rule-based extractors’ explainability without sacrificing accuracy. We conclude by discussing lessons learned while building and customizing complex rule-based annotators and outlining several research directions towards facilitating rule development.
4 0.15042952 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models
Author: Amarnag Subramanya ; Slav Petrov ; Fernando Pereira
Abstract: We describe a new scalable algorithm for semi-supervised training of conditional random fields (CRF) and its application to partof-speech (POS) tagging. The algorithm uses a similarity graph to encourage similar ngrams to have similar POS tags. We demonstrate the efficacy of our approach on a domain adaptation task, where we assume that we have access to large amounts of unlabeled data from the target domain, but no additional labeled data. The similarity graph is used during training to smooth the state posteriors on the target domain. Standard inference can be used at test time. Our approach is able to scale to very large problems and yields significantly improved target domain accuracy.
5 0.10387012 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
Author: Wei Lu ; Hwee Tou Ng
Abstract: This paper focuses on the task of inserting punctuation symbols into transcribed conversational speech texts, without relying on prosodic cues. We investigate limitations associated with previous methods, and propose a novel approach based on dynamic conditional random fields. Different from previous work, our proposed approach is designed to jointly perform both sentence boundary and sentence type prediction, and punctuation prediction on speech utterances. We performed evaluations on a transcribed conversational speech domain consisting of both English and Chinese texts. Empirical results show that our method outperforms an approach based on linear-chain conditional random fields and other previous approaches.
6 0.087889463 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications
7 0.077455968 84 emnlp-2010-NLP on Spoken Documents Without ASR
8 0.071674727 104 emnlp-2010-The Necessity of Combining Adaptation Methods
9 0.069149904 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping
10 0.062263172 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
11 0.059409052 33 emnlp-2010-Cross Language Text Classification by Model Translation and Semi-Supervised Learning
12 0.057632774 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
13 0.051553447 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
14 0.05107281 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text
15 0.050770875 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa
16 0.050527208 118 emnlp-2010-Utilizing Extra-Sentential Context for Parsing
17 0.048098188 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?
18 0.047374472 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech
19 0.045138355 77 emnlp-2010-Measuring Distributional Similarity in Context
20 0.044982757 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
topicId topicWeight
[(0, 0.188), (1, 0.108), (2, -0.013), (3, 0.005), (4, -0.171), (5, 0.01), (6, 0.176), (7, 0.115), (8, -0.069), (9, 0.161), (10, 0.018), (11, 0.037), (12, -0.14), (13, 0.16), (14, -0.03), (15, -0.061), (16, -0.014), (17, -0.281), (18, 0.022), (19, -0.13), (20, 0.027), (21, -0.059), (22, 0.156), (23, -0.097), (24, -0.195), (25, 0.002), (26, -0.024), (27, -0.075), (28, -0.045), (29, -0.244), (30, 0.064), (31, -0.023), (32, 0.251), (33, -0.117), (34, 0.22), (35, 0.027), (36, 0.106), (37, -0.028), (38, -0.037), (39, 0.014), (40, -0.01), (41, 0.054), (42, 0.012), (43, 0.035), (44, -0.078), (45, -0.025), (46, -0.067), (47, -0.063), (48, -0.067), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.9555074 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models
Author: Avihai Mejer ; Koby Crammer
Abstract: Confidence-Weighted linear classifiers (CW) and its successors were shown to perform well on binary and multiclass NLP problems. In this paper we extend the CW approach for sequence learning and show that it achieves state-of-the-art performance on four noun phrase chucking and named entity recognition tasks. We then derive few algorithmic approaches to estimate the prediction’s correctness of each label in the output sequence. We show that our approach provides a reliable relative correctness information as it outperforms other alternatives in ranking label-predictions according to their error. We also show empirically that our methods output close to absolute estimation of error. Finally, we show how to use this information to improve active learning.
2 0.7367565 37 emnlp-2010-Domain Adaptation of Rule-Based Annotators for Named-Entity Recognition Tasks
Author: Laura Chiticariu ; Rajasekar Krishnamurthy ; Yunyao Li ; Frederick Reiss ; Shivakumar Vaithyanathan
Abstract: Named-entity recognition (NER) is an important task required in a wide variety of applications. While rule-based systems are appealing due to their well-known “explainability,” most, if not all, state-of-the-art results for NER tasks are based on machine learning techniques. Motivated by these results, we explore the following natural question in this paper: Are rule-based systems still a viable approach to named-entity recognition? Specifically, we have designed and implemented a high-level language NERL on top of SystemT, a general-purpose algebraic information extraction system. NERL is tuned to the needs of NER tasks and simplifies the process of building, understanding, and customizing complex rule-based named-entity annotators. We show that these customized annotators match or outperform the best published results achieved with machine learning techniques. These results confirm that we can reap the benefits of rule-based extractors’ explainability without sacrificing accuracy. We conclude by discussing lessons learned while building and customizing complex rule-based annotators and outlining several research directions towards facilitating rule development.
3 0.52622998 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
Author: Mark Dredze ; Tim Oates ; Christine Piatko
Abstract: Domain adaptation, the problem of adapting a natural language processing system trained in one domain to perform well in a different domain, has received significant attention. This paper addresses an important problem for deployed systems that has received little attention – detecting when such adaptation is needed by a system operating in the wild, i.e., performing classification over a stream of unlabeled examples. Our method uses Aodifst uannlaceb,e a dm eextraicm fpolre detecting tshhoifdts u sine sd Aatastreams, combined with classification margins to detect domain shifts. We empirically show effective domain shift detection on a variety of data sets and shift conditions.
4 0.4190886 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models
Author: Amarnag Subramanya ; Slav Petrov ; Fernando Pereira
Abstract: We describe a new scalable algorithm for semi-supervised training of conditional random fields (CRF) and its application to partof-speech (POS) tagging. The algorithm uses a similarity graph to encourage similar ngrams to have similar POS tags. We demonstrate the efficacy of our approach on a domain adaptation task, where we assume that we have access to large amounts of unlabeled data from the target domain, but no additional labeled data. The similarity graph is used during training to smooth the state posteriors on the target domain. Standard inference can be used at test time. Our approach is able to scale to very large problems and yields significantly improved target domain accuracy.
5 0.334649 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
Author: Wei Lu ; Hwee Tou Ng
Abstract: This paper focuses on the task of inserting punctuation symbols into transcribed conversational speech texts, without relying on prosodic cues. We investigate limitations associated with previous methods, and propose a novel approach based on dynamic conditional random fields. Different from previous work, our proposed approach is designed to jointly perform both sentence boundary and sentence type prediction, and punctuation prediction on speech utterances. We performed evaluations on a transcribed conversational speech domain consisting of both English and Chinese texts. Empirical results show that our method outperforms an approach based on linear-chain conditional random fields and other previous approaches.
6 0.28955552 84 emnlp-2010-NLP on Spoken Documents Without ASR
7 0.22003576 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference
8 0.21684122 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text
9 0.21383916 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications
10 0.20809239 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
11 0.20690039 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction
12 0.19625986 43 emnlp-2010-Enhancing Domain Portability of Chinese Segmentation Model Using Chi-Square Statistics and Bootstrapping
13 0.18566211 104 emnlp-2010-The Necessity of Combining Adaptation Methods
14 0.18558794 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues
15 0.17323162 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
16 0.17087698 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
17 0.16565591 83 emnlp-2010-Multi-Level Structured Models for Document-Level Sentiment Classification
18 0.16151537 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech
19 0.1544058 21 emnlp-2010-Automatic Discovery of Manner Relations and its Applications
20 0.14889054 121 emnlp-2010-What a Parser Can Learn from a Semantic Role Labeler and Vice Versa
topicId topicWeight
[(12, 0.032), (29, 0.109), (30, 0.014), (32, 0.031), (52, 0.031), (56, 0.068), (62, 0.014), (66, 0.087), (72, 0.035), (76, 0.016), (77, 0.013), (79, 0.011), (87, 0.444), (89, 0.012)]
simIndex simValue paperId paperTitle
1 0.79517174 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing
Author: Eugene Charniak
Abstract: We present a new syntactic parser that works left-to-right and top down, thus maintaining a fully-connected parse tree for a few alternative parse hypotheses. All of the commonly used statistical parsers use context-free dynamic programming algorithms and as such work bottom up on the entire sentence. Thus they only find a complete fully connected parse at the very end. In contrast, both subjective and experimental evidence show that people understand a sentence word-to-word as they go along, or close to it. The constraint that the parser keeps one or more fully connected syntactic trees is intended to operationalize this cognitive fact. Our parser achieves a new best result for topdown parsers of 89.4%,a 20% error reduction over the previous single-parser best result for parsers of this type of 86.8% (Roark, 2001) . The improved performance is due to embracing the very large feature set available in exchange for giving up dynamic programming.
same-paper 2 0.78106987 30 emnlp-2010-Confidence in Structured-Prediction Using Confidence-Weighted Models
Author: Avihai Mejer ; Koby Crammer
Abstract: Confidence-Weighted linear classifiers (CW) and its successors were shown to perform well on binary and multiclass NLP problems. In this paper we extend the CW approach for sequence learning and show that it achieves state-of-the-art performance on four noun phrase chucking and named entity recognition tasks. We then derive few algorithmic approaches to estimate the prediction’s correctness of each label in the output sequence. We show that our approach provides a reliable relative correctness information as it outperforms other alternatives in ranking label-predictions according to their error. We also show empirically that our methods output close to absolute estimation of error. Finally, we show how to use this information to improve active learning.
3 0.70808107 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition
Author: Zhiyuan Liu ; Wenyi Huang ; Yabin Zheng ; Maosong Sun
Abstract: Existing graph-based ranking methods for keyphrase extraction compute a single importance score for each word via a single random walk. Motivated by the fact that both documents and words can be represented by a mixture of semantic topics, we propose to decompose traditional random walk into multiple random walks specific to various topics. We thus build a Topical PageRank (TPR) on word graph to measure word importance with respect to different topics. After that, given the topic distribution of the document, we further calculate the ranking scores of words and extract the top ranked ones as keyphrases. Experimental results show that TPR outperforms state-of-the-art keyphrase extraction methods on two datasets under various evaluation metrics.
4 0.48258251 86 emnlp-2010-Non-Isomorphic Forest Pair Translation
Author: Hui Zhang ; Min Zhang ; Haizhou Li ; Eng Siong Chng
Abstract: This paper studies two issues, non-isomorphic structure translation and target syntactic structure usage, for statistical machine translation in the context of forest-based tree to tree sequence translation. For the first issue, we propose a novel non-isomorphic translation framework to capture more non-isomorphic structure mappings than traditional tree-based and tree-sequence-based translation methods. For the second issue, we propose a parallel space searching method to generate hypothesis using tree-to-string model and evaluate its syntactic goodness using tree-to-tree/tree sequence model. This not only reduces the search complexity by merging spurious-ambiguity translation paths and solves the data sparseness issue in training, but also serves as a syntax-based target language model for better grammatical generation. Experiment results on the benchmark data show our proposed two solutions are very effective, achieving significant performance improvement over baselines when applying to different translation models.
5 0.42631567 42 emnlp-2010-Efficient Incremental Decoding for Tree-to-String Translation
Author: Liang Huang ; Haitao Mi
Abstract: Syntax-based translation models should in principle be efficient with polynomially-sized search space, but in practice they are often embarassingly slow, partly due to the cost of language model integration. In this paper we borrow from phrase-based decoding the idea to generate a translation incrementally left-to-right, and show that for tree-to-string models, with a clever encoding of derivation history, this method runs in averagecase polynomial-time in theory, and lineartime with beam search in practice (whereas phrase-based decoding is exponential-time in theory and quadratic-time in practice). Experiments show that, with comparable translation quality, our tree-to-string system (in Python) can run more than 30 times faster than the phrase-based system Moses (in C++).
6 0.42323002 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields
7 0.42173746 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice
9 0.40074608 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar
10 0.39973822 119 emnlp-2010-We're Not in Kansas Anymore: Detecting Domain Changes in Streams
11 0.39325535 60 emnlp-2010-Improved Fully Unsupervised Parsing with Zoomed Learning
12 0.39270338 84 emnlp-2010-NLP on Spoken Documents Without ASR
13 0.3895421 20 emnlp-2010-Automatic Detection and Classification of Social Events
14 0.38558638 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks
15 0.38546252 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space
16 0.38498685 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics
17 0.38362348 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference
18 0.38320389 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation
19 0.38035175 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation
20 0.37701118 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction