emnlp emnlp2012 emnlp2012-138 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shen Li ; Joao Graca ; Ben Taskar
Abstract: Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our . approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
Reference: text
sentIndex sentText sentNum sentScore
1 edu j avg@ l f 2 Abstract Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. [sent-4, score-0.106]
2 Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. [sent-5, score-0.127]
3 In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. [sent-7, score-0.158]
4 Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. [sent-8, score-0.242]
5 We achieve highest accuracy reported for several languages and show that our . [sent-9, score-0.129]
6 approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank. [sent-10, score-0.147]
7 Supervised learning of taggers from POS-annotated training text is a well-studied task, with several methods achieving near-human tagging accuracy (Ratnaparkhi, 1996; Toutanova et al. [sent-12, score-0.164]
8 However, while English and a handful of other languages ine s c- id . [sent-15, score-0.114]
9 , 1993), most of the world’s languages have no labeled corpora. [sent-19, score-0.113]
10 Unsupervised induction of POS taggers offers the possibility of avoiding costly annotation, but despite recent progress, the accuracy of unsupervised POS taggers still falls far behind supervised systems, and is not suitable for most applications (BergKirkpatrick et al. [sent-24, score-0.279]
11 Using additional information, in the form of tag dictionaries or parallel text, seems unavoidable at present. [sent-28, score-0.309]
12 Early work on using tag dictionaries used a labeled corpus to extract all allowed word-tag pairs (Merialdo, 1994), which is quite an unrealistic scenario. [sent-29, score-0.295]
13 More recent work has used a subset of the observed word-tag pairs and focused on generalizing dictionary entries (Smith and Eisner, 2005; Haghighi and Klein, 2006; Toutanova and Johnson, 2007; Goldwater and Griffiths, 2007). [sent-30, score-0.181]
14 Recent work by Das and Petrov (201 1) builds a dictionary for a particular language by transfer- ring annotated data from a resource-rich language through the use of word alignments in parallel text. [sent-32, score-0.219]
15 Lc a2n0g1u2ag Aes Psorcoicaetsiosin fgo arn Cdo Cmopmutpauti oantiaoln Lailn Ngautiustriacls The main idea is to rely on existing dictionaries for some languages (e. [sent-35, score-0.157]
16 English) and use parallel data to build a dictionary in the desired language and extend the dictionary coverage using label propagation. [sent-37, score-0.536]
17 However, parallel text does not exist for many pairs of languages and the proposed bilingual projection algorithms are fairly complex. [sent-38, score-0.201]
18 In this work we use the Wiktionary, a freely available, high coverage and constantly growing dictionary for a large number of languages. [sent-39, score-0.38]
19 We outperform best current results using parallel text supervision across 8 different languages, even when the word type coverage is as low as 20%. [sent-43, score-0.228]
20 Furthermore, using the Brown corpus as out-of-domain data we show that using the Wiktionary produces better taggers than using the Penn Treebank dictionary (88. [sent-44, score-0.25]
21 The source code, the dictionary mappings and the trained models described in this work are available at http : / / code . [sent-48, score-0.205]
22 2 Related Work The scarcity of labeled corpora for resource poor languages and the challenges of domain adaptation have led to several efforts to build systems for unsupervised POStagging. [sent-51, score-0.229]
23 (201 1) proposed replacing the multinomial emission distributions of standard HMMs by maximum entropy (ME) feature-based distributions. [sent-63, score-0.168]
24 Despite these improvements, fully unsupervised systems require an oracle to map clusters to true tags and the performance still fails to be of practical use. [sent-65, score-0.158]
25 In this paper we follow a different line of work where we rely on a prior tag dictionary indicating for each word type what POS tags it can take on (Merialdo, 1994). [sent-66, score-0.494]
26 Even when using a tag dictionary, disambiguating from all possible tags is still a hard problem and the accuracy of these methods is still fall far behind their supervised counterparts. [sent-68, score-0.372]
27 In this paper, we argue that the Wiktionary can serve as an effective and much less biased tag dictionary. [sent-70, score-0.203]
28 We note that most of the previous dictionary based approaches can be applied using the Wiktionary and would likely lead to similar accuracy increases that we show in this paper. [sent-71, score-0.221]
29 Models can also be trained jointly using parallel corpora in several languages, exploiting the fact that different languages present different ambiguities (Snyder et al. [sent-73, score-0.127]
30 The coverage of the Wiktionary varies greatly between languages: currently there are around 75 languages for which there exists more than 1000 word types, and 27 for which there exists more than 10,000 word types. [sent-81, score-0.225]
31 As with Wikipedia, the questions of accuracy, bias, consistency across languages, and selective coverage are paramount. [sent-85, score-0.159]
32 In this section, we explore these concerns by comparing Wiktionary to dictionaries derived from tagged corpora. [sent-86, score-0.101]
33 1 Labeled corpora and Universal tags We collected part-of-speech tagged corpora for 9 languages, from CoNLL-X and CoNLL-2007 shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al. [sent-88, score-0.112]
34 In this work we use the Universal POS tag set (Petrov et al. [sent-90, score-0.203]
35 , 2011) that defines 12 universal categories with a relatively stable functional definition across languages. [sent-91, score-0.128]
36 org/ 1391 Figure 1: Growth of the Wiktionary over the last three years, showing total number of entries for all languages and for the 9 languages we consider (left axis). [sent-94, score-0.178]
37 We also show the corresponding increase in average accuracy (right axis) achieved by our model across the 9 languages (see details below). [sent-95, score-0.152]
38 In Spanish, the finelevel tag for date (“w”) is mapped to universal tag NUM, while it should be mapped to NOUN. [sent-98, score-0.567]
39 After examining the corpus guidelines and the mapping more closely, we found that the tag AC (Cardinal numeral) and AO (Ordinal numeral) are mapped to ADJ. [sent-100, score-0.244]
40 Although the corpus guidelines indicate the category SsCatgram ‘adjective’ that encompasses both ‘normal’ adjectives (AN) as well as cardinal numeral (AC) and ordinal numerals (AO), we decided to tag AC and AO as NUM, since this assignment better fits the existing mapping. [sent-101, score-0.31]
41 We also reassigned all punctuation marks, which were erroneously mapped to X, to PUNC and the tag U which is used for words at, de and som, to PRT. [sent-102, score-0.244]
42 2 Wiktionary to Universal tags There are a total of 330 distinct POS-type tags in Wiktionary across all languages which we have mapped to the Universal tagset. [sent-104, score-0.311]
43 Most of the mapping was straightforward since the tags used in the Wiktionary are in fact close to the Universal tag set. [sent-105, score-0.282]
44 We also mapped relatively rare tags such as “Interjection”, “Symbol” to the “X” tag. [sent-109, score-0.12]
45 3 Wiktionary coverage There are two kinds of coverage of interest: type coverage and token coverage. [sent-115, score-0.479]
46 We define type coverage as the proportion of word types in the corpus that simply appear in the Wiktionary (accuracy of the tag sets are considered in the next subsection). [sent-116, score-0.37]
47 Token coverage is defined similarly as the portion of all word tokens in the corpus that appear in the Wiktionary. [sent-117, score-0.136]
48 These statistics reflect two aspects of the usefulness of a dictionary that affect learning in different ways: token coverage increases the density of supervised signal while type coverage increases the diversity of word shape supervision. [sent-118, score-0.574]
49 At one extreme, with 100% word and token coverage, we re- cover the POS tag disambiguation scenario and, on the other extreme of 0% coverage, we recover the unsupervised POS induction scenario. [sent-119, score-0.294]
50 The type and token coverage of Wiktionary for each of the languages we are using for evaluation is shown in Figure 2. [sent-120, score-0.296]
51 We plot the coverage bar for three different versions of Wiktionary (v20100326, v201 10321, v20120320), arranged chronologically. [sent-121, score-0.168]
52 As expected, the newer versions ofthe Wiktionary generally have larger coverage both on type level and token level. [sent-123, score-0.239]
53 Nevertheless, even for languages whose type coverage is relatively low, such as Greek (el), the token level coverage is still quite good (more than half of the tokens are covered). [sent-124, score-0.432]
54 This trend is even more evident when we break up the coverage by frequency of the words. [sent-126, score-0.136]
55 1392 Figure 2: Type-level (top) and token-level (bottom) cov- erage for the nine languages in three versions of the Wiktionary. [sent-128, score-0.153]
56 We also compared the coverage provided by the Wiktionary versus the Penn Treebank (PTB) extracted dictionary on the Brown corpus. [sent-129, score-0.317]
57 Figure 4 shows that the Wiktionary provides a greater coverage for all sections ofthe Brown corpus, hence being a better dictionary for tagging English text in general. [sent-130, score-0.372]
58 This is also reflected in the gain in accuracy on Brown over the taggers learned from the PTB dictionary in our experiments. [sent-131, score-0.29]
59 4 Wiktionary accuracy A more refined notion of quality is the accuracy of the tag sets for covered words, as measured against dictionaries extracted from labeled tree bank corpora. [sent-133, score-0.559]
60 We consider word types that are in both the Wiktionary (W) and the tree bank dictionaries (T). [sent-134, score-0.207]
61 For each word type, we compare the two tag sets and distinguish five different possibilities: 1. [sent-135, score-0.203]
62 dis{cNaOrdU eNnt}ry Table 1: Examples of constructing Universal -45 POS tag sets 6,*076 from the Wiktionary. [sent-145, score-0.203]
63 8098 Figure 3: Word type coverage by normalized frequency: words are grouped by word count / highest word count ratio: low [0, 0. [sent-146, score-0.167]
64 Most of the tag sets (around 90%) in the Wiktionary are identical to or supersets of the tree bank tag sets for our nine languages, which is surprisingly accurate. [sent-154, score-0.602]
65 About 10% of the Wiktionary tag sets are subsets of, partially overlapping with, or disjoint from the tree bank tag sets. [sent-155, score-0.571]
66 Our learning methods, which assume the given tag sets are correct, may be somewhat hurt by these word types, as we discuss in Section 5. [sent-156, score-0.203]
67 We also used feature-based max-ent emission models with both (HMM-ME and SHMM-ME). [sent-159, score-0.1]
68 Below, we denote the sequence of words in a sentence as boldface x and the sequence of hidden states which correspond to part-of-speech tags as boldface y. [sent-160, score-0.19]
69 To simplify notation, we assume that every tag sequence is prefixed 1393 )&'%! [sent-161, score-0.203]
70 Wiktionary type coverage across sections of the Brown corpus. [sent-168, score-0.19]
71 In this work, we compare multinomial and maximum entropy (log-linear) emission models. [sent-171, score-0.168]
72 Around 90% of the Wiktionary tag sets are identical or subsume tree bank tag sets. [sent-176, score-0.57]
73 For each tag y, the observations probabilities po(x | y) were initialized randomly for every word( type t yh)at w waellroew isn tag y eadcc orardn-ing to the Wiktionary and zero otherwise. [sent-185, score-0.437]
74 We found that EM achieved higher accuracy across languages compared to direct gradient approach (BergKirkpatrick et al. [sent-188, score-0.152]
75 We trained the SHMMME model with a dictionary built from the training and test tree bank (ALL TBD) and also with tree bank dictionary intersected with the Wiktionary (Covered TBD). [sent-197, score-0.64]
76 The Covered TBD dictionary is more supervised than the Wiktionary in the sense that some of the tag set mismatches of the Wiktionary are cleaned using the true corpus tags. [sent-198, score-0.434]
77 As a lower bound we include the results for unsupervised systems: a regular HMM model trained with EM (Johnson, 2007) and an HMM model using a ME emission model trained using direct gradient (Berg-Kirkpatrick et al. [sent-203, score-0.151]
78 These approaches build a dictionary by transferring labeled data from a resource rich language (English) to a resource poor language (Das and Petrov, 2011). [sent-207, score-0.3]
79 The first, projection, builds a dictionary by transferring the pos 3Values for these systems where taken from the D&P; paper. [sent-209, score-0.275]
80 The second method, D&P;, is the current state-of-the-art system, and runs label propagation on the dictionary resulting from the projected method. [sent-211, score-0.181]
81 We note that the results are not directly comparable since both the Unsupervised and the Bilingual results use a different setup, using the number offine grained tags for each language as hidden states instead of 12 (as we do). [sent-215, score-0.151]
82 The first two observations are that using the ME entropy emission model always improves over the standard multinomial model, and using a second order model always performs better. [sent-217, score-0.168]
83 The most common errors are due to tag set idiosyncrasies. [sent-219, score-0.203]
84 Other common mistakes for English include tagging to as an adposition (preposition) instead of particle and tagging which as a pronoun instead of determiner. [sent-221, score-0.142]
85 Finally, for English we also trained the SHMMME model using the Celex2 dictionary available from LDC4. [sent-223, score-0.181]
86 Celex2 coverage for the PTB corpus is much smaller than the coverage provided by the Wiktionary (43. [sent-224, score-0.272]
87 Wiktionary ambiguity While many words overwhelmingly appear with one tag in a given genre, in the Wiktionary a large proportion of words are annotated with several tags, even when those are extremely rare events. [sent-232, score-0.203]
88 cat alogI d=LDC 9 6L 1 4 1395 35% of word types in English have more than one tag according to the Wiktionary. [sent-238, score-0.234]
89 This increases the difficulty of predicting the correct tag as compared to having a corpus-based dictionary, where words have a smaller level of ambiguity. [sent-239, score-0.203]
90 One advantage ofthe Wiktionary is that it is a general purpose dictionary and not tailored for a particular domain. [sent-247, score-0.181]
91 To illustrate this we compared several models on the Brown corpus: the SHMM-ME model using the Wiktionary (Wik), against using a model trained using a dictionary extracted from the PTB corpus (PTBD), or trained fully supervised using the PTB corpus (PTB). [sent-248, score-0.259]
92 Unsupervised models are trained without dictionary and use an oracle to map tags to clusters. [sent-299, score-0.26]
93 Bilingual systems are trained using a dictionary transferred from English into the target language using word alignments. [sent-300, score-0.181]
94 The Projection model uses a dictionary build directly from the part-of-speech projection. [sent-301, score-0.181]
95 The D&P; model extends the Projection model dictionary by using Label Propagation. [sent-302, score-0.181]
96 Supervised models are trained using tree bank information with SHMM-ME: Covered TBD used tree bank tag set for the words only if they are also in the Wiktionary and All TBD uses tree bank tag sets for all words. [sent-303, score-0.823]
97 4 we discussed the accuracy of the Wiktionary tag sets and as Table 2 shows, a dictionary with better tag set quality generally (except for Greek) improves the POS tagging accuracy. [sent-311, score-0.682]
98 The largest source of – error across languages are out-of-vocabulary (oov) word types at around 45% of the errors, followed by tag set mismatch types: subset, overlap, dis1396 joint, which together comprise another 50% of the errors. [sent-314, score-0.315]
99 The largest source of error across languages are out-of-vocabulary (oov) word types, followed by tag set mismatch types: subset, overlap, disjoint. [sent-317, score-0.315]
100 Weakly supervised part-of-speech tagging for chinese using label propagation. [sent-384, score-0.105]
wordName wordTfidf (topN-words)
[('wiktionary', 0.784), ('tag', 0.203), ('dictionary', 0.181), ('tbd', 0.147), ('coverage', 0.136), ('ptb', 0.105), ('bank', 0.104), ('emission', 0.1), ('gra', 0.096), ('languages', 0.089), ('universal', 0.079), ('tags', 0.079), ('taggers', 0.069), ('dictionaries', 0.068), ('yi', 0.068), ('pos', 0.065), ('num', 0.058), ('brown', 0.055), ('ptbd', 0.055), ('punc', 0.055), ('shmm', 0.055), ('wik', 0.055), ('tagging', 0.055), ('hmm', 0.054), ('unsupervised', 0.051), ('supervised', 0.05), ('oov', 0.049), ('bilingual', 0.048), ('numeral', 0.047), ('bergkirkpatrick', 0.047), ('toutanova', 0.047), ('hidden', 0.047), ('covered', 0.045), ('po', 0.044), ('ca', 0.043), ('arxiv', 0.042), ('mapped', 0.041), ('token', 0.04), ('accuracy', 0.04), ('superset', 0.04), ('goldwater', 0.039), ('parallel', 0.038), ('cardinal', 0.037), ('chesley', 0.037), ('krizhanovsky', 0.037), ('shmmme', 0.037), ('entropy', 0.037), ('johnson', 0.036), ('tree', 0.035), ('ao', 0.034), ('penn', 0.033), ('tagged', 0.033), ('petrov', 0.033), ('resource', 0.033), ('versions', 0.032), ('nine', 0.032), ('growing', 0.032), ('boldface', 0.032), ('scarcity', 0.032), ('lamar', 0.032), ('adposition', 0.032), ('prt', 0.032), ('preprint', 0.032), ('das', 0.031), ('multinomial', 0.031), ('type', 0.031), ('cat', 0.031), ('freely', 0.031), ('haghighi', 0.03), ('navarro', 0.029), ('ldc', 0.029), ('hyphen', 0.029), ('uller', 0.029), ('transferring', 0.029), ('fully', 0.028), ('ganchev', 0.028), ('markov', 0.028), ('overlap', 0.027), ('disjoint', 0.026), ('greek', 0.026), ('ry', 0.026), ('collaborative', 0.026), ('merialdo', 0.026), ('categories', 0.026), ('projection', 0.026), ('identical', 0.025), ('handful', 0.025), ('abeill', 0.025), ('snyder', 0.025), ('axis', 0.025), ('grained', 0.025), ('labeled', 0.024), ('code', 0.024), ('em', 0.024), ('upenn', 0.023), ('ordinal', 0.023), ('det', 0.023), ('griffiths', 0.023), ('shen', 0.023), ('across', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
Author: Shen Li ; Joao Graca ; Ben Taskar
Abstract: Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our . approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
2 0.24991457 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
Author: Dan Garrette ; Jason Baldridge
Abstract: Past work on learning part-of-speech taggers from tag dictionaries and raw data has reported good results, but the assumptions made about those dictionaries are often unrealistic: due to historical precedents, they assume access to information about labels in the raw and test sets. Here, we demonstrate ways to learn hidden Markov model taggers from incomplete tag dictionaries. Taking the MINGREEDY algorithm (Ravi et al., 2010) as a starting point, we improve it with several intuitive heuristics. We also define a simple HMM emission initialization that takes advantage of the tag dictionary and raw data to capture both the openness of a given tag and its estimated prevalence in the raw data. Altogether, our augmentations produce improvements to per- formance over the original MIN-GREEDY algorithm for both English and Italian data.
3 0.1665711 81 emnlp-2012-Learning to Map into a Universal POS Tagset
Author: Yuan Zhang ; Roi Reichart ; Regina Barzilay ; Amir Globerson
Abstract: We present an automatic method for mapping language-specific part-of-speech tags to a set of universal tags. This unified representation plays a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Until now, however, such conversion schemes have been created manually. Our central hypothesis is that a valid mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function that captures a range of distributional and typological characteristics of the derived mapping. Given the exponential size of the mapping space, we propose a novel method for optimizing over soft mappings, and use entropy regularization to drive those towards hard mappings. Our results demonstrate that automatically induced mappings rival the quality of their manually designed counterparts when evaluated in the . context of multilingual parsing.1
4 0.14173771 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
Author: Greg Durrett ; Adam Pauls ; Dan Klein
Abstract: We consider the problem of using a bilingual dictionary to transfer lexico-syntactic information from a resource-rich source language to a resource-poor target language. In contrast to past work that used bitexts to transfer analyses of specific sentences at the token level, we instead use features to transfer the behavior of words at a type level. In a discriminative dependency parsing framework, our approach produces gains across a range of target languages, using two different lowresource training methodologies (one weakly supervised and one indirectly supervised) and two different dictionary sources (one manually constructed and one automatically constructed).
5 0.092467159 106 emnlp-2012-Part-of-Speech Tagging for Chinese-English Mixed Texts with Dynamic Features
Author: Jiayi Zhao ; Xipeng Qiu ; Shu Zhang ; Feng Ji ; Xuanjing Huang
Abstract: In modern Chinese articles or conversations, it is very popular to involve a few English words, especially in emails and Internet literature. Therefore, it becomes an important and challenging topic to analyze Chinese-English mixed texts. The underlying problem is how to tag part-of-speech (POS) for the English words involved. Due to the lack of specially annotated corpus, most of the English words are tagged as the oversimplified type, “foreign words”. In this paper, we present a method using dynamic features to tag POS of mixed texts. Experiments show that our method achieves higher performance than traditional sequence labeling methods. Meanwhile, our method also boosts the performance of POS tagging for pure Chinese texts.
6 0.092280366 12 emnlp-2012-A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing
7 0.091020495 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
8 0.087321952 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
9 0.080248617 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
10 0.064903095 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
11 0.064549193 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
12 0.064037919 64 emnlp-2012-Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints
13 0.057182323 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
14 0.054450508 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
15 0.049669519 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
16 0.04936979 46 emnlp-2012-Exploiting Reducibility in Unsupervised Dependency Parsing
17 0.047492709 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
18 0.04569668 119 emnlp-2012-Spectral Dependency Parsing with Latent Variables
19 0.044600051 131 emnlp-2012-Unified Dependency Parsing of Chinese Morphological and Syntactic Structures
20 0.044112407 105 emnlp-2012-Parser Showdown at the Wall Street Corral: An Empirical Investigation of Error Types in Parser Output
topicId topicWeight
[(0, 0.194), (1, -0.096), (2, 0.142), (3, -0.022), (4, 0.05), (5, 0.116), (6, 0.082), (7, 0.014), (8, 0.095), (9, -0.254), (10, 0.178), (11, -0.017), (12, 0.052), (13, -0.027), (14, -0.225), (15, -0.197), (16, 0.007), (17, -0.032), (18, -0.022), (19, 0.026), (20, 0.175), (21, -0.097), (22, 0.05), (23, 0.051), (24, 0.098), (25, 0.027), (26, -0.149), (27, 0.014), (28, 0.057), (29, -0.092), (30, -0.07), (31, 0.114), (32, -0.093), (33, 0.069), (34, 0.069), (35, 0.005), (36, 0.051), (37, -0.125), (38, 0.156), (39, -0.199), (40, 0.085), (41, -0.069), (42, 0.018), (43, -0.085), (44, -0.004), (45, -0.063), (46, 0.014), (47, -0.046), (48, 0.022), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.95443517 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
Author: Shen Li ; Joao Graca ; Ben Taskar
Abstract: Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our . approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
2 0.89128363 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
Author: Dan Garrette ; Jason Baldridge
Abstract: Past work on learning part-of-speech taggers from tag dictionaries and raw data has reported good results, but the assumptions made about those dictionaries are often unrealistic: due to historical precedents, they assume access to information about labels in the raw and test sets. Here, we demonstrate ways to learn hidden Markov model taggers from incomplete tag dictionaries. Taking the MINGREEDY algorithm (Ravi et al., 2010) as a starting point, we improve it with several intuitive heuristics. We also define a simple HMM emission initialization that takes advantage of the tag dictionary and raw data to capture both the openness of a given tag and its estimated prevalence in the raw data. Altogether, our augmentations produce improvements to per- formance over the original MIN-GREEDY algorithm for both English and Italian data.
3 0.67848247 81 emnlp-2012-Learning to Map into a Universal POS Tagset
Author: Yuan Zhang ; Roi Reichart ; Regina Barzilay ; Amir Globerson
Abstract: We present an automatic method for mapping language-specific part-of-speech tags to a set of universal tags. This unified representation plays a crucial role in cross-lingual syntactic transfer of multilingual dependency parsers. Until now, however, such conversion schemes have been created manually. Our central hypothesis is that a valid mapping yields POS annotations with coherent linguistic properties which are consistent across source and target languages. We encode this intuition in an objective function that captures a range of distributional and typological characteristics of the derived mapping. Given the exponential size of the mapping space, we propose a novel method for optimizing over soft mappings, and use entropy regularization to drive those towards hard mappings. Our results demonstrate that automatically induced mappings rival the quality of their manually designed counterparts when evaluated in the . context of multilingual parsing.1
4 0.46123546 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
Author: Greg Durrett ; Adam Pauls ; Dan Klein
Abstract: We consider the problem of using a bilingual dictionary to transfer lexico-syntactic information from a resource-rich source language to a resource-poor target language. In contrast to past work that used bitexts to transfer analyses of specific sentences at the token level, we instead use features to transfer the behavior of words at a type level. In a discriminative dependency parsing framework, our approach produces gains across a range of target languages, using two different lowresource training methodologies (one weakly supervised and one indirectly supervised) and two different dictionary sources (one manually constructed and one automatically constructed).
5 0.35561463 21 emnlp-2012-Assessment of ESL Learners' Syntactic Competence Based on Similarity Measures
Author: Su-Youn Yoon ; Suma Bhat
Abstract: This study presents a novel method that measures English language learners’ syntactic competence towards improving automated speech scoring systems. In contrast to most previous studies which focus on the length of production units such as the mean length of clauses, we focused on capturing the differences in the distribution of morpho-syntactic features or grammatical expressions across proficiency. We estimated the syntactic competence through the use of corpus-based NLP techniques. Assuming that the range and so- phistication of grammatical expressions can be captured by the distribution of Part-ofSpeech (POS) tags, vector space models of POS tags were constructed. We use a large corpus of English learners’ responses that are classified into four proficiency levels by human raters. Our proposed feature measures the similarity of a given response with the most proficient group and is then estimates the learner’s syntactic competence level. Widely outperforming the state-of-the-art measures of syntactic complexity, our method attained a significant correlation with humanrated scores. The correlation between humanrated scores and features based on manual transcription was 0.43 and the same based on ASR-hypothesis was slightly lower, 0.42. An important advantage of our method is its robustness against speech recognition errors not to mention the simplicity of feature generation that captures a reasonable set of learnerspecific syntactic errors. 600 Measures Suma Bhat Beckman Institute, Urbana, IL 61801 . spbhat 2 @ i l l ino i edu s
6 0.34022933 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
7 0.3160179 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
8 0.28142792 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
9 0.27794215 106 emnlp-2012-Part-of-Speech Tagging for Chinese-English Mixed Texts with Dynamic Features
10 0.25118184 119 emnlp-2012-Spectral Dependency Parsing with Latent Variables
11 0.24939509 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets
12 0.24514319 12 emnlp-2012-A Transition-Based System for Joint Part-of-Speech Tagging and Labeled Non-Projective Dependency Parsing
13 0.24437273 64 emnlp-2012-Improved Parsing and POS Tagging Using Inter-Sentence Consistency Constraints
14 0.23912147 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing
15 0.23379037 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
16 0.21678777 46 emnlp-2012-Exploiting Reducibility in Unsupervised Dependency Parsing
17 0.21100043 29 emnlp-2012-Concurrent Acquisition of Word Meaning and Lexical Categories
18 0.19717214 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns
19 0.19310997 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling
20 0.18381487 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation
topicId topicWeight
[(2, 0.021), (11, 0.01), (16, 0.033), (25, 0.018), (29, 0.023), (34, 0.089), (39, 0.036), (41, 0.011), (45, 0.235), (60, 0.111), (63, 0.039), (64, 0.046), (65, 0.032), (70, 0.031), (73, 0.015), (74, 0.05), (76, 0.038), (80, 0.01), (86, 0.046), (95, 0.021)]
simIndex simValue paperId paperTitle
1 0.87615716 4 emnlp-2012-A Comparison of Vector-based Representations for Semantic Composition
Author: William Blacoe ; Mirella Lapata
Abstract: In this paper we address the problem of modeling compositional meaning for phrases and sentences using distributional methods. We experiment with several possible combinations of representation and composition, exhibiting varying degrees of sophistication. Some are shallow while others operate over syntactic structure, rely on parameter learning, or require access to very large corpora. We find that shallow approaches are as good as more computationally intensive alternatives with regards to two particular tests: (1) phrase similarity and (2) paraphrase detection. The sizes of the involved training corpora and the generated vectors are not as important as the fit between the meaning representation and compositional method.
2 0.81993264 135 emnlp-2012-Using Discourse Information for Paraphrase Extraction
Author: Michaela Regneri ; Rui Wang
Abstract: Previous work on paraphrase extraction using parallel or comparable corpora has generally not considered the documents’ discourse structure as a useful information source. We propose a novel method for collecting paraphrases relying on the sequential event order in the discourse, using multiple sequence alignment with a semantic similarity measure. We show that adding discourse information boosts the performance of sentence-level paraphrase acquisition, which consequently gives a tremendous advantage for extracting phraselevel paraphrase fragments from matched sentences. Our system beats an informed baseline by a margin of 50%.
3 0.80600989 130 emnlp-2012-Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars
Author: Kewei Tu ; Vasant Honavar
Abstract: We introduce a novel approach named unambiguity regularization for unsupervised learning of probabilistic natural language grammars. The approach is based on the observation that natural language is remarkably unambiguous in the sense that only a tiny portion of the large number of possible parses of a natural language sentence are syntactically valid. We incorporate an inductive bias into grammar learning in favor of grammars that lead to unambiguous parses on natural language sentences. The resulting family of algorithms includes the expectation-maximization algorithm (EM) and its variant, Viterbi EM, as well as a so-called softmax-EM algorithm. The softmax-EM algorithm can be implemented with a simple and computationally efficient extension to standard EM. In our experiments of unsupervised dependency grammar learn- ing, we show that unambiguity regularization is beneficial to learning, and in combination with annealing (of the regularization strength) and sparsity priors it leads to improvement over the current state of the art.
same-paper 4 0.79451078 138 emnlp-2012-Wiki-ly Supervised Part-of-Speech Tagging
Author: Shen Li ; Joao Graca ; Ben Taskar
Abstract: Despite significant recent work, purely unsupervised techniques for part-of-speech (POS) tagging have not achieved useful accuracies required by many language processing tasks. Use of parallel text between resource-rich and resource-poor languages is one source ofweak supervision that significantly improves accuracy. However, parallel text is not always available and techniques for using it require multiple complex algorithmic steps. In this paper we show that we can build POS-taggers exceeding state-of-the-art bilingual methods by using simple hidden Markov models and a freely available and naturally growing resource, the Wiktionary. Across eight languages for which we have labeled data to evaluate results, we achieve accuracy that significantly exceeds best unsupervised and parallel text methods. We achieve highest accuracy reported for several languages and show that our . approach yields better out-of-domain taggers than those trained using fully supervised Penn Treebank.
5 0.64910626 129 emnlp-2012-Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
Author: Dan Garrette ; Jason Baldridge
Abstract: Past work on learning part-of-speech taggers from tag dictionaries and raw data has reported good results, but the assumptions made about those dictionaries are often unrealistic: due to historical precedents, they assume access to information about labels in the raw and test sets. Here, we demonstrate ways to learn hidden Markov model taggers from incomplete tag dictionaries. Taking the MINGREEDY algorithm (Ravi et al., 2010) as a starting point, we improve it with several intuitive heuristics. We also define a simple HMM emission initialization that takes advantage of the tag dictionary and raw data to capture both the openness of a given tag and its estimated prevalence in the raw data. Altogether, our augmentations produce improvements to per- formance over the original MIN-GREEDY algorithm for both English and Italian data.
6 0.6206283 81 emnlp-2012-Learning to Map into a Universal POS Tagset
7 0.61819655 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon
8 0.61082441 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs
9 0.60830027 24 emnlp-2012-Biased Representation Learning for Domain Adaptation
10 0.60323119 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers
11 0.59683359 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis
12 0.59596759 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction
13 0.58928782 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation
14 0.58736354 52 emnlp-2012-Fast Large-Scale Approximate Graph Construction for NLP
15 0.58688325 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games
16 0.5854041 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context
17 0.58469391 118 emnlp-2012-Source Language Adaptation for Resource-Poor Machine Translation
18 0.58351278 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration
19 0.57888007 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation
20 0.57868373 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents