emnlp emnlp2010 emnlp2010-34 knowledge-graph by maker-knowledge-mining

34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation


Source: pdf

Author: Taesun Moon ; Katrin Erk ; Jason Baldridge

Abstract: We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 erk , Abstract We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. [sent-2, score-0.517]

2 This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. [sent-3, score-0.19]

3 In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. [sent-4, score-0.445]

4 This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. [sent-5, score-0.248]

5 We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective. [sent-6, score-0.484]

6 They have been applied to part-of-speech (POS) tagging in supervised (Brants, 2000), semi-supervised (Goldwater and Griffiths, 2007; Ravi and Knight, 2009) and unsupervised (Johnson, 2007) training scenarios. [sent-8, score-0.193]

7 , 2003) learning, there has been only limited work on unsupervised discriminative sequence models (e. [sent-10, score-0.151]

8 The tagging accuracy of purely unsupervised HMMs is far below that of supervised and semisupervised HMMs; this is unsurprising as it is still 196 jbaldrid} @mai l . [sent-14, score-0.193]

9 This also applies to linguistically motivated HMMs for recovering states and sequences that correspond more closely to those implicitly defined by linguists when they label sentences with parts-of-speech. [sent-19, score-0.245]

10 One way in which a basic HMM’s structure is a poor model for POS tagging is that there is no inherent distinction between (open-class) content words and (closed-class) function words. [sent-20, score-0.233]

11 The first, HMM+, is a very simple modification where two different hyperparameters are posited for content states and function states, respectively. [sent-22, score-0.568]

12 The other is the crouching Dirichlet, hidden Markov model (CDHMM), an extended HMM that captures this dichotomy based on the statistical evidence that comes from context. [sent-23, score-0.192]

13 Content states display greater variance across lo- cal context (e. [sent-24, score-0.347]

14 sentences, paragraphs, documents), and we capture this variance by adding a component to the model for content states that is based on latent Dirichlet allocation (Blei et al. [sent-26, score-0.423]

15 Both models are composite in that two distributions do not mix with each other. [sent-30, score-0.151]

16 Unlike the LDAHMM, the generation of content states is folded into the CDHMM process. [sent-31, score-0.347]

17 The CDHMM easily outperforms all other models, including HMM+, across three measures (accuracy, F-score, and variation of information) for unsupervised POS tagging on most data sets. [sent-35, score-0.219]

18 The state transitions are generated by Mult(δt) whose prior δt is generated by Dir(γ) with a symmetric (i. [sent-38, score-0.23]

19 Emissions are generated by Mult(ψt) with a prior ψt generated by Dir(ξ) with a symmetric hyperparameter ξ. [sent-41, score-0.22]

20 This is particularly appropriate in unsupervised POS tagging with regard to novel data since there won’t be a priori grounds for favoring certain distributions over others. [sent-44, score-0.238]

21 There is considerable work on extensions to HMM-based unsupervised POS tagging (see §6), bHuMt hMe-rbea we c uonnscuepnetrravtiese on OtheS L taDgAgiHngMM (se (Griffiths et al. [sent-45, score-0.193]

22 The model is a composite of a probabilistic topic model and an HMM in which a single state is allocated for words generated from the topic model. [sent-47, score-0.318]

23 While the topic model component still uses the bagsof-words assumption, the joint model infers which words are more likely to carry topical content and which words are more likely to contribute to the local sequence. [sent-49, score-0.178]

24 Though many content states such as adjectives, verbs, and nouns can vary a great deal across documents, the topic state groups these words together. [sent-53, score-0.537]

25 This paper shows that a model that conflates the LDAHMM topics with content states can significantly improve POS tagging. [sent-55, score-0.347]

26 a document) and that certain tags have more peaked emission distributions—in a sequence model. [sent-67, score-0.249]

27 To do this, we define the crouching Dirichlet, hidden Markov model1 (CDHMM). [sent-68, score-0.158]

28 This model, like LDAHMM, captures items of high variance across contexts, but it does so without losing 1We call our model a “crouching Dirichlet” model since it involves a Dirichlet prior that generates distributions for certain states as if it were “crouching” on the side. [sent-69, score-0.46]

29 Observed word wi is dependent on hidden state ti. [sent-71, score-0.194]

30 The edge to transition prior δ is always activated. [sent-73, score-0.155]

31 We also define the HMM+, a simple adaptation of a basic HMM which accounts for the latter property by using different priors for emissions from content and function states. [sent-78, score-0.253]

32 Like the LDAHMM, the model is composite in that distributions over a single random variable are composed of several different distribution functions which depend on the value of the underlying variable. [sent-82, score-0.152]

33 T is a union of disjoint content states C and function states F. [sent-92, score-0.617]

34 In this composite model, the priors for the emission and transition for each step in 198 the sequence depend on whether state t at step iis t∈C or t∈F. [sent-93, score-0.499]

35 If t∈C, the word emission is dependt∈enCt on φ (the c Ifon t∈teCnt ,w thoerd w prior) amnisds tiohen s itsat dee tpraenn-sition is dependent on θ (the “topic” prior) and δ (the transition prior). [sent-94, score-0.23]

36 If t∈F, the word emission probability i so dependent on ψ (the founrdct eiomni wsioornd prior) and the state transition on δ (again, the transition prior). [sent-95, score-0.405]

37 To elaborate, three prior distributions are defined globally for this model: (1) δt, the transition prior such that p(tˆ|t, δt) = δtˆ|t (2) ψt, the function word prior such thta|tt p(w|t, ψtt)|t = ψw|t (3) φt, the content pwroiorrd prior hsuatch p (thwa|tt p(w|t, φt) = φw|t. [sent-98, score-0.531]

38 Locally for ewaocrhd c pornioterx stu cdh (documents φin our case), we define θd, the topic prior such that p(t|θd) = θt|d for t∈C. [sent-99, score-0.144]

39 For each state t∈T (a) Draw a distribution Dir(γ) over states δt ∼ (b) If t∈C, draw a distribution over words φt ∼ Dir(β) (c) If t∈F, draw a distribution over words ψt ∼ Dir(ξ) 2. [sent-101, score-0.53]

40 For each context d (a) Draw a distribution θd ∼ Dir(α) over states t∈C (b) For each word wi in d i. [sent-102, score-0.325]

41 if ti∈C, then draw wi from φti , else draw∈ wi f trhoemn ψti For each context d, we draw a prior distribution θd—formally identical to the LDA topic prior—that is defined only for the states t∈C. [sent-104, score-0.63]

42 t eTsh iast eraiochr word, from δt ◦ θd, where we have defined the vector valued operation ◦ as follows: (δti−1◦ θd)ti=(ZZ11δδtti | t i −−11· θti|d t i ∈∈FC where (δt ◦ θd)ti is the element corresponding to state ti in the ◦v eθctor δt ◦ θd. [sent-106, score-0.272]

43 p(ti|−i,w)∝N N tw i + |tWiW+ ξβξβ“N Ntdi |+d|itC+−α 1+“γN”Nt“i|Nt+it−iT+1γ1+ |tγIi”N[+t“iI=Ni[t+it−iT−+1γ1 =+]|tIi[=+tiI=i[+ti1− ]+1 =]γt”i=ti+1]+γ” t i ∈ F C Figure 2: Conditional distribution for ti in the CDHMM. [sent-108, score-0.213]

44 The important thing to note is that the draw for states at each word is proportional to a composite of (a) the product of the individual elements of the topic and transition priors when ti∈C and (b) the transition priors when ti∈F. [sent-109, score-0.762]

45 The d∈rCaw a nisd proportional to the product of topic aTnhde t drraanwsit i osn p priors when ti∈C because we have made a product of experts (PoE) f baectcaouriszeat wioen assumption (Hinton, 2002) for tractability and to reduce the size of our model. [sent-110, score-0.172]

46 CFu|rthermore, dth tios Oco(|mTb|ination of a composite hidden state space with a product of experts assumption allows us to capture high variance for certain states. [sent-112, score-0.326]

47 To summarize, the CDHMM is a composite model where both the observed token and the hidden state variable are composite distributions. [sent-113, score-0.329]

48 For the hidden state, this means that there is a “topical” element with high variance across contexts that is embedded in the state sequence for a subset of events. [sent-114, score-0.281]

49 We embed this element through a PoE assumption where transitions into content states are modeled as a product of the transition probability and the local probability of the content state. [sent-115, score-0.57]

50 The default method is to sample Λ based on the posterior, then sample each ti based on the conditional distribution. [sent-129, score-0.27]

51 Another approach is to sample directly from the conditional distribution without sampling from the posterior since the conditional distribution incorporates the posterior through integration. [sent-130, score-0.224]

52 The full conditional distribution for tag transitions for the Gibbs sampler is given in Figure 2. [sent-132, score-0.199]

53 At each time step, we decrement all counts for the current value of ti, sample a new value for ti from a multinomial proportional to the conditional distribution and assign that value to ti. [sent-133, score-0.271]

54 β, ξ are the hyperparameters for the word emission priors of the content states and function states, respectively. [sent-134, score-0.698]

55 γ is the hyperparameter for the state transition priors. [sent-135, score-0.287]

56 α is the hyperparameter for the state prior given that it is in some context d. [sent-136, score-0.268]

57 Notation such as Nti |ti−1 refers to the counts of the events indicated by the subscript, minus the current token and tag under consideration. [sent-139, score-0.159]

58 Nti |ti−1 is the number of times ti has occurred after ti−1 minus the tag for wi. [sent-140, score-0.313]

59 Nwi|ti is the number of times wi has occurred with ti minus the current value. [sent-141, score-0.284]

60 Nti and Ndi are the counts for the given tag and document minus the current value. [sent-142, score-0.157]

61 1) is nearly identical to that of an HMM with the exception that conditional probabilities for a subset of the states—the content states—are local. [sent-144, score-0.163]

62 2 HMM+ The CDHMM explicitly posits two different types of states: function states and content states. [sent-147, score-0.372]

63 Having made this distinction, there is a very simple way to capture the difference in emission distributions for function and content states within an otherwise standard HMM: posit different hyperparameters for the two types. [sent-148, score-0.715]

64 One type has a small hyperparameter to model a sparse distribution for function words and the other has a relatively large hyperparameter to model a distribution with broader support. [sent-149, score-0.307]

65 The conditional distribution for tag transitions for this model is identical to that in fig. [sent-152, score-0.173]

66 tags TablUe2sFp:lBaoNnTrWutoeimgwsSketonbJare7o491f7 407t2314ok275e8942ns,1 d3089o425c10963ume2n5341t 02s815,aver81a4390geto- kens per document and total tag types for each corpus. [sent-157, score-0.141]

67 • Accuracy: We use a greedy search algorithm to map reaaccyh: unsupervised tag t soe a gold gloabreithl msuc tho that accuracy is maximized. [sent-175, score-0.193]

68 We evaluate on a 1-to-1 mapping between unsupervised tags and gold labels, as well as many-to-1 (M-to-1), corresponding to the evaluation mappings used in Johnson (2007). [sent-176, score-0.146]

69 tp is the number of token pairs that share a tag in M as well as in G, fp is the number token pairs that share the same tag in M but have different tags in G, and fn is the number token pairs assigned a different tag in M but the same in G (Meila, 2007). [sent-244, score-0.467]

70 1 Models and Parameters We compare four different models: • • HMM: a standard HMM HMM+: an HMM in which the hyperparameters for the word emissions are asymmetric, such that content states have different word emission priors • compared to function states. [sent-252, score-0.757]

71 LDAHMM: an HMM with a distinguished state that generates words from a topic model (Griffiths et al. [sent-253, score-0.164]

72 , 2005) HMM+ LDAHMM CDHMM Figure 3: Averaged many-to-one accuracy on the full tagset for the models when the number of states is set at 20, 30, 40 and 50 states. [sent-254, score-0.273]

73 For all models, the transition hyperparameters γ are set to 0. [sent-256, score-0.203]

74 For the LDAHMM and HMM all emission hyperparameters are set to 0. [sent-258, score-0.259]

75 For the models that distinguish content and function states (HMM+, CDHMM), we fixed the number of content states at 5 and set the function state emission hyperparameters ξ = 0. [sent-261, score-1.119]

76 0001 and the content state emission hyperparameters β = 0. [sent-262, score-0.449]

77 For the models with an LDA or LDA-like component (LDAHMM, CDHMM), we set the topic or content-state hyperparameter α = 1. [sent-264, score-0.216]

78 For decoding, we use maximum posterior decoding to obtain a single sample after the required burnin, as has been done in other unsupervised HMM experiments. [sent-265, score-0.154]

79 This is striking because the non-CDHMM models have much higher standard deviation on other corpora but have sharply reduced standard deviation only for Uspanteko. [sent-279, score-0.14]

80 Experiments in this table use state sizes that correspond more closely to the size of the tag sets in the respective corpora. [sent-313, score-0.168]

81 Figure 3 shows the change in accuracy for the different models for different corpora when the overall number of states is varied between 20 and 50. [sent-318, score-0.273]

82 All models with the exception of HMM+ show improvements as the number of states is increased. [sent-320, score-0.304]

83 This brings up the valid concern (Clark, 2003; Johnson, 2007) that a model could posit a very large number of states and obtain high M-to1 scores. [sent-321, score-0.284]

84 Looking at the results by number of states on VI and f-score for CDHMM(Figure 5), it is clear that Floresta displays the reverse pattern of all other data sets where performance monotonically deteriorates as state sizes are increased. [sent-324, score-0.333]

85 We therefore wondered whether positing a state size that more closely approximated the size of the gold tag set performs better. [sent-326, score-0.194]

86 Since the discrepancy is greatest for Uspanteko and Floresta, we present tabulated results for experiments with state settings of 100 and 20 states respectively (table 3). [sent-327, score-0.333]

87 With the exception of VI (where lower is better) for Uspanteko, the scores generally improve when the model state size is closer to the gold size. [sent-328, score-0.145]

88 M-to-1 goes down for Floresta when 20 states are posited, but this is to be expected since this score is defined, to a certain extent, to do better with 203 F−SCORE VI oscerf−410 . [sent-329, score-0.245]

89 20WSJBrownTigerFloestaUpan2te543k0 oiv43210WSJBrownTgierFolestaUpanteko Figure 5: f-score and VI for CDHMM by number of states larger models. [sent-331, score-0.245]

90 This is not surprising for LDAHMM, since it has fifty topic parameters in addition to the number of states posited, and random initial conditions would have greater effect on the outcome than for the other models. [sent-336, score-0.321]

91 The model shares the large content emission hyperparameter β = 0. [sent-338, score-0.357]

92 At this point, it can only be assumed that the additional LDA component acts as a regularization factor for CDHMM and reduced the volatility in having a large emission hyperparameter. [sent-340, score-0.143]

93 Nonparametric HMMs for unsupervised POS tag induction (Snyder et al. [sent-368, score-0.2]

94 Though they did not concentrate on unsupervised methods, Haghighi and Klein (2006) conducted an unsupervised experiment that utilized certain token features (e. [sent-376, score-0.204]

95 (2007) is an interesting variant of unsupervised POS tagging where a parse tree is assumed and POS tags are induced from this structure non-parametrically. [sent-384, score-0.226]

96 (2005) is the work that inspired the current approach where a set of states is designated to capture variance across contexts. [sent-388, score-0.347]

97 As such, distinguishing between topic states such that they model different syntactic states was not attempted, and we have seen in sec. [sent-390, score-0.566]

98 a one 205 that sampled emission hyperparameters for each state rather than a single symmetric hyperparameter. [sent-402, score-0.387]

99 7 Conclusion We have shown that a hidden Markov model that allocates a subset of the states to have distributions conditioned on localized domains can significantly improve performance in unsupervised partof-speech tagging. [sent-405, score-0.432]

100 We have also demonstrated that significant performance gains are possible simply by setting a different emission hyperparameter for a subgroup of the states. [sent-406, score-0.255]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cdhmm', 0.496), ('hmm', 0.36), ('ldahmm', 0.257), ('states', 0.245), ('uspanteko', 0.24), ('floresta', 0.188), ('ti', 0.184), ('emission', 0.143), ('hyperparameters', 0.116), ('hyperparameter', 0.112), ('hmms', 0.11), ('tagging', 0.106), ('crouching', 0.103), ('pos', 0.102), ('content', 0.102), ('tiger', 0.093), ('state', 0.088), ('unsupervised', 0.087), ('transition', 0.087), ('dir', 0.084), ('tag', 0.08), ('composite', 0.078), ('griffiths', 0.077), ('topic', 0.076), ('variance', 0.076), ('prior', 0.068), ('priors', 0.067), ('vi', 0.064), ('emissions', 0.059), ('reichart', 0.057), ('goldwater', 0.056), ('deviation', 0.056), ('draw', 0.055), ('dirichlet', 0.055), ('hidden', 0.055), ('mult', 0.054), ('mayan', 0.051), ('posited', 0.051), ('wi', 0.051), ('johnson', 0.05), ('minus', 0.049), ('distributions', 0.045), ('abend', 0.044), ('nti', 0.044), ('headden', 0.043), ('wsj', 0.043), ('tp', 0.041), ('symmetric', 0.04), ('graca', 0.04), ('portuguese', 0.04), ('posit', 0.039), ('brown', 0.039), ('posterior', 0.039), ('markov', 0.038), ('mcmc', 0.037), ('peaked', 0.037), ('sequence', 0.036), ('transitions', 0.034), ('afonso', 0.034), ('dichotomy', 0.034), ('endez', 0.034), ('meila', 0.034), ('pixabaj', 0.034), ('poe', 0.034), ('stopword', 0.034), ('teichert', 0.034), ('vicente', 0.034), ('fp', 0.034), ('gibbs', 0.033), ('induction', 0.033), ('proceedings', 0.033), ('tags', 0.033), ('closed', 0.031), ('exception', 0.031), ('ravi', 0.031), ('pages', 0.031), ('token', 0.03), ('conditional', 0.03), ('german', 0.03), ('distribution', 0.029), ('moon', 0.029), ('gael', 0.029), ('tii', 0.029), ('experts', 0.029), ('brants', 0.029), ('fn', 0.029), ('iii', 0.029), ('modification', 0.029), ('blei', 0.029), ('curves', 0.028), ('sample', 0.028), ('models', 0.028), ('document', 0.028), ('winning', 0.026), ('across', 0.026), ('gold', 0.026), ('sampler', 0.026), ('function', 0.025), ('distinctions', 0.025), ('erk', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Author: Taesun Moon ; Katrin Erk ; Jason Baldridge

Abstract: We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective.

2 0.34117016 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

Author: Yoong Keok Lee ; Aria Haghighi ; Regina Barzilay

Abstract: Part-of-speech (POS) tag distributions are known to exhibit sparsity a word is likely to take a single predominant tag in a corpus. Recent research has demonstrated that incorporating this sparsity constraint improves tagging accuracy. However, in existing systems, this expansion come with a steep increase in model complexity. This paper proposes a simple and effective tagging method that directly models tag sparsity and other distributional properties of valid POS tag assignments. In addition, this formulation results in a dramatic reduction in the number of model parameters thereby, enabling unusually rapid training. Our experiments consistently demonstrate that this model architecture yields substantial performance gains over more complex tagging — counterparts. On several languages, we report performance exceeding that of more complex state-of-the art systems.1

3 0.17177151 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.

4 0.17041594 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay

Abstract: In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.1

5 0.1373284 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

6 0.13014077 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

7 0.10092946 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

8 0.097838357 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

9 0.096775264 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

10 0.092952058 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

11 0.083292812 114 emnlp-2010-Unsupervised Parse Selection for HPSG

12 0.082706489 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

13 0.082634017 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

14 0.081382528 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

15 0.079857305 77 emnlp-2010-Measuring Distributional Similarity in Context

16 0.079280637 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

17 0.0700766 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

18 0.064710371 113 emnlp-2010-Unsupervised Induction of Tree Substitution Grammars for Dependency Parsing

19 0.060643002 81 emnlp-2010-Modeling Perspective Using Adaptor Grammars

20 0.056437533 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.243), (1, 0.195), (2, 0.033), (3, -0.357), (4, -0.123), (5, -0.051), (6, -0.201), (7, -0.011), (8, 0.175), (9, -0.0), (10, 0.098), (11, -0.033), (12, 0.071), (13, -0.053), (14, 0.064), (15, 0.138), (16, 0.079), (17, 0.061), (18, 0.076), (19, 0.138), (20, -0.167), (21, 0.014), (22, 0.195), (23, 0.038), (24, -0.149), (25, -0.04), (26, 0.055), (27, 0.059), (28, 0.11), (29, 0.009), (30, 0.013), (31, 0.047), (32, -0.043), (33, -0.079), (34, 0.007), (35, -0.057), (36, -0.024), (37, -0.105), (38, -0.037), (39, 0.079), (40, -0.01), (41, -0.039), (42, 0.057), (43, -0.033), (44, -0.027), (45, -0.013), (46, 0.032), (47, -0.062), (48, -0.016), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96512246 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Author: Taesun Moon ; Katrin Erk ; Jason Baldridge

Abstract: We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective.

2 0.88437444 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

Author: Yoong Keok Lee ; Aria Haghighi ; Regina Barzilay

Abstract: Part-of-speech (POS) tag distributions are known to exhibit sparsity a word is likely to take a single predominant tag in a corpus. Recent research has demonstrated that incorporating this sparsity constraint improves tagging accuracy. However, in existing systems, this expansion come with a steep increase in model complexity. This paper proposes a simple and effective tagging method that directly models tag sparsity and other distributional properties of valid POS tag assignments. In addition, this formulation results in a dramatic reduction in the number of model parameters thereby, enabling unusually rapid training. Our experiments consistently demonstrate that this model architecture yields substantial performance gains over more complex tagging — counterparts. On several languages, we report performance exceeding that of more complex state-of-the art systems.1

3 0.64299649 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

Author: Michael Lamar ; Yariv Maron ; Elie Bienenstock

Abstract: We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.

4 0.57556546 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.

5 0.44864127 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

Author: Christina Sauper ; Aria Haghighi ; Regina Barzilay

Abstract: In this paper, we investigate how modeling content structure can benefit text analysis applications such as extractive summarization and sentiment analysis. This follows the linguistic intuition that rich contextual information should be useful in these tasks. We present a framework which combines a supervised text analysis application with the induction of latent content structure. Both of these elements are learned jointly using the EM algorithm. The induced content structure is learned from a large unannotated corpus and biased by the underlying text analysis task. We demonstrate that exploiting content structure yields significant improvements over approaches that rely only on local context.1

6 0.38872147 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

7 0.34902093 24 emnlp-2010-Automatically Producing Plot Unit Representations for Narrative Text

8 0.32959631 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

9 0.31167558 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

10 0.30157262 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

11 0.29900783 3 emnlp-2010-A Fast Fertility Hidden Markov Model for Word Alignment Using MCMC

12 0.29853696 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

13 0.27668083 114 emnlp-2010-Unsupervised Parse Selection for HPSG

14 0.27191165 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

15 0.26694795 77 emnlp-2010-Measuring Distributional Similarity in Context

16 0.26266262 100 emnlp-2010-Staying Informed: Supervised and Semi-Supervised Multi-View Topical Analysis of Ideological Perspective

17 0.25205651 23 emnlp-2010-Automatic Keyphrase Extraction via Topic Decomposition

18 0.24923077 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

19 0.23034637 48 emnlp-2010-Exploiting Conversation Structure in Unsupervised Topic Segmentation for Emails

20 0.21065621 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.213), (10, 0.012), (12, 0.022), (29, 0.22), (30, 0.033), (32, 0.024), (52, 0.031), (56, 0.093), (62, 0.014), (66, 0.14), (72, 0.046), (76, 0.018), (79, 0.019), (83, 0.014), (87, 0.011), (89, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.86567324 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

Author: Rushin Shah ; Paramveer S. Dhillon ; Mark Liberman ; Dean Foster ; Mohamed Maamouri ; Lyle Ungar

Abstract: We describe a model for the lexical analysis of Arabic text, using the lists of alternatives supplied by a broad-coverage morphological analyzer, SAMA, which include stable lemma IDs that correspond to combinations of broad word sense categories and POS tags. We break down each of the hundreds of thousands of possible lexical labels into its constituent elements, including lemma ID and part-of-speech. Features are computed for each lexical token based on its local and document-level context and used in a novel, simple, and highly efficient two-stage supervised machine learning algorithm that over- comes the extreme sparsity of label distribution in the training data. The resulting system achieves accuracy of 90.6% for its first choice, and 96.2% for its top two choices, in selecting among the alternatives provided by the SAMA lexical analyzer. We have successfully used this system in applications such as an online reading helper for intermediate learners of the Arabic language, and a tool for improving the productivity of Arabic Treebank annotators. 1 Background and Motivation This paper presents a methodology for generating high quality lexical analysis of highly inflected languages, and demonstrates excellent performance applying our approach to Arabic. Lexical analysis of the written form of a language involves resolving, explicitly or implicitly, several different kinds ofambiguities. Unfortunately, the usual ways of talking about this process are also ambiguous, and our general approach to the problem, though not unprecedented, has uncommon aspects. Therefore, in order 725 Paramveer S. Dhillon, Mark Liberman, Dean Foster, Mohamed Maamouri and Lyle Ungar University of Pennsylvania 345 1Walnut Street Philadelphia, PA 19104, USA {dhi l lon | myl | ungar} @ cis .upenn .edu floo snt|emry@lw|huanrgta on .upenn .eednun maamouri @ ldc .upenn .edu , , to avoid confusion, we begin by describing how we define the problem. In an inflected language with an alphabetic writing system, a central issue is how to interpret strings of characters as forms of words. For example, the English letter-string ‘winds’ will normally be interpreted in one of four different ways, all four of which involve the sequence of two formatives wind+s. The stem ‘wind’ might be analyzed as (1) a noun meaning something like “air in motion”, pronounced [wInd] , which we can associate with an arbitrary but stable identifier like wind n1; (2) a verb wind v1 derived from that noun, and pronounced the same way; (3) a verb wind v2 meaning something like “(cause to) twist”, pronounced [waInd]; or (4) a noun wind n2 derived from that verb, and pro- nounced the same way. Each of these “lemmas”, or dictionary entries, will have several distinguishable senses, which we may also wish to associate with stable identifiers. The affix ‘-s’ might be analyzed as the plural inflection, if the stem is a noun; or as the third-person singular inflection, if the stem is a verb. We see this analysis as conceptually divided into four parts: 1) Morphological analysis, which recognizes that the letter-string ‘winds’ might be (perhaps among other things) wind/N s/PLURAL or wind/V s/3SING; 2) Morphological disambiguation, which involves deciding, for example, that in the phrase “the four winds”, ‘winds’ is probably a plural noun, i.e. wind/N s/PLURAL; 3) Lemma analysis, which involves recognizing that the stem wind in ‘winds’ might be any of the four lemmas listed above – perhaps with a further listing of senses or other sub-entries for each of them; and 4) Lemma disambiguation, deciding, for example, that + + + ProceMedITin,g Ms oasfs thaceh 2u0se1t0ts C,o UnSfAer,e n9c-e11 on O Ectmobpeir ic 2a0l1 M0.e ?tc ho2d0s10 in A Nsastouciraatlio Lnan fogru Cagoem Ppruotcaetisosninagl, L pinaggeusis 7t2ic5s–735, the phrase “the four winds” probably involves the lemma wind n1. Confusingly, the standard word-analysis tasks in computational linguistics involve various combinations of pieces of these logically-distinguished operations. Thus, “part of speech (POS) tagging” is mainly what we’ve called “morphological disambiguation”, except that it doesn’t necessarily require identifying the specific stems and affixes involved. In some cases, it also may require a small amount of “lemma disambiguation”, for example to distinguish a proper noun from a common noun. “Sense disambiguation” is basically a form of what we’ve called “lemma disambiguation”, except that the sense disambiguation task may assume that the part of speech is known, and may break down lexical identity more finely than our system happens to do. “Lemmatization” generally refers to a radically simplified form of “lemma analysis” and “lemma disambiguation”, where the goal is simply to collapse different inflected forms of any similarly-spelled stems, so that the strings ‘wind’, ‘winds’, ‘winded’, ‘winding’ will all be treated as instances of the same thing, without in fact making any attempt to determine the identity of “lemmas” in the traditional sense of dictionary entries. Linguists use the term morphology to include all aspects of lexical analysis under discussion here. But in most computational applications, “morphological analysis” does not include the disambiguation of lemmas, because most morphological analyzers do not reference a set of stable lemma IDs. So for the purposes of this paper, we will continue to discuss lemma analysis and disambiguation as conceptually distinct from morphological analysis and disambiguation, although, in fact, our system disambiguates both of these aspects of lexical analysis at the same time. The lexical analysis of textual character-strings is a more complex and consequential problem in Arabic than it is in English, for several reasons. First, Arabic inflectional morphology is more complex than English inflectional morphology is. Where an English verb has five basic forms, for example, an Arabic verb in principle may have dozens. Second, the Arabic orthographic system writes elements such as prepositions, articles, and possessive pronouns without setting them off by spaces, roughly 726 as if the English phrase “in a way” were written “inaway”. This leads to an enormous increase in the number of distinct “orthographic words”, and a substantial increase in ambiguity. Third, short vowels are normally omitted in Arabic text, roughly as if English “in a way” were written “nway”. As a result, a whitespace/punctuation-delimited letter-string in Arabic text typically has many more alternative analyses than a comparable English letter-string does, and these analyses have many more parts, drawn from a much larger vocabulary of form-classes. While an English “tagger” can specify the morphosyntactic status of a word by choosing from a few dozen tags, an equivalent level of detail in Arabic would require thousands of alternatives. Similarly, the number of lemmas that might play a role in a given letter-sequence is generally much larger in Arabic than in English. We start our labeling of Arabic text with the alternative analyses provided by SAMA v. 3.1, the Standard Arabic Morphological Analyzer (Maamouri et al., 2009). SAMA is an updated version of the earlier Buckwalter analyzers (Buckwalter, 2004), with a number of significant differences in analysis to make it compatible with the LDC Arabic Treebank 3-v3.2 (Maamouri et al., 2004). The input to SAMA is an Arabic orthographic word (a string of letters delimited by whitespace or punctuation), and the output of SAMA is a set of alternative analyses, as shown in Table 1. For a typical word, SAMA produces approximately a dozen alternative analyses, but for certain highly ambiguous words it can produce hundreds of alternatives. The SAMA analyzer has good coverage; for typical texts, the correct analysis of an orthographic word can be found somewhere in SAMA’s list of alternatives about 95% of the time. However, this broad coverage comes at a cost; the list of analytic alternatives must include a long Zipfian tail of rare or contextually-implausible analyses, which collectively are correct often enough to make a large contribution to the coverage statistics. Furthermore, SAMA’s long lists of alternative analyses are not evaluated or ordered in terms of overall or contextual plausibility. This makes the results less useful in most practical applications. Our goal is to rank these alternative analyses so that the correct answer is as near to the top of the list as possible. Despite some risk of confusion, we’ll refer to SAMA’s list of alternative analyses for an orthographic word as potential labels for that word. And despite a greater risk ofconfusion, we’ll refer to the assignment of probabilities to the set of SAMA labels for a particular Arabic word in a particular textual context as tagging, by analogy to the operation of a stochastic part-of-speech tagger, which similarly assigns probabilities to the set of labels available for a word in textual context. Although our algorithms have been developed for the particular case of Arabic and the particular set of lexical-analysis labels produced by SAMA, they should be applicable without modification to the sets of labels produced by any broad-coverage lexical analyzer for the orthographic words of any highlyinflected language. In choosing our approach, we have been moti- vated by two specific applications. One application aims to help learners of Arabic in reading text, by offering a choice of English glosses with associated Arabic morphological analyses and vocalizations. SAMA’s excellent coverage is an important basis for this help; but SAMA’s long, unranked list of alternative analyses for a particular letter-string, where many analyses may involve rare words or alternatives that are completely implausible in the context, will be confusing at best for a learner. It is much more helpful for the list to be ranked so that the correct answer is almost always near the top, and is usually one of the top two or three alternatives. In our second application, this same sort of ranking is also helpful for the linguistically expert native speakers who do Arabic Treebank analysis. These 727 annotators understand the text without difficulty, but find it time-consuming and fatiguing to scan a long list of rare or contextually-implausible alternatives for the correct SAMA output. Their work is faster and more accurate if they start with a list that is ranked accurately in order of contextual plausibility. Other applications are also possible, such as vocalization of Arabic text for text-to-speech synthesis, or lexical analysis for Arabic parsing. However, our initial goals have been to rank the list of SAMA outputs for human users. We note in passing that the existence of set of stable “lemma IDs” is an unusual feature of SAMA, which in our opinion ought to be emulated by approaches to lexical analysis in other languages. The lack of such stable lemma IDs has helped to disguise the fact that without lemma analysis and disambiguation, morphological analyses and disambiguation is only a partial solution to the problem of lexical analysis. In principle, it is obvious that lemma disambiguation and morphological disambiguation are mutually beneficial. If we know the answer to one of the questions, the other one is easier to answer. However, these two tasks require rather different sets of contextual features. Lemma disambiguation is similar to the problem of word-sense disambiguation on some definitions, they are identical and as a result, it benefits from paragraph-level and documentlevel bag-of-words attributes that help to character– – ize what the text is “about” and therefore which lemmas are more likely to play a role in it. In contrast, morphological disambiguation mainly depends on features of nearby words, which help to characterize how inflected forms of these lemmas might fit into local phrasal structures. 2 Problem and Methodology Consider a collection oftokens (observations), ti, referred to by index i∈ {1, . . . , n}, where each token fise raressdo tcoia bteyd i nwdiethx a s∈et { of p features, xij, efaocr hth teo k jethn feature, and a label, li, which is a combination of a lemma and a morphological analysis. We use indicator functions yik to indicate whether or not the kth label for the ith token is present. We represent the complete set of features and labels for the entire training data using matrix notation as X and Y , respectively. Our goal is to predict the label l (or equivalently, the vector y for a given feature vector x. A standard linear regression model of this problem would be y = xβ + ? (1) The standard linear regression estimate of β (ig- ×× × noring, for simplicity the fact that the ys are 0/1) is: βˆ = (XTtrainXtrain)−1XtTrainYtrain (2) where Ytrain is an n h matrix containing 0s and 1s indicating whise tahner n or nho mt aetarcixh coofn tthaien ihn possible labels is the correct label (li) for each of the n tokens ti, Xtrain is an n p matrix of context features for each of thei n tokens, pth mea ctoriexff oifcie cnotnst are p hs .f However, this is a large, sparse, multiple l hab.el problem, and the above formulation is neither statistically nor computationally efficient. Each observation (x, y) consists of thousands of features associated with thousands of potential labels, almost all of which are zero. Worse, the matrix of coefficients β, to be estimated is large (p h) and one should thus use some soatretd do ifs tr laarngsefe (pr learning dto o nshea srheo strength across the different labels. We present a novel principled and highly computationally efficient method of estimating this multilabel model. We use a two stage procedure, first using a subset (Xtrain1 , Ytrain1) of training data to give a fast approximate estimate of β; we then use a second smaller subset of the training data (Xtrain2, Ytrain2,) to “correct” these estimates in a eβˆx way that we will show can be viewed as a specialized shrinkage. Our first stage estimation approximates β, but avoids the expensive computa728 tion of (XTtrainXtrain)−1. Our second stage corrects (shrinks) these initial estimates in a manner specialized to this problem. The second stage takes advantage of the fact that we only need to consider those candidate labels produced by SAMA. Thus, only dozens of the thousands of possible labels are considered for each token. We now present our algorithm. We start with a corpus D of documents d of labeled Arabic text. As described above, each token, ti is associated with a set of features characterizing its context, computed from the other words in the same document, and a label, li = (lemmai, morphologyi), which is a combination of a lemma and a morphological analysis. As described below, we introduce a novel factorization of the morphology into 15 different components. Our estimation algorithm, shown in Algorithm 1, has two stages. We partition the training corpus into × two subsets, one of which (Xtrain1) is used to estimate the coefficients βs and the other of which (Xtrain2) is used to optimally “shrink” these coefficient estimates to reduce variance and prevent overfitting due to data sparsity. For the first stage of our estimation procedure, we simplify the estimate of the (β) matrix (Equation 2) to avoid the inversion of the very high dimensional (p p) matrix (XTX) by approximating (XTX) by (itps diagonal, Var(X), the inverse of which is trivial to compute; i.e. we estimate β using βˆ = Var(Xtrain1)−1XtTrain1Ytrain1 (3) For the second stage, we assume that the coefficients for each feature can be shrunk differently, but that coefficients for each feature should be shrunk the same regardless of what label they are predicting. Thus, for a given observation we predict: ˆgik=Xpwjβˆjkxij (4) Xj=1 where the weights wj indicate how much to shrink each of the p features. In practice, we fold the variance of each of the j features into the weight, giving a slightly modified equation: ˆgik=Xj=p1αjβj∗kxij (5) where β∗ = XtTrain1Ytrain1 is just a matrix of the counts of how often each context feature shows up with each label in the first training set. The vector α, which we will estimate by regression, is just the shrinkage weights w rescaled by the feature variance. Note that the formation here is different from the first stage. Instead of having each observation be a token, we now let each observation be a (token, label) pair, but only include those labels that were output by SAMA. For a given token ti and potential label lk, our goal is to approximate the indicator function g(i, k), which is 1 if the kth label of token ti is present, and 0 otherwise. We find candidate labels using a morphological analyzer (namely SAMA), which returns a set of possible candidate labels, say C(t), for each Arabic token t. Our pre- dicted label for ti is then argmaxk∈C(ti)g(i, k). The regression model for learning tthe weights αj in the second stage thus has a row for each label g(i, k) associated with a SAMA candidate for each token i = ntrain1+1 . . . ntrain2 in the second training set. The value of g(i, k) is predicted as a function of the feature vector zijk = βj∗kxij. The shrinkage coefficients, αj, could be estimated from theory, using a version of James-Stein shrinkage (James and Stein, 1961), but in practice, superior results are obtained by estimating them empirically. Since there are only p of them (unlike the p ∗ h βs), a relatively asmreal oln training sheetm mis ( usunflfi kceie tnhte. Wp ∗e hfou βnsd), that regression-SVMs work slightly better than linear regression and significantly better than standard classification SVMs for this problem. Prediction is then done in the obvious way by taking the tokens in a test corpus Dtest, generating context features and candidate SAMA labels for each token ti, and selected the candidate label with the highest score ˆ g(i, k) that we set out to learn. More formally, The model parameters β∗ and α produced by the algorithm allow one to estimate the most likely label for a new token ti out of a set of can- didate labels C(ti) using kpred= argmaxk∈C(ti)jX=p1αjβj∗kxij (6) The most expensive part of the procedure is estimating β∗, which requires for each token in cor729 Algorithm 1 Training algorithm. Input: A training corpusDtrainof n observations (Xtrain, Ytrain) Partition Dtrain into two sets, D1 and D2, of sizes ntrain1 and ntrain2 = n − ntrain1 observations // Using D1, estimat=e β∗ βj∗k = Pin=tr1ain1 xijyik for the jth feature and kth label // Using D2, estimate αj // Generate new “features” Z and the true labels g(i, k) for each of the SAMA candidate labels for each of the tokens in D2 zijk = βj∗kxij for iin i= ntrain1 + 1...ntrain2 Estimate αj for the above (feature,label) pairs (zijk, g(i, k)) using Regression SVMs Output: α and β∗ pus D1, (a subset of D), finding the co-occurrence frequencies of each label element (a lemma, or a part of the morphological segmentation) with the target token and jointly with the token and with other tokens or characters in the context of the token of interest. For example, given an Arabic token, “yHlm”, we count what fraction of the time it is associated with each lemma (e.g. Halamu 1), count(lemma=Halam-u 1, token=yHlm) and each segment (e.g. “ya”), count(segment=ya, token=yHlm). (Of course, most tokens never show up with most lemmas or segments; this is not a problem.) We also find the base rates of the components of the labels (e.g., count(lemma=Halam-u 1), and what fraction of the time the label shows up in various contexts, e.g. count(lemma=Halam-u 1, previous token = yHlm). We describe these features in more detail below. 3 Features and Labels used for Training Our approach to tagging Arabic differs from conventional approaches in the two-part shrinkage-based method used, and in the choice of both features and labels used in our model. For features, we study both local context variables, as described above, and document-level word frequencies. For the labels, the key question is what labels are included and how they are factored. Standard “taggers” work by doing an n-way classification of all the alternatives, which is not feasible here due to the thousands of possible labels. Standard approaches such as Conditional Random Fields (CRFs) are intractable with so many labels. Moreover, few if any taggers do any lemma disambiguation; that is partly because one must start with some standard inventory of lemmas, which are not available for most languages, perhaps because the importance of lemma disambiguation has been underestimated. We make a couple of innovations to deal with these issues. First, we perform lemma disambiguation in addition to “tagging”. As mentioned above, lemmas and morphological information are not independent; the choice of lemma often influences morphology and vice versa. For example, Table 1 contains two analyses for the word qbl. For the first analysis, where the lemma is qabil-a 1 and the gloss is accept/receive/approve + he/it [verb], the word is a verb. However, for the second analysis, where the lemma is qabol 1 and the gloss is before, the word is a noun. Simultaneous lemma disambiguation and tagging introduces additional complexity: An analysis of ATB and SAMA shows that there are approximately 2,200 possible morphological analyses (“tags”) and 40,000 possible lemmas; even accounting for the fact that most combinations of lemmas and morphological analyses don’t occur, the size of the label space is still in the order of tens of thousands. To deal with data sparsity, our second innovation is to factor the labels. We factor each label linto a set of 16 label elements (LEs). These include lemmas, as well as morphological elements such as basic partof-speech, suffix, gender, number, mood, etc. These are explained in detail below. Thus, since each label l is a set of 15 categorical variables, each y in the first learning stage is actually a vector with 16 nonzero components and thousands of zeros. Since we do simultaneous estimation of the entire set of label elements, the value g(i, k) being predicted in the second learning phase is 1 if the entire label set is correct, and zero otherwise. We do not learn separate models for each label. 3.1 Label Elements (LEs) The fact that there are tens of thousands of possible labels presents the problem of extreme sparsity of label distribution in the training data. We find that a model that estimates coefficients β∗ to predict a sin730 data on basic POS include whether a noun is proper or common, whether a verb is transitive or not, etc. Both the basic POS and its suffix may have person, gender and number data. gle label (a label being in the Cartesian product of the set of label elements) yields poor performance. Therefore, as just mentioned, we factor each label l into a set of label elements (LEs), and learn the correlations β∗ between features and label elements, rather than features and entire label sets. This reduces, but does not come close to eliminating, the problem sparsity. A complete list of these LEs and their possible values is detailed in Table 2. 3.2 Features 3.2.1 Local Context Features We take (t, l) pairs from D2, and for each such pair generate features Z based on co-occurrence statistics β∗ in D1, as mentioned in Algorithm 2. These statistics include unigram co-occurrence frequencies of each label with the target token and bigram co-occurrence of the label with the token and with other tokens or characters in the context of the target token. We define them formally in Table 3. Let Zbaseline denote the set of all such basic features based on the local context statistics of the target token, namely the words and letters preceding and following it. We will use this set to create a baseline model. generate feature sets for our regression SVMs. For each label element (LE) e, we define a set of features Ze similar to Zbaseline; these features are based on co-occurrence frequencies of the particular LE e, not the entire label l. Finally, we define an aggregate feature set Zaggr as follows: Zaggr = Zbaseline [ {Ze} (7) where e ∈ {lemma, pre1, pre2, det, pos, dpos, suf, perpos, numpos, genpos, persuf, numsuf, gensuf, mood, pron}. 3.2.2 Document Level Features When trying to predict the lemma, it is useful to include not just the words and characters immediately adjacent to the target token, but also the all the words in the document. These words capture the “topic” of the document, and help to disambiguate different lemmas, which tend to be used or not used based on the topic being discussed, similarly to the way that word sense disambiguation systems in English sometimes use the “bag of words” the document to disambiguate, for example a “bank” for depositing money from a “bank” of a river. More precisely, we augment the features for each target token with the counts of each word in the document (the “term frequency” tf) in which the token occurs with a given label. Zfull = Zaggr [ Ztf (8) This set Zfull is our final feature set. We use Zfull to train an SVM model Mfull; this is our final predictive model. 731 3.3 Corpora used for Training and Testing We use three modules of the Penn Arabic Treebank (ATB) (Maamouri et al., 2004), namely ATB 1, ATB2 and ATB3 as our corpus of labeled Arabic text, D. Each ATB module is a collection of newswire data from a particular agency. ATB1 uses the Associated Press as a source, ATB2 uses Ummah, and ATB3 uses Annahar. D contains a total of 1,835 documents, accounting for approximately 350,000 words. We construct the training and testing sets Dtrain and Dtest from D using 10-fold cross validation, and we construct D1 and D2 from Dtrain by randomly performing a 9: 1 split. As mentioned earlier, we use the SAMA morphological analyzer to obtain candidate labels C(t) for each token t while training and testing an SVM model on D2 and Dtest respectively. A sample output of SAMA is shown in Table 1. To improve coverage, we also add to C(t) all the labels lseen for t in D1. We find that doing so improves coverage to 98% . This is an upper bound on the accuracy of our model. C(t) = SAMA(t) 4 [ {l|(t, l) ∈ D1} (9) Results We use two metrics of accuracy: A1, which measures the percentage of tokens for which the model assigns the highest score to the correct label or LE value (or E1= 100 A1, the corresponding percentage error), 1a=nd 1 A2, wAh1i,ch th measures tnhdei percentage of tokens for which the correct label or LE value is one of the two highest ranked choices returned by the model (or E2 = 100 A2). We test our bmyod theel Mfull on Dtest a =nd 1 a0c0hi −eve A A2)1. and A2 scores of 90.6% and 96.2% respectively. The accuracy achieved by our Mfull model is, to the best of our knowledge, higher than prior approaches have been able to achieve so far for the problem of combined morphological and lemma disambiguation. This is all the more impressive considering that the upper bound on accuracy for our model is 98% because, as described above, our set of candidate labels is incomplete. In order to analyze how well different LEs can be predicted, we train an SVM model Me for each LE e using the feature set Ze, and test all such models − − on Dtest. The results for all the LEs are reported in the form of error percentages E1 and E2 in Table 4. reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. A comparison of the results for Mfull with the results for Mlemma and Mpos is particularly informative. We see that Mfull is able to achieve a substantially lower E1 error score (9.4%) than Mlemma (11.1%) and Mpos (23.4%); in other words, we find that our full model is able to predict lemmas and basic parts-of-speech more accurately than the individ- ual models for each of these elements. We examine the effect of varying the size of D2, i.e. the number of SVM training instances, on the performance of Mfull on Dtest, and find that with increasing sizes of D2, E1 reduces only slightly from 9.5% to 9.4%, and shows no improvement thereafter. We also find that the use of documentlevel features in Mlemma reduces E1 and E2 percentages for Mlemma by 5.7% and 3.2% respectively. 4.1 Comparison to Alternate Approaches 4.1.1 Structured Prediction Models Preliminary experiments showed that knowing the predicted labels (lemma + morphology) of the surrounding words can slightly improve the predictive accuracy of our model. To further investigate this effect, we tried running experiments using different structured models, namely CRF (Conditional Random Fields) (Lafferty et al., 2001), (Structured) MIRA (Margin Infused Relaxation Algorithm) (Crammer et al., 2006) and Structured Perceptron (Collins, 2002). We used linear chain 732 CRFs as implemented in MALLET Toolbox (McCallum, 2001) and for Structured MIRA and Perceptron we used their implementations from EDLIN Toolbox (Ganchev and Georgiev, 2009). However, given the vast label space of our problem, running these methods proved infeasible. The time complexity of these methods scales badly with the number of labels; It took a week to train a linear chain CRF for only ∼ 50 labels and though MIRA and Perceptron are o 5n0lin leab algorithms, they MalsIoR Abec aonmde P ienr-tractable beyond a few hundred labels. Since our label space contains combinations of lemmas and morphologies, so even after factoring, the dimension of the label space is in the order of thousands. We also tried a na¨ ıve version (two-pass approximation) of these structured models. In addition to the features in Zfull, we include the predicted labels for the tokens preceding and following the target token as features. This new model is not only slow to train, but also achieves only slightly lower error rates (1.2% lower E1 and 1.0% lower E2) than Mfull. This provides an upper bound on the benefit of using the more complex structured models, and suggests that given their computational demands our (unstructured) model Mfull is a better choice. 4.1.2 MADA (Habash and Rambow, 2005) perform morphological disambiguation using a morphological analyzer. (Roth et al., 2008) augment this with lemma disambiguation; they call their system MADA. Our work differs from theirs in a number of respects. Firstly, they don’t use the two step regression procedure that we use. Secondly, they use only “unigram” features. Also, they do not learn a single model from a feature set based on labels and LEs; instead, they combine models for individual elements by using weighted agreement. We trained and tested MADA v2.32 using its full feature set on the same Dtrain and Dtest. We should point out that this is not an exact comparison, since MADA uses the older Buckwalter morphological analyzer.1 4.1.3 Other Alternatives Unfactored Labels: To illustrate the benefit obtained by breaking down each label l into 1A new version of MADA was released very close to the submission deadline for this conference. LEs, we contrast the performance of our Mfull model to an SVM model Mbaseline trained using only the feature set Zbaseline, which only contains features based on entire labels, those based on individual LEs. Independent lemma and morphology prediction: Another alternative approach is to predict lemmas and morphological analyses separately. We construct a feature set Zlemma0 = Zfull − Zlemma and train an SVM model Mlemma0 using this feature set. Labels are then predicted by simply combining the results predicted independently by Mlemma and Mlemma0 . Let Mind denote this approach. Unigram Features: Finally, we also consider a context-less approach, i.e. using only “unigram” features for labels as well as LEs. We call this feature set Zuni, and the corresponding SVM model Muni. The results of these various models, along with those of Mfull are summarized in Table 5. We see that Mfull has roughly half the error rate of the stateof-the-art MADA system. Note: The results reported are 10 fold cross validation test accuracies and no parameters have been tuned on them. We used same train-test splits for all the datasets. 5 Related Work (Hajic, 2000) show that for highly inflectional languages, the use of a morphological analyzer improves accuracy of disambiguation. (Diab et al., 2004) perform tokenization, POS tagging and base phrase chunking using an SVM based learner. (Ahmed and N ¨urnberger, 2008) perform word-sense disambiguation using a Naive Bayesian 733 model and rely on parallel corpora and match- ing schemes instead of a morphological analyzer. (Kulick, 2010) perform simultaneous tokenization and part-of-speech tagging for Arabic by separating closed and open-class items and focusing on the likelihood of possible stems of openclass words. (Mohamed and K ¨ubler, 2010) present a hybrid method between word-based and segmentbased POS tagging for Arabic and report good results. (Toutanova and Cherry, 2009) perform joint lemmatization and part-of-speech tagging for English, Bulgarian, Czech and Slovene, but they do not use the two step estimation-shrinkage model described in this paper; nor do they factor labels. The idea of joint lemmatization and part-of-speech tagging has also been discussed in the context of Hungarian in (Kornai, 1994). A substantial amount of relevant work has been done previously for Hebrew. (Adler and Elhadad, 2006) perform Hebrew morphological disambiguation using an unsupervised morpheme-based HMM, but they report lower scores than those achieved by our model. Moreover, their analysis doesn’t include lemma IDs, which is a novelty of our model. (Goldberg et al., 2008) extend the work of (Adler and El- hadad, 2006) by using an EM algorithm, and achieve an accuracy of 88% for full morphological analysis, but again, this does not include lemma IDs. To the best of our knowledge, there is no existing research for Hebrew that does what we did for Arabic, namely to use simultaneous lemma and morphological disambiguation to improve both. (Dinur et al., 2009) show that prepositions and function words can be accurately segmented using unsupervised methods. However, by using this method as a preprocessing step, we would lose the power of a simultaneous solution for these problems. Our method is closer in style to a CRF, giving much of the accuracy gains of simultaneous solution, while being about 4 orders of magnitude easier to train. We believe that our use of factored labels is novel for the problem of simultaneous lemma and morphological disambiguation; however, (Smith et al., 2005) and (Hatori et al., 2008) have previously made use of features based on parts of labels in CRF models for morphological disambiguation and word-sense disambiguation respectively. Also, we note that there is a similarity between our two-stage machine learning approach and log-linear models in machine translation that break the data in two parts, estimating log-probabilities of generative models from one part, and discriminatively re-weighting the models using the second part. 6 Conclusions We introduced a new approach to accurately predict labels consisting of both lemmas and morphological analyses for Arabic text. We obtained an accuracy of over 90% substantially higher than current state-of-the-art systems. Key to our success is the factoring of labels into lemma and a large set of morphosyntactic elements, and the use of an algorithm that computes a simple initial estimate of the coefficient relating each contextual feature to each label element (simply by counting co-occurrence) and then regularizes these features by shrinking each of the coefficients for each feature by an amount determined by supervised learning using only the candidate label sets produced by SAMA. We also showed that using features of word ngrams is preferable to using features of only individual tokens of data. Finally, we showed that a model using a full feature set based on labels as well as – factored components of labels, which we call label elements (LEs) works better than a model created by combining individual models for each LE. We believe that the approach we have used to create our model can be successfully applied not just to Arabic but also to other languages such as Turkish, Hungarian and Finnish that have highly inflectional morphology. The current accuracy of of our model, getting the correct answer among the top two choices 96.2% of the time is high enough to be highly useful for tasks such as aiding the manual annotation of Arabic text; a more complete automation would require that accuracy for the single top choice. Acknowledgments We woud like to thank everyone at the Linguistic Data Consortium, especially Christopher Cieri, David Graff, Seth Kulick, Ann Bies, Wajdi Zaghouani and Basma Bouziri for their help. We also wish to thank the anonymous reviewers for their comments and suggestions. 734 References Meni Adler and Michael Elhadad. 2006. An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Farag Ahmed and Andreas N ¨urnberger. 2008. Arabic/English Word Translation Disambiguation using Parallel Corpora and Matching Schemes. In Proceedings of EAMT’08, Hamburg, Germany. Tim Buckwalter. 2004. Buckwalter Arabic Morphological Analyzer version 2.0. Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of EMNLP’02. Koby Crammer, Ofer Dekel, Joseph Keshet, Shai ShalevShwartz, and Yoram Singer. 2006. Online PassiveAggressive Algorithms. Journal of Machine Learning Research, 7:551–585. Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic Tagging of Arabic text: From Raw Text to Base Phrase Chunks. In Proceedings of the 5th Meeting of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies Conference (HLT-NAACL’04). Elad Dinur, Dmitry Davidov, and Ari Rappoport. 2009. Unsupervised Concept Discovery in Hebrew Using Simple Unsupervised Word Prefix Segmentation for Hebrew and Arabic. In Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages. Kuzman Ganchev and Georgi Georgiev. 2009. Edlin: An Easy to Read Linear Learning Framework. In Proceedings of RANLP’09. Yoav Goldberg, Meni Adler, and Michael Elhadad. 2008. EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)*. In Proceedings of ACL’08. Nizar Habash and Owen Rambow. 2005. Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop. In Proceedings of ACL’05, Ann Arbor, MI, USA. Jan Hajic. 2000. Morphological Tagging: Data vs. Dictionaries. In Proceedings of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL’00). Jun Hatori, Yusuke Miyao, and Jun’ichi Tsujii. 2008. Word Sense Disambiguation for All Words using TreeStructured Conditional Random Fields. In Proceedings of COLing’08. W. James and Charles Stein. 1961 . Estimation with Quadratic Loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1. Andr a´s Kornai. 1994. On Hungarian morphology (LinDissertationes 14). Lin- guistica, Series A: Studia et guistics Institute of Hungarian Academy of Sciences, Budapest. Seth Kulick. 2010. Simultaneous Tokenization and Partof-Speech Tagging for Arabic without a Morphological Analyzer. In Proceedings of ACL’10. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of ICML’01, pages 282–289. Mohamed Maamouri, Ann Bies, and Tim Buckwalter. 2004. The Penn Arabic Treebank: Building a Large Scale Annotated Arabic Corpus. In Proceedings of NEMLAR Conference on Arabic Language Resources and Tools. Mohamed Maamouri, David Graff, Basma Bouziri, Sondos Krouna, and Seth Kulick. 2009. LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. Andrew McCallum, 2001. MALLET: A Machine Learning for Language Toolkit. Software available at http : / /mal let .cs .umas s .edu. Emad Mohamed and Sandra K ¨ubler. 2010. Arabic Part of Speech Tagging. In Proceedings of LREC’10. Ryan Roth, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of ACL’08, Columbus, Ohio, USA. Noah A. Smith, David A. Smith, and Roy W. Tromble. 2005. Context-Based Morphological Disambiguation with Random Fields*. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP). Kristina Toutanova and Colin Cherry. 2009. A Global Model for Joint Lemmatization and Part-of-Speech Prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, pages 486–494. 735

same-paper 2 0.86071229 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Author: Taesun Moon ; Katrin Erk ; Jason Baldridge

Abstract: We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective.

3 0.81883365 76 emnlp-2010-Maximum Entropy Based Phrase Reordering for Hierarchical Phrase-Based Translation

Author: Zhongjun He ; Yao Meng ; Hao Yu

Abstract: Hierarchical phrase-based (HPB) translation provides a powerful mechanism to capture both short and long distance phrase reorderings. However, the phrase reorderings lack of contextual information in conventional HPB systems. This paper proposes a contextdependent phrase reordering approach that uses the maximum entropy (MaxEnt) model to help the HPB decoder select appropriate reordering patterns. We classify translation rules into several reordering patterns, and build a MaxEnt model for each pattern based on various contextual features. We integrate the MaxEnt models into the HPB model. Experimental results show that our approach achieves significant improvements over a standard HPB system on large-scale translation tasks. On Chinese-to-English translation, , the absolute improvements in BLEU (caseinsensitive) range from 1.2 to 2.1.

4 0.77113235 77 emnlp-2010-Measuring Distributional Similarity in Context

Author: Georgiana Dinu ; Mirella Lapata

Abstract: The computation of meaning similarity as operationalized by vector-based models has found widespread use in many tasks ranging from the acquisition of synonyms and paraphrases to word sense disambiguation and textual entailment. Vector-based models are typically directed at representing words in isolation and thus best suited for measuring similarity out of context. In his paper we propose a probabilistic framework for measuring similarity in context. Central to our approach is the intuition that word meaning is represented as a probability distribution over a set of latent senses and is modulated by context. Experimental results on lexical substitution and word similarity show that our algorithm outperforms previously proposed models.

5 0.76990676 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

Author: Chang Liu ; Daniel Dahlmeier ; Hwee Tou Ng

Abstract: We present PEM, the first fully automatic metric to evaluate the quality of paraphrases, and consequently, that of paraphrase generation systems. Our metric is based on three criteria: adequacy, fluency, and lexical dissimilarity. The key component in our metric is a robust and shallow semantic similarity measure based on pivot language N-grams that allows us to approximate adequacy independently of lexical similarity. Human evaluation shows that PEM achieves high correlation with human judgments.

6 0.76681542 57 emnlp-2010-Hierarchical Phrase-Based Translation Grammars Extracted from Alignment Posterior Probabilities

7 0.76613975 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

8 0.76309776 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

9 0.76250154 99 emnlp-2010-Statistical Machine Translation with a Factorized Grammar

10 0.75990278 96 emnlp-2010-Self-Training with Products of Latent Variable Grammars

11 0.75386244 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

12 0.75373727 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

13 0.75313729 52 emnlp-2010-Further Meta-Evaluation of Broad-Coverage Surface Realization

14 0.74640816 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

15 0.74105626 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

16 0.73861289 29 emnlp-2010-Combining Unsupervised and Supervised Alignments for MT: An Empirical Study

17 0.73788857 65 emnlp-2010-Inducing Probabilistic CCG Grammars from Logical Form with Higher-Order Unification

18 0.73778951 105 emnlp-2010-Title Generation with Quasi-Synchronous Grammar

19 0.73778522 18 emnlp-2010-Assessing Phrase-Based Translation Models with Oracle Decoding

20 0.73595917 109 emnlp-2010-Translingual Document Representations from Discriminative Projections