emnlp emnlp2010 emnlp2010-71 knowledge-graph by maker-knowledge-mining

71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction


Source: pdf

Author: Michael Lamar ; Yariv Maron ; Elie Bienenstock

Abstract: We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. [sent-8, score-0.287]

2 These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. [sent-9, score-0.196]

3 1 Introduction Part-of-speech (POS) tagging is a fundamental natural-language-processing problem, and POS tags are used as input to many important applications. [sent-12, score-0.182]

4 Several authors addressed this gap using limited 799 supervision, such as a dictionary of tags for each word (Goldwater and Griffiths, 2007; Ravi and Knight, 2009), or a list of word prototypes for each tag (Haghighi and Klein, 2006). [sent-15, score-0.146]

5 Even in light of all these advancements, there is still interest in a completely unsupervised method for POS induction for several reasons. [sent-16, score-0.128]

6 Third, while several widely used tag sets do exist, researchers do not agree upon any specific set of tags across languages or even within one language. [sent-19, score-0.146]

7 For these reasons, a fully unsupervised induction algorithm has both a practical and a theoretical value. [sent-24, score-0.161]

8 In the past decade, there has been a steady improvement on the completely unsupervised version of POS induction (Schütze, 1995; Clark, 2001 ; Clark, 2003; Johnson, 2007; Gao and Johnson, 2008; Graça et al. [sent-25, score-0.196]

9 , 2010): these assign the same tag to all tokens of a word type, rather than attempting to disambiguate words in context. [sent-41, score-0.145]

10 As in all nondisambiguating distributional approaches, the goal, loosely stated, is to assign the same tag to words whose contexts in the corpus are similar. [sent-45, score-0.178]

11 Our approach, which we call Latent-Descriptor Clustering, or LDC, is an iterative algorithm, in the spirit of the K-means clustering algorithm and of the EM algorithm for the estimation of a mixture of Gaussians. [sent-46, score-0.291]

12 In conventional K-means clustering, one is given a collection of N objects described as N data points in an r-dimensional Euclidean space, and one seeks a clustering that minimizes the sum of intra-cluster squared distances, i. [sent-47, score-0.297]

13 , the sum, over all data points, of the squared distance between that point and the centroid of its assigned cluster. [sent-49, score-0.124]

14 , cluster assignment, A, that minimizes the sum of intra- cluster squared distances. [sent-52, score-0.299]

15 However, unlike in conventional K-means, the N objects to be clustered are themselves described by vectors—in a suitable manifold—that depend on the clustering A. [sent-53, score-0.199]

16 These context vectors are the counts of the K different tags occurring, under tagging A, to the left and right of tokens of word type w in the corpus. [sent-58, score-0.434]

17 We normalize each of these context vectors to unit length, producing, for each word type w, two points LA(w) and RA(w) on the (K–1)dimensional unit sphere. [sent-59, score-0.231]

18 The latent descriptor for w consists of the pair (LA(w), RA(w))—more details in Section 2. [sent-60, score-0.316]

19 A straightforward approach to this latentdescriptor K-means problem is to adapt the classical iterative K-means algorithm so as to handle 800 the latent descriptors. [sent-61, score-0.208]

20 Thus, rather than the hard assignment A, we use a soft-assignment matrix P. [sent-64, score-0.251]

21 Pwk, interpreted as the probability of assigning word w to cluster k, is, essentially, proportional to exp {– dwk2/2σ2}, where dwk is the distance between the latent descriptor for w and the centroid, i. [sent-65, score-0.39]

22 Unlike the Gaussian-mixture model however, we use the same mixture coefficient and the same Gaussian width for all k. [sent-68, score-0.158]

23 Further, we let the Gaussian width σ decrease gradually during the iterative process. [sent-69, score-0.207]

24 As is well-known, the EM algorithm for Gaussian mixtures reduces in the limit of small σ to the simpler K-means clustering algorithm. [sent-70, score-0.133]

25 The soft assignment used earlier in the process lends robustness to the algorithm. [sent-72, score-0.174]

26 For simplicity, induced tags are henceforth referred to as labels, while tags will be reserved for the gold-standard tags, to be used later for evaluation. [sent-78, score-0.175]

27 Given a labeling A, we define (w1) , the left-label context of word type w1, as the K-dimensional vector whose k-th component is the number of bigrams (ti–1, ti) in the corpus such that w(ti) = w1 and A(w(ti–1)) = k. [sent-98, score-0.168]

28 We define the left descriptor of L~A word type w as: LA(w)  (LA(w)) . [sent-99, score-0.332]

29 We similarly define the right-label context of w1, as the K-dimensional vector whose kth component is the number of bigrams (ti, ti+1) such that w(ti) = w1 and A(w(ti+1)) = k, and we define the right descriptor of word type w as: R~A(w1), RA(w)  (R~A(w)) . [sent-100, score-0.357]

30 Thus, CL(k) is the projection on SK–1 of the weighted average of the left descriptors of the word types labeled k. [sent-104, score-0.366]

31 We sometimes refer to CL(k) as the left centroid of cluster k. [sent-105, score-0.178]

32 Informally, we seek a labeling A such that, for any two word types w1 and w2 in W, w1 and w2 are labeled the same if and only if LA(w1) and LA(w2) are close to each other on SK–1 and so are RA(w1) and RA(w2). [sent-108, score-0.117]

33 Formally, our goal is to find a labeling A that minimizes the objective function: F(A)=||LA(w)–CL(A(w))||2+||RA(w)–CR(A(w))||2. [sent-109, score-0.168]

34 Note that, just as in conventional K-means clustering, F(A) is the sum of the intra-cluster squared distances. [sent-110, score-0.156]

35 However, unlike conventional K-means clustering, the descriptors of the objects to be clustered depend themselves on the clustering. [sent-111, score-0.411]

36 We accordingly refer to LA and RA as latent descriptors, and to the method described in the next section as Latent-Descriptor Clustering, or LDC. [sent-112, score-0.126]

37 This global minimum, 0, is obtained by the trivial assignment that maps all word types to a unique label. [sent-114, score-0.155]

38 Instead, we seek a minimum under the constraint that the labeling be non-trivial. [sent-115, score-0.117]

39 Each iteration of the algorithm includes an E phase and an M phase. [sent-125, score-0.246]

40 The E phase consists of computing, based on the current , a probabilistic assignment of each of the N observations to the K Gaussian distributions. [sent-126, score-0.245]

41 These probabilistic assignments form an NK stochastic matrix P, i. [sent-127, score-0.167]

42 The M phase consists of updating the model parameters θ, based on the current assignments P. [sent-130, score-0.221]

43 Thus, each iteration of LDC consists of an E phase and an M phase. [sent-135, score-0.213]

44 As observations are replaced by latent descriptors, an iteration of LDC is best viewed as starting with the M phase. [sent-136, score-0.174]

45 The M phase first starts by building a pair of latent-descriptor matrices LP and RP, from the soft assignments obtained in the previous iteration. [sent-137, score-0.335]

46 Note that these descriptors are now indexed by P, the matrix of probabilistic assignments, rather than by hard assignments A as in the previous section. [sent-138, score-0.539]

47 Thus, the latent descriptors consist of the leftword and right-word contexts (recall that these are given by matrices L and R), mapped into left-label and right-label contexts through multiplication by the assignment matrix P, and scaled to unit length: LP = λ(LP) RP = λ(RP). [sent-140, score-0.753]

48 802 With these latent descriptors in hand, we proceed with the M phase of the algorithm as usual. [sent-141, score-0.548]

49 Thus, the left mean µLk for Gaussian k is the weighted average of the left latent descriptors LP(w), scaled to unit length. [sent-142, score-0.566]

50 Note that the definition of the Gaussian mean µLk parallels the definition of the cluster centroid CL(k) given in the previous section; if the assignment P happens to be a hard assignment, µLk is actually identical to CL(k). [sent-144, score-0.357]

51 The E phase of the iteration takes the latent descriptors and the Gaussian means, and computes a new NK matrix of probabilistic assignments P. [sent-147, score-0.774]

52 σ is a parameter of the model, which, as mentioned, is gradually decreased to enforce convergence of P to a hard assignment. [sent-149, score-0.138]

53 The description of the M phase given above does not apply to the first iteration, since the M phase uses P from the previous iteration. [sent-150, score-0.242]

54 , create a set of left and right descriptor vectors in the M phase of the first iteration, we use the left-word and rightword contexts L and R. [sent-153, score-0.503]

55 The left and right descriptors for the first iteration are obtained by scaling each row of matrices L1 and R1 to unit length. [sent-159, score-0.621]

56 The Gaussian centers µLk and µRk, k = 1 ,K, are set equal to the left and right descriptors of the K most frequent words in the corpus. [sent-160, score-0.401]

57 The soft-assignment version of LDC does not directly attempt to minimize F(A), nor can it be viewed as likelihood maximization—as is EM for a Gaussian mixture—since the use of latent descriptors precludes the definition of a generative model for the data. [sent-164, score-0.394]

58 The first tool is an objective function G(P) that parallels the definition of F(A) for hard assignments. [sent-168, score-0.156]

59 For a probabilistic assignment P, we define G(P) to be the weighted average, over all w and all k, of ||LP(w) – µLk||2 + ||RP(w) – µRk||2; the weight used in this average is Pwkf(w), just as in the computation of the Gaussian means. [sent-169, score-0.124]

60 The second tool will allow us to compute a tagging accuracy for soft assignments. [sent-172, score-0.171]

61 For this purpose, we simply create, for any probabilistic assignment P, the obvious labeling A = A*(P) that maps w to k with highest Pwk. [sent-173, score-0.235]

62 4 Results In order to evaluate the performance of LDC, we apply it to the Wall Street Journal portion of the 1 The LDC code, including tagging accuracy evaluation, is available at http://www. [sent-174, score-0.121]

63 In order to compare the labels generated by the unsupervised model with the tags of each tagset, we use two map-based criteria:  MTO: many-to-one tagging accuracy, i. [sent-184, score-0.291]

64 , fraction of correctly-tagged tokens in the corpus under the so-called manyto-one mapping, which takes each induced tag to the gold-standard POS tag with which it co-occurs most frequently. [sent-186, score-0.331]

65  OTO: best tagging accuracy achievable under a so-called one-to-one mapping, i. [sent-189, score-0.121]

66 , a mapping such that at most one induced tag is sent to any POS tag. [sent-191, score-0.138]

67 Top curve: Many-to-one tagging accuracy of labeling A*(P), evaluated against PTB17. [sent-200, score-0.201]

68 Top curve shows the MTO accuracy of the labeling evaluated against PTB45. [sent-202, score-0.123]

69 With the -schedules used in these experiments, P typically converges to a hard assignment in about 45 iterations,  being then 10–5. [sent-207, score-0.221]

70 While the objective function G(P) mostly decreases, it does show a hump for K = 50 around iteration 9. [sent-208, score-0.139]

71 This forced choice, in turn, produces more coherent descriptor vectors for all word types, and yields a steady increase in tagging accuracy. [sent-216, score-0.482]

72 Table 1 compares the tagging accuracy of LDC with several recent models of Gao and Johnson (2008) and Lamar et al. [sent-221, score-0.121]

73 Each run was halted at iteration 15, and the score reported uses the labeling A*(P) defined at the end of Section 3. [sent-229, score-0.172]

74 The LDC results shown in the bottom half of the table, which uses the OTO criterion, were obtained with a variant of the LDC algorithm, in which the M phase estimates not only the Gaussian means but also the mixture coefficients. [sent-230, score-0.198]

75 Black bars indicate mislabeled words when 17 clusters are used. [sent-244, score-0.146]

76 Gray bars indicate words that continue to be mislabeled even when every word type is free to choose its own label, as if each type were in its own cluster—which defines the theoretically best possible non-disambiguating model. [sent-245, score-0.234]

77 Bottom: fraction of tokens of each tag that are mislabeled. [sent-247, score-0.193]

78 Many of the infrequent tags are 100% mislabeled because no induced label is mapped to these tags under MTO. [sent-248, score-0.361]

79 Over 8% of the tokens in the corpus are mislabeled adjectives roughly one-third of all total mislabeled tokens (25. [sent-251, score-0.447]

80 Similarly, nearly 4% of the mislabeled tokens are adverbs, but every adverb in the corpus is mislabeled because no label is mapped to this tag – a common oc- Figure 4: The confusion matrix for LDC's labeling under PTB17. [sent-254, score-0.671]

81 The diamonds indicate the induced tag under the MTO mapping. [sent-256, score-0.138]

82 Several labels are mapped to N (Noun), and one of these labels causes appreciable confusion between nouns and adjectives. [sent-257, score-0.212]

83 Because multiple labels are dedicated to a single tag (N, V and PREP), several tags (in this case 7) are left with no label. [sent-258, score-0.24]

84 Element (i,j) of this matrix stores the fraction of all tokens of POS tag ithat are given label j by the model. [sent-261, score-0.26]

85 By requiring the Gaussians to be isotropic with uniform width and by allowing that width to shrink to zero, the algorithm forces the soft assignments to converge to a set of hard assignments. [sent-271, score-0.405]

86 (2010), where each word type is mapped into a descriptor vector derived from its left and right tag contexts. [sent-283, score-0.492]

87 Accordingly, the objective function is that of the Kmeans clustering problem, namely a sum of intra-cluster squared distances. [sent-284, score-0.257]

88 This objective function, unlike the likelihood under an HMM, takes into account both left and right contexts. [sent-285, score-0.136]

89 It also makes use in a crucial way of cluster centroids (or Gaussian means), a notion that has no counterpart in the HMM approach. [sent-286, score-0.123]

90 This would not necessarily preclude the use of an iteration-dependent scaling factor, which would achieve the goal of gradually forcing the tagging to become deterministic. [sent-294, score-0.199]

91 Reduced-rank SVD is used in the initialization of the descriptor vectors, for the optimization to get off the ground. [sent-296, score-0.267]

92 For instance, using only the 400 most frequent words in the corpus—instead of all words—in the construction of the left-word and right-word context vectors in iteration 1 causes no appreciable change in performance. [sent-298, score-0.196]

93 The probabilistic-assignment algorithm was found to be much more robust against parameter changes than the hard-assignment version of LDC, which parallels the classical K-means clustering algorithm (see Section 1). [sent-299, score-0.215]

94 We experimented with this hard-assignment latentdescriptor clustering algorithm (data not shown), and found that a number of additional devices were necessary in order to make it work properly. [sent-300, score-0.21]

95 In particular, we found it necessary to use reduced-rank SVD on each iteration of the algorithm—as opposed to just the first iteration in the version presented here—and to gradually increase the rank r. [sent-301, score-0.262]

96 Central to the success of LDC is the dynamic interplay between the progressively harder cluster assignments and the updated latent descriptor vectors. [sent-306, score-0.523]

97 We operate under the assumption that if all word types were labeled optimally, words that share a label should have similar descriptor vectors arising from this optimal labeling. [sent-307, score-0.293]

98 The LDC algorithm demonstrates that, despite starting far from this optimal labeling, the alternation between vector updates and assignment updates is able to produce steadily improving clusters, as seen by the steady increase of tagging accuracy. [sent-309, score-0.381]

99 It is a relatively simple matter to extend the descriptor vectors to include context outside the nearest neighbors, which may well improve performance. [sent-311, score-0.293]

100 The unsupervised induction of stochastic context-free grammars using distributional clustering. [sent-330, score-0.176]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ldc', 0.474), ('descriptors', 0.312), ('descriptor', 0.234), ('mto', 0.228), ('lamar', 0.175), ('gaussian', 0.174), ('mislabeled', 0.146), ('assignment', 0.124), ('phase', 0.121), ('tagging', 0.121), ('pwk', 0.114), ('clustering', 0.1), ('assignments', 0.1), ('ti', 0.098), ('abend', 0.097), ('pos', 0.094), ('iteration', 0.092), ('ra', 0.09), ('lp', 0.087), ('tag', 0.085), ('cl', 0.083), ('la', 0.083), ('latent', 0.082), ('width', 0.081), ('labeling', 0.08), ('gradually', 0.078), ('mixture', 0.077), ('lk', 0.076), ('reichart', 0.076), ('squared', 0.074), ('cluster', 0.074), ('gaussians', 0.07), ('unsupervised', 0.069), ('rp', 0.069), ('steady', 0.068), ('matrix', 0.067), ('unit', 0.064), ('matrices', 0.064), ('sk', 0.062), ('tags', 0.061), ('svd', 0.061), ('tokens', 0.06), ('hard', 0.06), ('induction', 0.059), ('vectors', 0.059), ('oto', 0.058), ('elie', 0.058), ('em', 0.058), ('left', 0.054), ('clustered', 0.053), ('gao', 0.053), ('induced', 0.053), ('tagsets', 0.053), ('johnson', 0.05), ('soft', 0.05), ('tagset', 0.05), ('centroid', 0.05), ('centroids', 0.049), ('rk', 0.049), ('parallels', 0.049), ('fraction', 0.048), ('distributional', 0.048), ('iterative', 0.048), ('clark', 0.047), ('confusion', 0.047), ('objective', 0.047), ('conventional', 0.046), ('appreciable', 0.045), ('essen', 0.045), ('latentdescriptor', 0.045), ('nondisambiguating', 0.045), ('hmm', 0.044), ('bigrams', 0.044), ('accordingly', 0.044), ('type', 0.044), ('ptb', 0.043), ('curve', 0.043), ('minimizes', 0.041), ('labels', 0.04), ('mapped', 0.04), ('maron', 0.039), ('yariv', 0.039), ('euclidean', 0.039), ('converges', 0.037), ('seek', 0.037), ('sum', 0.036), ('adjectives', 0.035), ('steadily', 0.035), ('tsuruoka', 0.035), ('right', 0.035), ('updated', 0.033), ('initialization', 0.033), ('sch', 0.033), ('algorithm', 0.033), ('ney', 0.033), ('devices', 0.032), ('roi', 0.032), ('notations', 0.032), ('gra', 0.032), ('maps', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

Author: Michael Lamar ; Yariv Maron ; Elie Bienenstock

Abstract: We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.

2 0.22791767 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

Author: Yoong Keok Lee ; Aria Haghighi ; Regina Barzilay

Abstract: Part-of-speech (POS) tag distributions are known to exhibit sparsity a word is likely to take a single predominant tag in a corpus. Recent research has demonstrated that incorporating this sparsity constraint improves tagging accuracy. However, in existing systems, this expansion come with a steep increase in model complexity. This paper proposes a simple and effective tagging method that directly models tag sparsity and other distributional properties of valid POS tag assignments. In addition, this formulation results in a dramatic reduction in the number of model parameters thereby, enabling unusually rapid training. Our experiments consistently demonstrate that this model architecture yields substantial performance gains over more complex tagging — counterparts. On several languages, we report performance exceeding that of more complex state-of-the art systems.1

3 0.1428446 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.

4 0.13014077 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Author: Taesun Moon ; Katrin Erk ; Jason Baldridge

Abstract: We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective.

5 0.098877415 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

Author: Xian Qian ; Qi Zhang ; Yaqian Zhou ; Xuanjing Huang ; Lide Wu

Abstract: Many sequence labeling tasks in NLP require solving a cascade of segmentation and tagging subtasks, such as Chinese POS tagging, named entity recognition, and so on. Traditional pipeline approaches usually suffer from error propagation. Joint training/decoding in the cross-product state space could cause too many parameters and high inference complexity. In this paper, we present a novel method which integrates graph structures of two subtasks into one using virtual nodes, and performs joint training and decoding in the factorized state space. Experimental evaluations on CoNLL 2000 shallow parsing data set and Fourth SIGHAN Bakeoff CTB POS tagging data set demonstrate the superiority of our method over cross-product, pipeline and candidate reranking approaches.

6 0.082893692 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

7 0.080045022 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

8 0.079840869 88 emnlp-2010-On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

9 0.079640888 77 emnlp-2010-Measuring Distributional Similarity in Context

10 0.078673773 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

11 0.077830993 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

12 0.072445922 64 emnlp-2010-Incorporating Content Structure into Text Analysis Applications

13 0.071644284 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

14 0.068042651 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

15 0.06702958 114 emnlp-2010-Unsupervised Parse Selection for HPSG

16 0.063784815 84 emnlp-2010-NLP on Spoken Documents Without ASR

17 0.061823271 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

18 0.061136089 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

19 0.06067656 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

20 0.057771683 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.225), (1, 0.167), (2, 0.055), (3, -0.164), (4, -0.183), (5, 0.032), (6, -0.2), (7, 0.052), (8, 0.132), (9, 0.051), (10, 0.081), (11, 0.014), (12, -0.022), (13, -0.092), (14, 0.026), (15, -0.066), (16, -0.029), (17, -0.018), (18, 0.019), (19, 0.11), (20, -0.106), (21, 0.051), (22, 0.112), (23, -0.005), (24, -0.057), (25, -0.083), (26, 0.097), (27, -0.05), (28, 0.037), (29, -0.013), (30, 0.04), (31, -0.076), (32, -0.005), (33, 0.028), (34, 0.031), (35, 0.004), (36, -0.002), (37, 0.036), (38, 0.01), (39, -0.016), (40, 0.004), (41, -0.059), (42, -0.15), (43, 0.081), (44, 0.029), (45, -0.017), (46, -0.146), (47, 0.057), (48, -0.002), (49, -0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96703696 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

Author: Michael Lamar ; Yariv Maron ; Elie Bienenstock

Abstract: We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.

2 0.82423109 97 emnlp-2010-Simple Type-Level Unsupervised POS Tagging

Author: Yoong Keok Lee ; Aria Haghighi ; Regina Barzilay

Abstract: Part-of-speech (POS) tag distributions are known to exhibit sparsity a word is likely to take a single predominant tag in a corpus. Recent research has demonstrated that incorporating this sparsity constraint improves tagging accuracy. However, in existing systems, this expansion come with a steep increase in model complexity. This paper proposes a simple and effective tagging method that directly models tag sparsity and other distributional properties of valid POS tag assignments. In addition, this formulation results in a dramatic reduction in the number of model parameters thereby, enabling unusually rapid training. Our experiments consistently demonstrate that this model architecture yields substantial performance gains over more complex tagging — counterparts. On several languages, we report performance exceeding that of more complex state-of-the art systems.1

3 0.67536563 111 emnlp-2010-Two Decades of Unsupervised POS Induction: How Far Have We Come?

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Many different methods have been proposed, yet comparisons are difficult to make since there is little consensus on evaluation framework, and many papers evaluate against only one or two competitor systems. Here we evaluate seven different POS induction systems spanning nearly 20 years of work, using a variety of measures. We show that some of the oldest (and simplest) systems stand up surprisingly well against more recent approaches. Since most of these systems were developed and tested using data from the WSJ corpus, we compare their generalization abil- ities by testing on both WSJ and the multilingual Multext-East corpus. Finally, we introduce the idea of evaluating systems based on their ability to produce cluster prototypes that are useful as input to a prototype-driven learner. In most cases, the prototype-driven learner outperforms the unsupervised system used to initialize it, yielding state-of-the-art results on WSJ and improvements on nonEnglish corpora.

4 0.67532867 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Author: Taesun Moon ; Katrin Erk ; Jason Baldridge

Abstract: We define the crouching Dirichlet, hidden Markov model (CDHMM), an HMM for partof-speech tagging which draws state prior distributions for each local document context. This simple modification of the HMM takes advantage of the dichotomy in natural language between content and function words. In contrast, a standard HMM draws all prior distributions once over all states and it is known to perform poorly in unsupervised and semisupervised POS tagging. This modification significantly improves unsupervised POS tagging performance across several measures on five data sets for four languages. We also show that simply using different hyperparameter values for content and function word states in a standard HMM (which we call HMM+) is surprisingly effective.

5 0.4461568 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

Author: Hai Son Le ; Alexandre Allauzen ; Guillaume Wisniewski ; Francois Yvon

Abstract: Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms ofperplexity and on a large-scale translation task.

6 0.43384105 77 emnlp-2010-Measuring Distributional Similarity in Context

7 0.4199262 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

8 0.38985112 9 emnlp-2010-A New Approach to Lexical Disambiguation of Arabic Text

9 0.38240081 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

10 0.37939775 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

11 0.37438232 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

12 0.37423414 84 emnlp-2010-NLP on Spoken Documents Without ASR

13 0.36004329 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

14 0.340886 88 emnlp-2010-On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

15 0.34075207 114 emnlp-2010-Unsupervised Parse Selection for HPSG

16 0.32538524 61 emnlp-2010-Improving Gender Classification of Blog Authors

17 0.31362426 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

18 0.28994566 116 emnlp-2010-Using Universal Linguistic Knowledge to Guide Grammar Induction

19 0.28871205 27 emnlp-2010-Clustering-Based Stratified Seed Sampling for Semi-Supervised Relation Classification

20 0.27729693 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.017), (10, 0.015), (12, 0.021), (29, 0.148), (30, 0.015), (32, 0.396), (52, 0.03), (56, 0.052), (62, 0.012), (66, 0.124), (72, 0.052), (76, 0.017), (77, 0.012), (87, 0.015), (89, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83565277 108 emnlp-2010-Training Continuous Space Language Models: Some Practical Issues

Author: Hai Son Le ; Alexandre Allauzen ; Guillaume Wisniewski ; Francois Yvon

Abstract: Using multi-layer neural networks to estimate the probabilities of word sequences is a promising research area in statistical language modeling, with applications in speech recognition and statistical machine translation. However, training such models for large vocabulary tasks is computationally challenging which does not scale easily to the huge corpora that are nowadays available. In this work, we study the performance and behavior of two neural statistical language models so as to highlight some important caveats of the classical training algorithms. The induced word embeddings for extreme cases are also analysed, thus providing insight into the convergence issues. A new initialization scheme and new training techniques are then introduced. These methods are shown to greatly reduce the training time and to significantly improve performance, both in terms ofperplexity and on a large-scale translation task.

same-paper 2 0.8332392 71 emnlp-2010-Latent-Descriptor Clustering for Unsupervised POS Induction

Author: Michael Lamar ; Yariv Maron ; Elie Bienenstock

Abstract: We present a novel approach to distributionalonly, fully unsupervised, POS tagging, based on an adaptation of the EM algorithm for the estimation of a Gaussian mixture. In this approach, which we call Latent-Descriptor Clustering (LDC), word types are clustered using a series of progressively more informative descriptor vectors. These descriptors, which are computed from the immediate left and right context of each word in the corpus, are updated based on the previous state of the cluster assignments. The LDC algorithm is simple and intuitive. Using standard evaluation criteria for unsupervised POS tagging, LDC shows a substantial improvement in performance over state-of-the-art methods, along with a several-fold reduction in computational cost.

3 0.56128758 87 emnlp-2010-Nouns are Vectors, Adjectives are Matrices: Representing Adjective-Noun Constructions in Semantic Space

Author: Marco Baroni ; Roberto Zamparelli

Abstract: We propose an approach to adjective-noun composition (AN) for corpus-based distributional semantics that, building on insights from theoretical linguistics, represents nouns as vectors and adjectives as data-induced (linear) functions (encoded as matrices) over nominal vectors. Our model significantly outperforms the rivals on the task of reconstructing AN vectors not seen in training. A small post-hoc analysis further suggests that, when the model-generated AN vector is not similar to the corpus-observed AN vector, this is due to anomalies in the latter. We show moreover that our approach provides two novel ways to represent adjective meanings, alternative to its representation via corpus-based co-occurrence vectors, both outperforming the latter in an adjective clustering task.

4 0.52076232 109 emnlp-2010-Translingual Document Representations from Discriminative Projections

Author: John Platt ; Kristina Toutanova ; Wen-tau Yih

Abstract: Representing documents by vectors that are independent of language enhances machine translation and multilingual text categorization. We use discriminative training to create a projection of documents from multiple languages into a single translingual vector space. We explore two variants to create these projections: Oriented Principal Component Analysis (OPCA) and Coupled Probabilistic Latent Semantic Analysis (CPLSA). Both of these variants start with a basic model of documents (PCA and PLSA). Each model is then made discriminative by encouraging comparable document pairs to have similar vector representations. We evaluate these algorithms on two tasks: parallel document retrieval for Wikipedia and Europarl documents, and cross-lingual text classification on Reuters. The two discriminative variants, OPCA and CPLSA, significantly outperform their corre- sponding baselines. The largest differences in performance are observed on the task of retrieval when the documents are only comparable and not parallel. The OPCA method is shown to perform best.

5 0.51073372 84 emnlp-2010-NLP on Spoken Documents Without ASR

Author: Mark Dredze ; Aren Jansen ; Glen Coppersmith ; Ken Church

Abstract: There is considerable interest in interdisciplinary combinations of automatic speech recognition (ASR), machine learning, natural language processing, text classification and information retrieval. Many of these boxes, especially ASR, are often based on considerable linguistic resources. We would like to be able to process spoken documents with few (if any) resources. Moreover, connecting black boxes in series tends to multiply errors, especially when the key terms are out-ofvocabulary (OOV). The proposed alternative applies text processing directly to the speech without a dependency on ASR. The method finds long (∼ 1 sec) repetitions in speech, fainndd scl luostnegrs t∼he 1m sinecto) pseudo-terms (roughly phrases). Document clustering and classification work surprisingly well on pseudoterms; performance on a Switchboard task approaches a baseline using gold standard man- ual transcriptions.

6 0.50031525 78 emnlp-2010-Minimum Error Rate Training by Sampling the Translation Lattice

7 0.4923515 7 emnlp-2010-A Mixture Model with Sharing for Lexical Semantics

8 0.48818606 75 emnlp-2010-Lessons Learned in Part-of-Speech Tagging of Conversational Speech

9 0.48284274 34 emnlp-2010-Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

10 0.48198959 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

11 0.48010167 98 emnlp-2010-Soft Syntactic Constraints for Hierarchical Phrase-Based Translation Using Latent Syntactic Distributions

12 0.47987318 86 emnlp-2010-Non-Isomorphic Forest Pair Translation

13 0.47845301 4 emnlp-2010-A Game-Theoretic Approach to Generating Spatial Descriptions

14 0.47834867 103 emnlp-2010-Tense Sense Disambiguation: A New Syntactic Polysemy Task

15 0.47757319 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

16 0.477368 25 emnlp-2010-Better Punctuation Prediction with Dynamic Conditional Random Fields

17 0.47618768 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

18 0.47604635 2 emnlp-2010-A Fast Decoder for Joint Word Segmentation and POS-Tagging Using a Single Discriminative Model

19 0.47594008 89 emnlp-2010-PEM: A Paraphrase Evaluation Metric Exploiting Parallel Texts

20 0.47301027 53 emnlp-2010-Fusing Eye Gaze with Speech Recognition Hypotheses to Resolve Exophoric References in Situated Dialogue