emnlp emnlp2011 emnlp2011-99 knowledge-graph by maker-knowledge-mining

99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases


Source: pdf

Author: Yugo Murawaki ; Sadao Kurohashi

Abstract: A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological ana- lyzer.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp , Abstract A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. [sent-4, score-0.477]

2 Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. [sent-5, score-0.466]

3 To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. [sent-6, score-0.462]

4 We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. [sent-7, score-0.312]

5 For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. [sent-8, score-0.49]

6 Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological ana- lyzer. [sent-9, score-0.674]

7 1 Introduction Word segmentation is the first step of natural language processing for Japanese, Chinese and Thai because they do not delimit words by white-space. [sent-10, score-0.427]

8 Among them, encyclopedias are especially important in that they contain a lot of terms that a morphological dictionary fails to cover. [sent-18, score-0.342]

9 According to our segmentation criteria, it consists of two words “常山” (tsuneyama) and “城” (jou). [sent-21, score-0.427]

10 However, the morphological analyzer wrongly segments it into “常” (tsune) and “山城” (yamashiro) because “常山” (tsuneyama) is an unknown word. [sent-22, score-0.523]

11 To do this, we examine the main text of the entry, on the assumption that if the noun phrase in question consists of more than one word, its constituents appear in the main text either freely or as part of other noun phrases. [sent-25, score-0.415]

12 The bigram model alleviates a problem of the unigram model, that is, a tendency to misidentify a sequence of words in common collocations as a single word. [sent-33, score-0.381]

13 However, type-based sampling is not easily applicable to the bigram model owing to sparsity and its dependence on latent assignments. [sent-36, score-0.438]

14 We propose a hybrid type-based sampling procedure, which combines the Metropolis-Hastings algorithm with Gibbs sampling. [sent-37, score-0.332]

15 This greatly eases the sampling procedure while retaining the efficiency of typebased sampling. [sent-40, score-0.329]

16 Experiments show that the pro- posed method quickly corrects the initial segmentation given by a morphological analyzer. [sent-41, score-0.674]

17 2 Related Work Japanese Morphological Analysis and Lexical Acquisition Word segmentation for Japanese is usually solved as the joint task of segmentation and part-of-speech tagging, which is called morphological analysis (Kurohashi et al. [sent-42, score-1.054]

18 Since, unlike Chinese and Thai, Japanese is rich in morphology, morphological regularity can be used to determine if an unknown word candidate in text is indeed the word to be acquired. [sent-53, score-0.316]

19 Noun phrases can hardly be distinguished from single nouns because in Japanese, no morphological marker is attached to join nouns to form a noun phrase. [sent-55, score-0.406]

20 The assumption that language is a se- quence of invariant words fails to capture rich morphology, as our segmentation criteria specify that each verb or adjective consists of an invariant stem and an ending that changes its form according to its grammatical roles. [sent-59, score-0.497]

21 NER performance may be affected by segmentation errors in morphological analysis involving unknown words. [sent-69, score-0.743]

22 Chinese word segmentation is often formalized as a character tagging problem (Xue, 2003). [sent-70, score-0.476]

23 3 Japanese Noun Phrase Segmentation Our goal is to overcome the unknown word problem in morphological analysis by utilizing existing resources such as dictionaries and encyclopedias for human readers. [sent-86, score-0.397]

24 If the noun phrase in question consists of more than one word, its constituents would appear in the text either freely or as part of other noun phrases. [sent-96, score-0.415]

25 We obtain the segmentation of an entry noun phrase by considering the segmentation of the whole 1http : / / s ource forge . [sent-97, score-1.136]

26 One may instead consider a pipeline approach in which we first extract noun phrases in text and then identify boundaries within these noun phrases. [sent-100, score-0.439]

27 However, noun phrases in text are not trivially identifiable in the case that they contain unknown words as their constituents. [sent-101, score-0.322]

28 For example, the analyzer erroneously segments the word “ち ん す う ” (chiNsukou) into “ちん” (chiN) and “す う ” (sukou), and since the latter is misidentified as a verb, the incorrect noun phrase “ち ん” (chiN) is extracted. [sent-102, score-0.408]

29 We have a morphological analyzer with a dictionary that covers frequent words. [sent-103, score-0.468]

30 For this reason, we like to use the segmentation given by the analyzer as the initial state and to make small changes to them to get こ こ a desired output. [sent-105, score-0.728]

31 As the annotated corpus encodes our segmentation criteria, it can be used to force the models to stick with our segmentation criteria. [sent-107, score-0.933]

32 We concentrate on segmentation in this paper, but we also need to assign a POS tag to each constituent word and to incorporate segmented noun phrases into the dictionary of the morphological analyzer. [sent-108, score-0.968]

33 3 4 Non-parametric Bayesian Language Models To correct the initial segmentation given by the analyzer, we use non-parametric Bayesian language models that have been applied to unsupervised word segmentation (Goldwater et al. [sent-110, score-0.901]

34 1 Unigram Model In the unigram model, a word in the corpus wi is generated as follows: G|α0, P0 ∼ DP(α0, P0) wi|G ∼ G 3Fortunately, the morphological analyzer JUMAN is capable of handling phrases, each of which consists of more than one word. [sent-115, score-0.587]

35 In preliminary experiments, we found that the unigram model often interpreted a noun phrase as a single word, even in the case that its constituents frequently appeared in text. [sent-123, score-0.375]

36 2 Bigram Model The problem of the unigram model can be alleviated by the bigram model based on a hierarchical Dirichlet process (Goldwater et al. [sent-125, score-0.304]

37 In the bigram model, word wi is generated as follows: G|α0, P0 ∼ DP(α0, P0) Hl |α1 , G ∼ DP(α1, G) wi |wi−1 = l, Hl ∼ Hl Marginalizing out G and Hl, we can again explain the model with the Chinese restaurant process. [sent-127, score-0.322]

38 Unlike the unigram model, however, the bigram model depends on the latent table assignments z−i. [sent-128, score-0.304]

39 4 Mixing an Annotated Corpus An annotated corpus can be used to force the models to stick with our segmentation criteria. [sent-150, score-0.506]

40 A straightforward way to do this is to mix it with raw text while fixing the segmentation during inference (Mochihashi et al. [sent-151, score-0.427]

41 Similarly, the back-off mixing bigram model replaces P1 in (2) with P1BM 5 = λIPP1 + (1 − λIP)P2REF. [sent-161, score-0.41]

42 Inference Collapsed Gibbs sampling is widely used to find an optimal segmentation (Goldwater et al. [sent-162, score-0.675]

43 In this section, we first show that simple collapsed sampling can hardly escape the initial segmentation. [sent-164, score-0.495]

44 To address this problem, we apply a block sampling algorithm named type-based sampling (Liang et al. [sent-165, score-0.549]

45 Since type-based sampling is not applicable to the bigram model, we propose a novel sampling procedure for the bigram model, which we call hybrid type-based sampling. [sent-167, score-0.96]

46 1 Collapsed Sampling In collapsed Gibbs sampling, the sampler repeatedly samples every possible boundary position, conditioned on the current state of the rest of the corpus. [sent-169, score-0.354]

47 This property is especially problematic in our settings where the initial segmentation is given by a morphological analyzer. [sent-174, score-0.674]

48 Since the analyzer deterministically segments text using pre-defined parameters, the resultant segmentation is fairly consistent. [sent-175, score-0.634]

49 For this reason, the initial segmentation is usually chosen at random (Goldwater et al. [sent-179, score-0.474]

50 Sentence-based block sampling is also susceptible to consistent initialization (Liang et al. [sent-181, score-0.355]

51 2 Type-based Sampling To achieve fast convergence, we adopt a block sampling algorithm named type-based sampling (Liang et al. [sent-184, score-0.549]

52 Type-based sampling takes advantage of the exchangeability of multiple positions with the same type. [sent-188, score-0.338]

53 (2010) used random initialization, we take particular note of the possibility of efficiently correcting the consistent segmentation by the analyzer. [sent-194, score-0.427]

54 Type-based sampling is, however, not applicable to the bigram model for two reasons. [sent-195, score-0.438]

55 Strictly speaking, we need to update the model counts even when sampling one position because the observation of the bigram ⟨wlw1⟩, for example, may vafafteicotn th oef probability P2 (w2 |h− , ⟨wlw1⟩). [sent-202, score-0.438]

56 (2009) approximate t|hhe probability by not updating the model counts in collapsed Gibbs sampling (i. [sent-204, score-0.385]

57 This is motivated by the observation that although the joint sampling of a large number of positions is computationally expensive, the proposal is accepted very infrequently. [sent-255, score-0.454]

58 Similarly, we can impose our trivial rules of segmentation on the model. [sent-262, score-0.427]

59 For each entry of Wikipedia, we regarded the title as a noun phrase and used both the title and main text for segmentation. [sent-266, score-0.418]

60 We separately applied our segmentation procedure to each entry. [sent-267, score-0.427]

61 We applied both the title and main text to the morphological analyzer JUMAN5 to get an initial segmentation. [sent-274, score-0.522]

62 If the resultant segmentation conflicted with markup information, we overrode the former. [sent-275, score-0.488]

63 The initial segmentation was also used as the baseline. [sent-276, score-0.474]

64 The first condition ensures that there are segmentation ambiguities. [sent-293, score-0.427]

65 Models We compared the unigram and bigram models. [sent-301, score-0.304]

66 As for inference procedures, we used collapsed Gibbs sampling (CL) for both models, typebased sampling (TB) for the unigram model and hybrid type-based sampling (HTB) for the bigram model. [sent-302, score-1.35]

67 We tested two mixing methods of the annotated corpus, direct mixing (DM) and back-off mixing (BM). [sent-303, score-0.704]

68 The unigram model has one Dirichlet process concentration hyperparameter α0 and the bigram model has α0 and α1. [sent-307, score-0.403]

69 Kyot o% 2 0Univers ity% 2 0 Text % 2 0 Corpus Table 1: Results of segmentation of entry titles (F-score (precision/recall)). [sent-329, score-0.508]

70 Evaluation Metrics We evaluated the segmentation accuracy of 500 entry titles. [sent-330, score-0.508]

71 We report the score of the most frequent segmentation among 10 samples. [sent-332, score-0.427]

72 2 Results Table 1 shows segmentation accuracy of various models. [sent-340, score-0.427]

73 As suggested by relatively low precision, unknown words tend to be over-segmented by the morphological analyzer. [sent-344, score-0.316]

74 In the best hyperparameter settings, the back-off mixing bigram model with hybrid type-based sam- 612 pling (bigram + HTB + BM) significantly outperformed the baseline and achieved the best F-score. [sent-345, score-0.549]

75 It is simply because it did not change the initial segmentation a lot. [sent-349, score-0.474]

76 In contrast, type-based sampling (+TB) brought large moves to the unigram model and significantly hurt accuracy. [sent-350, score-0.362]

77 When combined with (hybrid) type-based sampling (+TB/+HTB), back-off mixing (+BM) increased accuracy from the corresponding nonmixing models. [sent-352, score-0.468]

78 To our surprise, collapsed sampling with mixing models (+CL, +DM/+BM) outperformed the baseline. [sent-355, score-0.605]

79 7 A diff is defined as the number of character-based disagreements between the baseline segmentation and a model output. [sent-361, score-0.427]

80 We can see that collapsed sampling was almost unable to escape the initial state. [sent-363, score-0.495]

81 With type-based sampling (+TB), the unigram model went further than the bigram model, but to an undesired direction. [sent-364, score-0.587]

82 The bigram model with hybrid type-based sampling (bigram + HTB) converged in few itera- tions. [sent-365, score-0.522]

83 Although the model with random initialization (+RAND) converged to a nearby point, the initial segmentation by the morphological analyzer realized a bit faster convergence and better accuracy. [sent-366, score-0.935]

84 However, this seems wasteful, given that a large portion of text has only marginal influence on the segmentation of the noun phrase in question. [sent-374, score-0.628]

85 We sampled a boundary only if the corresponding local area contained a substring of the noun phrase in question. [sent-376, score-0.3]

86 However, the difference of convergence speed is obvious in the iteration-based comparison although (hybrid) type-based sampling takes several times longer than collapsed sampling in the current na¨ ıve implementation. [sent-381, score-0.633]

87 5 Discussion Figure 4 shows some segmentations corrected by the back-off mixing bigram model with hybrid typebased sampling. [sent-391, score-0.616]

88 In Japanese, people often change the script to derive a proper noun from a common noun, which a na¨ ıve analyzer fails to recognize. [sent-396, score-0.361]

89 As hiragana is mainly used to write function words and other basic words, segmentation errors concerning hiragana often bring disastrous effects on applications of morphological analysis. [sent-413, score-0.907]

90 Most improvements come from correction of over-segmentation because the initial segmentation by the analyzer shows a tendency of oversegmentation. [sent-415, score-0.717]

91 On the other hand, the segmentation failed when our assumption about constituents does not hold. [sent-418, score-0.487]

92 We adopted nonparametric Bayesian language models and proposed hybrid type-based sampling that can efficiently correct segmentation given by the morphological analyzer. [sent-421, score-0.997]

93 Although supervised segmentation is very competitive, we showed that it can be supplemented with our unsupervised approach. [sent-422, score-0.427]

94 For example, in unknown word acquisition (Murawaki and Kurohashi, 2008), noun phrases are often acquired from text as single words. [sent-425, score-0.362]

95 In the future we will assign a POS tag to each word in order to use segmented noun phrases in morphological analysis. [sent-427, score-0.48]

96 Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. [sent-513, score-0.427]

97 Online acquisition of Japanese unknown morphemes using morphological constraints. [sent-518, score-0.356]

98 Chinese segmentation and new word detection using conditional random fields. [sent-544, score-0.427]

99 The unknown word problem: a morphological analysis of Japanese using maximum entropy aided by a dictionary. [sent-558, score-0.316]

100 Bayesian semi-supervised Chinese word segmentation for statistical machine translation. [sent-563, score-0.427]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('segmentation', 0.427), ('sampling', 0.248), ('mixing', 0.22), ('analyzer', 0.207), ('japanese', 0.206), ('morphological', 0.2), ('bigram', 0.19), ('noun', 0.154), ('goldwater', 0.142), ('hiragana', 0.14), ('collapsed', 0.137), ('tsuneyama', 0.122), ('unknown', 0.116), ('unigram', 0.114), ('mochihashi', 0.102), ('murawaki', 0.102), ('zerogram', 0.102), ('boundary', 0.099), ('positions', 0.09), ('bm', 0.088), ('hybrid', 0.084), ('entry', 0.081), ('encyclopedias', 0.081), ('htb', 0.081), ('typebased', 0.081), ('proposal', 0.08), ('boundaries', 0.079), ('segmented', 0.074), ('kurohashi', 0.073), ('bayesian', 0.073), ('sampler', 0.071), ('asahara', 0.07), ('katakana', 0.07), ('gibbs', 0.07), ('title', 0.068), ('wi', 0.066), ('yuji', 0.063), ('escape', 0.063), ('kudo', 0.062), ('dictionary', 0.061), ('markup', 0.061), ('tsune', 0.061), ('constituents', 0.06), ('chinese', 0.059), ('hl', 0.059), ('ip', 0.055), ('hyperparameter', 0.055), ('initialization', 0.054), ('block', 0.053), ('acceptance', 0.052), ('phrases', 0.052), ('segment', 0.051), ('jp', 0.05), ('character', 0.049), ('skip', 0.047), ('initial', 0.047), ('phrase', 0.047), ('state', 0.047), ('tb', 0.046), ('annotated', 0.044), ('concentration', 0.044), ('iob', 0.044), ('liang', 0.043), ('optimum', 0.042), ('encyclopedic', 0.041), ('segmentations', 0.041), ('entries', 0.041), ('chin', 0.041), ('escobar', 0.041), ('iterat', 0.041), ('jr', 0.041), ('kansai', 0.041), ('misidentify', 0.041), ('shinsuke', 0.041), ('tsuboi', 0.041), ('yamashiro', 0.041), ('yugo', 0.041), ('acquisition', 0.04), ('hyperparameters', 0.04), ('matsumoto', 0.04), ('median', 0.039), ('hh', 0.039), ('external', 0.039), ('nonparametric', 0.038), ('dp', 0.038), ('wl', 0.037), ('sadao', 0.037), ('accepted', 0.036), ('sharon', 0.036), ('tendency', 0.036), ('gazetteer', 0.035), ('adaptor', 0.035), ('stick', 0.035), ('undesired', 0.035), ('kyot', 0.035), ('characterbased', 0.035), ('wiki', 0.035), ('station', 0.035), ('invariant', 0.035), ('juman', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999809 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases

Author: Yugo Murawaki ; Sadao Kurohashi

Abstract: A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological ana- lyzer.

2 0.20101565 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

Author: Weiwei Sun ; Jia Xu

Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

3 0.18006976 124 emnlp-2011-Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words

Author: Nobuhiro Kaji ; Masaru Kitsuregawa

Abstract: Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often outof-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.

4 0.13043705 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

Author: Christos Christodoulopoulos ; Sharon Goldwater ; Mark Steedman

Abstract: In this paper we present a fully unsupervised syntactic class induction system formulated as a Bayesian multinomial mixture model, where each word type is constrained to belong to a single class. By using a mixture model rather than a sequence model (e.g., HMM), we are able to easily add multiple kinds of features, including those at both the type level (morphology features) and token level (context and alignment features, the latter from parallel corpora). Using only context features, our system yields results comparable to state-of-the art, far better than a similar model without the one-class-per-type constraint. Using the additional features provides added benefit, and our final system outperforms the best published results on most of the 25 corpora tested.

5 0.12758477 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction

Author: Young-Bum Kim ; Joao Graca ; Benjamin Snyder

Abstract: In this paper, we consider the problem of unsupervised morphological analysis from a new angle. Past work has endeavored to design unsupervised learning methods which explicitly or implicitly encode inductive biases appropriate to the task at hand. We propose instead to treat morphological analysis as a structured prediction problem, where languages with labeled data serve as training examples for unlabeled languages, without the assumption of parallel data. We define a universal morphological feature space in which every language and its morphological analysis reside. We develop a novel structured nearest neighbor prediction method which seeks to find the morphological analysis for each unlabeled lan- guage which lies as close as possible in the feature space to a training language. We apply our model to eight inflecting languages, and induce nominal morphology with substantially higher accuracy than a traditional, MDLbased approach. Our analysis indicates that accuracy continues to improve substantially as the number of training languages increases.

6 0.1145964 88 emnlp-2011-Linear Text Segmentation Using Affinity Propagation

7 0.10817471 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model

8 0.087913185 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing

9 0.087150469 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

10 0.072319746 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

11 0.071999989 113 emnlp-2011-Relation Acquisition using Word Classes and Partial Patterns

12 0.057781879 119 emnlp-2011-Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Definitions

13 0.05216768 125 emnlp-2011-Statistical Machine Translation with Local Language Models

14 0.051638428 18 emnlp-2011-Analyzing Methods for Improving Precision of Pivot Based Bilingual Dictionaries

15 0.049867641 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

16 0.049652949 146 emnlp-2011-Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

17 0.047790591 58 emnlp-2011-Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training

18 0.047547668 131 emnlp-2011-Syntactic Decision Tree LMs: Random Selection or Intelligent Design?

19 0.047369625 10 emnlp-2011-A Probabilistic Forest-to-String Model for Language Generation from Typed Lambda Calculus Expressions

20 0.047326166 17 emnlp-2011-Active Learning with Amazon Mechanical Turk


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.206), (1, -0.029), (2, -0.052), (3, -0.011), (4, -0.032), (5, 0.049), (6, -0.131), (7, 0.006), (8, -0.294), (9, 0.039), (10, 0.034), (11, 0.05), (12, -0.077), (13, 0.094), (14, -0.2), (15, -0.143), (16, -0.152), (17, -0.109), (18, 0.284), (19, 0.212), (20, -0.009), (21, -0.097), (22, -0.104), (23, 0.016), (24, -0.097), (25, -0.075), (26, -0.106), (27, 0.051), (28, -0.006), (29, 0.015), (30, 0.026), (31, -0.018), (32, 0.099), (33, -0.117), (34, -0.013), (35, 0.084), (36, 0.081), (37, -0.089), (38, 0.009), (39, -0.127), (40, -0.049), (41, 0.008), (42, 0.172), (43, 0.037), (44, -0.068), (45, 0.043), (46, -0.008), (47, -0.055), (48, 0.09), (49, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97234648 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases

Author: Yugo Murawaki ; Sadao Kurohashi

Abstract: A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological ana- lyzer.

2 0.74300474 48 emnlp-2011-Enhancing Chinese Word Segmentation Using Unlabeled Data

Author: Weiwei Sun ; Jia Xu

Abstract: This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-ofvocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

3 0.63907951 124 emnlp-2011-Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words

Author: Nobuhiro Kaji ; Masaru Kitsuregawa

Abstract: Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often outof-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.

4 0.50485009 88 emnlp-2011-Linear Text Segmentation Using Affinity Propagation

Author: Anna Kazantseva ; Stan Szpakowicz

Abstract: This paper presents a new algorithm for linear text segmentation. It is an adaptation of Affinity Propagation, a state-of-the-art clustering algorithm in the framework of factor graphs. Affinity Propagation for Segmentation, or APS, receives a set of pairwise similarities between data points and produces segment boundaries and segment centres data points which best describe all other data points within the segment. APS iteratively passes messages in a cyclic factor graph, until convergence. Each iteration works with information on all available similarities, resulting in highquality results. APS scales linearly for realistic segmentation tasks. We derive the algorithm from the original Affinity Propagation formu– lation, and evaluate its performance on topical text segmentation in comparison with two state-of-the art segmenters. The results suggest that APS performs on par with or outperforms these two very competitive baselines.

5 0.45384684 140 emnlp-2011-Universal Morphological Analysis using Structured Nearest Neighbor Prediction

Author: Young-Bum Kim ; Joao Graca ; Benjamin Snyder

Abstract: In this paper, we consider the problem of unsupervised morphological analysis from a new angle. Past work has endeavored to design unsupervised learning methods which explicitly or implicitly encode inductive biases appropriate to the task at hand. We propose instead to treat morphological analysis as a structured prediction problem, where languages with labeled data serve as training examples for unlabeled languages, without the assumption of parallel data. We define a universal morphological feature space in which every language and its morphological analysis reside. We develop a novel structured nearest neighbor prediction method which seeks to find the morphological analysis for each unlabeled lan- guage which lies as close as possible in the feature space to a training language. We apply our model to eight inflecting languages, and induce nominal morphology with substantially higher accuracy than a traditional, MDLbased approach. Our analysis indicates that accuracy continues to improve substantially as the number of training languages increases.

6 0.43674898 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model

7 0.42329344 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

8 0.29766867 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study

9 0.28077865 23 emnlp-2011-Bootstrapped Named Entity Recognition for Product Attribute Extraction

10 0.26073304 2 emnlp-2011-A Cascaded Classification Approach to Semantic Head Recognition

11 0.24621654 78 emnlp-2011-Large-Scale Noun Compound Interpretation Using Bootstrapping and the Web as a Corpus

12 0.2378787 93 emnlp-2011-Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation

13 0.22992511 111 emnlp-2011-Reducing Grounded Learning Tasks To Grammatical Inference

14 0.22500969 75 emnlp-2011-Joint Models for Chinese POS Tagging and Dependency Parsing

15 0.22044644 19 emnlp-2011-Approximate Scalable Bounded Space Sketch for Large Data NLP

16 0.2160053 143 emnlp-2011-Unsupervised Information Extraction with Distributional Prior Knowledge

17 0.20842959 69 emnlp-2011-Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources

18 0.20179939 96 emnlp-2011-Multilayer Sequence Labeling

19 0.20107 113 emnlp-2011-Relation Acquisition using Word Classes and Partial Patterns

20 0.19835757 54 emnlp-2011-Exploiting Parse Structures for Native Language Identification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(15, 0.387), (23, 0.124), (36, 0.034), (37, 0.017), (45, 0.074), (53, 0.018), (54, 0.028), (57, 0.034), (62, 0.015), (64, 0.024), (66, 0.039), (69, 0.015), (79, 0.044), (82, 0.016), (87, 0.012), (90, 0.011), (96, 0.024), (98, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79482591 99 emnlp-2011-Non-parametric Bayesian Segmentation of Japanese Noun Phrases

Author: Yugo Murawaki ; Sadao Kurohashi

Abstract: A key factor of high quality word segmentation for Japanese is a high-coverage dictionary, but it is costly to manually build such a lexical resource. Although external lexical resources for human readers are potentially good knowledge sources, they have not been utilized due to differences in segmentation criteria. To supplement a morphological dictionary with these resources, we propose a new task of Japanese noun phrase segmentation. We apply non-parametric Bayesian language models to segment each noun phrase in these resources according to the statistical behavior of its supposed constituents in text. For inference, we propose a novel block sampling procedure named hybrid type-based sampling, which has the ability to directly escape a local optimum that is not too distant from the global optimum. Experiments show that the proposed method efficiently corrects the initial segmentation given by a morphological ana- lyzer.

2 0.75067329 120 emnlp-2011-Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions

Author: Richard Socher ; Jeffrey Pennington ; Eric H. Huang ; Andrew Y. Ng ; Christopher D. Manning

Abstract: We introduce a novel machine learning framework based on recursive autoencoders for sentence-level prediction of sentiment label distributions. Our method learns vector space representations for multi-word phrases. In sentiment prediction tasks these representations outperform other state-of-the-art approaches on commonly used datasets, such as movie reviews, without using any pre-defined sentiment lexica or polarity shifting rules. We also evaluate the model’s ability to predict sentiment distributions on a new dataset based on confessions from the experience project. The dataset consists of personal user stories annotated with multiple labels which, when aggregated, form a multinomial distribution that captures emotional reactions. Our algorithm can more accurately predict distributions over such labels compared to several competitive baselines.

3 0.58024341 8 emnlp-2011-A Model of Discourse Predictions in Human Sentence Processing

Author: Amit Dubey ; Frank Keller ; Patrick Sturt

Abstract: This paper introduces a psycholinguistic model of sentence processing which combines a Hidden Markov Model noun phrase chunker with a co-reference classifier. Both models are fully incremental and generative, giving probabilities of lexical elements conditional upon linguistic structure. This allows us to compute the information theoretic measure of surprisal, which is known to correlate with human processing effort. We evaluate our surprisal predictions on the Dundee corpus of eye-movement data show that our model achieve a better fit with human reading times than a syntax-only model which does not have access to co-reference information.

4 0.43846118 124 emnlp-2011-Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words

Author: Nobuhiro Kaji ; Masaru Kitsuregawa

Abstract: Word boundaries within noun compounds are not marked by white spaces in a number of languages, unlike in English, and it is beneficial for various NLP applications to split such noun compounds. In the case of Japanese, noun compounds made up of katakana words (i.e., transliterated foreign words) are particularly difficult to split, because katakana words are highly productive and are often outof-vocabulary. To overcome this difficulty, we propose using monolingual and bilingual paraphrases of katakana noun compounds for identifying word boundaries. Experiments demonstrated that splitting accuracy is substantially improved by extracting such paraphrases from unlabeled textual data, the Web in our case, and then using that information for constructing splitting models.

5 0.43652767 39 emnlp-2011-Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model

Author: Markus Dreyer ; Jason Eisner

Abstract: We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finitestate transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50–100 seed paradigms, adding a 10million-word corpus reduces prediction error for morphological inflections by up to 10%.

6 0.42859969 1 emnlp-2011-A Bayesian Mixture Model for PoS Induction Using Multiple Features

7 0.41282722 85 emnlp-2011-Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming

8 0.41231754 126 emnlp-2011-Structural Opinion Mining for Graph-based Sentiment Representation

9 0.40306693 97 emnlp-2011-Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French

10 0.40168613 117 emnlp-2011-Rumor has it: Identifying Misinformation in Microblogs

11 0.40127274 63 emnlp-2011-Harnessing WordNet Senses for Supervised Sentiment Classification

12 0.39627576 77 emnlp-2011-Large-Scale Cognate Recovery

13 0.3936165 104 emnlp-2011-Personalized Recommendation of User Comments via Factor Models

14 0.39011103 28 emnlp-2011-Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances

15 0.38946918 142 emnlp-2011-Unsupervised Discovery of Discourse Relations for Eliminating Intra-sentence Polarity Ambiguities

16 0.38942751 35 emnlp-2011-Correcting Semantic Collocation Errors with L1-induced Paraphrases

17 0.38848245 108 emnlp-2011-Quasi-Synchronous Phrase Dependency Grammars for Machine Translation

18 0.38709751 6 emnlp-2011-A Generate and Rank Approach to Sentence Paraphrasing

19 0.38677585 136 emnlp-2011-Training a Parser for Machine Translation Reordering

20 0.38471842 98 emnlp-2011-Named Entity Recognition in Tweets: An Experimental Study