acl acl2011 acl2011-6 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe
Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1
Reference: text
sentIndex sentText sentNum sentScore
1 A Comprehensive Dictionary of Multiword Expressions Kosho Shudo1, Akira Kurahone2, and Toshifumi Tanabe1 1Fukuoka University, Nanakuma, Jonan-ku, Fukuoka, 814-0180, JAPAN { shudo tanabe } @ fukuoka-u . [sent-1, score-0.117]
2 j p , Abstract It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. [sent-6, score-0.304]
3 This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. [sent-7, score-0.293]
4 The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. [sent-8, score-0.201]
5 The dictionary contains about 104,000 expressions, potentially 750,000 expressions. [sent-9, score-0.068]
6 This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. [sent-10, score-0.091]
7 1 Introduction Linguistically idiosyncratic multiword expressions occur in authentic sentences with an unexpectedly high frequency. [sent-13, score-0.41]
8 2002), we have become aware that a proper solution of idiosyncratic multiword expressions (MWEs) is one of the most difficult and intriguing problems in NLP. [sent-15, score-0.432]
9 In principle, the nature of the idiosyncrasy of MWEs is twofold: one is idiomaticity, i. [sent-16, score-0.029]
10 Many attempts have been made to extract these expressions from corpora, mainly using automated methods that exploit statistical means. [sent-19, score-0.128]
11 Recognizing the crucial importance of such expressions, one of the authors of the current paper began in the 1970s to construct a Japanese electronic dictionary with comprehensive inclusion of idioms, idiom-like expressions, and probabilistically idiosyncratic expressions for general use. [sent-21, score-0.348]
12 It has approximately 104,000 dictionary entries and covers potentially at least 750,000 expressions. [sent-23, score-0.137]
13 A large notational, syntactic, and semantic diversity of contained expressions 2. [sent-25, score-0.128]
14 A detailed description of syntactic function and structure for each entry expression 3. [sent-26, score-0.176]
15 An indication of the syntactic flexibility of entry expressions (i. [sent-27, score-0.211]
16 , possibility of internal modification of constituent words) of entry expressions. [sent-29, score-0.132]
17 In section 3, we propose and describe the criteria for selecting MWEs and introduce a number of classes of multiword expressions. [sent-31, score-0.201]
18 In section 4, we outline the format and contents of the JDMWE, discussing the information on notational variants, syntactic functions, syntactic structures, and the syntactic flexibility of MWEs. [sent-32, score-0.209]
19 In section 5, we describe and explain the contextual conditions stipulated in the JDMWE. [sent-33, score-0.026]
20 In section 6, we illustrate some important statistical properties of the JDMWE by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, the LDC2009T08, generated by Google Inc. [sent-34, score-0.068]
21 2 Related Work Gross (1986) analyzed French compound adverbs and compound verbs. [sent-38, score-0.142]
22 Jackendoff (1997) notes that an English speaker’s lexicon would contain as many MWEs as single words. [sent-42, score-0.032]
23 (2002) pointed out that 41% of the entries of WordNet 1. [sent-44, score-0.039]
24 (2003) reported that 44% of Japanese verbs are VV-type compounds. [sent-46, score-0.028]
25 These and other similar observations underscore the great need for a well-designed, extensive MWE lexicon for practical natural language processing. [sent-47, score-0.032]
26 Examples include the following: Gross (1986) reported on a dictionary of French verbal MWEs with description of 22 syntactic structures; Kuiper et al. [sent-49, score-0.188]
27 (2003) constructed a database of 13,000 English idioms tagged with syntactic structures; Villavicencio (2004) attempted to compile lexicons of English idioms and verb-particle constructions (VPCs) by augmenting existing single-word dictionaries with specific tables; Baptista et al. [sent-50, score-0.265]
28 (2004) reported on a dictionary of 3,500 Portuguese verbal MWEs with ten syntactic structures; Fellbaum et al. [sent-51, score-0.157]
29 (2006) reported corpus-based studies in developing German verb phrase idiom resources; and recently, Laporte et al. [sent-52, score-0.126]
30 (2008) have reported on a dictionary of 6,800 French adverbial MWEs annotated with 15 syntactic structures. [sent-53, score-0.14]
31 Our JDMWE approach differs from these studies in that it can treat more comprehensive types of MWEs. [sent-54, score-0.044]
32 (2006) attempted to extract English verb162 object type idioms by recognizing their structural fixedness in terms of mutual information and relative entropy. [sent-60, score-0.136]
33 In spite of these and many similar efforts, it is still difficult to adequately extract MWEs from corpora using a statistical approach, because regarding the types of multiword expressions, realistically speaking, the corpus-wide distribution can be far from exhaustive. [sent-62, score-0.201]
34 Paradoxically, to compile an MWE lexicon we need a reliable standard MWE lexicon, as it is impossible to evaluate the automatic extraction by recall rate without such a reference. [sent-63, score-0.059]
35 The conventional idiom dictionaries published for human readers have been occasionally used for the evaluation of automatic extraction methods in some past studies. [sent-64, score-0.071]
36 In addition, they provide no systematic information on the notational variants, syntactic functions, or syntactic structures of the entry expressions. [sent-66, score-0.237]
37 (1980) compiled a lexicon of 3,500 functional multiword expressions and used the lexicon for a morphological analysis of Japanese. [sent-69, score-0.423]
38 (2009) studied a disambiguation method of semantically ambiguous idioms using 146 basic idioms. [sent-80, score-0.098]
39 In view of this, we have manually extracted multiword expressions that have definite syntactic, semantic, or communicative functions and are linguistically idiosyncratic from a variety of publications, such as newspaper articles, journals, magazines, novels, and dictionaries. [sent-82, score-0.513]
40 In principle, the idiosyncrasy of MWEs is twofold: first, the semantic non-compositionality (i. [sent-83, score-0.029]
41 “red stranger”) is selected because it has a definite nominal meaning of “complete stranger” and neither 真紅(w1’)-の-他人 sinku-no-tanin nor レ ッ ド (w1’)- 他 人 reddo-no-tanin means “complete stranger”. [sent-98, score-0.04]
42 This and the following transition probability condition constitute another criterion that we adopt to define what an MWE is. [sent-104, score-0.029]
43 With this definition, for example, 手-を-拱 く te-wo-komaneku “fold arms” is selected as an MWE because it is a well-formed verb phrase and pb( 手 | を く ) is judged empirically to be very high. [sent-106, score-0.055]
44 Although the probabilistic judgment was performed, for each expression in turn, on the basis of the developer’s empirical language model, the resulting dataset is consistent with this criterion on 拱 1 These classes are not necessarily disjoint. [sent-108, score-0.091]
45 lower lord from the shoulder) Phrase “take a big load off one’s mind” Table 2: Probabilistically Idiosyncratic Expressions With entries like these, an NLP system can use the JDMWE as a reliable reference while effectively disambiguating the structures in the syntactic analysis process. [sent-116, score-0.11]
46 Of the MWEs in the JDMWE, approximately 38% and 92% of them were judged to meet criterion 3. [sent-117, score-0.079]
47 Figure 2: Approximate constituent ratio of noncompositional MWEs and probabilistically bound MWEs Figure 3: Example JDMWE entry 2 These classes are not necessarily disjoint. [sent-121, score-0.157]
48 164 4 Contents of the JDMWE The JDMWE has approximately 104,000 entries, one for each MWE, composed of six fields, namely, Field-H, -N, -F, -S, -Cf, and -Cb. [sent-122, score-0.03]
49 The dictionary entry form of an MWE is stated in Field-H in the form of a non-segmented hira-kana (phonetic character) string. [sent-123, score-0.132]
50 1 Notational Information (Field-N) Japanese has three notational options: hira-kana, kata-kana, and kanji. [sent-126, score-0.083]
51 As we have many kanji characters that are both homophonic and synonymous, sentences can contain kanji replaceable by others. [sent-129, score-0.088]
52 In addition, the inflectional suffix of some verbs can be absent in some contexts. [sent-130, score-0.028]
53 Therefore, the entry whose Field-H (the first field) is き のい やつ kino-ii-yatu (lit. [sent-133, score-0.041]
54 2 Functional Information (Field-F) Linguistic functions of MWEs can be simply classified by means of codes, as shown in Tables 3 and 4. [sent-141, score-0.026]
55 Field-F is filled with one of those codes which corresponds to a root node label in the syntactic tree representation of a MWE. [sent-142, score-0.042]
56 3 For example, an idiom 真っ な - 嘘 makka-na-uso (lit. [sent-156, score-0.071]
57 This description represents the structure shown in Figure 4, where K00 and N are POS symbols denoting an adjective-verb stem and a noun, respectively. [sent-158, score-0.031]
58 165 The JDMWE contains 49,000 verbal entries, making this the largest functional class in the JDMWE. [sent-162, score-0.077]
59 For these verbal entries, more than 90 patterns are actually used as structural descriptors in Field-S. [sent-163, score-0.085]
60 This fact can indicate the broadness of the structural spectrum of Japanese verbal MWEs. [sent-164, score-0.085]
61 fo-at iugkauree b-guars-tsd eoruut ) “being suddenly overcome with fatigue” Table 5: Examples of structural types of verbal MWEs (N: noun, V23: verb (adverbial form), V30: verb (end form), Adv: adverb, wo, ga, ni, no, de, te, and ba: particle) 4. [sent-179, score-0.155]
62 2 Coordinate Structure Approximately 2,500 MWEs in the JDMWE contain internal coordinate structures. [sent-181, score-0.076]
63 The coordinative phrase specification usually requires that the conjuncts must be parallel with respect to the syntactic function of the constituents appearing in the bracketed description. [sent-183, score-0.042]
64 For example, an expression 後-は-野- と -なれ-山と -なれ ato-ha-no-to-nare-yama-to-nare (lit. [sent-184, score-0.062]
65 “the rest might become either a field or a mountain”) “what will be, will be”, has an internal coordinate structure. [sent-185, score-0.076]
66 This description represents the structure shown in Figure 5, where V60 denotes an imperative form of the verb. [sent-187, score-0.031]
67 Figure 5: Example of the coordinate shown by “<” and “>” in Field-S 4. [sent-188, score-0.031]
68 3 Non-phrasal Structure structure Approximately 250 MWEs in the JDMWE are syntactically ill-formed in the sense of context-free grammar but still form a syntactic unit on their own. [sent-190, score-0.042]
69 For example, 揺 り 籠 - か -墓 場 - ま で yurikago-kara-hakaba-made “from the cradle to the grave” is an adjunct of two postpositional phrases but is often used as a state-describing noun as in 揺 り 籠-か -墓場-ま で-の-保証 yurikagokara-hakaba-made-no-hoshou (lit. [sent-191, score-0.044]
70 security of from cradle to grave) “security from the cradle to the grave”. [sent-192, score-0.088]
71 Thus Field-F and Field-S have a functional code Nk and a description [[N kara][[N made] $]], respectively. [sent-193, score-0.061]
72 The symbol “$” denotes a null constituent occupying the position of the governor on which this MWE depends. [sent-194, score-0.046]
73 ら ら 166 Figure 6: Example of a non-phrasal expression with a null constituent marked with “$” in Field-S The total number of structural types specified in Field-S is nearly 6,000. [sent-196, score-0.146]
74 This indicates that Japanese MWEs present a wide structural variety. [sent-197, score-0.038]
75 In our system, this aspect is captured by prefixing a modifiable element of the structural description stated in the Field-S with an asterisk “*”. [sent-201, score-0.128]
76 An adverbial MWE 上-に-述べ-た-様-に ueni-nobe-ta-you-ni “as Iexplained above” is one such MWE and thus has a description [[[[[N ni] *V23] ta] N] ni] in Field-S, meaning that the third element V23 is a verb that can be modified internally by adverb phrases. [sent-202, score-0.096]
77 Since the asterisk designates such optional phrasal modification, our system allows a derivative expression like 理 由 -を-上-に-詳 し く -述べ-た-様-に riyuu-woue-ni-kuwasiku-nobe-ta-you-ni “as Iexplained in detail the reason above”, which contains two additional, internal modifiers. [sent-203, score-0.143]
78 4 Figure 7: Example of internal modifiability marked by “*” in Field-S 4 The positions to be taken by an internal modifier can be easily decided by the structural description given in Field-S along with the nest structure requirement. [sent-205, score-0.218]
79 Roughly speaking, 30,000 MWEs in the JDMWE have no asterisk in their Field-S. [sent-206, score-0.036]
80 Our rigid examination reveals that internal modification is not allowed for them. [sent-207, score-0.045]
81 , they require co-occurrence of a particular syntactic phrase in the context that immediately precedes them. [sent-210, score-0.042]
82 This adnominal modifier co-occurrence requirement is stipulated in Field-Cf by a code . [sent-216, score-0.082]
83 Similarly, backward contextual requirements, of which there are about 70, are stated in Field-Cb. [sent-218, score-0.045]
84 However, we can confirm that 3,600 Japanese standard idioms that Sato (2007) listed from five Japanese idiom dictionaries × published for human readers are included in the JDMWE as a proper subset. [sent-221, score-0.169]
85 In addition, the JDMWE contains the information about their syntactic functions, structures, and flexibilities. [sent-222, score-0.042]
86 We will refer to trigram w1w2w3 as an NpVtrigram only when w1 and w3 are restricted to a noun and a verb (end form), respectively, and w2 is 1010 167 one of the following case-particles: accusative を wo, subjective が ga, or dative に ni. [sent-228, score-0.035]
87 5 We write the number of occurrences of an expression x, counted in the GND, as C(x). [sent-229, score-0.062]
88 First, we obtain from the GND sets G, T, D, B, and Ri’s defined below, using a Japanese word dictionary IPADIC (Asahara et al. [sent-230, score-0.068]
89 2% =(|R1|/|B|)×100 of trigrams in T have verbs that occur most frequently in the GND, succeeding the individual bigrams. [sent-235, score-0.028]
90 This provides a measure of the flatness of the pf(w3|w1w2) distribution canceling out the influence of the number N of verb types w3’s. [sent-253, score-0.035]
91 Hf(w3|w1w2) = − ( pf(w3|w1w2) log pf(w3|w1w2)) / log N w3 After arranging 110,822 bigrams in D in ascending order of Hf(w3|w1w2), we divided them into 20 intervals A1, A2, ,A20 each with an equal number of bigrams (5,542). [sent-254, score-0.076]
92 We then examined how many bigrams in B were included in each interval. [sent-255, score-0.059]
93 Figures 9(a) and (b) plot the resulting constituent ratio of the bigrams in B and the mean value of Hf(w3|w1w2)’s in each interval, respectively. [sent-256, score-0.107]
94 We … … … × found, for example, that 1,262 out of 5,542 bigrams are in B for the first interval, i. [sent-257, score-0.038]
95 From this, we realize the macroscopic tendency that the larger the entropy Hf(w3|w1w2), or equivalently the perplexity of the succeeding verb w3, a bigram w1w2 has, the less likely it is adopted as a prefix of a trigram in T. [sent-264, score-0.035]
96 Taking the results in Figure 8 and Figure 9 together, we can presume that not only frequently 168 but also exclusively occurring verbs would be the preferred choice in T. [sent-265, score-0.028]
97 However, the results imply a general validity of the JDMWE since the same criteria for selection were applied to all kinds of multiword expressions. [sent-268, score-0.224]
98 7 Concluding Remarks The JDMWE is a slotted tree bank for idiosyncratic multiword expressions, annotated with detailed notational, syntactic information. [sent-273, score-0.324]
99 For example, the usage of Japanese onomatopoeic adverbs, which are mostly bound probabilistically to specific verbs or adjectives, is extensively catalogued in the JDMWE. [sent-278, score-0.119]
100 7 The time required to compile this dictionary is estimated at 24,000 working hours. [sent-292, score-0.095]
wordName wordTfidf (topN-words)
[('jdmwe', 0.572), ('mwes', 0.491), ('mwe', 0.349), ('multiword', 0.201), ('japanese', 0.196), ('expressions', 0.128), ('idioms', 0.098), ('gnd', 0.088), ('shudo', 0.088), ('notational', 0.083), ('idiosyncratic', 0.081), ('hf', 0.077), ('idiom', 0.071), ('dictionary', 0.068), ('pf', 0.063), ('expression', 0.062), ('compound', 0.06), ('stranger', 0.059), ('ga', 0.057), ('probabilistically', 0.047), ('verbal', 0.047), ('interval', 0.046), ('constituent', 0.046), ('internal', 0.045), ('sag', 0.045), ('cradle', 0.044), ('grave', 0.044), ('ipadic', 0.044), ('kanji', 0.044), ('onomatopoeic', 0.044), ('yoshimura', 0.044), ('ni', 0.042), ('syntactic', 0.042), ('entry', 0.041), ('definite', 0.04), ('entries', 0.039), ('wi', 0.039), ('bigrams', 0.038), ('structural', 0.038), ('communicative', 0.037), ('wo', 0.037), ('adv', 0.036), ('asterisk', 0.036), ('sad', 0.036), ('verb', 0.035), ('kudo', 0.034), ('lexicon', 0.032), ('wn', 0.032), ('coordinate', 0.031), ('description', 0.031), ('adverbial', 0.03), ('gross', 0.03), ('approximately', 0.03), ('functional', 0.03), ('modifier', 0.03), ('structures', 0.029), ('baptista', 0.029), ('fazly', 0.029), ('idiomaticity', 0.029), ('idiosyncrasy', 0.029), ('iexplained', 0.029), ('jackendoff', 0.029), ('koyama', 0.029), ('kuiper', 0.029), ('laporte', 0.029), ('mimetic', 0.029), ('modifiability', 0.029), ('tanabe', 0.029), ('uchiyama', 0.029), ('villavicencio', 0.029), ('vpcs', 0.029), ('criterion', 0.029), ('verbs', 0.028), ('compile', 0.027), ('fellbaum', 0.027), ('functions', 0.026), ('accumulative', 0.026), ('adnominal', 0.026), ('akira', 0.026), ('hashimoto', 0.026), ('stipulated', 0.026), ('wind', 0.026), ('french', 0.025), ('baldwin', 0.025), ('comprehensive', 0.024), ('asahara', 0.024), ('forward', 0.023), ('ratio', 0.023), ('stated', 0.023), ('validity', 0.023), ('compilation', 0.022), ('guy', 0.022), ('intriguing', 0.022), ('backward', 0.022), ('adverbs', 0.022), ('examined', 0.021), ('yoshida', 0.021), ('face', 0.021), ('studies', 0.02), ('judged', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions
Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe
Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1
2 0.44611287 134 acl-2011-Extracting and Classifying Urdu Multiword Expressions
Author: Annette Hautli ; Sebastian Sulger
Abstract: This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.
3 0.09195365 238 acl-2011-P11-2093 k2opt.pdf
Author: empty-author
Abstract: We present a pointwise approach to Japanese morphological analysis (MA) that ignores structure information during learning and tagging. Despite the lack of structure, it is able to outperform the current state-of-the-art structured approach for Japanese MA, and achieves accuracy similar to that of structured predictors using the same feature set. We also find that the method is both robust to outof-domain data, and can be easily adapted through the use of a combination of partial annotation and active learning.
4 0.066668361 193 acl-2011-Language-independent compound splitting with morphological operations
Author: Klaus Macherey ; Andrew Dai ; David Talbot ; Ashok Popat ; Franz Och
Abstract: Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach.
5 0.064254805 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation
Author: Xianchao Wu ; Takuya Matsuzaki ; Jun'ichi Tsujii
Abstract: In the present paper, we propose the effective usage of function words to generate generalized translation rules for forest-based translation. Given aligned forest-string pairs, we extract composed tree-to-string translation rules that account for multiple interpretations of both aligned and unaligned target function words. In order to constrain the exhaustive attachments of function words, we limit to bind them to the nearby syntactic chunks yielded by a target dependency parser. Therefore, the proposed approach can not only capture source-tree-to-target-chunk correspondences but can also use forest structures that compactly encode an exponential number of parse trees to properly generate target function words during decoding. Extensive experiments involving large-scale English-toJapanese translation revealed a significant im- provement of 1.8 points in BLEU score, as compared with a strong forest-to-string baseline system.
6 0.056935873 9 acl-2011-A Cross-Lingual ILP Solution to Zero Anaphora Resolution
7 0.051380321 197 acl-2011-Latent Class Transliteration based on Source Language Origin
8 0.042001035 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation
9 0.041053869 132 acl-2011-Extracting Paraphrases from Definition Sentences on the Web
10 0.040448517 247 acl-2011-Pre- and Postprocessing for Statistical Machine Translation into Germanic Languages
11 0.039579697 131 acl-2011-Extracting Opinion Expressions and Their Polarities - Exploration of Pipelines and Joint Models
12 0.03935283 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
13 0.038711105 333 acl-2011-Web-Scale Features for Full-Scale Parsing
14 0.037285857 255 acl-2011-Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
15 0.034891598 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary
16 0.033878528 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
17 0.033605229 127 acl-2011-Exploiting Web-Derived Selectional Preference to Improve Statistical Dependency Parsing
18 0.032750301 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words
19 0.032705791 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature
20 0.032481875 87 acl-2011-Corpus Expansion for Statistical Machine Translation with Semantic Role Label Substitution Rules
topicId topicWeight
[(0, 0.105), (1, -0.0), (2, -0.018), (3, -0.015), (4, -0.027), (5, 0.006), (6, 0.05), (7, -0.004), (8, -0.003), (9, -0.012), (10, -0.041), (11, -0.009), (12, -0.022), (13, 0.022), (14, 0.018), (15, -0.059), (16, -0.004), (17, 0.036), (18, 0.056), (19, -0.006), (20, 0.024), (21, 0.048), (22, -0.04), (23, -0.094), (24, -0.033), (25, -0.034), (26, 0.025), (27, -0.011), (28, -0.132), (29, 0.002), (30, 0.163), (31, 0.069), (32, 0.178), (33, -0.099), (34, -0.13), (35, 0.251), (36, -0.078), (37, -0.111), (38, -0.22), (39, -0.05), (40, 0.029), (41, -0.053), (42, 0.321), (43, 0.299), (44, -0.196), (45, -0.172), (46, 0.213), (47, 0.099), (48, -0.119), (49, -0.092)]
simIndex simValue paperId paperTitle
same-paper 1 0.9515 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions
Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe
Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1
2 0.94529766 134 acl-2011-Extracting and Classifying Urdu Multiword Expressions
Author: Annette Hautli ; Sebastian Sulger
Abstract: This paper describes a method for automatically extracting and classifying multiword expressions (MWEs) for Urdu on the basis of a relatively small unannotated corpus (around 8.12 million tokens). The MWEs are extracted by an unsupervised method and classified into two distinct classes, namely locations and person names. The classification is based on simple heuristics that take the co-occurrence of MWEs with distinct postpositions into account. The resulting classes are evaluated against a hand-annotated gold standard and achieve an f-score of 0.5 and 0.746 for locations and persons, respectively. A target application is the Urdu ParGram grammar, where MWEs are needed to generate a more precise syntactic and semantic analysis.
3 0.26665235 238 acl-2011-P11-2093 k2opt.pdf
Author: empty-author
Abstract: We present a pointwise approach to Japanese morphological analysis (MA) that ignores structure information during learning and tagging. Despite the lack of structure, it is able to outperform the current state-of-the-art structured approach for Japanese MA, and achieves accuracy similar to that of structured predictors using the same feature set. We also find that the method is both robust to outof-domain data, and can be easily adapted through the use of a combination of partial annotation and active learning.
4 0.24499677 193 acl-2011-Language-independent compound splitting with morphological operations
Author: Klaus Macherey ; Andrew Dai ; David Talbot ; Ashok Popat ; Franz Och
Abstract: Translating compounds is an important problem in machine translation. Since many compounds have not been observed during training, they pose a challenge for translation systems. Previous decompounding methods have often been restricted to a small set of languages as they cannot deal with more complex compound forming processes. We present a novel and unsupervised method to learn the compound parts and morphological operations needed to split compounds into their compound parts. The method uses a bilingual corpus to learn the morphological operations required to split a compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set of compound part candidates. We evaluate our method within a machine translation task and show significant improvements for various languages to show the versatility of the approach.
5 0.20835437 102 acl-2011-Does Size Matter - How Much Data is Required to Train a REG Algorithm?
Author: Mariet Theune ; Ruud Koolen ; Emiel Krahmer ; Sander Wubben
Abstract: In this paper we investigate how much data is required to train an algorithm for attribute selection, a subtask of Referring Expressions Generation (REG). To enable comparison between different-sized training sets, a systematic training method was developed. The results show that depending on the complexity of the domain, training on 10 to 20 items may already lead to a good performance.
6 0.18598162 219 acl-2011-Metagrammar engineering: Towards systematic exploration of implemented grammars
7 0.18420573 239 acl-2011-P11-5002 k2opt.pdf
8 0.18374871 249 acl-2011-Predicting Relative Prominence in Noun-Noun Compounds
9 0.18279266 297 acl-2011-That's What She Said: Double Entendre Identification
10 0.18041883 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
11 0.17693241 291 acl-2011-SystemT: A Declarative Information Extraction System
12 0.17634588 9 acl-2011-A Cross-Lingual ILP Solution to Zero Anaphora Resolution
13 0.16688623 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
14 0.16224584 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
15 0.15755881 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models
16 0.15547706 156 acl-2011-IMASS: An Intelligent Microblog Analysis and Summarization System
17 0.1483984 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes
18 0.14755732 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis
19 0.14514863 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices
20 0.1446415 151 acl-2011-Hindi to Punjabi Machine Translation System
topicId topicWeight
[(1, 0.306), (5, 0.042), (17, 0.039), (26, 0.029), (31, 0.02), (37, 0.077), (39, 0.035), (41, 0.052), (53, 0.022), (55, 0.028), (59, 0.056), (72, 0.028), (88, 0.011), (91, 0.05), (96, 0.095), (97, 0.019)]
simIndex simValue paperId paperTitle
Author: Kenneth Hild ; Umut Orhan ; Deniz Erdogmus ; Brian Roark ; Barry Oken ; Shalini Purwar ; Hooman Nezamfar ; Melanie Fried-Oken
Abstract: Event related potentials (ERP) corresponding to stimuli in electroencephalography (EEG) can be used to detect the intent of a person for brain computer interfaces (BCI). This paradigm is widely used to build letter-byletter text input systems using BCI. Nevertheless using a BCI-typewriter depending only on EEG responses will not be sufficiently accurate for single-trial operation in general, and existing systems utilize many-trial schemes to achieve accuracy at the cost of speed. Hence incorporation of a language model based prior or additional evidence is vital to improve accuracy and speed. In this demonstration we will present a BCI system for typing that integrates a stochastic language model with ERP classification to achieve speedups, via the rapid serial visual presentation (RSVP) paradigm.
same-paper 2 0.75228214 6 acl-2011-A Comprehensive Dictionary of Multiword Expressions
Author: Kosho Shudo ; Akira Kurahone ; Toshifumi Tanabe
Abstract: It has been widely recognized that one of the most difficult and intriguing problems in natural language processing (NLP) is how to cope with idiosyncratic multiword expressions. This paper presents an overview of the comprehensive dictionary (JDMWE) of Japanese multiword expressions. The JDMWE is characterized by a large notational, syntactic, and semantic diversity of contained expressions as well as a detailed description of their syntactic functions, structures, and flexibilities. The dictionary contains about 104,000 expressions, potentially 750,000 expressions. This paper shows that the JDMWE’s validity can be supported by comparing the dictionary with a large-scale Japanese N-gram frequency dataset, namely the LDC2009T08, generated by Google Inc. (Kudo et al. 2009). 1
3 0.61854124 203 acl-2011-Learning Sub-Word Units for Open Vocabulary Speech Recognition
Author: Carolina Parada ; Mark Dredze ; Abhinav Sethy ; Ariya Rastrow
Abstract: Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.
4 0.60318416 158 acl-2011-Identification of Domain-Specific Senses in a Machine-Readable Dictionary
Author: Fumiyo Fukumoto ; Yoshimi Suzuki
Abstract: This paper focuses on domain-specific senses and presents a method for assigning category/domain label to each sense of words in a dictionary. The method first identifies each sense of a word in the dictionary to its corresponding category. We used a text classification technique to select appropriate senses for each domain. Then, senses were scored by computing the rank scores. We used Markov Random Walk (MRW) model. The method was tested on English and Japanese resources, WordNet 3.0 and EDR Japanese dictionary. For evaluation of the method, we compared English results with the Subject Field Codes (SFC) resources. We also compared each English and Japanese results to the first sense heuristics in the WSD task. These results suggest that identification of domain-specific senses (IDSS) may actually be of benefit.
5 0.58177495 281 acl-2011-Sentiment Analysis of Citations using Sentence Structure-Based Features
Author: Awais Athar
Abstract: Sentiment analysis of citations in scientific papers and articles is a new and interesting problem due to the many linguistic differences between scientific texts and other genres. In this paper, we focus on the problem of automatic identification of positive and negative sentiment polarity in citations to scientific papers. Using a newly constructed annotated citation sentiment corpus, we explore the effectiveness of existing and novel features, including n-grams, specialised science-specific lexical features, dependency relations, sentence splitting and negation features. Our results show that 3-grams and dependencies perform best in this task; they outperform the sentence splitting, science lexicon and negation based features.
6 0.55469471 2 acl-2011-AM-FM: A Semantic Framework for Translation Quality Assessment
7 0.4636392 164 acl-2011-Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
8 0.46268797 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
9 0.45957267 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction
10 0.45767352 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization
11 0.4570781 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
12 0.4566704 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
13 0.45625931 311 acl-2011-Translationese and Its Dialects
14 0.4561573 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
15 0.45558685 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
16 0.45525074 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
17 0.45458609 190 acl-2011-Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations
18 0.45361757 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation
19 0.45326614 3 acl-2011-A Bayesian Model for Unsupervised Semantic Parsing
20 0.45297217 128 acl-2011-Exploring Entity Relations for Named Entity Disambiguation