acl acl2013 acl2013-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Burak Kerim Akku� ; Ruket Cakici
Abstract: Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. In this study, we examine the effects of morphological analysis on text categorization task in Turkish. We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. We aim to show the effects of using varying degrees of morphological information.
Reference: text
sentIndex sentText sentNum sentScore
1 Categorization of Turkish News Documents with Morphological Analysis Burak Kerim Akku ¸s Computer Engineering Department Middle East Technical University Ankara, Turkey burakkerim@ ceng . [sent-1, score-0.068]
2 t r Abstract Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. [sent-4, score-0.33]
3 In this study, we examine the effects of morphological analysis on text categorization task in Turkish. [sent-5, score-0.57]
4 We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. [sent-6, score-0.861]
5 We aim to show the effects of using varying degrees of morphological information. [sent-7, score-0.466]
6 1 Introduction The goal of text classification is to find the category or the topic of a text. [sent-8, score-0.113]
7 Text categorization has popular applications in daily life such as email routing, spam detection, language identification, audience detection or genre detection and has major part in information retrieval tasks. [sent-9, score-0.162]
8 The aim of this study is to explain the impact of morphological analysis and POS tagging on Turkish text classification task. [sent-10, score-0.497]
9 Turkish NLP tasks have been proven to benefit from morphological analysis or segmentation of some sort (Eryi ˘git et al. [sent-12, score-0.33]
10 Two different settings are used throughout the paper to represent different degrees of stemming and involvement of morphological information. [sent-14, score-0.421]
11 A variety of number of characters are compared from 4 to 7 to find the optimal length for data representation. [sent-16, score-0.23]
12 This acts as the baseline for word segmentation in order to make the limited amount of data less 1 Ruket C ¸akıcı Computer Engineering Department Middle East Technical University Ankara, Turkey ruken @ ceng . [sent-17, score-0.068]
13 The second setting involves word stems that are extracted with a morphological analysis followed by disambiguation. [sent-21, score-0.588]
14 The effects of part of speech tagging are also explored. [sent-22, score-0.104]
15 Disambiguated morphological data are used along with the part of speech tags as informative features about the word category. [sent-23, score-0.484]
16 Extracting an n-character prefix is simple and considerably cheap compared to complex stateof-the-art morphological analysis and disambiguation process. [sent-24, score-0.535]
17 Therefore, we may choose to use a cheap approximation instead of a more accurate representation if there is no significant sacrifice in the success of the system. [sent-26, score-0.112]
18 Therefore, approximate stems that are extracted with fixed size stemming rarely contain any affixes. [sent-28, score-0.372]
19 The training data used in this study consist of news articles taken from Milliyet Corpus that contains 80293 news articles published in the newspaper Milliyet (Hakkani-T u¨r et al. [sent-29, score-0.26]
20 The articles we use for training contain a subset of documents indexed from 1000-5000 and have at least 500 characters. [sent-31, score-0.128]
21 The data used in this study have been analyzed with the morphological analyser described in Oflazer (1993) and disambiguated with Sak et al. [sent-33, score-0.517]
22 The annotated data is made available for pub1It has only one prefix for intensifying adjectives and adverbs (sımsıcak: very hot). [sent-36, score-0.125]
23 There are other prefixes adopted from foreign languages such as anormal (abnormal), antisosyal (antisocial) or namert (not brave). [sent-38, score-0.122]
24 Section 2 briefly describes the classification methods used, section 3 explains how these methods are used in implementation and finally the paper is concluded with experimental results. [sent-47, score-0.068]
25 2 Background Supervised and unsupervised methods have been used for text classification in different languages (Amasyalı and Diri, 2006; Beil et al. [sent-48, score-0.113]
26 , 1997), k-nearest neighbour classifiers (Lim, 2004) and support-vector machines (Shanahan and Roma, 2003). [sent-52, score-0.078]
27 Bag-of-words model is one of the more intuitive ways to represent text files in text classification. [sent-53, score-0.09]
28 Each document is represented with an unordered list of words and each of the word frequencies in the collection becomes a feature representing the document. [sent-55, score-0.095]
29 Bag-of-words approach is an intuitive way and popular among document classification tasks (Scott and Matwin, 1998; Joachims, 1997). [sent-56, score-0.125]
30 Another way of representing documents with term weights is to use term frequency - inverse document frequency (Sparck Jones, 1988). [sent-57, score-0.202]
31 TFIDF is another way of saying that a term is valuable for a document if it occurs frequently in that document but it is not common in the rest of the collection. [sent-58, score-0.157]
32 TFIDF score of a term t in a document d in a collection D is calculated as below: tfidft,d,D = tft,d × idft,D tft,d is the number oftimes t occurs in d and idft,D is the number of documents in D over the number of document that contain t. [sent-59, score-0.216]
33 The idea behind bag of words and TFIDF is to find a mapping from words to numbers which can also be described as finding a mathematical representation for text files. [sent-60, score-0.127]
34 One way is to use dot product since each document is represented as a vector (Manning et al. [sent-67, score-0.094]
35 A number of different dimensions in vector spaces are compared in this study to find the optimal performance. [sent-69, score-0.137]
36 Morphemes may carry semantic or syntactic information, but morphological ambiguity make it hard to pass this information on to other level in a trivial manner especially for languages with productive morphology such as Turkish. [sent-73, score-0.371]
37 An example of possible morphological analyses of a single word in Turkish is presented in Table 1. [sent-74, score-0.33]
38 We aim to examine the effects of morphological information in a bag-of-words model in the context of text classification. [sent-76, score-0.479]
39 A relevant study explores the prefixing versus morphological anal- ysis/stemming effect on information retrieval in Can et al. [sent-77, score-0.422]
40 Several stemmers for Turkish are presented for the indexing problem for information retrieval. [sent-79, score-0.09]
41 They use Oflazer’s morphological analyzer (Oflazer, 1993), however, they do not use a disambiguator. [sent-80, score-0.33]
42 Their results show that among the fixed length stemmers 5-character prefix is the the best and the lemmatizer based stemmer is slightly better than the fixed length stemmer with five characters. [sent-82, score-0.625]
43 67% accuracy by Eryi g˘it (2012) 2 Figure 1: Learning curves with first five characters Figure 3 2: Learning curves with stems Implementation In the first setting, up to first N characters of each word is extracted as the feature set. [sent-87, score-0.944]
44 A comparison between 4, 5, 6 and 7 characters is performed to choose the best N. [sent-88, score-0.18]
45 Each word in documents is analysed morphologically with morphological analyser from Oflazer (1993) and word stems are extracted for each term. [sent-90, score-0.777]
46 Sak’s morphological disambiguator for Turkish is used at this step to choose the correct analysis (Sak et al. [sent-91, score-0.47]
47 We compare these settings in order to see how well morphological analysis with disambiguation performs against a simple baseline of fixed length stemming with a bag-of-words approach. [sent-95, score-0.532]
48 Both stem bags and the first N-character bags are transformed into vector space with TFIDF scoring. [sent-96, score-0.226]
49 Then, different sizes of feature space dimensions 3 are used with ranking by the highest term frequency scores. [sent-97, score-0.127]
50 A range of different dimension sizes from 1200 to 7200 were experimented on to find the optimal dimension size for this study (Table 2). [sent-98, score-0.216]
51 K-Nearest neighbours was implemented with weighted voting of 25 nearest neighbours based on distance and Support Vector Machine is implemented with linear kernel and default parameters. [sent-100, score-0.208]
52 Training data contains 872 articles labelled and divided into four categories as follows: 235 articles on politics, 258 articles about social news such as culture, education or health, 177 articles on economics and 202 about sports. [sent-103, score-0.356]
53 Test data consists of 160 articles with 40 in each class. [sent-107, score-0.069]
54 4 Experiments Experiments begin with searching the optimal prefix length for words with different classifiers. [sent-109, score-0.213]
55 After that, stems are used as features and evaluated with the same classifiers. [sent-110, score-0.295]
56 Finally, morphological information is added to these features and the effects of the extra information is inspected in Section 4. [sent-113, score-0.471]
57 1 Optimal Number of Characters This experiment aims to find out the optimal prefix length for the first N-character feature to rep- resent text documents in Turkish. [sent-116, score-0.355]
58 We conjecture that we can simulate stemming by taking a fixed length prefix of each word. [sent-117, score-0.281]
59 Table 2 shows the results of the experiments where columns represent the number of characters used and rows represent the number of features used for classification. [sent-119, score-0.179]
60 The best performance is acquired using the first five characters of each word for TFIDF transformation for all classifiers. [sent-120, score-0.22]
61 (2008) also reported that the five character prefix in the fixed length stemmer performed the best in their experiments. [sent-122, score-0.403]
62 Learning curves for 5-character prefixes are presented in Figure 1. [sent-123, score-0.265]
63 2 Stems Another experiment was conducted with the word stems extracted with a morphological analyser and a disambiguator (Sak et al. [sent-127, score-0.789]
64 kNN, Naive Bayes and SVM were trained with different feature sizes with increasing training data sizes. [sent-129, score-0.084]
65 As the number of features used in classi4 fication increases, the number of samples needed for an adequate classification also increases for Naive Bayes. [sent-133, score-0.167]
66 As the training size increases feature space dimension becomes irrelevant and the results con- verge to a similar point for Naive Bayes. [sent-136, score-0.135]
67 3 5-Character Prefixes vs Stems This section provides a comparison between two main features used in this study with three different classifiers. [sent-140, score-0.091]
68 F1 scores for the best and worst configurations with each of the three classifiers are presented in Table 3. [sent-141, score-0.099]
69 Using five character prefixes gives better results than using stems. [sent-142, score-0.234]
70 Naive Bayes with stems and five character prefixes disagree only on six instances out of 160 test instances with F1 scores of 0. [sent-143, score-0.492]
71 Similarly, results for SVM with stems for the best and the worst configurations is considered to be not statistically significant. [sent-147, score-0.314]
72 (a) Learning curves without tags (b) Learning curves with stem tags (c) Learning curves with word tags Figure 3: Learning curves for SVM 5 Table 2: F1-scores with different prefix lengths and dimensions. [sent-154, score-1.062]
73 4 SVM with POS Tags The final experiment examines the effects of POS tags that are extracted via morphological analysis. [sent-156, score-0.52]
74 Two different features are extracted and compared with the base lines of classifiers with stems and first five characters without tags. [sent-157, score-0.558]
75 Stem tag is the first tag of the first derivation and the word tag is the tag of the last derivation and example features are given in Table 4. [sent-158, score-0.197]
76 Since derivational morphemes are also present in the morphological analyses word tags may differ from stem tags. [sent-159, score-0.594]
77 Using POS tags with stems increases the success rate especially when the number of features is low. [sent-163, score-0.483]
78 However, using tags of the stems does not make significant changes on average. [sent-164, score-0.344]
79 The best and the worst results differ with baseline with less than 0. [sent-165, score-0.087]
80 This may be due to the fact that the same stem has a higher chance of being in the same category even though the derived final form is different. [sent-167, score-0.107]
81 Adding stem or word tags to the first five characters increases the success when the number of training instances are low, however, it has no significant effect on the highest score. [sent-169, score-0.515]
82 Using tags with five characters has positive effects when the number of features are low and negative effects when the number of features are high. [sent-170, score-0.588]
83 5 Conclusion In this study, we use K-Nearest Neighbours, Naive Bayes and Support Vector Machine classifiers for examining the effects of morphological information on the task of classifying Turkish news articles. [sent-171, score-0.511]
84 We have compared their performances on different sizes of training data, different number of features and different feature sets. [sent-172, score-0.121]
85 Results suggest that the first five characters of each word can be used for TFIDF transformation to represent text documents in classification tasks. [sent-173, score-0.392]
86 Stems are extracted with a morphological analyser which is computationally expensive and takes a lot of time compared to extracting first characters of a word. [sent-175, score-0.571]
87 Although different test sets and training data may change the final results, using a simple approximation with first five characters to represent documents instead of results of an expensive morphological analysis process gives similar or better results with much less cost. [sent-176, score-0.609]
88 Experiments also indicate that there is more place for growth if more training data is available as most of the learning curves presented in the experiments point. [sent-177, score-0.143]
89 Actual word cate- gories and meanings may differ and using POS tags may solve this problem but sparsity ofthe data is more prominent at the moment. [sent-179, score-0.117]
90 The future work includes repeating these experiments with larger data sets to explore the effects of the data size. [sent-180, score-0.104]
91 Automatic Turkish text categorization in terms of author, genre and gender. [sent-191, score-0.169]
92 The impact of automatic morphological analysis & disambiguation on dependency parsing of turkish. [sent-221, score-0.376]
93 An extensive empirical study of feature selection metrics for text classification. [sent-231, score-0.137]
94 A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. [sent-259, score-0.2]
95 Improving kNN based text classification with well estimated parameters. [sent-276, score-0.113]
96 A comparison of event models for naive bayes text classification. [sent-303, score-0.387]
97 Feature selection, perceptron learning, and a usability case study for text categorization. [sent-311, score-0.099]
98 Techniques for improving the performance of naive bayes for text classification. [sent-348, score-0.387]
99 Boosting support vector machines for text classification through parameter-free threshold relaxation. [sent-357, score-0.185]
100 A comparative study on feature selection in text categorization. [sent-373, score-0.137]
wordName wordTfidf (topN-words)
[('morphological', 0.33), ('turkish', 0.329), ('stems', 0.258), ('sak', 0.237), ('naive', 0.203), ('knn', 0.16), ('oflazer', 0.16), ('tfidf', 0.155), ('curves', 0.143), ('characters', 0.142), ('bayes', 0.139), ('prefix', 0.125), ('prefixes', 0.122), ('kemal', 0.107), ('stem', 0.107), ('eryi', 0.104), ('neighbours', 0.104), ('effects', 0.104), ('disambiguator', 0.102), ('milliyet', 0.102), ('analyser', 0.099), ('categorization', 0.091), ('stemmers', 0.09), ('tags', 0.086), ('svm', 0.084), ('five', 0.078), ('loper', 0.074), ('articles', 0.069), ('stemmer', 0.069), ('classification', 0.068), ('amasyal', 0.068), ('ankara', 0.068), ('ceng', 0.068), ('ruket', 0.068), ('sparck', 0.068), ('mcnemar', 0.062), ('increases', 0.062), ('stemming', 0.059), ('documents', 0.059), ('turkey', 0.058), ('document', 0.057), ('worst', 0.056), ('beil', 0.055), ('etino', 0.055), ('wseas', 0.055), ('fixed', 0.055), ('ak', 0.054), ('study', 0.054), ('ny', 0.053), ('shanahan', 0.052), ('pos', 0.052), ('agglutinative', 0.05), ('pedregosa', 0.05), ('bag', 0.049), ('pal', 0.047), ('yiming', 0.047), ('disambiguation', 0.046), ('optimal', 0.046), ('sizes', 0.046), ('labelled', 0.046), ('text', 0.045), ('sigir', 0.044), ('term', 0.043), ('classifiers', 0.043), ('length', 0.042), ('nltk', 0.042), ('fourteenth', 0.042), ('morphology', 0.041), ('publishing', 0.041), ('cicling', 0.041), ('bags', 0.041), ('al', 0.04), ('success', 0.04), ('tag', 0.04), ('morphemes', 0.04), ('york', 0.039), ('icml', 0.039), ('ul', 0.039), ('retrieval', 0.038), ('choose', 0.038), ('feature', 0.038), ('stroudsburg', 0.037), ('east', 0.037), ('features', 0.037), ('vector', 0.037), ('machines', 0.035), ('dimension', 0.035), ('cheap', 0.034), ('hinrich', 0.034), ('bird', 0.034), ('disambiguated', 0.034), ('news', 0.034), ('character', 0.034), ('behind', 0.033), ('genre', 0.033), ('degrees', 0.032), ('eighth', 0.032), ('informative', 0.031), ('differ', 0.031), ('morphologically', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
Author: Burak Kerim Akku� ; Ruket Cakici
Abstract: Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. In this study, we examine the effects of morphological analysis on text categorization task in Turkish. We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. We aim to show the effects of using varying degrees of morphological information.
2 0.17868996 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics
Author: Angeliki Lazaridou ; Marco Marelli ; Roberto Zamparelli ; Marco Baroni
Abstract: Speakers of a language can construct an unlimited number of new words through morphological derivation. This is a major cause of data sparseness for corpus-based approaches to lexical semantics, such as distributional semantic models of word meaning. We adapt compositional methods originally developed for phrases to the task of deriving the distributional meaning of morphologically complex words from their parts. Semantic representations constructed in this way beat a strong baseline and can be of higher quality than representations directly constructed from corpus data. Our results constitute a novel evaluation of the proposed composition methods, in which the full additive model achieves the best performance, and demonstrate the usefulness of a compositional morphology component in distributional semantics.
3 0.17179357 330 acl-2013-Stem Translation with Affix-Based Rule Selection for Agglutinative Languages
Author: Zhiyang Wang ; Yajuan Lu ; Meng Sun ; Qun Liu
Abstract: Current translation models are mainly designed for languages with limited morphology, which are not readily applicable to agglutinative languages as the difference in the way lexical forms are generated. In this paper, we propose a novel approach for translating agglutinative languages by treating stems and affixes differently. We employ stem as the atomic translation unit to alleviate data spareness. In addition, we associate each stemgranularity translation rule with a distribution of related affixes, and select desirable rules according to the similarity of their affix distributions with given spans to be translated. Experimental results show that our approach significantly improves the translation performance on tasks of translating from three Turkic languages to Chinese.
4 0.17095038 303 acl-2013-Robust multilingual statistical morphological generation models
Author: Ondrej Dusek ; Filip Jurcicek
Abstract: We present a novel method of statistical morphological generation, i.e. the prediction of inflected word forms given lemma, part-of-speech and morphological features, aimed at robustness to unseen inputs. Our system uses a trainable classifier to predict “edit scripts” that are then used to transform lemmas into inflected word forms. Suffixes of lemmas are included as features to achieve robustness. We evaluate our system on 6 languages with a varying degree of morphological richness. The results show that the system is able to learn most morphological phenomena and generalize to unseen inputs, producing significantly better results than a dictionarybased baseline.
5 0.098320179 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German
Author: Marion Weller ; Alexander Fraser ; Sabine Schulte im Walde
Abstract: This paper demonstrates the need and impact of subcategorization information for SMT. We combine (i) features on sourceside syntactic subcategorization and (ii) an external knowledge base with quantitative, dependency-based information about target-side subcategorization frames. A manual evaluation of an English-toGerman translation task shows that the subcategorization information has a positive impact on translation quality through better prediction of case.
6 0.096933782 80 acl-2013-Chinese Parsing Exploiting Characters
7 0.091590703 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German
8 0.081840426 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction
9 0.08131659 300 acl-2013-Reducing Annotation Effort for Quality Estimation via Active Learning
10 0.078160696 351 acl-2013-Topic Modeling Based Classification of Clinical Reports
11 0.077867776 47 acl-2013-An Information Theoretic Approach to Bilingual Word Clustering
12 0.077383973 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals
13 0.076336607 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
14 0.07522437 154 acl-2013-Extracting bilingual terminologies from comparable corpora
16 0.072867192 74 acl-2013-Building Comparable Corpora Based on Bilingual LDA Model
17 0.072149709 193 acl-2013-Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations
18 0.06987486 359 acl-2013-Translating Dialectal Arabic to English
19 0.064277329 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies
20 0.064195238 39 acl-2013-Addressing Ambiguity in Unsupervised Part-of-Speech Induction with Substitute Vectors
topicId topicWeight
[(0, 0.199), (1, 0.018), (2, -0.03), (3, -0.033), (4, 0.037), (5, -0.084), (6, -0.009), (7, 0.012), (8, 0.028), (9, 0.023), (10, -0.006), (11, 0.008), (12, 0.062), (13, 0.018), (14, -0.155), (15, 0.023), (16, -0.125), (17, -0.109), (18, -0.02), (19, -0.014), (20, -0.123), (21, 0.096), (22, 0.059), (23, 0.039), (24, -0.025), (25, -0.026), (26, -0.057), (27, -0.155), (28, -0.034), (29, -0.049), (30, 0.037), (31, 0.017), (32, 0.003), (33, 0.081), (34, -0.051), (35, -0.053), (36, -0.108), (37, 0.077), (38, 0.025), (39, -0.053), (40, -0.079), (41, 0.037), (42, 0.066), (43, 0.098), (44, 0.006), (45, 0.002), (46, 0.11), (47, 0.006), (48, -0.021), (49, 0.085)]
simIndex simValue paperId paperTitle
same-paper 1 0.91853637 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
Author: Burak Kerim Akku� ; Ruket Cakici
Abstract: Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. In this study, we examine the effects of morphological analysis on text categorization task in Turkish. We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. We aim to show the effects of using varying degrees of morphological information.
2 0.83064973 303 acl-2013-Robust multilingual statistical morphological generation models
Author: Ondrej Dusek ; Filip Jurcicek
Abstract: We present a novel method of statistical morphological generation, i.e. the prediction of inflected word forms given lemma, part-of-speech and morphological features, aimed at robustness to unseen inputs. Our system uses a trainable classifier to predict “edit scripts” that are then used to transform lemmas into inflected word forms. Suffixes of lemmas are included as features to achieve robustness. We evaluate our system on 6 languages with a varying degree of morphological richness. The results show that the system is able to learn most morphological phenomena and generalize to unseen inputs, producing significantly better results than a dictionarybased baseline.
3 0.65598255 227 acl-2013-Learning to lemmatise Polish noun phrases
Author: Adam Radziszewski
Abstract: We present a novel approach to noun phrase lemmatisation where the main phase is cast as a tagging problem. The idea draws on the observation that the lemmatisation of almost all Polish noun phrases may be decomposed into transformation of singular words (tokens) that make up each phrase. We perform evaluation, which shows results similar to those obtained earlier by a rule-based system, while our approach allows to separate chunking from lemmatisation.
4 0.65404552 378 acl-2013-Using subcategorization knowledge to improve case prediction for translation to German
Author: Marion Weller ; Alexander Fraser ; Sabine Schulte im Walde
Abstract: This paper demonstrates the need and impact of subcategorization information for SMT. We combine (i) features on sourceside syntactic subcategorization and (ii) an external knowledge base with quantitative, dependency-based information about target-side subcategorization frames. A manual evaluation of an English-toGerman translation task shows that the subcategorization information has a positive impact on translation quality through better prediction of case.
Author: Tirthankar Dasgupta
Abstract: In this work we present psycholinguistically motivated computational models for the organization and processing of Bangla morphologically complex words in the mental lexicon. Our goal is to identify whether morphologically complex words are stored as a whole or are they organized along the morphological line. For this, we have conducted a series of psycholinguistic experiments to build up hypothesis on the possible organizational structure of the mental lexicon. Next, we develop computational models based on the collected dataset. We observed that derivationally suffixed Bangla words are in general decomposed during processing and compositionality between the stem . and the suffix plays an important role in the decomposition process. We observed the same phenomena for Bangla verb sequences where experiments showed noncompositional verb sequences are in general stored as a whole in the ML and low traces of compositional verbs are found in the mental lexicon. 1 IInnttrroodduuccttiioonn Mental lexicon is the representation of the words in the human mind and their associations that help fast retrieval and comprehension (Aitchison, 1987). Words are known to be associated with each other in terms of, orthography, phonology, morphology and semantics. However, the precise nature of these relations is unknown. An important issue that has been a subject of study for a long time is to identify the fundamental units in terms of which the mental lexicon is i itkgp .ernet . in organized. That is, whether lexical representations in the mental lexicon are word based or are they organized along morphological lines. For example, whether a word such as “unimaginable” is stored in the mental lexicon as a whole word or do we break it up “un-” , “imagine” and “able”, understand the meaning of each of these constituent and then recombine the units to comprehend the whole word. Such questions are typically answered by designing appropriate priming experiments (Marslen-Wilson et al., 1994) or other lexical decision tasks. The reaction time of the subjects for recognizing various lexical items under appropriate conditions reveals important facts about their organization in the brain. (See Sec. 2 for models of morphological organization and access and related experiments). A clear understanding of the structure and the processing mechanism of the mental lexicon will further our knowledge of how the human brain processes language. Further, these linguistically important and interesting questions are also highly significant for computational linguistics (CL) and natural language processing (NLP) applications. Their computational significance arises from the issue of their storage in lexical resources like WordNet (Fellbaum, 1998) and raises the questions like, how to store morphologically complex words, in a lexical resource like WordNet keeping in mind the storage and access efficiency. There is a rich literature on organization and lexical access of morphologically complex words where experiments have been conducted mainly for derivational suffixed words of English, Hebrew, Italian, French, Dutch, and few other languages (Marslen-Wilson et al., 2008; Frost et al., 1997; Grainger, et al., 1991 ; Drews and Zwitserlood, 1995). However, we do not know of any such investigations for Indian languages, which 123 Sofia, BuPrlgoacreiead, iAngusgu osft 4h-e9 A 2C01L3 S.tu ?c d2en0t1 3Re Ases aorc hiat Wio nrk fsohro Cp,om papguesta 1ti2o3n–a1l2 L9in,guistics are morphologically richer than many of their Indo-European cousins. Moreover, Indian languages show some distinct phenomena like, compound and composite verbs for which no such investigations have been conducted yet. On the other hand, experiments indicate that mental representation and processing of morphologically complex words are not quite language independent (Taft, 2004). Therefore, the findings from experiments in one language cannot be generalized to all languages making it important to conduct similar experimentations in other languages. This work aims to design cognitively motivated computational models that can explain the organization and processing of Bangla morphologically complex words in the mental lexicon. Presently we will concentrate on the following two aspects: OOrrggaanniizzaattiioonn aanndd pprroocceessssiinngg ooff BBaannggllaa PPo o l yy-mmoorrpphheemmiicc wwoorrddss:: our objective here is to determine whether the mental lexicon decomposes morphologically complex words into its constituent morphemes or does it represent the unanalyzed surface form of a word. OOrrggaanniizzaattiioonn aanndd pprroocceessssiinngg ooff BBaannggllaa ccoomm-ppoouunndd vveerrbbss ((CCVV)) :: compound verbs are the subject of much debate in linguistic theory. No consensus has been reached yet with respect to the issue that whether to consider them as unitary lexical units or are they syntactically assembled combinations of two independent lexical units. As linguistic arguments have so far not led to a consensus, we here use cognitive experiments to probe the brain signatures of verb-verb combinations and propose cognitive as well as computational models regarding the possible organization and processing of Bangla CVs in the mental lexicon (ML). With respect to this, we apply the different priming and other lexical decision experiments, described in literature (Marslen-Wilson et al., 1994; Bentin, S. and Feldman, 1990) specifically for derivationally suffixed polymorphemic words and compound verbs of Bangla. Our cross-modal and masked priming experiment on Bangla derivationally suffixed words shows that morphological relatedness between lexical items triggers a significant priming effect, even when the forms are phonologically/orthographically unrelated. These observations are similar to those reported for English and indicate that derivationally suffixed words in Bangla are in general accessed through decomposition of the word into its constituent morphemes. Further, based on the experimental data we have developed a series of computational models that can be used to predict the decomposition of Bangla polymorphemic words. Our evaluation result shows that decom- position of a polymorphemic word depends on several factors like, frequency, productivity of the suffix and the compositionality between the stem and the suffix. The organization of the paper is as follows: Sec. 2 presents related works; Sec. 3 describes experiment design and procedure; Sec. 4 presents the processing of CVs; and finally, Sec. 5 concludes the paper by presenting the future direction of the work. 2 RReellaatteedd WWoorrkkss 2. . 11 RReepprreesseennttaattiioonn ooff ppoollyymmoorrpphheemmiicc wwoorrddss Over the last few decades many studies have attempted to understand the representation and processing of morphologically complex words in the brain for various languages. Most of the studies are designed to support one of the two mutually exclusive paradigms: the full-listing and the morphemic model. The full-listing model claims that polymorphic words are represented as a whole in the human mental lexicon (Bradley, 1980; Butterworth, 1983). On the other hand, morphemic model argues that morphologically complex words are decomposed and represented in terms of the smaller morphemic units. The affixes are stripped away from the root form, which in turn are used to access the mental lexicon (Taft and Forster, 1975; Taft, 1981 ; MacKay, 1978). Intermediate to these two paradigms is the partial decomposition model that argues that different types of morphological forms are processed separately. For instance, the derived morphological forms are believed to be represented as a whole, whereas the representation of the inflected forms follows the morphemic model (Caramazza et al., 1988). Traditionally, priming experiments have been used to study the effects of morphology in language processing. Priming is a process that results in increase in speed or accuracy of response to a stimulus, called the target, based on the occurrence of a prior exposure of another stimulus, called the prime (Tulving et al., 1982). Here, subjects are exposed to a prime word for a short duration, and are subsequently shown a target word. The prime and target words may be morphologically, phonologically or semantically re124 lated. An analysis of the effect of the reaction time of subjects reveals the actual organization and representation of the lexicon at the relevant level. See Pulvermüller (2002) for a detailed account of such phenomena. It has been argued that frequency of a word influences the speed of lexical processing and thus, can serve as a diagnostic tool to observe the nature and organization of lexical representations. (Taft, 1975) with his experiment on English inflected words, argued that lexical decision responses of polymorphemic words depends upon the base word frequency. Similar observation for surface word frequency was also observed by (Bertram et al., 2000;Bradley, 1980;Burani et al., 1987;Burani et al., 1984;Schreuder et al., 1997; Taft 1975;Taft, 2004) where it has been claimed that words having low surface frequency tends to decompose. Later, Baayen(2000) proposed the dual processing race model that proposes that a specific morphologically complex form is accessed via its parts if the frequency of that word is above a certain threshold of frequency, then the direct route will win, and the word will be accessed as a whole. If it is below that same threshold of frequency, the parsing route will win, and the word will be accessed via its parts. 2. . 22 RReepprreesseennttaattiioonn ooff CCoommppoouunndd A compound verb (CV) consists of two verbs (V1 and V2) acting as and expresses a single expression For example, in the sentence VVeerrbbss a sequence of a single verb of meaning. রুটিগুল ো খেল খেল ো (/ruTigulo kheYe phela/) ―bread-plural-the eat and drop-pres. Imp‖ ―Eat the breads‖ the verb sequence “খেল খেল ো (eat drop)” is an example of CV. Compound verbs are a special phenomena that are abundantly found in IndoEuropean languages like Indian languages. A plethora of works has been done to provide linguistic explanations on the formation of such word, yet none so far has led to any consensus. Hook (1981) considers the second verb V2 as an aspectual complex comparable to the auxiliaries. Butt (1993) argues CV formations in Hindi and Urdu are either morphological or syntactical and their formation take place at the argument struc- ture. Bashir (1993) tried to construct a semantic analysis based on “prepared” and “unprepared mind”. Similar findings have been proposed by Pandharipande (1993) that points out V1 and V2 are paired on the basis of their semantic compatibility, which is subject to syntactic constraints. Paul (2004) tried to represent Bangla CVs in terms of HPSG formalism. She proposes that the selection of a V2 by a V1 is determined at the semantic level because the two verbs will unify if and only if they are semantically compatible. Since none of the linguistic formalism could satisfactorily explain the unique phenomena of CV formation, we here for the first time drew our attention towards psycholinguistic and neurolinguistic studies to model the processing of verb-verb combinations in the ML and compare these responses with that of the existing models. 3 TThhee PPrrooppoosseedd AApppprrooaacchheess 3. . 11 TThhee ppssyycchhoolliinngguuiissttiicc eexxppeerriimmeennttss We apply two different priming experiments namely, the cross modal priming and masked priming experiment discussed in (Forster and Davis, 1984; Rastle et al., 2000;Marslen-Wilson et al., 1994; Marslen-Wilson et al., 2008) for Bangla morphologically complex words. Here, the prime is morphologically derived form of the target presented auditorily (for cross modal priming) or visually (for masked priming). The subjects were asked to make a lexical decision whether the given target is a valid word in that language. The same target word is again probed but with a different audio or visual probe called the control word. The control shows no relationship with the target. For example, baYaska (aged) and baYasa (age) is a prime-target pair, for which the corresponding control-target pair could be naYana (eye) and baYasa (age). Similar to (Marslen-Wilson et al., 2008) the masked priming has been conducted for three different SOA (Stimulus Onset Asynchrony), 48ms, 72ms and 120ms. The SOA is measured as the amount of time between the start the first stimulus till the start of the next stimulus. TCM abl-’+ Sse-+ O1 +:-DatjdgnmAshielbatArDu)f(osiAMrawnteihmsgcdaoe)lEx-npgmAchebamr)iD-gnatmprhdiYlbeaA(n ftrTsli,ae(+gnrmdisc)phroielctn)osrelated, and - implies unrelated. There were 500 prime-target and controltarget pairs classified into five classes. Depending on the class, the prime is related to the target 125 either in terms of morphology, semantics, orthography and/or Phonology (See Table 1). The experiments were conducted on 24 highly educated native Bangla speakers. Nineteen of them have a graduate degree and five hold a post graduate degree. The age of the subjects varies between 22 to 35 years. RReessuullttss:: The RTs with extreme values and incorrect decisions were excluded from the data. The data has been analyzed using two ways ANOVA with three factors: priming (prime and control), conditions (five classes) and prime durations (three different SOA). We observe strong priming effects (p<0.05) when the target word is morphologically derived and has a recognizable suffix, semantically and orthographically related with respect to the prime; no priming effects are observed when the prime and target words are orthographically related but share no morphological or semantic relationship; although not statistically significant (p>0.07), but weak priming is observed for prime target pairs that are only semantically related. We see no significant difference between the prime and control RTs for other classes. We also looked at the RTs for each of the 500 target words. We observe that maximum priming occurs for words in [M+S+O+](69%), some priming is evident in [M+S+O-](51%) and [M'+S-O+](48%), but for most of the words in [M-S+O-](86%) and [M-S-O+](92%) no priming effect was observed. 3. . 22 FFrreeqquueennccyy DDiissttrriibbuuttiioonn MMooddeellss ooff MMoo rrpphhoo-llooggiiccaall PPrroocceessssiinngg From the above results we saw that not all polymorphemic words tend to decompose during processing, thus we need to further investigate the processing phenomena of Bangla derived words. One notable means is to identify whether the stem or suffix frequency is involved in the processing stage of that word. For this, we apply different frequency based models to the Bangla polymorphemic words and try to evaluate their performance by comparing their predicted results with the result obtained through the priming experiment. MMooddeell --11:: BBaassee aanndd SSuurrffaaccee wwoorrdd ffrreeqquueennccyy ee ff-ffeecctt -- It states that the probability of decomposition of a Bangla polymorphemic word depends upon the frequency of its base word. Thus, if the stem frequency of a polymorphemic word crosses a given threshold value, then the word will decomposed into its constituent morpheme. Similar claim has been made for surface word frequency model where decomposition depends upon the frequency of the surface word itself. We have evaluated both the models with the 500 words used in the priming experiments discussed above. We have achieved an accuracy of 62% and 49% respectively for base and surface word frequency models. MMooddeell --22:: CCoommbbiinniinngg tthhee bbaassee aanndd ssuurrffaaccee wwoorrdd ffrreeq quueennccyy -- In a pursuit towards an extended model, we combine model 1 and 2 together. We took the log frequencies of both the base and the derived words and plotted the best-fit regression curve over the given dataset. The evaluation of this model over the same set of 500 target words returns an accuracy of 68% which is better than the base and surface word frequency models. However, the proposed model still fails to predict processing of around 32% of words. This led us to further enhance the model. For this, we analyze the role of suffixes in morphological processing. MMooddeell -- 33:: DDeeggrreeee ooff AAffffiixxaattiioonn aanndd SSuuffffiixx PPrroodd-uuccttiivviittyy:: we examine whether the regression analysis between base and derived frequency of Bangla words varies between suffixes and how these variations affect morphological decomposition. With respect to this, we try to compute the degree of affixation between the suffix and the base word. For this, we perform regression analysis on sixteen different Bangla suffixes with varying degree of type and token frequencies. For each suffix, we choose 100 different derived words. We observe that those suffixes having high value of intercept are forming derived words whose base frequencies are substantially high as compared to their derived forms. Moreover we also observe that high intercept value for a given suffix indicates higher inclination towards decomposition. Next, we try to analyze the role of suffix type/token ratio and compare them with the base/derived frequency ratio model. This has been done by regression analysis between the suffix type-token ratios with the base-surface frequency ratio. We further tried to observe the role of suffix productivity in morphological processing. For this, we computed the three components of productivity P, P* and V as discussed in (Hay and Plag, 2004). P is the “conditioned degree of productivity” and is the probability that we are encountering a word with an affix and it is representing a new type. P* is the “hapaxedconditioned degree of productivity”. It expresses the probability that when an entirely new word is 126 encountered it will contain the suffix. V is the “type frequency”. Finally, we computed the productivity of a suffix through its P, P* and V values. We found that decomposition of Bangla polymorphemic word is directly proportional to the productivity of the suffix. Therefore, words that are composed of productive suffixes (P value ranges between 0.6 and 0.9) like “-oYAlA”, “-giri”, “-tba” and “-panA” are highly decomposable than low productive suffixes like “-Ani”, “-lA”, “-k”, and “-tama”. The evaluation of the proposed model returns an accuracy of 76% which comes to be 8% better than the preceding models. CCoommbbiinniinngg MMooddeell --22 aanndd MMooddeell -- 33:: One important observation that can be made from the above results is that, model-3 performs best in determining the true negative values. It also possesses a high recall value of (85%) but having a low precision of (50%). In other words, the model can predict those words for which decomposition will not take place. On the other hand, results of Model-2 posses a high precision of 70%. Thus, we argue that combining the above two models can better predict the decomposition of Bangla polymorphemic words. Hence, we combine the two models together and finally achieved an overall accuracy of 80% with a precision of 87% and a recall of 78%. This surpasses the performance of the other models discussed earlier. However, around 22% of the test words were wrongly classified which the model fails to justify. Thus, a more rigorous set of experiments and data analysis are required to predict access mechanisms of such Bangla polymorphemic words. 3. . 33 SStteemm- -SSuuffffiixx CCoommppoossiittiioonnaalliittyy Compositionality refers to the fact that meaning of a complex expression is inferred from the meaning of its constituents. Therefore, the cost of retrieving a word from the secondary memory is directly proportional to the cost of retrieving the individual parts (i.e the stem and the suffix). Thus, following the work of (Milin et al., 2009) we define the compositionality of a morphologically complex word (We) as: C(We)=α 1H(We)+α α2H(e)+α α3H(W|e)+ α4H(e|W) Where, H(x) is entropy of an expression x, H(W|e) is the conditional entropy between the stem W and suffix e and is the proportionality factor whose value is computed through regression analysis. Next, we tried to compute the compositionality of the stem and suffixes in terms of relative entropy D(W||e) and Point wise mutual information (PMI). The relative entropy is the measure of the distance between the probability distribution of the stem W and the suffix e. The PMI measures the amount of information that one random variable (the stem) contains about the other (the suffix). We have compared the above three techniques with the actual reaction time data collected through the priming and lexical decision experiment. We observed that all the three information theoretic models perform much better than the frequency based models discussed in the earlier section, for predicting the decomposability of Bangla polymorphemic words. However, we think it is still premature to claim anything concrete at this stage of our work. We believe much more rigorous experiments are needed to be per- formed in order to validate our proposed models. Further, the present paper does not consider factors related to age of acquisition, and word familiarity effects that plays important role in the processing of morphologically complex words. Moreover, it is also very interesting to see how stacking of multiple suffixes in a word are processed by the human brain. 44 OOrrggaanniizzaattiioonn aanndd PPrroocceessssiinngg ooff CCoomm-ppoouunndd VVeerrbbss iinn tthhee MMeennttaall LLeexxiiccoonn Compound verbs, as discussed above, are special type of verb sequences consisting of two or more verbs acting as a single verb and express a single expression of meaning. The verb V1 is known as pole and V2 is called as vector. For example, “ওঠে পড়া ” (getting up) is a compound verb where individual words do not entirely reflects the meaning of the whole expression. However, not all V1+V2 combinations are CVs. For example, expressions like, “নিঠে য়াও ”(take and then go) and “ নিঠে আঠ ়া” (return back) are the examples of verb sequences where meaning of the whole expression can be derived from the mean- ing of the individual component and thus, these verb sequences are not considered as CV. The key question linguists are trying to identify for a long time and debating a lot is whether to consider CVs as a single lexical units or consider them as two separate units. Since linguistic rules fails to explain the process, we for the first time tried to perform cognitive experiments to understand the organization and processing of such verb sequences in the human mind. A clear understanding about these phenomena may help us to classify or extract actual CVs from other verb 127 sequences. In order to do so, presently we have applied three different techniques to collect user data. In the first technique, we annotated 4500 V1+V2 sequences, along with their example sentences, using a group of three linguists (the expert subjects). We asked the experts to classify the verb sequences into three classes namely, CV, not a CV and not sure. Each linguist has received 2000 verb pairs along with their respective example sentences. Out of this, 1500 verb sequences are unique to each of them and rest 500 are overlapping. We measure the inter annotator agreement using the Fleiss Kappa (Fleiss et al., 1981) measure (κ) where the agreement lies around 0.79. Next, out of the 500 common verb sequences that were annotated by all the three linguists, we randomly choose 300 V1+V2 pairs and presented them to 36 native Bangla speakers. We ask each subjects to give a compositionality score of each verb sequences under 1-10 point scale, 10 being highly compositional and 1 for noncompositional. We found an agreement of κ=0.69 among the subjects. We also observe a continuum of compositionality score among the verb sequences. This reflects that it is difficult to classify Bangla verb sequences discretely into the classes of CV and not a CV. We then, compare the compositionality score with that of the expert user’s annotation. We found a significant correlation between the expert annotation and the compositionality score. We observe verb sequences that are annotated as CVs (like, খেঠে খিল )কঠে খি ,ওঠে পড ,have got low compositionality score (average score ranges between 1-4) on the other hand high compositional values are in general tagged as not a cv (নিঠে য়া (come and get), নিঠে আে (return back), তুঠল খেঠেনি (kept), গনিঠে পিল (roll on floor)). This reflects that verb sequences which are not CV shows high degree of compositionality. In other words non CV verbs can directly interpret from their constituent verbs. This leads us to the possibility that compositional verb sequences requires individual verbs to be recognized separately and thus the time to recognize such expressions must be greater than the non-compositional verbs which maps to a single expression of meaning. In order to validate such claim we perform a lexical decision experiment using 32 native Bangla speakers with 92 different verb sequences. We followed the same experimental procedure as discussed in (Taft, 2004) for English polymorphemic words. However, rather than derived words, the subjects were shown a verb sequence and asked whether they recognize them as a valid combination. The reaction time (RT) of each subject is recorded. Our preliminarily observation from the RT analysis shows that as per our claim, RT of verb sequences having high compositionality value is significantly higher than the RTs for low or noncompositional verbs. This proves our hypothesis that Bangla compound verbs that show less compositionality are stored as a hole in the mental lexicon and thus follows the full-listing model whereas compositional verb phrases are individually parsed. However, we do believe that our experiment is composed of a very small set of data and it is premature to conclude anything concrete based only on the current experimental results. 5 FFuuttuurree DDiirreeccttiioonnss In the next phase of our work we will focus on the following aspects of Bangla morphologically complex words: TThhee WWoorrdd FFaammiilliiaarriittyy EEffffeecctt:: Here, our aim is to study the role of familiarity of a word during its processing. We define the familiarity of a word in terms of corpus frequency, Age of acquisition, the level of language exposure of a person, and RT of the word etc. RRoollee ooff ssuuffffiixx ttyyppeess iinn mmoorrpphhoollooggiiccaall ddeeccoo mm ppoo-ssiittiioonn:: For native Bangla speakers which morphological suffixes are internalized and which are just learnt in school, but never internalized. We can compare the representation of Native, Sanskrit derived and foreign suffixes in Bangla words. CCoommppuuttaattiioonnaall mmooddeellss ooff oorrggaanniizzaattiioonn aanndd pprroocceessssiinngg ooff BBaannggllaa ccoommppoouunndd vveerrbbss :: presently we have performed some small set of experiments to study processing of compound verbs in the mental lexicon. In the next phase of our work we will extend the existing experiments and also apply some more techniques like, crowd sourcing and language games to collect more relevant RT and compositionality data. Finally, based on the collected data we will develop computational models that can explain the possible organizational structure and processing mechanism of morphologically complex Bangla words in the mental lexicon. Reference Aitchison, J. (1987). ―Words in the mind: An introduction to the mental lexicon‖. Wiley-Blackwell, 128 Baayen R. H. (2000). ―On frequency, transparency and productivity‖. G. Booij and J. van Marle (eds), Yearbook of Morphology, pages 181-208, Baayen R.H. (2003). ―Probabilistic approaches to morphology‖. Probabilistic linguistics, pages 229287. Baayen R.H., T. Dijkstra, and R. Schreuder. (1997). ―Singulars and plurals in dutch: Evidence for a parallel dual-route model‖. Journal of Memory and Language, 37(1):94-1 17. Bashir, E. (1993), ―Causal Chains and Compound Verbs.‖ In M. K. Verma ed. (1993). Bentin, S. & Feldman, L.B. (1990). The contribution of morphological and semantic relatedness to repetition priming at short and long lags: Evidence from Hebrew. Quarterly Journal of Experimental Psychology, 42, pp. 693–71 1. Bradley, D. (1980). Lexical representation of derivational relation, Juncture, Saratoga, CA: Anma Libri, pp. 37-55. Butt, M. (1993), ―Conscious choice and some light verbs in Urdu.‖ In M. K. Verma ed. (1993). Butterworth, B. (1983). Lexical Representation, Language Production, Vol. 2, pp. 257-294, San Diego, CA: Academic Press. Caramazza, A., Laudanna, A. and Romani, C. (1988). Lexical access and inflectional morphology. Cognition, 28, pp. 297-332. Drews, E., and Zwitserlood, P. (1995).Morphological and orthographic similarity in visual word recognition. Journal of Experimental Psychology:HumanPerception andPerformance, 21, 1098– 1116. Fellbaum, C. (ed.). (1998). WordNet: An Electronic Lexical Database, MIT Press. Forster, K.I., and Davis, C. (1984). Repetition priming and frequency attenuation in lexical access. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10, 680–698. Frost, R., Forster, K.I., & Deutsch, A. (1997). What can we learn from the morphology of Hebrew? A masked-priming investigation of morphological representation. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23, 829–856. Grainger, J., Cole, P., & Segui, J. (1991). Masked morphological priming in visual word recognition. Journal of Memory and Language, 30, 370–384. Hook, P. E. (1981). ―Hindi Structures: Intermediate Level.‖ Michigan Papers on South and Southeast Asia, The University of Michigan Center for South and Southeast Studies, Ann Arbor, Michigan. Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. 1981. The measurement of interrater agreement. Statistical methods for rates and proportions,2:212–236. MacKay,D.G.(1978), Derivational rules and the internal lexicon. Journal of Verbal Learning and Verbal Behavior, 17, pp.61-71. Marslen-Wilson, W.D., & Tyler, L.K. (1997). Dissociating types of mental computation. Nature, 387, pp. 592–594. Marslen-Wilson, W.D., & Tyler, L.K. (1998). Rules, representations, and the English past tense. Trends in Cognitive Sciences, 2, pp. 428–435. Marslen-Wilson, W.D., Tyler, L.K., Waksler, R., & Older, L. (1994). Morphology and meaning in the English mental lexicon. Psychological Review, 101, pp. 3–33. Marslen-Wilson,W.D. and Zhou,X.( 1999). Abstractness, allomorphy, and lexical architecture. Language and Cognitive Processes, 14, 321–352. Milin, P., Kuperman, V., Kosti´, A. and Harald R., H. (2009). Paradigms bit by bit: an information- theoretic approach to the processing of paradigmatic structure in inflection and derivation, Analogy in grammar: Form and acquisition, pp: 214— 252. Pandharipande, R. (1993). ―Serial verb construction in Marathi.‖ In M. K. Verma ed. (1993). Paul, S. (2004). An HPSG Account of Bangla Compound Verbs with LKB Implementation, Ph.D. Dissertation. CALT, University of Hyderabad. Pulvermüller, F. (2002). The Neuroscience guage. Cambridge University Press. of Lan- Stolz, J.A., and Feldman, L.B. (1995). The role of orthographic and semantic transparency of the base morpheme in morphological processing. In L.B. Feldman (Ed.) Morphological aspects of language processing. Hillsdale, NJ: Lawrence Erlbaum Associates Inc. Taft, M., and Forster, K.I.(1975). Lexical storage and retrieval of prefix words. Journal of Verbal Learning and Verbal Behavior, Vol.14, pp. 638-647. Taft, M.(1988). A morphological decomposition model of lexical access. Linguistics, 26, pp. 657667. Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology, 57A, pp. 745-765 Tulving, E., Schacter D. L., and Heather A.(1982). Priming Effects in Word Fragment Completion are independent of Recognition Memory. Journal of Experimental Psychology: Learning, Memory and Cognition, vol.8 (4). 129
6 0.61614901 87 acl-2013-Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics
7 0.6085816 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
8 0.59577954 327 acl-2013-Sorani Kurdish versus Kurmanji Kurdish: An Empirical Comparison
9 0.54992098 330 acl-2013-Stem Translation with Affix-Based Rule Selection for Agglutinative Languages
10 0.52577031 342 acl-2013-Text Classification from Positive and Unlabeled Data using Misclassified Data Correction
11 0.52425152 72 acl-2013-Bridging Languages through Etymology: The case of cross language text categorization
12 0.51508611 14 acl-2013-A Novel Classifier Based on Quantum Computation
13 0.50943768 28 acl-2013-A Unified Morpho-Syntactic Scheme of Stanford Dependencies
14 0.50853652 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German
15 0.49846771 346 acl-2013-The Impact of Topic Bias on Quality Flaw Prediction in Wikipedia
16 0.4848637 186 acl-2013-Identifying English and Hungarian Light Verb Constructions: A Contrastive Approach
17 0.48113069 340 acl-2013-Text-Driven Toponym Resolution using Indirect Supervision
18 0.46946049 298 acl-2013-Recognizing Rare Social Phenomena in Conversation: Empowerment Detection in Support Group Chatrooms
19 0.46278641 370 acl-2013-Unsupervised Transcription of Historical Documents
20 0.45791695 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals
topicId topicWeight
[(0, 0.034), (6, 0.052), (11, 0.078), (15, 0.013), (24, 0.067), (26, 0.059), (35, 0.066), (42, 0.044), (48, 0.09), (52, 0.225), (70, 0.056), (88, 0.043), (90, 0.021), (95, 0.08)]
simIndex simValue paperId paperTitle
1 0.90128499 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics
Author: Pieter Wellens ; Remi van Trijp ; Katrien Beuls ; Luc Steels
Abstract: Fluid Construction Grammar (FCG) is an open-source computational grammar formalism that is becoming increasingly popular for studying the history and evolution of language. This demonstration shows how FCG can be used to operationalise the cultural processes and cognitive mechanisms that underly language evolution and change.
2 0.83206743 339 acl-2013-Temporal Signals Help Label Temporal Relations
Author: Leon Derczynski ; Robert Gaizauskas
Abstract: Automatically determining the temporal order of events and times in a text is difficult, though humans can readily perform this task. Sometimes events and times are related through use of an explicit co-ordination which gives information about the temporal relation: expressions like “before ” and “as soon as”. We investigate the r oˆle that these co-ordinating temporal signals have in determining the type of temporal relations in discourse. Using machine learning, we improve upon prior approaches to the problem, achieving over 80% accuracy at labelling the types of temporal relation between events and times that are related by temporal signals.
same-paper 3 0.80102253 78 acl-2013-Categorization of Turkish News Documents with Morphological Analysis
Author: Burak Kerim Akku� ; Ruket Cakici
Abstract: Morphologically rich languages such as Turkish may benefit from morphological analysis in natural language tasks. In this study, we examine the effects of morphological analysis on text categorization task in Turkish. We use stems and word categories that are extracted with morphological analysis as main features and compare them with fixed length stemmers in a bag of words approach with several learning algorithms. We aim to show the effects of using varying degrees of morphological information.
4 0.72605896 113 acl-2013-Derivational Smoothing for Syntactic Distributional Semantics
Author: Sebastian Pado ; Jan Snajder ; Britta Zeller
Abstract: Syntax-based vector spaces are used widely in lexical semantics and are more versatile than word-based spaces (Baroni and Lenci, 2010). However, they are also sparse, with resulting reliability and coverage problems. We address this problem by derivational smoothing, which uses knowledge about derivationally related words (oldish → old) to improve semantic similarity est→imates. We develop a set of derivational smoothing methods and evaluate them on two lexical semantics tasks in German. Even for models built from very large corpora, simple derivational smoothing can improve coverage considerably.
5 0.65340227 7 acl-2013-A Lattice-based Framework for Joint Chinese Word Segmentation, POS Tagging and Parsing
Author: Zhiguo Wang ; Chengqing Zong ; Nianwen Xue
Abstract: For the cascaded task of Chinese word segmentation, POS tagging and parsing, the pipeline approach suffers from error propagation while the joint learning approach suffers from inefficient decoding due to the large combined search space. In this paper, we present a novel lattice-based framework in which a Chinese sentence is first segmented into a word lattice, and then a lattice-based POS tagger and a lattice-based parser are used to process the lattice from two different viewpoints: sequential POS tagging and hierarchical tree building. A strategy is designed to exploit the complementary strengths of the tagger and parser, and encourage them to predict agreed structures. Experimental results on Chinese Treebank show that our lattice-based framework significantly improves the accuracy of the three sub-tasks. 1
6 0.65207362 102 acl-2013-DErivBase: Inducing and Evaluating a Derivational Morphology Resource for German
7 0.64693397 188 acl-2013-Identifying Sentiment Words Using an Optimization-based Model without Seed Words
8 0.64577484 318 acl-2013-Sentiment Relevance
9 0.64422959 62 acl-2013-Automatic Term Ambiguity Detection
10 0.64089781 185 acl-2013-Identifying Bad Semantic Neighbors for Improving Distributional Thesauri
11 0.63972539 82 acl-2013-Co-regularizing character-based and word-based models for semi-supervised Chinese word segmentation
12 0.63526714 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning
13 0.63375729 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study
14 0.63327789 275 acl-2013-Parsing with Compositional Vector Grammars
15 0.63294959 117 acl-2013-Detecting Turnarounds in Sentiment Analysis: Thwarting
16 0.63249421 309 acl-2013-Scaling Semi-supervised Naive Bayes with Feature Marginals
17 0.63135684 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing
18 0.63111591 377 acl-2013-Using Supervised Bigram-based ILP for Extractive Summarization
19 0.63041115 333 acl-2013-Summarization Through Submodularity and Dispersion
20 0.62898505 17 acl-2013-A Random Walk Approach to Selectional Preferences Based on Preference Ranking and Propagation