acl acl2013 acl2013-292 knowledge-graph by maker-knowledge-mining

292 acl-2013-Question Classification Transfer

Source: pdf

Author: Anne-Laure Ligozat

Abstract: Question answering systems have been developed for many languages, but most resources were created for English, which can be a problem when developing a system in another language such as French. In particular, for question classification, no labeled question corpus is available for French, so this paper studies the possibility to use existing English corpora and transfer a classification by translating the question and their labels. By translating the training corpus, we obtain results close to a monolingual setting.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Question Classification Transfer Anne-Laure Ligozat LIMSI-CNRS / BP133, 91403 Orsay cedex, France ENSIIE / 1, square de la r ´esistance, Evry, France firstname . [sent-1, score-0.027]

2 Abstract Question answering systems have been developed for many languages, but most resources were created for English, which can be a problem when developing a system in another language such as French. [sent-3, score-0.094]

3 In particular, for question classification, no labeled question corpus is available for French, so this paper studies the possibility to use existing English corpora and transfer a classification by translating the question and their labels. [sent-4, score-1.473]

4 By translating the training corpus, we obtain results close to a monolingual setting. [sent-5, score-0.318]

5 1 Introduction In question answering (QA), as in most Natural Language Processing domains, English is the best resourced language, in terms of corpora, lexicons, or systems. [sent-6, score-0.379]

6 While developing a question answering system for French, we were thus limited by the lack of resources for this language. [sent-8, score-0.409]

7 Some were created, for example for answer validation (Grappy et al. [sent-9, score-0.119]

8 Yet, for question classification, although question corpora in French exist, only a small part of them is annotated with question classes, and such an annotation is costly. [sent-11, score-0.907]

9 We thus wondered if it was possible to use existing English corpora, in this case the data used in (Li and Roth, 2002), to create a classification module for French. [sent-12, score-0.272]

10 Transfering knowledge from one language to another is usually done by exploiting parallel corpora; yet in this case, few such corpora exists (CLEF QA datasets could be used, but question classes are not very precise). [sent-13, score-0.587]

11 We thus investigated the possibility of using machine translation to create a parallel corpus, as has been done for spoken language understanding (Jabaian et al. [sent-14, score-0.12]

12 The idea is that using machine translation would enable us to have a large training corpus, either by using the English one and translating the test corpus, or by translating the training corpus. [sent-16, score-0.438]

13 One of the questions posed was whether the quality of present machine translation systems would enable to learn the classification properly. [sent-17, score-0.538]

14 This paper presents a question classification transfer method, which results are close to those of a monolingual system. [sent-18, score-0.713]

15 The contributions of the paper are the following: • • comparison of train-on-target and test-onsource strategies froairn question classification; creation of an effective question classification system nfoorf French, iwveithqu mesitniiomnacll asnsnifoictaattiioonn effort. [sent-19, score-0.81]

16 2 Problem definition A Question Answering (QA) system aims at returning a precise answer to a natural language question: if asked ”How large is the Lincoln Memorial? [sent-24, score-0.113]

17 ”, a QA system should return the answer ”164 acres” as well as a justifying snippet. [sent-25, score-0.117]

18 Most systems include a question classification step which determines the expected answer type, for example area in the previous case. [sent-26, score-0.639]

19 This type can then be used to extract the correct answer in documents. [sent-27, score-0.125]

20 Detecting the answer type is usually considered as a multiclass classification problem, with each answer type representing a class. [sent-28, score-0.485]

21 In this paper, we wish to learn such a system for French, without having to manually annotate thousands of questions. [sent-32, score-0.032]

22 3 Transfering question classification The two methods tested for transfering the classification, following (Jabaian et al. [sent-33, score-0.681]

23 , 2011), are presented in Figure 1: • The first one (on the left), called test-onsource, cto onnseist (so nin learning a acllleadssi tfiecsat-tioonnmodel in English, and to translate the test corpus from French to English, in order to apply the English model on the translated test corpus. [sent-34, score-0.228]

24 • The second one (on the right), called trainon-target, dco onnsiest (so inn translating tahllee training corpus from English to French. [sent-35, score-0.236]

25 We obtain an labeled French corpus, on which it is possible to learn a classification model. [sent-36, score-0.314]

26 In the first case, classification is learned on well written questions; yet, as the test corpus is translated, translation errors may disturb the classifier. [sent-37, score-0.42]

27 In the second case, the classification model will be learned on less well written questions, but the corpus may be large enough to compensate for the loss in quality. [sent-38, score-0.278]

28 Figure 2: Some of the question categories proposed by (Li and Roth, 2002) 4 Experiments 4. [sent-39, score-0.285]

29 1 Question classes We used the question taxonomy proposed by (Li and Roth, 2002), which enabled us to compare our results to those obtained by (Zhang and Lee, 2003) on English. [sent-40, score-0.498]

30 This taxonomy contains two levels: the first one contains 50 fine grained categories, the second one contains 6 coarse grained categories. [sent-41, score-1.498]

31 2 Corpora For English, we used the data from (Li and Roth, 2002), which was assembled from USC, UIUC and TREC collections, and has been manually labeled according to their taxonomy. [sent-44, score-0.036]

32 The training set contains 5,500 labeled questions, and the testing set contains 500 questions. [sent-45, score-0.068]

33 For French, we gathered questions from several evaluation campaigns: QA@CLEF 2005, 2006, 2007, EQueR and Quæro 2008, 2009 and 2010. [sent-46, score-0.222]

34 After elimination of duplicated questions, we obtained a corpus of 1,421 questions, which were divided into a training set of728 questions, and a test set of 693 questions 1. [sent-47, score-0.35]

35 Some of these questions were already labeled, and we manually annotated the rest of them. [sent-48, score-0.222]

36 Translation was performed by Google Translate online interface, which had satisfactory performance on interrogative forms, which are not well handled by all machine translation systems 2. [sent-49, score-0.127]

37 2We tested other translation systems, but Google Translate gave the best results. [sent-51, score-0.076]

38 f87o46r9both levels of the hierarchy (features = word n-grams, classifier = libsvm) 4. [sent-58, score-0.218]

39 3 Classification parameters The classifier used was LibSVM (Chang and Lin, 2011) with default parameters, which offers onevs-one multiclass classification, and which (Zhang and Lee, 2003) showed to be most effective for this task. [sent-59, score-0.1]

40 4 Results and discussion Table 1 shows the results obtained with the basic configuration, for both transfer methods. [sent-64, score-0.074]

41 the proportion of correctly classified questions among the test questions 3. [sent-67, score-0.501]

42 Using word n-grams, monolingual English classification obtains . [sent-68, score-0.306]

43 798 correct classification for the fine grained classes, and . [sent-69, score-1.027]

44 90 for the coarse grained classes, results which are very close to those obtained by (Zhang and Lee, 2003). [sent-70, score-0.765]

45 84 for coarse grained classes, probably mostly due to the smallest size of the training corpus: (Zhang and Lee, 2003) had a precision of . [sent-73, score-0.789]

46 65 for the fine grained classification with a 1,000 questions training corpus. [sent-74, score-1.239]

47 When translating test questions from French to English, classification precision decreases, as was expected from (Cumbreras et al. [sent-75, score-0.697]

48 Yet, when translating the training corpus from English to French and learning the classification model 3We measured the significance of precision differences (Student t test, p=. [sent-77, score-0.484]

49 05), for each level of the hierarchy between each test, and, unless indicated otherwise, comparable results are significantly different in each condition. [sent-78, score-0.115]

50 One possible explanation is that the condition when test questions are translated is very sensitive to translation errors: if one of the test questions is not correcly translated, the classifier will have a hard time categorizing it. [sent-86, score-0.709]

51 If the training corpus is translated, translation errors can be counterbalanced by correct translations. [sent-87, score-0.224]

52 Table 2 shows the classification performance with this additional information. [sent-90, score-0.24]

53 Classification is slightly improved, but only for coarse grained classes (the difference is not significant for fine grained classes). [sent-91, score-1.639]

54 When analyzing the results, we noted that most confusion errors were due to the type of features given as inputs: for example, to correctly clas- sify the question ”What is BPH? [sent-92, score-0.348]

55 ” as a question expecting an expression corresponding to an abbreviation (ABBR:exp class in the hierarchy), it is necessary to know that ”BPH” is an abbreviation. [sent-93, score-0.368]

56 We thus added a specific feature to detect if a question word is an abbreviation, simply by test431 TTerasitneenn(f rtr ans. [sent-94, score-0.319]

57 i8s92i0o8nforbth levels of the hierarchy (features = word n-grams with abbreviations, classifier = libsvm) ing if it contains only upper case letters, and normalizing them. [sent-98, score-0.218]

58 Table 3 gives the results with this additional feature (we only kept the method with translation ofthe training corpus since results were much higher). [sent-99, score-0.146]

59 Precision is improved for both levels of the hierarchy: for fine grained classes, results increase from . [sent-100, score-0.796]

60 5 Related work Most question answering systems include question classification, which is generally based on supervised learning. [sent-106, score-0.664]

61 (Li and Roth, 2002) trained the SNoW hierarchical classifier for question classification, with a 50 classes fine grained hierarchy, and a coarse grained one of 6 classes. [sent-107, score-1.976]

62 8% correct classification of the coarse grained classes, and 95% on the fine grained one. [sent-110, score-1.744]

63 This hierarchy was widely used by other QA systems. [sent-111, score-0.115]

64 (Zhang and Lee, 2003) studied the classification performance according to the classifier and training dataser size, as well as the contribution of question parse trees. [sent-112, score-0.609]

65 Their results are 87% correct classification on coarse grained classes and 80% on fine grained classes with vectorial attributes, and 90% correct classification on coarse grained classes and 80% on fine grained classes with structured input and tree kerneks. [sent-113, score-4.23]

66 Adapting the methods to other languages requires to annotated large corpora of questions. [sent-115, score-0.052]

67 In order to classify questions in different languages, (Solorio et al. [sent-116, score-0.222]

68 , 2004) proposed an internet based approach to determine the expected type. [sent-117, score-0.031]

69 By combining this information with question words, they obtain 84% correct classification for English, 84% for Spanish and 89% for Italian, with a cross validation on a 450 question corpus for 7 question classes. [sent-118, score-1.249]

70 One of the limitations raised by the authors is the lack of large labeled corpora for all languages. [sent-119, score-0.118]

71 A possibility to overcome this lack of resources is to use existing English resources. [sent-120, score-0.074]

72 , 2006) developed a QA system for Spanish, based on an English QA system, by translating the questions from Spanish to English. [sent-122, score-0.356]

73 They obtain a 65% precision for Spanish question classification, while English classification are correctly classified with an 80% precision. [sent-123, score-0.63]

74 Crosslingual QA systems, in which the question is in a different language than the documents, also usually rely on English systems, and translate answers for example (Bos and Nissim, 2006; Bowden et al. [sent-125, score-0.369]

75 6 Conclusion This paper presents a comparison between two transfer modes to adapt question classification from English to French. [sent-127, score-0.599]

76 Results show that translating the training corpus gives better results than translating the test corpus. [sent-128, score-0.368]

77 Part-of-speech information only was used, but since (Zhang and Lee, 2003) showed that best results are obtained with parse trees and tree kernels, it could be interesting to test this additional information; yet, parsing translated questions may prove unreliable. [sent-129, score-0.329]

78 Finally, as interrogative forms occur rarely is corpora, their translation is usually of a slightly lower quality. [sent-130, score-0.158]

79 A possible future direction for this work could be to use a specific model of translation for questions in order to learn question classification on higher quality translations. [sent-131, score-0.823]

80 Multilingual question answering through intermediate translation: Lcc´s poweranswer at qa@clef 2007. [sent-145, score-0.379]

81 us- ing machine translation and an english classifier. [sent-160, score-0.134]

82 Selecting answers to questions from web documents by a robust validation process. [sent-165, score-0.258]

83 Combination of stochastic understanding and machine translation systems for language portability of dialogue systems. [sent-169, score-0.076]

84 In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26–32. [sent-192, score-0.03]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('grained', 0.523), ('question', 0.285), ('classification', 0.24), ('questions', 0.222), ('fine', 0.222), ('coarse', 0.194), ('classes', 0.177), ('qa', 0.177), ('french', 0.156), ('transfering', 0.156), ('translating', 0.134), ('cumbreras', 0.117), ('jabaian', 0.117), ('hierarchy', 0.115), ('libsvm', 0.111), ('answering', 0.094), ('answer', 0.083), ('clef', 0.082), ('bowden', 0.078), ('bph', 0.078), ('licsohrpus', 0.078), ('ligozat', 0.078), ('reconrcphus', 0.078), ('tesft', 0.078), ('trainein', 0.078), ('roth', 0.078), ('translated', 0.077), ('translation', 0.076), ('transfer', 0.074), ('solorio', 0.069), ('grappy', 0.069), ('monolingual', 0.066), ('lee', 0.064), ('zhang', 0.062), ('qs', 0.06), ('english', 0.058), ('spanish', 0.055), ('translate', 0.053), ('corpora', 0.052), ('classifier', 0.052), ('ue', 0.051), ('interrogative', 0.051), ('levels', 0.051), ('abbreviation', 0.049), ('close', 0.048), ('multiclass', 0.048), ('fr', 0.047), ('bos', 0.044), ('possibility', 0.044), ('yet', 0.042), ('abbreviations', 0.042), ('correct', 0.042), ('precision', 0.04), ('corpus', 0.038), ('france', 0.038), ('obtain', 0.038), ('li', 0.038), ('validation', 0.036), ('taxonomy', 0.036), ('errors', 0.036), ('labeled', 0.036), ('lef', 0.034), ('rtr', 0.034), ('nissim', 0.034), ('vectorial', 0.034), ('tilingual', 0.034), ('acres', 0.034), ('justifying', 0.034), ('expecting', 0.034), ('ort', 0.034), ('thousands', 0.032), ('mera', 0.032), ('cedex', 0.032), ('lincoln', 0.032), ('wondered', 0.032), ('opez', 0.032), ('lcc', 0.032), ('brigitte', 0.032), ('grau', 0.032), ('dco', 0.032), ('memorial', 0.032), ('training', 0.032), ('usually', 0.031), ('expected', 0.031), ('lack', 0.03), ('test', 0.03), ('precise', 0.03), ('precisions', 0.03), ('silva', 0.03), ('arnaud', 0.03), ('informaion', 0.03), ('otn', 0.03), ('elimination', 0.028), ('trans', 0.028), ('orsay', 0.028), ('correctly', 0.027), ('tra', 0.027), ('lastname', 0.027), ('firstname', 0.027), ('usc', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 292 acl-2013-Question Classification Transfer

Author: Anne-Laure Ligozat

2 0.21648271 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval

Author: Xuchen Yao ; Benjamin Van Durme ; Peter Clark

Abstract: Information Retrieval (IR) and Answer Extraction are often designed as isolated or loosely connected components in Question Answering (QA), with repeated overengineering on IR, and not necessarily performance gain for QA. We propose to tightly integrate them by coupling automatically learned features for answer extraction to a shallow-structured IR model. Our method is very quick to implement, and significantly improves IR for QA (measured in Mean Average Precision and Mean Reciprocal Rank) by 10%-20% against an uncoupled retrieval baseline in both document and passage retrieval, which further leads to a downstream 20% improvement in QA F1.

3 0.19889987 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

Author: Guangyou Zhou ; Fang Liu ; Yang Liu ; Shizhu He ; Jun Zhao

Abstract: Community question answering (CQA) has become an increasingly popular research topic. In this paper, we focus on the problem of question retrieval. Question retrieval in CQA can automatically find the most relevant and recent questions that have been solved by other users. However, the word ambiguity and word mismatch problems bring about new challenges for question retrieval in CQA. State-of-the-art approaches address these issues by implicitly expanding the queried questions with additional words or phrases using monolingual translation models. While useful, the effectiveness of these models is highly dependent on the availability of quality parallel monolingual corpora (e.g., question-answer pairs) in the absence of which they are troubled by noise issue. In this work, we propose an alternative way to address the word ambiguity and word mismatch problems by taking advantage of potentially rich semantic information drawn from other languages. Our proposed method employs statistical machine translation to improve question retrieval and enriches the question representation with the translated words from other languages via matrix factorization. Experiments conducted on a real CQA data show that our proposed approach is promising.

4 0.17914751 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

Author: Wen-tau Yih ; Ming-Wei Chang ; Christopher Meek ; Andrzej Pastusiak

Abstract: In this paper, we study the answer sentence selection problem for question answering. Unlike previous work, which primarily leverages syntactic analysis through dependency tree matching, we focus on improving the performance using models of lexical semantic resources. Experiments show that our systems can be consistently and significantly improved with rich lexical semantic information, regardless of the choice of learning algorithms. When evaluated on a benchmark dataset, the MAP and MRR scores are increased by 8 to 10 points, compared to one of our baseline systems using only surface-form matching. Moreover, our best system also outperforms pervious work that makes use of the dependency tree structure by a wide margin.

5 0.17812884 290 acl-2013-Question Analysis for Polish Question Answering

Author: Piotr Przybyla

Abstract: This study is devoted to the problem of question analysis for a Polish question answering system. The goal of the question analysis is to determine its general structure, type of an expected answer and create a search query for finding relevant documents in a textual knowledge base. The paper contains an overview of available solutions of these problems, description of their implementation and presents an evaluation based on a set of 1137 questions from a Polish quiz TV show. The results help to understand how an environment of a Slavonic language affects the performance of methods created for English.

6 0.17467886 241 acl-2013-Minimum Bayes Risk based Answer Re-ranking for Question Answering

7 0.17052706 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

8 0.16911332 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

9 0.13921921 179 acl-2013-HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text

10 0.12293123 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

11 0.11543189 218 acl-2013-Latent Semantic Tensor Indexing for Community-based Question Answering

12 0.095340773 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

13 0.083573163 266 acl-2013-PAL: A Chatterbot System for Answering Domain-specific Questions

14 0.078095414 98 acl-2013-Cross-lingual Transfer of Semantic Role Labeling Models

15 0.070155866 223 acl-2013-Learning a Phrase-based Translation Model from Monolingual Data with Application to Domain Adaptation

16 0.065155901 168 acl-2013-Generating Recommendation Dialogs by Extracting Information from User Reviews

17 0.064165182 306 acl-2013-SPred: Large-scale Harvesting of Semantic Predicates

18 0.062397834 357 acl-2013-Transfer Learning for Constituency-Based Grammars

19 0.061851069 345 acl-2013-The Haves and the Have-Nots: Leveraging Unlabelled Corpora for Sentiment Analysis

20 0.061657019 351 acl-2013-Topic Modeling Based Classification of Clinical Reports

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.176), (1, 0.039), (2, 0.039), (3, -0.083), (4, 0.06), (5, 0.047), (6, -0.008), (7, -0.312), (8, 0.172), (9, 0.024), (10, 0.129), (11, -0.082), (12, -0.015), (13, 0.01), (14, 0.01), (15, 0.019), (16, -0.018), (17, -0.034), (18, 0.023), (19, 0.103), (20, 0.057), (21, 0.009), (22, 0.039), (23, -0.059), (24, 0.031), (25, -0.052), (26, 0.01), (27, -0.097), (28, 0.044), (29, 0.014), (30, 0.02), (31, -0.005), (32, 0.045), (33, 0.077), (34, -0.019), (35, 0.006), (36, -0.025), (37, -0.001), (38, -0.002), (39, -0.045), (40, -0.054), (41, 0.043), (42, 0.017), (43, -0.035), (44, 0.001), (45, 0.052), (46, 0.086), (47, -0.013), (48, 0.05), (49, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97461993 292 acl-2013-Question Classification Transfer

Author: Anne-Laure Ligozat

2 0.89234489 60 acl-2013-Automatic Coupling of Answer Extraction and Information Retrieval

Author: Xuchen Yao ; Benjamin Van Durme ; Peter Clark

3 0.8440448 241 acl-2013-Minimum Bayes Risk based Answer Re-ranking for Question Answering

Author: Nan Duan

Abstract: This paper presents two minimum Bayes risk (MBR) based Answer Re-ranking (MBRAR) approaches for the question answering (QA) task. The first approach re-ranks single QA system’s outputs by using a traditional MBR model, by measuring correlations between answer candidates; while the second approach reranks the combined outputs of multiple QA systems with heterogenous answer extraction components by using a mixture model-based MBR model. Evaluations are performed on factoid questions selected from two different domains: Jeopardy! and Web, and significant improvements are achieved on all data sets.

4 0.84394699 290 acl-2013-Question Analysis for Polish Question Answering

Author: Piotr Przybyla

5 0.83597529 218 acl-2013-Latent Semantic Tensor Indexing for Community-based Question Answering

Author: Xipeng Qiu ; Le Tian ; Xuanjing Huang

Abstract: Retrieving similar questions is very important in community-based question answering(CQA) . In this paper, we propose a unified question retrieval model based on latent semantic indexing with tensor analysis, which can capture word associations among different parts of CQA triples simultaneously. Thus, our method can reduce lexical chasm of question retrieval with the help of the information of question content and answer parts. The experimental result shows that our method outperforms the traditional methods.

6 0.82590383 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

7 0.76297605 266 acl-2013-PAL: A Chatterbot System for Answering Domain-specific Questions

8 0.75128734 272 acl-2013-Paraphrase-Driven Learning for Open Question Answering

9 0.70859796 291 acl-2013-Question Answering Using Enhanced Lexical Semantic Models

10 0.69121397 169 acl-2013-Generating Synthetic Comparable Questions for News Articles

11 0.62316155 107 acl-2013-Deceptive Answer Prediction with User Preference Graph

12 0.60868818 254 acl-2013-Multimodal DBN for Predicting High-Quality Answers in cQA portals

13 0.49143046 387 acl-2013-Why-Question Answering using Intra- and Inter-Sentential Causal Relations

14 0.48986581 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE

15 0.46257377 141 acl-2013-Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation

16 0.46195811 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia

17 0.45961118 158 acl-2013-Feature-Based Selection of Dependency Paths in Ad Hoc Information Retrieval

18 0.44120386 159 acl-2013-Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction

19 0.43549061 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

20 0.40419453 168 acl-2013-Generating Recommendation Dialogs by Extracting Information from User Reviews

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.049), (6, 0.026), (11, 0.053), (22, 0.228), (24, 0.041), (26, 0.071), (35, 0.067), (42, 0.054), (48, 0.043), (64, 0.012), (70, 0.096), (88, 0.067), (90, 0.015), (95, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7704457 292 acl-2013-Question Classification Transfer

Author: Anne-Laure Ligozat

2 0.76325166 369 acl-2013-Unsupervised Consonant-Vowel Prediction over Hundreds of Languages

Author: Young-Bum Kim ; Benjamin Snyder

Abstract: In this paper, we present a solution to one aspect of the decipherment task: the prediction of consonants and vowels for an unknown language and alphabet. Adopting a classical Bayesian perspective, we performs posterior inference over hundreds of languages, leveraging knowledge of known languages and alphabets to uncover general linguistic patterns of typologically coherent language clusters. We achieve average accuracy in the unsupervised consonant/vowel prediction task of 99% across 503 languages. We further show that our methodology can be used to predict more fine-grained phonetic distinctions. On a three-way classification task between vowels, nasals, and nonnasal consonants, our model yields unsu- pervised accuracy of 89% across the same set of languages.

3 0.69007599 194 acl-2013-Improving Text Simplification Language Modeling Using Unsimplified Text Data

Author: David Kauchak

Abstract: In this paper we examine language modeling for text simplification. Unlike some text-to-text translation tasks, text simplification is a monolingual translation task allowing for text in both the input and output domain to be used for training the language model. We explore the relationship between normal English and simplified English and compare language models trained on varying amounts of text from each. We evaluate the models intrinsically with perplexity and extrinsically on the lexical simplification task from SemEval 2012. We find that a combined model using both simplified and normal English data achieves a 23% improvement in perplexity and a 24% improvement on the lexical simplification task over a model trained only on simple data. Post-hoc analysis shows that the additional unsimplified data provides better coverage for unseen and rare n-grams.

4 0.63592738 80 acl-2013-Chinese Parsing Exploiting Characters

Author: Meishan Zhang ; Yue Zhang ; Wanxiang Che ; Ting Liu

Abstract: Characters play an important role in the Chinese language, yet computational processing of Chinese has been dominated by word-based approaches, with leaves in syntax trees being words. We investigate Chinese parsing from the character-level, extending the notion of phrase-structure trees by annotating internal structures of words. We demonstrate the importance of character-level information to Chinese processing by building a joint segmentation, part-of-speech (POS) tagging and phrase-structure parsing system that integrates character-structure features. Our joint system significantly outperforms a state-of-the-art word-based baseline on the standard CTB5 test, and gives the best published results for Chinese parsing.

5 0.63064623 144 acl-2013-Explicit and Implicit Syntactic Features for Text Classification

Author: Matt Post ; Shane Bergsma

Abstract: Syntactic features are useful for many text classification tasks. Among these, tree kernels (Collins and Duffy, 2001) have been perhaps the most robust and effective syntactic tool, appealing for their empirical success, but also because they do not require an answer to the difficult question of which tree features to use for a given task. We compare tree kernels to different explicit sets of tree features on five diverse tasks, and find that explicit features often perform as well as tree kernels on accuracy and always in orders of magnitude less time, and with smaller models. Since explicit features are easy to generate and use (with publicly avail- able tools) , we suggest they should always be included as baseline comparisons in tree kernel method evaluations.

6 0.62953496 134 acl-2013-Embedding Semantic Similarity in Tree Kernels for Domain Adaptation of Relation Extraction

7 0.62881535 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

8 0.62761986 155 acl-2013-Fast and Accurate Shift-Reduce Constituent Parsing

9 0.62750018 252 acl-2013-Multigraph Clustering for Unsupervised Coreference Resolution

10 0.62644011 131 acl-2013-Dual Training and Dual Prediction for Polarity Classification

11 0.62396419 153 acl-2013-Extracting Events with Informal Temporal References in Personal Histories in Online Communities

12 0.62357187 356 acl-2013-Transfer Learning Based Cross-lingual Knowledge Extraction for Wikipedia

13 0.62327003 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

14 0.62289101 373 acl-2013-Using Conceptual Class Attributes to Characterize Social Media Users

15 0.62177587 329 acl-2013-Statistical Machine Translation Improves Question Retrieval in Community Question Answering via Matrix Factorization

16 0.62156832 318 acl-2013-Sentiment Relevance

17 0.62136573 288 acl-2013-Punctuation Prediction with Transition-based Parsing

18 0.62131262 196 acl-2013-Improving pairwise coreference models through feature space hierarchy learning

19 0.6200707 343 acl-2013-The Effect of Higher-Order Dependency Features in Discriminative Phrase-Structure Parsing

20 0.62002385 97 acl-2013-Cross-lingual Projections between Languages from Different Families