acl acl2011 acl2011-42 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Balaji Soundrarajan ; Thomas Ginter ; Scott DuVall
Abstract: This demonstration presents the Annotation Librarian, an application programming interface that supports rapid development of natural language processing (NLP) projects built in Apache Unstructured Information Management Architecture (UIMA). The flexibility of UIMA to support all types of unstructured data – images, audio, and text – increases the complexity of some of the most common NLP development tasks. The Annotation Librarian interface handles these common functions and allows the creation and management of annotations by mirroring Java methods used to manipulate Strings. The familiar syntax and NLP-centric design allows developers to adopt and rapidly develop NLP algorithms in UIMA. The general functionality of the interface is described in relation to the use cases that necessitated its creation. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This demonstration presents the Annotation Librarian, an application programming interface that supports rapid development of natural language processing (NLP) projects built in Apache Unstructured Information Management Architecture (UIMA). [sent-8, score-0.473]
2 The flexibility of UIMA to support all types of unstructured data – images, audio, and text – increases the complexity of some of the most common NLP development tasks. [sent-9, score-0.127]
3 The Annotation Librarian interface handles these common functions and allows the creation and management of annotations by mirroring Java methods used to manipulate Strings. [sent-10, score-0.512]
4 The familiar syntax and NLP-centric design allows developers to adopt and rapidly develop NLP algorithms in UIMA. [sent-11, score-0.334]
5 The general functionality of the interface is described in relation to the use cases that necessitated its creation. [sent-12, score-0.433]
6 1 Introduction In the days when public libraries were the center of information exchange, the job of the librarian was to serve as an interface between the complex library system and the average user. [sent-13, score-0.651]
7 The librarian made it possible for one to access specific sources of information without memorizing the Dewey Decimal System or flipping through the card catalog. [sent-14, score-0.493]
8 Analogous to the great librarians of yesteryear, the Annotation Librarian serves the average Java developer in the creation and management of annotations within natural language processing (NLP) projects built using the open source Apache Unstructured Information Management Architecture (UIMA)1. [sent-15, score-0.403]
9 Many NLP tasks are performed in processing steps that build upon one another. [sent-16, score-0.03]
10 Systems designed in this fashion are called pipelines because 1 Apache UIMA is available from http://uima. [sent-17, score-0.053]
11 org/ 139 text is processed and then passed from one step to the next like water flowing through a pipe. [sent-19, score-0.071]
12 Each step in the pipeline adds structured data on top of the text called annotations. [sent-20, score-0.067]
13 An annotation can be as simple as a classification of a span of text or complex with attributes and mappings to coded values. [sent-21, score-0.248]
14 As pipeline systems have caught on, the ability to standardize functionality in and even across pipelines has emerged. [sent-22, score-0.361]
15 UIMA provides a powerful infrastructure for the storage, transport, and retrieval of document and annotation knowledge accumulated in NLP pipeline systems (Ferrucci 2004). [sent-23, score-0.28]
16 Because UIMA provides the underlying data model for storing meta-data and annotations with document text and the interface for interacting between processing steps, it has become a popular platform for the development of reusable NLP systems (D’Avolio 2010, Coden 2009, Savova 2008). [sent-25, score-0.482]
17 The most notable example of UIMA capabilities is Watson, the question-answering system that competed and won two Jeopardy! [sent-26, score-0.032]
18 In addition to its successful implementations in NLP, UIMA supports all types of unstructured information – video, audio, images, etc – and so all UIMA constructs generalize beyond text. [sent-28, score-0.126]
19 While handling multiple data types increases the utility of the framework, developers new to UIMA may feel they need to understand the entire framework before being able to distinguish and focus solely on text. [sent-29, score-0.219]
20 The Annotation Librarian aids both novice and experienced UIMA developers by providing intuitive and NLP-centric functionality. [sent-30, score-0.315]
21 It provides convenience methods that mirror Java String manipulation, allowing developers to seamlessly combine document text and annotations with the same commands familiar to anyone who has parsed a String or written a regular expression. [sent-36, score-0.659]
22 Advanced functionality allows developers to examine spatial relationships among annotations and perform annotation pattern matching. [sent-37, score-0.784]
23 In this demonstration, we present the general functionality of the Annotation Librarian in the context of the health care research projects that necessitated the creation of the interface. [sent-38, score-0.507]
24 The interface does not replace the need for NLP algorithms – developers have a plethora of patterns and decision rules, symbolic grammars, and machine learning techniques to create annotations. [sent-39, score-0.442]
25 The Annotation Toolkit, though, provides a convenient way for developers to use existing annotations in their algorithms. [sent-40, score-0.346]
26 This feeds the pipeline workflow that allows more complex annotations to be built in later processing steps using the annotations created in earlier steps. [sent-41, score-0.49]
27 The Annotation Librarian was developed and modified in response to four research projects in the health care domain that relied on NLP extraction of concepts from clinical text. [sent-42, score-0.251]
28 The diversity of the different tasks in each of these use cases allowed the interface to include functionality common to different types of NLP system development. [sent-43, score-0.358]
29 Interface functionality will be described as groups of related methods in the context of the four research projects and cover pattern matching, span overlap, relative position, annotation modification, and retrieval. [sent-44, score-0.534]
30 All projects received Institutional Review Board approval for data use and only synthetic documents, not real patient records, are shown in the examples presented in this paper. [sent-45, score-0.238]
31 3 Pattern Matching Name entity recognition and semantic classification tasks often require advanced concept identifi140 cation techniques. [sent-46, score-0.034]
32 Identifying mentions of prescriptions in a document using regular expressions, for example, would require hundreds of thousands of patterns for names of medicines and have to account for misspelling, abbreviations, and acronyms. [sent-47, score-0.143]
33 Regular expressions are commonly used to solve simple NLP tasks, though, and can be utilized as part of a more complex information extraction strategy, such as understanding the context in which a term is used in the text (Garvin 201 1, McCrae 2008, Frenz 2007, Chapman 2001). [sent-48, score-0.035]
34 Negex (Chapman 2001) is an algorithm for identifying words before or after a term that suggest, for ex- ample, that a particular symptom is not present in a patient: “the patient has no fever. [sent-49, score-0.15]
35 ” Other methods for understanding the context around terms include the use of an inclusion and exclusion list (Akbar 2009), temporal locality search (Grouin 2009), window search (Li 2009), and combinations of the above techniques (Hamon 2009). [sent-50, score-0.031]
36 The Annotation Librarian allows patterns to be built using existing annotations along with document text. [sent-51, score-0.247]
37 This functionality combines the power of finding concepts that require complex means with the simplicity of regular expressions. [sent-52, score-0.296]
38 The syntax mirrors that of the Java Pattern3 and Matcher4 classes, but allows for an extended regular expression grammar to identify Annotations. [sent-53, score-0.136]
39 Pattern matching is accomplished in three phases: the input pattern is compiled, the document and annotations are analyzed for matches, and matches are returned along with span information. [sent-54, score-0.413]
40 A project identifying positive microbiology cultures will illustrate the use of pattern matching with the Annotation Librarian. [sent-55, score-0.353]
41 Clinicians order microbiology cultures to determine whether a patient has a bacterial infection and which antibiotics would be most effective at treating the infection. [sent-56, score-0.305]
42 Susceptibility is the measure of whether an antibi- otic can effectively treat an organism or whether the organism is resistant to it. [sent-57, score-0.291]
43 A sample of microbiology report text is shown in Figure 1 and visualized annotations for the same sample are shown in Figure 2. [sent-58, score-0.286]
44 html To demonstrate pattern matching in this sample, the simple pattern of a drug annotation followed by an equals sign and then by a susceptibility annotation will be used. [sent-65, score-0.686]
45 1 Pattern Compilation The pattern matching process begins when a new instance of an AnnotationPattern is created from the static compile method. [sent-67, score-0.161]
46 AnnotationPatte rn su s cept ibi l tyPattern i Annotat ionPattern . [sent-69, score-0.104]
47 compi l ( “patte e rn ” ) ; = The method takes advantage of the UIMA implementation of annotations. [sent-70, score-0.125]
48 Each annotation is an instance of a class that inherits from the UIMA class Annotation5. [sent-71, score-0.171]
49 UIMA allows developers to create new types of annotations (in this example Organism, Antibiotic, and Susceptibility) that become Java classes. [sent-72, score-0.384]
wordName wordTfidf (topN-words)
[('uima', 0.552), ('librarian', 0.425), ('developers', 0.219), ('interface', 0.191), ('functionality', 0.167), ('java', 0.142), ('annotation', 0.134), ('microbiology', 0.127), ('organism', 0.127), ('susceptibility', 0.127), ('annotations', 0.127), ('patient', 0.113), ('pattern', 0.099), ('unstructured', 0.093), ('apache', 0.092), ('documented', 0.092), ('projects', 0.091), ('nlp', 0.086), ('annotationpattern', 0.085), ('balaji', 0.085), ('compi', 0.085), ('familiar', 0.077), ('necessitated', 0.075), ('management', 0.075), ('pipeline', 0.067), ('health', 0.067), ('chapman', 0.065), ('cultures', 0.065), ('regular', 0.064), ('care', 0.063), ('matching', 0.062), ('ferrucci', 0.062), ('pipelines', 0.053), ('xml', 0.05), ('rapid', 0.049), ('platform', 0.048), ('document', 0.045), ('creation', 0.044), ('span', 0.043), ('analogous', 0.043), ('audio', 0.041), ('images', 0.04), ('rn', 0.04), ('allows', 0.038), ('demonstration', 0.038), ('mirroring', 0.037), ('caught', 0.037), ('annotat', 0.037), ('jeopardy', 0.037), ('duvall', 0.037), ('flowing', 0.037), ('ginter', 0.037), ('inherits', 0.037), ('patte', 0.037), ('resistant', 0.037), ('reusable', 0.037), ('savova', 0.037), ('standardize', 0.037), ('symptom', 0.037), ('transport', 0.037), ('built', 0.037), ('matches', 0.037), ('attributes', 0.036), ('complex', 0.035), ('mirrors', 0.034), ('cept', 0.034), ('commands', 0.034), ('approval', 0.034), ('cation', 0.034), ('decimal', 0.034), ('flipping', 0.034), ('infrastructure', 0.034), ('medicines', 0.034), ('memorizing', 0.034), ('negex', 0.034), ('novice', 0.034), ('utah', 0.034), ('water', 0.034), ('development', 0.034), ('supports', 0.033), ('compilation', 0.032), ('visualized', 0.032), ('anyone', 0.032), ('competed', 0.032), ('mirror', 0.032), ('plethora', 0.032), ('experienced', 0.031), ('aids', 0.031), ('salt', 0.031), ('drug', 0.031), ('exclusion', 0.031), ('manipulation', 0.031), ('http', 0.031), ('steps', 0.03), ('architecture', 0.03), ('concepts', 0.03), ('su', 0.03), ('developer', 0.029), ('seamlessly', 0.029), ('workflow', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
Author: Balaji Soundrarajan ; Thomas Ginter ; Scott DuVall
Abstract: This demonstration presents the Annotation Librarian, an application programming interface that supports rapid development of natural language processing (NLP) projects built in Apache Unstructured Information Management Architecture (UIMA). The flexibility of UIMA to support all types of unstructured data – images, audio, and text – increases the complexity of some of the most common NLP development tasks. The Annotation Librarian interface handles these common functions and allows the creation and management of annotations by mirroring Java methods used to manipulate Strings. The familiar syntax and NLP-centric design allows developers to adopt and rapidly develop NLP algorithms in UIMA. The general functionality of the interface is described in relation to the use cases that necessitated its creation. 1
2 0.11556838 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes
Author: Youngjun Kim ; Ellen Riloff ; Stephane Meystre
Abstract: We present an NLP system that classifies the assertion type of medical problems in clinical notes used for the Fourth i2b2/VA Challenge. Our classifier uses a variety of linguistic features, including lexical, syntactic, lexicosyntactic, and contextual features. To overcome an extremely unbalanced distribution of assertion types in the data set, we focused our efforts on adding features specifically to improve the performance of minority classes. As a result, our system reached 94. 17% micro-averaged and 79.76% macro-averaged F1-measures, and showed substantial recall gains on the minority classes. 1
3 0.072430775 182 acl-2011-Joint Annotation of Search Queries
Author: Michael Bendersky ; W. Bruce Croft ; David A. Smith
Abstract: W. Bruce Croft Dept. of Computer Science University of Massachusetts Amherst, MA cro ft @ c s .uma s s .edu David A. Smith Dept. of Computer Science University of Massachusetts Amherst, MA dasmith@ c s .umas s .edu articles or web pages). As previous research shows, these differences severely limit the applicability of Marking up search queries with linguistic annotations such as part-of-speech tags, capitalization, and segmentation, is an impor- tant part of query processing and understanding in information retrieval systems. Due to their brevity and idiosyncratic structure, search queries pose a challenge to existing NLP tools. To address this challenge, we propose a probabilistic approach for performing joint query annotation. First, we derive a robust set of unsupervised independent annotations, using queries and pseudo-relevance feedback. Then, we stack additional classifiers on the independent annotations, and exploit the dependencies between them to further improve the accuracy, even with a very limited amount of available training data. We evaluate our method using a range of queries extracted from a web search log. Experimental results verify the effectiveness of our approach for both short keyword queries, and verbose natural language queries.
4 0.055911761 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis
Author: Amjad Abu-Jbara ; Dragomir Radev
Abstract: In this paper we present Clairlib, an opensource toolkit for Natural Language Processing, Information Retrieval, and Network Analysis. Clairlib provides an integrated framework intended to simplify a number of generic tasks within and across those three areas. It has a command-line interface, a graphical interface, and a documented API. Clairlib is compatible with all the common platforms and operating systems. In addition to its own functionality, it provides interfaces to external software and corpora. Clairlib comes with a comprehensive documentation and a rich set of tutorials and visual demos.
5 0.05328612 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
Author: Lonneke van der Plas ; Paola Merlo ; James Henderson
Abstract: Broad-coverage semantic annotations for training statistical learners are only available for a handful of languages. Previous approaches to cross-lingual transfer of semantic annotations have addressed this problem with encouraging results on a small scale. In this paper, we scale up previous efforts by using an automatic approach to semantic annotation that does not rely on a semantic ontology for the target language. Moreover, we improve the quality of the transferred semantic annotations by using a joint syntacticsemantic parser that learns the correlations between syntax and semantics of the target language and smooths out the errors from automatic transfer. We reach a labelled F-measure for predicates and arguments of only 4% and 9% points, respectively, lower than the upper bound from manual annotations.
6 0.051516313 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output
7 0.042955946 315 acl-2011-Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment
8 0.041822866 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
9 0.04121371 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
10 0.037529405 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search
11 0.037262991 308 acl-2011-Towards a Framework for Abstractive Summarization of Multimodal Documents
12 0.03598316 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices
13 0.035285491 298 acl-2011-The ACL Anthology Searchbench
14 0.034212917 86 acl-2011-Coreference for Learning to Extract Relations: Yes Virginia, Coreference Matters
15 0.033633981 242 acl-2011-Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
16 0.033209711 291 acl-2011-SystemT: A Declarative Information Extraction System
17 0.032351594 48 acl-2011-Automatic Detection and Correction of Errors in Dependency Treebanks
18 0.031963948 115 acl-2011-Engkoo: Mining the Web for Language Learning
19 0.031899117 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus
20 0.031587355 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation
topicId topicWeight
[(0, 0.091), (1, 0.028), (2, -0.034), (3, 0.007), (4, -0.024), (5, 0.006), (6, 0.003), (7, -0.027), (8, 0.004), (9, -0.013), (10, -0.022), (11, -0.015), (12, -0.022), (13, 0.013), (14, -0.009), (15, 0.001), (16, 0.009), (17, -0.02), (18, 0.003), (19, -0.043), (20, 0.022), (21, 0.01), (22, -0.001), (23, -0.027), (24, 0.007), (25, 0.031), (26, 0.018), (27, 0.019), (28, -0.016), (29, -0.04), (30, 0.016), (31, 0.042), (32, 0.064), (33, -0.0), (34, -0.016), (35, 0.034), (36, -0.007), (37, -0.019), (38, -0.046), (39, 0.038), (40, 0.099), (41, 0.076), (42, -0.031), (43, -0.053), (44, 0.007), (45, -0.004), (46, -0.017), (47, -0.013), (48, 0.034), (49, 0.085)]
simIndex simValue paperId paperTitle
same-paper 1 0.91695982 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
Author: Balaji Soundrarajan ; Thomas Ginter ; Scott DuVall
Abstract: This demonstration presents the Annotation Librarian, an application programming interface that supports rapid development of natural language processing (NLP) projects built in Apache Unstructured Information Management Architecture (UIMA). The flexibility of UIMA to support all types of unstructured data – images, audio, and text – increases the complexity of some of the most common NLP development tasks. The Annotation Librarian interface handles these common functions and allows the creation and management of annotations by mirroring Java methods used to manipulate Strings. The familiar syntax and NLP-centric design allows developers to adopt and rapidly develop NLP algorithms in UIMA. The general functionality of the interface is described in relation to the use cases that necessitated its creation. 1
2 0.60784864 165 acl-2011-Improving Classification of Medical Assertions in Clinical Notes
Author: Youngjun Kim ; Ellen Riloff ; Stephane Meystre
Abstract: We present an NLP system that classifies the assertion type of medical problems in clinical notes used for the Fourth i2b2/VA Challenge. Our classifier uses a variety of linguistic features, including lexical, syntactic, lexicosyntactic, and contextual features. To overcome an extremely unbalanced distribution of assertion types in the data set, we focused our efforts on adding features specifically to improve the performance of minority classes. As a result, our system reached 94. 17% micro-averaged and 79.76% macro-averaged F1-measures, and showed substantial recall gains on the minority classes. 1
3 0.59120011 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices
Author: Alexis Nasr ; Frederic Bechet ; Jean-Francois Rey ; Benoit Favre ; Joseph Le Roux
Abstract: MACAON is a tool suite for standard NLP tasks developed for French. MACAON has been designed to process both human-produced text and highly ambiguous word-lattices produced by NLP tools. MACAON is made of several native modules for common tasks such as a tokenization, a part-of-speech tagging or syntactic parsing, all communicating with each other through XML files . In addition, exchange protocols with external tools are easily definable. MACAON is a fast, modular and open tool, distributed under GNU Public License.
4 0.56844717 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
Author: Svetlana Kiritchenko ; Colin Cherry
Abstract: The automatic coding of clinical documents is an important task for today’s healthcare providers. Though it can be viewed as multi-label document classification, the coding problem has the interesting property that most code assignments can be supported by a single phrase found in the input document. We propose a Lexically-Triggered Hidden Markov Model (LT-HMM) that leverages these phrases to improve coding accuracy. The LT-HMM works in two stages: first, a lexical match is performed against a term dictionary to collect a set of candidate codes for a document. Next, a discriminative HMM selects the best subset of codes to assign to the document by tagging candidates as present or absent. By confirming codes proposed by a dictionary, the LT-HMM can share features across codes, enabling strong performance even on rare codes. In fact, we are able to recover codes that do not occur in the training set at all. Our approach achieves the best ever performance on the 2007 Medical NLP Challenge test set, with an F-measure of 89.84.
5 0.53584373 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements
Author: Oliver Schneider ; Alex Garnett
Abstract: We present ConsentCanvas, a system which structures and “texturizes” End-User License Agreement (EULA) documents to be more readable. The system aims to help users better understand the terms under which they are providing their informed consent. ConsentCanvas receives unstructured text documents as input and uses unsupervised natural language processing methods to embellish the source document using a linked stylesheet. Unlike similar usable security projects which employ summarization techniques, our system preserves the contents of the source document, minimizing the cognitive and legal burden for both the end user and the licensor. Our system does not require a corpus for training. 1
6 0.50507098 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
7 0.50435841 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature
8 0.50293398 291 acl-2011-SystemT: A Declarative Information Extraction System
9 0.49811122 214 acl-2011-Lost in Translation: Authorship Attribution using Frame Semantics
10 0.49403352 212 acl-2011-Local Histograms of Character N-grams for Authorship Attribution
11 0.48430702 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis
12 0.48329252 306 acl-2011-Towards Style Transformation from Written-Style to Audio-Style
13 0.47680661 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity
14 0.47666621 125 acl-2011-Exploiting Readymades in Linguistic Creativity: A System Demonstration of the Jigsaw Bard
15 0.47489917 317 acl-2011-Underspecifying and Predicting Voice for Surface Realisation Ranking
16 0.47478208 133 acl-2011-Extracting Social Power Relationships from Natural Language
17 0.46266803 138 acl-2011-French TimeBank: An ISO-TimeML Annotated Reference Corpus
18 0.4544819 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis
19 0.44984809 74 acl-2011-Combining Indicators of Allophony
20 0.44329908 31 acl-2011-Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations
topicId topicWeight
[(5, 0.052), (13, 0.374), (17, 0.023), (26, 0.048), (37, 0.051), (39, 0.05), (41, 0.074), (55, 0.026), (59, 0.031), (61, 0.017), (72, 0.012), (91, 0.054), (96, 0.112)]
simIndex simValue paperId paperTitle
same-paper 1 0.76938319 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA
Author: Balaji Soundrarajan ; Thomas Ginter ; Scott DuVall
Abstract: This demonstration presents the Annotation Librarian, an application programming interface that supports rapid development of natural language processing (NLP) projects built in Apache Unstructured Information Management Architecture (UIMA). The flexibility of UIMA to support all types of unstructured data – images, audio, and text – increases the complexity of some of the most common NLP development tasks. The Annotation Librarian interface handles these common functions and allows the creation and management of annotations by mirroring Java methods used to manipulate Strings. The familiar syntax and NLP-centric design allows developers to adopt and rapidly develop NLP algorithms in UIMA. The general functionality of the interface is described in relation to the use cases that necessitated its creation. 1
2 0.53123277 63 acl-2011-Bootstrapping coreference resolution using word associations
Author: Hamidreza Kobdani ; Hinrich Schuetze ; Michael Schiehlen ; Hans Kamp
Abstract: In this paper, we present an unsupervised framework that bootstraps a complete coreference resolution (CoRe) system from word associations mined from a large unlabeled corpus. We show that word associations are useful for CoRe – e.g., the strong association between Obama and President is an indicator of likely coreference. Association information has so far not been used in CoRe because it is sparse and difficult to learn from small labeled corpora. Since unlabeled text is readily available, our unsupervised approach addresses the sparseness problem. In a self-training framework, we train a decision tree on a corpus that is automatically labeled using word associations. We show that this unsupervised system has better CoRe performance than other learning approaches that do not use manually labeled data. .
3 0.50610894 11 acl-2011-A Fast and Accurate Method for Approximate String Search
Author: Ziqi Wang ; Gu Xu ; Hang Li ; Ming Zhang
Abstract: This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows. Given a misspelled word, the system finds words in a dictionary, which are most “similar” to the misspelled word. The paper proposes a probabilistic approach to the task, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for finding the top k candidates. The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word. The learning method employs the criterion in candidate generation as loss function. The retrieval algorithm is efficient and is guaranteed to find the optimal k candidates. Experimental results on large scale data show that the proposed approach improves upon existing methods in terms of accuracy in different settings.
4 0.47896272 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs
Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat
Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.
5 0.44082481 137 acl-2011-Fine-Grained Class Label Markup of Search Queries
Author: Joseph Reisinger ; Marius Pasca
Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.
6 0.40929323 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling
7 0.40227643 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation
8 0.40220082 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing
9 0.39990819 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding
10 0.39960706 291 acl-2011-SystemT: A Declarative Information Extraction System
11 0.39829761 119 acl-2011-Evaluating the Impact of Coder Errors on Active Learning
12 0.39740923 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering
13 0.39736927 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing
14 0.3960034 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices
15 0.39499521 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models
16 0.39494842 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction
17 0.39466155 246 acl-2011-Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
18 0.39436311 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora
19 0.39364302 269 acl-2011-Scaling up Automatic Cross-Lingual Semantic Role Annotation
20 0.39341325 178 acl-2011-Interactive Topic Modeling