acl acl2011 acl2011-215 knowledge-graph by maker-knowledge-mining

215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices


Source: pdf

Author: Alexis Nasr ; Frederic Bechet ; Jean-Francois Rey ; Benoit Favre ; Joseph Le Roux

Abstract: MACAON is a tool suite for standard NLP tasks developed for French. MACAON has been designed to process both human-produced text and highly ambiguous word-lattices produced by NLP tools. MACAON is made of several native modules for common tasks such as a tokenization, a part-of-speech tagging or syntactic parsing, all communicating with each other through XML files . In addition, exchange protocols with external tools are easily definable. MACAON is a fast, modular and open tool, distributed under GNU Public License.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 i Abstract MACAON is a tool suite for standard NLP tasks developed for French. [sent-8, score-0.099]

2 MACAON has been designed to process both human-produced text and highly ambiguous word-lattices produced by NLP tools. [sent-9, score-0.075]

3 MACAON is made of several native modules for common tasks such as a tokenization, a part-of-speech tagging or syntactic parsing, all communicating with each other through XML files . [sent-10, score-0.254]

4 In addition, exchange protocols with external tools are easily definable. [sent-11, score-0.275]

5 MACAON is a fast, modular and open tool, distributed under GNU Public License. [sent-12, score-0.028]

6 Unlike native texts (texts produced by humans), this new kind of texts is the result of imperfect processors and they are made of several hypotheses, usually weighted with confidence measures. [sent-14, score-0.164]

7 Automatic text production systems can produce these weighted hypotheses as nbest lists, word lattices, or confusion networks. [sent-15, score-0.04]

8 It is crucial for this space of ambiguous solutions to be kept for later processing since the ambiguities of the lower levels can sometimes be resolved during highlevel processing stages. [sent-16, score-0.299]

9 ∗This work has been funded by the French Agence Nationale pour la Recherche, through the projects SEQUOIA (ANR-08EMER-013) and DECODA (2009-CORD-005-01) 86 . [sent-18, score-0.029]

10 fr is a suite of tools developped to process ambiguous input and extend inference of input modules within a global scope. [sent-20, score-0.403]

11 It consists in several modules that perform classical NLP tasks (tokenization, word recognition, part-ofMACAON speech tagging, lemmatization, morphological analysis, partial or full parsing) on either native text or word lattices. [sent-21, score-0.208]

12 MACAON is distributed under GNU public licence and can be downloaded from http : / /www . [sent-22, score-0.055]

13 i From a general point of view, a MACAON module can be seen as an annotation device1 which adds a new level of annotation to its input that generally depends on annotations from preceding modules. [sent-27, score-0.273]

14 The modules communicate through XML files that allow the representation different layers of annotation as well as ambiguities at each layer. [sent-28, score-0.325]

15 Moreover, the initial XML structuring of the processed files (logical structuring of a document, information from the Automatic Speech Recognition module . [sent-29, score-0.226]

16 As already mentioned, one of the main characteristics of MACAON is the ability for each module to accept ambiguous inputs and produce ambiguous outputs, in such a way that ambiguities can be resolved at a later stage of processing. [sent-33, score-0.359]

17 The compact representation of ambiguous structures is at the heart of the MACAON exchange format, described in section 2. [sent-34, score-0.333]

18 Furthermore every module can weight the solutions it produces. [sent-35, score-0.105]

19 such weights can be used to rank solutions or limit their number for later stages 1Annotation must be taken here in a general sense which includes tagging, segmentation or the construction of more complex objets as syntagmatic or dependencies trees. [sent-36, score-0.118]

20 Several processing tools suites alread exist for French among which SXPIPE (Sagot and Boullier, 2008), OUTILEX (Blanc et al. [sent-40, score-0.085]

21 A general comparison of MACAON with these tools is beyond the scope of this paper. [sent-42, score-0.056]

22 Let us just mention that MACAON shares with most of them the use of finite state machines as core data representation. [sent-43, score-0.128]

23 Some modules are implemented as standard operations on finite state machines. [sent-44, score-0.197]

24 MACAON can also be compared to the numerous development frameworks for developping processing tools, such as GATE4, FREELING5, ELLOGON6 or LINGPIPE7 that are usually limited to the processing of native texts. [sent-45, score-0.055]

25 The MACAON exchange format shares a certain number of features with linguistic annotation scheme standards such as the Text Encoding Initia- tive8, XCES9, or EAGLES10. [sent-46, score-0.501]

26 They all aim at defining standards for various types of corpus annotations. [sent-47, score-0.084]

27 The main difference between MACAON and these approaches is that MACAON defines an exchange format between NLP modules and not an annotation format. [sent-48, score-0.513]

28 More precisely, this format is dedicated to the compact representation of ambiguity: some information represented in the exchange format are to be interpreted by MACAON modules and would not be part of an annotation format. [sent-49, score-0.766]

29 Moreover, the MACAON exchange format was defined from the bottom up, originating from the authors’ need to use several existing tools and adapt their input/output formats in order for them to be compatible. [sent-50, score-0.404]

30 Still, MACAON shares several characteristics with the LAF (Ide and Romary, 2004) which aims at defining high level standards for exchanging linguistic data. [sent-52, score-0.244]

31 html 87 2 The MACAON exchange format The MACAON exchange format is based on four concepts: segment, attribute, annotation level and segmentation. [sent-74, score-0.786]

32 A segment refers to a segment of the text or speech signal that is to be processed, as a sentence, a clause, a syntactic constituent, a lexical unit, a named entity . [sent-75, score-0.371]

33 A segment can be equipped with attributes that describe some of its aspects. [sent-78, score-0.202]

34 A syntactic constituent, for example, will define the attribute type which specifies its syntactic type (Noun Phrase, Verb Phrase . [sent-79, score-0.031]

35 A segment is made of one or more smaller segments. [sent-83, score-0.173]

36 A sequence of segments covering a whole sentence for written text, or a spoken utterance for oral data, is called a segmentation. [sent-84, score-0.199]

37 An annotation level groups together segments of a same type, as well as segmentations defined on these segments. [sent-86, score-0.371]

38 Four levels are currently defined: pre-lexical, lexical, morpho-syntactic and syntactic. [sent-87, score-0.039]

39 Two relations are defined on segments: the precedence relation that organises linearly segments of a given level into segmentations and the dominance relation that describes how a segment is decomposed in smaller segments either of the same level or of a lower level. [sent-88, score-0.869]

40 We have represented in figure 2, a schematic representation of the analysis of the reconstructed output a speech recognizer would produce on the input time flies like an arrow11 . [sent-89, score-0.192]

41 Three annotation levels have been represented, lexical, morpho-syntactic and syntactic. [sent-90, score-0.105]

42 Each level is represented by a finitestate automaton which models the precedence relation defined over the segments of this level. [sent-91, score-0.376]

43 The segments are implicitly represented by the labels of the automaton’s arcs. [sent-93, score-0.224]

44 The dominance relations are represented with dashed lines that link segments of different levels. [sent-95, score-0.373]

45 This example illustrates the different ambiguity cases and the way they are represented. [sent-97, score-0.094]

46 NP PP Figure 1: Three annotation levels for a sample sentence. [sent-99, score-0.105]

47 Plain lines represent annotation hypotheses within a level while dashed lines represent links between levels. [sent-100, score-0.261]

48 Triangles with the tip up are “and” nodes and triangles with the tip down are “or” nodes. [sent-101, score-0.352]

49 In the chunking layer, segments that span multiple part-of-speech tags are linked to them through “and” nodes. [sent-103, score-0.174]

50 The most immediate ambiguity phenomenon is the segmentation ambiguity: several segmentations are possible at every level. [sent-104, score-0.205]

51 This ambiguity is represented in a compact way through the factoring of segments that participate in different segmentations, by way of a finite state automaton. [sent-105, score-0.459]

52 The second ambiguity phenomenon is the dominance ambiguity, where a segment can be decomposed in several ways into lower level segments. [sent-106, score-0.462]

53 Such a case appears in the preceding example, where the NN segment appearing in one of the outgoing transition of the initial state of the morpho-syntactic level dominates both thyme and time segments of the lexical level. [sent-107, score-0.564]

54 The triangle with the tip down is an “or” node, modeling the fact that NN corresponds to time or thyme. [sent-108, score-0.16]

55 They model the fact that the PP segment of the syntactic level dominates segments IN, DT and NN of the morpho-syntactic level. [sent-110, score-0.439]

56 1 XML representation The MACAON exchange format is implemented in XML. [sent-112, score-0.346]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('macaon', 0.796), ('exchange', 0.187), ('segments', 0.174), ('segment', 0.173), ('format', 0.132), ('modules', 0.128), ('tip', 0.128), ('xml', 0.107), ('triangles', 0.096), ('ambiguity', 0.094), ('segmentations', 0.081), ('dominance', 0.079), ('ambiguous', 0.075), ('roux', 0.072), ('thyme', 0.072), ('suite', 0.068), ('annotation', 0.066), ('module', 0.066), ('nn', 0.06), ('ambiguities', 0.059), ('shares', 0.059), ('standards', 0.057), ('tools', 0.056), ('native', 0.055), ('precedence', 0.052), ('automaton', 0.05), ('level', 0.05), ('represented', 0.05), ('favre', 0.048), ('gnu', 0.045), ('files', 0.045), ('fr', 0.044), ('compact', 0.044), ('structuring', 0.043), ('dominates', 0.042), ('finite', 0.041), ('hypotheses', 0.04), ('solutions', 0.039), ('nlp', 0.039), ('levels', 0.039), ('org', 0.039), ('lattices', 0.039), ('rey', 0.038), ('layer', 0.038), ('decomposed', 0.036), ('dashed', 0.035), ('lines', 0.035), ('resolved', 0.035), ('tokenization', 0.033), ('html', 0.032), ('processors', 0.032), ('agence', 0.032), ('umr', 0.032), ('flies', 0.032), ('triangle', 0.032), ('objet', 0.032), ('laf', 0.032), ('boullier', 0.032), ('developped', 0.032), ('echet', 0.032), ('nasr', 0.032), ('protocols', 0.032), ('sagot', 0.032), ('untouched', 0.032), ('tool', 0.031), ('attribute', 0.031), ('phenomenon', 0.03), ('originating', 0.029), ('reconstructed', 0.029), ('cnrs', 0.029), ('schematic', 0.029), ('objets', 0.029), ('equipped', 0.029), ('noo', 0.029), ('pour', 0.029), ('suites', 0.029), ('processed', 0.029), ('state', 0.028), ('nas', 0.028), ('factoring', 0.028), ('highlevel', 0.028), ('distributed', 0.028), ('public', 0.027), ('representation', 0.027), ('defining', 0.027), ('texts', 0.026), ('cois', 0.026), ('syntagmatic', 0.026), ('exchanging', 0.026), ('tagging', 0.026), ('preceding', 0.025), ('speech', 0.025), ('french', 0.025), ('constituent', 0.025), ('oral', 0.025), ('imperfect', 0.025), ('beno', 0.025), ('characteristics', 0.025), ('later', 0.024), ('alexis', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

Author: Alexis Nasr ; Frederic Bechet ; Jean-Francois Rey ; Benoit Favre ; Joseph Le Roux

Abstract: MACAON is a tool suite for standard NLP tasks developed for French. MACAON has been designed to process both human-produced text and highly ambiguous word-lattices produced by NLP tools. MACAON is made of several native modules for common tasks such as a tokenization, a part-of-speech tagging or syntactic parsing, all communicating with each other through XML files . In addition, exchange protocols with external tools are easily definable. MACAON is a fast, modular and open tool, distributed under GNU Public License.

2 0.050598487 62 acl-2011-Blast: A Tool for Error Analysis of Machine Translation Output

Author: Sara Stymne

Abstract: We present BLAST, an open source tool for error analysis of machine translation (MT) output. We believe that error analysis, i.e., to identify and classify MT errors, should be an integral part ofMT development, since it gives a qualitative view, which is not obtained by standard evaluation methods. BLAST can aid MT researchers and users in this process, by providing an easy-to-use graphical user interface. It is designed to be flexible, and can be used with any MT system, language pair, and error typology. The annotation task can be aided by highlighting similarities with a reference translation.

3 0.049298182 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

Author: Elijah Mayfield ; Carolyn Penstein Rose

Abstract: We present a novel computational formulation of speaker authority in discourse. This notion, which focuses on how speakers position themselves relative to each other in discourse, is first developed into a reliable coding scheme (0.71 agreement between human annotators). We also provide a computational model for automatically annotating text using this coding scheme, using supervised learning enhanced by constraints implemented with Integer Linear Programming. We show that this constrained model’s analyses of speaker authority correlates very strongly with expert human judgments (r2 coefficient of 0.947).

4 0.048755888 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

Author: Daniel Hewlett ; Paul Cohen

Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

5 0.046114601 339 acl-2011-Word Alignment Combination over Multiple Word Segmentation

Author: Ning Xi ; Guangchao Tang ; Boyuan Li ; Yinggong Zhao

Abstract: In this paper, we present a new word alignment combination approach on language pairs where one language has no explicit word boundaries. Instead of combining word alignments of different models (Xiang et al., 2010), we try to combine word alignments over multiple monolingually motivated word segmentation. Our approach is based on link confidence score defined over multiple segmentations, thus the combined alignment is more robust to inappropriate word segmentation. Our combination algorithm is simple, efficient, and easy to implement. In the Chinese-English experiment, our approach effectively improved word alignment quality as well as translation performance on all segmentations simultaneously, which showed that word alignment can benefit from complementary knowledge due to the diversity of multiple and monolingually motivated segmentations. 1

6 0.043132331 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

7 0.03917481 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

8 0.037258249 170 acl-2011-In-domain Relation Discovery with Meta-constraints via Posterior Regularization

9 0.036775015 294 acl-2011-Temporal Evaluation

10 0.036552034 77 acl-2011-Computing and Evaluating Syntactic Complexity Features for Automated Scoring of Spontaneous Non-Native Speech

11 0.036133748 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

12 0.03598316 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

13 0.035730246 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

14 0.033784918 182 acl-2011-Joint Annotation of Search Queries

15 0.033152245 333 acl-2011-Web-Scale Features for Full-Scale Parsing

16 0.032977771 72 acl-2011-Collecting Highly Parallel Data for Paraphrase Evaluation

17 0.032313619 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

18 0.03204396 329 acl-2011-Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition

19 0.031872105 117 acl-2011-Entity Set Expansion using Topic information

20 0.030780246 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.087), (1, -0.005), (2, -0.022), (3, -0.009), (4, -0.033), (5, 0.022), (6, 0.019), (7, -0.012), (8, -0.01), (9, 0.007), (10, -0.007), (11, 0.012), (12, -0.038), (13, -0.003), (14, -0.021), (15, 0.004), (16, 0.014), (17, -0.018), (18, 0.02), (19, 0.072), (20, 0.029), (21, 0.026), (22, 0.008), (23, -0.028), (24, 0.046), (25, 0.001), (26, 0.002), (27, 0.032), (28, -0.007), (29, -0.017), (30, 0.018), (31, 0.037), (32, 0.071), (33, -0.044), (34, 0.006), (35, 0.025), (36, 0.007), (37, 0.009), (38, -0.003), (39, -0.018), (40, 0.046), (41, 0.027), (42, 0.01), (43, -0.036), (44, 0.014), (45, -0.022), (46, -0.039), (47, -0.072), (48, 0.016), (49, 0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90277666 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

Author: Alexis Nasr ; Frederic Bechet ; Jean-Francois Rey ; Benoit Favre ; Joseph Le Roux

Abstract: MACAON is a tool suite for standard NLP tasks developed for French. MACAON has been designed to process both human-produced text and highly ambiguous word-lattices produced by NLP tools. MACAON is made of several native modules for common tasks such as a tokenization, a part-of-speech tagging or syntactic parsing, all communicating with each other through XML files . In addition, exchange protocols with external tools are easily definable. MACAON is a fast, modular and open tool, distributed under GNU Public License.

2 0.61336786 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

Author: Balaji Soundrarajan ; Thomas Ginter ; Scott DuVall

Abstract: This demonstration presents the Annotation Librarian, an application programming interface that supports rapid development of natural language processing (NLP) projects built in Apache Unstructured Information Management Architecture (UIMA). The flexibility of UIMA to support all types of unstructured data – images, audio, and text – increases the complexity of some of the most common NLP development tasks. The Annotation Librarian interface handles these common functions and allows the creation and management of annotations by mirroring Java methods used to manipulate Strings. The familiar syntax and NLP-centric design allows developers to adopt and rapidly develop NLP algorithms in UIMA. The general functionality of the interface is described in relation to the use cases that necessitated its creation. 1

3 0.55677819 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Author: Weiwei Sun

Abstract: The large combined search space of joint word segmentation and Part-of-Speech (POS) tagging makes efficient decoding very hard. As a result, effective high order features representing rich contexts are inconvenient to use. In this work, we propose a novel stacked subword model for this task, concerning both efficiency and effectiveness. Our solution is a two step process. First, one word-based segmenter, one character-based segmenter and one local character classifier are trained to produce coarse segmentation and POS information. Second, the outputs of the three predictors are merged into sub-word sequences, which are further bracketed and labeled with POS tags by a fine-grained sub-word tagger. The coarse-to-fine search scheme is effi- cient, while in the sub-word tagging step rich contextual features can be approximately derived. Evaluation on the Penn Chinese Treebank shows that our model yields improvements over the best system reported in the literature.

4 0.54602414 284 acl-2011-Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Author: Elias Ponvert ; Jason Baldridge ; Katrin Erk

Abstract: We consider a new subproblem of unsupervised parsing from raw text, unsupervised partial parsing—the unsupervised version of text chunking. We show that addressing this task directly, using probabilistic finite-state methods, produces better results than relying on the local predictions of a current best unsupervised parser, Seginer’s (2007) CCL. These finite-state models are combined in a cascade to produce more general (full-sentence) constituent structures; doing so outperforms CCL by a wide margin in unlabeled PARSEVAL scores for English, German and Chinese. Finally, we address the use of phrasal punctuation as a heuristic indicator of phrasal boundaries, both in our system and in CCL.

5 0.53784406 140 acl-2011-Fully Unsupervised Word Segmentation with BVE and MDL

Author: Daniel Hewlett ; Paul Cohen

Abstract: Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.

6 0.50535691 184 acl-2011-Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser

7 0.49673727 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

8 0.48115727 298 acl-2011-The ACL Anthology Searchbench

9 0.47762436 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework

10 0.47433928 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

11 0.46917078 222 acl-2011-Model-Portability Experiments for Textual Temporal Analysis

12 0.46336997 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

13 0.461555 291 acl-2011-SystemT: A Declarative Information Extraction System

14 0.46016243 66 acl-2011-Chinese sentence segmentation as comma classification

15 0.45761502 138 acl-2011-French TimeBank: An ISO-TimeML Annotated Reference Corpus

16 0.44867277 229 acl-2011-NULEX: An Open-License Broad Coverage Lexicon

17 0.44773388 260 acl-2011-Recognizing Authority in Dialogue with an Integer Linear Programming Constrained Model

18 0.44716766 187 acl-2011-Jointly Learning to Extract and Compress

19 0.43803799 321 acl-2011-Unsupervised Discovery of Rhyme Schemes

20 0.42788008 238 acl-2011-P11-2093 k2opt.pdf


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.046), (17, 0.043), (26, 0.066), (31, 0.012), (37, 0.044), (39, 0.096), (41, 0.051), (55, 0.015), (59, 0.011), (62, 0.015), (66, 0.255), (72, 0.031), (91, 0.058), (96, 0.148), (97, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79753089 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

Author: Alexis Nasr ; Frederic Bechet ; Jean-Francois Rey ; Benoit Favre ; Joseph Le Roux

Abstract: MACAON is a tool suite for standard NLP tasks developed for French. MACAON has been designed to process both human-produced text and highly ambiguous word-lattices produced by NLP tools. MACAON is made of several native modules for common tasks such as a tokenization, a part-of-speech tagging or syntactic parsing, all communicating with each other through XML files . In addition, exchange protocols with external tools are easily definable. MACAON is a fast, modular and open tool, distributed under GNU Public License.

2 0.62610155 111 acl-2011-Effects of Noun Phrase Bracketing in Dependency Parsing and Machine Translation

Author: Nathan Green

Abstract: Flat noun phrase structure was, up until recently, the standard in annotation for the Penn Treebanks. With the recent addition of internal noun phrase annotation, dependency parsing and applications down the NLP pipeline are likely affected. Some machine translation systems, such as TectoMT, use deep syntax as a language transfer layer. It is proposed that changes to the noun phrase dependency parse will have a cascading effect down the NLP pipeline and in the end, improve machine translation output, even with a reduction in parser accuracy that the noun phrase structure might cause. This paper examines this noun phrase structure’s effect on dependency parsing, in English, with a maximum spanning tree parser and shows a 2.43%, 0.23 Bleu score, improvement for English to Czech machine translation. .

3 0.61109567 182 acl-2011-Joint Annotation of Search Queries

Author: Michael Bendersky ; W. Bruce Croft ; David A. Smith

Abstract: W. Bruce Croft Dept. of Computer Science University of Massachusetts Amherst, MA cro ft @ c s .uma s s .edu David A. Smith Dept. of Computer Science University of Massachusetts Amherst, MA dasmith@ c s .umas s .edu articles or web pages). As previous research shows, these differences severely limit the applicability of Marking up search queries with linguistic annotations such as part-of-speech tags, capitalization, and segmentation, is an impor- tant part of query processing and understanding in information retrieval systems. Due to their brevity and idiosyncratic structure, search queries pose a challenge to existing NLP tools. To address this challenge, we propose a probabilistic approach for performing joint query annotation. First, we derive a robust set of unsupervised independent annotations, using queries and pseudo-relevance feedback. Then, we stack additional classifiers on the independent annotations, and exploit the dependencies between them to further improve the accuracy, even with a very limited amount of available training data. We evaluate our method using a range of queries extracted from a web search log. Experimental results verify the effectiveness of our approach for both short keyword queries, and verbose natural language queries.

4 0.61017191 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

Author: Andrei Popescu-Belis ; Majid Yazdani ; Alexandre Nanchen ; Philip N. Garner

Abstract: The Automatic Content Linking Device is a just-in-time document retrieval system which monitors an ongoing conversation or a monologue and enriches it with potentially related documents, including multimedia ones, from local repositories or from the Internet. The documents are found using keyword-based search or using a semantic similarity measure between documents and the words obtained from automatic speech recognition. Results are displayed in real time to meeting participants, or to users watching a recorded lecture or conversation.

5 0.60969639 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

6 0.60072708 258 acl-2011-Ranking Class Labels Using Query Sessions

7 0.59944516 241 acl-2011-Parsing the Internal Structure of Words: A New Paradigm for Chinese Word Segmentation

8 0.59846079 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

9 0.59554696 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

10 0.59554088 97 acl-2011-Discovering Sociolinguistic Associations with Structured Sparsity

11 0.59464455 27 acl-2011-A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

12 0.59120989 300 acl-2011-The Surprising Variance in Shortest-Derivation Parsing

13 0.59104574 117 acl-2011-Entity Set Expansion using Topic information

14 0.58948529 192 acl-2011-Language-Independent Parsing with Empty Elements

15 0.5890345 178 acl-2011-Interactive Topic Modeling

16 0.58891678 88 acl-2011-Creating a manually error-tagged and shallow-parsed learner corpus

17 0.5859133 145 acl-2011-Good Seed Makes a Good Crop: Accelerating Active Learning Using Language Modeling

18 0.58499318 11 acl-2011-A Fast and Accurate Method for Approximate String Search

19 0.58467317 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

20 0.58464193 207 acl-2011-Learning to Win by Reading Manuals in a Monte-Carlo Framework