acl acl2012 acl2012-196 knowledge-graph by maker-knowledge-mining

196 acl-2012-The OpenGrm open-source finite-state grammar software libraries

Source: pdf

Author: Brian Roark ; Richard Sproat ; Cyril Allauzen ; Michael Riley ; Jeffrey Sorensen ; Terry Tai

Abstract: In this paper, we present a new collection of open-source software libraries that provides command line binary utilities and library classes and functions for compiling regular expression and context-sensitive rewrite rules into finite-state transducers, and for n-gram language modeling. The OpenGrm libraries use the OpenFst library to provide an efficient encoding of grammars and general algorithms for building, modifying and applying models.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The OpenGrm open-source finite-state grammar software libraries Brian Roark† Richard Sproat†◦ Cyril Allauzen◦ Michael Riley◦ Jeffrey Sorensen◦ & Terry Tai◦ †Oregon Health & Science University, Portland, Oregon ◦Google, Inc. [sent-1, score-0.371]

2 The OpenGrm libraries use the OpenFst library to provide an efficient encoding of grammars and general algorithms for building, modifying and applying models. [sent-3, score-0.495]

3 1 Introduction libraries1 The OpenGrm are a (growing) collection of open-source software libraries for building and applying various kinds of formal grammars. [sent-4, score-0.348]

4 The C++ libraries use the OpenFst for the underlying finite-state representation, which allows for easy inspection of the resulting grammars and models, as well as straightforward combination with other finite-state transducers. [sent-5, score-0.404]

5 Like OpenFst, there are easy-to-use command line binaries for fre- library2 quently used operations, as well as a C++ library interface, allowing library users to create their own algorithms from the basic classes and functions provided. [sent-6, score-0.532]

6 The libraries can be used for a range of common string processing tasks, such as text normalization, as well as for building and using large statistical models for applications like speech recognition. [sent-7, score-0.347]

7 In the rest of the paper, we will present each of the two libraries, starting with the Thrax grammar compiler and then the NGram library. [sent-8, score-0.22]

8 First, though, we will briefly present some preliminary (informal) background on weighted finite-state transducers (WFST), just as needed for this paper. [sent-9, score-0.164]

9 org/ 61 2 Informal WFST preliminaries A weighted finite-state transducer consists of a set of states and transitions between states. [sent-14, score-0.289]

10 There is an initial state and a subset of states are final. [sent-15, score-0.138]

11 Each transition is labeled with an input symbol from an input alphabet; an output symbol from an output alpha- bet; an origin state; a destination state; and a weight. [sent-16, score-0.797]

12 A path in the WFST is a sequence of transitions where each transition’s destination state is the next transition’s origin state. [sent-18, score-0.423]

13 A valid path through the WFST is a path where the origin state of the first transition is an initial state, and the the last transition is to a final state. [sent-19, score-0.774]

14 Weights combine along the path according to the semiring of the WFST. [sent-20, score-0.237]

15 If every transition in the transducer has the same input and output symbol, then the WFST represents a weighted finite-state automaton. [sent-21, score-0.382]

16 symbol represents the empty string, which allows the transition to be traversed without consuming any symbol. [sent-24, score-0.637]

17 The φ (or failure) symbol on a transition also allows it to be traversed without consuming any symbol, but it differs from ? [sent-25, score-0.604]

18 in only allowing traversal if the symbol being matched does not label any other transition leaving the same state, i. [sent-26, score-0.501]

19 For a more detailed presentation of WFSTs, see Allauzen et al. [sent-29, score-0.037]

20 3 The Thrax Grammar Compiler The Thrax grammar compiler3 compiles grammars that consist of regular expressions, and contextdependent rewrite rules, into FST archives (fars) of weighted finite state transducers. [sent-31, score-0.496]

21 Grammars may 3The compiler is named after Dionysius Thrax (170– 90BCE), the reputed first Greek grammarian. [sent-32, score-0.166]

22 c s 2o0c1ia2ti Aosns fo cria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsteiscs 61–6 , be split over multiple files and imported into other grammars. [sent-35, score-0.07]

23 Strings in the rules may be parsed in one of three different ways: as a sequence of bytes (the default), as utf8 encodings, or according to a user-provided symbol table. [sent-36, score-0.292]

24 With the --s ave symbol s flag, the transducers can be saved out into fars with appropriate symbol tables. [sent-37, score-0.709]

25 The Thrax libraries provide full support for different weight (semiring) classes. [sent-38, score-0.262]

26 The command-line flag --semi ring allows one to set the semiring, currently to one of: tropical (default), log or log64 semirings. [sent-39, score-0.209]

27 1 General Description Thrax revolves around rules which, typically, construct an FST based on a given input. [sent-41, score-0.037]

28 Thrax provides a set of built-in functions that aid in the construction of more complex expressions. [sent-43, score-0.102]

29 We have already seen the disjunction “|” in tshioen s pr. [sent-44, score-0.048]

30 O sethener hstean ddiasjrudn rcetigounla “r| operations are expr*, expr+, expr? [sent-46, score-0.038]

31 and expr{m,n}, the latter repeating expr between m dan edx n {timm,ne}s,, inclusive. [sent-47, score-0.317]

32 Composition is notated with “@” so that expr1 @ expr2 denotes the composition of expr1 and expr2. [sent-48, score-0.106]

33 Rewriting is denoted with “:” where expr1 : expr2 rewrites strings that match expr1 into expr2. [sent-49, score-0.096]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('thrax', 0.442), ('expr', 0.276), ('libraries', 0.262), ('opengrm', 0.221), ('pear', 0.221), ('wfst', 0.22), ('symbol', 0.207), ('openfst', 0.205), ('transition', 0.193), ('compiler', 0.166), ('kiwi', 0.166), ('semiring', 0.166), ('library', 0.123), ('origin', 0.116), ('fst', 0.111), ('fars', 0.11), ('traversed', 0.11), ('transducers', 0.106), ('state', 0.104), ('transducer', 0.098), ('fsts', 0.096), ('flag', 0.088), ('enclosed', 0.088), ('allauzen', 0.082), ('command', 0.077), ('grammars', 0.076), ('destination', 0.074), ('identifier', 0.074), ('path', 0.071), ('functions', 0.069), ('consuming', 0.065), ('strings', 0.059), ('weighted', 0.058), ('composition', 0.058), ('transitions', 0.058), ('rewrite', 0.058), ('informal', 0.055), ('software', 0.055), ('string', 0.054), ('grammar', 0.054), ('tropical', 0.048), ('sorensen', 0.048), ('bet', 0.048), ('quently', 0.048), ('disjunction', 0.048), ('wfsts', 0.048), ('notated', 0.048), ('uni', 0.048), ('bytes', 0.048), ('oregon', 0.045), ('ring', 0.044), ('contextdependent', 0.044), ('utilities', 0.044), ('ave', 0.044), ('repeating', 0.041), ('operators', 0.041), ('encodings', 0.041), ('preliminaries', 0.041), ('imported', 0.041), ('regular', 0.039), ('compiling', 0.039), ('operations', 0.038), ('rules', 0.037), ('sproat', 0.037), ('traversal', 0.037), ('greek', 0.037), ('presentation', 0.037), ('riley', 0.037), ('rewrites', 0.037), ('archives', 0.037), ('health', 0.037), ('inspection', 0.037), ('saved', 0.035), ('custom', 0.035), ('roark', 0.035), ('modifying', 0.034), ('terry', 0.034), ('allowing', 0.034), ('states', 0.034), ('default', 0.033), ('expressions', 0.033), ('represents', 0.033), ('aid', 0.033), ('alphabet', 0.032), ('failure', 0.032), ('building', 0.031), ('classes', 0.031), ('operator', 0.031), ('cyril', 0.031), ('rewriting', 0.03), ('hs', 0.03), ('leaving', 0.03), ('files', 0.029), ('allows', 0.029), ('simplest', 0.028), ('line', 0.027), ('ngram', 0.026), ('final', 0.026), ('brian', 0.026), ('finite', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries

Author: Brian Roark ; Richard Sproat ; Cyril Allauzen ; Michael Riley ; Jeffrey Sorensen ; Terry Tai

2 0.10479038 32 acl-2012-Automated Essay Scoring Based on Finite State Transducer: towards ASR Transcription of Oral English Speech

Author: Xingyuan Peng ; Dengfeng Ke ; Bo Xu

Abstract: Conventional Automated Essay Scoring (AES) measures may cause severe problems when directly applied in scoring Automatic Speech Recognition (ASR) transcription as they are error sensitive and unsuitable for the characteristic of ASR transcription. Therefore, we introduce a framework of Finite State Transducer (FST) to avoid the shortcomings. Compared with the Latent Semantic Analysis with Support Vector Regression (LSA-SVR) method (stands for the conventional measures), our FST method shows better performance especially towards the ASR transcription. In addition, we apply the synonyms similarity to expand the FST model. The final scoring performance reaches an acceptable level of 0.80 which is only 0.07 lower than the correlation (0.87) between human raters.

3 0.094572082 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

Author: Bevan Jones ; Mark Johnson ; Sharon Goldwater

Abstract: Many semantic parsing models use tree transformations to map between natural language and meaning representation. However, while tree transformations are central to several state-of-the-art approaches, little use has been made of the rich literature on tree automata. This paper makes the connection concrete with a tree transducer based semantic parsing model and suggests that other models can be interpreted in a similar framework, increasing the generality of their contributions. In particular, this paper further introduces a variational Bayesian inference algorithm that is applicable to a wide class of tree transducers, producing state-of-the-art semantic parsing results while remaining applicable to any domain employing probabilistic tree transducers.

4 0.079538248 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

Author: Hiroyuki Shindo ; Yusuke Miyao ; Akinori Fujino ; Masaaki Nagata

Abstract: We propose Symbol-Refined Tree Substitution Grammars (SR-TSGs) for syntactic parsing. An SR-TSG is an extension of the conventional TSG model where each nonterminal symbol can be refined (subcategorized) to fit the training data. We aim to provide a unified model where TSG rules and symbol refinement are learned from training data in a fully automatic and consistent fashion. We present a novel probabilistic SR-TSG model based on the hierarchical Pitman-Yor Process to encode backoff smoothing from a fine-grained SR-TSG to simpler CFG rules, and develop an efficient training method based on Markov Chain Monte Carlo (MCMC) sampling. Our SR-TSG parser achieves an F1 score of 92.4% in the Wall Street Journal (WSJ) English Penn Treebank parsing task, which is a 7.7 point improvement over a conventional Bayesian TSG parser, and better than state-of-the-art discriminative reranking parsers.

5 0.068468124 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

Author: Micha Elsner ; Sharon Goldwater ; Jacob Eisenstein

Abstract: ILCC, School of Informatics School of Interactive Computing University of Edinburgh Georgia Institute of Technology Edinburgh, EH8 9AB, UK Atlanta, GA, 30308, USA (a) intended: /ju want w2n/ /want e kUki/ (b) surface: [j@ w a?P w2n] [wan @ kUki] During early language acquisition, infants must learn both a lexicon and a model of phonetics that explains how lexical items can vary in pronunciation—for instance “the” might be realized as [Di] or [D@]. Previous models of acquisition have generally tackled these problems in isolation, yet behavioral evidence suggests infants acquire lexical and phonetic knowledge simultaneously. We present a Bayesian model that clusters together phonetic variants of the same lexical item while learning both a language model over lexical items and a log-linear model of pronunciation variability based on articulatory features. The model is trained on transcribed surface pronunciations, and learns by bootstrapping, without access to the true lexicon. We test the model using a corpus of child-directed speech with realistic phonetic variation and either gold standard or automatically induced word boundaries. In both cases modeling variability improves the accuracy of the learned lexicon over a system that assumes each lexical item has a unique pronunciation.

6 0.048960056 139 acl-2012-MIX Is Not a Tree-Adjoining Language

7 0.039373964 78 acl-2012-Efficient Search for Transformation-based Inference

8 0.03858982 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining

9 0.037175391 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

10 0.034874689 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars

11 0.031917877 57 acl-2012-Concept-to-text Generation via Discriminative Reranking

12 0.030403718 184 acl-2012-String Re-writing Kernel

13 0.028890729 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

14 0.028855663 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

15 0.028827488 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition

16 0.02737119 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

17 0.027243813 208 acl-2012-Unsupervised Relation Discovery with Sense Disambiguation

18 0.025368644 154 acl-2012-Native Language Detection with Tree Substitution Grammars

19 0.024962163 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

20 0.02369426 197 acl-2012-Tokenization: Returning to a Long Solved Problem A Survey, Contrastive Experiment, Recommendations, and Toolkit

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.077), (1, 0.009), (2, -0.026), (3, -0.007), (4, -0.04), (5, 0.031), (6, 0.006), (7, 0.061), (8, 0.029), (9, 0.005), (10, -0.049), (11, -0.04), (12, -0.053), (13, 0.038), (14, 0.024), (15, -0.073), (16, 0.014), (17, 0.016), (18, 0.029), (19, 0.053), (20, 0.048), (21, -0.034), (22, -0.057), (23, -0.003), (24, -0.09), (25, 0.013), (26, 0.12), (27, 0.027), (28, 0.028), (29, 0.078), (30, 0.086), (31, 0.069), (32, -0.107), (33, 0.018), (34, -0.068), (35, -0.004), (36, -0.073), (37, 0.075), (38, 0.14), (39, 0.034), (40, 0.079), (41, -0.11), (42, 0.046), (43, 0.064), (44, 0.066), (45, -0.059), (46, 0.104), (47, 0.041), (48, 0.067), (49, 0.126)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96732777 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries

Author: Brian Roark ; Richard Sproat ; Cyril Allauzen ; Michael Riley ; Jeffrey Sorensen ; Terry Tai

2 0.66197622 32 acl-2012-Automated Essay Scoring Based on Finite State Transducer: towards ASR Transcription of Oral English Speech

Author: Xingyuan Peng ; Dengfeng Ke ; Bo Xu

3 0.63650531 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

Author: Bevan Jones ; Mark Johnson ; Sharon Goldwater

4 0.5905987 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition

Author: Khe Chai Sim

Abstract: This paper presents a probabilistic framework that combines multiple knowledge sources for Haptic Voice Recognition (HVR), a multimodal input method designed to provide efficient text entry on modern mobile devices. HVR extends the conventional voice input by allowing users to provide complementary partial lexical information via touch input to improve the efficiency and accuracy of voice recognition. This paper investigates the use of the initial letter of the words in the utterance as the partial lexical information. In addition to the acoustic and language models used in automatic speech recognition systems, HVR uses the haptic and partial lexical models as additional knowledge sources to reduce the recognition search space and suppress confusions. Experimental results show that both the word error rate and runtime factor can be re- duced by a factor of two using HVR.

5 0.44663754 185 acl-2012-Strong Lexicalization of Tree Adjoining Grammars

Author: Andreas Maletti ; Joost Engelfriet

Abstract: Recently, it was shown (KUHLMANN, SATTA: Tree-adjoining grammars are not closed under strong lexicalization. Comput. Linguist., 2012) that finitely ambiguous tree adjoining grammars cannot be transformed into a normal form (preserving the generated tree language), in which each production contains a lexical symbol. A more powerful model, the simple context-free tree grammar, admits such a normal form. It can be effectively constructed and the maximal rank of the nonterminals only increases by 1. Thus, simple context-free tree grammars strongly lexicalize tree adjoining grammars and themselves.

6 0.39573565 41 acl-2012-Bootstrapping a Unified Model of Lexical and Phonetic Acquisition

7 0.38940164 139 acl-2012-MIX Is Not a Tree-Adjoining Language

8 0.30914566 95 acl-2012-Fast Syntactic Analysis for Statistical Language Modeling via Substructure Sharing and Uptraining

9 0.27603605 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

10 0.27464971 57 acl-2012-Concept-to-text Generation via Discriminative Reranking

11 0.27100903 75 acl-2012-Discriminative Strategies to Integrate Multiword Expression Recognition and Parsing

12 0.26768702 215 acl-2012-WizIE: A Best Practices Guided Development Environment for Information Extraction

13 0.26126352 74 acl-2012-Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach

14 0.2351101 108 acl-2012-Hierarchical Chunk-to-String Translation

15 0.23480743 107 acl-2012-Heuristic Cube Pruning in Linear Time

16 0.23450302 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

17 0.2206313 127 acl-2012-Large-Scale Syntactic Language Modeling with Treelets

18 0.20268422 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

19 0.20215419 16 acl-2012-A Nonparametric Bayesian Approach to Acoustic Model Discovery

20 0.20184082 111 acl-2012-How Are Spelling Errors Generated and Corrected? A Study of Corrected and Uncorrected Spelling Errors Using Keystroke Logs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.027), (26, 0.048), (28, 0.019), (30, 0.025), (32, 0.327), (37, 0.013), (39, 0.046), (43, 0.011), (60, 0.12), (74, 0.023), (82, 0.023), (84, 0.014), (85, 0.022), (86, 0.012), (90, 0.072), (92, 0.078), (99, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76953065 196 acl-2012-The OpenGrm open-source finite-state grammar software libraries

Author: Brian Roark ; Richard Sproat ; Cyril Allauzen ; Michael Riley ; Jeffrey Sorensen ; Terry Tai

2 0.62036091 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

Author: Richard Eckart de Castilho ; Sabine Bartsch ; Iryna Gurevych

Abstract: We present CSNIPER (Corpus Sniper), a tool that implements (i) a web-based multiuser scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and (ii) evaluation of annotation quality by measuring inter-rater agreement. This annotationby-query approach efficiently harnesses expert knowledge to identify instances of linguistic phenomena that are hard to identify by means of existing automatic annotation tools.

3 0.38771924 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

Author: Remy Kessler ; Xavier Tannier ; Caroline Hagege ; Veronique Moriceau ; Andre Bittar

Abstract: We present an approach for detecting salient (important) dates in texts in order to automatically build event timelines from a search query (e.g. the name of an event or person, etc.). This work was carried out on a corpus of newswire texts in English provided by the Agence France Presse (AFP). In order to extract salient dates that warrant inclusion in an event timeline, we first recognize and normalize temporal expressions in texts and then use a machine-learning approach to extract salient dates that relate to a particular topic. We focused only on extracting the dates and not the events to which they are related.

4 0.35056785 174 acl-2012-Semantic Parsing with Bayesian Tree Transducers

Author: Bevan Jones ; Mark Johnson ; Sharon Goldwater

5 0.34224424 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

Author: Wei He ; Hua Wu ; Haifeng Wang ; Ting Liu

Abstract: unkown-abstract

6 0.33623409 84 acl-2012-Estimating Compact Yet Rich Tree Insertion Grammars

7 0.33431441 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

8 0.33409145 31 acl-2012-Authorship Attribution with Author-aware Topic Models

9 0.33130011 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

10 0.33129632 205 acl-2012-Tweet Recommendation with Graph Co-Ranking

11 0.33041468 38 acl-2012-Bayesian Symbol-Refined Tree Substitution Grammars for Syntactic Parsing

12 0.32791853 154 acl-2012-Native Language Detection with Tree Substitution Grammars