acl acl2010 acl2010-217 knowledge-graph by maker-knowledge-mining

217 acl-2010-String Extension Learning


Source: pdf

Author: Jeffrey Heinz

Abstract: This paper provides a unified, learningtheoretic analysis of several learnable classes of languages discussed previously in the literature. The analysis shows that for these classes an incremental, globally consistent, locally conservative, set-driven learner always exists. Additionally, the analysis provides a recipe for constructing new learnable classes. Potential applications include learnable models for aspects of natural language and cognition.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Abstract This paper provides a unified, learningtheoretic analysis of several learnable classes of languages discussed previously in the literature. [sent-2, score-0.379]

2 The analysis shows that for these classes an incremental, globally consistent, locally conservative, set-driven learner always exists. [sent-3, score-0.446]

3 1 These learners are called String Extension Learners because each string in the language can be mapped (extended) to an element of the grammar, which in every case, is conceived as a finite set of elements. [sent-8, score-0.529]

4 These learners have desirable properties: they are incremental, globally consistent, and locally conservative. [sent-9, score-0.269]

5 Classes of languages learnable in the harder, distribution-free, positive-evidenceonly settings are due to structural properties of the language classes that permit generalization (Angluin, 1980b; Blumer et al. [sent-13, score-0.379]

6 (Strictly Local, SL) (McNaughton and Papert, 1971 ; Rogers and Pullum, to appear), the Piecewise Testable (PT) languages (Simon, 1975), the Piecewise Testable languages in the Strict Sense (Strictly Piecewise, SP) (Rogers et al. [sent-16, score-0.402]

7 , 2009), the Strongly Testable languages (Beauquier and Pin, 1991), the Definite languages (Brzozowski, 1962), and the Finite languages, among others. [sent-17, score-0.402]

8 It is expected string extension learning will have applications in linguistic and cognitive models. [sent-23, score-0.418]

9 As mentioned, the SP languages already provide a novel hypothesis of how long-distance dependencies in sound patterns are learned. [sent-24, score-0.201]

10 Another example is the Strictly Local (SL) languages which are the categorical, symbolic version of n-gram models, which are widely used in natural language processing (Jurafsky and Martin, 2008). [sent-25, score-0.201]

11 Since the SP languages also admit a probabilistic variant which describe an efficiently estimable class of distributions (Heinz and Rogers, 2010), it is plausible to expect the other classes will as well, though this is left for future research. [sent-26, score-0.372]

12 String extension learners are also simple, mak897 Proce dinUgsp osfa tlhae, 4S8wthed Aen n,u 1a1l-1 M6e Jeutilnyg 2 o0f1 t0h. [sent-27, score-0.252]

13 §3 d2e gfinoeess string extension grammars, languages, 3an dde lianen-s guage classes and proves some of their fundamental properties. [sent-32, score-0.515]

14 §4 defines string extension learners panrodp proves t §h4eir d e bfienheasvi sotrri. [sent-33, score-0.499]

15 §5 sxhteownssi hnow le aimrn-portant subregular cirla bseshesa are string heoxwtesns hioown l iamn-guage classes. [sent-34, score-0.35]

16 §6 gives examples of nonregular agunda ienfi cnliatses language vcelsass eexsa mwphliecsh are string extension learnable. [sent-35, score-0.462]

17 sA, set π oAf nonempty s (nubosteets f of S is a partition of S iff the elements of π (called blocks) are pairwise disjoint and their union equals S. [sent-43, score-0.216]

18 Let Σn, Σ≤n, Σ∗, Σ+ denote all strings formed over this alphabet of length n, of length less than or equal to n, of any finite length, and of any finite length strictly greater than zero, respectively. [sent-45, score-0.668]

19 The range of a string w is the set of symbols which are in w. [sent-47, score-0.247]

20 The empty string is the unique string of length zero denoted Thus range(λ) = ∅. [sent-48, score-0.551]

21 The length of a string u is dernoantedge by |u|, e. [sent-49, score-0.304]

22 A learner converges on a text t iff there exists i ∈ ? [sent-89, score-0.282]

23 A learner identifies a language L in the limit iff for any positive text t for L, converges on t to grammar G and L(G) = L. [sent-92, score-0.531]

24 Finally, a learner identifies a class of languages L φ φ φ φ nina tlyhe, ali lmeairt nieffr fφo ird any iLe ∈ cLla, φ oifde lanntigfiueasg eLs iLn tihne t hlime ilti. [sent-93, score-0.476]

25 m Angluin (1980b) provides necessary Lan idn sufficient properties of language classes which are identifiable in the limit from positive data. [sent-94, score-0.291]

26 A learner φ of language class L is globally consisAten lte airffn fror φ oefac lah ig uanagde f colra sasll L Lte ixst gsl ot faolrl some L ∈ L, content(t[i]) ⊆ L(φ(t[i])). [sent-95, score-0.284]

27 A learner φ is locally c coonnsteernvat(titv[ie] )if ⊆f f Lor( φea(tch[i i) )a. [sent-96, score-0.3]

28 A string extension function is a total function f : Σ∗ → Pfin(A). [sent-105, score-0.524]

29 Enasc whh string veextt ehnissi goenn efruanlct fioormn i sS naturally associated with some formal class of grammars and languages. [sent-108, score-0.386]

30 These functions, grammars, and languages are called string extension functions, gram- mars, and languages, respectively. [sent-109, score-0.619]

31 The class of languages obtained by all possible grammars is Lf = {Lf(G) : G ∈ Pfin(A)} The subscript f is omitted when it is understood from context. [sent-115, score-0.34]

32 Proof: Follows directly from the definition of ∼f aPnrdo otfhe: fFinoiltloewnesss d oiref string oemxte thnseio dne grammars. [sent-119, score-0.247]

33 2 String extension language classes are not in general closed under union or reversal (counterexamples to union closure are given in §5. [sent-127, score-0.319]

34 f(L) = [ f(w) (1) w[∈L An element g of grammar G for language L = Lf (G) is useful iff g ∈ f(L). [sent-131, score-0.256]

35 2 4 String Extension Learning Learning string extension classes is simple. [sent-150, score-0.515]

36 Eaarncehr i φndividual string in the text reveals, by extensio? [sent-155, score-0.247]

37 2 The key to the proof that φf identifies Lf in the limit from positive data is the fidineintteinfeiesss Lof G for all L(G) ∈ L. [sent-168, score-0.227]

38 ic Thh every ael iesm theantt tohfe trhee i grammar has been seen because (1) there are only finitely many useful elements of G, and (2) the learner is guaranteed to see a word in L which yields (via f) each element of G at some point (since the learner receives a positive text for L). [sent-170, score-0.555]

39 This cdaronp bpee sde iefn hwehe qnu aonliefi ecro n“sinide Lrs the identity function and the class of finite languages. [sent-172, score-0.321]

40 (The identity function is a string extension function, see §6. [sent-173, score-0.508]

41 899 the learner φ is guaranteed to have converged to the target G as no additional words will add any more elements to the learner’s grammar. [sent-177, score-0.204]

42 2 An immediate corollary is the efficiency of φf in the length of the sample, provided f is efficient in the length of the string (de la Higuera, 1997). [sent-192, score-0.361]

43 Corollary 1 φf is efficient in the length of the sample iff f is efficiently computable in the length of a string. [sent-193, score-0.235]

44 To summarize: string extension grammars are finite subsets of some set A. [sent-194, score-0.64]

45 The class of languages they generate are determined by a function f which maps strings to finite subsets of A (chunks of grammars). [sent-195, score-0.568]

46 Since the size of the canonical grammars is finite, a learner which develops a grammar on the basis of the observed words and the function f identifies this class exactly in the limit from positive data. [sent-196, score-0.638]

47 It also follows that if f is efficient in the length of the string then φf is efficient in the length of the sample and that φf is globally consistent, locally conservative, and setdriven. [sent-197, score-0.549]

48 5 Subregular examples This section shows how classes which make up the subregular hierarchies (McNaughton and Papert, 1971) are string extension language classes. [sent-199, score-0.618]

49 1 K-factor languages The k-factors of a word are ? [sent-205, score-0.201]

50 , let fack (w) = {x ∈ Σk : ∃u, v ∈ Σ∗ such that w = uxv} when k |w| and {w} otherwise Following the earlier definitions, for some k, a grammar G is a subset of Σ≤k and a word w belongs to the language of G iff fack (w) ⊆ G. [sent-210, score-0.493]

51 Since fack ∈ SEF, it follows immediately from the results ∈in §§3-4 tth faotl othwe k i-mfamcteodri languages are crelossueltds uinnd §e§r3 intersection, a-nfadc eoarc hla nhgausa a sch aarreacteristic sample. [sent-216, score-0.318]

52 Th theeor ceamno 5n itchaalt gthraem cmlasars oit-f k-factor languages is identifiable in the limit by φfack . [sent-222, score-0.34]

53 T,hbbe,nb L1 ∪ L2 exclu=des L string aba, bau,ta ibn,cblub}d)e. [sent-226, score-0.247]

54 Kno-tf pacostosirbs are u asneyd Lto ∈ d Lefine other language classes, such as the Strictly Local and Locally Testable languages (McNaughton and Papert, 1971), discussed in §5. [sent-228, score-0.201]

55 2 Strictly k-Piecewise languages The Strictly k-Piecewise (SPk) languages (Rogers et al. [sent-232, score-0.402]

56 As seen from Example 2, SP languages encode long-distance dependencies. [sent-261, score-0.201]

57 (IfP Tk) i isf fixed, tPhiee ke-wPiiseec Tewesistaeb Tlees ftoarb sloe languages are identifiable in the limit from positive data (Garc ´ıa and Ruiz, 1996; Garc ı´a and Ruiz, 2004). [sent-277, score-0.395]

58 More recently, the Piecewise Testable languages has been shown to be linearly separable with a subsequence kernel (Kontorovich et al. [sent-278, score-0.333]

59 The k-Piecewise Testable languages can also be described with the function SPk⋄. [sent-280, score-0.254]

60 te list of sets of subsequences up to length k that may occur in words in the language. [sent-283, score-0.205]

61 This reflects the fact that the k-Piecewise Testable languages are the boolean closure of the Strictly k-Piecewise languages. [sent-284, score-0.252]

62 4 Strictly k-Local languages To define the Strictly k-Local languages, it is necessary to make a pointwise extension to the definitions in §3. [sent-286, score-0.372]

63 The class of languages obtained by all such possible grammars G is Lf. [sent-306, score-0.34]

64 Then the (left-edge) preDfixe foinf length k, txhe k (right-edge) suffix of length k, and the interior k-factors of a word w are Lk (w) = {u ∈ Σk : ∃v ∈ Σ∗ such that w = uv} Rk(w) = {u ∈ Σk : ∃v ∈ Σ∗ such that w = vu} Ik (w) = fack (w)\(Lk (w) Rk (w)) Example 3 Suppose w = abcba. [sent-325, score-0.231]

65 This can now be expressed as the string extension function: LRIk(w) = (Lk(w), Rk(w), Ik(w)) Thus for some k, a grammar G is triple formed by taking subsets of Σk, Σk, and Σ≤k, respectively. [sent-333, score-0.509]

66 5 Locally k-Testable languages The Locally k-testable languages (k-LT) are orig- inally defined in McNaughton and Papert (1971) and are the subject of several studies (Brzozowski and Simon, 1973; McNaughton, 1974; Kim et al. [sent-339, score-0.402]

67 It is known that the k-LT languages are the boolean closure of the k-SL (McNaughton and Papert, 1971). [sent-344, score-0.252]

68 6 Generalized subsequence languages Here we introduce generalized subsequence functions, a general class of functions to which the SPk and fack functions belong. [sent-351, score-0.825]

69 Like those functions, generalized subsequence functions map words to a set of subsequences found within the words. [sent-352, score-0.385]

70 , unxnun+1 and |xi | = vi for all 0 i n} when k |w|, and{w} otherwise ≤ ≤ ≤ The following examples help make the generalized subsequence functions clear. [sent-377, score-0.237]

71 x pThleen 8 Lfh3,2,1i 3in,c2lu,d1eis a languages ,w eh,i fc∈h prohibit strings w which contain subsequences abcdef where abc and de must be contiguous in w and abcdef is a subsequence of w. [sent-389, score-0.602]

72 Generalized subsequence languages make different kinds of distinctions to be made than PT and LT languages. [sent-390, score-0.333]

73 Generalized subsequence languages properly include the k-SP and k-SL classes (Examples 6 and 7), and the boolean closure of the subsequence languages (f v~⋄) properly includes the LT and PT classes. [sent-392, score-0.814]

74 Since for any f v~ and f~ v⋄ are string extension functions the learning results in §4 apply. [sent-393, score-0.482]

75 h eNreo ke is the length ocofm mthpeu mtabaxleim ina tli subsequences determined by v. [sent-395, score-0.205]

76 There are also infinite language classes that are string extension language classes. [sent-415, score-0.569]

77 Arguably the simplest example is the class of finite languages, denoted Lfin. [sent-416, score-0.231]

78 5 A grammar G is then a finite subset of Σ∗, a}n. [sent-420, score-0.248]

79 It can be easily seen that the function id induces the trivial partition over Σ∗, and languages are just finite unions of these blocks. [sent-423, score-0.538]

80 There are other more interesting infinite string extension classes. [sent-425, score-0.472]

81 For all a ∈ Σ, let fa(w) b me athpe set containing n wFoherr ael n ais t∈he Σ number of times the letter a occurs in the string w. [sent-427, score-0.294]

82 For 5Strictly speaking, this is not the identity function per se, but it is as close to the identity function as one can get since string extension functions are defined as mappings from strings to sets. [sent-428, score-0.745]

83 Thus fa is a total function mapping strings t {o2 singleton sets of natural numbers, so it is a string extension function. [sent-431, score-0.607]

84 In this case, a grammar G is a finite subset of ? [sent-442, score-0.248]

85 Thus Lfa is an infinite class, which contains languages of infinite size, which is easily identified in the limit from positive data by φfa . [sent-449, score-0.427]

86 This section gave examples of nonregular and nonfinite string extension classes by pursuing the implications of Theorem 1, which established that f ∈ SEF partition Σ∗ into blocks of which languages are fpianritteit uonnio Σns thereof. [sent-450, score-0.872]

87 The string extension function f provides an effective way of encoding all languages L in Lf because f(L) enccooddeinsg a aflilnit lean set, tghees grammar. [sent-451, score-0.672]

88 7 Conclusion and open questions One contribution of this paper is a unified way of thinking about many formal language classes, all of which have been shown to be identifiable in the limit from positive data by a string extension learner. [sent-452, score-0.612]

89 Another contribution is a recipe for defining classes of languages identifiable in the limit from positive data by this kind of learner. [sent-453, score-0.542]

90 Additionally, the learner is guaranteed to be efficient in the size of the sample, provided the function f itself is efficient in the length of the string. [sent-456, score-0.271]

91 The string extension learner for this class is essentially two learners: φLRI3 and φP2, operating simultaneously. [sent-459, score-0.653]

92 Instead of defining the function f as a map from strings to finite subsets, let f be a function from strings to elements of a lattice. [sent-470, score-0.519]

93 A grammar G is an element of the lattice and the language of the G are all strings w such that f maps w to a grammar less than G. [sent-471, score-0.309]

94 8 Kasprzik and K ¨otzing (2010) develop this idea and demonstrate additional properties of string extension classes and learning, and show that the pattern languages (Angluin, 1980a) form a string extension class. [sent-473, score-1.134]

95 Finally, since the stochastic counterpart of kSL class is the n-gram model, it is plausible that probabilistic string extension language classes can form the basis of new natural language processing techniques. [sent-478, score-0.589]

96 is a finite set of strings rep=rehs eLnting, ⊇thie. [sent-513, score-0.24]

97 904 how to efficiently estimate k-SP distributions, and it is conjectured that the other string extension language classes can be recast as classes of distributions, which can also be successfully estimated from positive evidence. [sent-515, score-0.667]

98 Phonological features in infants phonotactic learning: Evidence from artificial grammar learning. [sent-592, score-0.208]

99 Learning k-testable and k-piecewise testable languages from positive data. [sent-619, score-0.535]

100 Learning indexed families of recursive languages from positive data: A survey. [sent-697, score-0.256]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lf', 0.284), ('testable', 0.279), ('string', 0.247), ('rogers', 0.205), ('languages', 0.201), ('extension', 0.171), ('learner', 0.161), ('finite', 0.157), ('subsequences', 0.148), ('heinz', 0.147), ('locally', 0.139), ('lrik', 0.132), ('mcnaughton', 0.132), ('subsequence', 0.132), ('iff', 0.121), ('papert', 0.117), ('piecewise', 0.117), ('fack', 0.117), ('pfin', 0.117), ('sef', 0.117), ('theorem', 0.115), ('spk', 0.103), ('subregular', 0.103), ('strictly', 0.1), ('classes', 0.097), ('grammar', 0.091), ('strings', 0.083), ('learners', 0.081), ('learnable', 0.081), ('identifiable', 0.076), ('class', 0.074), ('phonotactic', 0.073), ('pullum', 0.073), ('rk', 0.071), ('proof', 0.069), ('lk', 0.068), ('garc', 0.067), ('angluin', 0.067), ('llrik', 0.067), ('otzing', 0.067), ('whitney', 0.067), ('grammars', 0.065), ('functions', 0.064), ('limit', 0.063), ('blocks', 0.06), ('timo', 0.059), ('ruiz', 0.059), ('length', 0.057), ('jain', 0.057), ('positive', 0.055), ('infinite', 0.054), ('function', 0.053), ('fa', 0.053), ('ik', 0.053), ('partition', 0.052), ('jeffrey', 0.052), ('closure', 0.051), ('ba', 0.051), ('seq', 0.05), ('lange', 0.05), ('lfack', 0.05), ('moelius', 0.05), ('recipe', 0.05), ('globally', 0.049), ('sl', 0.049), ('ab', 0.048), ('theoretical', 0.047), ('let', 0.047), ('aa', 0.045), ('suppose', 0.044), ('element', 0.044), ('automata', 0.044), ('brzozowski', 0.044), ('delaware', 0.044), ('edlefsen', 0.044), ('garcia', 0.044), ('sanjay', 0.044), ('onishi', 0.044), ('infants', 0.044), ('learnability', 0.044), ('nonregular', 0.044), ('elements', 0.043), ('generalized', 0.041), ('strict', 0.041), ('sp', 0.04), ('dana', 0.04), ('identifies', 0.04), ('id', 0.039), ('contiguous', 0.038), ('identity', 0.037), ('conservative', 0.037), ('canonical', 0.036), ('induces', 0.036), ('steffen', 0.036), ('bb', 0.036), ('jos', 0.036), ('simon', 0.035), ('characteristic', 0.034), ('aughton', 0.033), ('blumer', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 217 acl-2010-String Extension Learning

Author: Jeffrey Heinz

Abstract: This paper provides a unified, learningtheoretic analysis of several learnable classes of languages discussed previously in the literature. The analysis shows that for these classes an incremental, globally consistent, locally conservative, set-driven learner always exists. Additionally, the analysis provides a recipe for constructing new learnable classes. Potential applications include learnable models for aspects of natural language and cognition.

2 0.38027453 103 acl-2010-Estimating Strictly Piecewise Distributions

Author: Jeffrey Heinz ; James Rogers

Abstract: Strictly Piecewise (SP) languages are a subclass of regular languages which encode certain kinds of long-distance dependencies that are found in natural languages. Like the classes in the Chomsky and Subregular hierarchies, there are many independently converging characterizations of the SP class (Rogers et al., to appear). Here we define SP distributions and show that they can be efficiently estimated from positive data.

3 0.078185037 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization

Author: Shay Cohen ; Noah A Smith

Abstract: We consider the search for a maximum likelihood assignment of hidden derivations and grammar weights for a probabilistic context-free grammar, the problem approximately solved by “Viterbi training.” We show that solving and even approximating Viterbi training for PCFGs is NP-hard. We motivate the use of uniformat-random initialization for Viterbi EM as an optimal initializer in absence of further information about the correct model parameters, providing an approximate bound on the log-likelihood.

4 0.07602261 162 acl-2010-Learning Common Grammar from Multilingual Corpus

Author: Tomoharu Iwata ; Daichi Mochihashi ; Hiroshi Sawada

Abstract: We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel multilingual corpus of eleven languages demonstrate the feasibility of the proposed method.

5 0.066922732 195 acl-2010-Phylogenetic Grammar Induction

Author: Taylor Berg-Kirkpatrick ; Dan Klein

Abstract: We present an approach to multilingual grammar induction that exploits a phylogeny-structured model of parameter drift. Our method does not require any translated texts or token-level alignments. Instead, the phylogenetic prior couples languages at a parameter level. Joint induction in the multilingual model substantially outperforms independent learning, with larger gains both from more articulated phylogenies and as well as from increasing numbers of languages. Across eight languages, the multilingual approach gives error reductions over the standard monolingual DMV averaging 21. 1% and reaching as high as 39%.

6 0.06509126 234 acl-2010-The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages

7 0.064104982 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System

8 0.061710458 23 acl-2010-Accurate Context-Free Parsing with Combinatory Categorial Grammar

9 0.058830019 66 acl-2010-Compositional Matrix-Space Models of Language

10 0.058615178 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers

11 0.05760666 228 acl-2010-The Importance of Rule Restrictions in CCG

12 0.055846136 186 acl-2010-Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two

13 0.055416949 61 acl-2010-Combining Data and Mathematical Models of Language Change

14 0.055046033 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

15 0.05189877 53 acl-2010-Blocked Inference in Bayesian Tree Substitution Grammars

16 0.051636238 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

17 0.049736015 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

18 0.045864414 247 acl-2010-Unsupervised Event Coreference Resolution with Rich Linguistic Features

19 0.045682661 84 acl-2010-Detecting Errors in Automatically-Parsed Dependency Relations

20 0.04538881 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.141), (1, 0.008), (2, 0.02), (3, -0.036), (4, -0.03), (5, -0.078), (6, 0.103), (7, 0.004), (8, 0.115), (9, -0.057), (10, -0.043), (11, 0.043), (12, 0.044), (13, -0.042), (14, -0.051), (15, -0.033), (16, 0.008), (17, 0.03), (18, -0.002), (19, 0.117), (20, -0.125), (21, -0.135), (22, -0.211), (23, -0.24), (24, 0.157), (25, -0.006), (26, -0.188), (27, 0.169), (28, 0.143), (29, 0.152), (30, -0.126), (31, -0.03), (32, -0.089), (33, 0.166), (34, 0.173), (35, 0.071), (36, -0.039), (37, -0.036), (38, -0.021), (39, 0.041), (40, 0.2), (41, 0.033), (42, -0.182), (43, 0.041), (44, 0.081), (45, -0.04), (46, 0.072), (47, 0.005), (48, -0.024), (49, 0.169)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96843374 217 acl-2010-String Extension Learning

Author: Jeffrey Heinz

Abstract: This paper provides a unified, learningtheoretic analysis of several learnable classes of languages discussed previously in the literature. The analysis shows that for these classes an incremental, globally consistent, locally conservative, set-driven learner always exists. Additionally, the analysis provides a recipe for constructing new learnable classes. Potential applications include learnable models for aspects of natural language and cognition.

2 0.96151322 103 acl-2010-Estimating Strictly Piecewise Distributions

Author: Jeffrey Heinz ; James Rogers

Abstract: Strictly Piecewise (SP) languages are a subclass of regular languages which encode certain kinds of long-distance dependencies that are found in natural languages. Like the classes in the Chomsky and Subregular hierarchies, there are many independently converging characterizations of the SP class (Rogers et al., to appear). Here we define SP distributions and show that they can be efficiently estimated from positive data.

3 0.56387961 234 acl-2010-The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages

Author: Andres Osvaldo Porta

Abstract: The aim of this work is to present some preliminary results of an investigation in course on the typology of the morphology of the native South American languages from the point of view of the formal language theory. With this object, we give two contrasting examples of descriptions of two Aboriginal languages finite verb forms morphology: Argentinean Quechua (quichua santiague n˜o) and Toba. The description of the morphology of the finite verb forms of Argentinean quechua, uses finite automata and finite transducers. In this case the construction is straightforward using two level morphology and then, describes in a very natural way the Argentinean Quechua morphology using a regular language. On the contrary, the Toba verbs morphology, with a system that simultaneously uses prefixes and suffixes, has not a natural description as regular language. Toba has a complex system of causative suffixes, whose successive applications determinate the use of prefixes belonging different person marking prefix sets. We adopt the solution of Creider et al. (1995) to naturally deal with this and other similar morphological processes which involve interactions between prefixes and suffixes and then we describe the toba morphology using linear context-free languages.1 .

4 0.50272018 186 acl-2010-Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two

Author: Benoit Sagot ; Giorgio Satta

Abstract: Linear Context-Free Rewriting Systems (LCFRSs) are a grammar formalism capable of modeling discontinuous phrases. Many parsing applications use LCFRSs where the fan-out (a measure of the discontinuity of phrases) does not exceed 2. We present an efficient algorithm for optimal reduction of the length of production right-hand side in LCFRSs with fan-out at most 2. This results in asymptotical running time improvement for known parsing algorithms for this class.

5 0.35417423 162 acl-2010-Learning Common Grammar from Multilingual Corpus

Author: Tomoharu Iwata ; Daichi Mochihashi ; Hiroshi Sawada

Abstract: We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel multilingual corpus of eleven languages demonstrate the feasibility of the proposed method.

6 0.32413507 255 acl-2010-Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization

7 0.31634381 67 acl-2010-Computing Weakest Readings

8 0.30746302 40 acl-2010-Automatic Sanskrit Segmentizer Using Finite State Transducers

9 0.30440482 235 acl-2010-Tools for Multilingual Grammar-Based Translation on the Web

10 0.30304402 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System

11 0.29699135 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers

12 0.29431853 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices

13 0.28240409 61 acl-2010-Combining Data and Mathematical Models of Language Change

14 0.27870592 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

15 0.26690942 29 acl-2010-An Exact A* Method for Deciphering Letter-Substitution Ciphers

16 0.26301143 195 acl-2010-Phylogenetic Grammar Induction

17 0.26182598 66 acl-2010-Compositional Matrix-Space Models of Language

18 0.25407299 182 acl-2010-On the Computational Complexity of Dominance Links in Grammatical Formalisms

19 0.23388004 64 acl-2010-Complexity Assumptions in Ontology Verbalisation

20 0.23119889 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.011), (14, 0.014), (25, 0.036), (33, 0.439), (39, 0.014), (44, 0.011), (59, 0.056), (73, 0.026), (78, 0.039), (83, 0.06), (84, 0.113), (98, 0.073)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85445839 217 acl-2010-String Extension Learning

Author: Jeffrey Heinz

Abstract: This paper provides a unified, learningtheoretic analysis of several learnable classes of languages discussed previously in the literature. The analysis shows that for these classes an incremental, globally consistent, locally conservative, set-driven learner always exists. Additionally, the analysis provides a recipe for constructing new learnable classes. Potential applications include learnable models for aspects of natural language and cognition.

2 0.63573736 75 acl-2010-Correcting Errors in a Treebank Based on Synchronous Tree Substitution Grammar

Author: Yoshihide Kato ; Shigeki Matsubara

Abstract: This paper proposes a method of correcting annotation errors in a treebank. By using a synchronous grammar, the method transforms parse trees containing annotation errors into the ones whose errors are corrected. The synchronous grammar is automatically induced from the treebank. We report an experimental result of applying our method to the Penn Treebank. The result demonstrates that our method corrects syntactic annotation errors with high precision.

3 0.55641639 185 acl-2010-Open Information Extraction Using Wikipedia

Author: Fei Wu ; Daniel S. Weld

Abstract: Information-extraction (IE) systems seek to distill semantic relations from naturallanguage text, but most systems use supervised learning of relation-specific examples and are thus limited by the availability of training data. Open IE systems such as TextRunner, on the other hand, aim to handle the unbounded number of relations found on the Web. But how well can these open systems perform? This paper presents WOE, an open IE system which improves dramatically on TextRunner’s precision and recall. The key to WOE’s performance is a novel form of self-supervised learning for open extractors using heuris— tic matches between Wikipedia infobox attribute values and corresponding sentences to construct training data. Like TextRunner, WOE’s extractor eschews lexicalized features and handles an unbounded set of semantic relations. WOE can operate in two modes: when restricted to POS tag features, it runs as quickly as TextRunner, but when set to use dependency-parse features its precision and recall rise even higher.

4 0.54250592 14 acl-2010-A Risk Minimization Framework for Extractive Speech Summarization

Author: Shih-Hsiang Lin ; Berlin Chen

Abstract: In this paper, we formulate extractive summarization as a risk minimization problem and propose a unified probabilistic framework that naturally combines supervised and unsupervised summarization models to inherit their individual merits as well as to overcome their inherent limitations. In addition, the introduction of various loss functions also provides the summarization framework with a flexible but systematic way to render the redundancy and coherence relationships among sentences and between sentences and the whole document, respectively. Experiments on speech summarization show that the methods deduced from our framework are very competitive with existing summarization approaches. 1

5 0.50322717 103 acl-2010-Estimating Strictly Piecewise Distributions

Author: Jeffrey Heinz ; James Rogers

Abstract: Strictly Piecewise (SP) languages are a subclass of regular languages which encode certain kinds of long-distance dependencies that are found in natural languages. Like the classes in the Chomsky and Subregular hierarchies, there are many independently converging characterizations of the SP class (Rogers et al., to appear). Here we define SP distributions and show that they can be efficiently estimated from positive data.

6 0.35598147 136 acl-2010-How Many Words Is a Picture Worth? Automatic Caption Generation for News Images

7 0.34907401 216 acl-2010-Starting from Scratch in Semantic Role Labeling

8 0.34840181 220 acl-2010-Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

9 0.34475547 53 acl-2010-Blocked Inference in Bayesian Tree Substitution Grammars

10 0.34462047 131 acl-2010-Hierarchical A* Parsing with Bridge Outside Scores

11 0.34338659 126 acl-2010-GernEdiT - The GermaNet Editing Tool

12 0.33821428 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System

13 0.3378678 21 acl-2010-A Tree Transducer Model for Synchronous Tree-Adjoining Grammars

14 0.33358014 162 acl-2010-Learning Common Grammar from Multilingual Corpus

15 0.33269387 18 acl-2010-A Study of Information Retrieval Weighting Schemes for Sentiment Analysis

16 0.33060807 95 acl-2010-Efficient Inference through Cascades of Weighted Tree Transducers

17 0.3288728 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

18 0.32749921 186 acl-2010-Optimal Rank Reduction for Linear Context-Free Rewriting Systems with Fan-Out Two

19 0.32461202 109 acl-2010-Experiments in Graph-Based Semi-Supervised Learning Methods for Class-Instance Acquisition

20 0.32386574 67 acl-2010-Computing Weakest Readings