acl acl2010 acl2010-235 knowledge-graph by maker-knowledge-mining

235 acl-2010-Tools for Multilingual Grammar-Based Translation on the Web


Source: pdf

Author: Aarne Ranta ; Krasimir Angelov ; Thomas Hallgren

Abstract: This is a system demo for a set of tools for translating texts between multiple languages in real time with high quality. The translation works on restricted languages, and is based on semantic interlinguas. The underlying model is GF (Grammatical Framework), which is an open-source toolkit for multilingual grammar implementations. The demo will cover up to 20 parallel languages. Two related sets of tools are presented: grammarian’s tools helping to build translators for new domains and languages, and translator’s tools helping to translate documents. The grammarian’s tools are designed to make it easy to port the technique to new applications. The translator’s tools are essential in the restricted language context, enabling the author to remain in the fragments recognized by the system. The tools that are demonstrated will be ap- plied and developed further in the European project MOLTO (Multilingual On-Line Translation) which has started in March 2010 and runs for three years. 1 Translation Needs for the Web The best-known translation tools on the web are Google translate1 and Systran2. They are targeted to consumers of web documents: users who want to find out what a given document is about. For this purpose, browsing quality is sufficient, since the user has intelligence and good will, and understands that she uses the translation at her own risk. Since Google and Systran translations can be grammatically and semantically flawed, they don’t reach publication quality, and cannot hence be used by the producers of web documents. For instance, the provider of an e-commerce site cannot take the risk that the product descriptions or selling conditions have errors that change the original intentions. There are very few automatic translation systems actually in use for producers of information. As already 1www .google . com/t rans l e at 2www. systransoft . com noted by Bar-Hillel (1964), machine translation is one of those AI-complete tasks that involves a trade-off between coverage and precision, and the current mainstream systems opt for coverage. This is also what web users expect: they want to be able to throw just anything at the translation system and get something useful back. Precision-oriented approaches, the prime example of which is METEO (Chandioux 1977), have not been popular in recent years. However, from the producer’s point of view, large coverage is not essential: unlike the consumer’s tools, their input is predictable, and can be restricted to very specific domains, and to content that the producers themselves are creating in the first place. But even in such tasks, two severe problems remain: • • The development cost problem: a large amount oTfh ew dorekv eisl onpemedeendt f coors building tmra:n asl laatorgrse afomr new domains and new languages. The authoring problem: since the method does nTohte ew aourkth foorri nalgl input, etmhe: :asu tihnocer othfe eth me source toexest of translation may need special training to write in a way that can be translated at all. These two problems have probably been the main obstacles to making high-quality restricted language translation more wide-spread in tasks where it would otherwise be applicable. We address these problems by providing tools that help developers of translation systems on the one hand, and authors and translators—i.e. the users of the systems—on the other. In the MOLTO project (Multilingual On-Line Translation)3, we have the goal to improve both the development and use of restricted language translation by an order of magnitude, as compared with the state of the art. As for development costs, this means that a system for many languages and with adequate quality can be built in a matter of days rather than months. As for authoring, this means that content production does not require the use of manuals or involve trial and error, both of which can easily make the work ten times slower than normal writing. In the proposed system demo, we will show how some of the building blocks for MOLTO can already now be used in web-based translators, although on a 3 www.molto-project .eu 66 UppsalaP,r Sowceeeddenin,g 1s3 o Jfu tlhye 2 A0C1L0. 2 ?c 01200 S1y0s Atesmso Dcieamtioonns ftorart Cioonms,p puatagteiso 6n6a–l7 L1in,guistics Figure 1: A multilingual GF grammar with reversible mappings from a common abstract syntax to the 15 languages currently available in the GF Resource Grammar Library. smaller scale as regards languages and application domains. A running demo system is available at http : / / grammat i cal framework .org : 4 1 9 6. 2 2 Multilingual Grammars The translation tools are based on GF, Grammatical Framework4 (Ranta 2004). GF is a grammar formalism—that is, a mathematical model of natural language, equipped with a formal notation for writing grammars and a computer program implementing parsing and generation which are declaratively defined by grammars. Thus GF is comparable with formalism such as HPSG (Pollard and Sag 1994), LFG (Bresnan 1982) or TAG (Joshi 1985). The novel feature of GF is the notion of multilingual grammars, which describe several languages simultaneously by using a common representation called abstract syntax; see Figure 1. In a multilingual GF grammar, meaning-preserving translation is provided as a composition of parsing and generation via the abstract syntax, which works as an interlingua. This model of translation is different from approaches based on other comparable grammar formalisms, such as synchronous TAGs (Shieber and Schabes 1990), Pargram (Butt & al. 2002, based on LFG), LINGO Matrix (Bender and Flickinger 2005, based on HPSG), and CLE (Core Language Engine, Alshawi 1992). These approaches use transfer rules between individual languages, separate for each pair of languages. Being interlingua-based, GF translation scales up linearly to new languages without the quadratic blowup of transfer-based systems. In transfer-based sys- 4www.grammaticalframework.org tems, as many as n(n − 1) components (transfer functtieomnss), are naeneyde ads nto( cover a)l cl language pairs nisnf bero tfhu ndci-rections. In an interlingua-based system, 2n + 1components are enough: the interlingua itself, plus translations in both directions between each language and the interlingua. However, in GF, n + 1 components are sufficient, because the mappings from the abstract syntax to each language (the concrete syntaxes) are reversible, i.e. usable for both generation and parsing. Multilingual GF grammars can be seen as an implementation of Curry’s distinction between tectogrammatical and phenogrammatical structure (Curry 1961). In GF, the tectogrammatical structure is called abstract syntax, following standard computer science terminology. It is defined by using a logical framework (Harper & al. 1993), whose mathematical basis is in the type theory of Martin-L o¨f (1984). Two things can be noted about this architecture, both showing im- provements over state-of-the-art grammar-based translation methods. First, the translation interlingua (the abstract syntax) is a powerful logical formalism, able to express semantical structures such as context-dependencies and anaphora (Ranta 1994). In particular, dependent types make it more expressive than the type theory used in Montague grammar (Montague 1974) and employed in the Rosetta translation project (Rosetta 1998). Second, GF uses a framework for interlinguas, rather than one universal interlingua. This makes the interlingual approach more light-weight and feasible than in systems assuming one universal interlingua, such as Rosetta and UNL, Universal Networking Language5 . It also gives more precision to special-purpose translation: the interlingua of a GF translation system (i.e. the abstract syntax of a multilingual grammar) can encode precisely those structures and distinctions that are relevant for the task at hand. Thus an interlingua for mathematical proofs (Hallgren and Ranta 2000) is different from one for commands for operating an MP3 player (Perera and Ranta 2007). The expressive power of the logical framework is sufficient for both kinds of tasks. One important source of inspiration for GF was the WYSIWYM system (Power and Scott 1998), which used domain-specific interlinguas and produced excellent quality in multilingual generation. But the generation components were hard-coded in the program, instead of being defined declaratively as in GF, and they were not usable in the direction of parsing. 3 Grammars and Ontologies Parallel to the first development efforts of GF in the late 1990’s, another framework idea was emerging in web technology: XML, Extensible Mark-up Language, which unlike HTML is not a single mark-up language but a framework for creating custom mark-up lan5www .undl .org 67 guages. The analogy between GF and XML was seen from the beginning, and GF was designed as a formalism for multilingual rendering of semantic content (Dymetman and al. 2000). XML originated as a format for structuring documents and structured data serialization, but a couple ofits descendants, RDF(S) and OWL, developed its potential to formally express the semantics of data and content, serving as the fundaments of the emerging Semantic Web. Practically any meaning representation format can be converted into GF’s abstract syntax, which can then be mapped to different target languages. In particular the OWL language can be seen as a syntactic sugar for a subset of Martin-L o¨f’s type theory so it is trivial to embed it in GF’s abstract syntax. The translation problem defined in terms of an ontology is radically different from the problem of translating plain text from one language to another. Many of the projects in which GF has been used involve precisely this: a meaning representation formalized as GF abstract syntax. Some projects build on previously existing meaning representation and address mathematical proofs (Hallgren and Ranta 2000), software specifications (Beckert & al. 2007), and mathematical exercises (the European project WebALT6). Other projects start with semantic modelling work to build meaning representations from scratch, most notably ones for dialogue systems (Perera and Ranta 2007) in the European project TALK7. Yet another project, and one closest to web translation, is the multilingual Wiki system presented in (Meza Moreno and Bringert 2008). In this system, users can add and modify reviews of restaurants in three languages (English, Spanish, and Swedish). Any change made in any of the languages gets automatically translated to the other languages. To take an example, the OWL-to-GF mapping trans- lates OWL’s classes to GF’s categories and OWL’s properties to GF’s functions that return propositions. As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). The correspondences are as follows: Clas s (pp : intege r . . . ) m catm integer Ob j e ctP roperty ( pp :div domain (pp : int ege r ) range ( pp :integer ) ) m funm div : int eger -> 4 int ege r -> prop Grammar Engineer’s Tools In the GF setting, building a multilingual translation system is equivalent to building a multilingual GF 6EDC-22253, webalt .math .he l inki . fi s 7IST-507802, 2004–2006, www .t alk-pro j e ct .org grammar, which in turn consists of two kinds of components: • a language-independent abstract syntax, giving tahe l snegmuaangtei-ci nmdeopdeenl dveinat tw ahbisctrha ctrtan ssylnattiaoxn, gisi performed; • for each language, a concrete syntax mapping abfstorrac eta syntax turaegese ,t oa strings ien s tyhnatta language. While abstract syntax construction is an extra task compared to many other kinds of translation methods, it is technically relatively simple, and its cost is moreover amortized as the system is extended to new languages. Concrete syntax construction can be much more demanding in terms of programming skills and linguistic knowledge, due to the complexity of natural languages. This task is where GF claims perhaps the highest advantage over other approaches to special-purpose grammars. The two main assets are: • • Programming language support: GF is a modern fPuroncgtriaomnaml programming language, w isith a a powerful type system and module system supporting modular and collaborative programming and reuse of code. RGL, the GF Resource Grammar Library, implementing Fthe R bsoausicrc linguistic dre Ltaiiblsr orfy l iamn-guages: inflectional morphology and syntactic combination functions. The RGL covers fifteen languages at the moment, shown in Figure 1; see also Khegai 2006, El Dada and Ranta 2007, Angelov 2008, Ranta 2009a,b, and Enache et al. 2010. To give an example of what the library provides, let us first consider the inflectional morphology. It is presented as a set of lexicon-building functions such as, in English, mkV : St r -> V i.e. function mkV, which takes a string (St r) as its argument and returns a verb (V) as its value. The verb is, internally, an inflection table containing all forms of a verb. The function mkV derives all these forms from its argument string, which is the infinitive form. It predicts all regular variations: (mkV

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Tools for Multilingual Grammar-Based Translation on the Web aarne Aarne Ranta and Krasimir Angelov and Thomas Hallgren Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg @ chalmers . [sent-1, score-0.211]

2 se Abstract This is a system demo for a set of tools for translating texts between multiple languages in real time with high quality. [sent-4, score-0.212]

3 The translation works on restricted languages, and is based on semantic interlinguas. [sent-5, score-0.138]

4 The underlying model is GF (Grammatical Framework), which is an open-source toolkit for multilingual grammar implementations. [sent-6, score-0.199]

5 Two related sets of tools are presented: grammarian’s tools helping to build translators for new domains and languages, and translator’s tools helping to translate documents. [sent-8, score-0.327]

6 The grammarian’s tools are designed to make it easy to port the technique to new applications. [sent-9, score-0.084]

7 The translator’s tools are essential in the restricted language context, enabling the author to remain in the fragments recognized by the system. [sent-10, score-0.13]

8 The tools that are demonstrated will be ap- plied and developed further in the European project MOLTO (Multilingual On-Line Translation) which has started in March 2010 and runs for three years. [sent-11, score-0.084]

9 1 Translation Needs for the Web The best-known translation tools on the web are Google translate1 and Systran2. [sent-12, score-0.226]

10 They are targeted to consumers of web documents: users who want to find out what a given document is about. [sent-13, score-0.081]

11 For this purpose, browsing quality is sufficient, since the user has intelligence and good will, and understands that she uses the translation at her own risk. [sent-14, score-0.092]

12 Since Google and Systran translations can be grammatically and semantically flawed, they don’t reach publication quality, and cannot hence be used by the producers of web documents. [sent-15, score-0.137]

13 There are very few automatic translation systems actually in use for producers of information. [sent-17, score-0.145]

14 com noted by Bar-Hillel (1964), machine translation is one of those AI-complete tasks that involves a trade-off between coverage and precision, and the current mainstream systems opt for coverage. [sent-22, score-0.092]

15 This is also what web users expect: they want to be able to throw just anything at the translation system and get something useful back. [sent-23, score-0.173]

16 However, from the producer’s point of view, large coverage is not essential: unlike the consumer’s tools, their input is predictable, and can be restricted to very specific domains, and to content that the producers themselves are creating in the first place. [sent-25, score-0.099]

17 The authoring problem: since the method does nTohte ew aourkth foorri nalgl input, etmhe: :asu tihnocer othfe eth me source toexest of translation may need special training to write in a way that can be translated at all. [sent-27, score-0.142]

18 These two problems have probably been the main obstacles to making high-quality restricted language translation more wide-spread in tasks where it would otherwise be applicable. [sent-28, score-0.138]

19 We address these problems by providing tools that help developers of translation systems on the one hand, and authors and translators—i. [sent-29, score-0.176]

20 In the MOLTO project (Multilingual On-Line Translation)3, we have the goal to improve both the development and use of restricted language translation by an order of magnitude, as compared with the state of the art. [sent-32, score-0.138]

21 As for development costs, this means that a system for many languages and with adequate quality can be built in a matter of days rather than months. [sent-33, score-0.066]

22 c 01200 S1y0s Atesmso Dcieamtioonns ftorart Cioonms,p puatagteiso 6n6a–l7 L1in,guistics Figure 1: A multilingual GF grammar with reversible mappings from a common abstract syntax to the 15 languages currently available in the GF Resource Grammar Library. [sent-39, score-0.401]

23 A running demo system is available at http : / / grammat i cal framework . [sent-41, score-0.154]

24 2 2 Multilingual Grammars The translation tools are based on GF, Grammatical Framework4 (Ranta 2004). [sent-43, score-0.176]

25 GF is a grammar formalism—that is, a mathematical model of natural language, equipped with a formal notation for writing grammars and a computer program implementing parsing and generation which are declaratively defined by grammars. [sent-44, score-0.265]

26 Thus GF is comparable with formalism such as HPSG (Pollard and Sag 1994), LFG (Bresnan 1982) or TAG (Joshi 1985). [sent-45, score-0.031]

27 The novel feature of GF is the notion of multilingual grammars, which describe several languages simultaneously by using a common representation called abstract syntax; see Figure 1. [sent-46, score-0.179]

28 In a multilingual GF grammar, meaning-preserving translation is provided as a composition of parsing and generation via the abstract syntax, which works as an interlingua. [sent-47, score-0.205]

29 This model of translation is different from approaches based on other comparable grammar formalisms, such as synchronous TAGs (Shieber and Schabes 1990), Pargram (Butt & al. [sent-48, score-0.178]

30 These approaches use transfer rules between individual languages, separate for each pair of languages. [sent-50, score-0.039]

31 Being interlingua-based, GF translation scales up linearly to new languages without the quadratic blowup of transfer-based systems. [sent-51, score-0.158]

32 In an interlingua-based system, 2n + 1components are enough: the interlingua itself, plus translations in both directions between each language and the interlingua. [sent-55, score-0.123]

33 However, in GF, n + 1 components are sufficient, because the mappings from the abstract syntax to each language (the concrete syntaxes) are reversible, i. [sent-56, score-0.141]

34 Multilingual GF grammars can be seen as an implementation of Curry’s distinction between tectogrammatical and phenogrammatical structure (Curry 1961). [sent-59, score-0.048]

35 1993), whose mathematical basis is in the type theory of Martin-L o¨f (1984). [sent-62, score-0.049]

36 Two things can be noted about this architecture, both showing im- provements over state-of-the-art grammar-based translation methods. [sent-63, score-0.092]

37 First, the translation interlingua (the abstract syntax) is a powerful logical formalism, able to express semantical structures such as context-dependencies and anaphora (Ranta 1994). [sent-64, score-0.215]

38 In particular, dependent types make it more expressive than the type theory used in Montague grammar (Montague 1974) and employed in the Rosetta translation project (Rosetta 1998). [sent-65, score-0.178]

39 It also gives more precision to special-purpose translation: the interlingua of a GF translation system (i. [sent-68, score-0.215]

40 the abstract syntax of a multilingual grammar) can encode precisely those structures and distinctions that are relevant for the task at hand. [sent-70, score-0.214]

41 Thus an interlingua for mathematical proofs (Hallgren and Ranta 2000) is different from one for commands for operating an MP3 player (Perera and Ranta 2007). [sent-71, score-0.172]

42 One important source of inspiration for GF was the WYSIWYM system (Power and Scott 1998), which used domain-specific interlinguas and produced excellent quality in multilingual generation. [sent-73, score-0.148]

43 But the generation components were hard-coded in the program, instead of being defined declaratively as in GF, and they were not usable in the direction of parsing. [sent-74, score-0.035]

44 3 Grammars and Ontologies Parallel to the first development efforts of GF in the late 1990’s, another framework idea was emerging in web technology: XML, Extensible Mark-up Language, which unlike HTML is not a single mark-up language but a framework for creating custom mark-up lan5www . [sent-75, score-0.05]

45 The analogy between GF and XML was seen from the beginning, and GF was designed as a formalism for multilingual rendering of semantic content (Dymetman and al. [sent-78, score-0.144]

46 The translation problem defined in terms of an ontology is radically different from the problem of translating plain text from one language to another. [sent-83, score-0.092]

47 Some projects build on previously existing meaning representation and address mathematical proofs (Hallgren and Ranta 2000), software specifications (Beckert & al. [sent-85, score-0.049]

48 Yet another project, and one closest to web translation, is the multilingual Wiki system presented in (Meza Moreno and Bringert 2008). [sent-88, score-0.163]

49 In this system, users can add and modify reviews of restaurants in three languages (English, Spanish, and Swedish). [sent-89, score-0.097]

50 Any change made in any of the languages gets automatically translated to the other languages. [sent-90, score-0.066]

51 As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). [sent-92, score-0.14]

52 While abstract syntax construction is an extra task compared to many other kinds of translation methods, it is technically relatively simple, and its cost is moreover amortized as the system is extended to new languages. [sent-102, score-0.193]

53 Concrete syntax construction can be much more demanding in terms of programming skills and linguistic knowledge, due to the complexity of natural languages. [sent-103, score-0.133]

54 The two main assets are: • • Programming language support: GF is a modern fPuroncgtriaomnaml programming language, w isith a a powerful type system and module system supporting modular and collaborative programming and reuse of code. [sent-105, score-0.064]

55 RGL, the GF Resource Grammar Library, implementing Fthe R bsoausicrc linguistic dre Ltaiiblsr orfy l iamn-guages: inflectional morphology and syntactic combination functions. [sent-106, score-0.047]

56 The RGL covers fifteen languages at the moment, shown in Figure 1; see also Khegai 2006, El Dada and Ranta 2007, Angelov 2008, Ranta 2009a,b, and Enache et al. [sent-107, score-0.066]

57 To give an example of what the library provides, let us first consider the inflectional morphology. [sent-109, score-0.047]

58 The verb is, internally, an inflection table containing all forms of a verb. [sent-113, score-0.04]

59 For irregular English verbs, RGL gives a three-argument function taking forms such as sing,sang,sung, but it also has a fairly complete lexicon of irregular verbs, so that the normal application programmer who builds a lexicon only needs the regular mkV function. [sent-116, score-0.07]

60 Extending a lexicon with domain-specific vocabulary is typically the main part of the work of a con- crete syntax author. [sent-117, score-0.101]

61 Considerable work has been put into RGL’s inflection functions to make them as “intelligent” as possible and thereby ease the work of the 68 users of the library, who don’t know the linguistic details of morphology. [sent-118, score-0.071]

62 For instance, even Finnish, whose verbs have hundreds of forms and are conjugated in accordance with around 50 conjugations, has a oneargument function mkV that yields the correct inflection table for 90% of Finnish verbs. [sent-119, score-0.04]

63 It returns a sentence as value: pred : A2 -> NP -> NP -> S This function is available in all languages of RGL, even though the details of sentence formation are vastly different in them. [sent-122, score-0.109]

64 The syntactic combinations of the RGL have their own abstract syntax, but this abstract syntax is not the interlingua of translation: it is only used as a library for implementing the semantic interlingua, which is based on an ontology and abstracts away from syntactic structure. [sent-124, score-0.318]

65 Thus the translation equivalents in a multilingual grammar need not use the same syntactic combinations in different languages. [sent-125, score-0.291]

66 Assume, for the sake of argument, that x is divisible by y is expressed in Swedish by the transitive verb construction y delar x (literally, ”y divides x”). [sent-126, score-0.07]

67 This can be expressed easily by using the transitive verb predication function of the RGL and switching the subject and object, div x y = pred (mkV2 " de l " ) y x a Thus, even though GF translation is interlingua-based, there is a component of transfer between English and Swedish. [sent-127, score-0.28]

68 In general, the use ofthe large-coverage RGL as a library for restricted grammars is called grammar specialization. [sent-129, score-0.227]

69 The way GF performs grammar specialization is based on techniques for optimizing functional programming languages, in particular partial evaluation (Ranta 2004, 2007). [sent-130, score-0.118]

70 GF also gives a possibility to run-time transfer via semantic actions on abstract syntax trees, but this option has rarely been needed in previous applications, which helps to keep translation systems simple and efficient. [sent-131, score-0.232]

71 Figure 2: French word prediction in GF parser, suggesting feminine adjectives that agree with the subject la femme. [sent-132, score-0.063]

72 As shown in Figure 1, the RGL is currently available for 15 languages, of which 12 are official languages of the European Union. [sent-133, score-0.066]

73 A similar number of new languages are under construction in this collaborative open-source project. [sent-134, score-0.066]

74 The real challenge is to help the author to keep inside the restricted language. [sent-137, score-0.046]

75 Thus it does not suggest all existing words, but only those words that are grammatically correct in the context. [sent-141, score-0.034]

76 The author has started a sentence as la femme qui remplit le formulaire est co (”the woman who fills the form is co”), and a menu shows a list of words beginning with co that are given in the French grammar and possible in the context at hand; all these words are adjectives in the feminine form. [sent-143, score-0.517]

77 Predictive parsing is a good way to help users produce translatable content in the first place. [sent-145, score-0.031]

78 For instance, if the word femme (”woman”) in the previous example is changed to homme, the preceding article la has to be changed to l’, and the adjective has to be changed to the masculine form: thus connue (”known”) would become connu, and so on. [sent-154, score-0.213]

79 This is where another utility of the abstract syntax comes in: in the abstract syntax tree, all that is changed is the noun, and the regenerated concrete syntax string automatically obeys all the agreement rules. [sent-156, score-0.374]

80 This functionality is implemented in the GF syntax editor (Khegai & al. [sent-159, score-0.101]

81 Restricted languages in the sense of GF are close to controlled languages, such as Attempto (Fuchs & al. [sent-161, score-0.115]

82 2008); the examples shown in this section are actually taken from a GF implementation that generalizes Attempto Controlled English to five languages (Angelov and Ranta 2009). [sent-162, score-0.066]

83 However, unlike typical controlled languages, GF does not require the absence of ambiguity. [sent-163, score-0.049]

84 In fact, when a controlled language is generalized to new languages, lexical ambiguities in particular are hard to avoid. [sent-164, score-0.049]

85 The translation tool snapshot in Figure 2 is from an actual web-based prototype. [sent-168, score-0.092]

86 The translation is performed using GF in a server, which is called via HTTP. [sent-171, score-0.092]

87 Also client-side translators, with similar user interfaces, can be built by converting the whole GF grammar to JavaScript (Meza Moreno and Bringert 2008). [sent-172, score-0.086]

88 All the 2 demonstrated tools are available as open-source software from http : / / grammat i cal framework . [sent-175, score-0.176]

89 ME´TE´O: un syst e`me op´ erationnel pour la traduction automatique des bulletins m ´et ´ereologiques destin´ es au grand public. [sent-254, score-0.032]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('gf', 0.573), ('ranta', 0.44), ('angelov', 0.176), ('rgl', 0.176), ('aarne', 0.141), ('url', 0.128), ('interlingua', 0.123), ('mkv', 0.123), ('multilingual', 0.113), ('cle', 0.113), ('div', 0.106), ('syntax', 0.101), ('translation', 0.092), ('bringert', 0.088), ('chalme', 0.088), ('grammar', 0.086), ('tools', 0.084), ('translators', 0.075), ('translator', 0.072), ('chalmers', 0.07), ('divisible', 0.07), ('durch', 0.07), ('grammarian', 0.07), ('hallgren', 0.07), ('khegai', 0.07), ('molto', 0.07), ('owl', 0.066), ('languages', 0.066), ('demo', 0.062), ('rosetta', 0.062), ('femme', 0.053), ('formulaire', 0.053), ('grammat', 0.053), ('meza', 0.053), ('moreno', 0.053), ('nordstr', 0.053), ('perera', 0.053), ('producers', 0.053), ('remplit', 0.053), ('teilbar', 0.053), ('authoring', 0.05), ('web', 0.05), ('controlled', 0.049), ('mathematical', 0.049), ('grammars', 0.048), ('implementing', 0.047), ('library', 0.047), ('rs', 0.046), ('attempto', 0.046), ('qui', 0.046), ('woman', 0.046), ('restricted', 0.046), ('art', 0.045), ('xml', 0.043), ('pred', 0.043), ('curry', 0.042), ('concrete', 0.04), ('inflection', 0.04), ('montague', 0.04), ('transfer', 0.039), ('http', 0.039), ('lncs', 0.038), ('beckert', 0.035), ('chandioux', 0.035), ('connu', 0.035), ('connue', 0.035), ('dada', 0.035), ('declaratively', 0.035), ('ege', 0.035), ('enache', 0.035), ('harper', 0.035), ('homme', 0.035), ('ink', 0.035), ('interlinguas', 0.035), ('irregular', 0.035), ('plotkin', 0.035), ('reversible', 0.035), ('springerl', 0.035), ('grammatically', 0.034), ('programming', 0.032), ('ist', 0.032), ('la', 0.032), ('changed', 0.031), ('int', 0.031), ('formalism', 0.031), ('compl', 0.031), ('fuchs', 0.031), ('feminine', 0.031), ('fil', 0.031), ('javascript', 0.031), ('gotal', 0.031), ('gz', 0.031), ('gelbukh', 0.031), ('users', 0.031), ('resource', 0.03), ('est', 0.03), ('co', 0.029), ('fills', 0.029), ('bender', 0.028), ('bresnan', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 235 acl-2010-Tools for Multilingual Grammar-Based Translation on the Web

Author: Aarne Ranta ; Krasimir Angelov ; Thomas Hallgren

Abstract: This is a system demo for a set of tools for translating texts between multiple languages in real time with high quality. The translation works on restricted languages, and is based on semantic interlinguas. The underlying model is GF (Grammatical Framework), which is an open-source toolkit for multilingual grammar implementations. The demo will cover up to 20 parallel languages. Two related sets of tools are presented: grammarian’s tools helping to build translators for new domains and languages, and translator’s tools helping to translate documents. The grammarian’s tools are designed to make it easy to port the technique to new applications. The translator’s tools are essential in the restricted language context, enabling the author to remain in the fragments recognized by the system. The tools that are demonstrated will be ap- plied and developed further in the European project MOLTO (Multilingual On-Line Translation) which has started in March 2010 and runs for three years. 1 Translation Needs for the Web The best-known translation tools on the web are Google translate1 and Systran2. They are targeted to consumers of web documents: users who want to find out what a given document is about. For this purpose, browsing quality is sufficient, since the user has intelligence and good will, and understands that she uses the translation at her own risk. Since Google and Systran translations can be grammatically and semantically flawed, they don’t reach publication quality, and cannot hence be used by the producers of web documents. For instance, the provider of an e-commerce site cannot take the risk that the product descriptions or selling conditions have errors that change the original intentions. There are very few automatic translation systems actually in use for producers of information. As already 1www .google . com/t rans l e at 2www. systransoft . com noted by Bar-Hillel (1964), machine translation is one of those AI-complete tasks that involves a trade-off between coverage and precision, and the current mainstream systems opt for coverage. This is also what web users expect: they want to be able to throw just anything at the translation system and get something useful back. Precision-oriented approaches, the prime example of which is METEO (Chandioux 1977), have not been popular in recent years. However, from the producer’s point of view, large coverage is not essential: unlike the consumer’s tools, their input is predictable, and can be restricted to very specific domains, and to content that the producers themselves are creating in the first place. But even in such tasks, two severe problems remain: • • The development cost problem: a large amount oTfh ew dorekv eisl onpemedeendt f coors building tmra:n asl laatorgrse afomr new domains and new languages. The authoring problem: since the method does nTohte ew aourkth foorri nalgl input, etmhe: :asu tihnocer othfe eth me source toexest of translation may need special training to write in a way that can be translated at all. These two problems have probably been the main obstacles to making high-quality restricted language translation more wide-spread in tasks where it would otherwise be applicable. We address these problems by providing tools that help developers of translation systems on the one hand, and authors and translators—i.e. the users of the systems—on the other. In the MOLTO project (Multilingual On-Line Translation)3, we have the goal to improve both the development and use of restricted language translation by an order of magnitude, as compared with the state of the art. As for development costs, this means that a system for many languages and with adequate quality can be built in a matter of days rather than months. As for authoring, this means that content production does not require the use of manuals or involve trial and error, both of which can easily make the work ten times slower than normal writing. In the proposed system demo, we will show how some of the building blocks for MOLTO can already now be used in web-based translators, although on a 3 www.molto-project .eu 66 UppsalaP,r Sowceeeddenin,g 1s3 o Jfu tlhye 2 A0C1L0. 2 ?c 01200 S1y0s Atesmso Dcieamtioonns ftorart Cioonms,p puatagteiso 6n6a–l7 L1in,guistics Figure 1: A multilingual GF grammar with reversible mappings from a common abstract syntax to the 15 languages currently available in the GF Resource Grammar Library. smaller scale as regards languages and application domains. A running demo system is available at http : / / grammat i cal framework .org : 4 1 9 6. 2 2 Multilingual Grammars The translation tools are based on GF, Grammatical Framework4 (Ranta 2004). GF is a grammar formalism—that is, a mathematical model of natural language, equipped with a formal notation for writing grammars and a computer program implementing parsing and generation which are declaratively defined by grammars. Thus GF is comparable with formalism such as HPSG (Pollard and Sag 1994), LFG (Bresnan 1982) or TAG (Joshi 1985). The novel feature of GF is the notion of multilingual grammars, which describe several languages simultaneously by using a common representation called abstract syntax; see Figure 1. In a multilingual GF grammar, meaning-preserving translation is provided as a composition of parsing and generation via the abstract syntax, which works as an interlingua. This model of translation is different from approaches based on other comparable grammar formalisms, such as synchronous TAGs (Shieber and Schabes 1990), Pargram (Butt & al. 2002, based on LFG), LINGO Matrix (Bender and Flickinger 2005, based on HPSG), and CLE (Core Language Engine, Alshawi 1992). These approaches use transfer rules between individual languages, separate for each pair of languages. Being interlingua-based, GF translation scales up linearly to new languages without the quadratic blowup of transfer-based systems. In transfer-based sys- 4www.grammaticalframework.org tems, as many as n(n − 1) components (transfer functtieomnss), are naeneyde ads nto( cover a)l cl language pairs nisnf bero tfhu ndci-rections. In an interlingua-based system, 2n + 1components are enough: the interlingua itself, plus translations in both directions between each language and the interlingua. However, in GF, n + 1 components are sufficient, because the mappings from the abstract syntax to each language (the concrete syntaxes) are reversible, i.e. usable for both generation and parsing. Multilingual GF grammars can be seen as an implementation of Curry’s distinction between tectogrammatical and phenogrammatical structure (Curry 1961). In GF, the tectogrammatical structure is called abstract syntax, following standard computer science terminology. It is defined by using a logical framework (Harper & al. 1993), whose mathematical basis is in the type theory of Martin-L o¨f (1984). Two things can be noted about this architecture, both showing im- provements over state-of-the-art grammar-based translation methods. First, the translation interlingua (the abstract syntax) is a powerful logical formalism, able to express semantical structures such as context-dependencies and anaphora (Ranta 1994). In particular, dependent types make it more expressive than the type theory used in Montague grammar (Montague 1974) and employed in the Rosetta translation project (Rosetta 1998). Second, GF uses a framework for interlinguas, rather than one universal interlingua. This makes the interlingual approach more light-weight and feasible than in systems assuming one universal interlingua, such as Rosetta and UNL, Universal Networking Language5 . It also gives more precision to special-purpose translation: the interlingua of a GF translation system (i.e. the abstract syntax of a multilingual grammar) can encode precisely those structures and distinctions that are relevant for the task at hand. Thus an interlingua for mathematical proofs (Hallgren and Ranta 2000) is different from one for commands for operating an MP3 player (Perera and Ranta 2007). The expressive power of the logical framework is sufficient for both kinds of tasks. One important source of inspiration for GF was the WYSIWYM system (Power and Scott 1998), which used domain-specific interlinguas and produced excellent quality in multilingual generation. But the generation components were hard-coded in the program, instead of being defined declaratively as in GF, and they were not usable in the direction of parsing. 3 Grammars and Ontologies Parallel to the first development efforts of GF in the late 1990’s, another framework idea was emerging in web technology: XML, Extensible Mark-up Language, which unlike HTML is not a single mark-up language but a framework for creating custom mark-up lan5www .undl .org 67 guages. The analogy between GF and XML was seen from the beginning, and GF was designed as a formalism for multilingual rendering of semantic content (Dymetman and al. 2000). XML originated as a format for structuring documents and structured data serialization, but a couple ofits descendants, RDF(S) and OWL, developed its potential to formally express the semantics of data and content, serving as the fundaments of the emerging Semantic Web. Practically any meaning representation format can be converted into GF’s abstract syntax, which can then be mapped to different target languages. In particular the OWL language can be seen as a syntactic sugar for a subset of Martin-L o¨f’s type theory so it is trivial to embed it in GF’s abstract syntax. The translation problem defined in terms of an ontology is radically different from the problem of translating plain text from one language to another. Many of the projects in which GF has been used involve precisely this: a meaning representation formalized as GF abstract syntax. Some projects build on previously existing meaning representation and address mathematical proofs (Hallgren and Ranta 2000), software specifications (Beckert & al. 2007), and mathematical exercises (the European project WebALT6). Other projects start with semantic modelling work to build meaning representations from scratch, most notably ones for dialogue systems (Perera and Ranta 2007) in the European project TALK7. Yet another project, and one closest to web translation, is the multilingual Wiki system presented in (Meza Moreno and Bringert 2008). In this system, users can add and modify reviews of restaurants in three languages (English, Spanish, and Swedish). Any change made in any of the languages gets automatically translated to the other languages. To take an example, the OWL-to-GF mapping trans- lates OWL’s classes to GF’s categories and OWL’s properties to GF’s functions that return propositions. As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). The correspondences are as follows: Clas s (pp : intege r . . . ) m catm integer Ob j e ctP roperty ( pp :div domain (pp : int ege r ) range ( pp :integer ) ) m funm div : int eger -> 4 int ege r -> prop Grammar Engineer’s Tools In the GF setting, building a multilingual translation system is equivalent to building a multilingual GF 6EDC-22253, webalt .math .he l inki . fi s 7IST-507802, 2004–2006, www .t alk-pro j e ct .org grammar, which in turn consists of two kinds of components: • a language-independent abstract syntax, giving tahe l snegmuaangtei-ci nmdeopdeenl dveinat tw ahbisctrha ctrtan ssylnattiaoxn, gisi performed; • for each language, a concrete syntax mapping abfstorrac eta syntax turaegese ,t oa strings ien s tyhnatta language. While abstract syntax construction is an extra task compared to many other kinds of translation methods, it is technically relatively simple, and its cost is moreover amortized as the system is extended to new languages. Concrete syntax construction can be much more demanding in terms of programming skills and linguistic knowledge, due to the complexity of natural languages. This task is where GF claims perhaps the highest advantage over other approaches to special-purpose grammars. The two main assets are: • • Programming language support: GF is a modern fPuroncgtriaomnaml programming language, w isith a a powerful type system and module system supporting modular and collaborative programming and reuse of code. RGL, the GF Resource Grammar Library, implementing Fthe R bsoausicrc linguistic dre Ltaiiblsr orfy l iamn-guages: inflectional morphology and syntactic combination functions. The RGL covers fifteen languages at the moment, shown in Figure 1; see also Khegai 2006, El Dada and Ranta 2007, Angelov 2008, Ranta 2009a,b, and Enache et al. 2010. To give an example of what the library provides, let us first consider the inflectional morphology. It is presented as a set of lexicon-building functions such as, in English, mkV : St r -> V i.e. function mkV, which takes a string (St r) as its argument and returns a verb (V) as its value. The verb is, internally, an inflection table containing all forms of a verb. The function mkV derives all these forms from its argument string, which is the infinitive form. It predicts all regular variations: (mkV

2 0.083161913 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System

Author: Emily M. Bender ; Scott Drellishak ; Antske Fokkens ; Michael Wayne Goodman ; Daniel P. Mills ; Laurie Poulson ; Safiyyah Saleem

Abstract: This demonstration presents the LinGO Grammar Matrix grammar customization system: a repository of distilled linguistic knowledge and a web-based service which elicits a typological description of a language from the user and yields a customized grammar fragment ready for sustained development into a broad-coverage grammar. We describe the implementation of this repository with an emphasis on how the information is made available to users, including in-browser testing capabilities.

3 0.082436264 162 acl-2010-Learning Common Grammar from Multilingual Corpus

Author: Tomoharu Iwata ; Daichi Mochihashi ; Hiroshi Sawada

Abstract: We propose a corpus-based probabilistic framework to extract hidden common syntax across languages from non-parallel multilingual corpora in an unsupervised fashion. For this purpose, we assume a generative model for multilingual corpora, where each sentence is generated from a language dependent probabilistic contextfree grammar (PCFG), and these PCFGs are generated from a prior grammar that is common across languages. We also develop a variational method for efficient inference. Experiments on a non-parallel multilingual corpus of eleven languages demonstrate the feasibility of the proposed method.

4 0.074824139 166 acl-2010-Learning Word-Class Lattices for Definition and Hypernym Extraction

Author: Roberto Navigli ; Paola Velardi

Abstract: Definition extraction is the task of automatically identifying definitional sentences within texts. The task has proven useful in many research areas including ontology learning, relation extraction and question answering. However, current approaches mostly focused on lexicosyntactic patterns suffer from both low recall and precision, as definitional sentences occur in highly variable syntactic structures. In this paper, we propose WordClass Lattices (WCLs), a generalization of word lattices that we use to model textual definitions. Lattices are learned from a dataset of definitions from Wikipedia. Our method is applied to the task of definition and hypernym extraction and compares favorably to other pattern general– – ization methods proposed in the literature.

5 0.070757784 64 acl-2010-Complexity Assumptions in Ontology Verbalisation

Author: Richard Power

Abstract: We describe the strategy currently pursued for verbalising OWL ontologies by sentences in Controlled Natural Language (i.e., combining generic rules for realising logical patterns with ontology-specific lexicons for realising atomic terms for individuals, classes, and properties) and argue that its success depends on assumptions about the complexity of terms and axioms in the ontology. We then show, through analysis of a corpus of ontologies, that although these assumptions could in principle be violated, they are overwhelmingly respected in practice by ontology developers.

6 0.065455742 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

7 0.059472553 169 acl-2010-Learning to Translate with Source and Target Syntax

8 0.05485693 243 acl-2010-Tree-Based and Forest-Based Translation

9 0.05115293 260 acl-2010-Wide-Coverage NLP with Linguistically Expressive Grammars

10 0.050893039 195 acl-2010-Phylogenetic Grammar Induction

11 0.050648279 77 acl-2010-Cross-Language Document Summarization Based on Machine Translation Quality Prediction

12 0.049080789 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems

13 0.046121467 130 acl-2010-Hard Constraints for Grammatical Function Labelling

14 0.045303557 118 acl-2010-Fine-Grained Tree-to-String Translation Rule Extraction

15 0.044915468 217 acl-2010-String Extension Learning

16 0.044699833 265 acl-2010-cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models

17 0.044449735 203 acl-2010-Rebanking CCGbank for Improved NP Interpretation

18 0.043712795 200 acl-2010-Profiting from Mark-Up: Hyper-Text Annotations for Guided Parsing

19 0.042502992 119 acl-2010-Fixed Length Word Suffix for Factored Statistical Machine Translation

20 0.042303048 44 acl-2010-BabelNet: Building a Very Large Multilingual Semantic Network


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.129), (1, -0.016), (2, -0.003), (3, -0.021), (4, -0.012), (5, -0.061), (6, 0.054), (7, 0.012), (8, -0.004), (9, -0.021), (10, 0.071), (11, 0.032), (12, 0.018), (13, -0.02), (14, -0.031), (15, 0.001), (16, 0.02), (17, 0.003), (18, 0.033), (19, 0.038), (20, -0.076), (21, -0.1), (22, -0.011), (23, -0.093), (24, 0.021), (25, 0.001), (26, -0.021), (27, 0.102), (28, -0.022), (29, -0.013), (30, -0.059), (31, -0.08), (32, 0.047), (33, 0.028), (34, 0.073), (35, -0.043), (36, -0.052), (37, -0.108), (38, 0.015), (39, -0.012), (40, -0.124), (41, 0.016), (42, 0.148), (43, 0.012), (44, 0.026), (45, 0.079), (46, -0.097), (47, -0.01), (48, 0.06), (49, -0.005)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92048484 235 acl-2010-Tools for Multilingual Grammar-Based Translation on the Web

Author: Aarne Ranta ; Krasimir Angelov ; Thomas Hallgren

Abstract: This is a system demo for a set of tools for translating texts between multiple languages in real time with high quality. The translation works on restricted languages, and is based on semantic interlinguas. The underlying model is GF (Grammatical Framework), which is an open-source toolkit for multilingual grammar implementations. The demo will cover up to 20 parallel languages. Two related sets of tools are presented: grammarian’s tools helping to build translators for new domains and languages, and translator’s tools helping to translate documents. The grammarian’s tools are designed to make it easy to port the technique to new applications. The translator’s tools are essential in the restricted language context, enabling the author to remain in the fragments recognized by the system. The tools that are demonstrated will be ap- plied and developed further in the European project MOLTO (Multilingual On-Line Translation) which has started in March 2010 and runs for three years. 1 Translation Needs for the Web The best-known translation tools on the web are Google translate1 and Systran2. They are targeted to consumers of web documents: users who want to find out what a given document is about. For this purpose, browsing quality is sufficient, since the user has intelligence and good will, and understands that she uses the translation at her own risk. Since Google and Systran translations can be grammatically and semantically flawed, they don’t reach publication quality, and cannot hence be used by the producers of web documents. For instance, the provider of an e-commerce site cannot take the risk that the product descriptions or selling conditions have errors that change the original intentions. There are very few automatic translation systems actually in use for producers of information. As already 1www .google . com/t rans l e at 2www. systransoft . com noted by Bar-Hillel (1964), machine translation is one of those AI-complete tasks that involves a trade-off between coverage and precision, and the current mainstream systems opt for coverage. This is also what web users expect: they want to be able to throw just anything at the translation system and get something useful back. Precision-oriented approaches, the prime example of which is METEO (Chandioux 1977), have not been popular in recent years. However, from the producer’s point of view, large coverage is not essential: unlike the consumer’s tools, their input is predictable, and can be restricted to very specific domains, and to content that the producers themselves are creating in the first place. But even in such tasks, two severe problems remain: • • The development cost problem: a large amount oTfh ew dorekv eisl onpemedeendt f coors building tmra:n asl laatorgrse afomr new domains and new languages. The authoring problem: since the method does nTohte ew aourkth foorri nalgl input, etmhe: :asu tihnocer othfe eth me source toexest of translation may need special training to write in a way that can be translated at all. These two problems have probably been the main obstacles to making high-quality restricted language translation more wide-spread in tasks where it would otherwise be applicable. We address these problems by providing tools that help developers of translation systems on the one hand, and authors and translators—i.e. the users of the systems—on the other. In the MOLTO project (Multilingual On-Line Translation)3, we have the goal to improve both the development and use of restricted language translation by an order of magnitude, as compared with the state of the art. As for development costs, this means that a system for many languages and with adequate quality can be built in a matter of days rather than months. As for authoring, this means that content production does not require the use of manuals or involve trial and error, both of which can easily make the work ten times slower than normal writing. In the proposed system demo, we will show how some of the building blocks for MOLTO can already now be used in web-based translators, although on a 3 www.molto-project .eu 66 UppsalaP,r Sowceeeddenin,g 1s3 o Jfu tlhye 2 A0C1L0. 2 ?c 01200 S1y0s Atesmso Dcieamtioonns ftorart Cioonms,p puatagteiso 6n6a–l7 L1in,guistics Figure 1: A multilingual GF grammar with reversible mappings from a common abstract syntax to the 15 languages currently available in the GF Resource Grammar Library. smaller scale as regards languages and application domains. A running demo system is available at http : / / grammat i cal framework .org : 4 1 9 6. 2 2 Multilingual Grammars The translation tools are based on GF, Grammatical Framework4 (Ranta 2004). GF is a grammar formalism—that is, a mathematical model of natural language, equipped with a formal notation for writing grammars and a computer program implementing parsing and generation which are declaratively defined by grammars. Thus GF is comparable with formalism such as HPSG (Pollard and Sag 1994), LFG (Bresnan 1982) or TAG (Joshi 1985). The novel feature of GF is the notion of multilingual grammars, which describe several languages simultaneously by using a common representation called abstract syntax; see Figure 1. In a multilingual GF grammar, meaning-preserving translation is provided as a composition of parsing and generation via the abstract syntax, which works as an interlingua. This model of translation is different from approaches based on other comparable grammar formalisms, such as synchronous TAGs (Shieber and Schabes 1990), Pargram (Butt & al. 2002, based on LFG), LINGO Matrix (Bender and Flickinger 2005, based on HPSG), and CLE (Core Language Engine, Alshawi 1992). These approaches use transfer rules between individual languages, separate for each pair of languages. Being interlingua-based, GF translation scales up linearly to new languages without the quadratic blowup of transfer-based systems. In transfer-based sys- 4www.grammaticalframework.org tems, as many as n(n − 1) components (transfer functtieomnss), are naeneyde ads nto( cover a)l cl language pairs nisnf bero tfhu ndci-rections. In an interlingua-based system, 2n + 1components are enough: the interlingua itself, plus translations in both directions between each language and the interlingua. However, in GF, n + 1 components are sufficient, because the mappings from the abstract syntax to each language (the concrete syntaxes) are reversible, i.e. usable for both generation and parsing. Multilingual GF grammars can be seen as an implementation of Curry’s distinction between tectogrammatical and phenogrammatical structure (Curry 1961). In GF, the tectogrammatical structure is called abstract syntax, following standard computer science terminology. It is defined by using a logical framework (Harper & al. 1993), whose mathematical basis is in the type theory of Martin-L o¨f (1984). Two things can be noted about this architecture, both showing im- provements over state-of-the-art grammar-based translation methods. First, the translation interlingua (the abstract syntax) is a powerful logical formalism, able to express semantical structures such as context-dependencies and anaphora (Ranta 1994). In particular, dependent types make it more expressive than the type theory used in Montague grammar (Montague 1974) and employed in the Rosetta translation project (Rosetta 1998). Second, GF uses a framework for interlinguas, rather than one universal interlingua. This makes the interlingual approach more light-weight and feasible than in systems assuming one universal interlingua, such as Rosetta and UNL, Universal Networking Language5 . It also gives more precision to special-purpose translation: the interlingua of a GF translation system (i.e. the abstract syntax of a multilingual grammar) can encode precisely those structures and distinctions that are relevant for the task at hand. Thus an interlingua for mathematical proofs (Hallgren and Ranta 2000) is different from one for commands for operating an MP3 player (Perera and Ranta 2007). The expressive power of the logical framework is sufficient for both kinds of tasks. One important source of inspiration for GF was the WYSIWYM system (Power and Scott 1998), which used domain-specific interlinguas and produced excellent quality in multilingual generation. But the generation components were hard-coded in the program, instead of being defined declaratively as in GF, and they were not usable in the direction of parsing. 3 Grammars and Ontologies Parallel to the first development efforts of GF in the late 1990’s, another framework idea was emerging in web technology: XML, Extensible Mark-up Language, which unlike HTML is not a single mark-up language but a framework for creating custom mark-up lan5www .undl .org 67 guages. The analogy between GF and XML was seen from the beginning, and GF was designed as a formalism for multilingual rendering of semantic content (Dymetman and al. 2000). XML originated as a format for structuring documents and structured data serialization, but a couple ofits descendants, RDF(S) and OWL, developed its potential to formally express the semantics of data and content, serving as the fundaments of the emerging Semantic Web. Practically any meaning representation format can be converted into GF’s abstract syntax, which can then be mapped to different target languages. In particular the OWL language can be seen as a syntactic sugar for a subset of Martin-L o¨f’s type theory so it is trivial to embed it in GF’s abstract syntax. The translation problem defined in terms of an ontology is radically different from the problem of translating plain text from one language to another. Many of the projects in which GF has been used involve precisely this: a meaning representation formalized as GF abstract syntax. Some projects build on previously existing meaning representation and address mathematical proofs (Hallgren and Ranta 2000), software specifications (Beckert & al. 2007), and mathematical exercises (the European project WebALT6). Other projects start with semantic modelling work to build meaning representations from scratch, most notably ones for dialogue systems (Perera and Ranta 2007) in the European project TALK7. Yet another project, and one closest to web translation, is the multilingual Wiki system presented in (Meza Moreno and Bringert 2008). In this system, users can add and modify reviews of restaurants in three languages (English, Spanish, and Swedish). Any change made in any of the languages gets automatically translated to the other languages. To take an example, the OWL-to-GF mapping trans- lates OWL’s classes to GF’s categories and OWL’s properties to GF’s functions that return propositions. As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). The correspondences are as follows: Clas s (pp : intege r . . . ) m catm integer Ob j e ctP roperty ( pp :div domain (pp : int ege r ) range ( pp :integer ) ) m funm div : int eger -> 4 int ege r -> prop Grammar Engineer’s Tools In the GF setting, building a multilingual translation system is equivalent to building a multilingual GF 6EDC-22253, webalt .math .he l inki . fi s 7IST-507802, 2004–2006, www .t alk-pro j e ct .org grammar, which in turn consists of two kinds of components: • a language-independent abstract syntax, giving tahe l snegmuaangtei-ci nmdeopdeenl dveinat tw ahbisctrha ctrtan ssylnattiaoxn, gisi performed; • for each language, a concrete syntax mapping abfstorrac eta syntax turaegese ,t oa strings ien s tyhnatta language. While abstract syntax construction is an extra task compared to many other kinds of translation methods, it is technically relatively simple, and its cost is moreover amortized as the system is extended to new languages. Concrete syntax construction can be much more demanding in terms of programming skills and linguistic knowledge, due to the complexity of natural languages. This task is where GF claims perhaps the highest advantage over other approaches to special-purpose grammars. The two main assets are: • • Programming language support: GF is a modern fPuroncgtriaomnaml programming language, w isith a a powerful type system and module system supporting modular and collaborative programming and reuse of code. RGL, the GF Resource Grammar Library, implementing Fthe R bsoausicrc linguistic dre Ltaiiblsr orfy l iamn-guages: inflectional morphology and syntactic combination functions. The RGL covers fifteen languages at the moment, shown in Figure 1; see also Khegai 2006, El Dada and Ranta 2007, Angelov 2008, Ranta 2009a,b, and Enache et al. 2010. To give an example of what the library provides, let us first consider the inflectional morphology. It is presented as a set of lexicon-building functions such as, in English, mkV : St r -> V i.e. function mkV, which takes a string (St r) as its argument and returns a verb (V) as its value. The verb is, internally, an inflection table containing all forms of a verb. The function mkV derives all these forms from its argument string, which is the infinitive form. It predicts all regular variations: (mkV

2 0.68807685 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System

Author: Emily M. Bender ; Scott Drellishak ; Antske Fokkens ; Michael Wayne Goodman ; Daniel P. Mills ; Laurie Poulson ; Safiyyah Saleem

Abstract: This demonstration presents the LinGO Grammar Matrix grammar customization system: a repository of distilled linguistic knowledge and a web-based service which elicits a typological description of a language from the user and yields a customized grammar fragment ready for sustained development into a broad-coverage grammar. We describe the implementation of this repository with an emphasis on how the information is made available to users, including in-browser testing capabilities.

3 0.64256567 64 acl-2010-Complexity Assumptions in Ontology Verbalisation

Author: Richard Power

Abstract: We describe the strategy currently pursued for verbalising OWL ontologies by sentences in Controlled Natural Language (i.e., combining generic rules for realising logical patterns with ontology-specific lexicons for realising atomic terms for individuals, classes, and properties) and argue that its success depends on assumptions about the complexity of terms and axioms in the ontology. We then show, through analysis of a corpus of ontologies, that although these assumptions could in principle be violated, they are overwhelmingly respected in practice by ontology developers.

4 0.56093842 226 acl-2010-The Human Language Project: Building a Universal Corpus of the World's Languages

Author: Steven Abney ; Steven Bird

Abstract: We present a grand challenge to build a corpus that will include all of the world’s languages, in a consistent structure that permits large-scale cross-linguistic processing, enabling the study of universal linguistics. The focal data types, bilingual texts and lexicons, relate each language to one of a set of reference languages. We propose that the ability to train systems to translate into and out of a given language be the yardstick for determining when we have successfully captured a language. We call on the computational linguistics community to begin work on this Universal Corpus, pursuing the many strands of activity described here, as their contribution to the global effort to document the world’s linguistic heritage before more languages fall silent.

5 0.50512975 260 acl-2010-Wide-Coverage NLP with Linguistically Expressive Grammars

Author: Julia Hockenmaier ; Yusuke Miyao ; Josef van Genabith

Abstract: unkown-abstract

6 0.49899387 222 acl-2010-SystemT: An Algebraic Approach to Declarative Information Extraction

7 0.48687539 259 acl-2010-WebLicht: Web-Based LRT Services for German

8 0.4654229 182 acl-2010-On the Computational Complexity of Dominance Links in Grammatical Formalisms

9 0.45872229 162 acl-2010-Learning Common Grammar from Multilingual Corpus

10 0.43400308 7 acl-2010-A Generalized-Zero-Preserving Method for Compact Encoding of Concept Lattices

11 0.4318321 224 acl-2010-Talking NPCs in a Virtual Game World

12 0.42933917 234 acl-2010-The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages

13 0.42053461 67 acl-2010-Computing Weakest Readings

14 0.41488567 105 acl-2010-Evaluating Multilanguage-Comparability of Subjectivity Analysis Systems

15 0.4084897 138 acl-2010-Hunting for the Black Swan: Risk Mining from Text

16 0.39966959 66 acl-2010-Compositional Matrix-Space Models of Language

17 0.39571309 61 acl-2010-Combining Data and Mathematical Models of Language Change

18 0.38873613 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

19 0.3807905 92 acl-2010-Don't 'Have a Clue'? Unsupervised Co-Learning of Downward-Entailing Operators.

20 0.37766969 139 acl-2010-Identifying Generic Noun Phrases


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.027), (25, 0.061), (32, 0.014), (34, 0.367), (39, 0.02), (42, 0.022), (44, 0.017), (56, 0.01), (59, 0.074), (73, 0.052), (78, 0.036), (80, 0.012), (83, 0.051), (84, 0.039), (98, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7493701 235 acl-2010-Tools for Multilingual Grammar-Based Translation on the Web

Author: Aarne Ranta ; Krasimir Angelov ; Thomas Hallgren

Abstract: This is a system demo for a set of tools for translating texts between multiple languages in real time with high quality. The translation works on restricted languages, and is based on semantic interlinguas. The underlying model is GF (Grammatical Framework), which is an open-source toolkit for multilingual grammar implementations. The demo will cover up to 20 parallel languages. Two related sets of tools are presented: grammarian’s tools helping to build translators for new domains and languages, and translator’s tools helping to translate documents. The grammarian’s tools are designed to make it easy to port the technique to new applications. The translator’s tools are essential in the restricted language context, enabling the author to remain in the fragments recognized by the system. The tools that are demonstrated will be ap- plied and developed further in the European project MOLTO (Multilingual On-Line Translation) which has started in March 2010 and runs for three years. 1 Translation Needs for the Web The best-known translation tools on the web are Google translate1 and Systran2. They are targeted to consumers of web documents: users who want to find out what a given document is about. For this purpose, browsing quality is sufficient, since the user has intelligence and good will, and understands that she uses the translation at her own risk. Since Google and Systran translations can be grammatically and semantically flawed, they don’t reach publication quality, and cannot hence be used by the producers of web documents. For instance, the provider of an e-commerce site cannot take the risk that the product descriptions or selling conditions have errors that change the original intentions. There are very few automatic translation systems actually in use for producers of information. As already 1www .google . com/t rans l e at 2www. systransoft . com noted by Bar-Hillel (1964), machine translation is one of those AI-complete tasks that involves a trade-off between coverage and precision, and the current mainstream systems opt for coverage. This is also what web users expect: they want to be able to throw just anything at the translation system and get something useful back. Precision-oriented approaches, the prime example of which is METEO (Chandioux 1977), have not been popular in recent years. However, from the producer’s point of view, large coverage is not essential: unlike the consumer’s tools, their input is predictable, and can be restricted to very specific domains, and to content that the producers themselves are creating in the first place. But even in such tasks, two severe problems remain: • • The development cost problem: a large amount oTfh ew dorekv eisl onpemedeendt f coors building tmra:n asl laatorgrse afomr new domains and new languages. The authoring problem: since the method does nTohte ew aourkth foorri nalgl input, etmhe: :asu tihnocer othfe eth me source toexest of translation may need special training to write in a way that can be translated at all. These two problems have probably been the main obstacles to making high-quality restricted language translation more wide-spread in tasks where it would otherwise be applicable. We address these problems by providing tools that help developers of translation systems on the one hand, and authors and translators—i.e. the users of the systems—on the other. In the MOLTO project (Multilingual On-Line Translation)3, we have the goal to improve both the development and use of restricted language translation by an order of magnitude, as compared with the state of the art. As for development costs, this means that a system for many languages and with adequate quality can be built in a matter of days rather than months. As for authoring, this means that content production does not require the use of manuals or involve trial and error, both of which can easily make the work ten times slower than normal writing. In the proposed system demo, we will show how some of the building blocks for MOLTO can already now be used in web-based translators, although on a 3 www.molto-project .eu 66 UppsalaP,r Sowceeeddenin,g 1s3 o Jfu tlhye 2 A0C1L0. 2 ?c 01200 S1y0s Atesmso Dcieamtioonns ftorart Cioonms,p puatagteiso 6n6a–l7 L1in,guistics Figure 1: A multilingual GF grammar with reversible mappings from a common abstract syntax to the 15 languages currently available in the GF Resource Grammar Library. smaller scale as regards languages and application domains. A running demo system is available at http : / / grammat i cal framework .org : 4 1 9 6. 2 2 Multilingual Grammars The translation tools are based on GF, Grammatical Framework4 (Ranta 2004). GF is a grammar formalism—that is, a mathematical model of natural language, equipped with a formal notation for writing grammars and a computer program implementing parsing and generation which are declaratively defined by grammars. Thus GF is comparable with formalism such as HPSG (Pollard and Sag 1994), LFG (Bresnan 1982) or TAG (Joshi 1985). The novel feature of GF is the notion of multilingual grammars, which describe several languages simultaneously by using a common representation called abstract syntax; see Figure 1. In a multilingual GF grammar, meaning-preserving translation is provided as a composition of parsing and generation via the abstract syntax, which works as an interlingua. This model of translation is different from approaches based on other comparable grammar formalisms, such as synchronous TAGs (Shieber and Schabes 1990), Pargram (Butt & al. 2002, based on LFG), LINGO Matrix (Bender and Flickinger 2005, based on HPSG), and CLE (Core Language Engine, Alshawi 1992). These approaches use transfer rules between individual languages, separate for each pair of languages. Being interlingua-based, GF translation scales up linearly to new languages without the quadratic blowup of transfer-based systems. In transfer-based sys- 4www.grammaticalframework.org tems, as many as n(n − 1) components (transfer functtieomnss), are naeneyde ads nto( cover a)l cl language pairs nisnf bero tfhu ndci-rections. In an interlingua-based system, 2n + 1components are enough: the interlingua itself, plus translations in both directions between each language and the interlingua. However, in GF, n + 1 components are sufficient, because the mappings from the abstract syntax to each language (the concrete syntaxes) are reversible, i.e. usable for both generation and parsing. Multilingual GF grammars can be seen as an implementation of Curry’s distinction between tectogrammatical and phenogrammatical structure (Curry 1961). In GF, the tectogrammatical structure is called abstract syntax, following standard computer science terminology. It is defined by using a logical framework (Harper & al. 1993), whose mathematical basis is in the type theory of Martin-L o¨f (1984). Two things can be noted about this architecture, both showing im- provements over state-of-the-art grammar-based translation methods. First, the translation interlingua (the abstract syntax) is a powerful logical formalism, able to express semantical structures such as context-dependencies and anaphora (Ranta 1994). In particular, dependent types make it more expressive than the type theory used in Montague grammar (Montague 1974) and employed in the Rosetta translation project (Rosetta 1998). Second, GF uses a framework for interlinguas, rather than one universal interlingua. This makes the interlingual approach more light-weight and feasible than in systems assuming one universal interlingua, such as Rosetta and UNL, Universal Networking Language5 . It also gives more precision to special-purpose translation: the interlingua of a GF translation system (i.e. the abstract syntax of a multilingual grammar) can encode precisely those structures and distinctions that are relevant for the task at hand. Thus an interlingua for mathematical proofs (Hallgren and Ranta 2000) is different from one for commands for operating an MP3 player (Perera and Ranta 2007). The expressive power of the logical framework is sufficient for both kinds of tasks. One important source of inspiration for GF was the WYSIWYM system (Power and Scott 1998), which used domain-specific interlinguas and produced excellent quality in multilingual generation. But the generation components were hard-coded in the program, instead of being defined declaratively as in GF, and they were not usable in the direction of parsing. 3 Grammars and Ontologies Parallel to the first development efforts of GF in the late 1990’s, another framework idea was emerging in web technology: XML, Extensible Mark-up Language, which unlike HTML is not a single mark-up language but a framework for creating custom mark-up lan5www .undl .org 67 guages. The analogy between GF and XML was seen from the beginning, and GF was designed as a formalism for multilingual rendering of semantic content (Dymetman and al. 2000). XML originated as a format for structuring documents and structured data serialization, but a couple ofits descendants, RDF(S) and OWL, developed its potential to formally express the semantics of data and content, serving as the fundaments of the emerging Semantic Web. Practically any meaning representation format can be converted into GF’s abstract syntax, which can then be mapped to different target languages. In particular the OWL language can be seen as a syntactic sugar for a subset of Martin-L o¨f’s type theory so it is trivial to embed it in GF’s abstract syntax. The translation problem defined in terms of an ontology is radically different from the problem of translating plain text from one language to another. Many of the projects in which GF has been used involve precisely this: a meaning representation formalized as GF abstract syntax. Some projects build on previously existing meaning representation and address mathematical proofs (Hallgren and Ranta 2000), software specifications (Beckert & al. 2007), and mathematical exercises (the European project WebALT6). Other projects start with semantic modelling work to build meaning representations from scratch, most notably ones for dialogue systems (Perera and Ranta 2007) in the European project TALK7. Yet another project, and one closest to web translation, is the multilingual Wiki system presented in (Meza Moreno and Bringert 2008). In this system, users can add and modify reviews of restaurants in three languages (English, Spanish, and Swedish). Any change made in any of the languages gets automatically translated to the other languages. To take an example, the OWL-to-GF mapping trans- lates OWL’s classes to GF’s categories and OWL’s properties to GF’s functions that return propositions. As a running example in this and the next section, we will use the class of integers and the two-place property of being divisible (“x is divisible by y”). The correspondences are as follows: Clas s (pp : intege r . . . ) m catm integer Ob j e ctP roperty ( pp :div domain (pp : int ege r ) range ( pp :integer ) ) m funm div : int eger -> 4 int ege r -> prop Grammar Engineer’s Tools In the GF setting, building a multilingual translation system is equivalent to building a multilingual GF 6EDC-22253, webalt .math .he l inki . fi s 7IST-507802, 2004–2006, www .t alk-pro j e ct .org grammar, which in turn consists of two kinds of components: • a language-independent abstract syntax, giving tahe l snegmuaangtei-ci nmdeopdeenl dveinat tw ahbisctrha ctrtan ssylnattiaoxn, gisi performed; • for each language, a concrete syntax mapping abfstorrac eta syntax turaegese ,t oa strings ien s tyhnatta language. While abstract syntax construction is an extra task compared to many other kinds of translation methods, it is technically relatively simple, and its cost is moreover amortized as the system is extended to new languages. Concrete syntax construction can be much more demanding in terms of programming skills and linguistic knowledge, due to the complexity of natural languages. This task is where GF claims perhaps the highest advantage over other approaches to special-purpose grammars. The two main assets are: • • Programming language support: GF is a modern fPuroncgtriaomnaml programming language, w isith a a powerful type system and module system supporting modular and collaborative programming and reuse of code. RGL, the GF Resource Grammar Library, implementing Fthe R bsoausicrc linguistic dre Ltaiiblsr orfy l iamn-guages: inflectional morphology and syntactic combination functions. The RGL covers fifteen languages at the moment, shown in Figure 1; see also Khegai 2006, El Dada and Ranta 2007, Angelov 2008, Ranta 2009a,b, and Enache et al. 2010. To give an example of what the library provides, let us first consider the inflectional morphology. It is presented as a set of lexicon-building functions such as, in English, mkV : St r -> V i.e. function mkV, which takes a string (St r) as its argument and returns a verb (V) as its value. The verb is, internally, an inflection table containing all forms of a verb. The function mkV derives all these forms from its argument string, which is the infinitive form. It predicts all regular variations: (mkV

2 0.68764472 131 acl-2010-Hierarchical A* Parsing with Bridge Outside Scores

Author: Adam Pauls ; Dan Klein

Abstract: Hierarchical A∗ (HA∗) uses of a hierarchy of coarse grammars to speed up parsing without sacrificing optimality. HA∗ prioritizes search in refined grammars using Viterbi outside costs computed in coarser grammars. We present Bridge Hierarchical A∗ (BHA∗), a modified Hierarchial A∗ algorithm which computes a novel outside cost called a bridge outside cost. These bridge costs mix finer outside scores with coarser inside scores, and thus constitute tighter heuristics than entirely coarse scores. We show that BHA∗ substantially outperforms HA∗ when the hierarchy contains only very coarse grammars, while achieving comparable performance on more refined hierarchies.

3 0.57481462 17 acl-2010-A Structured Model for Joint Learning of Argument Roles and Predicate Senses

Author: Yotaro Watanabe ; Masayuki Asahara ; Yuji Matsumoto

Abstract: In predicate-argument structure analysis, it is important to capture non-local dependencies among arguments and interdependencies between the sense of a predicate and the semantic roles of its arguments. However, no existing approach explicitly handles both non-local dependencies and semantic dependencies between predicates and arguments. In this paper we propose a structured model that overcomes the limitation of existing approaches; the model captures both types of dependencies simultaneously by introducing four types of factors including a global factor type capturing non-local dependencies among arguments and a pairwise factor type capturing local dependencies between a predicate and an argument. In experiments the proposed model achieved competitive results compared to the stateof-the-art systems without applying any feature selection procedure.

4 0.39242369 128 acl-2010-Grammar Prototyping and Testing with the LinGO Grammar Matrix Customization System

Author: Emily M. Bender ; Scott Drellishak ; Antske Fokkens ; Michael Wayne Goodman ; Daniel P. Mills ; Laurie Poulson ; Safiyyah Saleem

Abstract: This demonstration presents the LinGO Grammar Matrix grammar customization system: a repository of distilled linguistic knowledge and a web-based service which elicits a typological description of a language from the user and yields a customized grammar fragment ready for sustained development into a broad-coverage grammar. We describe the implementation of this repository with an emphasis on how the information is made available to users, including in-browser testing capabilities.

5 0.38982743 172 acl-2010-Minimized Models and Grammar-Informed Initialization for Supertagging with Highly Ambiguous Lexicons

Author: Sujith Ravi ; Jason Baldridge ; Kevin Knight

Abstract: We combine two complementary ideas for learning supertaggers from highly ambiguous lexicons: grammar-informed tag transitions and models minimized via integer programming. Each strategy on its own greatly improves performance over basic expectation-maximization training with a bitag Hidden Markov Model, which we show on the CCGbank and CCG-TUT corpora. The strategies provide further error reductions when combined. We describe a new two-stage integer programming strategy that efficiently deals with the high degree of ambiguity on these datasets while obtaining the full effect of model minimization.

6 0.38847131 162 acl-2010-Learning Common Grammar from Multilingual Corpus

7 0.38842696 65 acl-2010-Complexity Metrics in an Incremental Right-Corner Parser

8 0.38746506 55 acl-2010-Bootstrapping Semantic Analyzers from Non-Contradictory Texts

9 0.387348 71 acl-2010-Convolution Kernel over Packed Parse Forest

10 0.38707143 169 acl-2010-Learning to Translate with Source and Target Syntax

11 0.38706702 211 acl-2010-Simple, Accurate Parsing with an All-Fragments Grammar

12 0.38691592 120 acl-2010-Fully Unsupervised Core-Adjunct Argument Classification

13 0.38688529 191 acl-2010-PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names

14 0.38570428 158 acl-2010-Latent Variable Models of Selectional Preference

15 0.38555455 113 acl-2010-Extraction and Approximation of Numerical Attributes from the Web

16 0.38552332 248 acl-2010-Unsupervised Ontology Induction from Text

17 0.38513121 218 acl-2010-Structural Semantic Relatedness: A Knowledge-Based Method to Named Entity Disambiguation

18 0.38436037 261 acl-2010-Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

19 0.38435176 130 acl-2010-Hard Constraints for Grammatical Function Labelling

20 0.38377982 121 acl-2010-Generating Entailment Rules from FrameNet