acl acl2012 acl2012-138 knowledge-graph by maker-knowledge-mining

138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation


Source: pdf

Author: Andrejs Vasiljevs ; Raivis Skadins ; Jorg Tiedemann

Abstract: Machine Translation Raivis Skadiņš TILDE Vienbas gatve 75a, Riga LV-1004, LATVIA raivi s . s kadins @ ti lde . lv Jörg Tiedemann Uppsala University Box 635, Uppsala SE-75 126, SWEDEN j org .t iedemann@ l ingfi l .uu . se the Universities of Copenhagen, and Uppsala. Edinburgh, Zagreb, To facilitate the creation and usage of custom SMT systems we have created a cloud-based platform for do-it-yourself MT. The platform is developed in the EU collaboration project LetsMT! . This system demonstration paper presents the motivation in developing the LetsMT! platform, its main features, architecture, and an evaluation in a practical use case. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 : A Cloud-Based Platform for Do-It-Yourself AndrTejIsL VDaEs iļjevs Vienbas gatve 75a, Riga LV-1004, LATVIA andre j s @ t i lde . [sent-2, score-0.065]

2 com Abstract Machine Translation Raivis Skadiņš TILDE Vienbas gatve 75a, Riga LV-1004, LATVIA raivi s . [sent-3, score-0.034]

3 Edinburgh, Zagreb, To facilitate the creation and usage of custom SMT systems we have created a cloud-based platform for do-it-yourself MT. [sent-9, score-0.171]

4 The platform is developed in the EU collaboration project LetsMT! [sent-10, score-0.195]

5 This system demonstration paper presents the motivation in developing the LetsMT! [sent-12, score-0.038]

6 develops a user-driven MT “factory in the cloud” enabling web users to get customised MT that better fits their needs. [sent-16, score-0.114]

7 Harnessing the huge potential of the web together with open statistical machine translation (SMT) technologies, LetsMT! [sent-17, score-0.138]

8 has created an online collaborative platform for data sharing and MT building. [sent-18, score-0.137]

9 project is to facilitate the use of open source SMT tools and to involve users in the collection of training data. [sent-20, score-0.158]

10 project extends the use of existing state- of-the-art SMT methods by providing them as cloud-based services. [sent-22, score-0.058]

11 An easy-to-use web interface empowers users to participate in data collection and MT customisation to increase the quality, domain coverage, and usage of MT. [sent-23, score-0.159]

12 project partners are companies TILDE (coordinator), Moravia, and SemLab, and 43 2 LetsMT! [sent-25, score-0.097]

13 , 201 1) gathers public and user-provided MT training data and enables generation of multiple MT systems by combining and prioritising this data. [sent-28, score-0.082]

14 Users can upload their parallel corpora to an online repository and generate user-tailored SMT systems based on data selected by the user. [sent-29, score-0.185]

15 Authenticated users with appropriate permissions can also store private corpora that can be seen and used only by this user (or a designated user group). [sent-30, score-0.247]

16 repository is kept in internal format, and only its metadata is provided to the user. [sent-32, score-0.166]

17 The uploaded data can only be used for SMT training. [sent-34, score-0.059]

18 A user creates SMT system definition by specifying a few basic parameters like system name, source/target languages, domain, and choosing corpora (parallel for translation models or monolingual for language models) to use for the particular system. [sent-36, score-0.229]

19 Tuning and evaluation data can be automatically extracted from the training corpora or specified by the user. [sent-37, score-0.032]

20 The access level of the system can also be specified - whether it will be public or accessible only to the particular user or user group. [sent-38, score-0.255]

21 t hIat ncu r 7e3n0tl y comnitlali on s parallel sentences in almost 50 languages. [sent-46, score-0.053]

22 been When the system is specified, the user can begin training it. [sent-50, score-0.127]

23 Progress of the training can be monitored on the dynamic training chart (Figure 1). [sent-51, score-0.125]

24 It provides a detailed visualisation of the training process showing (i) steps queued for execution of a particular training task, (ii) current execution status of active training steps, and (iii) steps where any errors have occurred. [sent-52, score-0.213]

25 The training chart remains available after the training to facilitate analysis of the performed trainings. [sent-53, score-0.096]

26 The last step of the training task is automatic evaluation using BLEU, NIST, TER, and METEOR scores. [sent-54, score-0.032]

27 A successfully trained SMT system can be started and used for translation in several ways:  on the translation webpage of LetsMT! [sent-55, score-0.23]

28 plug-ins in computerassisted translation (CAT) tools for professional translation;  integrating the LetsMT! [sent-57, score-0.131]

29 plug-ins for IE and FireFox to integrate translation into the browsers;  using LetsMT! [sent-59, score-0.096]

30 allows for several system instances to run simultaneously to speed up translation and balance the workload from numerous translation requests. [sent-62, score-0.23]

31 user authentication and authorisation mechanisms control access rights to private 44 3 SMT Training and Decoding Facilities The SMT training and decoding facilities of LetsMT! [sent-64, score-0.269]

32 One of the important achievements of the project is the adaptation of the Moses toolkit to fit into the rapid training, updating, and interactive access environment of the LetsMT! [sent-66, score-0.138]

33 , 2007) provides a complete statistical translation system distributed under the LGPL license. [sent-69, score-0.176]

34 Moses includes all of the components needed to preprocess data and to train language and translation models. [sent-70, score-0.096]

35 While the use of the software is not closely monitored, Moses is known to be in commercial use by companies such as Systran (Dugast et al. [sent-72, score-0.039]

36 SMT training is automated using the Moses experiment management system (Koehn, 2010). [sent-82, score-0.07]

37 , 2009);  a server mode version of the Moses decoder and multithreaded decoding;  multiple translation models; distributed language models (Brants et al. [sent-86, score-0.221]

38 Many improvements in the Moses experiment management system were implemented to speed up SMT system training and to use the full potential of the HPC cluster. [sent-88, score-0.108]

39 We revised and improved Moses training routines (i) by finding tasks that are executed sequentially but can be executed in parallel and (ii) by splitting big training tasks into smaller ones and executing them in parallel. [sent-89, score-0.175]

40 It has (i) an interface layer implementing the user interface and APIs with external systems, (ii) an application logic layer for the system logic, (iii) a data storage layer consisting of file and database storage, and (iv) a high performance computing (HPC) cluster. [sent-92, score-0.906]

41 system performs various time and resource consuming tasks; these tasks are defined by the application logic and data storage and are sent to the HPC cluster for execution. [sent-94, score-0.43]

42 Human users can access the system through web browsers by using the LetsMT! [sent-98, score-0.205]

43 External systems such as Computer Aided Translation (CAT) tools and web browser plug-ins can access the LetsMT! [sent-100, score-0.13]

44 The public API is available through both REST/JSON and SOAP protocol web services. [sent-102, score-0.131]

45 An HTTPS protocol is used to ensure secure user authentication and secure data transfer. [sent-103, score-0.229]

46 45 The application logic layer contains a set of modules responsible for the main functionality and logic of the system. [sent-104, score-0.396]

47 It receives queries and commands from the interface layer and prepares answers or performs tasks using data storage and the HPC cluster. [sent-105, score-0.34]

48 This layer contains several modules such as the Resource Repository Adapter, the User Manager, the SMT Training Manager, etc. [sent-106, score-0.195]

49 The interface layer accesses the application logic layer through the REST/JSON and SOAP protocol web services. [sent-107, score-0.492]

50 The same protocols are used for communication between modules in the application logic layer. [sent-108, score-0.193]

51 As training data may change (for example, grow), the RR is based on a versioncontrolled file system (currently we use SVN as the backend system). [sent-110, score-0.267]

52 A key-value store is used to keep metadata and statistics about training data and trained SMT systems. [sent-111, score-0.156]

53 Modules from the application logic layer and HPC cluster access RR through a REST-based web service interface. [sent-112, score-0.388]

54 Modules from the application logic and data storage layers create jobs and send them to the HPC cluster for execution. [sent-114, score-0.37]

55 The HPC cluster is responsible for accepting, scheduling, dispatching, and managing remote and distributed execution of large numbers of standalone, parallel, or interactive jobs. [sent-115, score-0.134]

56 It also manages and schedules the allocation of distributed resources such as processors, memory, and disk space. [sent-116, score-0.076]

57 HPC cluster is based on the Oracle Grid Engine (SGE). [sent-118, score-0.053]

58 The majority of services run on Linux platforms (Moses, RR, data processing tools, etc. [sent-121, score-0.055]

59 The Web server and application logic services run on a Microsoft Windows platform. [sent-123, score-0.222]

60 The system hardware architecture is designed to be highly scalable. [sent-124, score-0.112]

61 system including the HPC cluster is hosted by Amazon Web Services infrastructure, which provides easy access to on-demand computing and storage resources. [sent-128, score-0.313]

62 system has to store and process large amounts of SMT training data (parallel and monolingual corpora) as well as trained models of SMT systems. [sent-130, score-0.129]

63 The general architecture of the Resource Repository is illustrated in Figure 3. [sent-133, score-0.043]

64 It is implemented in terms of a modular package that can easily be installed in a distributed environment. [sent-134, score-0.042]

65 RR services are provided via Web API’s and secure HTTP requests. [sent-135, score-0.102]

66 Data storage can be distributed over several servers as is illustrated in Figure 3. [sent-136, score-0.24]

67 Storage servers communicate with the central database server that manages all metadata  records attached to resources in the RR. [sent-137, score-0.176]

68 Data resources are organised in slots that correspond to file systems with user-specific branches. [sent-138, score-0.06]

69 Currently, the RR package implements two storage backends: a plain file system and a version-controlled file system based on subversion (SVN). [sent-139, score-0.365]

70 Furthermore, they keep track of modifications and file histories to make it possible to backtrack to prior revisions. [sent-142, score-0.06]

71 Another interesting feature is the possibility to create cheap copies of entire branches that can be used to enable data modifications by other users without compromising data integrity for others. [sent-144, score-0.033]

72 In general, the RR implementation is modular, other storage backends may be added later, and each individual slot can use its own backend type. [sent-146, score-0.306]

73 We decided to integrate a modern key-value store into the platform in order to allow a maximum of flexibility. [sent-148, score-0.196]

74 In contrast to traditional relational databases, key-value stores allow the storage of arbitrary data sets based on pairs of keys and values without being restricted to a pre-defined schema or a fixed data model. [sent-149, score-0.214]

75 In particular, we use the table mode of TokyoCabinet that supports storage of arbitrary data records connected to a single key in the database. [sent-151, score-0.234]

76 We use resource URL’s in our repository to define unique keys in the database, and data records attached to these keys may include any number of key-value pairs. [sent-152, score-0.242]

77 In this way, we can add any kind of information to each addressable resource in the RR. [sent-153, score-0.051]

78 The software also supports keys with unordered lists of values, which is useful for metadata such as languages (in a data collection) and for many other purposes. [sent-154, score-0.14]

79 Using TokyoCabinet as our backend, we implemented a key-value store for metadata in the RR that can easily be extended and queried from the frontend of the LetsMT! [sent-157, score-0.183]

80 Yet another important feature of the RR is the collection of import modules that take care of validation and conversion of user-provided SMT training material. [sent-159, score-0.224]

81 Pre-aligned parallel data can be uploaded in TMX, XLIFF, and Moses formats. [sent-163, score-0.112]

82 We also support the upload of compressed archives in zip and tar format. [sent-165, score-0.062]

83 Furthermore, our system also includes tools for automatic sentence alignment. [sent-167, score-0.073]

84 Finally, we also integrated a general batchqueuing system (SGE) to run off-line processes such as import jobs. [sent-170, score-0.156]

85 In this way, we further increase the scalability of the system by taking the load off repository servers. [sent-171, score-0.17]

86 Data uploads automatically trigger appropriate import jobs that will be queued on the grid engine using a dedicated job web-service API. [sent-172, score-0.279]

87 6 Evaluation for Usage in Localisation One of the usage scenarios particularly targeted by the project is application in the localisation and translation industry. [sent-173, score-0.288]

88 Localisation companies usually have collected significant amounts of parallel data in the form of translation memories. [sent-174, score-0.188]

89 They are interested in using this data to create customised MT engines that can increase productivity of translators. [sent-175, score-0.141]

90 In addition to translation candidates from translation memories, translators receive translation suggestions provided by the selected MT engine running on LetsMT! [sent-179, score-0.362]

91 As part of the system evaluation, project partner Moravia used the LetsMT! [sent-181, score-0.096]

92 platform to train and evaluate SMT systems for Polish and Czech. [sent-182, score-0.137]

93 9M parallel sentences coming from Moravia translation memories in the IT and tech domain part of the Czech National Corpus. [sent-184, score-0.18]

94 5M parallel sentences from Moravia production data in the IT domain. [sent-188, score-0.053]

95 For evaluation of English-Latvian translation, TILDE created a MT system using a significantly larger corpus of 5. [sent-191, score-0.038]

96 project is on track to fulfill its goal to democratise the creation and usage of custom SMT systems. [sent-199, score-0.092]

97 The architecture of the platform and Resource Repository enables scalability of the system and very large amounts of data to be handled in a variety of formats. [sent-202, score-0.249]

98 Evaluation shows a strong increase in translation productivity by using LetsMT! [sent-203, score-0.198]

99 project has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement 250456. [sent-206, score-0.058]

100 Selective addition of corpus-extracted phrasal lexical rules to a rule-based machine translation system. [sent-212, score-0.096]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('letsmt', 0.718), ('hpc', 0.216), ('smt', 0.172), ('storage', 0.169), ('platform', 0.137), ('moses', 0.126), ('rr', 0.124), ('layer', 0.121), ('import', 0.118), ('skadi', 0.103), ('productivity', 0.102), ('repository', 0.101), ('backend', 0.098), ('translation', 0.096), ('logic', 0.082), ('mt', 0.08), ('levenberg', 0.079), ('moravia', 0.079), ('modules', 0.074), ('metadata', 0.065), ('localisation', 0.063), ('file', 0.06), ('frontend', 0.059), ('jevs', 0.059), ('svn', 0.059), ('tilde', 0.059), ('tokyocabinet', 0.059), ('uploaded', 0.059), ('store', 0.059), ('project', 0.058), ('user', 0.057), ('services', 0.055), ('cluster', 0.053), ('parallel', 0.053), ('access', 0.053), ('vasi', 0.051), ('translator', 0.051), ('resource', 0.051), ('interface', 0.05), ('public', 0.05), ('server', 0.048), ('secure', 0.047), ('facilities', 0.047), ('keys', 0.045), ('translators', 0.044), ('architecture', 0.043), ('distributed', 0.042), ('web', 0.042), ('private', 0.041), ('companies', 0.039), ('protocol', 0.039), ('execution', 0.039), ('authentication', 0.039), ('backends', 0.039), ('browsers', 0.039), ('customised', 0.039), ('dugast', 0.039), ('jec', 0.039), ('luxembourg', 0.039), ('multitier', 0.039), ('plitt', 0.039), ('queued', 0.039), ('randomised', 0.039), ('sge', 0.039), ('soap', 0.039), ('versioncontrolled', 0.039), ('vienbas', 0.039), ('system', 0.038), ('application', 0.037), ('tools', 0.035), ('grid', 0.035), ('cat', 0.035), ('mode', 0.035), ('usage', 0.034), ('gatve', 0.034), ('latvia', 0.034), ('riga', 0.034), ('manages', 0.034), ('varga', 0.034), ('users', 0.033), ('api', 0.032), ('chart', 0.032), ('training', 0.032), ('lde', 0.031), ('upload', 0.031), ('compressed', 0.031), ('memories', 0.031), ('hardware', 0.031), ('scalability', 0.031), ('engine', 0.03), ('supports', 0.03), ('manager', 0.029), ('servers', 0.029), ('jobs', 0.029), ('executed', 0.029), ('monitored', 0.029), ('infrastructure', 0.028), ('dedicated', 0.028), ('meets', 0.028), ('toolkit', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

Author: Andrejs Vasiljevs ; Raivis Skadins ; Jorg Tiedemann

Abstract: Machine Translation Raivis Skadiņš TILDE Vienbas gatve 75a, Riga LV-1004, LATVIA raivi s . s kadins @ ti lde . lv Jörg Tiedemann Uppsala University Box 635, Uppsala SE-75 126, SWEDEN j org .t iedemann@ l ingfi l .uu . se the Universities of Copenhagen, and Uppsala. Edinburgh, Zagreb, To facilitate the creation and usage of custom SMT systems we have created a cloud-based platform for do-it-yourself MT. The platform is developed in the EU collaboration project LetsMT! . This system demonstration paper presents the motivation in developing the LetsMT! platform, its main features, architecture, and an evaluation in a practical use case. 1

2 0.10054763 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

Author: Tong Xiao ; Jingbo Zhu ; Hao Zhang ; Qiang Li

Abstract: We present a new open source toolkit for phrase-based and syntax-based machine translation. The toolkit supports several state-of-the-art models developed in statistical machine translation, including the phrase-based model, the hierachical phrase-based model, and various syntaxbased models. The key innovation provided by the toolkit is that the decoder can work with various grammars and offers different choices of decoding algrithms, such as phrase-based decoding, decoding as parsing/tree-parsing and forest-based decoding. Moreover, several useful utilities were distributed with the toolkit, including a discriminative reordering model, a simple and fast language model, and an implementation of minimum error rate training for weight tuning. 1

3 0.089936942 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

Author: Preslav Nakov ; Jorg Tiedemann

Abstract: We propose several techniques for improving statistical machine translation between closely-related languages with scarce resources. We use character-level translation trained on n-gram-character-aligned bitexts and tuned using word-level BLEU, which we further augment with character-based transliteration at the word level and combine with a word-level translation model. The evaluation on Macedonian-Bulgarian movie subtitles shows an improvement of 2.84 BLEU points over a phrase-based word-level baseline.

4 0.084336795 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

Author: Seung-Wook Lee ; Dongdong Zhang ; Mu Li ; Ming Zhou ; Hae-Chang Rim

Abstract: In this paper, we propose a novel method of reducing the size of translation model for hierarchical phrase-based machine translation systems. Previous approaches try to prune infrequent entries or unreliable entries based on statistics, but cause a problem of reducing the translation coverage. On the contrary, the proposed method try to prune only ineffective entries based on the estimation of the information redundancy encoded in phrase pairs and hierarchical rules, and thus preserve the search space of SMT decoders as much as possible. Experimental results on Chinese-toEnglish machine translation tasks show that our method is able to reduce almost the half size of the translation model with very tiny degradation of translation performance.

5 0.077433214 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis

Author: Meritxell Gonzalez ; Jesus Gimenez ; Lluis Marquez

Abstract: Error analysis in machine translation is a necessary step in order to investigate the strengths and weaknesses of the MT systems under development and allow fair comparisons among them. This work presents an application that shows how a set of heterogeneous automatic metrics can be used to evaluate a test bed of automatic translations. To do so, we have set up an online graphical interface for the ASIYA toolkit, a rich repository of evaluation measures working at different linguistic levels. The current implementation of the interface shows constituency and dependency trees as well as shallow syntactic and semantic annotations, and word alignments. The intelligent visualization of the linguistic structures used by the metrics, as well as a set of navigational functionalities, may lead towards advanced methods for automatic error analysis.

6 0.074503407 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

7 0.074075401 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

8 0.072815329 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

9 0.070538186 140 acl-2012-Machine Translation without Words through Substring Alignment

10 0.068650894 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

11 0.068534561 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

12 0.067136757 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

13 0.063282788 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

14 0.063222282 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

15 0.061066568 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

16 0.060326863 162 acl-2012-Post-ordering by Parsing for Japanese-English Statistical Machine Translation

17 0.055837043 116 acl-2012-Improve SMT Quality with Automatically Extracted Paraphrase Rules

18 0.051270805 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

19 0.050761741 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

20 0.050722927 163 acl-2012-Prediction of Learning Curves in Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.145), (1, -0.094), (2, 0.069), (3, 0.047), (4, 0.061), (5, -0.002), (6, 0.02), (7, 0.032), (8, 0.005), (9, 0.017), (10, 0.007), (11, 0.057), (12, -0.01), (13, 0.052), (14, -0.024), (15, 0.0), (16, -0.049), (17, 0.026), (18, -0.035), (19, -0.028), (20, -0.014), (21, 0.043), (22, 0.03), (23, -0.036), (24, 0.074), (25, 0.008), (26, 0.036), (27, 0.081), (28, 0.025), (29, 0.006), (30, 0.056), (31, -0.063), (32, 0.049), (33, 0.046), (34, 0.036), (35, -0.008), (36, 0.019), (37, -0.005), (38, 0.008), (39, -0.127), (40, -0.03), (41, 0.083), (42, 0.109), (43, -0.08), (44, -0.055), (45, -0.206), (46, -0.127), (47, -0.091), (48, -0.134), (49, 0.085)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91026247 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

Author: Andrejs Vasiljevs ; Raivis Skadins ; Jorg Tiedemann

Abstract: Machine Translation Raivis Skadiņš TILDE Vienbas gatve 75a, Riga LV-1004, LATVIA raivi s . s kadins @ ti lde . lv Jörg Tiedemann Uppsala University Box 635, Uppsala SE-75 126, SWEDEN j org .t iedemann@ l ingfi l .uu . se the Universities of Copenhagen, and Uppsala. Edinburgh, Zagreb, To facilitate the creation and usage of custom SMT systems we have created a cloud-based platform for do-it-yourself MT. The platform is developed in the EU collaboration project LetsMT! . This system demonstration paper presents the motivation in developing the LetsMT! platform, its main features, architecture, and an evaluation in a practical use case. 1

2 0.67736971 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

Author: Nicola Cancedda

Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.

3 0.65285641 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis

Author: Meritxell Gonzalez ; Jesus Gimenez ; Lluis Marquez

Abstract: Error analysis in machine translation is a necessary step in order to investigate the strengths and weaknesses of the MT systems under development and allow fair comparisons among them. This work presents an application that shows how a set of heterogeneous automatic metrics can be used to evaluate a test bed of automatic translations. To do so, we have set up an online graphical interface for the ASIYA toolkit, a rich repository of evaluation measures working at different linguistic levels. The current implementation of the interface shows constituency and dependency trees as well as shallow syntactic and semantic annotations, and word alignments. The intelligent visualization of the linguistic structures used by the metrics, as well as a set of navigational functionalities, may lead towards advanced methods for automatic error analysis.

4 0.57064116 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

Author: Marcis Pinnis ; Radu Ion ; Dan Stefanescu ; Fangzhong Su ; Inguna Skadina ; Andrejs Vasiljevs ; Bogdan Babych

Abstract: The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. This demonstration focuses on the English, Latvian, Lithuanian, and Romanian languages.

5 0.50898534 163 acl-2012-Prediction of Learning Curves in Machine Translation

Author: Prasanth Kolachina ; Nicola Cancedda ; Marc Dymetman ; Sriram Venkatapathy

Abstract: Parallel data in the domain of interest is the key resource when training a statistical machine translation (SMT) system for a specific purpose. Since ad-hoc manual translation can represent a significant investment in time and money, a prior assesment of the amount of training data required to achieve a satisfactory accuracy level can be very useful. In this work, we show how to predict what the learning curve would look like if we were to manually translate increasing amounts of data. We consider two scenarios, 1) Monolingual samples in the source and target languages are available and 2) An additional small amount of parallel corpus is also available. We propose methods for predicting learning curves in both these scenarios.

6 0.473295 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

7 0.46003288 70 acl-2012-Demonstration of IlluMe: Creating Ambient According to Instant Message Logs

8 0.45802119 136 acl-2012-Learning to Translate with Multiple Objectives

9 0.44144651 43 acl-2012-Building Trainable Taggers in a Web-based, UIMA-Supported NLP Workbench

10 0.42212415 160 acl-2012-Personalized Normalization for a Multilingual Chat System

11 0.41729969 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

12 0.40384185 113 acl-2012-INPROwidth.3emiSS: A Component for Just-In-Time Incremental Speech Synthesis

13 0.39609364 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

14 0.39037341 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence

15 0.38232642 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

16 0.37984437 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

17 0.37618163 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

18 0.35883385 162 acl-2012-Post-ordering by Parsing for Japanese-English Statistical Machine Translation

19 0.35458311 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

20 0.35294336 158 acl-2012-PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.042), (26, 0.051), (28, 0.019), (30, 0.016), (37, 0.039), (39, 0.086), (57, 0.027), (59, 0.013), (74, 0.036), (82, 0.019), (84, 0.021), (85, 0.067), (90, 0.08), (92, 0.036), (93, 0.298), (94, 0.011), (99, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73833311 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

Author: Andrejs Vasiljevs ; Raivis Skadins ; Jorg Tiedemann

Abstract: Machine Translation Raivis Skadiņš TILDE Vienbas gatve 75a, Riga LV-1004, LATVIA raivi s . s kadins @ ti lde . lv Jörg Tiedemann Uppsala University Box 635, Uppsala SE-75 126, SWEDEN j org .t iedemann@ l ingfi l .uu . se the Universities of Copenhagen, and Uppsala. Edinburgh, Zagreb, To facilitate the creation and usage of custom SMT systems we have created a cloud-based platform for do-it-yourself MT. The platform is developed in the EU collaboration project LetsMT! . This system demonstration paper presents the motivation in developing the LetsMT! platform, its main features, architecture, and an evaluation in a practical use case. 1

2 0.47932717 118 acl-2012-Improving the IBM Alignment Models Using Variational Bayes

Author: Darcey Riley ; Daniel Gildea

Abstract: Bayesian approaches have been shown to reduce the amount of overfitting that occurs when running the EM algorithm, by placing prior probabilities on the model parameters. We apply one such Bayesian technique, variational Bayes, to the IBM models of word alignment for statistical machine translation. We show that using variational Bayes improves the performance of the widely used GIZA++ software, as well as improving the overall performance of the Moses machine translation system in terms of BLEU score.

3 0.44866598 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

Author: Gerard de Melo ; Gerhard Weikum

Abstract: We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.

4 0.44632688 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API

Author: Roberto Navigli ; Simone Paolo Ponzetto

Abstract: In this paper we present an API for programmatic access to BabelNet a wide-coverage multilingual lexical knowledge base and multilingual knowledge-rich Word Sense Disambiguation (WSD). Our aim is to provide the research community with easy-to-use tools to perform multilingual lexical semantic analysis and foster further research in this direction. – –

5 0.42559785 21 acl-2012-A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle

Author: Hao Wang ; Dogan Can ; Abe Kazemzadeh ; Francois Bar ; Shrikanth Narayanan

Abstract: This paper describes a system for real-time analysis of public sentiment toward presidential candidates in the 2012 U.S. election as expressed on Twitter, a microblogging service. Twitter has become a central site where people express their opinions and views on political parties and candidates. Emerging events or news are often followed almost instantly by a burst in Twitter volume, providing a unique opportunity to gauge the relation between expressed public sentiment and electoral events. In addition, sentiment analysis can help explore how these events affect public opinion. While traditional content analysis takes days or weeks to complete, the system demonstrated here analyzes sentiment in the entire Twitter traffic about the election, delivering results instantly and continuously. It offers the public, the media, politicians and scholars a new and timely perspective on the dynamics of the electoral process and public opinion. 1

6 0.4250167 7 acl-2012-A Computational Approach to the Automation of Creative Naming

7 0.42262647 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

8 0.42152652 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

9 0.41854632 132 acl-2012-Learning the Latent Semantics of a Concept from its Definition

10 0.41819185 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

11 0.41618058 130 acl-2012-Learning Syntactic Verb Frames using Graphical Models

12 0.41611442 133 acl-2012-Learning to "Read Between the Lines" using Bayesian Logic Programs

13 0.41398948 175 acl-2012-Semi-supervised Dependency Parsing using Lexical Affinities

14 0.41394347 187 acl-2012-Subgroup Detection in Ideological Discussions

15 0.41332048 29 acl-2012-Assessing the Effect of Inconsistent Assessors on Summarization Evaluation

16 0.41324058 165 acl-2012-Probabilistic Integration of Partial Lexical Information for Noise Robust Haptic Voice Recognition

17 0.413147 102 acl-2012-Genre Independent Subgroup Detection in Online Discussion Threads: A Study of Implicit Attitude using Textual Latent Semantics

18 0.41277581 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

19 0.41229582 80 acl-2012-Efficient Tree-based Approximation for Entailment Graph Learning

20 0.41210687 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence