acl acl2013 acl2013-51 knowledge-graph by maker-knowledge-mining

51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

Source: pdf

Author: Valentin Tablan ; Kalina Bontcheva ; Ian Roberts ; Hamish Cunningham ; Marin Dimitrov

Abstract: This paper presents AnnoMarket, an open cloud-based platform which enables researchers to deploy, share, and use language processing components and resources, following the data-as-a-service and software-as-a-service paradigms. The focus is on multilingual text analysis resources and services, based on an opensource infrastructure and compliant with relevant NLP standards. We demonstrate how the AnnoMarket platform can be used to develop NLP applications with little or no programming, to index the results for enhanced browsing and search, and to evaluate performance. Utilising AnnoMarket is straightforward, since cloud infrastructural issues are dealt with by the platform, completely transparently to the user: load balancing, efficient data upload and storage, deployment on the virtual machines, security, and fault tolerance.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk al Abstract This paper presents AnnoMarket, an open cloud-based platform which enables researchers to deploy, share, and use language processing components and resources, following the data-as-a-service and software-as-a-service paradigms. [sent-5, score-0.335]

2 The focus is on multilingual text analysis resources and services, based on an opensource infrastructure and compliant with relevant NLP standards. [sent-6, score-0.111]

3 We demonstrate how the AnnoMarket platform can be used to develop NLP applications with little or no programming, to index the results for enhanced browsing and search, and to evaluate performance. [sent-7, score-0.308]

4 Utilising AnnoMarket is straightforward, since cloud infrastructural issues are dealt with by the platform, completely transparently to the user: load balancing, efficient data upload and storage, deployment on the virtual machines, security, and fault tolerance. [sent-8, score-0.686]

5 1 Introduction Following the Software-as-a-Service (SaaS) paradigm from cloud computing (Dikaiakos et al. [sent-9, score-0.196]

6 , 2009), a number of text processing services have been developed, e. [sent-10, score-0.271]

7 These provide information extraction services, accessible programmatically and charged per number of documents processed. [sent-13, score-0.025]

8 Secondly, the text process- ing algorithms are pre-packaged: it is not possible for researchers to extend the functional1http://www. [sent-16, score-0.031]

9 com Marin Dimitrov Ontotext AD 47A Tsarigradsko Shosse, Sofia, Bulgaria marin . [sent-20, score-0.037]

10 adapt such a service to recognise new kinds of entities). [sent-24, score-0.13]

11 Additionally, these text processing SaaS sites come with daily rate limits, in terms of number of API calls or documents that can be processed. [sent-25, score-0.025]

12 Consequently, using these services for research is not just limited in terms of text processing functionality offered, but also quickly becomes very expensive on large-scale datasets. [sent-26, score-0.318]

13 A moderately-sized collection of tweets, for example, comprises small but numerous documents, which can lead to unfeasibly high processing costs. [sent-27, score-0.024]

14 , 2009) are a type of cloud computing service which insulates developers from the low-level issues of utilising cloud infrastructures effectively, while providing facilities for efficient development, testing, and deployment of software over the Internet, following the SaaS model. [sent-29, score-0.794]

15 In the context of traditional NLP research and development, and pre-dating cloud computing, similar needs were addressed through NLP infrastructures, such as GATE (Cunningham et al. [sent-30, score-0.196]

16 These infrastructures accelerated significantly the pace of NLP research, through reusable algorithms (e. [sent-32, score-0.084]

17 rule-based pattern matching engines, machine learning algorithms), free tools for low-level NLP tasks, and support for multiple input and output document formats (e. [sent-34, score-0.058]

18 This demonstration introduces the AnnoMarket3 open, cloud-based platform, which has been developed following the PaaS paradigm. [sent-37, score-0.03]

19 It enables researchers to deploy, share, and use language processing components and resources, following the Data-as-a-Service (DaaS) and Software-as-a-Service (SaaS) paradigms. [sent-38, score-0.056]

20 It gives researchers access to an open, standard- compliant NLP infrastructure and enables them 3At the time of writing, a beta version of AnnoMarket is available at http://annomarket. [sent-39, score-0.199]

21 It supports not only NLP algorithm development and execution, but also ondemand collaborative corpus annotation and performance evaluation. [sent-43, score-0.196]

22 Important infrastructural issues are dealt with by the platform, completely transparently for the researcher: load balancing, efficient data upload and storage, deployment on the virtual machines, security, and fault tolerance. [sent-44, score-0.49]

23 In that sense, it combines the ease of use of an NLP SaaS with the openness and comprehensive facilities of the GATE NLP infrastructure. [sent-50, score-0.045]

24 Additionally, as a specialised NLP PaaS, it also supports a bring-your-own-pipeline option, which can be built easily by reusing pre-existing GATEcompatible NLP components and adding some new ones. [sent-52, score-0.048]

25 Moreover, in addition to offering entity extraction services like OpenCalais, our NLP PaaS also supports manual corpus annotation, semantic indexing and search, and performance evaluation. [sent-53, score-0.33]

26 A demonstration of running AnnoMarket multilingual NLP services on large datasets, without programming. [sent-55, score-0.338]

27 The new service deployment facilities will also be shown, including how services can optionally be shared with others. [sent-56, score-0.603]

28 A demonstration on shared research corpora via the AnnoMarket platform, following the data-as-a-service model (the sharer is responsible for ensuring no copyright violations). [sent-58, score-0.03]

29 A demonstration of the large-scale search and browsing interface, which uses the results of the NLP SaaS to offer enhanced, semantic- based functionality. [sent-60, score-0.115]

30 2 The AnnoMarket NLP PaaS This section first discusses the methodology underpinning the AnnoMarket platform, then presents its architecture and key components. [sent-61, score-0.046]

31 1 Development and Deployment Methodology The development of text analysis algorithms and pipelines typically follows a certain methodological pattern, or lifecycle. [sent-63, score-0.147]

32 It is common to use double or triple annotation, where several people perform the annotation task independently and we then measure their level of agreement (Inter-Annotator Agreement, or IAA) to quantify and control the quality of this data (Hovy, 2010). [sent-65, score-0.113]

33 The AnnoMarket platform was therefore designed to offer full methodological support for all stages of the text analysis development lifecycle: 1. [sent-66, score-0.322]

34 Create an initial prototype of the NLP pipeline, testing on a small document collection, using the desktop-based GATE user interface (Cunningham et al. [sent-67, score-0.105]

35 If required, collect a gold-standard corpus for evaluation and/or training, using the GATE Teamware collaborative corpus annotation service (Bontcheva et al. [sent-69, score-0.28]

36 Evaluate the performance of the automatic pipeline on the gold standard (either locally in the GATE development environment or on the cloud). [sent-71, score-0.063]

37 Return to step 1for further development and evaluation cycles, as needed. [sent-72, score-0.023]

38 Upload the large datasets and deploy the NLP pipeline on the AnnoMarket PaaS; 5. [sent-74, score-0.101]

39 Run the large-scale NLP experiment and download the results as XML or a standard linguistic annotation format (Ide and Romary, 2004). [sent-75, score-0.171]

40 AnnoMarket also offers scalable semantic indexing and search over the linguistic annotations and document content. [sent-76, score-0.102]

41 AnnoMarket is fully compatible with the GATE open-source architecture (Cunningham et al. [sent-79, score-0.046]

42 , 2002), in order to benefit from GATE’s numerous reusable and multilingual text processing components, and also from its infrastructural support for linguistic standards and diverse input formats. [sent-80, score-0.08]

43 2 Architecture The architecture of the AnnoMarket PaaS comprises of four layers (see Figure 1), combining 20 Figure 1: The AnnoMarket Architecture components with related capabilities. [sent-82, score-0.071]

44 The fourth, web user interface layer, contains a number of UI components that allow researchers to use the AnnoMarket platform in various ways, – – e. [sent-88, score-0.38]

45 to run an already deployed text annotation service on a large dataset, to deploy and share a new service on the platform, or to upload (and optionally share) a document collection (i. [sent-90, score-0.796]

46 There is also support for finding relevant services, deployed on the AnnoMarket platform. [sent-93, score-0.112]

47 Lastly, due to the platform running on the Amazon cloud infrastructure, there are account management interfaces, including billing information, payments, and usage reports. [sent-94, score-0.577]

48 The first vertical aspect is cloud deployment on Amazon. [sent-95, score-0.351]

49 This covers support for automatic up and down-scaling of the allocated Amazon resources, detection of and recovery from Amazon infrastructure failures and network failures, and data backup. [sent-96, score-0.134]

50 Usage monitoring and billing is the second key vertical aspect, since fine-grained pay-asyou-go ability is essential. [sent-97, score-0.086]

51 Even in the case of freely-available annotations services, Amazon usage charges are incurred and thus such functionality is needed. [sent-98, score-0.075]

52 Various usage metrics are monitored and metered so that proper billing can be guaranteed, including: storage space required by language resources and data sets; CPU utilisation ofthe annotation services; number and size ofdocuments processed. [sent-99, score-0.29]

53 In addition, we have implemented a REST programming API for AnnoMarket, so that data upload and download and running of annotation services can all be done automatically, outside of the web interface. [sent-101, score-0.616]

54 crawled web content, users’ own corpora (private or shared with others), results from running the annotation services, etc. [sent-108, score-0.15]

55 , XML, HTML, JSON, PDF, DOC), based on GATE’s comprehensive format support. [sent-111, score-0.028]

56 In all cases, when a document is being processed by AnnoMarket, the format is analysed and converted into a single unified, graph-based model of annotation: the one of the GATE NLP framework (Cunningham et al. [sent-112, score-0.063]

57 Then this internal annotation format is also used by the collaborative corpus annotation web tool, and for annotation indexing and search. [sent-114, score-0.44]

58 S3 provides a REST service for content access, as well as direct HTTP access, which provides an easy way for AnnoMarket users to upload and download content. [sent-117, score-0.36]

59 While stored on the cloud, data is protected by Amazon’s security procedures. [sent-118, score-0.056]

60 All transfers between the cloud storage, the annotation services, and the user’s computer are done via an encrypted channel, using SSL. [sent-119, score-0.309]

61 4 The Platform Layer The AnnoMarket platform provides an environment where text processing applications can be deployed as annotation services on the cloud. [sent-121, score-0.75]

62 It allows processing pipelines that were produced on a 21 Figure 2: Web-based Job Editor developer’s stand-alone computer to be deployed seamlessly on distributed hardware resources (the compute cloud) with the aim of processing large amounts of data in a timely fashion. [sent-122, score-0.25]

63 This process needs to be resilient in the face of failures at the level of the cloud infrastructure, the network com- munication, errors in the processing pipeline and in the input data. [sent-123, score-0.304]

64 The platform layer determines the optimal number of virtual machines for running a given NLP application, given the size of the document collection to be processed and taking into account the overhead in starting up new virtual machines on demand. [sent-124, score-0.628]

65 The implementation is designed to be robust in the face of hardware failures and processing errors. [sent-125, score-0.102]

66 Users can upload any pipelines compliant with the GATE Processing Resource (PR) model and these are automatically deployed as annotation services on the AnnoMarket platform. [sent-130, score-0.785]

67 5 Annotation Services As discussed above, the platform layer in AnnoMarket addresses most of the technical and methodological requirements towards the NLP PaaS, making the deployment, execution, and sharing of annotation services (i. [sent-132, score-0.807]

68 pipelines and algorithms) searcher’s a straightforward perspective, task. [sent-134, score-0.079]

69 e While the job is running, a regularly updated execution log is made available in the user’s dashboard. [sent-136, score-0.078]

70 Upon job completion, an email notification is also sent. [sent-137, score-0.041]

71 Most ofthe implementation details are hidden away from the user, who interacts with the system through a web-based job editor, depicted in Figure 2, or through a REST API. [sent-138, score-0.041]

72 The number of already deployed annotation services on the platform is growing continuously. [sent-139, score-0.75]

73 Figure 3 shows a subset of them, as well as the metadata tags associated with these services, so that users can quickly restrict which types of ser- vices they are after and then be shown only the relevant subset. [sent-140, score-0.035]

74 At the time of writing, there are services of the following kinds: • Part-of-Speech-Taggers for English, German, Dutch, -aSnpde Hungarian. [sent-141, score-0.271]

75 The deployment of new annotation services is done via a web interface (see Figure 4), where an administrator needs to configure some basic details related to the utilisation of the platform layer and provide a self-contained GATE-compatible application. [sent-150, score-1.005]

76 Platform users can only publish their own annotation services by contacting an administrator, who can validate the provided pipeline before making it publicly available to the other users. [sent-151, score-0.459]

77 This step is intended to protect the users community from malicious or poor quality pipelines. [sent-152, score-0.035]

78 3 Search and Browsing of Annotated Corpora The AnnoMarket platform also includes a service for indexing and searching over a collection of semantically annotated documents. [sent-153, score-0.469]

79 The output of an annotation service (see Figure 2) can be fed directly into a search index, which is created as the service is run on the documents. [sent-154, score-0.404]

80 This provides fa- cilities for searching over different views of document text, for example one can search the document’s words, the part-of-speech of those words, or their morphological roots. [sent-155, score-0.091]

81 As well as searching the document text, we also support searches over the documents’ semantic annotations, e. [sent-156, score-0.06]

82 Figure 5 shows a semantic search over 80,000 news web pages from the BBC. [sent-159, score-0.031]

83 They have first been pre-processed with the POS tagging, morphological analysis, and NER services on the platform and the output indexed automatically. [sent-160, score-0.525]

84 The search query is for documents, where entities of type Person are followed by any morphological form of the verb say, i. [sent-161, score-0.031]

85 4 Conclusion This paper described a cloud-based open platform for text mining, which aims to assist the development and deployment of robust, large-scale text processing applications. [sent-164, score-0.433]

86 By supporting the shar- ing of annotation pipelines, AnnoMarket also pro23 Figure 5: Example Semantic Search Results motes reuse and repeatability of experiments. [sent-165, score-0.113]

87 As the number of annotation services offered by the platform has grown, we identified a need for service search, so that users can locate useful NLP services more effectively. [sent-166, score-1.101]

88 We are currently developing a new UI, which offers search and browsing functionality, alongside various criteria, such as functionality (e. [sent-167, score-0.132]

89 POS tagger, named entity recogniser), user ratings, natural language supported). [sent-169, score-0.036]

90 A beta version is currently open to researchers for experimentation. [sent-171, score-0.084]

91 Within the next six months we plan to to solicit more shared annotation pipelines to be deployed on the platform by other researchers. [sent-172, score-0.558]

92 Gate: an architecture for development of robust hlt applications. [sent-184, score-0.069]

93 Getting more out of biomedical documents with gate’s full lifecycle open 6See http : / /www . [sent-189, score-0.117]

94 Cloud computing: Distributed internet computing for IT and scientific research. [sent-196, score-0.025]

95 Building the scientific knowledge mine (SciKnowMine): a communitydriven framework for text mining tools in direct service to biocuration. [sent-236, score-0.13]

96 In Proceedings of the ACL-02 workshop on Natural Language Processing in the biomedical domain, 7– 12 July 2002, volume 3, pages 9–13, Philadelphia, PA. [sent-248, score-0.033]

97 A distributed text mining system for online web textual data analysis. [sent-252, score-0.025]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('annomarket', 0.601), ('services', 0.271), ('gate', 0.255), ('platform', 0.254), ('cloud', 0.196), ('paas', 0.166), ('upload', 0.165), ('cunningham', 0.159), ('deployment', 0.131), ('service', 0.13), ('saas', 0.124), ('layer', 0.124), ('annotation', 0.113), ('deployed', 0.112), ('tablan', 0.104), ('hamish', 0.092), ('kalina', 0.085), ('pipelines', 0.079), ('roberts', 0.073), ('nlp', 0.071), ('amazon', 0.07), ('failures', 0.068), ('infrastructure', 0.066), ('billing', 0.062), ('dikaiakos', 0.062), ('deploy', 0.061), ('security', 0.056), ('encryption', 0.055), ('infrastructures', 0.055), ('browsing', 0.054), ('valentin', 0.052), ('infrastructural', 0.051), ('bontcheva', 0.048), ('virtual', 0.048), ('functionality', 0.047), ('storage', 0.046), ('architecture', 0.046), ('facilities', 0.045), ('methodological', 0.045), ('compliant', 0.045), ('xml', 0.044), ('uima', 0.044), ('romary', 0.041), ('tanabe', 0.041), ('utilisation', 0.041), ('utilising', 0.041), ('job', 0.041), ('pipeline', 0.04), ('execution', 0.037), ('running', 0.037), ('administrator', 0.037), ('authentication', 0.037), ('marin', 0.037), ('teamware', 0.037), ('transparently', 0.037), ('collaborative', 0.037), ('user', 0.036), ('indexing', 0.036), ('ide', 0.035), ('users', 0.035), ('document', 0.035), ('interface', 0.034), ('hardware', 0.034), ('fault', 0.034), ('lifecycle', 0.034), ('transport', 0.034), ('biomedical', 0.033), ('json', 0.032), ('doc', 0.032), ('pdf', 0.032), ('search', 0.031), ('researchers', 0.031), ('demonstration', 0.03), ('download', 0.03), ('balancing', 0.029), ('ramakrishnan', 0.029), ('reusable', 0.029), ('access', 0.029), ('machines', 0.029), ('usage', 0.028), ('beta', 0.028), ('format', 0.028), ('offered', 0.027), ('optionally', 0.026), ('ferrucci', 0.026), ('sheffield', 0.025), ('searching', 0.025), ('components', 0.025), ('documents', 0.025), ('internet', 0.025), ('distributed', 0.025), ('open', 0.025), ('load', 0.024), ('collection', 0.024), ('vertical', 0.024), ('supports', 0.023), ('development', 0.023), ('formats', 0.023), ('ui', 0.023), ('ian', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

Author: Valentin Tablan ; Kalina Bontcheva ; Ian Roberts ; Hamish Cunningham ; Marin Dimitrov

2 0.11342809 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

Author: Rafal Rak ; Andrew Rowley ; Jacob Carter ; Sophia Ananiadou

Abstract: Developing sophisticated NLP pipelines composed of multiple processing tools and components available through different providers may pose a challenge in terms of their interoperability. The Unstructured Information Management Architecture (UIMA) is an industry standard whose aim is to ensure such interoperability by defining common data structures and interfaces. The architecture has been gaining attention from industry and academia alike, resulting in a large volume ofUIMA-compliant processing components. In this paper, we demonstrate Argo, a Web-based workbench for the development and processing of NLP pipelines/workflows. The workbench is based upon UIMA, and thus has the potential of using many of the existing UIMA resources. We present features, and show examples, offacilitating the distributed development of components and the analysis of processing results. The latter includes annotation visualisers and editors, as well as serialisation to RDF format, which enables flexible querying in addition to data manipulation thanks to the semantic query language SPARQL. The distributed development feature allows users to seamlessly connect their tools to workflows running in Argo, and thus take advantage of both the available library of components (without the need of installing them locally) and the analytical tools.

3 0.10679118 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou

Abstract: U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the METANET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with METANET’s aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both mul- tilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.

4 0.10155258 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

Author: Seid Muhie Yimam ; Iryna Gurevych ; Richard Eckart de Castilho ; Chris Biemann

Abstract: We present WebAnno, a general purpose web-based annotation tool for a wide range of linguistic annotations. WebAnno offers annotation project management, freely configurable tagsets and the management of users in different roles. WebAnno uses modern web technology for visualizing and editing annotations in a web browser. It supports arbitrarily large documents, pluggable import/export filters, the curation of annotations across various users, and an interface to farming out annotations to a crowdsourcing platform. Currently WebAnno allows part-ofspeech, named entity, dependency parsing and co-reference chain annotations. The architecture design allows adding additional modes of visualization and editing, when new kinds of annotations are to be supported.

5 0.077382654 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE

Author: Pedro Fialho ; Luisa Coheur ; Sergio Curto ; Pedro Claudio ; Angela Costa ; Alberto Abad ; Hugo Meinedo ; Isabel Trancoso

Abstract: In this paper we describe a platform for embodied conversational agents with tutoring goals, which takes as input written and spoken questions and outputs answers in both forms. The platform is developed within a game environment, and currently allows speech recognition and synthesis in Portuguese, English and Spanish. In this paper we focus on its understanding component that supports in-domain interactions, and also small talk. Most indomain interactions are answered using different similarity metrics, which compare the perceived utterances with questions/sentences in the agent’s knowledge base; small-talk capabilities are mainly due to AIML, a language largely used by the chatbots’ community. In this paper we also introduce EDGAR, the butler of MONSERRATE, which was developed in the aforementioned platform, and that answers tourists’ questions about MONSERRATE.

6 0.071312882 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

7 0.057347715 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

8 0.055311359 120 acl-2013-Dirt Cheap Web-Scale Parallel Text from the Common Crawl

9 0.053088464 216 acl-2013-Large tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language

10 0.051213674 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

11 0.049453206 355 acl-2013-TransDoop: A Map-Reduce based Crowdsourced Translation for Complex Domain

12 0.046859492 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

13 0.043172199 219 acl-2013-Learning Entity Representation for Entity Disambiguation

14 0.042703524 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

15 0.041000657 388 acl-2013-Word Alignment Modeling with Context Dependent Deep Neural Network

16 0.040058959 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

17 0.03698501 204 acl-2013-Iterative Transformation of Annotation Guidelines for Constituency Parsing

18 0.03542877 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

19 0.035270154 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

20 0.032936804 368 acl-2013-Universal Dependency Annotation for Multilingual Parsing

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.094), (1, 0.027), (2, -0.016), (3, -0.027), (4, 0.027), (5, -0.016), (6, 0.014), (7, -0.033), (8, 0.05), (9, -0.002), (10, -0.059), (11, 0.017), (12, -0.043), (13, 0.023), (14, -0.057), (15, -0.047), (16, -0.033), (17, 0.042), (18, -0.014), (19, -0.094), (20, -0.078), (21, -0.028), (22, -0.13), (23, -0.005), (24, -0.048), (25, -0.089), (26, 0.044), (27, 0.002), (28, -0.006), (29, -0.023), (30, -0.026), (31, 0.017), (32, -0.126), (33, -0.045), (34, 0.048), (35, 0.143), (36, -0.036), (37, 0.034), (38, 0.08), (39, -0.049), (40, 0.005), (41, -0.008), (42, -0.03), (43, 0.041), (44, -0.036), (45, 0.029), (46, 0.038), (47, -0.003), (48, 0.081), (49, -0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94408339 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

Author: Valentin Tablan ; Kalina Bontcheva ; Ian Roberts ; Hamish Cunningham ; Marin Dimitrov

2 0.89472294 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

Author: Rafal Rak ; Andrew Rowley ; Jacob Carter ; Sophia Ananiadou

3 0.84194428 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou

4 0.79746372 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

Author: Seid Muhie Yimam ; Iryna Gurevych ; Richard Eckart de Castilho ; Chris Biemann

5 0.57116252 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE

Author: Pedro Fialho ; Luisa Coheur ; Sergio Curto ; Pedro Claudio ; Angela Costa ; Alberto Abad ; Hugo Meinedo ; Isabel Trancoso

6 0.51590121 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

7 0.51536572 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

8 0.49345255 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

9 0.48792347 268 acl-2013-PATHS: A System for Accessing Cultural Heritage Collections

10 0.43171665 141 acl-2013-Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation

11 0.40377858 367 acl-2013-Universal Conceptual Cognitive Annotation (UCCA)

12 0.39757138 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

13 0.38819152 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees

14 0.38577557 270 acl-2013-ParGramBank: The ParGram Parallel Treebank

15 0.36222598 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

16 0.34731102 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics

17 0.3470245 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

18 0.34043977 29 acl-2013-A Visual Analytics System for Cluster Exploration

19 0.34034461 302 acl-2013-Robust Automated Natural Language Processing with Multiword Expressions and Collocations

20 0.33990234 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.123), (6, 0.034), (11, 0.041), (15, 0.016), (24, 0.034), (26, 0.052), (28, 0.012), (35, 0.049), (42, 0.066), (48, 0.026), (64, 0.03), (68, 0.331), (70, 0.013), (71, 0.011), (88, 0.016), (90, 0.022), (95, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78016907 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

Author: Valentin Tablan ; Kalina Bontcheva ; Ian Roberts ; Hamish Cunningham ; Marin Dimitrov

2 0.68465614 180 acl-2013-Handling Ambiguities of Bilingual Predicate-Argument Structures for Statistical Machine Translation

Author: Feifei Zhai ; Jiajun Zhang ; Yu Zhou ; Chengqing Zong

Abstract: Predicate-argument structure (PAS) has been demonstrated to be very effective in improving SMT performance. However, since a sourceside PAS might correspond to multiple different target-side PASs, there usually exist many PAS ambiguities during translation. In this paper, we group PAS ambiguities into two types: role ambiguity and gap ambiguity. Then we propose two novel methods to handle the two PAS ambiguities for SMT accordingly: 1) inside context integration; 2) a novel maximum entropy PAS disambiguation (MEPD) model. In this way, we incorporate rich context information of PAS for disambiguation. Then we integrate the two methods into a PASbased translation framework. Experiments show that our approach helps to achieve significant improvements on translation quality. 1

3 0.44385448 326 acl-2013-Social Text Normalization using Contextual Graph Random Walks

Author: Hany Hassan ; Arul Menezes

Abstract: We introduce a social media text normalization system that can be deployed as a preprocessing step for Machine Translation and various NLP applications to handle social media text. The proposed system is based on unsupervised learning of the normalization equivalences from unlabeled text. The proposed approach uses Random Walks on a contextual similarity bipartite graph constructed from n-gram sequences on large unlabeled text corpus. We show that the proposed approach has a very high precision of (92.43) and a reasonable recall of (56.4). When used as a preprocessing step for a state-of-the-art machine translation system, the translation quality on social media text improved by 6%. The proposed approach is domain and language independent and can be deployed as a preprocessing step for any NLP application to handle social media text.

4 0.44294864 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

Author: Rafal Rak ; Andrew Rowley ; Jacob Carter ; Sophia Ananiadou

5 0.44254506 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

Author: Sujith Ravi

Abstract: In this paper, we propose a new Bayesian inference method to train statistical machine translation systems using only nonparallel corpora. Following a probabilistic decipherment approach, we first introduce a new framework for decipherment training that is flexible enough to incorporate any number/type of features (besides simple bag-of-words) as side-information used for estimating translation models. In order to perform fast, efficient Bayesian inference in this framework, we then derive a hash sampling strategy that is inspired by the work of Ahmed et al. (2012). The new translation hash sampler enables us to scale elegantly to complex models (for the first time) and large vocab- ulary/corpora sizes. We show empirical results on the OPUS data—our method yields the best BLEU scores compared to existing approaches, while achieving significant computational speedups (several orders faster). We also report for the first time—BLEU score results for a largescale MT task using only non-parallel data (EMEA corpus).

6 0.44111246 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

7 0.44031471 284 acl-2013-Probabilistic Sense Sentiment Similarity through Hidden Emotions

8 0.43347961 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

9 0.43345636 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

10 0.43269497 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

11 0.42939681 269 acl-2013-PLIS: a Probabilistic Lexical Inference System

12 0.42906904 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model

13 0.42862603 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

14 0.42483923 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures

15 0.42332813 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

16 0.42268229 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

17 0.42151719 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

18 0.42149842 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

19 0.41960981 70 acl-2013-Bilingually-Guided Monolingual Dependency Grammar Induction

20 0.41784692 208 acl-2013-Joint Inference for Heterogeneous Dependency Parsing