acl acl2013 acl2013-118 knowledge-graph by maker-knowledge-mining

118 acl-2013-Development and Analysis of NLP Pipelines in Argo


Source: pdf

Author: Rafal Rak ; Andrew Rowley ; Jacob Carter ; Sophia Ananiadou

Abstract: Developing sophisticated NLP pipelines composed of multiple processing tools and components available through different providers may pose a challenge in terms of their interoperability. The Unstructured Information Management Architecture (UIMA) is an industry standard whose aim is to ensure such interoperability by defining common data structures and interfaces. The architecture has been gaining attention from industry and academia alike, resulting in a large volume ofUIMA-compliant processing components. In this paper, we demonstrate Argo, a Web-based workbench for the development and processing of NLP pipelines/workflows. The workbench is based upon UIMA, and thus has the potential of using many of the existing UIMA resources. We present features, and show examples, offacilitating the distributed development of components and the analysis of processing results. The latter includes annotation visualisers and editors, as well as serialisation to RDF format, which enables flexible querying in addition to data manipulation thanks to the semantic query language SPARQL. The distributed development feature allows users to seamlessly connect their tools to workflows running in Argo, and thus take advantage of both the available library of components (without the need of installing them locally) and the analytical tools.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 uk Abstract Developing sophisticated NLP pipelines composed of multiple processing tools and components available through different providers may pose a challenge in terms of their interoperability. [sent-7, score-0.184]

2 The Unstructured Information Management Architecture (UIMA) is an industry standard whose aim is to ensure such interoperability by defining common data structures and interfaces. [sent-8, score-0.139]

3 The architecture has been gaining attention from industry and academia alike, resulting in a large volume ofUIMA-compliant processing components. [sent-9, score-0.156]

4 In this paper, we demonstrate Argo, a Web-based workbench for the development and processing of NLP pipelines/workflows. [sent-10, score-0.135]

5 The workbench is based upon UIMA, and thus has the potential of using many of the existing UIMA resources. [sent-11, score-0.1]

6 We present features, and show examples, offacilitating the distributed development of components and the analysis of processing results. [sent-12, score-0.175]

7 The latter includes annotation visualisers and editors, as well as serialisation to RDF format, which enables flexible querying in addition to data manipulation thanks to the semantic query language SPARQL. [sent-13, score-0.218]

8 The distributed development feature allows users to seamlessly connect their tools to workflows running in Argo, and thus take advantage of both the available library of components (without the need of installing them locally) and the analytical tools. [sent-14, score-0.556]

9 For instance, the extraction of relationships between named entities in text is preceded by text segmentation, part-ofspeech recognition, the recognition of named entities, and dependency parsing. [sent-16, score-0.03]

10 Unstructured Information Management Architecture (UIMA) (Ferrucci and Lally, 2004) is a framework that tackles the problem of interoperability of processing components. [sent-18, score-0.061]

11 UIMA has been gaining much interest from industry and academia alike for the past decade. [sent-20, score-0.139]

12 Notable repositories of UIMA-compliant tools include U-Compare component library3, DKPro (Gurevych et al. [sent-21, score-0.15]

13 In this work we demonstrate Argo4, a Webbased (remotely-accessed) workbench for collaborative development of text-processing workflows. [sent-26, score-0.135]

14 We focus primarily on the process of development and analysis of both individual processing components and workflows composed of such components. [sent-27, score-0.381]

15 Sections 3–5 discuss selected features that are useful in the development and analysis of components and workflows. [sent-40, score-0.141]

16 2 Overview of Argo Argo comes equipped with an ever-growing library of atomic processing components that can be put together by users to form meaningful pipelines or workflows. [sent-42, score-0.204]

17 The processing components range from simple data serialisers to complex text analytics and include text segmentation, part-ofspeech tagging, parsing, named entity recognition, and discourse analysis. [sent-43, score-0.184]

18 Users interact with the workbench through a graphical user interface (GUI) that is accessible entirely through a Web browser. [sent-44, score-0.202]

19 Figure 1 shows two views of the interface: the main, resource management window (Figure 1(a)) and the workflow diagramming window (Figure 1(b)). [sent-45, score-0.293]

20 The Documents panel lists primarily userowned files that are uploaded (through the GUI) by users into their respective personal spaces on the remote host. [sent-47, score-0.222]

21 Documents may also be generated as a result of executing workflows (e. [sent-48, score-0.244]

22 , XML files containing annotations), in which case they are available for users to download. [sent-50, score-0.088]

23 , the user-defined arrangements of processing components together with their settings. [sent-53, score-0.106]

24 Users compose workflows through a flexible, graphical diagramming editor by connecting the components (represented as blocks) with lines signifying the flow of data between components (see Figure 1(b)). [sent-54, score-0.558]

25 , each participating component has at most one incoming and at most one outgoing connection; however, the system also supports multiple branching and merging points in the workflow. [sent-57, score-0.254]

26 For ease of use, components are categorized into readers, analytics, and consumers, indicating what role they are set to play in a workflow. [sent-59, score-0.106]

27 Readers are responsible for delivering data for processing and have only an outgoing port (represented as a green triangle). [sent-60, score-0.083]

28 alytics is to modify incoming data structures and pass them onto following components in a workflow, and thus they have both incoming and outgoing ports. [sent-62, score-0.371]

29 Finally, the consumers are responsible for serialising or visualising (selected or all) anno- tations in the data structures without modification, and so they have only an incoming port. [sent-63, score-0.146]

30 The Processes panel lists resources that are created automatically when workflows are submitted for execution by users. [sent-64, score-0.329]

31 Users may follow the progress ofthe executing workflows (processes) as well as manage the execution from this panel. [sent-65, score-0.312]

32 The processing of workflows is carried out on remote servers, and thus frees users from using their own processing resources. [sent-66, score-0.327]

33 1 Argo and UIMA Argo supports and is based upon UIMA and thus can run any UIMA-compliant processing component. [sent-68, score-0.034]

34 Each such component defines or imports type systems and modifies common annotation structures (CAS). [sent-69, score-0.202]

35 , a token with its text boundaries and a part-of-speech tag. [sent-73, score-0.033]

36 Feature structures may, and often do, refer to a subject of annotation (Sofa), a structure that (in textprocessing applications) stores the text. [sent-74, score-0.108]

37 , Annotation that holds a reference to a Sofa the annotation is asserted about, and two features, begin and end, for marking boundaries of a span of text. [sent-78, score-0.104]

38 A developer is free to extend any of the complex types. [sent-79, score-0.139]

39 2 Architecture Although the Apache UIMA project provides an implementation of the UIMA framework, Argo incorporates home-grown solutions, especially in terms of the management of workflow processing. [sent-81, score-0.248]

40 This includes features such as workflow branching and merging points, user-interactive components (see Section 4), as well as distributed processing. [sent-82, score-0.354]

41 Additionally, in order to increase computing throughput, we have incorpo- rated cloud computing capabilities into Argo, which is designed to work with various cloud computing providers. [sent-84, score-0.092]

42 Currently, Argo is capable of switching the processing of workflows to a local cluster of over 3,000 processor cores. [sent-86, score-0.215]

43 Further extensions to use the Microsoft Azure5 and Amazon EC26 cloud platforms are also planned. [sent-87, score-0.082]

44 The Argo platform is available entirely using RESTful Web services (Fielding and Taylor, 2002), and therefore it is possible to gain access to all or selected features of Argo by implementing a compliant client. [sent-88, score-0.061]

45 In fact, the “native” Web interface shown in Figure 1 is an example of such a client. [sent-89, score-0.03]

46 3 Distributed Development Argo includes a Generic Listener component that permits execution of a UIMA component that is running externally of the Argo system. [sent-90, score-0.285]

47 Any component that a user wishes to deploy on the Argo system has to undergo a verification process, which could lead to a slower development lifecycle without the availability of this component. [sent-96, score-0.206]

48 Generic Listener operates in a reverse manner to a traditional Web service; rather than Argo connecting to the developer’s component, the component connects to Argo. [sent-97, score-0.094]

49 This behaviour was deliberately chosen to avoid network-related issues, such as firewall port blocking, which could become a source of frustration to developers. [sent-98, score-0.068]

50 Argo will prompt the user with a unique URL, which must be supplied to the client component run by the user, allowing it to connect to the Argo workflow and continue its execution. [sent-100, score-0.377]

51 It contains a Maven structure, Eclipse IDE project files, and required libraries, in addition to a number of shell scripts to simplify the running of the component. [sent-102, score-0.029]

52 The project provides both a command-line interface (CLI) and GUI runner applications that take, as arguments, the name of the class of the locally developed component and the URL provided by Argo, upon each run of a workflow containing the remote component. [sent-103, score-0.434]

53 An example of a workflow with a Generic Listener is shown in Figure 2. [sent-104, score-0.214]

54 The workflow is designed for the analysis and evaluation of a solution (in this case, the automatic extraction of biological events) that is being developed locally by the user. [sent-105, score-0.358]

55 The reader (BioNLP ST Data Reader) provides text documents together with gold (i. [sent-106, score-0.064]

56 , manually created) event annotations prepared for the BioNLP Shared Task7. [sent-108, score-0.054]

57 The annotations are selectively removed with the Annotation Remover and the remaining data is sent onto the Generic Listener component, and consequently, onto the developer’s machine. [sent-109, score-0.106]

58 org/ 117 Figure 2: Example of a workflow for development, analysis, and evaluation of a user-developed solution for the BioNLP Shared Task. [sent-112, score-0.214]

59 connect to Argo, retrieve CASes from the running workflow, and for each CAS recreate the re- moved annotations as faithfully as possible. [sent-113, score-0.113]

60 The developer can then track the performance of their solution by observing standard information extraction measures (precision, recall, etc. [sent-114, score-0.139]

61 ) computed by the Reference Evaluator component that compares the original, gold annotations (coming from the reader) against the developer’s annotations (coming from the Generic Listener), and saves these measures for each document/CAS into a tabular-format file. [sent-115, score-0.202]

62 4 Annotation Analysis and Manipulation Traditionally, NLP pipelines (including existing UIMA-supporting platforms), once set up, are executed without human involvement. [sent-117, score-0.049]

63 One of the novelties in Argo is an introduction of userinteractive components, a special type of analytic that, if present in a workflow, cause the execution of the workflow to pause. [sent-118, score-0.327]

64 Argo resumes the execution only after receiving input from a user. [sent-119, score-0.068]

65 Examples of user-interactive components include Annotation Editor and Brat BioNLP ST Comparator. [sent-123, score-0.106]

66 The Brat BioNLP ST Comparator component Figure 3: Example of an annotated fragment of a document visualised with the Brat BioNLP ST Comparator component. [sent-124, score-0.137]

67 The component highlights (in red and green) differences between two sources of annotations. [sent-125, score-0.094]

68 Figure 4: Example of manual annotation with the user-interactive Annotation Editor component. [sent-126, score-0.071]

69 expects two incoming connections from components processing the same subject of annotation. [sent-127, score-0.182]

70 As a result, using brat visualisation (Stenetorp et al. [sent-128, score-0.175]

71 , 2012), it will show annotation structures by laying them out above text and mark differences between the two inputs by colour-coding missing or additional annotations in each input. [sent-129, score-0.162]

72 A sample of visualisation coming from the workflow in Figure 2 is shown in Figure 3. [sent-130, score-0.293]

73 Since in this particular workflow the Brat BioNLP ST Comparator receives gold annotations (from the BioNLP ST Data Reader) as one of its inputs, the highlighted differences are, in fact, false positives and false negatives. [sent-131, score-0.268]

74 Annotation Editor is another example of a userinteractive component that allows the user to add, delete or modify annotations. [sent-132, score-0.178]

75 The user has an option to create a span-of-text annotation by selecting a text fragment and assigning an annotation type. [sent-134, score-0.224]

76 5 Querying Serialised Data Argo comes with several (de)serialisation components for reading and storing collections of data, such as a generic reader of text (Document Reader) or readers and writers of CASes in XMI format (CAS Reader and CAS Writer). [sent-138, score-0.258]

77 One of the more useful in terms of annotation analysis is, however, the RDF Writer component as well as its counterpart, RDF Reader. [sent-139, score-0.165]

78 RDF Writer serialises data into RDF files and supports several RDF formats such as RDF/XML, Turtle, and NTriple. [sent-140, score-0.098]

79 A resulting RDF graph consists of both the data model (type system) and the data itself (CAS) and thus constitutes a self-contained knowledge base. [sent-141, score-0.027]

80 RDF Writer has an option to create a graph for each CAS or a single graph for an entire collection. [sent-142, score-0.054]

81 Figure 5 shows an example of a SPARQL query that is performed on the output of an RDF Writer in the workflow shown in Figure 1(b). [sent-144, score-0.279]

82 This workflow results in several types of annotations including the boundaries of sentences, tokens with part-of-speech tags and lemmas, chunks, as well as biological entities, such as DNA, RNA, cell line and cell type. [sent-145, score-0.412]

83 The SPARQL query is meant to retrieve pairs of seemingly interacting biological entities ranked according to their occurrence in the entire collection. [sent-146, score-0.206]

84 The interaction here is (na ı¨vely) defined as co-occurrence of two entities in the same sentence. [sent-147, score-0.03]

85 The query includes patterns for retrieving the boundaries of sentences (syn : Sentence) and two biological entities (sem :NamedEnt ity) and then filters out the crossproduct of those by ensuring that the two en8http://www. [sent-148, score-0.265]

86 As a result, the query returns a list of biological entity pairs ac- companied by their categories and the number of appearances, as shown in Figure 5(b). [sent-151, score-0.176]

87 Note that the query itselfdoes not list the four biological categories; instead, it requests their common semantic ancestor sem :NamedEnt ity. [sent-152, score-0.176]

88 Suppose a user is interested in placing the retrieved biological entity interactions from our running example into the UIMA structure Relationship that simply defines a pair of references to other structures of any type. [sent-155, score-0.216]

89 This can be accomplished, without resorting to programming, by issuing a SPARQL insert query shown in Figure 5(c). [sent-156, score-0.065]

90 The query will create triple statements compliant with the definition of Relationship. [sent-157, score-0.098]

91 The resulting modified RDF graph can then be read back to Argo by the RDF Reader component that will convert the new RDF graph back into a CAS. [sent-158, score-0.148]

92 6 Related Work Other notable examples of NLP platforms that provide graphical interfaces for managing workflows include GATE (Cunningham et al. [sent-159, score-0.284]

93 GATE is a standalone suite of text processing and annotation tools and comes with its own programming interface. [sent-162, score-0.1]

94 In contrast, U-Compare—similarly to Argo—uses UIMA as its base interoperability framework. [sent-163, score-0.061]

95 The key features of Argo that distinguish it from U-Compare are the Web availability of the platform, primarily remote processing of workflows, a multi-user, collaborative architecture, and the availability of user-interactive components. [sent-164, score-0.164]

96 Moreover, the presented annotation viewer and editor, performance evaluator, and lastly RDF (de)serialisers are indispensable for the analysis of processing tasks at hand. [sent-166, score-0.071]

97 Together with the distributed development support for developers wishing to create their own components or run their own tools with the help ofresources available in Argo, the workbench becomes a powerful development and analytical NLP tool. [sent-167, score-0.368]

98 GATE: A framework and graphical development environment for robust NLP tools and applications. [sent-175, score-0.097]

99 U-Compare: An integrated language resource evaluation platform including a comprehensive UIMA resource library. [sent-199, score-0.028]

100 Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. [sent-203, score-0.094]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('argo', 0.619), ('uima', 0.316), ('workflows', 0.215), ('workflow', 0.214), ('rdf', 0.206), ('bionlp', 0.142), ('brat', 0.14), ('developer', 0.139), ('listener', 0.129), ('cas', 0.122), ('biological', 0.111), ('sparql', 0.111), ('components', 0.106), ('workbench', 0.1), ('component', 0.094), ('incoming', 0.076), ('annotation', 0.071), ('execution', 0.068), ('query', 0.065), ('reader', 0.064), ('comparator', 0.063), ('remote', 0.063), ('writer', 0.062), ('generic', 0.062), ('interoperability', 0.061), ('annotations', 0.054), ('editor', 0.053), ('gate', 0.052), ('architecture', 0.05), ('gui', 0.05), ('outgoing', 0.05), ('pipelines', 0.049), ('users', 0.049), ('ananiadou', 0.047), ('panel', 0.046), ('cloud', 0.046), ('diagramming', 0.045), ('fielding', 0.045), ('namedent', 0.045), ('serialisation', 0.045), ('serialisers', 0.045), ('userinteractive', 0.045), ('coming', 0.044), ('fragment', 0.043), ('st', 0.043), ('industry', 0.041), ('unstructured', 0.04), ('baumgartner', 0.04), ('ctakes', 0.04), ('kano', 0.04), ('stenetorp', 0.04), ('files', 0.039), ('user', 0.039), ('availability', 0.038), ('structures', 0.037), ('sofa', 0.037), ('manipulation', 0.037), ('evaluator', 0.037), ('hahn', 0.037), ('platforms', 0.036), ('development', 0.035), ('cunningham', 0.035), ('savova', 0.035), ('visualisation', 0.035), ('academia', 0.035), ('frustration', 0.035), ('distributed', 0.034), ('supports', 0.034), ('management', 0.034), ('manchester', 0.034), ('locally', 0.033), ('boundaries', 0.033), ('graphical', 0.033), ('port', 0.033), ('analytics', 0.033), ('rak', 0.033), ('compliant', 0.033), ('consumers', 0.033), ('alike', 0.033), ('repository', 0.032), ('gaining', 0.03), ('entities', 0.03), ('connect', 0.03), ('interface', 0.03), ('executing', 0.029), ('tools', 0.029), ('running', 0.029), ('analytical', 0.029), ('ferrucci', 0.029), ('julie', 0.028), ('platform', 0.028), ('repositories', 0.027), ('nlp', 0.027), ('graph', 0.027), ('apache', 0.026), ('readers', 0.026), ('ensuring', 0.026), ('onto', 0.026), ('primarily', 0.025), ('formats', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

Author: Rafal Rak ; Andrew Rowley ; Jacob Carter ; Sophia Ananiadou

Abstract: Developing sophisticated NLP pipelines composed of multiple processing tools and components available through different providers may pose a challenge in terms of their interoperability. The Unstructured Information Management Architecture (UIMA) is an industry standard whose aim is to ensure such interoperability by defining common data structures and interfaces. The architecture has been gaining attention from industry and academia alike, resulting in a large volume ofUIMA-compliant processing components. In this paper, we demonstrate Argo, a Web-based workbench for the development and processing of NLP pipelines/workflows. The workbench is based upon UIMA, and thus has the potential of using many of the existing UIMA resources. We present features, and show examples, offacilitating the distributed development of components and the analysis of processing results. The latter includes annotation visualisers and editors, as well as serialisation to RDF format, which enables flexible querying in addition to data manipulation thanks to the semantic query language SPARQL. The distributed development feature allows users to seamlessly connect their tools to workflows running in Argo, and thus take advantage of both the available library of components (without the need of installing them locally) and the analytical tools.

2 0.37177566 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou

Abstract: U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the METANET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with METANET’s aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both mul- tilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.

3 0.13863626 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

Author: Seid Muhie Yimam ; Iryna Gurevych ; Richard Eckart de Castilho ; Chris Biemann

Abstract: We present WebAnno, a general purpose web-based annotation tool for a wide range of linguistic annotations. WebAnno offers annotation project management, freely configurable tagsets and the management of users in different roles. WebAnno uses modern web technology for visualizing and editing annotations in a web browser. It supports arbitrarily large documents, pluggable import/export filters, the curation of annotations across various users, and an interface to farming out annotations to a crowdsourcing platform. Currently WebAnno allows part-ofspeech, named entity, dependency parsing and co-reference chain annotations. The architecture design allows adding additional modes of visualization and editing, when new kinds of annotations are to be supported.

4 0.11486369 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

Author: Tristan Miller ; Nicolai Erbs ; Hans-Peter Zorn ; Torsten Zesch ; Iryna Gurevych

Abstract: Implementations of word sense disambiguation (WSD) algorithms tend to be tied to a particular test corpus format and sense inventory. This makes it difficult to test their performance on new data sets, or to compare them against past algorithms implemented for different data sets. In this paper we present DKPro WSD, a freely licensed, general-purpose framework for WSD which is both modular and extensible. DKPro WSD abstracts the WSD process in such a way that test corpora, sense inventories, and algorithms can be freely swapped. Its UIMA-based architecture makes it easy to add support for new resources and algorithms. Related tasks such as word sense induction and entity linking are also supported.

5 0.11342809 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

Author: Valentin Tablan ; Kalina Bontcheva ; Ian Roberts ; Hamish Cunningham ; Marin Dimitrov

Abstract: This paper presents AnnoMarket, an open cloud-based platform which enables researchers to deploy, share, and use language processing components and resources, following the data-as-a-service and software-as-a-service paradigms. The focus is on multilingual text analysis resources and services, based on an opensource infrastructure and compliant with relevant NLP standards. We demonstrate how the AnnoMarket platform can be used to develop NLP applications with little or no programming, to index the results for enhanced browsing and search, and to evaluate performance. Utilising AnnoMarket is straightforward, since cloud infrastructural issues are dealt with by the platform, completely transparently to the user: load balancing, efficient data upload and storage, deployment on the virtual machines, security, and fault tolerance.

6 0.10468567 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees

7 0.098507993 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

8 0.067366235 190 acl-2013-Implicatures and Nested Beliefs in Approximate Decentralized-POMDPs

9 0.065355346 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

10 0.052808229 386 acl-2013-What causes a causal relation? Detecting Causal Triggers in Biomedical Scientific Discourse

11 0.045686945 179 acl-2013-HYENA-live: Fine-Grained Online Entity Type Classification from Natural-language Text

12 0.045035455 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

13 0.04269693 99 acl-2013-Crowd Prefers the Middle Path: A New IAA Metric for Crowdsourcing Reveals Turker Biases in Query Segmentation

14 0.041481413 36 acl-2013-Adapting Discriminative Reranking to Grounded Language Learning

15 0.041366108 11 acl-2013-A Multi-Domain Translation Model Framework for Statistical Machine Translation

16 0.039681479 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

17 0.039438944 164 acl-2013-FudanNLP: A Toolkit for Chinese Natural Language Processing

18 0.039088119 123 acl-2013-Discriminative Learning with Natural Annotations: Word Segmentation as a Case Study

19 0.038503088 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

20 0.038413748 209 acl-2013-Joint Modeling of News Readerâ•Žs and Comment Writerâ•Žs Emotions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.104), (1, 0.028), (2, -0.021), (3, -0.055), (4, 0.025), (5, -0.015), (6, 0.012), (7, -0.038), (8, 0.078), (9, 0.005), (10, -0.096), (11, 0.04), (12, -0.105), (13, 0.043), (14, -0.026), (15, -0.075), (16, -0.003), (17, 0.06), (18, -0.01), (19, -0.083), (20, -0.122), (21, -0.012), (22, -0.176), (23, 0.042), (24, -0.156), (25, -0.149), (26, 0.082), (27, 0.017), (28, -0.084), (29, -0.077), (30, 0.006), (31, 0.029), (32, -0.266), (33, -0.09), (34, 0.043), (35, 0.229), (36, -0.093), (37, 0.033), (38, 0.018), (39, 0.04), (40, 0.008), (41, -0.042), (42, -0.034), (43, 0.046), (44, -0.043), (45, 0.153), (46, -0.007), (47, -0.023), (48, 0.163), (49, -0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95546442 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

Author: Rafal Rak ; Andrew Rowley ; Jacob Carter ; Sophia Ananiadou

Abstract: Developing sophisticated NLP pipelines composed of multiple processing tools and components available through different providers may pose a challenge in terms of their interoperability. The Unstructured Information Management Architecture (UIMA) is an industry standard whose aim is to ensure such interoperability by defining common data structures and interfaces. The architecture has been gaining attention from industry and academia alike, resulting in a large volume ofUIMA-compliant processing components. In this paper, we demonstrate Argo, a Web-based workbench for the development and processing of NLP pipelines/workflows. The workbench is based upon UIMA, and thus has the potential of using many of the existing UIMA resources. We present features, and show examples, offacilitating the distributed development of components and the analysis of processing results. The latter includes annotation visualisers and editors, as well as serialisation to RDF format, which enables flexible querying in addition to data manipulation thanks to the semantic query language SPARQL. The distributed development feature allows users to seamlessly connect their tools to workflows running in Argo, and thus take advantage of both the available library of components (without the need of installing them locally) and the analytical tools.

2 0.89189929 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou

Abstract: U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the METANET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with METANET’s aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both mul- tilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.

3 0.82414132 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

Author: Valentin Tablan ; Kalina Bontcheva ; Ian Roberts ; Hamish Cunningham ; Marin Dimitrov

Abstract: This paper presents AnnoMarket, an open cloud-based platform which enables researchers to deploy, share, and use language processing components and resources, following the data-as-a-service and software-as-a-service paradigms. The focus is on multilingual text analysis resources and services, based on an opensource infrastructure and compliant with relevant NLP standards. We demonstrate how the AnnoMarket platform can be used to develop NLP applications with little or no programming, to index the results for enhanced browsing and search, and to evaluate performance. Utilising AnnoMarket is straightforward, since cloud infrastructural issues are dealt with by the platform, completely transparently to the user: load balancing, efficient data upload and storage, deployment on the virtual machines, security, and fault tolerance.

4 0.72642916 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

Author: Seid Muhie Yimam ; Iryna Gurevych ; Richard Eckart de Castilho ; Chris Biemann

Abstract: We present WebAnno, a general purpose web-based annotation tool for a wide range of linguistic annotations. WebAnno offers annotation project management, freely configurable tagsets and the management of users in different roles. WebAnno uses modern web technology for visualizing and editing annotations in a web browser. It supports arbitrarily large documents, pluggable import/export filters, the curation of annotations across various users, and an interface to farming out annotations to a crowdsourcing platform. Currently WebAnno allows part-ofspeech, named entity, dependency parsing and co-reference chain annotations. The architecture design allows adding additional modes of visualization and editing, when new kinds of annotations are to be supported.

5 0.53042173 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE

Author: Pedro Fialho ; Luisa Coheur ; Sergio Curto ; Pedro Claudio ; Angela Costa ; Alberto Abad ; Hugo Meinedo ; Isabel Trancoso

Abstract: In this paper we describe a platform for embodied conversational agents with tutoring goals, which takes as input written and spoken questions and outputs answers in both forms. The platform is developed within a game environment, and currently allows speech recognition and synthesis in Portuguese, English and Spanish. In this paper we focus on its understanding component that supports in-domain interactions, and also small talk. Most indomain interactions are answered using different similarity metrics, which compare the perceived utterances with questions/sentences in the agent’s knowledge base; small-talk capabilities are mainly due to AIML, a language largely used by the chatbots’ community. In this paper we also introduce EDGAR, the butler of MONSERRATE, which was developed in the aforementioned platform, and that answers tourists’ questions about MONSERRATE.

6 0.4265728 285 acl-2013-Propminer: A Workflow for Interactive Information Extraction and Exploration using Dependency Trees

7 0.42279339 183 acl-2013-ICARUS - An Extensible Graphical Search Tool for Dependency Treebanks

8 0.41102073 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

9 0.39028293 268 acl-2013-PATHS: A System for Accessing Cultural Heritage Collections

10 0.38495189 52 acl-2013-Annotating named entities in clinical text by combining pre-annotation and active learning

11 0.34881276 48 acl-2013-An Open Source Toolkit for Quantitative Historical Linguistics

12 0.34868237 95 acl-2013-Crawling microblogging services to gather language-classified URLs. Workflow and case study

13 0.34393379 141 acl-2013-Evaluating a City Exploration Dialogue System with Integrated Question-Answering and Pedestrian Navigation

14 0.34062111 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

15 0.3190212 29 acl-2013-A Visual Analytics System for Cluster Exploration

16 0.30641842 295 acl-2013-Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages

17 0.29855165 163 acl-2013-From Natural Language Specifications to Program Input Parsers

18 0.28542814 161 acl-2013-Fluid Construction Grammar for Historical and Evolutionary Linguistics

19 0.27864128 271 acl-2013-ParaQuery: Making Sense of Paraphrase Collections

20 0.27620381 265 acl-2013-Outsourcing FrameNet to the Crowd


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.212), (6, 0.041), (11, 0.044), (13, 0.013), (15, 0.012), (24, 0.043), (26, 0.044), (35, 0.054), (42, 0.06), (48, 0.022), (62, 0.236), (68, 0.016), (70, 0.032), (71, 0.011), (88, 0.019), (90, 0.019), (95, 0.042)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86703098 118 acl-2013-Development and Analysis of NLP Pipelines in Argo

Author: Rafal Rak ; Andrew Rowley ; Jacob Carter ; Sophia Ananiadou

Abstract: Developing sophisticated NLP pipelines composed of multiple processing tools and components available through different providers may pose a challenge in terms of their interoperability. The Unstructured Information Management Architecture (UIMA) is an industry standard whose aim is to ensure such interoperability by defining common data structures and interfaces. The architecture has been gaining attention from industry and academia alike, resulting in a large volume ofUIMA-compliant processing components. In this paper, we demonstrate Argo, a Web-based workbench for the development and processing of NLP pipelines/workflows. The workbench is based upon UIMA, and thus has the potential of using many of the existing UIMA resources. We present features, and show examples, offacilitating the distributed development of components and the analysis of processing results. The latter includes annotation visualisers and editors, as well as serialisation to RDF format, which enables flexible querying in addition to data manipulation thanks to the semantic query language SPARQL. The distributed development feature allows users to seamlessly connect their tools to workflows running in Argo, and thus take advantage of both the available library of components (without the need of installing them locally) and the analytical tools.

2 0.80020756 198 acl-2013-IndoNet: A Multilingual Lexical Knowledge Network for Indian Languages

Author: Brijesh Bhatt ; Lahari Poddar ; Pushpak Bhattacharyya

Abstract: We present IndoNet, a multilingual lexical knowledge base for Indian languages. It is a linked structure of wordnets of 18 different Indian languages, Universal Word dictionary and the Suggested Upper Merged Ontology (SUMO). We discuss various benefits of the network and challenges involved in the development. The system is encoded in Lexical Markup Framework (LMF) and we propose modifications in LMF to accommodate Universal Word Dictionary and SUMO. This standardized version of lexical knowledge base of Indian Languages can now easily , be linked to similar global resources.

3 0.71567416 362 acl-2013-Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers

Author: Andre Martins ; Miguel Almeida ; Noah A. Smith

Abstract: We present fast, accurate, direct nonprojective dependency parsers with thirdorder features. Our approach uses AD3, an accelerated dual decomposition algorithm which we extend to handle specialized head automata and sequential head bigram models. Experiments in fourteen languages yield parsing speeds competitive to projective parsers, with state-ofthe-art accuracies for the largest datasets (English, Czech, and German).

4 0.71162903 269 acl-2013-PLIS: a Probabilistic Lexical Inference System

Author: Eyal Shnarch ; Erel Segal-haLevi ; Jacob Goldberger ; Ido Dagan

Abstract: This paper presents PLIS, an open source Probabilistic Lexical Inference System which combines two functionalities: (i) a tool for integrating lexical inference knowledge from diverse resources, and (ii) a framework for scoring textual inferences based on the integrated knowledge. We provide PLIS with two probabilistic implementation of this framework. PLIS is available for download and developers of text processing applications can use it as an off-the-shelf component for injecting lexical knowledge into their applications. PLIS is easily configurable, components can be extended or replaced with user generated ones to enable system customization and further research. PLIS includes an online interactive viewer, which is a powerful tool for investigating lexical inference processes. 1 Introduction and background Semantic Inference is the process by which machines perform reasoning over natural language texts. A semantic inference system is expected to be able to infer the meaning of one text from the meaning of another, identify parts of texts which convey a target meaning, and manipulate text units in order to deduce new meanings. Semantic inference is needed for many Natural Language Processing (NLP) applications. For instance, a Question Answering (QA) system may encounter the following question and candidate answer (Example 1): Q: which explorer discovered the New World? A: Christopher Columbus revealed America. As there are no overlapping words between the two sentences, to identify that A holds an answer for Q, background world knowledge is needed to link Christopher Columbus with explorer and America with New World. Linguistic knowledge is also needed to identify that reveal and discover refer to the same concept. Knowledge is needed in order to bridge the gap between text fragments, which may be dissimilar on their surface form but share a common meaning. For the purpose of semantic inference, such knowledge can be derived from various resources (e.g. WordNet (Fellbaum, 1998) and others, detailed in Section 2.1) in a form which we denote as inference links (often called inference/entailment rules), each is an ordered pair of elements in which the first implies the meaning of the second. For instance, the link ship→vessel can be derived from tshtaen hypernym rkel sahtiiopn→ ovfe Wsseolr cdNanet b. Other applications can benefit from utilizing inference links to identify similarity between language expressions. In Information Retrieval, the user’s information need may be expressed in relevant documents differently than it is expressed in the query. Summarization systems should identify text snippets which convey the same meaning. Our work addresses a generic, application in- dependent, setting of lexical inference. We therefore adopt the terminology of Textual Entailment (Dagan et al., 2006), a generic paradigm for applied semantic inference which captures inference needs of many NLP applications in a common underlying task: given two textual fragments, termed hypothesis (H) and text (T), the task is to recognize whether T implies the meaning of H, denoted T→H. For instance, in a QA application, H reprTe→seHnts. Fthoer question, a innd a T Q a c aanpdpilidcaattei answer. pInthis setting, T is likely to hold an answer for the question if it entails the question. It is challenging to properly extract the needed inference knowledge from available resources, and to effectively utilize it within the inference process. The integration of resources, each has its own format, is technically complex and the quality 97 ProceedingSsof oiaf, th Beu 5lg1asrtia A,n Anuuaglu Mst 4ee-9tin 2g0 o1f3. th ?ec A20ss1o3ci Aastisoonci faotrio Cno fomrp Cuotamtipountaalti Loinnaglu Lisitnigcsu,is patigcess 97–102, Figure 1: PLIS schema - a text-hypothesis pair is processed by the Lexical Integrator which uses a set of lexical resources to extract inference chains which connect the two. The Lexical Inference component provides probability estimations for the validity of each level of the process. ofthe resulting inference links is often unknown in advance and varies considerably. For coping with this challenge we developed PLIS, a Probabilistic Lexical Inference System1 . PLIS, illustrated in Fig 1, has two main modules: the Lexical Integra- tor (Section 2) accepts a set of lexical resources and a text-hypothesis pair, and finds all the lexical inference relations between any pair of text term ti and hypothesis term hj, based on the available lexical relations found in the resources (and their combination). The Lexical Inference module (Section 3) provides validity scores for these relations. These term-level scores are used to estimate the sentence-level likelihood that the meaning of the hypothesis can be inferred from the text, thus making PLIS a complete lexical inference system. Lexical inference systems do not look into the structure of texts but rather consider them as bag ofterms (words or multi-word expressions). These systems are easy to implement, fast to run, practical across different genres and languages, while maintaining a competitive level of performance. PLIS can be used as a stand-alone efficient inference system or as the lexical component of any NLP application. PLIS is a flexible system, allowing users to choose the set of knowledge resources as well as the model by which inference 1The complete software package is available at http:// www.cs.biu.ac.il/nlp/downloads/PLIS.html and an online interactive viewer is available for examination at http://irsrv2. cs.biu.ac.il/nlp-net/PLIS.html. is done. PLIS can be easily extended with new knowledge resources and new inference models. It comes with a set of ready-to-use plug-ins for many common lexical resources (Section 2.1) as well as two implementation of the scoring framework. These implementations, described in (Shnarch et al., 2011; Shnarch et al., 2012), provide probability estimations for inference. PLIS has an interactive online viewer (Section 4) which provides a visualization of the entire inference process, and is very helpful for analysing lexical inference models and lexical resources usability. 2 Lexical integrator The input for the lexical integrator is a set of lexical resources and a pair of text T and hypothesis H. The lexical integrator extracts lexical inference links from the various lexical resources to connect each text term ti ∈ T with each hypothesis term hj ∈ H2. A lexical i∈nfTer wenicthe elianckh hinydpicoathteess a semantic∈ rHelation between two terms. It could be a directional relation (Columbus→navigator) or a bai ddiirreeccttiioonnaall one (car ←→ automobile). dSirinecceti knowledge resources vary lien) their representation methods, the lexical integrator wraps each lexical resource in a common plug-in interface which encapsulates resource’s inner representation method and exposes its knowledge as a list of inference links. The implemented plug-ins that come with PLIS are described in Section 2.1. Adding a new lexical resource and integrating it with the others only demands the implementation of the plug-in interface. As the knowledge needed to connect a pair of terms, ti and hj, may be scattered across few resources, the lexical integrator combines inference links into lexical inference chains to deduce new pieces of knowledge, such as Columbus −r −e −so −u −rc −e →2 −r −e −so −u −rc −e →1 navigator explorer. Therefore, the only assumption −t −he − l−e −x →ica elx integrator makes, regarding its input lexical resources, is that the inferential lexical relations they provide are transitive. The lexical integrator generates lexical infer- ence chains by expanding the text and hypothesis terms with inference links. These links lead to new terms (e.g. navigator in the above chain example and t0 in Fig 1) which can be further expanded, as all inference links are transitive. A transitivity 2Where iand j run from 1 to the length of the text and hypothesis respectively. 98 limit is set by the user to determine the maximal length for inference chains. The lexical integrator uses a graph-based representation for the inference chains, as illustrates in Fig 1. A node holds the lemma, part-of-speech and sense of a single term. The sense is the ordinal number of WordNet sense. Whenever we do not know the sense of a term we implement the most frequent sense heuristic.3 An edge represents an inference link and is labeled with the semantic relation of this link (e.g. cytokine→protein is larbeellaetdio wni othf tt hheis sW linokrd (Nee.gt .re clayttiookni hypernym). 2.1 Available plug-ins for lexical resources We have implemented plug-ins for the follow- ing resources: the English lexicon WordNet (Fellbaum, 1998)(based on either JWI, JWNL or extJWNL java APIs4), CatVar (Habash and Dorr, 2003), a categorial variations database, Wikipedia-based resource (Shnarch et al., 2009), which applies several extraction methods to derive inference links from the text and structure of Wikipedia, VerbOcean (Chklovski and Pantel, 2004), a knowledge base of fine-grained semantic relations between verbs, Lin’s distributional similarity thesaurus (Lin, 1998), and DIRECT (Kotlerman et al., 2010), a directional distributional similarity thesaurus geared for lexical inference. To summarize, the lexical integrator finds all possible inference chains (of a predefined length), resulting from any combination of inference links extracted from lexical resources, which link any t, h pair of a given text-hypothesis. Developers can use this tool to save the hassle of interfacing with the different lexical knowledge resources, and spare the labor of combining their knowledge via inference chains. The lexical inference model, described next, provides a mean to decide whether a given hypothesis is inferred from a given text, based on weighing the lexical inference chains extracted by the lexical integrator. 3 Lexical inference There are many ways to implement an inference model which identifies inference relations between texts. A simple model may consider the 3This disambiguation policy was better than considering all senses of an ambiguous term in preliminary experiments. However, it is a matter of changing a variable in the configuration of PLIS to switch between these two policies. 4http://wordnet.princeton.edu/wordnet/related-projects/ number of hypothesis terms for which inference chains, originated from text terms, were found. In PLIS, the inference model is a plug-in, similar to the lexical knowledge resources, and can be easily replaced to change the inference logic. We provide PLIS with two implemented baseline lexical inference models which are mathematically based. These are two Probabilistic Lexical Models (PLMs), HN-PLM and M-PLM which are described in (Shnarch et al., 2011; Shnarch et al., 2012) respectively. A PLM provides probability estimations for the three parts of the inference process (as shown in Fig 1): the validity probability of each inference chain (i.e. the probability for a valid inference relation between its endpoint terms) P(ti → hj), the probability of each hypothesis term to →b e i hnferred by the entire text P(T → hj) (term-level probability), eanntdir teh tee probability o hf the entire hypothesis to be inferred by the text P(T → H) (sentencelteov eble probability). HN-PLM describes a generative process by which the hypothesis is generated from the text. Its parameters are the reliability level of each of the resources it utilizes (that is, the prior probability that applying an arbitrary inference link derived from each resource corresponds to a valid inference). For learning these parameters HN-PLM applies a schema of the EM algorithm (Dempster et al., 1977). Its performance on the recognizing textual entailment task, RTE (Bentivogli et al., 2009; Bentivogli et al., 2010), are in line with the state of the art inference systems, including complex systems which perform syntactic analysis. This model is improved by M-PLM, which deduces sentence-level probability from term-level probabilities by a Markovian process. PLIS with this model was used for a passage retrieval for a question answering task (Wang et al., 2007), and outperformed state of the art inference systems. Both PLMs model the following prominent aspects of the lexical inference phenomenon: (i) considering the different reliability levels of the input knowledge resources, (ii) reducing inference chain probability as its length increases, and (iii) increasing term-level probability as we have more inference chains which suggest that the hypothesis term is inferred by the text. Both PLMs only need sentence-level annotations from which they derive term-level inference probabilities. To summarize, the lexical inference module 99 ?(? → ?) Figure 2: PLIS interactive viewer with Example 1 demonstrates knowledge integration of multiple inference chains and resource combination (additional explanations, which are not part of the demo, are provided in orange). provides the setting for interfacing with the lexical integrator. Additionally, the module provides the framework for probabilistic inference models which estimate term-level probabilities and integrate them into a sentence-level inference decision, while implementing prominent aspects of lexical inference. The user can choose to apply another inference logic, not necessarily probabilistic, by plugging a different lexical inference model into the provided inference infrastructure. 4 The PLIS interactive system PLIS comes with an online interactive viewer5 in which the user sets the parameters of PLIS, inserts a text-hypothesis pair and gets a visualization of the entire inference process. This is a powerful tool for investigating knowledge integration and lexical inference models. Fig 2 presents a screenshot of the processing of Example 1. On the right side, the user configures the system by selecting knowledge resources, adjusting their configuration, setting the transitivity limit, and choosing the lexical inference model to be applied by PLIS. After inserting a text and a hypothesis to the appropriate text boxes, the user clicks on the infer button and PLIS generates all lexical inference chains, of length up to the transitivity limit, that connect text terms with hypothesis terms, as available from the combination of the selected input re5http://irsrv2.cs.biu.ac.il/nlp-net/PLIS.html sources. Each inference chain is presented in a line between the text and hypothesis. PLIS also displays the probability estimations for all inference levels; the probability of each chain is presented at the end of its line. For each hypothesis term, term-level probability, which weighs all inference chains found for it, is given below the dashed line. The overall sentence-level probability integrates the probabilities of all hypothesis terms and is displayed in the box at the bottom right corner. Next, we detail the inference process of Example 1, as presented in Fig 2. In this QA example, the probability of the candidate answer (set as the text) to be relevant for the given question (the hypothesis) is estimated. When utilizing only two knowledge resources (WordNet and Wikipedia), PLIS is able to recognize that explorer is inferred by Christopher Columbus and that New World is inferred by America. Each one of these pairs has two independent inference chains, numbered 1–4, as evidence for its inference relation. Both inference chains 1 and 3 include a single inference link, each derived from a different relation of the Wikipedia-based resource. The inference model assigns a higher probability for chain 1since the BeComp relation is much more reliable than the Link relation. This comparison illustrates the ability of the inference model to learn how to differ knowledge resources by their reliability. Comparing the probability assigned by the in100 ference model for inference chain 2 with the probabilities assigned for chains 1 and 3, reveals the sophisticated way by which the inference model integrates lexical knowledge. Inference chain 2 is longer than chain 1, therefore its probability is lower. However, the inference model assigns chain 2 a higher probability than chain 3, even though the latter is shorter, since the model is sensitive enough to consider the difference in reliability levels between the two highly reliable hypernym relations (from WordNet) of chain 2 and the less reliable Link relation (from Wikipedia) of chain 3. Another aspect of knowledge integration is exemplified in Fig 2 by the three circled probabilities. The inference model takes into consideration the multiple pieces of evidence for the inference of New World (inference chains 3 and 4, whose probabilities are circled). This results in a termlevel probability estimation for New World (the third circled probability) which is higher than the probabilities of each chain separately. The third term of the hypothesis, discover, remains uncovered by the text as no inference chain was found for it. Therefore, the sentence-level inference probability is very low, 37%. In order to identify that the hypothesis is indeed inferred from the text, the inference model should be provided with indications for the inference of discover. To that end, the user may increase the transitivity limit in hope that longer inference chains provide the needed information. In addition, the user can examine other knowledge resources in search for the missing inference link. In this example, it is enough to add VerbOcean to the input of PLIS to expose two inference chains which connect reveal with discover by combining an inference link from WordNet and another one from VerbOcean. With this additional information, the sentence-level probability increases to 76%. This is a typical scenario of utilizing PLIS, either via the interactive system or via the software, for analyzing the usability of the different knowledge resources and their combination. A feature of the interactive system, which is useful for lexical resources analysis, is that each term in a chain is clickable and links to another screen which presents all the terms that are inferred from it and those from which it is inferred. Additionally, the interactive system communicates with a server which runs PLIS, in a fullduplex WebSocket connection6. This mode of operation is publicly available and provides a method for utilizing PLIS, without having to install it or the lexical resources it uses. Finally, since PLIS is a lexical system it can easily be adjusted to other languages. One only needs to replace the basic lexical text processing tools and plug in knowledge resources in the target language. If PLIS is provided with bilingual resources,7 it can operate also as a cross-lingual inference system (Negri et al., 2012). For instance, the text in Fig 3 is given in English, while the hypothesis is written in Spanish (given as a list of lemma:part-of-speech). The left side of the figure depicts a cross-lingual inference process in which the only lexical knowledge resource used is a man- ually built English-Spanish dictionary. As can be seen, two Spanish terms, jugador and casa remain uncovered since the dictionary alone cannot connect them to any of the English terms in the text. As illustrated in the right side of Fig 3, PLIS enables the combination of the bilingual dictionary with monolingual resources to produce cross-lingual inference chains, such as footballer−h −y −p −er−n y −m →player− −m −a −nu − →aljugador. Such inferenc−e − c−h −a −in − →s hpalavey trh− e− capability otro. overcome monolingual language variability (the first link in this chain) as well as to provide cross-lingual translation (the second link). 5 Conclusions To utilize PLIS one should gather lexical resources, obtain sentence-level annotations and train the inference model. Annotations are available in common data sets for task such as QA, Information Retrieval (queries are hypotheses and snippets are texts) and Student Response Analysis (reference answers are the hypotheses that should be inferred by the student answers). For developers of NLP applications, PLIS offers a ready-to-use lexical knowledge integrator which can interface with many common lexical knowledge resources and constructs lexical inference chains which combine the knowledge in them. A developer who wants to overcome lexical language variability, or to incorporate background knowledge, can utilize PLIS to inject lex6We used the socket.io implementation. 7A bilingual resource holds inference links which connect terms in different languages (e.g. an English-Spanish dictionary can provide the inference link explorer→explorador). 101 Figure 3 : PLIS as a cross-lingual inference system. Left: the process with a single manual bilingual resource. Right: PLIS composes cross-lingual inference chains to increase hypothesis coverage and increase sentence-level inference probability. ical knowledge into any text understanding application. PLIS can be used as a lightweight inference system or as the lexical component of larger, more complex inference systems. Additionally, PLIS provides scores for infer- ence chains and determines the way to combine them in order to recognize sentence-level inference. PLIS comes with two probabilistic lexical inference models which achieved competitive performance levels in the tasks of recognizing textual entailment and passage retrieval for QA. All aspects of PLIS are configurable. The user can easily switch between the built-in lexical resources, inference models and even languages, or extend the system with additional lexical resources and new inference models. Acknowledgments The authors thank Eden Erez for his help with the interactive viewer and Miquel Espl a` Gomis for the bilingual dictionaries. This work was partially supported by the European Community’s 7th Framework Programme (FP7/2007-2013) under grant agreement no. 287923 (EXCITEMENT) and the Israel Science Foundation grant 880/12. References Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge. In Proc. of TAC. Luisa Bentivogli, Peter Clark, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2010. The sixth PASCAL recognizing textual entailment challenge. In Proc. of TAC. Timothy Chklovski and Patrick Pantel. 2004. VerbOcean: Mining the web for fine-grained semantic verb relations. In Proc. of EMNLP. Ido Dagan, Oren Glickman, and Bernardo Magnini. 2006. The PASCAL recognising textual entailment challenge. In Lecture Notes in Computer Science, volume 3944, pages 177–190. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society, series [B], 39(1): 1–38. Christiane Fellbaum, editor. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts. Nizar Habash and Bonnie Dorr. 2003. A categorial variation database for English. In Proc. of NAACL. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. 2010. Directional distributional similarity for lexical inference. Natural Language Engineering, 16(4):359–389. Dekang Lin. 1998. Automatic retrieval and clustering of similar words. In Proc. of COLOING-ACL. Matteo Negri, Alessandro Marchetti, Yashar Mehdad, Luisa Bentivogli, and Danilo Giampiccolo. 2012. Semeval-2012 task 8: Cross-lingual textual entailment for content synchronization. In Proc. of SemEval. Eyal Shnarch, Libby Barak, and Ido Dagan. 2009. Extracting lexical reference rules from Wikipedia. In Proc. of ACL. Eyal Shnarch, Jacob Goldberger, and Ido Dagan. 2011. Towards a probabilistic model for lexical entailment. In Proc. of the TextInfer Workshop. Eyal Shnarch, Ido Dagan, and Jacob Goldberger. 2012. A probabilistic lexical model for ranking textual inferences. In Proc. of *SEM. Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the Jeopardy model? A quasisynchronous grammar for QA. In Proc. of EMNLP. 102

5 0.7104764 150 acl-2013-Extending an interoperable platform to facilitate the creation of multilingual and multimodal NLP applications

Author: Georgios Kontonatsios ; Paul Thompson ; Riza Theresa Batista-Navarro ; Claudiu Mihaila ; Ioannis Korkontzelos ; Sophia Ananiadou

Abstract: U-Compare is a UIMA-based workflow construction platform for building natural language processing (NLP) applications from heterogeneous language resources (LRs), without the need for programming skills. U-Compare has been adopted within the context of the METANET Network of Excellence, and over 40 LRs that process 15 European languages have been added to the U-Compare component library. In line with METANET’s aims of increasing communication between citizens of different European countries, U-Compare has been extended to facilitate the development of a wider range of applications, including both mul- tilingual and multimodal workflows. The enhancements exploit the UIMA Subject of Analysis (Sofa) mechanism, that allows different facets of the input data to be represented. We demonstrate how our customised extensions to U-Compare allow the construction and testing of NLP applications that transform the input data in different ways, e.g., machine translation, automatic summarisation and text-to-speech.

6 0.70625705 104 acl-2013-DKPro Similarity: An Open Source Framework for Text Similarity

7 0.70353174 277 acl-2013-Part-of-speech tagging with antagonistic adversaries

8 0.70267946 12 acl-2013-A New Set of Norms for Semantic Relatedness Measures

9 0.70033658 116 acl-2013-Detecting Metaphor by Contextual Analogy

10 0.69644755 284 acl-2013-Probabilistic Sense Sentiment Similarity through Hidden Emotions

11 0.67434233 307 acl-2013-Scalable Decipherment for Machine Translation via Hash Sampling

12 0.64869773 51 acl-2013-AnnoMarket: An Open Cloud Platform for NLP

13 0.63591766 105 acl-2013-DKPro WSD: A Generalized UIMA-based Framework for Word Sense Disambiguation

14 0.60673034 304 acl-2013-SEMILAR: The Semantic Similarity Toolkit

15 0.603266 385 acl-2013-WebAnno: A Flexible, Web-based and Visually Supported System for Distributed Annotations

16 0.60181737 297 acl-2013-Recognizing Partial Textual Entailment

17 0.59986138 43 acl-2013-Align, Disambiguate and Walk: A Unified Approach for Measuring Semantic Similarity

18 0.59395355 157 acl-2013-Fast and Robust Compressive Summarization with Dual Decomposition and Multi-Task Learning

19 0.59304017 239 acl-2013-Meet EDGAR, a tutoring agent at MONSERRATE

20 0.58085001 83 acl-2013-Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model