acl acl2012 acl2012-164 knowledge-graph by maker-knowledge-mining

164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation


Source: pdf

Author: Nicola Cancedda

Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. [sent-4, score-0.376]

2 We propose a simple and practical encryption-based method addressing this barrier. [sent-5, score-0.054]

3 1 Introduction It is generally taken for granted that whoever is deploying a Statistical Machine Translation (SMT) system has unrestricted rights to access and use the parallel data required for its training. [sent-6, score-0.165]

4 Such TMs are cherished as valuable as- sets by their owners, who rarely accept to give away wholesale rights to their use. [sent-9, score-0.037]

5 At the same time, the prospective user of the SMT system that could be derived from such TM might be subject to confidentiality constraints on the text stream needing translation, so that sending out text to translate to an SMT system deployed by the owner of the PT is not an option. [sent-10, score-0.46]

6 We propose an encryption-based method that addresses such conflicting constraints. [sent-11, score-0.026]

7 In this method, the owner of the TM generates a Phrase Table (PT) from it, and makes it accessible to the user following a special procedure. [sent-12, score-0.273]

8 An SMT decoder is deployed 23 by the user, with all the required resources to operate except the PT1. [sent-13, score-0.176]

9 The method assumes that, besides the PT Owner and the PT User, there is a Trusted Third Party. [sent-15, score-0.026]

10 This means that both the User and the PT owner trust such third party not to collude with the other one for violating their secrets (i. [sent-16, score-0.295]

11 the content of the PT, or a string requiring translation), even if they do not trust her enough to directly disclose such secrets to her. [sent-18, score-0.182]

12 While the exposition will focus on phrase tables, there is nothing in the method precluding its use with other resources, provided that they can be represented as look-up tables, a very mild constraint. [sent-19, score-0.157]

13 Provided speed-related aspects can be dealt with, this makes the method directly applicable to language models, or distortion tables for models with lexicalized distortion (Al-Onaizan and Papineni, 2006). [sent-20, score-0.189]

14 The method is also directly applicable to Translation Memories, which can be seen as “degenerate” 1If the decoder can operate with multiple PTs, then there could be other (possibly out-of-domain) PTs installed locally. [sent-21, score-0.083]

15 c s 2o0c1ia2ti Aosns fo cria Ctio nm fpourta Ctoiomnpault Laitniognuaislt Licisn,g puaigsteiscs 23–27, phrase tables where each record contains only a translation in the target language, and no associated statistics. [sent-24, score-0.265]

16 The rest ofthis paper is organized as follows: Section 2 explains the proposed method; in Section 3 we make more precise some implementation choices. [sent-25, score-0.029]

17 2 Private access to phrase tables Let Alice2 be the owner of a PT, Bob the owner of the SMT decoder who would like to use the table, and Tina a trusted third-party. [sent-29, score-0.729]

18 In broad terms, the proposed method works like this: in an initialization phase, Alice first encrypts PT entries one by one, sends the encrypted PT to Bob, and the encryption/decryption keys to Tina. [sent-30, score-0.782]

19 Alice also sends a method to map source language phrases to PT indices to Bob. [sent-31, score-0.201]

20 When translating, Bob uses the mapping method sent by Alice to check if a given source phrase is present and has a translation in the PT and, if this is the case, retrieves the index of the corresponding entry in the PT. [sent-32, score-0.365]

21 If the check is positive, then Bob sends a request to Tina for the corresponding decryption key. [sent-33, score-0.237]

22 Tina delivers the decryption key to Bob and communicates that a download has taken place to Alice, who can then increase a download counter. [sent-34, score-0.206]

23 , (sn, vn) } be a PT, where si is a source phrase and vi is the) corresponding rreec sord. [sent-38, score-0.25]

24 In an actual PT there are multiple lines for a same source phrase, but it is always possible to reconstruct a single record by concatenating all such lines. [sent-39, score-0.193]

25 Encrypts vi with key ki We denote the encrypted record as vi ⊕ ki 2. [sent-44, score-0.622]

26 ,n to Bob 2We adopt a widespread convention in cryptography and assign person names to the parties involved in the exchange. [sent-49, score-0.063]

27 24 Figure 1: The initialization phase of the method (Sec. [sent-50, score-0.234]

28 Bob receives an encrypted version of the PT entries and the corresponding source phrase digests. [sent-53, score-0.499]

29 Sends the encrypted record (or ciphertext) {vi ⊕ ki}i=1,. [sent-56, score-0.275]

30 ,n to Tina A digest, or one-way hash function (Schneider, 1996), is a particular type of hash function. [sent-63, score-0.32]

31 It takes as input a string of arbitrary length, and deterministically produces a bit string of fixed length. [sent-64, score-0.093]

32 It is such that it is virtually impossible to reconstruct a message given its digest, and that the probability of collisions, i. [sent-65, score-0.048]

33 At the end of the initialization, neither Bob nor Tina can access the content of the PT, unless they collude. [sent-68, score-0.094]

34 2 Retrieval During translation, Bob has a source phrase s and would like to retrieve from the PT the corresponding entry, if it is present. [sent-70, score-0.135]

35 Bob computes the digest d of s using the same cryptographic hash function used by Alice in the initialization phase; 2. [sent-73, score-0.554]

36 If the cBhoebck c hise negative thheern d s ∈doe {sd n}ot have an entry in the PT, and the process stops. [sent-78, score-0.097]

37 If the check is positive then s has an entry in the PT: let is be the corresponding index; 1dsi=s23disvisk is k5is=Bviosb Tiknias+14Alice Figure 2: The retrieval phase (Sec. [sent-79, score-0.242]

38 Tina sends Bob kis and notifies Alice, who can increment a counter of PT entries downloaded by Bob; 5. [sent-84, score-0.362]

39 vis ⊕ kis using key kis , and re- At the end of the process, Bob retrieved from the PT owned by Alice an entry if and only if it matched phrase s (this is guaranteed by the virtual absence of collisions ensured by the cryptographic hash functions used for computing phrase digests). [sent-86, score-0.826]

40 Alice was notified by Tina that Bob downloaded one entry, as desired, while neither Tina nor Alice could learn s, unless they colluded. [sent-87, score-0.062]

41 2 we presented a method for looking up PT entries involving one interaction for each phrase look-up. [sent-89, score-0.227]

42 In our implementation, we batch all requests for all source phrases up to a predefined length for all sentences in a given file. [sent-90, score-0.114]

43 This mirrors the standard practice of filtering the phrase table for a given source file to translate before starting the actual decoding. [sent-91, score-0.171]

44 Out of the large choice of cryptographic hash functions in the literature (Schneider, 1996), we chose 128 bits md5 for its widespread availability in multiple programming languages and environments. [sent-92, score-0.348]

45 For encrypting entries, we used bit-wise XOR with a string of random bits (the key) of the same 25 length as the encrypted item. [sent-93, score-0.343]

46 This symmetric encryption is known as one-timepad, and it is unbreakable, provided key bits are really random. [sent-94, score-0.107]

47 Both keys and ciphertext are indexed and sorted by increasing md5 digest of the corresponding source phrase. [sent-95, score-0.43]

48 For retrieving all entries matching a given text file, Bob generates md5 digests for all source phrases up to a maximum length, sorts them, and performs a join with the encrypted entry file. [sent-96, score-0.693]

49 Matching digests are then sent to Tina for her to join with the keys. [sent-97, score-0.246]

50 Note that it is never necessary to have any massive data structure in main memory, and all process steps except the initial sorting by md5 digest are linear in the number of PT entries or in the number of tokens to look up. [sent-99, score-0.317]

51 The process results however in increased storage and bandwidth requirements, since ciphertext and key have each roughly the same size as the original PT. [sent-100, score-0.162]

52 4 Related work We are not aware of any previous work directly addressing the problem we solve, i. [sent-101, score-0.028]

53 private access to a phrase table or other resources for the purpose of performing statistical machine translation. [sent-103, score-0.252]

54 Private access to electronic information in general, however, is an active research area. [sent-104, score-0.065]

55 An interesting and relatively recent survey of the field of secure multiparty computation and privacy-preserving data mining is (Lindell and Pinkas, 2009). [sent-109, score-0.092]

56 5 Experiments We validated our simple implementation using a phrase table of 38,488,777 lines created with the Moses et al. [sent-110, score-0.166]

57 , 2007) phrase-based SMT system, corresponding to 15,764,069 entries for distinct source toolkit3(Koehn phrases4. [sent-111, score-0.156]

58 org/moses/ 4The birthday bound for a 128 bit hash like md5 for a collision probability of 10−18 is around 2. [sent-114, score-0.299]

59 This means Figure 3: Time required to complete the initialization as a function of the number of lines in the original PT. [sent-116, score-0.28]

60 Figure 3 shows the time required to complete the initialization phase as a function of the size of the original PT (in million of lines). [sent-121, score-0.299]

61 The progression is largely linear, and the overall initialization time of roughly 45 minutes for the complete PT indicates that the method can be used in practice. [sent-122, score-0.196]

62 Note that the Europarl corpus originating the phrase-table is much larger than most TMs available at even large language service providers. [sent-123, score-0.024]

63 Figure 4 displays the time required to complete retrieval for subsets of increasing size of the 2,000 sentence test set, and for phrase tables uniformly sampled at 25%, 50%, 75% and 100%. [sent-124, score-0.357]

64 217,019 distinct digests are generated for all possible phrase of length up to 6 from the full test set, resulting in the retrieval of 47,072 entries (596,560 lines) from the full phrase table. [sent-125, score-0.531]

65 Our implementation of the retrieval uses the Unix join command on the ciphertext and the key tables, and performs a full scan through that if the hash distributed keys perfectly uniformly, then about 26 billion entries would be required for the collision probability to exceed 10−18. [sent-126, score-0.816]

66 While no hash function, including md5, distributes keys perfectly evenly (Bellare and Kohno, 2004), the number of entries likely to be handled in our application is orders of magnitude smaller than the bound. [sent-127, score-0.434]

67 26 Figure 4: Time required for retrieval as a function of the number of sentences in the query, for different subsets of the original phrase table. [sent-133, score-0.223]

68 Complexity hence depends more on the size of the PT than on the length of the query. [sent-135, score-0.028]

69 An ad-hoc indexing of the encrypted entries and of the keys in e. [sent-136, score-0.428]

70 a standard database would make the dependency logarithmic in the number of entries, and linear in the number of source tokens. [sent-138, score-0.045]

71 Digests’ prefixes are perfectly suited for bucketing ciphertext and keys. [sent-139, score-0.165]

72 6 Conclusions Some SMT systems never get deployed because of legitimate and incompatible concerns of the prospective users and of the training data owners. [sent-141, score-0.141]

73 We propose a method that guarantees to the owner of a TM that only some fraction of an artifact derived from the original resource, a phrase-table, is transferred, and only in a very controlled way allowing to track downloads. [sent-142, score-0.266]

74 This same method also guarantees the privacy of the user, who is not required to disclose the content of what needs translation. [sent-143, score-0.237]

75 Empirical validation on demanding conditions shows that the proposed method is practical on or- dinary computing infrastructure. [sent-144, score-0.026]

76 This same method can be easily extended to other resources used by SMT systems, and indeed even beyond SMT itself, whenever similar constraints on data access exist. [sent-145, score-0.091]

77 Hash function balance and its impact on birthday attacks. [sent-153, score-0.044]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('bob', 0.442), ('pt', 0.365), ('tina', 0.266), ('alice', 0.243), ('encrypted', 0.222), ('owner', 0.206), ('digest', 0.169), ('digests', 0.167), ('hash', 0.16), ('initialization', 0.142), ('sends', 0.13), ('ciphertext', 0.121), ('entries', 0.111), ('private', 0.097), ('entry', 0.097), ('keys', 0.095), ('smt', 0.09), ('phrase', 0.09), ('kis', 0.088), ('vi', 0.085), ('confidentiality', 0.083), ('cryptographic', 0.083), ('tables', 0.077), ('decryption', 0.073), ('ki', 0.068), ('user', 0.067), ('bits', 0.066), ('disclose', 0.066), ('phase', 0.066), ('access', 0.065), ('required', 0.063), ('tm', 0.063), ('deployed', 0.056), ('bellare', 0.056), ('bellovin', 0.056), ('benny', 0.056), ('chor', 0.056), ('collision', 0.056), ('encrypts', 0.056), ('lindell', 0.056), ('pts', 0.056), ('trusted', 0.056), ('record', 0.053), ('join', 0.051), ('reconstruct', 0.048), ('collisions', 0.048), ('tms', 0.048), ('multiparty', 0.048), ('privacy', 0.048), ('prospective', 0.048), ('secrets', 0.048), ('lines', 0.047), ('translation', 0.045), ('source', 0.045), ('retrieval', 0.045), ('birthday', 0.044), ('schneider', 0.044), ('memories', 0.044), ('secure', 0.044), ('perfectly', 0.044), ('distortion', 0.043), ('requests', 0.041), ('trust', 0.041), ('vis', 0.041), ('exposition', 0.041), ('key', 0.041), ('bit', 0.039), ('widespread', 0.039), ('rights', 0.037), ('never', 0.037), ('file', 0.036), ('check', 0.034), ('download', 0.034), ('guarantees', 0.034), ('downloaded', 0.033), ('receives', 0.031), ('si', 0.03), ('di', 0.03), ('decoder', 0.029), ('uniformly', 0.029), ('unless', 0.029), ('implementation', 0.029), ('sent', 0.028), ('addressing', 0.028), ('operate', 0.028), ('complete', 0.028), ('length', 0.028), ('string', 0.027), ('alexandra', 0.026), ('europarl', 0.026), ('method', 0.026), ('nicola', 0.026), ('subsets', 0.025), ('uss', 0.024), ('distributes', 0.024), ('communicates', 0.024), ('bloom', 0.024), ('originating', 0.024), ('cryptography', 0.024), ('unbreakable', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

Author: Nicola Cancedda

Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.

2 0.096371666 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

Author: Seung-Wook Lee ; Dongdong Zhang ; Mu Li ; Ming Zhou ; Hae-Chang Rim

Abstract: In this paper, we propose a novel method of reducing the size of translation model for hierarchical phrase-based machine translation systems. Previous approaches try to prune infrequent entries or unreliable entries based on statistics, but cause a problem of reducing the translation coverage. On the contrary, the proposed method try to prune only ineffective entries based on the estimation of the information redundancy encoded in phrase pairs and hierarchical rules, and thus preserve the search space of SMT decoders as much as possible. Experimental results on Chinese-toEnglish machine translation tasks show that our method is able to reduce almost the half size of the translation model with very tiny degradation of translation performance.

3 0.082127951 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

Author: Yashar Mehdad ; Matteo Negri ; Marcello Federico

Abstract: We address a core aspect of the multilingual content synchronization task: the identification of novel, more informative or semantically equivalent pieces of information in two documents about the same topic. This can be seen as an application-oriented variant of textual entailment recognition where: i) T and H are in different languages, and ii) entailment relations between T and H have to be checked in both directions. Using a combination of lexical, syntactic, and semantic features to train a cross-lingual textual entailment system, we report promising results on different datasets.

4 0.079615496 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

Author: Xiaodong He ; Li Deng

Abstract: This paper proposes a new discriminative training method in constructing phrase and lexicon translation models. In order to reliably learn a myriad of parameters in these models, we propose an expected BLEU score-based utility function with KL regularization as the objective, and train the models on a large parallel dataset. For training, we derive growth transformations for phrase and lexicon translation probabilities to iteratively improve the objective. The proposed method, evaluated on the Europarl German-to-English dataset, leads to a 1.1 BLEU point improvement over a state-of-the-art baseline translation system. In IWSLT 201 1 Benchmark, our system using the proposed method achieves the best Chinese-to-English translation result on the task of translating TED talks.

5 0.069909625 140 acl-2012-Machine Translation without Words through Substring Alignment

Author: Graham Neubig ; Taro Watanabe ; Shinsuke Mori ; Tatsuya Kawahara

Abstract: In this paper, we demonstrate that accurate machine translation is possible without the concept of “words,” treating MT as a problem of transformation between character strings. We achieve this result by applying phrasal inversion transduction grammar alignment techniques to character strings to train a character-based translation model, and using this in the phrase-based MT framework. We also propose a look-ahead parsing algorithm and substring-informed prior probabilities to achieve more effective and efficient alignment. In an evaluation, we demonstrate that character-based translation can achieve results that compare to word-based systems while effectively translating unknown and uncommon words over several language pairs.

6 0.061066568 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

7 0.056676723 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

8 0.052118 68 acl-2012-Decoding Running Key Ciphers

9 0.051222932 203 acl-2012-Translation Model Adaptation for Statistical Machine Translation with Monolingual Topic Information

10 0.049044959 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

11 0.048695359 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

12 0.046976715 148 acl-2012-Modified Distortion Matrices for Phrase-Based Statistical Machine Translation

13 0.045651022 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

14 0.045179017 123 acl-2012-Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

15 0.04464807 147 acl-2012-Modeling the Translation of Predicate-Argument Structure for SMT

16 0.04429318 67 acl-2012-Deciphering Foreign Language by Combining Language Models and Context Vectors

17 0.044269398 97 acl-2012-Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

18 0.041121006 3 acl-2012-A Class-Based Agreement Model for Generating Accurately Inflected Translations

19 0.040199168 25 acl-2012-An Exploration of Forest-to-String Translation: Does Translation Help or Hurt Parsing?

20 0.040144596 199 acl-2012-Topic Models for Dynamic Translation Model Adaptation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.123), (1, -0.057), (2, 0.046), (3, 0.026), (4, 0.033), (5, 0.007), (6, 0.02), (7, 0.051), (8, -0.003), (9, 0.001), (10, -0.016), (11, 0.053), (12, -0.004), (13, 0.007), (14, -0.021), (15, -0.021), (16, -0.01), (17, 0.032), (18, -0.009), (19, -0.005), (20, -0.008), (21, 0.018), (22, 0.022), (23, -0.004), (24, -0.002), (25, -0.039), (26, 0.033), (27, 0.113), (28, -0.065), (29, -0.04), (30, 0.027), (31, 0.009), (32, 0.07), (33, 0.064), (34, 0.018), (35, -0.074), (36, 0.014), (37, 0.068), (38, -0.102), (39, -0.036), (40, -0.008), (41, -0.004), (42, 0.001), (43, -0.156), (44, -0.1), (45, -0.212), (46, -0.022), (47, -0.021), (48, -0.037), (49, -0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92304218 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

Author: Nicola Cancedda

Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.

2 0.6456458 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

Author: Andrejs Vasiljevs ; Raivis Skadins ; Jorg Tiedemann

Abstract: Machine Translation Raivis Skadiņš TILDE Vienbas gatve 75a, Riga LV-1004, LATVIA raivi s . s kadins @ ti lde . lv Jörg Tiedemann Uppsala University Box 635, Uppsala SE-75 126, SWEDEN j org .t iedemann@ l ingfi l .uu . se the Universities of Copenhagen, and Uppsala. Edinburgh, Zagreb, To facilitate the creation and usage of custom SMT systems we have created a cloud-based platform for do-it-yourself MT. The platform is developed in the EU collaboration project LetsMT! . This system demonstration paper presents the motivation in developing the LetsMT! platform, its main features, architecture, and an evaluation in a practical use case. 1

3 0.57628745 160 acl-2012-Personalized Normalization for a Multilingual Chat System

Author: Ai Ti Aw ; Lian Hau Lee

Abstract: This paper describes the personalized normalization of a multilingual chat system that supports chatting in user defined short-forms or abbreviations. One of the major challenges for multilingual chat realized through machine translation technology is the normalization of non-standard, self-created short-forms in the chat message to standard words before translation. Due to the lack of training data and the variations of short-forms used among different social communities, it is hard to normalize and translate chat messages if user uses vocabularies outside the training data and create short-forms freely. We develop a personalized chat normalizer for English and integrate it with a multilingual chat system, allowing user to create and use personalized short-forms in multilingual chat. 1

4 0.55215776 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

Author: Marco Lui ; Timothy Baldwin

Abstract: We present langid .py, an off-the-shelflanguage identification tool. We discuss the design and implementation of langid .py, and provide an empirical comparison on 5 longdocument datasets, and 2 datasets from the microblog domain. We find that langid .py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.

5 0.49719903 68 acl-2012-Decoding Running Key Ciphers

Author: Sravana Reddy ; Kevin Knight

Abstract: There has been recent interest in the problem of decoding letter substitution ciphers using techniques inspired by natural language processing. We consider a different type of classical encoding scheme known as the running key cipher, and propose a search solution using Gibbs sampling with a word language model. We evaluate our method on synthetic ciphertexts of different lengths, and find that it outperforms previous work that employs Viterbi decoding with character-based models.

6 0.4730975 163 acl-2012-Prediction of Learning Curves in Machine Translation

7 0.44445091 204 acl-2012-Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation

8 0.42537692 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

9 0.41348124 77 acl-2012-Ecological Evaluation of Persuasive Messages Using Google AdWords

10 0.38802904 143 acl-2012-Mixing Multiple Translation Models in Statistical Machine Translation

11 0.37640843 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

12 0.35973188 155 acl-2012-NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation

13 0.35254261 54 acl-2012-Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages

14 0.35020801 13 acl-2012-A Graphical Interface for MT Evaluation and Error Analysis

15 0.33954683 141 acl-2012-Maximum Expected BLEU Training of Phrase and Lexicon Translation Models

16 0.33641931 172 acl-2012-Selective Sharing for Multilingual Dependency Parsing

17 0.3307316 82 acl-2012-Entailment-based Text Exploration with Application to the Health-care Domain

18 0.32056534 121 acl-2012-Iterative Viterbi A* Algorithm for K-Best Sequential Decoding

19 0.31903696 125 acl-2012-Joint Learning of a Dual SMT System for Paraphrase Generation

20 0.31831038 63 acl-2012-Cross-lingual Parse Disambiguation based on Semantic Correspondence


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(25, 0.504), (26, 0.023), (28, 0.025), (37, 0.013), (39, 0.035), (57, 0.024), (74, 0.027), (84, 0.03), (85, 0.025), (90, 0.092), (92, 0.034), (94, 0.035), (99, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88925278 49 acl-2012-Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study

Author: Nathan Schneider ; Behrang Mohit ; Kemal Oflazer ; Noah A. Smith

Abstract: “Lightweight” semantic annotation of text calls for a simple representation, ideally without requiring a semantic lexicon to achieve good coverage in the language and domain. In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains. The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.

same-paper 2 0.86929792 164 acl-2012-Private Access to Phrase Tables for Statistical Machine Translation

Author: Nicola Cancedda

Abstract: Some Statistical Machine Translation systems never see the light because the owner of the appropriate training data cannot release them, and the potential user ofthe system cannot disclose what should be translated. We propose a simple and practical encryption-based method addressing this barrier.

3 0.77547014 1 acl-2012-ACCURAT Toolkit for Multi-Level Alignment and Information Extraction from Comparable Corpora

Author: Marcis Pinnis ; Radu Ion ; Dan Stefanescu ; Fangzhong Su ; Inguna Skadina ; Andrejs Vasiljevs ; Bogdan Babych

Abstract: The lack of parallel corpora and linguistic resources for many languages and domains is one of the major obstacles for the further advancement of automated translation. A possible solution is to exploit comparable corpora (non-parallel bi- or multi-lingual text resources) which are much more widely available than parallel translation data. Our presented toolkit deals with parallel content extraction from comparable corpora. It consists of tools bundled in two workflows: (1) alignment of comparable documents and extraction of parallel sentences and (2) extraction and bilingual mapping of terms and named entities. The toolkit pairs similar bilingual comparable documents and extracts parallel sentences and bilingual terminological and named entity dictionaries from comparable corpora. This demonstration focuses on the English, Latvian, Lithuanian, and Romanian languages.

4 0.61511511 56 acl-2012-Computational Approaches to Sentence Completion

Author: Geoffrey Zweig ; John C. Platt ; Christopher Meek ; Christopher J.C. Burges ; Ainur Yessenalina ; Qiang Liu

Abstract: This paper studies the problem of sentencelevel semantic coherence by answering SATstyle sentence completion questions. These questions test the ability of algorithms to distinguish sense from nonsense based on a variety of sentence-level phenomena. We tackle the problem with two approaches: methods that use local lexical information, such as the n-grams of a classical language model; and methods that evaluate global coherence, such as latent semantic analysis. We evaluate these methods on a suite of practice SAT questions, and on a recently released sentence completion task based on data taken from five Conan Doyle novels. We find that by fusing local and global information, we can exceed 50% on this task (chance baseline is 20%), and we suggest some avenues for further research.

5 0.34310877 44 acl-2012-CSNIPER - Annotation-by-query for Non-canonical Constructions in Large Corpora

Author: Richard Eckart de Castilho ; Sabine Bartsch ; Iryna Gurevych

Abstract: We present CSNIPER (Corpus Sniper), a tool that implements (i) a web-based multiuser scenario for identifying and annotating non-canonical grammatical constructions in large corpora based on linguistic queries and (ii) evaluation of annotation quality by measuring inter-rater agreement. This annotationby-query approach efficiently harnesses expert knowledge to identify instances of linguistic phenomena that are hard to identify by means of existing automatic annotation tools.

6 0.32952741 206 acl-2012-UWN: A Large Multilingual Lexical Knowledge Base

7 0.328677 99 acl-2012-Finding Salient Dates for Building Thematic Timelines

8 0.327804 138 acl-2012-LetsMT!: Cloud-Based Platform for Do-It-Yourself Machine Translation

9 0.32381633 202 acl-2012-Transforming Standard Arabic to Colloquial Arabic

10 0.31566173 161 acl-2012-Polarity Consistency Checking for Sentiment Dictionaries

11 0.31463033 72 acl-2012-Detecting Semantic Equivalence and Information Disparity in Cross-lingual Documents

12 0.3144958 36 acl-2012-BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment

13 0.30893308 157 acl-2012-PDTB-style Discourse Annotation of Chinese Text

14 0.30876166 152 acl-2012-Multilingual WSD with Just a Few Lines of Code: the BabelNet API

15 0.30864766 219 acl-2012-langid.py: An Off-the-shelf Language Identification Tool

16 0.30851567 52 acl-2012-Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation

17 0.30806327 214 acl-2012-Verb Classification using Distributional Similarity in Syntactic and Semantic Structures

18 0.30643365 145 acl-2012-Modeling Sentences in the Latent Space

19 0.30234921 217 acl-2012-Word Sense Disambiguation Improves Information Retrieval

20 0.30002394 76 acl-2012-Distributional Semantics in Technicolor