acl acl2011 acl2011-115 knowledge-graph by maker-knowledge-mining

115 acl-2011-Engkoo: Mining the Web for Language Learning


Source: pdf

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. [sent-5, score-0.455]

2 Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. [sent-6, score-0.148]

3 The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. [sent-8, score-0.464]

4 Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. [sent-9, score-0.312]

5 Firstly, they often depend on static contents compiled by experts, and therefore cannot cover fresh words or new usages of existing words. [sent-13, score-0.155]

6 44 functions are often limited, making it hard for users to effectively find information they are interested in. [sent-17, score-0.115]

7 Lastly, existing tools tend to focus exclusively on dictionary, machine translation or language learning, losing out on synergy that can reduce inefficiencies in the user experience. [sent-18, score-0.164]

8 Different from existing tools, it discovers fresh and authentic translation knowledge from billions of web pages - using the Internet to catch language in motion, and offering novel search functions that allow users efficient access to massive knowledge resources. [sent-20, score-0.72]

9 Additionally, the system unifies the scenarios of dictionary, machine translation, and language learning into a seamless and more productive user experience. [sent-21, score-0.157]

10 Engkoo derives its data from a process that continuously culls bilingual term/sentence pairs from the web, filters noise and conducts a series of NLP processes including POS tagging, dependency parsing and classification. [sent-22, score-0.406]

11 Next, the mined bilingual pairs, together with the extracted linguistic knowledge, are indexed. [sent-24, score-0.406]

12 Finally, it exposes a set of web services through which users can: 1) look up the definition of a word/phrase; 2) retrieve example sentences using keywords, POS tags or collocations; and 3) get the translation of a word/phrase/sentence. [sent-25, score-0.646]

13 While Engkoo is currently built for Chinese users who are learning English, the technology itself is language independent and can be extended to sup- port other language pairs in the future. [sent-26, score-0.223]

14 We have deployed Engkoo online to Chinese internet users and gathered log data that suggests its PortlandP, Ororce geodnin,g UsS oAf, t 2h1e J AuCnLe-2H 0L1T1. [sent-27, score-0.166]

15 0% are active users (make at least 1 query); active users make 8 queries per day on average. [sent-32, score-0.23]

16 The service receives more than one million page views per day. [sent-33, score-0.113]

17 Online dictionary lookup services can be divided into two categories. [sent-39, score-0.27]

18 , Oxford dictionaries 2 and Longman contemporary English dictionary 3. [sent-42, score-0.173]

19 Examples of these kinds of services include iCiba 4 and Lingoes 5. [sent-43, score-0.094]

20 The second depends mainly on mined bilingual term/sentence pairs, e. [sent-44, score-0.406]

21 , fuzzy POS-based search, classifier filtering), and an integrated language learning experience (e. [sent-49, score-0.109]

22 , translation with interactive word alignment, and photorealistic lip-synced video tutors). [sent-51, score-0.135]

23 (2006) uses document object model (DOM) tree mapping to extract bilingual sentence pairs from aligned bilingual web pages. [sent-54, score-0.741]

24 (2009b) exploits collective patterns to extract bilingual term/sentence pairs from one web page. [sent-56, score-0.489]

25 (2010) proposes training a SVM-based classifier with multiple linguistic features to evaluate the quality of mined corpora. [sent-58, score-0.186]

26 Following this line of work, Engkoo implements its mining pipeline with a focus on robustness and speed, and is designed to work on a very large volume of web pages. [sent-62, score-0.274]

27 The first layer consists of the crawler and the raw web page storage. [sent-77, score-0.44]

28 The crawler periodically downloads two kinds of web pages, which are put into the storage. [sent-78, score-0.315]

29 The first kind of web pages are parallel web pages (describe the same contents but with different languages, often from bilingual sites, e. [sent-79, score-0.692]

30 , government sites), and the second are those containing bilingual contents. [sent-81, score-0.252]

31 A list of seed URLs are maintained and updated after each round of the mining process. [sent-82, score-0.082]

32 The second layer consists of the extractor, the filter, the classifiers and the readability evaluator, which are applied sequentially. [sent-83, score-0.18]

33 The extractor scans the raw web page storage and identifies bilingual web page pairs using URL patterns. [sent-84, score-0.885]

34 For example, two web pages are parallel if their URLs are in the form of “· · · /zh/· · · ” and “· · · /en/· · · ”, respectively. [sent-85, score-0.219]

35 (2006) the extractor then extracts bilingual term/sentence pairs from parallel web pages. [sent-87, score-0.636]

36 Meanwhile, it identifies web pages with bilingual contents, and mines bilingual term/sentence pairs from them using the method proposed by Jiang et al. [sent-88, score-0.741]

37 The readability evaluator assigns a score to each term/sentence pair according to Formula 1 7. [sent-96, score-0.087]

38 Firstly, a list of top sites from which a good number of high quality pairs are obtained, is figured out; these are used as seeds by the crawler. [sent-101, score-0.185]

39 Secondly, bilingual term/sentence pairs extracted from traditional dictionaries are fed into this layer as well, but with the quality checking process ignored. [sent-102, score-0.521]

40 The third layer consists of a series of NLP components, which conduct POS tagging, dependency parsing, and word alignment, respectively. [sent-103, score-0.08]

41 It also includes components that learn translation information and collocations from the parsed term/sentence pairs. [sent-104, score-0.228]

42 Based on the learned statistical information, two phrase-based statistical machine translation (SMT) systems are trained, which can then translate sentences from one language to the other and vice versa. [sent-105, score-0.14]

43 Finally, the mined bilingual term/sentence pairs, together with their parsed information, are stored and indexed with a multi-level indexing engine, a core component of this layer. [sent-106, score-0.406]

44 The indexer is called multi-level since it uses not only keywords but also POS tags and dependency triples (e. [sent-107, score-0.163]

45 The fourth layer consists of a set of services that expose the mined term/sentence pairs and the linguistic knowledge based on the built index. [sent-113, score-0.477]

46 On top of these services, we construct a web application, supporting a wide range offunctions, such as searching bilingual terms/sentences, translation and so on. [sent-114, score-0.514]

47 The crawler scans the Internet to get parallel and bilingual web pages. [sent-118, score-0.678]

48 It employs a set of heuristic rules related to URLs and contents to filter unwanted pages. [sent-119, score-0.115]

49 That is, it uses these URLs as seeds, and then conducts a deep-first crawling with a maximum allowable depth of 5. [sent-121, score-0.088]

50 In this way, the crawler tries to avoid repeatedly downloading the same web page. [sent-124, score-0.315]

51 A bilingual term/sentence extractor is implemented following Shi et al. [sent-128, score-0.342]

52 It works in two modes, mining from parallel web pages and from bilingual web pages. [sent-131, score-0.715]

53 Parallel web pages are identified recursively in the following way. [sent-132, score-0.162]

54 Given a pair of parallel web pages, the URLs in two pages are extracted respectively, and are further aligned according to their positions in DOM trees, so that more parallel pages can be obtained. [sent-133, score-0.276]

55 (2007) is implemented as well to mine the definition of a given term using search engines. [sent-135, score-0.132]

56 By now, we have obtained about 1,050 million bilingual term pairs and 100 million bilingual sentence pairs. [sent-136, score-0.698]

57 The filter takes three steps to drop low quality pairs. [sent-138, score-0.088]

58 In Engkoo, the language model is a 5-gram language model trained on news articles using SRILM (Stolcke, 2002), while the translation model is based on a manually compiled translation table. [sent-146, score-0.23]

59 We have got about 20 million bilingual term pairs and 15 million bilingual sentence pairs after filtering noise. [sent-147, score-0.773]

60 For each classifier, about 10,000 sentence pairs are manually annotated for training/development/testing. [sent-150, score-0.075]

61 Our SMT systems are phrase-based, trained on the web mined bilingual sentence pairs using the GIZA++ (Och and Ney, 2000) alignment s′ p(s)p(s′ p(s′ package, with a collaborative decoder similar to Li et al. [sent-154, score-0.74]

62 At the heart of the indexer is the inverted lists, each of which contains an entry pointing to an ordered list of the related term/sentence pairs. [sent-160, score-0.081]

63 The traditional dictionary interface is extended with a blending of web-mined and ranked term definitions, sample sentences, synonyms, collocations, and phonetically similar terms. [sent-165, score-0.313]

64 The result page user experience includes an intuitive comparable tabs interface described in Jiang et al. [sent-166, score-0.289]

65 (2009a) that effectively exposes differences be- 47 tween similar terms. [sent-167, score-0.092]

66 The search experience is augmented with a fuzzy auto completion experience, which besides traditional prefix matching is also robust against errors and allows for alternative inputs. [sent-168, score-0.204]

67 All of these contain inline micro translations to help users narrow in on their intended search. [sent-169, score-0.115]

68 Errors are resolved by a blend of edit-distance and phonetic search algorithms tuned for Chinese user behavior patterns identified by user study. [sent-170, score-0.176]

69 The definitions for the term derived from traditional dictionary sources are included in the main definition area and refer to the noise of a small bird. [sent-173, score-0.296]

70 Augmenting the definition area are “Web translations,” which include the contemporary use of the word standing for micro-blogging. [sent-174, score-0.084]

71 Web-mined bilingual sample sentences are also presented and ranked by popularity metrics; this demonstrates the modern usage of the term. [sent-175, score-0.365]

72 Engkoo exposes a novel search and interactive exploration interface for the ever-growing web-mined bilingual sample sentences in its database. [sent-177, score-0.595]

73 Emphasis is placed on sample sentences in Engkoo because of their crucial role in language learning. [sent-178, score-0.113]

74 One can search for sentences as they would in traditional search engines or concordancers. [sent-180, score-0.183]

75 Further, sentences can be filtered based on classifiers such as oral, written, and technical styles, source, and language difficulty. [sent-182, score-0.097]

76 Additionally sample sentences for terms can be filtered by their inflection and the semantics of a particular definition. [sent-183, score-0.113]

77 Interactivity can be found in the word alignment between the languages as one moves his or her mouse over the words, which can also be clicked on for deeper exploration. [sent-184, score-0.064]

78 Sample sentences between two similar words can be displayed side-by-side in a tabbed (a) A screenshot of the definition and sample sentence areas of a Engkoo result page. [sent-186, score-0.278]

79 (b) A screenshot of samples sentences for the POS-wildcard query “v. [sent-187, score-0.136]

80 (c) A screenshot of machine translation integrated into the dictionary experience, where the top pane shows results of machine translation while the bottom pane displays example sentences mined from the web. [sent-189, score-0.66]

81 48 user interface to easily expose the subtleties between usages. [sent-191, score-0.16]

82 In the example seen in Figure 2(b), a user has searched for the collocation verb+TV, represented by the query “v. [sent-192, score-0.133]

83 ” In the results, we find fresh and authentic sample sentences mined from the web, the first of which contains “watch TV,” the most common collocation, as the top result. [sent-194, score-0.38]

84 Additionally, the corresponding keyword in Chinese is automatically highlighted using statistical alignment techniques. [sent-195, score-0.064]

85 For many users, the difference between a machine translation (MT) system and a translation dictionary are not entirely clear. [sent-197, score-0.297]

86 For shorter MT queries, sample sentences might also be returned as one can see in Figure 2(c) which expands the search and also raises confidence in a translation as one can observe it used on the web. [sent-199, score-0.261]

87 Like the sample sentences, word alignment is also exposed on the machine translation. [sent-200, score-0.137]

88 As the alignment naturally serves as a word breaker, users can click the selection for a lookup which would open a new tab with the definition. [sent-201, score-0.258]

89 This is especially useful in cases where a user might want to find alternatives to a particular part of a translation. [sent-202, score-0.094]

90 Note that the seemingly single line dictionary search box is also adapted to MT behavior, allowing users to paste in multi-line text as it can detect and unfold itself to a larger text area as needed. [sent-203, score-0.26]

91 4 Conclusions and Future work We have presented Engkoo, a novel online translation system which uniquely unifies the scenarios of dictionary, machine translation, and language learning. [sent-204, score-0.193]

92 The features of the offering are based on an ever-expanding data set derived from state-of-the-art web mining and NLP techniques. [sent-205, score-0.291]

93 Direct user feedback and implicit log data suggest that the service is effective for both translation utility and language learning, with advantages over existing services. [sent-207, score-0.193]

94 In future work, we are examining extracting language 49 knowledge from the real-time web for translation in news scenarios. [sent-208, score-0.262]

95 Additionally, we are actively mining other language pairs to build a multi-language learning system. [sent-209, score-0.157]

96 Combinable tabs: An interactive method of information comparison using a combinable tabbed document interface. [sent-218, score-0.157]

97 Mining bilingual data from the web with adaptively learnt patterns. [sent-222, score-0.414]

98 Collaborative decoding: Partial hypothesis re-ranking using translation consensus between decoders. [sent-230, score-0.1]

99 Evaluating the quality of web-mined bilingual sentences using multiple linguistic features. [sent-234, score-0.324]

100 A dom tree alignment model for mining parallel data from the web. [sent-246, score-0.269]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('engkoo', 0.641), ('bilingual', 0.252), ('web', 0.162), ('mined', 0.154), ('crawler', 0.153), ('ming', 0.123), ('urls', 0.12), ('users', 0.115), ('xiaohua', 0.111), ('jiang', 0.109), ('translation', 0.1), ('dictionary', 0.097), ('services', 0.094), ('tv', 0.092), ('exposes', 0.092), ('extractor', 0.09), ('collocations', 0.084), ('mining', 0.082), ('indexer', 0.081), ('layer', 0.08), ('lookup', 0.079), ('pairs', 0.075), ('sample', 0.073), ('experience', 0.071), ('liu', 0.069), ('dom', 0.066), ('fresh', 0.066), ('chinese', 0.065), ('user', 0.064), ('alignment', 0.064), ('combinable', 0.061), ('tabbed', 0.061), ('screenshot', 0.061), ('contents', 0.059), ('parallel', 0.057), ('massive', 0.057), ('classifiers', 0.057), ('zhou', 0.057), ('filter', 0.056), ('smt', 0.055), ('interface', 0.055), ('tabs', 0.054), ('pane', 0.054), ('scans', 0.054), ('cheng', 0.053), ('shi', 0.053), ('internet', 0.051), ('wild', 0.05), ('unifies', 0.05), ('search', 0.048), ('traditional', 0.047), ('authentic', 0.047), ('conducts', 0.047), ('niu', 0.047), ('offering', 0.047), ('firstly', 0.045), ('page', 0.045), ('sites', 0.045), ('evaluator', 0.044), ('dongdong', 0.044), ('secondly', 0.044), ('components', 0.044), ('keywords', 0.043), ('readability', 0.043), ('scenarios', 0.043), ('definition', 0.043), ('watch', 0.042), ('term', 0.041), ('expose', 0.041), ('billions', 0.041), ('crawling', 0.041), ('contemporary', 0.041), ('sentences', 0.04), ('triples', 0.039), ('li', 0.039), ('million', 0.039), ('fuzzy', 0.038), ('mu', 0.038), ('url', 0.038), ('catch', 0.037), ('definitions', 0.036), ('interactive', 0.035), ('query', 0.035), ('pos', 0.035), ('dictionaries', 0.035), ('additionally', 0.035), ('collocation', 0.034), ('sun', 0.034), ('collaborative', 0.033), ('seeds', 0.033), ('built', 0.033), ('meanwhile', 0.033), ('quality', 0.032), ('mt', 0.032), ('noise', 0.032), ('implements', 0.03), ('alternatives', 0.03), ('compiled', 0.03), ('experts', 0.029), ('service', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 115 acl-2011-Engkoo: Mining the Web for Language Learning

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

2 0.12492705 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

3 0.12303927 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

Author: Bo Pang ; Ravi Kumar

Abstract: Web search is an information-seeking activity. Often times, this amounts to a user seeking answers to a question. However, queries, which encode user’s information need, are typically not expressed as full-length natural language sentences in particular, as questions. Rather, they consist of one or more text fragments. As humans become more searchengine-savvy, do natural-language questions still have a role to play in web search? Through a systematic, large-scale study, we find to our surprise that as time goes by, web users are more likely to use questions to express their search intent. —

4 0.10703111 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

Author: Hal Daume III ; Jagadeesh Jagarlamudi

Abstract: We show that unseen words account for a large part of the translation error when moving to new domains. Using an extension of a recent approach to mining translations from comparable corpora (Haghighi et al., 2008), we are able to find translations for otherwise OOV terms. We show several approaches to integrating such translations into a phrasebased translation system, yielding consistent improvements in translations quality (between 0.5 and 1.5 Bleu points) on four domains and two language pairs.

5 0.10444549 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

6 0.10090181 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

7 0.099159598 183 acl-2011-Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora

8 0.096801281 152 acl-2011-How Much Can We Gain from Supervised Word Alignment?

9 0.096084394 177 acl-2011-Interactive Group Suggesting for Twitter

10 0.0928936 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

11 0.089004762 81 acl-2011-Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach

12 0.085703962 266 acl-2011-Reordering with Source Language Collocations

13 0.085189618 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

14 0.0816705 19 acl-2011-A Mobile Touchable Application for Online Topic Graph Extraction and Exploration of Web Content

15 0.080264904 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

16 0.075872943 49 acl-2011-Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level?

17 0.074181624 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

18 0.073535524 146 acl-2011-Goodness: A Method for Measuring Machine Translation Confidence

19 0.073300235 245 acl-2011-Phrase-Based Translation Model for Question Retrieval in Community Question Answer Archives

20 0.072630696 155 acl-2011-Hypothesis Mixture Decoding for Statistical Machine Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.208), (1, -0.043), (2, 0.039), (3, 0.111), (4, -0.032), (5, -0.051), (6, 0.03), (7, -0.101), (8, 0.044), (9, -0.024), (10, -0.017), (11, 0.017), (12, 0.043), (13, -0.076), (14, -0.021), (15, 0.007), (16, 0.069), (17, -0.011), (18, 0.085), (19, -0.029), (20, -0.007), (21, 0.002), (22, 0.021), (23, 0.051), (24, -0.017), (25, -0.027), (26, -0.03), (27, 0.127), (28, 0.013), (29, -0.136), (30, 0.091), (31, 0.023), (32, 0.009), (33, -0.073), (34, -0.022), (35, -0.068), (36, 0.017), (37, -0.011), (38, 0.026), (39, 0.028), (40, -0.033), (41, 0.048), (42, 0.038), (43, 0.071), (44, 0.037), (45, -0.041), (46, -0.028), (47, 0.005), (48, 0.051), (49, 0.063)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94674969 115 acl-2011-Engkoo: Mining the Web for Language Learning

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

2 0.71005917 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

Author: Bo Li ; Eric Gaussier ; Akiko Aizawa

Abstract: We study in this paper the problem of enhancing the comparability of bilingual corpora in order to improve the quality of bilingual lexicons extracted from comparable corpora. We introduce a clustering-based approach for enhancing corpus comparability which exploits the homogeneity feature of the corpus, and finally preserves most of the vocabulary of the original corpus. Our experiments illustrate the well-foundedness of this method and show that the bilingual lexicons obtained from the homogeneous corpus are of better quality than the lexicons obtained with previous approaches.

3 0.69631505 139 acl-2011-From Bilingual Dictionaries to Interlingual Document Representations

Author: Jagadeesh Jagarlamudi ; Hal Daume III ; Raghavendra Udupa

Abstract: Mapping documents into an interlingual representation can help bridge the language barrier of a cross-lingual corpus. Previous approaches use aligned documents as training data to learn an interlingual representation, making them sensitive to the domain of the training data. In this paper, we learn an interlingual representation in an unsupervised manner using only a bilingual dictionary. We first use the bilingual dictionary to find candidate document alignments and then use them to find an interlingual representation. Since the candidate alignments are noisy, we de- velop a robust learning algorithm to learn the interlingual representation. We show that bilingual dictionaries generalize to different domains better: our approach gives better performance than either a word by word translation method or Canonical Correlation Analysis (CCA) trained on a different domain.

4 0.66789436 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

5 0.63265198 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

Author: Yabin Zheng ; Lixing Xie ; Zhiyuan Liu ; Maosong Sun ; Yang Zhang ; Liyun Ru

Abstract: Chinese Pinyin input method is very important for Chinese language information processing. Users may make errors when they are typing in Chinese words. In this paper, we are concerned with the reasons that cause the errors. Inspired by the observation that pressing backspace is one of the most common user behaviors to modify the errors, we collect 54, 309, 334 error-correction pairs from a realworld data set that contains 2, 277, 786 users via backspace operations. In addition, we present a comparative analysis of the data to achieve a better understanding of users’ input behaviors. Comparisons with English typos suggest that some language-specific properties result in a part of Chinese input errors. 1

6 0.62301683 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

7 0.61987722 80 acl-2011-ConsentCanvas: Automatic Texturing for Improved Readability in End-User License Agreements

8 0.59261405 151 acl-2011-Hindi to Punjabi Machine Translation System

9 0.57593131 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

10 0.57314634 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

11 0.54757065 248 acl-2011-Predicting Clicks in a Vocabulary Learning System

12 0.54462028 240 acl-2011-ParaSense or How to Use Parallel Corpora for Word Sense Disambiguation

13 0.5396508 304 acl-2011-Together We Can: Bilingual Bootstrapping for WSD

14 0.52345085 157 acl-2011-I Thou Thee, Thou Traitor: Predicting Formal vs. Informal Address in English Literature

15 0.52323598 67 acl-2011-Clairlib: A Toolkit for Natural Language Processing, Information Retrieval, and Network Analysis

16 0.52275652 104 acl-2011-Domain Adaptation for Machine Translation by Mining Unseen Words

17 0.51963001 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

18 0.51242214 177 acl-2011-Interactive Group Suggesting for Twitter

19 0.51091808 11 acl-2011-A Fast and Accurate Method for Approximate String Search

20 0.50585592 311 acl-2011-Translationese and Its Dialects


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.032), (17, 0.035), (26, 0.436), (37, 0.054), (39, 0.047), (41, 0.051), (55, 0.026), (59, 0.022), (72, 0.038), (91, 0.041), (96, 0.129)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91194016 105 acl-2011-Dr Sentiment Knows Everything!

Author: Amitava Das ; Sivaji Bandyopadhyay

Abstract: Sentiment analysis is one of the hot demanding research areas since last few decades. Although a formidable amount of research have been done, the existing reported solutions or available systems are still far from perfect or do not meet the satisfaction level of end users’ . The main issue is the various conceptual rules that govern sentiment and there are even more clues (possibly unlimited) that can convey these concepts from realization to verbalization of a human being. Human psychology directly relates to the unrevealed clues and governs the sentiment realization of us. Human psychology relates many things like social psychology, culture, pragmatics and many more endless intelligent aspects of civilization. Proper incorporation of human psychology into computational sentiment knowledge representation may solve the problem. In the present paper we propose a template based online interactive gaming technology, called Dr Sentiment to automatically create the PsychoSentiWordNet involving internet population. The PsychoSentiWordNet is an extension of SentiWordNet that presently holds human psychological knowledge on a few aspects along with sentiment knowledge.

same-paper 2 0.87238717 115 acl-2011-Engkoo: Mining the Web for Language Learning

Author: Matthew R. Scott ; Xiaohua Liu ; Ming Zhou ; Microsoft Engkoo Team

Abstract: This paper presents Engkoo 1, a system for exploring and learning language. It is built primarily by mining translation knowledge from billions of web pages - using the Internet to catch language in motion. Currently Engkoo is built for Chinese users who are learning English; however the technology itself is language independent and can be extended in the future. At a system level, Engkoo is an application platform that supports a multitude of NLP technologies such as cross language retrieval, alignment, sentence classification, and statistical machine translation. The data set that supports this system is primarily built from mining a massive set of bilingual terms and sentences from across the web. Specifically, web pages that contain both Chinese and English are discovered and analyzed for parallelism, extracted and formulated into clear term definitions and sample sentences. This approach allows us to build perhaps the world’s largest lexicon linking both Chinese and English together - at the same time covering the most up-to-date terms as captured by the net.

3 0.82041413 253 acl-2011-PsychoSentiWordNet

Author: Amitava Das

Abstract: Sentiment analysis is one of the hot demanding research areas since last few decades. Although a formidable amount of research has been done but still the existing reported solutions or available systems are far from perfect or to meet the satisfaction level of end user's. The main issue may be there are many conceptual rules that govern sentiment, and there are even more clues (possibly unlimited) that can convey these concepts from realization to verbalization of a human being. Human psychology directly relates to the unrevealed clues; govern the sentiment realization of us. Human psychology relates many things like social psychology, culture, pragmatics and many more endless intelligent aspects of civilization. Proper incorporation of human psychology into computational sentiment knowledge representation may solve the problem. PsychoSentiWordNet is an extension over SentiWordNet that holds human psychological knowledge and sentiment knowledge simultaneously. 1

4 0.81745821 259 acl-2011-Rare Word Translation Extraction from Aligned Comparable Documents

Author: Emmanuel Prochasson ; Pascale Fung

Abstract: We present a first known result of high precision rare word bilingual extraction from comparable corpora, using aligned comparable documents and supervised classification. We incorporate two features, a context-vector similarity and a co-occurrence model between words in aligned documents in a machine learning approach. We test our hypothesis on different pairs of languages and corpora. We obtain very high F-Measure between 80% and 98% for recognizing and extracting correct translations for rare terms (from 1to 5 occurrences). Moreover, we show that our system can be trained on a pair of languages and test on a different pair of languages, obtaining a F-Measure of 77% for the classification of Chinese-English translations using a training corpus of Spanish-French. Our method is therefore even potentially applicable to low resources languages without training data.

5 0.74520886 333 acl-2011-Web-Scale Features for Full-Scale Parsing

Author: Mohit Bansal ; Dan Klein

Abstract: Counts from large corpora (like the web) can be powerful syntactic cues. Past work has used web counts to help resolve isolated ambiguities, such as binary noun-verb PP attachments and noun compound bracketings. In this work, we first present a method for generating web count features that address the full range of syntactic attachments. These features encode both surface evidence of lexical affinities as well as paraphrase-based cues to syntactic structure. We then integrate our features into full-scale dependency and constituent parsers. We show relative error reductions of7.0% over the second-order dependency parser of McDonald and Pereira (2006), 9.2% over the constituent parser of Petrov et al. (2006), and 3.4% over a non-local constituent reranker.

6 0.67032105 123 acl-2011-Exact Decoding of Syntactic Translation Models through Lagrangian Relaxation

7 0.65367907 70 acl-2011-Clustering Comparable Corpora For Bilingual Lexicon Extraction

8 0.59060138 258 acl-2011-Ranking Class Labels Using Query Sessions

9 0.5858773 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

10 0.54218221 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

11 0.53957796 182 acl-2011-Joint Annotation of Search Queries

12 0.52657741 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

13 0.51160258 331 acl-2011-Using Large Monolingual and Bilingual Corpora to Improve Coordination Disambiguation

14 0.50666249 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

15 0.50011194 256 acl-2011-Query Weighting for Ranking Model Adaptation

16 0.49628919 338 acl-2011-Wikulu: An Extensible Architecture for Integrating Natural Language Processing Techniques with Wikis

17 0.49594107 90 acl-2011-Crowdsourcing Translation: Professional Quality from Non-Professionals

18 0.48904702 193 acl-2011-Language-independent compound splitting with morphological operations

19 0.48740754 215 acl-2011-MACAON An NLP Tool Suite for Processing Word Lattices

20 0.48613706 337 acl-2011-Wikipedia Revision Toolkit: Efficiently Accessing Wikipedias Edit History