emnlp emnlp2010 emnlp2010-56 knowledge-graph by maker-knowledge-mining

56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names


Source: pdf

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. [sent-2, score-0.969]

2 The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. [sent-3, score-0.65]

3 The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. [sent-4, score-0.705]

4 More over, the method that uses bilingual data for learning hash functions gives the best performance. [sent-6, score-0.408]

5 Hence, personal names are used predominantly as queries in People Search. [sent-10, score-0.495]

6 Naturally, spelling correction of misspelled personal names plays a very important role in not only reducing the time and effort needed by users to find people they are searching for but also in ensuring good user experience. [sent-12, score-1.11]

7 Spelling errors in personal names are of a different nature compared to those in general text. [sent-13, score-0.438]

8 before People Search became widely popular, researchers working on the problem of personal name matching had recognized the human tendency to be inexact in recollecting names from the memory and specifying them. [sent-15, score-0.868]

9 A study of personal names in hospital databases found that only 39% of the errors in the names were single typographical errors (Friedman and Sideli, 1992)1 . [sent-16, score-0.809]

10 Standard approaches to general purpose spelling correction are not well-suited for correcting misspelled personal names. [sent-21, score-0.811]

11 Spelling correction techniques meant for general purpose web-queries require large volumes of training data in the form of query logs for learning the error models (Cucerzan and Brill, 2004), (Ahmad and Kondrak, 2005). [sent-26, score-0.497]

12 Therefore, spelling correction techniques that rely crucially on bigram and higher order language models will fail on queries with a different word order than what is observed in the query log. [sent-33, score-0.849]

13 Unlike general purpose Web Search where it is not reasonable to assume the availability of a highcoverage trusted lexicon, People Search typically employs large authoritative name directories. [sent-34, score-0.447]

14 For instance, if one is searching for a friend on Facebook, the correct spelling of the friend’s name exists in the Facebook people directory2 (assuming that the friend is a registered user of Facebook at the time of the search). [sent-35, score-0.894]

15 In fact, even in Web Search, broad-coverage name directories are available in the form of Wikipedia, IMDB, etc. [sent-37, score-0.499]

16 The availability of large authoritative name directories that serve as the source of trusted spellings of names throws open the possibility of correcting misspelled personal names with the help of name matching techniques (Pfeifer et al. [sent-38, score-1.941]

17 However, the best of the name matching techniques can at best work with a few thousand names to give acceptable re- sponse time and accuracy. [sent-41, score-0.797]

18 In this work, we develop hashing-based name similarity search techniques and employ them for spelling correction of personal names. [sent-43, score-1.232]

19 The motivation for using hashing as a building block of spelling correction is the following: given a query, we want to return the global best match in the name directory 2http://www. [sent-44, score-1.291]

20 Using the list of candidate tokens, we extract the list of candidate names which contain at least one approximately matching token. [sent-50, score-0.374]

21 Clearly, our success in finding the right name suggestion for the query in the NAME MATCHING stage depends crucially on our success in getting the right name suggestion in the list of candidates produced by the NAME BUCKETING stage search. [sent-52, score-0.926]

22 Therefore, we need a name similarity search technique that can ensure very high recall without producing too many candidates. [sent-53, score-0.519]

23 Hashing is best suited for this task of fast and approximate name matching. [sent-54, score-0.382]

24 We hash the query tokens as well as directory tokens into d bit binary codes. [sent-55, score-0.789]

25 With binary codes, finding approximate matches for a query token is as easy as finding all the database tokens that are at a Hamming distance of r or less from the query token in the binary code representation (Shakhnarovich et al. [sent-56, score-0.527]

26 When the binary codes are compact, this search can be done in a fraction of a second on directories containing millions of names on a simple processor. [sent-59, score-0.541]

27 Our contributions are: • • • We develop a novel data-driven technique for learning hoaps ha nfounvcetli donasta -fdorri mapping qsuimei flaorr names to similar binary codes using a set of names in a given language/script (i. [sent-60, score-0.63]

28 We formulate the problem of learning hash functions as an optmization problem whose relaxation can be solved as a generalized Eigenvalue problem. [sent-63, score-0.417]

29 We show that hash functions can also be learnt using bilingual adashta f uinn tchtieo fnosr mca nof a name equivalents in two languages. [sent-66, score-0.823]

30 We formulate the problem of learning hash functions as an optmization problem whose relaxation can be solved using Canonical Correlation Analysis. [sent-67, score-0.417]

31 2) We develop new similarity measures for matching names (Section 3i. [sent-69, score-0.42]

32 2 Learning Hash Functions In this section, we develop two techniques for learning hash functions using names as training data. [sent-73, score-0.691]

33 In • the first approach, we use monolingual data consisting of names in a language whereas in the second we use bilingual name pairs. [sent-74, score-0.775]

34 In both techniques, the key idea is the same: we learn hash functions that map similar names in the training data to similar codewords. [sent-75, score-0.65]

35 We are given a set of name pairs T = {(s, s0)} as the training data. [sent-78, score-0.382]

36 W Lee want to ∈le Rarn a hash function f that maps each name to a d bit codeword: f : s → {−1, We also want the Hamming distfan :c se →of t {h−e 1c,o1d}eword of s to the codeword of s0 be small when w (s, s0) is large. [sent-81, score-0.952]

37 void the trap of mapping all names to the same codeword and thereby making the Hamming error zero while satisfying the first and last constraints. [sent-102, score-0.427]

38 Further, the optimal solution gives codewords only for the names in the training data. [sent-105, score-0.44]

39 the training data, the codeword of a name s can be produced by binarizing each coordinate of fR (s) : f (s) = ? [sent-126, score-0.542]

40 In the reminder of this work, we call the system that uses the hash function learnt from monolingual data as M-HASH. [sent-142, score-0.385]

41 2 B-HASH: Learning with Bilingual Names Data Let (s, t) be a pair of name s and its transliteration equivalent t in a different language/script. [sent-144, score-0.382]

42 We want to learn a pair of hash functions f,g that map names to d bit ψ 1}d, × 1}d. [sent-150, score-0.749]

43 codewords: f : s → {−1, g : t → {−1, Wcode awlsoor want t:h se Hamming d,is gtan :c te →of t{h−e1 ,c1od}eword of a name to the codeword of its transliteration be small. [sent-151, score-0.542]

44 all the system that uses the hash function learnt from bilingual data as B-HASH. [sent-203, score-0.399]

45 3 Similarity Score In this section, we develop new techniques for computing the similarity of names at token level as well as a whole. [sent-204, score-0.433]

46 1 Token-level Similarity We use a logistic function over multiple distance measures to compute the similarity between name tokens s and s0: K? [sent-209, score-0.533]

47 (16) While a variety of distance measures can be employed in Equation 16, two obvious choices 6As a biproduct of bilingual learning, we can hash names in the second language using g: g (t) = ? [sent-212, score-0.728]

48 4 (17) Spelling Correction using Hashing In this section, we describe our algorithm for spelling correction using hashing as a building block. [sent-240, score-0.712]

49 1 Indexing the Name Directory Given a name directory, we break each name into its constituent tokens and form a set of distinct name tokens. [sent-242, score-1.186]

50 Using the name tokens and the original names, we build an inverted index which, for each name token, lists all the names that have the token as a constituent. [sent-243, score-1.15]

51 Further, we hash each name token into a d bit codeword as described in Equation 6 (and resp. [sent-244, score-0.941]

52 Equation 15) when using the hash function learnt on monolingual data (and resp. [sent-245, score-0.385]

53 sI, we hash each si into a codeword yi and retrieve all codewords in the hash table that are at a Hamming distance of r or less from yi. [sent-255, score-0.96]

54 We rank the name tokens thus retrieved using the token level similarity score of Section 3. [sent-256, score-0.547]

55 Using the top tokens, we get all names which contain any of the name tokens as a constituent to form the pool of candidates C for tahse a aN cAonMstEitu MenAtT toCH forImNG th stage. [sent-258, score-0.719]

56 Apart from evaluating the systems on test sets using different name directories, we were interested in comparing our systems with several baselines, understanding the effect of some of the choices we made (e. [sent-264, score-0.382]

57 1 Experimental Setup We tested the proposed hashing-based spelling correction algorithms on two test sets: • • DUMBTIONARY: 1231 misspellings of variDouUsM names fAroRmY : D 1u2m31bti omniassrpy8e ianngds a name directory consisting of about 550, 000 names gleaned from the English Wikipedia. [sent-268, score-1.794]

58 Each of the misspellings had a correct spelling in the name directory. [sent-269, score-0.836]

59 INTRANET: 200 misspellings of employees tIaNkTenR AfrNoEmT :th 2e0 0se maricshsp logs sof o fth eem ipnltoryaneeest of a large organization and a name directory 8http://www. [sent-270, score-0.72]

60 Each of the misspellings had a correct spelling in the name directory. [sent-273, score-0.836]

61 1 Training For M-HASH, we used 30,000 single token names in English (sampled from the list of names in the Internet Movie Database9) as training data and for B-HASH we used 14,941 parallel single token names in English-Hindi 10. [sent-286, score-0.989]

62 Each name was represented as a feature vector over character bigrams. [sent-287, score-0.382]

63 Thus, the name token Klein has the bigrams {·k, kl, le, ei, in, n·} as the features. [sent-288, score-0.431]

64 For both M-HASH and BHASH we used the top 32 Eigenvectors to form the hash function resulting in a 32 bit representation for every name token11 . [sent-293, score-0.762]

65 2 Performance Metric We measured the performance of all the systems using Precision@ 1, the fraction of names for which a correct spelling was suggested at Rank 1. [sent-296, score-0.66]

66 To use BM25 algorithm for spelling correction, we represented each name as a bag of bigrams and set the parameters K and b to 2 and 0. [sent-308, score-0.698]

67 BHASH trained with just 1000 name pairs gives 95. [sent-362, score-0.382]

68 5% of the performance of B-HASH trained with 15000 name pairs. [sent-363, score-0.382]

69 5% of the performance of 1262 M-HASH trained with 30000 name pairs. [sent-365, score-0.382]

70 It turns out that in the DUMBTIONARY test set, for 81misspelled names, both M-HASH and B-HASH failed to suggest the correct name at rank 1. [sent-399, score-0.502]

71 Similarly, in the case of INTRANET test set, both M-HASH and B-HASH failed to suggest the correct name at rank 1 for 47 queries. [sent-400, score-0.502]

72 However, BHASH was able to suggest correct names for some of the queries where M-HASH failed. [sent-402, score-0.43]

73 And interestingly, in the DUMB- TIONARY test set, the average edit distance of the query and the correct name for the cases where MHASH failed to get the correct name in top 10 while B-HASH got it at rank 1was 2. [sent-404, score-1.251]

74 This could be because M-HASH attempts to map names with smaller edit distances to similar codewords. [sent-406, score-0.426]

75 For the first query, M-HASH suggested the correct name whereas B-HASH did not. [sent-408, score-0.429]

76 And for the third query, B-HASH suggested the correct name whereas M-HASH did not. [sent-410, score-0.429]

77 1 MB for the employees name directory (150,000 names) and 13. [sent-423, score-0.579]

78 8 MB for the Wikipedia name directory (550,000 names). [sent-424, score-0.549]

79 The first approach to spelling correction made use of a lexicon to correct out-of-lexicon terms by finding the closest in-lexicon word (Damerau, 1964). [sent-428, score-0.607]

80 The next class of approaches applied the noisy channel model to correct single word spelling errors (Kernighan et al. [sent-430, score-0.431]

81 A major flaw of single word spelling correction algorithms is they do not make use of the context of the word in correcting the errors. [sent-432, score-0.617]

82 Recently, several works have leveraged the Web for improved spelling correction (Chen et al. [sent-434, score-0.56]

83 Spelling correction algorithms targeted for web-search queries have been developed making use of query logs and click-thru data (Cucerzan and Brill, 2004), (Ahmad and Kondrak, 2005), (Sun et al. [sent-437, score-0.542]

84 None of these approaches focus exclusively on correcting name misspellings. [sent-439, score-0.439]

85 Most of these techniques either give poor retrieval performance on large name directories or do not scale. [sent-447, score-0.595]

86 Although LSH guarantees that asymptotically the Hamming distance between the codewords approaches the Euclidean distance between the data objects, it is known to produce long codewords making it practically inefficient. [sent-451, score-0.416]

87 Re1264 cently data-aware approaches that employ Machine Learning techniques to learn hash functions have been proposed and shown to be a lot more effective than LSH on both synthetic and real data. [sent-452, score-0.394]

88 7 Conclusions We developed two hashing-based techniques for spelling correction of person names in People Search applications. [sent-457, score-0.898]

89 To the best of our knowledge, these are the first techniques that focus exclusively on correcting spelling mistakes in person names. [sent-458, score-0.414]

90 Our approach has several advantages over other spelling correction techniques. [sent-459, score-0.56]

91 Further, as we suggest spellings from only authoritative name directories, the suggestions are always well formed and coherent. [sent-461, score-0.449]

92 Neither do we require pairs of misspelled names and their correct spellings for learning the error model unlike (Brill and Moore, 2000) or large-coverage general purpose lexicon for unlike (Cucerzan and Brill, 2004) or pronunciation dictionaries unlike (Toutanova and Moore, 2002). [sent-463, score-0.566]

93 Fourthly, we do not iteratively process misspelled name unlike (Cucerzan and Brill, 2004). [sent-465, score-0.5]

94 Fifthly, we handle large name directories efficiently unlike the spectrum of name matching techniques discussed in (Pfeifer et al. [sent-466, score-1.035]

95 As future work, we would like to explore the possibility of learning hash functions using 1) bilingual and monolingual data together and 2) multiple conjugate languages. [sent-469, score-0.525]

96 Learning a spelling error model from search query logs. [sent-472, score-0.569]

97 A comparison of personal name matching: techniques and practical issues. [sent-509, score-0.535]

98 A technique for computer detection and correction of spelling errors. [sent-528, score-0.56]

99 Real-word spelling correction using google web 1tn-gram data set. [sent-554, score-0.56]

100 A spelling correction program based on a noisy channel model. [sent-568, score-0.599]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('name', 0.382), ('spelling', 0.316), ('hash', 0.311), ('names', 0.297), ('correction', 0.244), ('cucerzan', 0.182), ('directory', 0.167), ('query', 0.162), ('hashing', 0.152), ('codewords', 0.143), ('codeword', 0.13), ('brill', 0.124), ('dumbtionary', 0.121), ('intranet', 0.121), ('directories', 0.117), ('personal', 0.112), ('hamming', 0.094), ('edit', 0.093), ('search', 0.091), ('bucketing', 0.091), ('misspellings', 0.091), ('queries', 0.086), ('misspelled', 0.082), ('kondrak', 0.078), ('matching', 0.077), ('conjugate', 0.076), ('bhash', 0.076), ('metaphone', 0.076), ('pfeifer', 0.076), ('bit', 0.069), ('distance', 0.065), ('ahmad', 0.065), ('lsh', 0.061), ('sgn', 0.061), ('people', 0.059), ('correcting', 0.057), ('bilingual', 0.055), ('retrieval', 0.055), ('weiss', 0.054), ('response', 0.053), ('india', 0.052), ('logs', 0.05), ('token', 0.049), ('phonetic', 0.048), ('moore', 0.048), ('correct', 0.047), ('eigenvalue', 0.047), ('ta', 0.046), ('similarity', 0.046), ('fr', 0.046), ('christen', 0.045), ('friend', 0.045), ('mhash', 0.045), ('shakhnarovich', 0.045), ('sideli', 0.045), ('typographical', 0.045), ('friedman', 0.043), ('bt', 0.043), ('failed', 0.043), ('functions', 0.042), ('monolingual', 0.041), ('techniques', 0.041), ('double', 0.04), ('tokens', 0.04), ('facebook', 0.039), ('channel', 0.039), ('spectral', 0.038), ('distances', 0.036), ('codes', 0.036), ('unlike', 0.036), ('authoritative', 0.035), ('cro', 0.035), ('relaxation', 0.034), ('learnt', 0.033), ('spellings', 0.032), ('rank', 0.03), ('andoni', 0.03), ('atd', 0.03), ('binarizing', 0.03), ('clijsters', 0.03), ('employees', 0.03), ('eword', 0.03), ('golub', 0.03), ('hardoon', 0.03), ('kernighan', 0.03), ('misspelling', 0.03), ('navarro', 0.03), ('optmization', 0.03), ('piotr', 0.03), ('pradeep', 0.03), ('probablity', 0.03), ('ravikumar', 0.03), ('ristad', 0.03), ('signficantly', 0.03), ('trusted', 0.03), ('tb', 0.03), ('reduced', 0.03), ('want', 0.03), ('errors', 0.029), ('baselines', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

2 0.18027508 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Author: David Guthrie ; Mark Hepple

Abstract: We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1byte per n-gram.

3 0.15045866 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

Author: Gae-won You ; Seung-won Hwang ; Young-In Song ; Long Jiang ; Zaiqing Nie

Abstract: This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. We validate our proposed system using real-life datasets.

4 0.11536229 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

Author: Alla Rozovskaya ; Dan Roth

Abstract: In this paper, we consider the problem of generating candidate corrections for the task of correcting errors in text. We focus on the task of correcting errors in preposition usage made by non-native English speakers, using discriminative classifiers. The standard approach to the problem assumes that the set of candidate corrections for a preposition consists of all preposition choices participating in the task. We determine likely preposition confusions using an annotated corpus of nonnative text and use this knowledge to produce smaller sets of candidates. We propose several methods of restricting candidate sets. These methods exclude candidate prepositions that are not observed as valid corrections in the annotated corpus and take into account the likelihood of each preposition confusion in the non-native text. We find that restricting candidates to those that are ob- served in the non-native data improves both the precision and the recall compared to the approach that views all prepositions as possible candidates. Furthermore, the approach that takes into account the likelihood of each preposition confusion is shown to be the most effective.

5 0.11356907 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

6 0.10902929 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

7 0.094314888 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

8 0.072850697 95 emnlp-2010-SRL-Based Verb Selection for ESL

9 0.068863101 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

10 0.068026766 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

11 0.061423454 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

12 0.051369678 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

13 0.043264262 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

14 0.038477432 16 emnlp-2010-An Approach of Generating Personalized Views from Normalized Electronic Dictionaries : A Practical Experiment on Arabic Language

15 0.036630854 84 emnlp-2010-NLP on Spoken Documents Without ASR

16 0.036476821 63 emnlp-2010-Improving Translation via Targeted Paraphrasing

17 0.034964837 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

18 0.034789804 69 emnlp-2010-Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks

19 0.034101028 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

20 0.033358116 109 emnlp-2010-Translingual Document Representations from Discriminative Projections


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.142), (1, 0.059), (2, -0.089), (3, 0.154), (4, 0.025), (5, 0.064), (6, -0.222), (7, -0.053), (8, -0.184), (9, 0.092), (10, -0.122), (11, 0.25), (12, 0.097), (13, 0.051), (14, -0.013), (15, -0.097), (16, 0.125), (17, -0.069), (18, 0.065), (19, 0.09), (20, -0.168), (21, -0.195), (22, -0.164), (23, 0.198), (24, -0.057), (25, -0.276), (26, -0.162), (27, 0.127), (28, 0.107), (29, -0.06), (30, -0.02), (31, -0.048), (32, -0.02), (33, -0.045), (34, 0.037), (35, 0.043), (36, -0.009), (37, 0.007), (38, -0.043), (39, 0.06), (40, 0.058), (41, 0.053), (42, 0.069), (43, -0.065), (44, 0.057), (45, 0.005), (46, -0.023), (47, -0.026), (48, -0.039), (49, -0.057)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97380006 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

2 0.68903565 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Author: David Guthrie ; Mark Hepple

Abstract: We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1byte per n-gram.

3 0.49291518 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

Author: Gae-won You ; Seung-won Hwang ; Young-In Song ; Long Jiang ; Zaiqing Nie

Abstract: This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. We validate our proposed system using real-life datasets.

4 0.37880176 73 emnlp-2010-Learning Recurrent Event Queries for Web Search

Author: Ruiqiang Zhang ; Yuki Konda ; Anlei Dong ; Pranam Kolari ; Yi Chang ; Zhaohui Zheng

Abstract: Recurrent event queries (REQ) constitute a special class of search queries occurring at regular, predictable time intervals. The freshness of documents ranked for such queries is generally of critical importance. REQ forms a significant volume, as much as 6% of query traffic received by search engines. In this work, we develop an improved REQ classifier that could provide significant improvements in addressing this problem. We analyze REQ queries, and develop novel features from multiple sources, and evaluate them using machine learning techniques. From historical query logs, we develop features utilizing query frequency, click information, and user intent dynamics within a search session. We also develop temporal features by time series analysis from query frequency. Other generated features include word matching with recurrent event seed words and time sensitivity of search result set. We use Naive Bayes, SVM and decision tree based logistic regres- sion model to train REQ classifier. The results on test data show that our models outperformed baseline approach significantly. Experiments on a commercial Web search engine also show significant gains in overall relevance, and thus overall user experience.

5 0.37426388 54 emnlp-2010-Generating Confusion Sets for Context-Sensitive Error Correction

Author: Alla Rozovskaya ; Dan Roth

Abstract: In this paper, we consider the problem of generating candidate corrections for the task of correcting errors in text. We focus on the task of correcting errors in preposition usage made by non-native English speakers, using discriminative classifiers. The standard approach to the problem assumes that the set of candidate corrections for a preposition consists of all preposition choices participating in the task. We determine likely preposition confusions using an annotated corpus of nonnative text and use this knowledge to produce smaller sets of candidates. We propose several methods of restricting candidate sets. These methods exclude candidate prepositions that are not observed as valid corrections in the annotated corpus and take into account the likelihood of each preposition confusion in the non-native text. We find that restricting candidates to those that are ob- served in the non-native data improves both the precision and the recall compared to the approach that views all prepositions as possible candidates. Furthermore, the approach that takes into account the likelihood of each preposition confusion is shown to be the most effective.

6 0.34549758 55 emnlp-2010-Handling Noisy Queries in Cross Language FAQ Retrieval

7 0.32908374 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

8 0.26787856 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

9 0.23536783 91 emnlp-2010-Practical Linguistic Steganography Using Contextual Synonym Substitution and Vertex Colour Coding

10 0.20094678 90 emnlp-2010-Positional Language Models for Clinical Information Retrieval

11 0.19016299 82 emnlp-2010-Multi-Document Summarization Using A* Search and Discriminative Learning

12 0.18994997 95 emnlp-2010-SRL-Based Verb Selection for ESL

13 0.16572411 62 emnlp-2010-Improving Mention Detection Robustness to Noisy Input

14 0.16116242 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

15 0.14450155 17 emnlp-2010-An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL

16 0.13917814 106 emnlp-2010-Top-Down Nearly-Context-Sensitive Parsing

17 0.12936731 84 emnlp-2010-NLP on Spoken Documents Without ASR

18 0.12791392 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

19 0.11756537 114 emnlp-2010-Unsupervised Parse Selection for HPSG

20 0.11456081 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(12, 0.103), (29, 0.06), (30, 0.012), (32, 0.018), (52, 0.019), (56, 0.036), (62, 0.03), (66, 0.072), (72, 0.042), (76, 0.029), (87, 0.011), (89, 0.463)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83750033 56 emnlp-2010-Hashing-Based Approaches to Spelling Correction of Personal Names

Author: Raghavendra Udupa ; Shaishav Kumar

Abstract: We propose two hashing-based solutions to the problem of fast and effective personal names spelling correction in People Search applications. The key idea behind our methods is to learn hash functions that map similar names to similar (and compact) binary codewords. The two methods differ in the data they use for learning the hash functions - the first method uses a set of names in a given language/script whereas the second uses a set of bilingual names. We show that both methods give excellent retrieval performance in comparison to several baselines on two lists of misspelled personal names. More over, the method that uses bilingual data for learning hash functions gives the best performance.

2 0.59058809 6 emnlp-2010-A Latent Variable Model for Geographic Lexical Variation

Author: Jacob Eisenstein ; Brendan O'Connor ; Noah A. Smith ; Eric P. Xing

Abstract: The rapid growth of geotagged social media raises new computational possibilities for investigating geographic linguistic variation. In this paper, we present a multi-level generative model that reasons jointly about latent topics and geographical regions. High-level topics such as “sports” or “entertainment” are rendered differently in each geographic region, revealing topic-specific regional distinctions. Applied to a new dataset of geotagged microblogs, our model recovers coherent topics and their regional variants, while identifying geographic areas of linguistic consistency. The model also enables prediction of an author’s geographic location from raw text, outperforming both text regression and supervised topic models.

3 0.58713305 41 emnlp-2010-Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models

Author: Amarnag Subramanya ; Slav Petrov ; Fernando Pereira

Abstract: We describe a new scalable algorithm for semi-supervised training of conditional random fields (CRF) and its application to partof-speech (POS) tagging. The algorithm uses a similarity graph to encourage similar ngrams to have similar POS tags. We demonstrate the efficacy of our approach on a domain adaptation task, where we assume that we have access to large amounts of unlabeled data from the target domain, but no additional labeled data. The similarity graph is used during training to smooth the state posteriors on the target domain. Standard inference can be used at test time. Our approach is able to scale to very large problems and yields significantly improved target domain accuracy.

4 0.33665097 79 emnlp-2010-Mining Name Translations from Entity Graph Mapping

Author: Gae-won You ; Seung-won Hwang ; Young-In Song ; Long Jiang ; Zaiqing Nie

Abstract: This paper studies the problem of mining entity translation, specifically, mining English and Chinese name pairs. Existing efforts can be categorized into (a) a transliterationbased approach leveraging phonetic similarity and (b) a corpus-based approach exploiting bilingual co-occurrences, each of which suffers from inaccuracy and scarcity respectively. In clear contrast, we use unleveraged resources of monolingual entity co-occurrences, crawled from entity search engines, represented as two entity-relationship graphs extracted from two language corpora respectively. Our problem is then abstracted as finding correct mappings across two graphs. To achieve this goal, we propose a holistic approach, of exploiting both transliteration similarity and monolingual co-occurrences. This approach, building upon monolingual corpora, complements existing corpus-based work, requiring scarce resources of parallel or compa- rable corpus, while significantly boosting the accuracy of transliteration-based work. We validate our proposed system using real-life datasets.

5 0.33101794 101 emnlp-2010-Storing the Web in Memory: Space Efficient Language Models with Constant Time Retrieval

Author: David Guthrie ; Mark Hepple

Abstract: We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current stateof-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1byte per n-gram.

6 0.32187831 107 emnlp-2010-Towards Conversation Entailment: An Empirical Investigation

7 0.30913094 123 emnlp-2010-Word-Based Dialect Identification with Georeferenced Rules

8 0.30699453 95 emnlp-2010-SRL-Based Verb Selection for ESL

9 0.30287665 110 emnlp-2010-Turbo Parsers: Dependency Parsing by Approximate Variational Inference

10 0.29957646 32 emnlp-2010-Context Comparison of Bursty Events in Web Search and Online Media

11 0.29712126 66 emnlp-2010-Inducing Word Senses to Improve Web Search Result Clustering

12 0.29633078 92 emnlp-2010-Predicting the Semantic Compositionality of Prefix Verbs

13 0.29223949 84 emnlp-2010-NLP on Spoken Documents Without ASR

14 0.29078123 58 emnlp-2010-Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation

15 0.29044297 35 emnlp-2010-Discriminative Sample Selection for Statistical Machine Translation

16 0.28775683 45 emnlp-2010-Evaluating Models of Latent Document Semantics in the Presence of OCR Errors

17 0.28293124 49 emnlp-2010-Extracting Opinion Targets in a Single and Cross-Domain Setting with Conditional Random Fields

18 0.28284496 72 emnlp-2010-Learning First-Order Horn Clauses from Web Text

19 0.28129563 67 emnlp-2010-It Depends on the Translation: Unsupervised Dependency Parsing via Word Alignment

20 0.28076059 8 emnlp-2010-A Multi-Pass Sieve for Coreference Resolution