acl acl2011 acl2011-11 knowledge-graph by maker-knowledge-mining

11 acl-2011-A Fast and Accurate Method for Approximate String Search


Source: pdf

Author: Ziqi Wang ; Gu Xu ; Hang Li ; Ming Zhang

Abstract: This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows. Given a misspelled word, the system finds words in a dictionary, which are most “similar” to the misspelled word. The paper proposes a probabilistic approach to the task, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for finding the top k candidates. The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word. The learning method employs the criterion in candidate generation as loss function. The retrieval algorithm is efficient and is guaranteed to find the optimal k candidates. Experimental results on large scale data show that the proposed approach improves upon existing methods in terms of accuracy in different settings.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows. [sent-5, score-0.767]

2 Given a misspelled word, the system finds words in a dictionary, which are most “similar” to the misspelled word. [sent-6, score-0.588]

3 The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word. [sent-9, score-0.702]

4 The learning method employs the criterion in candidate generation as loss function. [sent-10, score-0.403]

5 Given a query string, a dictionary of strings (vocabulary), and a set of operators, the system returns the top k strings in the dictionary that can be transformed from the query string by applying several operators in the operator set. [sent-14, score-0.621]

6 Here each operator is a rule that can replace a substring in the query string with another substring. [sent-15, score-0.392]

7 Approximate string search is useful in many applications including spelling error correction, similar terminology retrieval, duplicate detection, etc. [sent-24, score-0.459]

8 Without loss of generality, in this paper we address candidate generation in spelling error correction. [sent-26, score-0.557]

9 Candidate generation is to find the most possible corrections of a misspelled word. [sent-27, score-0.46]

10 a single word; after candidate generation, the words surrounding it in the text can be further leveraged to make the final candidate selection, e. [sent-30, score-0.353]

11 In spelling error correction, Brill and Moore (2000) proposed employing a generative model for candidate generation and a hierarchy of trie structures for fast candidate retrieval. [sent-34, score-0.865]

12 c A2s0s1o1ci Aatsiosonc fioartio Cno fmorpu Ctoamtiopnuatalt Lioin gauli Lsitnicgsu,i psatgices 52–61, assumes that there is only one rule applicable each time in candidate generation. [sent-41, score-0.363]

13 There are two fundamental problems in research on approximate string search: (1) how to build a model that can archive both high accuracy and efficiency, and (2) how to develop a data structure and algorithm that can facilitate efficient retrieval of the top k candidates. [sent-43, score-0.461]

14 It employs (a) a log-linear (discriminative) model for candidate generation, (b) an effective algorithm for model learning, and (c) an efficient algorithm for candidate retrieval. [sent-46, score-0.479]

15 The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction given the misspelled word. [sent-47, score-0.702]

16 The retrieval algorithm uses special data structures and efficiently performs the top k candidates finding. [sent-50, score-0.318]

17 We empirically evaluated the proposed method in spelling error correction of web search queries. [sent-52, score-0.628]

18 The experimental results have verified that the accuracy of the top candidates given by our method is significantly higher than those given by the baseline methods. [sent-53, score-0.375]

19 Our method is more accurate than the baseline methods in different settings such as large rule sets and large vocabulary sizes. [sent-54, score-0.301]

20 The efficiency of our method is also very high in different experimental settings. [sent-55, score-0.269]

21 Instead of finding all the candidates in a fixed range, methods for finding the top k candidates have also been developed. [sent-64, score-0.374]

22 Spelling error correction normally consists of candidate generation and candidate final selection. [sent-69, score-0.674]

23 Note that candidate generation is only concerned with a single word. [sent-71, score-0.259]

24 Some methods generate candidates within a fixed range of edit distance or different ranges for strings with different lengths (Li et al. [sent-74, score-0.315]

25 Schaback and Li (2007) proposed a multi-level feature-based framework for spelling error correction including a modification of Brill and Moore’s model (2000). [sent-83, score-0.474]

26 (2008) uti- lized substring substitution rules and incorporated the rules into a L1-regularized logistic regression model. [sent-85, score-0.495]

27 Their model is a binary classification model and it is assumed that only a single rule is applied in candidate generation. [sent-88, score-0.28]

28 Since users’ behavior of misspelling and correction can be frequently observed in web search log data, it has been proposed to mine spelling-error and correction pairs by using search log data. [sent-89, score-0.739]

29 The mined pairs can be directly used in spelling error correction. [sent-90, score-0.347]

30 Methods of selecting spelling and correction pairs with maximum entropy model (Chen et al. [sent-91, score-0.461]

31 The mined pairs can only be used in candidate generation of high frequency typos, however. [sent-94, score-0.34]

32 In this paper, we work on candidate generation at the character level, which can be applied to spelling error correction for both high and low frequency words. [sent-95, score-0.733]

33 3 Model for Candidate Generation As an example of approximate string search, we consider candidate generation in spelling correction. [sent-96, score-0.684]

34 Suppose that there is a vocabulary V and a misspelled word, tthhee objective ocafb cualandryida Vte generation is to select the best corrections from the vocabulary V. [sent-97, score-0.75]

35 We care about both accuracy and efficiency of the process. [sent-98, score-0.247]

36 rTeh aeb problem aisc very challenging nwcyhe onf tthhee size of vocabulary is large, because there are a large number of potential candidates to be verified. [sent-99, score-0.3]

37 In our approach, it is assumed that a large number of misspelled words and their best corrections are given as training data. [sent-101, score-0.36]

38 The best candidates for correction of a misspelled word are thus defined as those candidates having the highest probabilistic scores with respect to the training data and the operators. [sent-103, score-0.812]

39 54 Edit-distance based aligment ^ mn i c o r o s s o o f t $ Derived rules Expended rules with context Figure 1: Example of rule extraction from word pair 3. [sent-105, score-0.439]

40 An operator is formally represented a rule α → β that replaces a substring α in a misspelled ew αord → w βit hth β, ewplhaecrees α, β ∈ {s|s = t, s = ˆt, or s = t$} aβn,d w th ∈ Σ α,∗ βis ∈the { ss|est o=f at,lls possible strings over atnhed alphabet. [sent-108, score-0.553]

41 If we can apply a set of rules to transform the misspelled word wm to a correct word wc in the vocabulary, then we call the rule set a “transformation” for the word pair wm and wc. [sent-115, score-1.498]

42 Without loss of generality, we set the maximum number of rules applicable to a word pair to be a fixed number. [sent-119, score-0.274]

43 Given word pair (wm, wc), let R(wm, wc) denote one transformation (a set of rules) that can rewrite wm to wc. [sent-122, score-0.433]

44 We consider that there is a probabilistic mapping between the misspelled word wm and correct word wc plus transformation R(wm, wc). [sent-123, score-0.973]

45 To improve efficiency in retrieval, we fur- ther assume that all the weights are non-positive, i. [sent-128, score-0.248]

46 It introduces monotonicity in rule applicat∀ioλn a≤nd 0 implies dthucate applying naidcdityiti oinna rul l reu laepsp cannot lead to generation of better candidates. [sent-131, score-0.261]

47 Our experimental results have shown that the change in accuracy by making the assumption is negligible, but the gain in efficiency is very large. [sent-135, score-0.282]

48 2 Training of Model Training data is given as a set of pairs T = {(wmi,wci)}iN=1, where wmiis a misspelled word and {wic ∈ V is a} correction of wmi. [sent-137, score-0.537]

49 This is not a triv)ia|wl problem, however, because the “true” transformation R∗(wmi, wci) for each word pair wmi and wci is not given in the training data. [sent-139, score-0.502]

50 (It is relatively easy to automatically 55 find the pairs wmi and wci as explained in Section 5. [sent-141, score-0.443]

51 In this paper, we assume that the transformation that actually generates the correction among all the possible transformations is the one that can give the maximum conditional probability; the exactly same criterion is also used for fast prediction. [sent-143, score-0.429]

52 Therefore we have the following objective function λ∗=argmλaxL(λ) (2) =argmλax∑iR(mwmiax,wcil)ogP(wci,R(wmi,wci)|wmi) where λ denotes the weight parameters and the max is taken over the set of transformations that can transform wmi to wci. [sent-144, score-0.285]

53 3 Candidate Generation In candidate generation, given a misspelled word wm, we find the k candidates from the vocabulary, that can be transformed from wm and have the largest probabilities assigned by the learned model. [sent-150, score-0.947]

54 We only need to utilize the following ranking function to rank a candidate wc given a misspelled word wm, by taking into account Equs. [sent-151, score-0.773]

55 We then choose the sum as a ranking score, which is equivalent to ranking candidates based on their largest conditional probabilities. [sent-153, score-0.337]

56 Our retrieval algorithm is guaranteed to find the optimal k candidates with some “pruning” techniques. [sent-162, score-0.295]

57 One is a trie for storing and matching words in the vocabulary, referred to as vocabulary trie, and the other based on what we call an Aho-Corasick tree (AC tree) (Aho and Corasick, 1975), which is used for storing and applying correction rules, referred to as rule index. [sent-166, score-0.685]

58 The vocabulary trie is the same as that used in existing work and it will be traversed when searching the top k candidates. [sent-167, score-0.35]

59 Our rule index is unique because it indexes all the rules based on an AC tree. [sent-168, score-0.332]

60 The AC tree is a trie with “failure links”, on which the Aho-Corasick string matching algorithm can be executed. [sent-169, score-0.293]

61 1 1One may further improve the index structure by using a trie rather than a ranking list to store βs associated with the same α. [sent-175, score-0.267]

62 Our algorithm first employs the Aho-Corasick algorithm to locate all the applicable α’s within the input word wm, from the rule index. [sent-181, score-0.353]

63 Our algorithm next traverses the vocabulary trie and searches the top k candidates with some pruning techniques. [sent-184, score-0.626]

64 1) If the current sum of weights of applied rules is smaller than the smallest weight in the top k list, the search branch is pruned. [sent-189, score-0.351]

65 2) If two search branches merge at the same node in the vocabulary trie as well as the same position on wm, the search branches with smaller sum of weights will be pruned. [sent-192, score-0.493]

66 It is not difficult to prove that our algorithm is guaranteed to find the best k candidates in terms of the ranking scores, because we only prune those candidates that cannot give better scores than the ones in the current top k list. [sent-195, score-0.527]

67 Due to the limitation of space, we omit the proof of the theorem that if the weights of rules λ are non-positive and the ranking function is defined as in Equ. [sent-196, score-0.282]

68 5, then the top k candidates obtained with the pruning criteria are the same as the top k candidates obtained without pruning. [sent-197, score-0.521]

69 5 Experimental Results We have experimentally evaluated our approach in spelling error correction of queries in web search. [sent-198, score-0.621]

70 (1) The vocabulary of queries in web search is extremely large due to the scale, diversity, and dynamics of the Internet. [sent-200, score-0.335]

71 (2) Efficiency is critically important, because the response time of top k candidate retrieval for web search must be kept very low. [sent-201, score-0.403]

72 Our approach for candidate generation is in fact motivated by the application. [sent-202, score-0.259]

73 It is easy to observe from search session data that there are many spelling errors and their corrections occurring in the same sessions. [sent-205, score-0.414]

74 We employed heuristics to automatically mine training pairs from search session data at a commercial search engine. [sent-206, score-0.277]

75 We used short sessions here because we found that search users usually correct their misspelled queries very quickly after they find the misspellings. [sent-209, score-0.444]

76 Then the following heuristics were employed to identify pairs of misspelled words and their corrections from two consecutive queries within a session: 1) Two queries have the same number of words. [sent-210, score-0.57]

77 3) For the two distinct words, the word in the first query is considered as misspelled and the second one as its correction. [sent-212, score-0.372]

78 (2008)’s model is not particularly for spelling error correction, but it can be employed in the task. [sent-219, score-0.299]

79 We compared our method with the two baselines in terms of top k accuracy, which is ratio of the true corrections among the top k candidates generated by a method. [sent-221, score-0.435]

80 All the methods shared the same settings: 973,902 words in the vocabulary, 10,597 rules for correction, and up to two rules used in one transformation. [sent-222, score-0.318]

81 Next, we conducted experiments to investigate how the top k accuracy changes with different sizes of vocabularies, maximum numbers of applicable rules and sizes of rule set for the three methods. [sent-232, score-0.475]

82 However, the drop of accuracy by our method is much smaller than that by generative, which means our method is more powerful when the vocabulary is large, e. [sent-240, score-0.263]

83 5, we changed the maximum number of rules that can be applied to a transformation from 2 to 3. [sent-244, score-0.253]

84 When there are more applicable rules, more candidates can be generated and thus ranking of them becomes more challenging. [sent-246, score-0.312]

85 The performance of our method and those of the two baselines did not change so much, and our method still visibly outperform the baselines when more rules are exploited. [sent-251, score-0.331]

86 Instead of making comparison with the existing methods in terms of efficiency, we evaluated the efficiency of our method by looking at how efficient it becomes with its data structure and pruning technique. [sent-257, score-0.36]

87 Figure 6: Accuracy Comparison between Baselines and Our Method with Different Numbers of Rules First, we tested the efficiency of using AhoCorasick algorithm (the rule index). [sent-272, score-0.358]

88 Number of Rules time complexity of Aho-Corasick algorithm is determined by the lengths of query strings and the number of matches, we examined how the number of matches on query strings with different lengths changes when the number of rules increases. [sent-278, score-0.471]

89 We can see that the number of matches is not largely affected by the number of rules in the rule index. [sent-281, score-0.28]

90 Next, since the running time of our method is proportional to the number of visited nodes on the vocabulary trie, we evaluated the efficiency of our method in terms of number of visited nodes. [sent-283, score-0.57]

91 Specifically, we tested how the number of visited nodes changes according to three factors: maximum number of applicable rules in a transformation, vocabulary size and rule set size. [sent-285, score-0.586]

92 8, with increasing maximum number of applicable rules in a transformation, number of visited nodes increases first and then stabilizes, especially when the words are long. [sent-291, score-0.32]

93 Figure 9: Efficiency Evaluation with Different Sizes of Vocabulary we have seen that using up to two rules in a transformation can bring a very high accuracy. [sent-299, score-0.253]

94 9, we can conclude that the numbers of visited nodes are stable and thus the efficiency of our method keeps high with larger vocabulary size and number of rules. [sent-302, score-0.457]

95 6 Conclusion In this paper, we have proposed a new method for approximate string search, including spelling error correction, which is both accurate and efficient. [sent-324, score-0.508]

96 Learning a spelling error model from search query logs. [sent-331, score-0.423]

97 An improved error model for noisy channel spelling correction. [sent-349, score-0.266]

98 Exploring distributional similarity based models for query spelling correction. [sent-375, score-0.296]

99 Vgram: improving performance of approximate queries on string collections using variable-length grams. [sent-380, score-0.278]

100 Costbased variable-length-gram selection for string collections to support approximate queries efficiently. [sent-441, score-0.278]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('wm', 0.339), ('misspelled', 0.294), ('wc', 0.246), ('wmi', 0.227), ('spelling', 0.218), ('correction', 0.208), ('efficiency', 0.199), ('wci', 0.181), ('candidate', 0.159), ('rules', 0.159), ('candidates', 0.155), ('vocabulary', 0.145), ('trie', 0.141), ('okazaki', 0.131), ('rule', 0.121), ('string', 0.114), ('brill', 0.114), ('moore', 0.109), ('generation', 0.1), ('transformation', 0.094), ('approximate', 0.093), ('pruning', 0.083), ('applicable', 0.083), ('search', 0.079), ('query', 0.078), ('visited', 0.078), ('bounded', 0.076), ('ranking', 0.074), ('queries', 0.071), ('behm', 0.068), ('corrections', 0.066), ('top', 0.064), ('logistic', 0.063), ('edit', 0.062), ('retrieval', 0.061), ('unbounded', 0.06), ('strings', 0.059), ('transformations', 0.058), ('li', 0.058), ('aho', 0.055), ('morristown', 0.055), ('index', 0.052), ('baselines', 0.051), ('session', 0.051), ('operators', 0.051), ('weights', 0.049), ('nj', 0.049), ('accuracy', 0.048), ('error', 0.048), ('mined', 0.046), ('corasick', 0.045), ('danling', 0.045), ('jiaheng', 0.045), ('mihov', 0.045), ('vernica', 0.045), ('wic', 0.045), ('xiaochun', 0.045), ('log', 0.045), ('efficient', 0.043), ('employs', 0.042), ('substitution', 0.042), ('chen', 0.042), ('guaranteed', 0.041), ('generative', 0.04), ('generality', 0.04), ('operator', 0.04), ('web', 0.04), ('officer', 0.04), ('oncina', 0.04), ('rul', 0.04), ('schaback', 0.04), ('dictionary', 0.039), ('substring', 0.039), ('distance', 0.039), ('verified', 0.038), ('algorithm', 0.038), ('enlarged', 0.037), ('golding', 0.037), ('islam', 0.037), ('ristad', 0.037), ('sigmod', 0.037), ('wim', 0.037), ('experimentally', 0.036), ('referred', 0.035), ('criterion', 0.035), ('surrounding', 0.035), ('pairs', 0.035), ('method', 0.035), ('whitelaw', 0.035), ('experimental', 0.035), ('ming', 0.034), ('conditional', 0.034), ('yang', 0.033), ('regression', 0.033), ('employed', 0.033), ('vldb', 0.033), ('deletion', 0.033), ('loss', 0.032), ('locate', 0.031), ('peking', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 11 acl-2011-A Fast and Accurate Method for Approximate String Search

Author: Ziqi Wang ; Gu Xu ; Hang Li ; Ming Zhang

Abstract: This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows. Given a misspelled word, the system finds words in a dictionary, which are most “similar” to the misspelled word. The paper proposes a probabilistic approach to the task, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for finding the top k candidates. The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word. The learning method employs the criterion in candidate generation as loss function. The retrieval algorithm is efficient and is guaranteed to find the optimal k candidates. Experimental results on large scale data show that the proposed approach improves upon existing methods in terms of accuracy in different settings.

2 0.21571964 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

Author: Zhuowei Bao ; Benny Kimelfeld ; Yunyao Li

Abstract: Spelling correction for keyword-search queries is challenging in restricted domains such as personal email (or desktop) search, due to the scarcity of query logs, and due to the specialized nature of the domain. For that task, this paper presents an algorithm that is based on statistics from the corpus data (rather than the query log). This algorithm, which employs a simple graph-based approach, can incorporate different types of data sources with different levels of reliability (e.g., email subject vs. email body), and can handle complex spelling errors like splitting and merging of words. An experimental study shows the superiority of the algorithm over existing alternatives in the email domain.

3 0.17498326 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

Author: Y. Albert Park ; Roger Levy

Abstract: Automated grammar correction techniques have seen improvement over the years, but there is still much room for increased performance. Current correction techniques mainly focus on identifying and correcting a specific type of error, such as verb form misuse or preposition misuse, which restricts the corrections to a limited scope. We introduce a novel technique, based on a noisy channel model, which can utilize the whole sentence context to determine proper corrections. We show how to use the EM algorithm to learn the parameters of the noise model, using only a data set of erroneous sentences, given the proper language model. This frees us from the burden of acquiring a large corpora of corrected sentences. We also present a cheap and efficient way to provide automated evaluation re- sults for grammar corrections by using BLEU and METEOR, in contrast to the commonly used manual evaluations.

4 0.11930981 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

Author: Yabin Zheng ; Lixing Xie ; Zhiyuan Liu ; Maosong Sun ; Yang Zhang ; Liyun Ru

Abstract: Chinese Pinyin input method is very important for Chinese language information processing. Users may make errors when they are typing in Chinese words. In this paper, we are concerned with the reasons that cause the errors. Inspired by the observation that pressing backspace is one of the most common user behaviors to modify the errors, we collect 54, 309, 334 error-correction pairs from a realworld data set that contains 2, 277, 786 users via backspace operations. In addition, we present a comparative analysis of the data to achieve a better understanding of users’ input behaviors. Comparisons with English typos suggest that some language-specific properties result in a part of Chinese input errors. 1

5 0.10948297 268 acl-2011-Rule Markov Models for Fast Tree-to-String Translation

Author: Ashish Vaswani ; Haitao Mi ; Liang Huang ; David Chiang

Abstract: Most statistical machine translation systems rely on composed rules (rules that can be formed out of smaller rules in the grammar). Though this practice improves translation by weakening independence assumptions in the translation model, it nevertheless results in huge, redundant grammars, making both training and decoding inefficient. Here, we take the opposite approach, where we only use minimal rules (those that cannot be formed out of other rules), and instead rely on a rule Markov model of the derivation history to capture dependencies between minimal rules. Large-scale experiments on a state-of-the-art tree-to-string translation system show that our approach leads to a slimmer model, a faster decoder, yet the same translation quality (measured using B ) as composed rules.

6 0.10208789 182 acl-2011-Joint Annotation of Search Queries

7 0.098695017 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

8 0.098384947 258 acl-2011-Ranking Class Labels Using Query Sessions

9 0.096057385 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

10 0.095500186 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

11 0.094963059 256 acl-2011-Query Weighting for Ranking Model Adaptation

12 0.090945832 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

13 0.088590249 52 acl-2011-Automatic Labelling of Topic Models

14 0.087667078 110 acl-2011-Effective Use of Function Words for Rule Generalization in Forest-Based Translation

15 0.08615949 197 acl-2011-Latent Class Transliteration based on Source Language Origin

16 0.084638953 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

17 0.082629666 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

18 0.081418775 206 acl-2011-Learning to Transform and Select Elementary Trees for Improved Syntax-based Machine Translations

19 0.080927536 29 acl-2011-A Word-Class Approach to Labeling PSCFG Rules for Machine Translation

20 0.075152807 61 acl-2011-Binarized Forest to String Translation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.213), (1, -0.03), (2, -0.026), (3, 0.028), (4, -0.093), (5, -0.12), (6, -0.087), (7, -0.141), (8, 0.001), (9, -0.041), (10, -0.081), (11, -0.013), (12, 0.001), (13, 0.114), (14, 0.02), (15, 0.126), (16, 0.04), (17, -0.012), (18, 0.039), (19, -0.011), (20, -0.042), (21, -0.009), (22, 0.032), (23, -0.054), (24, -0.015), (25, -0.04), (26, -0.01), (27, 0.014), (28, 0.044), (29, 0.059), (30, 0.038), (31, 0.009), (32, 0.038), (33, -0.003), (34, -0.123), (35, -0.039), (36, -0.039), (37, 0.102), (38, 0.024), (39, 0.004), (40, 0.043), (41, 0.018), (42, -0.041), (43, 0.132), (44, 0.026), (45, 0.007), (46, -0.01), (47, -0.043), (48, 0.079), (49, 0.103)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93902588 11 acl-2011-A Fast and Accurate Method for Approximate String Search

Author: Ziqi Wang ; Gu Xu ; Hang Li ; Ming Zhang

Abstract: This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows. Given a misspelled word, the system finds words in a dictionary, which are most “similar” to the misspelled word. The paper proposes a probabilistic approach to the task, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for finding the top k candidates. The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word. The learning method employs the criterion in candidate generation as loss function. The retrieval algorithm is efficient and is guaranteed to find the optimal k candidates. Experimental results on large scale data show that the proposed approach improves upon existing methods in terms of accuracy in different settings.

2 0.80255812 13 acl-2011-A Graph Approach to Spelling Correction in Domain-Centric Search

Author: Zhuowei Bao ; Benny Kimelfeld ; Yunyao Li

Abstract: Spelling correction for keyword-search queries is challenging in restricted domains such as personal email (or desktop) search, due to the scarcity of query logs, and due to the specialized nature of the domain. For that task, this paper presents an algorithm that is based on statistics from the corpus data (rather than the query log). This algorithm, which employs a simple graph-based approach, can incorporate different types of data sources with different levels of reliability (e.g., email subject vs. email body), and can handle complex spelling errors like splitting and merging of words. An experimental study shows the superiority of the algorithm over existing alternatives in the email domain.

3 0.64643914 108 acl-2011-EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Author: Chung-chi Huang ; Mei-hua Chen ; Shih-ting Huang ; Jason S. Chang

Abstract: We introduce a new method for learning to detect grammatical errors in learner’s writing and provide suggestions. The method involves parsing a reference corpus and inferring grammar patterns in the form of a sequence of content words, function words, and parts-of-speech (e.g., “play ~ role in Ving” and “look forward to Ving”). At runtime, the given passage submitted by the learner is matched using an extended Levenshtein algorithm against the set of pattern rules in order to detect errors and provide suggestions. We present a prototype implementation of the proposed method, EdIt, that can handle a broad range of errors. Promising results are illustrated with three common types of errors in nonnative writing. 1

4 0.61670178 46 acl-2011-Automated Whole Sentence Grammar Correction Using a Noisy Channel Model

Author: Y. Albert Park ; Roger Levy

Abstract: Automated grammar correction techniques have seen improvement over the years, but there is still much room for increased performance. Current correction techniques mainly focus on identifying and correcting a specific type of error, such as verb form misuse or preposition misuse, which restricts the corrections to a limited scope. We introduce a novel technique, based on a noisy channel model, which can utilize the whole sentence context to determine proper corrections. We show how to use the EM algorithm to learn the parameters of the noise model, using only a data set of erroneous sentences, given the proper language model. This frees us from the burden of acquiring a large corpora of corrected sentences. We also present a cheap and efficient way to provide automated evaluation re- sults for grammar corrections by using BLEU and METEOR, in contrast to the commonly used manual evaluations.

5 0.58676642 336 acl-2011-Why Press Backspace? Understanding User Input Behaviors in Chinese Pinyin Input Method

Author: Yabin Zheng ; Lixing Xie ; Zhiyuan Liu ; Maosong Sun ; Yang Zhang ; Liyun Ru

Abstract: Chinese Pinyin input method is very important for Chinese language information processing. Users may make errors when they are typing in Chinese words. In this paper, we are concerned with the reasons that cause the errors. Inspired by the observation that pressing backspace is one of the most common user behaviors to modify the errors, we collect 54, 309, 334 error-correction pairs from a realworld data set that contains 2, 277, 786 users via backspace operations. In addition, we present a comparative analysis of the data to achieve a better understanding of users’ input behaviors. Comparisons with English typos suggest that some language-specific properties result in a part of Chinese input errors. 1

6 0.55544895 36 acl-2011-An Efficient Indexer for Large N-Gram Corpora

7 0.5492152 291 acl-2011-SystemT: A Declarative Information Extraction System

8 0.5353753 250 acl-2011-Prefix Probability for Probabilistic Synchronous Context-Free Grammars

9 0.50643891 32 acl-2011-Algorithm Selection and Model Adaptation for ESL Correction Tasks

10 0.5064109 172 acl-2011-Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision

11 0.4985204 35 acl-2011-An ERP-based Brain-Computer Interface for text entry using Rapid Serial Visual Presentation and Language Modeling

12 0.49581647 26 acl-2011-A Speech-based Just-in-Time Retrieval System using Semantic Search

13 0.49366021 147 acl-2011-Grammatical Error Correction with Alternating Structure Optimization

14 0.48039982 181 acl-2011-Jigs and Lures: Associating Web Queries with Structured Entities

15 0.47878703 115 acl-2011-Engkoo: Mining the Web for Language Learning

16 0.47605073 208 acl-2011-Lexical Normalisation of Short Text Messages: Makn Sens a #twitter

17 0.46157834 301 acl-2011-The impact of language models and loss functions on repair disfluency detection

18 0.46120363 271 acl-2011-Search in the Lost Sense of "Query": Question Formulation in Web Search Queries and its Temporal Changes

19 0.45714194 135 acl-2011-Faster and Smaller N-Gram Language Models

20 0.45026329 89 acl-2011-Creative Language Retrieval: A Robust Hybrid of Information Retrieval and Linguistic Creativity


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.029), (13, 0.169), (17, 0.066), (26, 0.038), (37, 0.076), (39, 0.064), (41, 0.117), (55, 0.032), (59, 0.032), (72, 0.025), (91, 0.036), (96, 0.164), (97, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95563495 42 acl-2011-An Interface for Rapid Natural Language Processing Development in UIMA

Author: Balaji Soundrarajan ; Thomas Ginter ; Scott DuVall

Abstract: This demonstration presents the Annotation Librarian, an application programming interface that supports rapid development of natural language processing (NLP) projects built in Apache Unstructured Information Management Architecture (UIMA). The flexibility of UIMA to support all types of unstructured data – images, audio, and text – increases the complexity of some of the most common NLP development tasks. The Annotation Librarian interface handles these common functions and allows the creation and management of annotations by mirroring Java methods used to manipulate Strings. The familiar syntax and NLP-centric design allows developers to adopt and rapidly develop NLP algorithms in UIMA. The general functionality of the interface is described in relation to the use cases that necessitated its creation. 1

same-paper 2 0.86126047 11 acl-2011-A Fast and Accurate Method for Approximate String Search

Author: Ziqi Wang ; Gu Xu ; Hang Li ; Ming Zhang

Abstract: This paper proposes a new method for approximate string search, specifically candidate generation in spelling error correction, which is a task as follows. Given a misspelled word, the system finds words in a dictionary, which are most “similar” to the misspelled word. The paper proposes a probabilistic approach to the task, which is both accurate and efficient. The approach includes the use of a log linear model, a method for training the model, and an algorithm for finding the top k candidates. The log linear model is defined as a conditional probability distribution of a corrected word and a rule set for the correction conditioned on the misspelled word. The learning method employs the criterion in candidate generation as loss function. The retrieval algorithm is efficient and is guaranteed to find the optimal k candidates. Experimental results on large scale data show that the proposed approach improves upon existing methods in terms of accuracy in different settings.

3 0.83130592 63 acl-2011-Bootstrapping coreference resolution using word associations

Author: Hamidreza Kobdani ; Hinrich Schuetze ; Michael Schiehlen ; Hans Kamp

Abstract: In this paper, we present an unsupervised framework that bootstraps a complete coreference resolution (CoRe) system from word associations mined from a large unlabeled corpus. We show that word associations are useful for CoRe – e.g., the strong association between Obama and President is an indicator of likely coreference. Association information has so far not been used in CoRe because it is sparse and difficult to learn from small labeled corpora. Since unlabeled text is readily available, our unsupervised approach addresses the sparseness problem. In a self-training framework, we train a decision tree on a corpus that is automatically labeled using word associations. We show that this unsupervised system has better CoRe performance than other learning approaches that do not use manually labeled data. .

4 0.8018955 137 acl-2011-Fine-Grained Class Label Markup of Search Queries

Author: Joseph Reisinger ; Marius Pasca

Abstract: We develop a novel approach to the semantic analysis of short text segments and demonstrate its utility on a large corpus of Web search queries. Extracting meaning from short text segments is difficult as there is little semantic redundancy between terms; hence methods based on shallow semantic analysis may fail to accurately estimate meaning. Furthermore search queries lack explicit syntax often used to determine intent in question answering. In this paper we propose a hybrid model of semantic analysis combining explicit class-label extraction with a latent class PCFG. This class-label correlation (CLC) model admits a robust parallel approximation, allowing it to scale to large amounts of query data. We demonstrate its performance in terms of (1) its predicted label accuracy on polysemous queries and (2) its ability to accurately chunk queries into base constituents.

5 0.79336041 225 acl-2011-Monolingual Alignment by Edit Rate Computation on Sentential Paraphrase Pairs

Author: Houda Bouamor ; Aurelien Max ; Anne Vilnat

Abstract: In this paper, we present a novel way of tackling the monolingual alignment problem on pairs of sentential paraphrases by means of edit rate computation. In order to inform the edit rate, information in the form of subsentential paraphrases is provided by a range of techniques built for different purposes. We show that the tunable TER-PLUS metric from Machine Translation evaluation can achieve good performance on this task and that it can effectively exploit information coming from complementary sources.

6 0.79160815 196 acl-2011-Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models

7 0.78486133 14 acl-2011-A Hierarchical Model of Web Summaries

8 0.7840687 65 acl-2011-Can Document Selection Help Semi-supervised Learning? A Case Study On Event Extraction

9 0.78070736 58 acl-2011-Beam-Width Prediction for Efficient Context-Free Parsing

10 0.77845645 15 acl-2011-A Hierarchical Pitman-Yor Process HMM for Unsupervised Part of Speech Induction

11 0.7764805 324 acl-2011-Unsupervised Semantic Role Induction via Split-Merge Clustering

12 0.7762633 94 acl-2011-Deciphering Foreign Language

13 0.7746172 185 acl-2011-Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems

14 0.77285862 327 acl-2011-Using Bilingual Parallel Corpora for Cross-Lingual Textual Entailment

15 0.7721867 202 acl-2011-Learning Hierarchical Translation Structure with Linguistic Annotations

16 0.77201355 126 acl-2011-Exploiting Syntactico-Semantic Structures for Relation Extraction

17 0.77166921 244 acl-2011-Peeling Back the Layers: Detecting Event Role Fillers in Secondary Contexts

18 0.77156985 135 acl-2011-Faster and Smaller N-Gram Language Models

19 0.77075982 209 acl-2011-Lexically-Triggered Hidden Markov Models for Clinical Document Coding

20 0.77069294 178 acl-2011-Interactive Topic Modeling