emnlp emnlp2012 emnlp2012-96 knowledge-graph by maker-knowledge-mining

96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation


Source: pdf

Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze

Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). [sent-5, score-0.707]

2 The role of the transducer is to organize the collection. [sent-6, score-0.293]

3 Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. [sent-7, score-0.648]

4 Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. [sent-8, score-0.593]

5 The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. [sent-9, score-0.427]

6 We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance. [sent-10, score-0.369]

7 1 Introduction Systematic relationships between pairs of strings are at the core of problems such as transliteration (Knight and Graehl, 1998), morphology (Dreyer and Eisner, 2011), cross-document coreference resolution (Bagga and Baldwin, 1998), canonicalization (Culotta et al. [sent-11, score-0.283]

8 They model a conditional distribution p(y | x), and are ordinarily trained on input-output pairs xof) strings (Dreyer aerti al. [sent-14, score-0.222]

9 akes explicit assumptions about how strings are copied with mutation. [sent-34, score-0.237]

10 A person name may have unboundedly many different variants. [sent-63, score-0.229]

11 1 We cannot assign the observed strings to positions in an existing structure that is shared across all lexemes, such as a given phylogenetic tree whose K nodes represent languages, or a given inflectional grid whose K cells represent grammatical inflections. [sent-65, score-0.326]

12 We instead use one global mutation model that applies to all names—but see footnote 14 on incorporating specialized transductions (Latin to Italian) within our mutation model. [sent-67, score-0.71]

13 345 them into a idiosyncratic phylogenetic tree whose nodes are the string types or tokens themselves. [sent-68, score-0.412]

14 In molecular evolutionary analysis, phylogenetic techniques have often been combined with estimation of some parametric model of mutation (Tamura et al. [sent-80, score-0.527]

15 However, names mutate differently from biological sequences, and our mutation model for names (§4, §8) reflects that. [sent-82, score-0.77]

16 2 An Example A fragment of a phylogeny for person names is shown in Figure 1. [sent-84, score-0.567]

17 Our procedure learned this automatically from a collection of name tokens, without observing any input/output pairs. [sent-85, score-0.24]

18 The nodes of the phylogeny are the observed name types,2 each one associated with a count of observed tokens. [sent-86, score-0.536]

19 tT nhoed dee ♦sce gnednanertsa eofs this initial name are other names that subsequently evolved for that entity. [sent-90, score-0.357]

20 Together, the learned p and the phylogeny in Figure 1form an explanation of the observed collection of names. [sent-103, score-0.406]

21 A different phylogeny would have involved a more improbable collection of events, such as replacing Chishti with Pynchon, or generating many unrelated copies of Pynchon directly from ♦. [sent-108, score-0.406]

22 In the phylogeny, the parent names tend to be Iunse tdh eof ptehny enough tthhaet pita irse plausible sf toern nvdar tioan btse of these names to have emerged. [sent-109, score-0.409]

23 A named entity has the form y = (e, w) where w is a string being used to refer to entity e. [sent-115, score-0.32]

24 A 346 single entity e may be referred to on different occasions by different name strings w. [sent-116, score-0.464]

25 We suppose that this is the result ofcopying the entity with occasional mutation of its name (as in asexual reproduction). [sent-117, score-0.635]

26 So to make this choice, she selects one of the previous tokens yi uniformly at random, each having probability 1/(k α) ; or else she selects ♦, with probability α/(k α). [sent-127, score-0.358]

27 bBiluitty yw 1ith − probability ssh iet i fnasittehafdul ldyr,aw sos a mutated token yk+1 = (ek+1 , wk+1) from the mutation model p(· | yi). [sent-129, score-0.507]

28 This preserves the entity (ek+1 = ei lw pit(·h probability 1), but the new name wk+1 is a stochastic transduction of wi drawn from p(· | wi). [sent-130, score-0.485]

29 Sao f sesheh sets ek+1 to a newly created entity, sampling its name wk+1 from the distribution p(· | ♦). [sent-133, score-0.223]

30 Nothing prevents wk+1 from being a name that is already in use for another entity (i. [sent-137, score-0.28]

31 • for some j 3Straightforward extensions are to allow a variable mutation rate µ(yi) that depends on properties of yi, and to allow wk+1 to depend on known properties of ei. [sent-140, score-0.355]

32 1 Relationship to other models If we ignore the name strings, we can see that the sequence of entities e1, e2, . [sent-143, score-0.246]

33 This is because our model posits that later authors are influenced by earlier authors, copying entity names from them with mutation. [sent-161, score-0.343]

34 The mutation process is not symmetric—for example, Figure 1 reflects a tendency to shorten rather than lengthen names. [sent-163, score-0.355]

35 These too are defined using mutation of strings or other types. [sent-165, score-0.539]

36 From a transformation process, one can draw a distribution over types, from which the tokens are then sampled IID. [sent-166, score-0.252]

37 F♦o,r b an author to generate a name token this way, she would have to know the whole derivational history of the previous name she was adapting. [sent-173, score-0.52]

38 Our present model instead allows an author simply to select a name she 347 previously saw and copy or mutate its surface form. [sent-174, score-0.31]

39 (3) One should presumably prefer to explain a novel name y as a mutation of a frequent name x, other things equal (§2). [sent-175, score-0.725]

40 Our phylogeny is a preferential attachment tree, a random directed graph in which each vertex selects a single previous vertex as its parent. [sent-181, score-0.511]

41 nTohte s key step athitehrfeu lis c to sample t(hye name string wy from p(wy | wx) or p(wy | ♦). [sent-186, score-0.513]

42 tions could easily incorporate detailed linguistic knowledge of the mutation process (see §8). [sent-188, score-0.355]

43 Like many such models, it can be regarded as a stochastic finitestate string-to-string transducer parameterized by θ. [sent-190, score-0.285]

44 We assume a stochastic mutation process which, when given an input string wx, edits it from left to 4The very fact that x has been frequently observed demonstrates that it has often chosen to stop mutating. [sent-195, score-0.573]

45 It is common to mutate a name by editing contiguous substrings (e. [sent-206, score-0.297]

46 Contiguous regions of copying versus editing can be modeled by a low probability of transitioning between no-edit and edit regions. [sent-209, score-0.263]

47 6 Note that an edit region may include some copy edits (or substitute edits that replace a character with itself) without leaving the edit region. [sent-210, score-0.396]

48 Input and output strings are augmented with a trailing eos (“end-of-string”) symbol that is seen by the single-character lookahead. [sent-212, score-0.237]

49 Alternatively, if the process selects no-edit, then eos is copied to the output string and the process terminates. [sent-214, score-0.272]

50 5 Inference The input to inference is a collection of named entity tokens y. [sent-220, score-0.29]

51 348 µ of the tokens may be tagged tokens of the form y = (e, w), whose true entity is known. [sent-225, score-0.375]

52 1 An unrealistically supervised setting Suppose we were lucky enough to fully observe the sequence of named entity tokens yi = (ei, wi) produced by our generative model. [sent-228, score-0.301]

53 This phylogeny is described by a spanning tree over the tokens. [sent-231, score-0.564]

54 For each potential edge x → y between named entity tokens, odteefnintiae δ(y | x) to →be y yth bee probability eodf choosing x sa,n dde copying i|t x (possibly ew pitroh mutation) to obtain y. [sent-233, score-0.257]

55 So δ(yj | ♦) = δ(yj | yi) = α p(yj | ♦) (1) p(yj | yi) + (1 − µ)1(yj = yi) (2) = except that if i≥ j or if ei ej, then δ(yj | yi) = 0 (since yj can only b oer difer eiv6=ed efrom an earl|ie yr token yi with the same entity). [sent-234, score-0.314]

56 yN with a given phylogenetic tree is easily seen to be a product over all tree edges, Qj δ(yj | pa(yj)) where pa(yj) is the parent of yj. [sent-238, score-0.277]

57 3): (a) the max-probability spanning tree (b) the total probability of all spanning trees (c) the marginal probability ofeach edge, under the posterior distribution on spanning trees (a) is our single best guess of the phylogeny. [sent-240, score-0.707]

58 To locally maximize the model likelihood, (c) can serve as the E step ofour EM algorithm (§6) for tuning our mutation model. [sent-248, score-0.355]

59 The M step then retrains the mutation model’s parameters θ on inputoutput pairs wi → wj, weighting each pair by its edge’s posterior marginal probability (c), since that is the expected count of a wi → wj mutation. [sent-249, score-0.6]

60 Second, the order of the tokens is not observed, so we do not know which other tokens are candidate parents. [sent-255, score-0.28]

61 7 The type phylogeny in Figure 1 represents a set of possible token phylogenies. [sent-257, score-0.463]

62 Each node of Figure 1 represents an untagged name type y = (? [sent-258, score-0.246]

63 By grouping all ny tokens of this type into a single node, we mean that the first token of y was derived by mutation from the parent node, while each later token of y was derived by copying an (unspecified) earlier token of y. [sent-260, score-1.011]

64 A token phylogeny cannot be represented in this way if two or more tokens of y were created by mutations. [sent-261, score-0.603]

65 In that case, their name strings are equal only by coincidence. [sent-262, score-0.369]

66 They may have different parents (perhaps of different entities), whereas the y node in a type phylogeny can have only one parent. [sent-263, score-0.351]

67 µµ The first token ofy is necessarily a mutation, but later tokens are much more likely to be copies. [sent-265, score-0.252]

68 The probability of generating a later token y by copying some previous token is at least (1 − µ)/(N + α), while the probability of generating it in some other way is at most max(α p(y | ♦), mx∈aYxp(y | x)) where Y is the set of observed types. [sent-266, score-0.38]

69 an aeu stheocor ids unlikely to invent exactly the observed string y, certainly from ♦ but even by mutating a similar string x (especially ♦w bhuetn e tvheen m byuta mtiuotna rinatge si sm small). [sent-268, score-0.295]

70 Consider the probability of generating untagged tokens 7Working over types improves the quality of our second approximation, and also speeds up the spanning tree algorithms. [sent-270, score-0.454]

71 8 To µ find the expectation of (5), observe that the expected number of tokens of x that precede the first token of y is nx/(ny + 1), since each ofthe nx tokens ofx has a 1/(ny + 1) chance of falling before all ny tokens of y. [sent-282, score-0.571]

72 Thhee w (approximate) posterior probability bofy a given phylogeny given our evidence, riso proportional to the product of the values of its edges. [sent-298, score-0.432]

73 the tPotaTl probability of generating the data G via any spanning tree ofthe form we consider. [sent-301, score-0.253]

74 This distribution is determined by the parameters θ of the transducer pθ, along with the ratio α/µ. [sent-302, score-0.28]

75 6 Training the Transducer with EM Our inference algorithm assumes that we know the transducer parameters θ. [sent-308, score-0.242]

76 This marginal likelihood sums over all the other latent variables in the model—the spanning tree, the alignments between strings, and the hidden token ordering. [sent-310, score-0.304]

77 The EM procedure repeats the following until convergence: E-step: Given θ, compute the posterior marginal probabilities cxy of all possible phylogeny edges. [sent-311, score-0.565]

78 tTiohen posterior marginal probability of a directed edge from vertex x to vertex y, according to (10), is cxy = T∈T♦ X pθ(T) (GX):(x→y)∈T (11) The probability cxy is a “pseudocount” for the expected number of mutations from x to y. [sent-317, score-0.677]

79 Calculating cxy requires summing over all spanning trees of G, of which there are nn−2 for a fully connected graph with n vertices. [sent-319, score-0.267]

80 At the M step, we retrain the mutation model parameters to maximize Pxy cxylogp(wy | wx). [sent-328, score-0.355]

81 Furthermore, pruned edges do not appear in any spanning tree, so the E step will find that their posterior marginal probabilities are 0. [sent-339, score-0.274]

82 This means that the input-output pairs corresponding to these edges can be ignored when re-estimating the transducer parameters in the M step. [sent-340, score-0.283]

83 10 In this case, we would need (expensive) new algorithms to reconstruct the strings w. [sent-347, score-0.232]

84 However, this model could infer a more realistic phylogeny by positing unobserved ancestral or intermediate forms that relate the observed tokens, as in transformation models (Eisner, 2002; Andrews and Eisner, 2011). [sent-348, score-0.511]

85 We used Wikipedia to create a list of name aliases for different entities. [sent-354, score-0.256]

86 However, many redirects are for topics other than individual people, and these would be poor examples of name variation. [sent-360, score-0.256]

87 To synthesize token counts, empirical token frequencies for each type were estimated from the LDC Gigaword corpus,12 which is a corpus of newswire text spanning several years. [sent-385, score-0.367]

88 2 Experiments We begin by evaluating the generalization ability ofa transducer trained using a transformation model. [sent-391, score-0.316]

89 We construct pairs of entity title (input) and alias (output) names from the Wikipedia data. [sent-398, score-0.308]

90 For different amounts of supervised data, we trained the transformation model on the training set, and plotted the log-likelihood of heldout test data for the transducer parameters at each iteration of EM. [sent-399, score-0.316]

91 For each alias a in a test set (not seen at training time), we produce a ranking of test entity titles t according to transducer probabilities pθ (a | t). [sent-403, score-0.378]

92 A good transducer should assign high probability t oA t groaondsfo trrmanastdiouncse rf srohmou tlhde a scsoirgrnec ht gtih- tle for the alias. [sent-404, score-0.282]

93 Figure 3: Learning curves for different initializations of the transducer parameters. [sent-412, score-0.242]

94 Above, “sup=100” (for instance) means that 100 entities were used as training data to initialize the transducer parameters (constructing pairs between all title-alias pairs for those Wikpedia entities). [sent-413, score-0.303]

95 It learns from a collection ofrelated strings whose relationships are unknown. [sent-417, score-0.239]

96 The key idea is that some strings are mutations of common strings that occurred earlier. [sent-418, score-0.457]

97 We compute a distribution over the unknown phylogenetic tree that relates these strings, and use it to rees353 timate the transducer parameters via EM. [sent-419, score-0.422]

98 Another future direction would be to incorporate the context of tokens in order to help reconstruct which tokens are coreferent. [sent-423, score-0.328]

99 14These last two points suggest that the mutation model should operate not on simple (entity, string) pairs, but on richer representations in which the name has been parsed into its components (Eisenstein et al. [sent-427, score-0.54]

100 For example, if wy and ‘y denote the string and language of name y, then define p(y | x) = p(‘y | ‘x) · p(wy | ‘y , ‘axg , wx ). [sent-430, score-0.65]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mutation', 0.355), ('phylogeny', 0.351), ('transducer', 0.242), ('wy', 0.198), ('name', 0.185), ('strings', 0.184), ('names', 0.172), ('spanning', 0.143), ('tokens', 0.14), ('wx', 0.137), ('string', 0.13), ('cxy', 0.124), ('token', 0.112), ('edit', 0.106), ('entity', 0.095), ('yj', 0.095), ('mutations', 0.089), ('lexeme', 0.083), ('wk', 0.082), ('dreyer', 0.077), ('schafer', 0.076), ('copying', 0.076), ('transducers', 0.076), ('transformation', 0.074), ('barack', 0.072), ('phylogenetic', 0.072), ('aliases', 0.071), ('mutate', 0.071), ('nawaz', 0.071), ('redirects', 0.071), ('tree', 0.07), ('yi', 0.066), ('parent', 0.065), ('transliteration', 0.064), ('evolutionary', 0.064), ('wikipedia', 0.063), ('vertices', 0.062), ('vertex', 0.062), ('untagged', 0.061), ('ristad', 0.061), ('entities', 0.061), ('yk', 0.059), ('obama', 0.057), ('collection', 0.055), ('copy', 0.054), ('copied', 0.053), ('eos', 0.053), ('phylogenies', 0.053), ('pynchon', 0.053), ('unorganized', 0.053), ('unobserved', 0.051), ('andrews', 0.051), ('organize', 0.051), ('marginal', 0.049), ('conditioned', 0.048), ('reconstruct', 0.048), ('em', 0.047), ('laplacian', 0.046), ('tarjan', 0.046), ('bibliographic', 0.046), ('redirect', 0.046), ('edge', 0.046), ('edits', 0.045), ('eisner', 0.044), ('person', 0.044), ('stochastic', 0.043), ('charles', 0.043), ('levenshtein', 0.041), ('cognates', 0.041), ('plausibly', 0.041), ('alias', 0.041), ('editing', 0.041), ('posterior', 0.041), ('edges', 0.041), ('transduction', 0.041), ('ek', 0.041), ('ei', 0.041), ('wi', 0.04), ('probability', 0.04), ('character', 0.04), ('ny', 0.039), ('crp', 0.038), ('linkage', 0.038), ('derivational', 0.038), ('distribution', 0.038), ('nicholas', 0.036), ('molecular', 0.036), ('selects', 0.036), ('aeu', 0.035), ('ancestral', 0.035), ('bennett', 0.035), ('bilenko', 0.035), ('canonicalization', 0.035), ('chishti', 0.035), ('deduplication', 0.035), ('dias', 0.035), ('ghareeb', 0.035), ('inputoutput', 0.035), ('jingleheimer', 0.035), ('karim', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze

Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.

2 0.13253012 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao

Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.

3 0.10716929 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

Author: Mihai Surdeanu ; Julie Tibshirani ; Ramesh Nallapati ; Christopher D. Manning

Abstract: Distant supervision for relation extraction (RE) gathering training data by aligning a database of facts with text – is an efficient approach to scale RE to thousands of different relations. However, this introduces a challenging learning scenario where the relation expressed by a pair of entities found in a sentence is unknown. For example, a sentence containing Balzac and France may express BornIn or Died, an unknown relation, or no relation at all. Because of this, traditional supervised learning, which assumes that each example is explicitly mapped to a label, is not appropriate. We propose a novel approach to multi-instance multi-label learning for RE, which jointly models all the instances of a pair of entities in text and all their labels using a graphical model with latent variables. Our model performs competitively on two difficult domains. –

4 0.096833564 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

Author: Thomas Lin ; Mausam ; Oren Etzioni

Abstract: Entity linking systems link noun-phrase mentions in text to their corresponding Wikipedia articles. However, NLP applications would gain from the ability to detect and type all entities mentioned in text, including the long tail of entities not prominent enough to have their own Wikipedia articles. In this paper we show that once the Wikipedia entities mentioned in a corpus of textual assertions are linked, this can further enable the detection and fine-grained typing of the unlinkable entities. Our proposed method for detecting unlinkable entities achieves 24% greater accuracy than a Named Entity Recognition baseline, and our method for fine-grained typing is able to propagate over 1,000 types from linked Wikipedia entities to unlinkable entities. Detection and typing of unlinkable entities can increase yield for NLP applications such as typed question answering.

5 0.089882195 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration

Author: Jagadeesh Jagarlamudi ; Hal Daume III

Abstract: In this paper, we address the problem of building a multilingual transliteration system using an interlingual representation. Our approach uses international phonetic alphabet (IPA) to learn the interlingual representation and thus allows us to use any word and its IPA representation as a training example. Thus, our approach requires only monolingual resources: a phoneme dictionary that lists words and their IPA representations.1 By adding a phoneme dictionary of a new language, we can readily build a transliteration system into any of the existing previous languages, without the expense of all-pairs data or computation. We also propose a regularization framework for learning the interlingual representation, which accounts for language specific phonemic variability, and thus it can find better mappings between languages. Experimental results on the name transliteration task in five diverse languages show a maximum improvement of 29% accuracy and an average improvement of 17% accuracy compared to a state-of-the-art baseline system.

6 0.086022355 19 emnlp-2012-An Entity-Topic Model for Entity Linking

7 0.07909096 37 emnlp-2012-Dynamic Programming for Higher Order Parsing of Gap-Minding Trees

8 0.077497222 119 emnlp-2012-Spectral Dependency Parsing with Latent Variables

9 0.075478382 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

10 0.074354082 84 emnlp-2012-Linking Named Entities to Any Database

11 0.074121825 41 emnlp-2012-Entity based QA Retrieval

12 0.069498949 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation

13 0.064076252 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence

14 0.063783742 31 emnlp-2012-Cross-Lingual Language Modeling with Syntactic Reordering for Low-Resource Speech Recognition

15 0.063577913 91 emnlp-2012-Monte Carlo MCMC: Efficient Inference by Approximate Sampling

16 0.062590703 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets

17 0.060621399 68 emnlp-2012-Iterative Annotation Transformation with Predict-Self Reestimation for Chinese Word Segmentation

18 0.060422432 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

19 0.059552509 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

20 0.059517041 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.24), (1, 0.052), (2, 0.025), (3, -0.051), (4, -0.107), (5, 0.011), (6, 0.036), (7, 0.078), (8, -0.011), (9, -0.017), (10, 0.079), (11, 0.008), (12, 0.035), (13, -0.023), (14, -0.098), (15, 0.039), (16, 0.271), (17, -0.035), (18, -0.024), (19, 0.117), (20, -0.119), (21, 0.011), (22, 0.019), (23, -0.104), (24, 0.045), (25, -0.017), (26, 0.006), (27, 0.089), (28, 0.037), (29, -0.032), (30, 0.167), (31, -0.075), (32, -0.218), (33, 0.03), (34, -0.1), (35, 0.035), (36, 0.174), (37, -0.013), (38, -0.133), (39, 0.038), (40, -0.053), (41, -0.038), (42, -0.0), (43, -0.105), (44, -0.036), (45, 0.081), (46, 0.13), (47, -0.073), (48, -0.117), (49, -0.064)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95594275 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze

Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.

2 0.48649502 47 emnlp-2012-Explore Person Specific Evidence in Web Person Name Disambiguation

Author: Liwei Chen ; Yansong Feng ; Lei Zou ; Dongyan Zhao

Abstract: In this paper, we investigate different usages of feature representations in the web person name disambiguation task which has been suffering from the mismatch of vocabulary and lack of clues in web environments. In literature, the latter receives less attention and remains more challenging. We explore the feature space in this task and argue that collecting person specific evidences from a corpus level can provide a more reasonable and robust estimation for evaluating a feature’s importance in a given web page. This can alleviate the lack of clues where discriminative features can be reasonably weighted by taking their corpus level importance into account, not just relying on the current local context. We therefore propose a topic-based model to exploit the person specific global importance and embed it into the person name similarity. The experimental results show that the corpus level topic in- formation provides more stable evidences for discriminative features and our method outperforms the state-of-the-art systems on three WePS datasets.

3 0.46944568 119 emnlp-2012-Spectral Dependency Parsing with Latent Variables

Author: Paramveer Dhillon ; Jordan Rodu ; Michael Collins ; Dean Foster ; Lyle Ungar

Abstract: Recently there has been substantial interest in using spectral methods to learn generative sequence models like HMMs. Spectral methods are attractive as they provide globally consistent estimates of the model parameters and are very fast and scalable, unlike EM methods, which can get stuck in local minima. In this paper, we present a novel extension of this class of spectral methods to learn dependency tree structures. We propose a simple yet powerful latent variable generative model for dependency parsing, and a spectral learning method to efficiently estimate it. As a pilot experimental evaluation, we use the spectral tree probabilities estimated by our model to re-rank the outputs of a near state-of-theart parser. Our approach gives us a moderate reduction in error of up to 4.6% over the baseline re-ranker. .

4 0.46348989 26 emnlp-2012-Building a Lightweight Semantic Model for Unsupervised Information Extraction on Short Listings

Author: Doo Soon Kim ; Kunal Verma ; Peter Yeh

Abstract: Short listings such as classified ads or product listings abound on the web. If a computer can reliably extract information from them, it will greatly benefit a variety of applications. Short listings are, however, challenging to process due to their informal styles. In this paper, we present an unsupervised information extraction system for short listings. Given a corpus of listings, the system builds a semantic model that represents typical objects and their attributes in the domain of the corpus, and then uses the model to extract information. Two key features in the system are a semantic parser that extracts objects and their attributes and a listing-focused clustering module that helps group together extracted tokens of same type. Our evaluation shows that the , semantic model learned by these two modules is effective across multiple domains.

5 0.45259982 37 emnlp-2012-Dynamic Programming for Higher Order Parsing of Gap-Minding Trees

Author: Emily Pitler ; Sampath Kannan ; Mitchell Marcus

Abstract: We introduce gap inheritance, a new structural property on trees, which provides a way to quantify the degree to which intervals of descendants can be nested. Based on this property, two new classes of trees are derived that provide a closer approximation to the set of plausible natural language dependency trees than some alternative classes of trees: unlike projective trees, a word can have descendants in more than one interval; unlike spanning trees, these intervals cannot be nested in arbitrary ways. The 1-Inherit class of trees has exactly the same empirical coverage of natural language sentences as the class of mildly nonprojective trees, yet the optimal scoring tree can be found in an order of magnitude less time. Gap-minding trees (the second class) have the property that all edges into an interval of descendants come from the same node, and thus an algorithm which uses only single in- tervals can produce trees in which a node has descendants in multiple intervals.

6 0.43107352 111 emnlp-2012-Regularized Interlingual Projections: Evaluation on Multilingual Transliteration

7 0.4228355 41 emnlp-2012-Entity based QA Retrieval

8 0.41112068 22 emnlp-2012-Automatically Constructing a Normalisation Dictionary for Microblogs

9 0.40476489 84 emnlp-2012-Linking Named Entities to Any Database

10 0.36396688 98 emnlp-2012-No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities

11 0.34437826 76 emnlp-2012-Learning-based Multi-Sieve Co-reference Resolution with Knowledge

12 0.33543921 79 emnlp-2012-Learning Syntactic Categories Using Paradigmatic Representations of Word Context

13 0.32488808 132 emnlp-2012-Universal Grapheme-to-Phoneme Prediction Over Latin Alphabets

14 0.32108587 59 emnlp-2012-Generating Non-Projective Word Order in Statistical Linearization

15 0.31824791 93 emnlp-2012-Multi-instance Multi-label Learning for Relation Extraction

16 0.31574914 108 emnlp-2012-Probabilistic Finite State Machines for Regression-based MT Evaluation

17 0.30764008 89 emnlp-2012-Mixed Membership Markov Models for Unsupervised Conversation Modeling

18 0.30697355 19 emnlp-2012-An Entity-Topic Model for Entity Linking

19 0.30527836 104 emnlp-2012-Parse, Price and Cut-Delayed Column and Row Generation for Graph Based Parsers

20 0.29250664 127 emnlp-2012-Transforming Trees to Improve Syntactic Convergence


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.018), (16, 0.029), (25, 0.023), (34, 0.066), (45, 0.012), (60, 0.108), (63, 0.054), (64, 0.021), (65, 0.029), (70, 0.028), (73, 0.012), (74, 0.041), (76, 0.054), (80, 0.019), (81, 0.37), (86, 0.022), (95, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78795367 96 emnlp-2012-Name Phylogeny: A Generative Model of String Variation

Author: Nicholas Andrews ; Jason Eisner ; Mark Dredze

Abstract: Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, “similar” strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to referto persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Levenshtein distance.

2 0.77854329 59 emnlp-2012-Generating Non-Projective Word Order in Statistical Linearization

Author: Bernd Bohnet ; Anders Bjorkelund ; Jonas Kuhn ; Wolfgang Seeker ; Sina Zarriess

Abstract: We propose a technique to generate nonprojective word orders in an efficient statistical linearization system. Our approach predicts liftings of edges in an unordered syntactic tree by means of a classifier, and uses a projective algorithm for tree linearization. We obtain statistically significant improvements on six typologically different languages: English, German, Dutch, Danish, Hungarian, and Czech.

3 0.40284777 136 emnlp-2012-Weakly Supervised Training of Semantic Parsers

Author: Jayant Krishnamurthy ; Tom Mitchell

Abstract: We present a method for training a semantic parser using only a knowledge base and an unlabeled text corpus, without any individually annotated sentences. Our key observation is that multiple forms ofweak supervision can be combined to train an accurate semantic parser: semantic supervision from a knowledge base, and syntactic supervision from dependencyparsed sentences. We apply our approach to train a semantic parser that uses 77 relations from Freebase in its knowledge representation. This semantic parser extracts instances of binary relations with state-of-theart accuracy, while simultaneously recovering much richer semantic structures, such as conjunctions of multiple relations with partially shared arguments. We demonstrate recovery of this richer structure by extracting logical forms from natural language queries against Freebase. On this task, the trained semantic parser achieves 80% precision and 56% recall, despite never having seen an annotated logical form.

4 0.39721599 33 emnlp-2012-Discovering Diverse and Salient Threads in Document Collections

Author: Jennifer Gillenwater ; Alex Kulesza ; Ben Taskar

Abstract: We propose a novel probabilistic technique for modeling and extracting salient structure from large document collections. As in clustering and topic modeling, our goal is to provide an organizing perspective into otherwise overwhelming amounts of information. We are particularly interested in revealing and exploiting relationships between documents. To this end, we focus on extracting diverse sets of threads—singlylinked, coherent chains of important documents. To illustrate, we extract research threads from citation graphs and construct timelines from news articles. Our method is highly scalable, running on a corpus of over 30 million words in about four minutes, more than 75 times faster than a dynamic topic model. Finally, the results from our model more closely resemble human news summaries according to several metrics and are also preferred by human judges.

5 0.39619273 71 emnlp-2012-Joint Entity and Event Coreference Resolution across Documents

Author: Heeyoung Lee ; Marta Recasens ; Angel Chang ; Mihai Surdeanu ; Dan Jurafsky

Abstract: We introduce a novel coreference resolution system that models entities and events jointly. Our iterative method cautiously constructs clusters of entity and event mentions using linear regression to model cluster merge operations. As clusters are built, information flows between entity and event clusters through features that model semantic role dependencies. Our system handles nominal and verbal events as well as entities, and our joint formulation allows information from event coreference to help entity coreference, and vice versa. In a cross-document domain with comparable documents, joint coreference resolution performs significantly better (over 3 CoNLL F1 points) than two strong baselines that resolve entities and events separately.

6 0.39476785 92 emnlp-2012-Multi-Domain Learning: When Do Domains Matter?

7 0.39322606 123 emnlp-2012-Syntactic Transfer Using a Bilingual Lexicon

8 0.39296481 14 emnlp-2012-A Weakly Supervised Model for Sentence-Level Semantic Orientation Analysis with Multiple Experts

9 0.39282641 82 emnlp-2012-Left-to-Right Tree-to-String Decoding with Prediction

10 0.39132324 24 emnlp-2012-Biased Representation Learning for Domain Adaptation

11 0.39071739 23 emnlp-2012-Besting the Quiz Master: Crowdsourcing Incremental Classification Games

12 0.39064643 20 emnlp-2012-Answering Opinion Questions on Products by Exploiting Hierarchical Organization of Consumer Reviews

13 0.39022619 70 emnlp-2012-Joint Chinese Word Segmentation, POS Tagging and Parsing

14 0.38989803 18 emnlp-2012-An Empirical Investigation of Statistical Significance in NLP

15 0.38976461 107 emnlp-2012-Polarity Inducing Latent Semantic Analysis

16 0.38925812 3 emnlp-2012-A Coherence Model Based on Syntactic Patterns

17 0.38882104 110 emnlp-2012-Reading The Web with Learned Syntactic-Semantic Inference Rules

18 0.38828751 39 emnlp-2012-Enlarging Paraphrase Collections through Generalization and Instantiation

19 0.38652954 124 emnlp-2012-Three Dependency-and-Boundary Models for Grammar Induction

20 0.38598579 109 emnlp-2012-Re-training Monolingual Parser Bilingually for Syntactic SMT