nips nips2002 nips2002-15 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matthew G. Snover, Michael R. Brent
Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. [sent-7, score-0.561]
2 The system is composed of a generative probability model and hill-climbing and directed search algorithms. [sent-8, score-0.235]
3 By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. [sent-9, score-0.32]
4 Quantitative results are shown by measuring the accuracy of the morphological relations identified. [sent-11, score-0.221]
5 Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique. [sent-12, score-0.157]
6 1 Introduction One of the fundamental problems in computational linguistics is adaptation of language processing systems to new languages with minimal reliance on human expertise. [sent-13, score-0.131]
7 A ubiquitous component of language processing systems is the morphological analyzer, which determines the properties of morphologically complex words like watches and gladly by inferring their derivation as watch+s and glad+ly. [sent-14, score-0.459]
8 The derivation reveals much about the word, such as the fact that glad+ly share syntactic properties with quick+ly and semantic properties with its stem glad. [sent-15, score-0.364]
9 While morphological processes can take many forms, the most common are suffixation and prefixation (collectively, concatenative morphology). [sent-16, score-0.247]
10 In this paper, we present a system for unsupervised inference of morphological derivations of written words, with no prior knowledge of the language in question. [sent-17, score-0.323]
11 Specifically, neither the stems nor the suffixes of the language are given in advance. [sent-18, score-0.277]
12 It is applicable to any language for written words lists are available. [sent-20, score-0.215]
13 In languages that have been a focus of research in computational linguistics the practical applications are limited, but in languages like Polish, automated analysis of unannotated text corpora has potential applications for information retrieval and other language processing systems. [sent-21, score-0.205]
14 In this paper, however, we focus on the problem of unsupervised morphological inference for its inherent interest. [sent-23, score-0.231]
15 We are particularly interested in developing automated morphological analysis as a first stage of a larger grammatical inference system, and hence we favor a conservative analysis that identifies primarily productive morphological processes (those that can be applied to new words). [sent-31, score-0.434]
16 In this paper, we present a probabilistic model and search algorithm for automated analysis of suffixation, along with experiments comparing our system to that of Goldsmith [4]. [sent-32, score-0.163]
17 This system, which extends the system of Snover and Brent [5], is designed to detect the final stem and suffix break of each word given a list of words. [sent-33, score-0.547]
18 It does not distinguish between derivational and inflectional suffixation or between the notion of a stem and a root. [sent-34, score-0.429]
19 Further, it does not currently have a mechanism to deal with multiple interpretations of a word, or to deal with morphological ambiguity. [sent-35, score-0.182]
20 2 Probability Model This section introduces a prior probability distribution over the space of all hypotheses, where a hypothesis is a set of words, each with morphological split separating the stem and suffix. [sent-37, score-0.668]
21 The hypothesis is generated by choosing the number of stems and suffixes, the spellings of those stems and suffixes and then the combination of the stems and suffixes. [sent-39, score-0.745]
22 The seven steps are presented below, along with their probability distributions and a running example of how a hypothesis could be generated by this process. [sent-40, score-0.122]
23 For each stem , choose its length in letters , according to the inverse squared distribution. [sent-51, score-0.384]
24 For each from 1 to , generate stem by choosing letters at random, according to the probabilities . [sent-57, score-0.364]
25 The probability of any character, , being chosen is obtained from a maximum likelihood estimate: where is the count of among all the hypothesized stems and suffixes and . [sent-60, score-0.288]
26 The joint probability of the hypothesized stem and suffix sets is defined by the distribution: &% 7 ¨ ¨ ¤ ¤ ¤ ©¢ §¦¥ £¡ ¢ %' $ # 7 ¨ ¨ ¤ ¤ ¥ ¥¢ §¦¤ © # ¢ ¡ ' () ¢ $ " #! [sent-61, score-0.435]
27 © % " 4 6 &% (3) 7 8' 2 $ 1" 0 % $ 0 0 0 £§ 9 © 2 3 " 4 5 STEM SUFF 0 ¤¢ ¥£¡ The factorial terms reflect the fact that the stems and suffixes could be generated in any order. [sent-62, score-0.217]
28 A paradigm is a set of suffixes and the stems that attach to those suffixes and no others. [sent-67, score-0.318]
29 Each stem is in exactly one paradigm, and each paradigm has at least one stem. [sent-68, score-0.465]
30 The distribution for picking , suffixes for paradigm is: C # © § " ' E © 7 " A @ C "0 ' C ¤¢ ¥£¡ @ D is therefore: @ The joint probability over all paradigms, (5) C 7 5' 4 © § E F " "0 @ D C ¤¢ ¥£¡ Example: = 2, 1, 2 . [sent-73, score-0.129]
31 For each paradigm , choose the set of suffixes, PARA that the paradigm will represent. [sent-75, score-0.222]
32 For each stem choose the paradigm that the stem will belong in, according to a distribution that favors paradigms with more stems. [sent-82, score-0.939]
33 The probability of choosing a paradigm , for a stem is calculated using a maximum likelihood estimate: PARA 9 I¡ 9 P 0 %' 9 H¡ 79 # 0 where PARA is the set of stems in paradigm . [sent-83, score-0.811]
34 PARA 0 ¡ 7% Example: PARA Combining the results of stages 6 and 7, one can see that the running example would yield the hypothesis consisting of the set of words with suffix breaks, walk+ , walk+s, walk+ed, look+ , look+s, look+ed, far+ , door+ , door+s, cat+ , cat+s . [sent-87, score-0.23]
35 Removing the breaks in the words results in the set of input words. [sent-88, score-0.208]
36 To find the probability for this hypothesis just take of the product of the probabilities from equations (1) to (7). [sent-89, score-0.122]
37 Typically one wishes to know the probability of the hypothesis given the data, however in our case such a distribution is not required. [sent-91, score-0.122]
38 Equation (8) shows how the probability of the hypothesis given the data could be derived from Bayes law. [sent-92, score-0.122]
39 Hyp Data Hyp Hyp Data (8) Data § 0 § ¤¢ ¥£¡ ¤ (¡ § ¢ ¤¢ © )(¡ £§ ¤¢ ¥£¡ 0 Our search only considers hypotheses consistent with the data. [sent-93, score-0.159]
40 The probability of the data Data Hyp , is always , since if you remove the breaks from any given the hypothesis, hypothesis, the input data is produced. [sent-94, score-0.128]
41 The prior probability of the data is constant over all hypotheses, thus the probability of the hypothesis given the data reduces to Hyp . [sent-96, score-0.15]
42 The prior probability of the hypothesis is given by the above generative process and, among all consistent hypotheses, the one with the greatest prior probability also has the greatest posterior probability. [sent-97, score-0.211]
43 § % § ¤¢ ¥(¡ 0 ¤¢ )(¡ 3 Search This section details a novel search algorithm which is used to find a high probability segmentation of the all the words in the input lexicon, . [sent-98, score-0.307]
44 The input lexicon is a list of words extracted from a corpus. [sent-99, score-0.331]
45 The output of the search is a segmentation of each of the input words into a stem and suffix. [sent-100, score-0.643]
46 $ The search algorithm has two phases, which we call the directed search and the hillclimbing search. [sent-101, score-0.248]
47 The directed search builds up a consistent hypothesis about the segmentation of all words in the input out of consistent hypothesis about subsets of the words. [sent-102, score-0.527]
48 The hill-climbing search further tunes the result of the directed search by trying out nearby hypotheses over all the input words. [sent-103, score-0.339]
49 1 Directed Search The directed search is accomplished in two steps. [sent-105, score-0.154]
50 The remainder of the input lexicon is added to this sub-hypothesis at which point it becomes the final hypothesis. [sent-108, score-0.195]
51 ¡ We define the set of possible suffixes to be the set of terminal substrings, including the empty string , of the words in . [sent-109, score-0.136]
52 For each subset of the possible suffixes , there is a maximal set of possible stems (initial substrings) , such that for each and each , is a word in . [sent-110, score-0.298]
53 We define to be the sub-hypothesis in which each input word that can be analyzed as consisting of a stem in and a suffix in is analyzed that way. [sent-111, score-0.471]
54 This subhypothesis consists of all pairings of the stems in and the suffixes in with the corresponding morphological breaks. [sent-112, score-0.399]
55 We only consider sub-hypotheses which have at least two stems and two suffixes. [sent-114, score-0.217]
56 " " " ¥ ¦¤ ¢ © ¨ ¢ ¢ £ " § ¤ ¨ $ 9 ¤ ¤ § ¢ § ¥ § $ ¨ " For each sub-hypothesis, , there is a corresponding null hypothesis, , which has the same set of words as , but in which all the words are hypothesized to consist of the ¨ word as the stem and as the suffix. [sent-115, score-0.791]
57 By beginning at the node representing no suffixes, one can apply standard graph search techniques, such as a beam search or a best first search to find the best scoring nodes without visiting all nodes. [sent-120, score-0.43]
58 While one cannot guarantee that such approaches perform exactly the same as examining all sub-hypotheses, initial experiments using a beam search with a beam size equal to , with a of 100, show that the best sub-hypotheses are found with a significant decrease in the number of nodes visited. [sent-121, score-0.22]
59 ' ' ¡ £ ' ¡ ¢ ¡ ¡ ¡ ¡ ¡ The highest scoring sub-hypotheses are incrementally combined in order to create a hypothesis over the complete set of input words. [sent-123, score-0.191]
60 We iteratively remove the highest scoring hypothesis from . [sent-127, score-0.174]
61 The words in are added to each of the remaining sub-hypotheses in , and their null hypotheses, with their morphological breaks from . [sent-128, score-0.395]
62 If a word in was already in the morphological break from overrides the one from . [sent-129, score-0.314]
63 All of the sub-hypotheses are now rescored, as the words in them have changed. [sent-130, score-0.136]
64 All words in that are not in are added to with suffix . [sent-134, score-0.136]
65 2 Hill Climbing Search The hill climbing search further optimizes the probability of the hypothesis by moving stems to new nodes. [sent-136, score-0.511]
66 For each possible suffix , and each node , the search attempts to add to . [sent-137, score-0.126]
67 This means that all stems in that can take the suffix are moved to a new node, , which represents all the suffixes of and . [sent-138, score-0.239]
68 This is analogous to pushing stems to adjacent nodes in a directed graph. [sent-139, score-0.305]
69 A stem , can only be moved into a node with the suffix , if the new word, is an observed word in the input lexicon. [sent-140, score-0.525]
70 The hill climbing search continues to add and remove suffixes to nodes until the probability of the hypothesis cannot be increased. [sent-143, score-0.35]
71 1 Experiment We tested our unsupervised morphology learning system, which we refer to as Paramorph, and Goldsmith’s MDL system, otherwise known as Linguistica1 , on various sized word lists 1 A demo version available on the web, http://humanities. [sent-146, score-0.257]
72 The results were evaluated by measuring the accuracy of the stem relations identified. [sent-154, score-0.403]
73 We extracted input lexicons from each corpus, excluding words containing non-alphabetic characters. [sent-155, score-0.162]
74 The 100 most common words in each corpus were also excluded, since these words tend to be function words and are not very informative for morphology. [sent-156, score-0.48]
75 The experiments in English were also conducted on the 16,000 most common words from the Hansard corpus. [sent-158, score-0.136]
76 1 Stem Relation Ideally, we would like to be able to specify the correct morphological break for each of the words in the input, however morphology is laced with ambiguity, and we believe this to be an inappropriate method for this task. [sent-161, score-0.477]
77 It seems that the stem “locate” is combined with the suffix “tion”, but in terms of simple concatenation it is unclear if the break should be placed before or after the “t”. [sent-163, score-0.453]
78 In an attempt to solve this problem we have developed a new measure of performance, which does not specify the exact morphological split of a word. [sent-164, score-0.182]
79 We measure the accuracy of the stems predicted by examining whether two words which are morphologically related are predicted as having the same stem. [sent-165, score-0.552]
80 The actual break point for the stems is not evaluated, only whether the words are predicted as having the same stem. [sent-166, score-0.45]
81 Two words are related if they share the same immediate stem. [sent-168, score-0.136]
82 For example the words “building”, “build”, and “builds” are related since they all have “build” as a stem, just as “building” and “buildings” are related as they both have “building” as a stem. [sent-169, score-0.136]
83 Irregular forms of words are also considered to be related even though such relations would be very difficult to detect with a simple concatenation model. [sent-171, score-0.214]
84 The stem relation precision measures how many of the relations predicted by the system were correct, while the recall measures how many of the relations present in the data were found. [sent-172, score-0.623]
85 Stem relation fscore is an unbiased combination of precision and recall that favors equal scores. [sent-173, score-0.19]
86 Due to software difficulties we were unable to get Linguistica to run on 500, 1000, and 2000 words in English. [sent-177, score-0.176]
87 2 0 500 1000 2k 4k Lexicon Size 8k Figure 2: Stem Relation Fscores suffixes across lexicon sizes and Linguistica found an increasingly large number of suffixes, predicting over 700 different suffixes in the 16,000 word English lexicon. [sent-190, score-0.268]
88 Figure 2 shows the fscores using the stem relation metric for various sizes of English and Polish input lexicons. [sent-191, score-0.491]
89 Paramorph maintains a very high precision across lexicon sizes in both languages, whereas the precision of Linguistica decreases considerably at larger lexicon sizes. [sent-192, score-0.434]
90 However Linguistica shows an increasing recall as the lexicon size increases, with Paramorph having a decreasing recall as lexicon size increases, though the recall of Linguistica in Polish is consistently lower than the Paramorph’s recall. [sent-193, score-0.428]
91 Suffixes -a -e -ego -ej -ie -o -y -a -ami -y -e ¸ -cie -li -m -´ c Stems dziwn chmur siekier gada odda sprzeda 9 9 Table 1: Sample Paradigms in Polish Table 1 shows several of the larger paradigms found by Paramorph when run on 8000 words of Polish. [sent-195, score-0.204]
92 The first paradigm shown is for the single adjective stem meaning “strange” with numerous inflections for gender, number and case, as well as one derivational suffix, “ie” which changes it into an adverb, “strangely”. [sent-196, score-0.53]
93 The second paradigm is for the nouns, “cloud” and “ax”, with various case inflections and the third paradigm paradigm contains the verbs, “talk”, “return”, and “sell”. [sent-197, score-0.303]
94 This is consistent with our goals to create a conservative system for morphological analysis, where the number of false positives is minimized. [sent-202, score-0.214]
95 In addition phonology plays a much stronger role in Polish morphology, causing alterations in stems, which are difficult to detect using a concatenative framework. [sent-205, score-0.114]
96 5 Discussion Many of the stem relations predicted by Paramorph result from postulating stem and suffix breaks in words that are actually morphologically simple. [sent-206, score-1.076]
97 This occurs when the endings of these words resemble other, correct, suffixes. [sent-207, score-0.136]
98 In an attempt to deal with this problem we have investigated incorporating semantic information into the probability model since morphologically related words also tend to be semantically related. [sent-208, score-0.267]
99 Paramorph performed better for the most part with respect to Fscore than Linguistica, but more importantly, the precision of Linguistica does not approach the precision of our algorithm, particularly on the larger corpus sizes. [sent-211, score-0.15]
100 Unsupervised learning of the morphology of a natural language. [sent-234, score-0.108]
wordName wordTfidf (topN-words)
[('xes', 0.452), ('stem', 0.364), ('paramorph', 0.293), ('linguistica', 0.261), ('suf', 0.228), ('polish', 0.228), ('stems', 0.217), ('para', 0.212), ('morphological', 0.182), ('lexicon', 0.169), ('words', 0.136), ('english', 0.111), ('morphology', 0.108), ('paradigm', 0.101), ('brent', 0.098), ('hypothesis', 0.094), ('search', 0.094), ('xation', 0.085), ('morphologically', 0.081), ('word', 0.081), ('corpus', 0.072), ('hyp', 0.071), ('paradigms', 0.068), ('concatenative', 0.065), ('derivational', 0.065), ('ectional', 0.065), ('fscore', 0.065), ('hypotheses', 0.065), ('directed', 0.06), ('language', 0.06), ('scoring', 0.052), ('break', 0.051), ('walk', 0.05), ('unsupervised', 0.049), ('climbing', 0.049), ('fscores', 0.049), ('snover', 0.049), ('suff', 0.049), ('breaks', 0.046), ('predicted', 0.046), ('door', 0.043), ('goldsmith', 0.043), ('hypothesized', 0.043), ('precision', 0.039), ('acl', 0.039), ('relations', 0.039), ('automated', 0.037), ('languages', 0.037), ('beam', 0.036), ('linguistics', 0.034), ('mdl', 0.034), ('relation', 0.034), ('cat', 0.033), ('buildings', 0.033), ('glad', 0.033), ('productive', 0.033), ('subhypotheses', 0.033), ('system', 0.032), ('node', 0.032), ('null', 0.031), ('recall', 0.03), ('causing', 0.03), ('hill', 0.029), ('look', 0.028), ('ections', 0.028), ('hansard', 0.028), ('louis', 0.028), ('suffixes', 0.028), ('remove', 0.028), ('identi', 0.028), ('probability', 0.028), ('nodes', 0.028), ('culties', 0.027), ('examining', 0.026), ('input', 0.026), ('mo', 0.024), ('build', 0.024), ('segmentation', 0.023), ('ly', 0.023), ('matthew', 0.023), ('nal', 0.023), ('michael', 0.022), ('software', 0.022), ('favors', 0.022), ('moved', 0.022), ('semantically', 0.022), ('generative', 0.021), ('discovering', 0.02), ('greatest', 0.02), ('concatenation', 0.02), ('choose', 0.02), ('detect', 0.019), ('incrementally', 0.019), ('lists', 0.019), ('substrings', 0.019), ('building', 0.019), ('washington', 0.018), ('unclear', 0.018), ('sizes', 0.018), ('unable', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
Author: Matthew G. Snover, Michael R. Brent
Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.
2 0.10119895 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
Author: Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor
Abstract: The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.
3 0.095261283 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
4 0.085818179 42 nips-2002-Bias-Optimal Incremental Problem Solving
Author: Jürgen Schmidhuber
Abstract: Given is a problem sequence and a probability distribution (the bias) on programs computing solution candidates. We present an optimally fast way of incrementally solving each task in the sequence. Bias shifts are computed by program prefixes that modify the distribution on their suffixes by reusing successful code for previous tasks (stored in non-modifiable memory). No tested program gets more runtime than its probability times the total search time. In illustrative experiments, ours becomes the first general system to learn a universal solver for arbitrary disk Towers of Hanoi tasks (minimal solution size ). It demonstrates the advantages of incremental learning by profiting from previously solved, simpler tasks involving samples of a simple context free language. ¦ ¤ ¢ §¥£¡ 1 Brief Introduction to Optimal Universal Search Consider an asymptotically optimal method for tasks with quickly verifiable solutions: ¦ ¦ © £ £¨ © © © © ¦ ¦ ¦ Method 1.1 (L SEARCH ) View the -th binary string as a potential program for a universal Turing machine. Given some problem, for all do: every steps on average execute (if possible) one instruction of the -th program candidate, until one of the programs has computed a solution. ! © © © ¢
5 0.056363419 125 nips-2002-Learning Semantic Similarity
Author: Jaz Kandola, Nello Cristianini, John S. Shawe-taylor
Abstract: The standard representation of text documents as bags of words suffers from well known limitations, mostly due to its inability to exploit semantic similarity between terms. Attempts to incorporate some notion of term similarity include latent semantic indexing [8], the use of semantic networks [9], and probabilistic methods [5]. In this paper we propose two methods for inferring such similarity from a corpus. The first one defines word-similarity based on document-similarity and viceversa, giving rise to a system of equations whose equilibrium point we use to obtain a semantic similarity measure. The second method models semantic relations by means of a diffusion process on a graph defined by lexicon and co-occurrence information. Both approaches produce valid kernel functions parametrised by a real number. The paper shows how the alignment measure can be used to successfully perform model selection over this parameter. Combined with the use of support vector machines we obtain positive results. 1
6 0.054762378 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories
7 0.049749337 40 nips-2002-Bayesian Models of Inductive Generalization
8 0.048716448 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures
9 0.038000025 124 nips-2002-Learning Graphical Models with Mercer Kernels
10 0.037725702 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata
11 0.03663091 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus
12 0.035478242 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
13 0.034408949 198 nips-2002-Theory-Based Causal Inference
14 0.033948321 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
15 0.033310432 145 nips-2002-Mismatch String Kernels for SVM Protein Classification
16 0.031834129 115 nips-2002-Informed Projections
17 0.027769525 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
18 0.027442768 85 nips-2002-Fast Kernels for String and Tree Matching
19 0.027393254 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
20 0.02704153 83 nips-2002-Extracting Relevant Structures with Side Information
topicId topicWeight
[(0, -0.098), (1, -0.017), (2, 0.003), (3, 0.002), (4, -0.073), (5, 0.05), (6, -0.045), (7, -0.083), (8, 0.009), (9, -0.083), (10, -0.091), (11, 0.001), (12, 0.037), (13, -0.029), (14, -0.065), (15, -0.002), (16, 0.042), (17, -0.005), (18, 0.023), (19, -0.115), (20, -0.046), (21, 0.005), (22, -0.071), (23, 0.015), (24, -0.022), (25, -0.06), (26, -0.017), (27, -0.062), (28, 0.098), (29, 0.059), (30, 0.046), (31, 0.057), (32, 0.018), (33, -0.025), (34, 0.039), (35, 0.14), (36, 0.125), (37, -0.042), (38, -0.045), (39, -0.086), (40, 0.028), (41, -0.051), (42, -0.09), (43, -0.004), (44, 0.077), (45, -0.02), (46, -0.022), (47, -0.018), (48, -0.017), (49, 0.116)]
simIndex simValue paperId paperTitle
same-paper 1 0.95575082 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
Author: Matthew G. Snover, Michael R. Brent
Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.
2 0.63555825 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures
Author: Zach Solan, Eytan Ruppin, David Horn, Shimon Edelman
Abstract: The distributional principle according to which morphemes that occur in identical contexts belong, in some sense, to the same category [1] has been advanced as a means for extracting syntactic structures from corpus data. We extend this principle by applying it recursively, and by using mutual information for estimating category coherence. The resulting model learns, in an unsupervised fashion, highly structured, distributed representations of syntactic knowledge from corpora. It also exhibits promising behavior in tasks usually thought to require representations anchored in a grammar, such as systematicity. 1 Motivation Models dealing with the acquisition of syntactic knowledge are sharply divided into two classes, depending on whether they subscribe to some variant of the classical generative theory of syntax, or operate within the framework of “general-purpose” statistical or distributional learning. An example of the former is the model of [2], which attempts to learn syntactic structures such as Functional Category, as stipulated by the Government and Binding theory. An example of the latter model is Elman’s widely used Simple Recursive Network (SRN) [3]. We believe that polarization between statistical and classical (generative, rule-based) approaches to syntax is counterproductive, because it hampers the integration of the stronger aspects of each method into a common powerful framework. Indeed, on the one hand, the statistical approach is geared to take advantage of the considerable progress made to date in the areas of distributed representation, probabilistic learning, and “connectionist” modeling. Yet, generic connectionist architectures are ill-suited to the abstraction and processing of symbolic information. On the other hand, classical rule-based systems excel in just those tasks, yet are brittle and difficult to train. We present a scheme that acquires “raw” syntactic information construed in a distributional sense, yet also supports the distillation of rule-like regularities out of the accrued statistical knowledge. Our research is motivated by linguistic theories that postulate syntactic structures (and transformations) rooted in distributional data, as exemplified by the work of Zellig Harris [1]. 2 The ADIOS model The ADIOS (Automatic DIstillation Of Structure) model constructs syntactic representations of a sample of language from unlabeled corpus data. The model consists of two elements: (1) a Representational Data Structure (RDS) graph, and (2) a Pattern Acquisition (PA) algorithm that learns the RDS in an unsupervised fashion. The PA algorithm aims to detect patterns — repetitive sequences of “significant” strings of primitives occurring in the corpus (Figure 1). In that, it is related to prior work on alignment-based learning [4] and regular expression (“local grammar”) extraction [5] from corpora. We stress, however, that our algorithm requires no pre-judging either of the scope of the primitives or of their classification, say, into syntactic categories: all the information needed for its operation is extracted from the corpus in an unsupervised fashion. In the initial phase of the PA algorithm the text is segmented down to the smallest possible morphological constituents (e.g., ed is split off both walked and bed; the algorithm later discovers that bed should be left whole, on statistical grounds).1 This initial set of unique constituents is the vertex set of the newly formed RDS (multi-)graph. A directed edge is inserted between two vertices whenever the corresponding transition exists in the corpus (Figure 2(a)); the edge is labeled by the sentence number and by its within-sentence index. Thus, corpus sentences initially correspond to paths in the graph, a path being a sequence of edges that share the same sentence number. (a) mh mi mk mj (b) ci{j,k}l ml mn mi ck ... cj ml cu cv . Figure 1: (a) Two sequences mi , mj , ml and mi , mk , ml form a pattern ci{j,k}l = mi , {mj , mk }, ml , which allows mj and mk to be attributed to the same equivalence class, following the principle of complementary distributions [1]. Both the length of the shared context and the cohesiveness of the equivalence class need to be taken into account in estimating the goodness of the candidate pattern (see eq. 1). (b) Patterns can serve as constituents in their own right; recursively abstracting patterns from a corpus allows us to capture the syntactic regularities concisely, yet expressively. Abstraction also supports generalization: in this schematic illustration, two new paths (dashed lines) emerge from the formation of equivalence classes associated with cu and cv . In the second phase, the PA algorithm repeatedly scans the RDS graph for Significant P atterns (sequences of constituents) ( SP), which are then used to modify the graph (Algorithm 1). For each path pi , the algorithm constructs a list of candidate constituents, ci1 , . . . , cik . Each of these consists of a “prefix” (sequence of graph edges), an equivalence class of vertices, and a “suffix” (another sequence of edges; cf. Figure 2(b)). The criterion I for judging pattern significance combines a syntagmatic consideration (the pattern must be long enough) with a paradigmatic one (its constituents c1 , . . . , ck must have high mutual information): I (c1 , c2 , . . . , ck ) = 2 e−(L/k) P (c1 , c2 , . . . , ck ) log P (c1 , c2 , . . . , ck ) Πk P (cj ) j=1 (1) where L is the typical context length and k is the length of the candidate pattern; the probabilities associated with a cj are estimated from frequencies that are immediately available 1 We remark that the algorithm can work in any language, with any set of tokens, including individual characters – or phonemes, if applied to speech. Algorithm 1 PA (pattern acquisition), phase 2 1: while patterns exist do 2: for all path ∈ graph do {path=sentence; graph=corpus} 3: for all source node ∈ path do 4: for all sink node ∈ path do {source and sink can be equivalence classes} 5: degree of separation = path index(sink) − path index(source); 6: pattern table ⇐ detect patterns(source, sink, degree of separation, equivalence table); 7: end for 8: end for 9: winner ⇐ get most significant pattern(pattern table); 10: equivalence table ⇐ detect equivalences(graph, winner); 11: graph ⇐ rewire graph(graph, winner); 12: end for 13: end while in the graph (e.g., the out-degree of a node is related to the marginal probability of the corresponding cj ). Equation 1 balances two opposing “forces” in pattern formation: (1) the length of the pattern, and (2) the number and the cohesiveness of the set of examples that support it. On the one hand, shorter patterns are likely to be supported by more examples; on the other hand, they are also more likely to lead to over-generalization, because shorter patterns mean less context. A pattern tagged as significant is added as a new vertex to the RDS graph, replacing the constituents and edges it subsumes (Figure 2). Note that only those edges of the multigraph that belong to the detected pattern are rewired; edges that belong to sequences not subsumed by the pattern are untouched. This highly context-sensitive approach to pattern abstraction, which is unique to our model, allows ADIOS to achieve a high degree of representational parsimony without sacrificing generalization power. During the pass over the corpus the list of equivalence sets is updated continuously; the identification of new significant patterns is done using thecurrent equivalence sets (Figure 3(d)). Thus, as the algorithm processes more and more text, it “bootstraps” itself and enriches the RDS graph structure with new SPs and their accompanying equivalence sets. The recursive nature of this process enables the algorithm to form more and more complex patterns, in a hierarchical manner. The relationships among these can be visualized recursively in a tree format, with tree depth corresponding to the level of recursion (e.g., Figure 3(c)). The PA algorithm halts if it processes a given amount of text without finding a new SP or equivalence set (in real-life language acquisition this process may never stop). Generalization. A collection of patterns distilled from a corpus can be seen as an empirical grammar of sorts; cf. [6], p.63: “the grammar of a language is simply an inventory of linguistic units.” The patterns can eventually become highly abstract, thus endowing the model with an ability to generalize to unseen inputs. Generalization is possible, for example, when two equivalence classes are placed next to each other in a pattern, creating new paths among the members of the equivalence classes (dashed lines in Figure 1(b)). Generalization can also ensue from partial activation of existing patterns by novel inputs. This function is supported by the input module, designed to process a novel sentence by forming its distributed representation in terms of activities of existing patterns (Figure 6). These are computed by propagating activation from bottom (the terminals) to top (the patterns) of the RDS. The initial activities wj of the terminals cj are calculated given the novel input s1 , . . . , sk as follows: wj = max {I(sk , cj )} m=1..k (2) 102: do you see the cat? 101: the cat is eating 103: are you sure? Sentence Number Within-Sentence Index 101_1 101_4 101_3 101_2 101_5 101_6 END her ing show eat play is cat Pam the BEGIN (a) 131_3 131_2 109_7 END ing 121_12 stay 121_10 play 121_8 101_6 109_6 cat the BEGIN 109_5 121_9 eat 109_4 (b) 109_9 101_5 109_8 101_4 101_3 101_2 is 131_1 101_1 121_13 121_11 131_1 131_3 101_1 109_4 PATTERN 230: the cat is {eat, play, stay} -ing 165_1 Equivalence Class 230: stay, eat, play 165_2 221_3 here stay play 171_3 165_3 eat 221_1 we 171_2 they BEGIN (d) END 101_2 109_5 121_9 121_8 171_1 ing stay 131_2 play eat is cat the BEGIN (c) PATTERN 231: BEGIN {they, we} {230} here 221_2 Figure 2: (a) A small portion of the RDS graph for a simple corpus, with sentence #101 (the cat is eat -ing) indicated by solid arcs. (b) This sentence joins a pattern the cat is {eat, play, stay} -ing, in which two others (#109,121) already participate. (c) The abstracted pattern, and the equivalence class associated with it (edges that belong to sequences not subsumed by this pattern, e.g., #131, are untouched). (d) The identification of new significant patterns is done using the acquired equivalence classes (e.g., #230). In this manner, the system “bootstraps” itself, recursively distilling more and more complex patterns. where I(sk , cj ) is the mutual information between sk and cj . For an equivalence class, the value propagated upwards is the strongest non-zero activation of its members; for a pattern, it is the average weight of the children nodes, on the condition that all the children were activated by adjacent inputs. Activity propagation continues until it reaches the top nodes of the pattern lattice. When the algorithm encounters a novel word, all the members of the terminal equivalence class contribute a value of , which is then propagated upwards as usual. This enables the model to make an educated guess as to the meaning of the unfamiliar word, by considering the patterns that become active (Figure 6(b)). 3 Results We now briefly describe the results of several studies designed to evaluate the viability of the ADIOS model, in which it was exposed to corpora of varying size and complexity. (a) propnoun:
3 0.59850925 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus
Author: Willem H. Zuidema
Abstract: Language acquisition is a special kind of learning problem because the outcome of learning of one generation is the input for the next. That makes it possible for languages to adapt to the particularities of the learner. In this paper, I show that this type of language change has important consequences for models of the evolution and acquisition of syntax. 1 The Language Acquisition Problem For both artificial systems and non-human animals, learning the syntax of natural languages is a notoriously hard problem. All healthy human infants, in contrast, learn any of the approximately 6000 human languages rapidly, accurately and spontaneously. Any explanation of how they accomplish this difficult task must specify the (innate) inductive bias that human infants bring to bear, and the input data that is available to them. Traditionally, the inductive bias is termed - somewhat unfortunately -
Author: David Fass, Jacob Feldman
Abstract: We present an account of human concept learning-that is, learning of categories from examples-based on the principle of minimum description length (MDL). In support of this theory, we tested a wide range of two-dimensional concept types, including both regular (simple) and highly irregular (complex) structures, and found the MDL theory to give a good account of subjects' performance. This suggests that the intrinsic complexity of a concept (that is, its description -length) systematically influences its leamability. 1- The Structure of Categories A number of different principles have been advanced to explain the manner in which humans learn to categorize objects. It has been variously suggested that the underlying principle might be the similarity structure of objects [1], the manipulability of decision bound~ aries [2], or Bayesian inference [3][4]. While many of these theories are mathematically well-grounded and have been successful in explaining a range of experimental findings, they have commonly only been tested on a narrow collection of concept types similar to the simple unimodal categories of Figure 1(a-e). (a) (b) (c) (d) (e) Figure 1: Categories similar to those previously studied. Lines represent contours of equal probability. All except (e) are unimodal. ~http://ruccs.rutgers.edu/~jacob/feldman.html Moreover, in the scarce research that has ventured to look beyond simple category types, the goal has largely been to investigate categorization performance for isolated irregular distributions, rather than to present a survey of performance across a range of interesting distributions. For example, Nosofsky has previously examined the
5 0.54737258 163 nips-2002-Prediction and Semantic Association
Author: Thomas L. Griffiths, Mark Steyvers
Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1
6 0.52366453 42 nips-2002-Bias-Optimal Incremental Problem Solving
7 0.50137633 40 nips-2002-Bayesian Models of Inductive Generalization
8 0.50002933 146 nips-2002-Modeling Midazolam's Effect on the Hippocampus and Recognition Memory
9 0.46392453 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing
10 0.42240825 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis
11 0.40347719 176 nips-2002-Replay, Repair and Consolidation
12 0.32228845 85 nips-2002-Fast Kernels for String and Tree Matching
13 0.29442948 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
14 0.29430544 90 nips-2002-Feature Selection in Mixture-Based Clustering
15 0.28039443 188 nips-2002-Stability-Based Model Selection
16 0.27110046 89 nips-2002-Feature Selection by Maximum Marginal Diversity
17 0.26984307 125 nips-2002-Learning Semantic Similarity
18 0.26444003 185 nips-2002-Speeding up the Parti-Game Algorithm
19 0.25935382 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
20 0.25405699 115 nips-2002-Informed Projections
topicId topicWeight
[(11, 0.036), (23, 0.012), (42, 0.048), (54, 0.108), (55, 0.023), (57, 0.023), (64, 0.012), (67, 0.011), (68, 0.031), (74, 0.072), (75, 0.39), (87, 0.019), (92, 0.02), (98, 0.082)]
simIndex simValue paperId paperTitle
1 0.78711134 32 nips-2002-Approximate Inference and Protein-Folding
Author: Chen Yanover, Yair Weiss
Abstract: Side-chain prediction is an important subtask in the protein-folding problem. We show that finding a minimal energy side-chain configuration is equivalent to performing inference in an undirected graphical model. The graphical model is relatively sparse yet has many cycles. We used this equivalence to assess the performance of approximate inference algorithms in a real-world setting. Specifically we compared belief propagation (BP), generalized BP (GBP) and naive mean field (MF). In cases where exact inference was possible, max-product BP always found the global minimum of the energy (except in few cases where it failed to converge), while other approximation algorithms of similar complexity did not. In the full protein data set, maxproduct BP always found a lower energy configuration than the other algorithms, including a widely used protein-folding software (SCWRL). 1
same-paper 2 0.7713818 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology
Author: Matthew G. Snover, Michael R. Brent
Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.
3 0.55758989 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals
Author: Tatyana Sharpee, Nicole C. Rust, William Bialek
Abstract: unkown-abstract
4 0.49338436 190 nips-2002-Stochastic Neighbor Embedding
Author: Geoffrey E. Hinton, Sam T. Roweis
Abstract: We describe a probabilistic approach to the task of placing objects, described by high-dimensional vectors or by pairwise dissimilarities, in a low-dimensional space in a way that preserves neighbor identities. A Gaussian is centered on each object in the high-dimensional space and the densities under this Gaussian (or the given dissimilarities) are used to define a probability distribution over all the potential neighbors of the object. The aim of the embedding is to approximate this distribution as well as possible when the same operation is performed on the low-dimensional “images” of the objects. A natural cost function is a sum of Kullback-Leibler divergences, one per object, which leads to a simple gradient for adjusting the positions of the low-dimensional images. Unlike other dimensionality reduction methods, this probabilistic framework makes it easy to represent each object by a mixture of widely separated low-dimensional images. This allows ambiguous objects, like the document count vector for the word “bank”, to have versions close to the images of both “river” and “finance” without forcing the images of outdoor concepts to be located close to those of corporate concepts.
5 0.39977455 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons
Author: Elad Schneidman, William Bialek, Michael Ii
Abstract: A population of neurons typically exhibits a broad diversity of responses to sensory inputs. The intuitive notion of functional classification is that cells can be clustered so that most of the diversity is captured by the identity of the clusters rather than by individuals within clusters. We show how this intuition can be made precise using information theory, without any need to introduce a metric on the space of stimuli or responses. Applied to the retinal ganglion cells of the salamander, this approach recovers classical results, but also provides clear evidence for subclasses beyond those identified previously. Further, we find that each of the ganglion cells is functionally unique, and that even within the same subclass only a few spikes are needed to reliably distinguish between cells. 1
6 0.39576209 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
7 0.39116254 124 nips-2002-Learning Graphical Models with Mercer Kernels
8 0.38881311 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
9 0.38848874 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
10 0.38845712 53 nips-2002-Clustering with the Fisher Score
11 0.38838512 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
12 0.38623279 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories
13 0.38548413 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
14 0.38538307 169 nips-2002-Real-Time Particle Filters
15 0.38492054 10 nips-2002-A Model for Learning Variance Components of Natural Images
16 0.38467008 135 nips-2002-Learning with Multiple Labels
17 0.38416311 163 nips-2002-Prediction and Semantic Association
18 0.38416308 3 nips-2002-A Convergent Form of Approximate Policy Iteration
19 0.38292146 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits
20 0.38281307 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition