nips nips2001 nips2001-110 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alberto Paccanaro, Geoffrey E. Hinton
Abstract: We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. Its final goal is to be able to generalize, i.e. infer new instances of these relations among the concepts. On a task involving family relationships we show that LRE can generalize better than any previously published method. We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. We want to be able to infer other instances of these relations. For example, if the concepts are the people in a certain family, the relations are kinship relations, and we are given the facts ”Alberto has-father Pietro” and ”Pietro has-brother Giovanni”, we would like to be able to infer ”Alberto has-uncle Giovanni”. Our approach is to learn appropriate distributed representations of the entities in the data, and then exploit the generalization properties of the distributed representations [2] to make the inferences. In this paper we present a method, which we have called Linear Relational Embedding (LRE), which learns a distributed representation for the concepts by embedding them in a space where the relations between concepts are linear transformations of their distributed representations. Let us consider the case in which all the relations are binary, i.e. involve two concepts. , and the problem In this case our data consists of triplets we are trying to solve is to infer missing triplets when we are given only few of them. Inferring a triplet is equivalent to being able to complete it, that is to come up with one of its elements, given the other two. Here we shall always try to complete the third element of the triplets 1 . LRE will then represent each concept in the data as a learned vector in a 2 0 £ § ¥ £ § ¥ %
Reference: text
sentIndex sentText sentNum sentScore
1 uk ¡ Abstract We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. [sent-5, score-0.582]
2 infer new instances of these relations among the concepts. [sent-8, score-0.278]
3 On a task involving family relationships we show that LRE can generalize better than any previously published method. [sent-9, score-0.132]
4 We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. [sent-10, score-0.339]
5 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. [sent-11, score-0.145]
6 Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. [sent-12, score-0.628]
7 We want to be able to infer other instances of these relations. [sent-13, score-0.113]
8 For example, if the concepts are the people in a certain family, the relations are kinship relations, and we are given the facts ”Alberto has-father Pietro” and ”Pietro has-brother Giovanni”, we would like to be able to infer ”Alberto has-uncle Giovanni”. [sent-14, score-0.557]
9 Our approach is to learn appropriate distributed representations of the entities in the data, and then exploit the generalization properties of the distributed representations [2] to make the inferences. [sent-15, score-0.444]
10 In this paper we present a method, which we have called Linear Relational Embedding (LRE), which learns a distributed representation for the concepts by embedding them in a space where the relations between concepts are linear transformations of their distributed representations. [sent-16, score-0.956]
11 Let us consider the case in which all the relations are binary, i. [sent-17, score-0.197]
12 , and the problem In this case our data consists of triplets we are trying to solve is to infer missing triplets when we are given only few of them. [sent-20, score-0.707]
13 Inferring a triplet is equivalent to being able to complete it, that is to come up with one of its elements, given the other two. [sent-21, score-0.392]
14 Here we shall always try to complete the third element of the triplets 1 . [sent-22, score-0.431]
15 £ § ¥ £ )1¦©¨)('¨&&$#¨ ¨¦©¨¦¤¢ 1 Methods analogous to the ones presented here that can be used to complete any element of a triplet can be found in [4]. [sent-24, score-0.36]
16 Euclidean space and each relationship between the two concepts as a learned matrix that maps the first concept into an approximation to the second concept. [sent-25, score-0.284]
17 Let us assume that our data consists of such triplets containing distinct concepts and binary relations. [sent-26, score-0.572]
18 We shall call this set of triplets ; will denote the set of -dimensional vectors corresponding to the concepts, and the set of matrices corresponding to the relations. [sent-27, score-0.466]
19 Often we shall need to indicate the vectors and the matrix which correspond to the concepts and the relation in a certain triplet . [sent-28, score-0.675]
20 In this case we shall denote the vector corresponding to the first concept with , the vector corresponding to the second concept with and the matrix corresponding to the relation with . [sent-29, score-0.309]
21 We shall therefore write the triplet as where and . [sent-30, score-0.379]
22 If for every triplet we think of as a noisy version of one of the concept vectors, then one way to learn an embedding is to maximize the probability that it is a noisy version of the correct completion, . [sent-32, score-0.564]
23 We imagine that a concept has an average location in the space, but that each “observation” of the concept is a noisy realization of this average location. [sent-33, score-0.184]
24 Assuming spherical Gaussian noise with a variance of on each dimension, the discriminative goodness function that corresponds to the log probability of getting the right completion, summed over all training triplets is: ! [sent-34, score-0.379]
25 However, when we learn an embedding by maximizing , we are not making use of exactly the information that we have in the triplets. [sent-37, score-0.216]
26 For each triplet , we are making the vector representing the correct completion more probable than any other concept vector given , while must be equal to . [sent-38, score-0.501]
27 The numerator of does exactly this, the triplet states that but we also have the denominator, which is necessary in order to stay away from the trivial solution 3 . [sent-39, score-0.321]
28 4 For one-to-many relations we must not decrease the value of all the way to , because this would cause some concept vectors to become coincident. [sent-48, score-0.326]
29 It is worth pointing out that, in general, different initial configurations and optimization algorithms caused the system to arrive at different solutions, but these solutions were almost always very similar in terms of generalization performance. [sent-56, score-0.126]
30 In this problem, the data consists of people and relations among people belonging to two families, one Italian and one English, shown in fig. [sent-58, score-0.317]
31 Uscan be represented in simple propositions of the form ing the relations father, mother, husband, wife, son, daughter, uncle, aunt, brother, sister, nephew, niece there are 112 such triplets in the two trees. [sent-61, score-0.531]
32 Vectors end-points are indicated by *, the ones in the same family tree are connected to each other. [sent-71, score-0.251]
33 The vector of each person is a column, ordered according to the numbering on the tree diagram on the left. [sent-74, score-0.184]
34 B@ ¨T@ £ % 8" 6 % $" performance, for each triplet in the test set , we chose as completion the concepts according to their probability, given . [sent-75, score-0.601]
35 The system was generally able to complete correctly all triplets even when of them, picked at random, had been left out during training. [sent-76, score-0.473]
36 For most problems there exist triplets which cannot be completed. [sent-79, score-0.334]
37 Therefore, here we argue that it is not sufficient to test generalization by merely testing the completion of those complete-able triplets which have not been used for training. [sent-82, score-0.47]
38 The proper test for generalization is to see how the system completes any triplet of the kind where ranges over the concepts and R over the relations. [sent-83, score-0.597]
39 We cannot assume to have knowledge of which triplets admit a completion, and which do not. [sent-84, score-0.373]
40 To do this the system needs a way to indicate when a triplet does not admit a completion. [sent-86, score-0.396]
41 The parameters of this probabilistic model are, for each relation , the variances of the Gaussians and the relative density under the Uniform distribution, which we shall write as . [sent-91, score-0.125]
42 These parameters are learned using a validation set, which will be the union of a set of complete-able (positive) triplets and a set of pairs which cannot be and where completed (negative); that is indicates the fact that the result of applying relation to does not belong to . [sent-92, score-0.449]
43 This is done by maximizing the following discriminative goodness function over the validation set : " £ 2 $" ¢ ¡ 2 ¤ G ¥ " ¨ ¨ ¡ § " G © © b# © ¡ © ¥ " ! [sent-93, score-0.158]
44 Having we compute the probalearned these parameters, in order to complete any triplet bility distribution over each of the Gaussians and the Uniform distribution given . [sent-98, score-0.36]
45 The system then chooses a vector or the “don’t know” answer according to those probabilities, as the completion to the triplet. [sent-99, score-0.124]
46 The test set contained positive triplets chosen at random, but such that there was a triplet per relation. [sent-101, score-0.655]
47 The validation set contained a group of positive and a group of negative triplets, chosen at random and such that each group had a triplet per relation. [sent-102, score-0.369]
48 After learning a distributed representation for the entities in the data by maximizing over the training set, we learned the parameters of the probabilistic model by maximizing over the validation set. [sent-104, score-0.329]
49 The resulting system was able to correctly complete all the possible triplets . [sent-105, score-0.473]
50 Figure 2 shows the distribution of the probabilities when completing one complete-able and one uncomplete-able triplet in the test set. [sent-106, score-0.321]
51 We have used it on a much bigger version of the Family Tree Problem, where the family tree is a branch of the real family tree of one of the authors containing people over generations. [sent-108, score-0.562]
52 Using the same set of relations used in the Family Tree Problem, there is a total of positive triplets. [sent-109, score-0.197]
53 After learning using a training set of positive triplets, and a validation set constituted § ¡7 § 6 5 4§ B s § ¡ 6 B @ Charlotte uncle Emma aunt 1 1 0. [sent-110, score-0.173]
54 The completeable triplet has two correct completions but neither of the triplets had been used for training. [sent-119, score-0.655]
55 B @ § ¡ s s by positive and negative triplets, the system is able to complete correctly almost all the possible triplets. [sent-123, score-0.181]
56 3 Using LRE to represent recursive data structures In this section, we shall show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. [sent-130, score-0.468]
57 Here we discuss binary trees, but the same reasoning applies to trees of any valence. [sent-131, score-0.169]
58 Center: how LRE can be used to learn a representation for binary trees in a RAAM-like fashion. [sent-139, score-0.247]
59 Right: the binary tree structure of the sentences used in the experiment. [sent-140, score-0.369]
60 To encode a tree the network must learn as many auto-associations as the total number of non-terminal nodes in the tree. [sent-143, score-0.189]
61 The decoding procedure must decide whether a decoded vector represents a terminal node or an internal node which should be further decoded. [sent-145, score-0.172]
62 This is done by using binary codes for the terminal symbols, and then fixing a threshold which is used for checking for “binary-ness” during decoding. [sent-146, score-0.114]
63 The RAAM approach can be cast as an LRE problem, in which concepts are trees, subtrees or leaves, or pairs of trees, sub-trees or leaves, and there exist relationships: implementing the compressor, and and which jointly implement the reconstructor (see fig. [sent-147, score-0.249]
64 We can then learn a representation for all the trees, and the matrices by maximizing in eq. [sent-149, score-0.18]
65 First, one does not need to supply codes for the leaves of the trees, since LRE will learn an appropriate distributed representation for them. [sent-152, score-0.317]
66 In fact, the problem of recognizing whether a node needs to be further decoded, is similar to the problem of recognizing that a certain triplet does not admit a completion, that we solved in the previous section. [sent-154, score-0.391]
67 The set of appropriate values of and for relations triplets where is not a leaf of the tree, will play the role of the set which appears in eq. [sent-158, score-0.588]
68 0 ¨ 0 %# 2 % ' % f3" % ¢ We have applied this method to the problem of encoding binary trees which correspond to sentences of words from a small vocabulary. [sent-160, score-0.341]
69 Sentences had a fixed structure: a noun phrase, constituted of an adjective and a noun, followed by a verb phrase, made of a verb and a noun (see fig. [sent-161, score-0.376]
70 Nouns were in girl, woman, scientist or in dog, doctor, lawyer ; adjectives were in pretty, young or in ugly, old ; verbs were in help, love or in hurt, annoy . [sent-165, score-0.489]
71 In this way, sentences of the kind “pretty girl annoy scientist” were not allowed in the training set, and there were possible sentences that satisfied the constraints which were implicit in the training set. [sent-167, score-0.646]
72 We used HLRE to learn a distributed representation for all the nodes in the trees, maximizing using the sentences in the training set. [sent-168, score-0.426]
73 In 7D, after having built the outlier model for the non-terminal symbols, given any root or internal node the system would reconstruct its children, and if they were non-terminal symbols would further decode each of them. [sent-169, score-0.121]
74 The decoding process would always halt providing the correct reconstruction for all the sentences in the training set. [sent-170, score-0.247]
75 4 shows the distributed representations found for each word in the vocabulary. [sent-172, score-0.218]
76 If the norm of the difference between the original , then the tree and the reconstructed sub-trees was within a tolerance, which we set to could be considered to be well formed. [sent-176, score-0.183]
77 The system shows impressive generalization performance: after training using the sentences, the four-word sentences it generates are all the well formed sentences, and only those. [sent-177, score-0.256]
78 It does not generate sentences which are either grammatically wrong, like “dog old girl annoy”, nor sentences which violate semantic constraints, like “pretty girl annoy scientist”. [sent-178, score-0.854]
79 Top row: The disBlack bars separate tributed representation of the words in the sentences found after learning. [sent-183, score-0.212]
80 Center row: The different contributions given to the root of the tree by the word “girl” when placed in position , , and in the tree. [sent-184, score-0.358]
81 Bottom row: The contribution of each leaf to the reconstruction of , when adjectives, nouns, verbs and nouns are applied in positions , , and respectively. [sent-185, score-0.515]
82 ¡ ¡ ¡ ¤ £ ¡ ¢ ¤ ¡ ¢ ¥ " £ " " ¥ Pollack [6], this was almost certainly due to the fact that for the RAAMs the representation for the leaves was too similar, a problem that the HLRE formulation solves, since it learns their distributed representations. [sent-186, score-0.287]
83 Once the system has learned an embedding, finding a distributed representation for a given tree amounts to multiplying the representation of its leaves by all the matrices found on all the paths from the leaves to the root, and adding them up. [sent-190, score-0.603]
84 Luckily matrix multiplication is non-commutative, and therefore every sequence of words on its leaves can generate a different representation at the root node. [sent-191, score-0.188]
85 4 makes this point clear showing the different contributions given to the root of the tree by the word “girl” , depending on its position in the sentence. [sent-193, score-0.325]
86 A tree can be “unrolled” from the root to its leaves by multiplying its distributed representation using the matrices. [sent-194, score-0.45]
87 4 shows the contribution of each leaf to the reconstruction of , when adjectives, nouns, verbs and nouns are placed on leaves , , and respectively. [sent-200, score-0.642]
88 We can see that the contributions from the adjectives, match very closely their actual distributed representations, while the contributions from the nouns in position are negligible. [sent-201, score-0.54]
89 This means that any adjective placed on will tend to be reconstructed correctly, and that its reconstruction is independent of the noun we have in position . [sent-202, score-0.284]
90 On the other hand, the " ¥ ¥ £ ¥ " " contributions from nouns and verbs in positions and are non-negligible, and notice how those given by words belonging to the subsets are almost symmetric to those given by words in the subsets. [sent-203, score-0.502]
91 Finally, the reconstruction of , when adjectives, nouns, verbs and nouns are not placed on leaves , , and respectively, assigns a very low probability to any word, and thus the system does not generate sentences which are not well formed. [sent-205, score-0.793]
92 £ " ¥ £ " £ " " 4 Conclusions Linear Relational Embedding is a new method for learning distributed representations of concepts and relations from data consisting of instances of relations between given concepts. [sent-206, score-0.807]
93 It finds a mapping from the concepts into a feature-space by imposing the constraint that relations in this feature-space are modeled by linear operations. [sent-207, score-0.389]
94 We began introducing LRE for binary relations, and then we saw how these ideas can be easily extended to higher arity relation by simply concatenating concept vectors and using rectangular matrices for the relations. [sent-213, score-0.317]
95 The compressor relation for binary trees is a ternary relation; for trees of higher valence the compressor relation will have higher arity. [sent-214, score-0.614]
96 We have seen how HLRE can be used to find distributed representations for hierarchical structures, and its generalization performance is much better than the one obtained using RAAMs on similar problems. [sent-215, score-0.227]
97 It is easy to prove that, when all the relations are binary, given a sufficient number of dimensions, there always exists an LRE-type of solution that satisfies any set of triplets [4]. [sent-216, score-0.531]
98 However, due to its linearity, LRE cannot represent some relations of arity greater than . [sent-217, score-0.235]
99 This new method, called Non-Linear Relational Embedding (NLRE) [4], can represent any relation and has given good generalization results. [sent-219, score-0.115]
100 Learning distributed representations by mapping concepts and relations into a linear space. [sent-243, score-0.568]
wordName wordTfidf (topN-words)
[('triplets', 0.334), ('lre', 0.328), ('triplet', 0.321), ('nouns', 0.302), ('girl', 0.208), ('relations', 0.197), ('concepts', 0.192), ('sentences', 0.172), ('adjectives', 0.17), ('tree', 0.151), ('trees', 0.123), ('embedding', 0.113), ('alberto', 0.113), ('verbs', 0.112), ('distributed', 0.111), ('family', 0.1), ('leaves', 0.094), ('annoy', 0.094), ('compressor', 0.094), ('hlre', 0.094), ('raam', 0.094), ('concept', 0.092), ('completion', 0.088), ('relational', 0.088), ('noun', 0.083), ('scientist', 0.075), ('representations', 0.068), ('relation', 0.067), ('maximizing', 0.065), ('hinton', 0.061), ('people', 0.06), ('pretty', 0.06), ('shall', 0.058), ('leaf', 0.057), ('adjective', 0.057), ('foil', 0.057), ('pollack', 0.057), ('raams', 0.057), ('reconstructor', 0.057), ('trqp', 0.057), ('geoffrey', 0.056), ('root', 0.054), ('verb', 0.052), ('dog', 0.049), ('constituted', 0.049), ('hy', 0.049), ('generalization', 0.048), ('validation', 0.048), ('phrase', 0.048), ('binary', 0.046), ('contributions', 0.046), ('goodness', 0.045), ('pietro', 0.045), ('decoded', 0.045), ('reconstruction', 0.044), ('row', 0.042), ('instances', 0.042), ('almost', 0.042), ('representation', 0.04), ('admit', 0.039), ('james', 0.039), ('infer', 0.039), ('word', 0.039), ('complete', 0.039), ('learn', 0.038), ('arity', 0.038), ('aunt', 0.038), ('charlotte', 0.038), ('doctor', 0.038), ('emma', 0.038), ('father', 0.038), ('giovanni', 0.038), ('italian', 0.038), ('italians', 0.038), ('lawyer', 0.038), ('paccanaro', 0.038), ('shqci', 0.038), ('ugly', 0.038), ('uncle', 0.038), ('woman', 0.038), ('facts', 0.037), ('matrices', 0.037), ('vectors', 0.037), ('recursive', 0.037), ('system', 0.036), ('hg', 0.036), ('position', 0.035), ('terminal', 0.034), ('english', 0.034), ('structures', 0.034), ('codes', 0.034), ('gaussians', 0.034), ('placed', 0.033), ('numbering', 0.033), ('able', 0.032), ('reconstructed', 0.032), ('correctly', 0.032), ('generalize', 0.032), ('node', 0.031), ('decoding', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
Author: Alberto Paccanaro, Geoffrey E. Hinton
Abstract: We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. Its final goal is to be able to generalize, i.e. infer new instances of these relations among the concepts. On a task involving family relationships we show that LRE can generalize better than any previously published method. We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. We want to be able to infer other instances of these relations. For example, if the concepts are the people in a certain family, the relations are kinship relations, and we are given the facts ”Alberto has-father Pietro” and ”Pietro has-brother Giovanni”, we would like to be able to infer ”Alberto has-uncle Giovanni”. Our approach is to learn appropriate distributed representations of the entities in the data, and then exploit the generalization properties of the distributed representations [2] to make the inferences. In this paper we present a method, which we have called Linear Relational Embedding (LRE), which learns a distributed representation for the concepts by embedding them in a space where the relations between concepts are linear transformations of their distributed representations. Let us consider the case in which all the relations are binary, i.e. involve two concepts. , and the problem In this case our data consists of triplets we are trying to solve is to infer missing triplets when we are given only few of them. Inferring a triplet is equivalent to being able to complete it, that is to come up with one of its elements, given the other two. Here we shall always try to complete the third element of the triplets 1 . LRE will then represent each concept in the data as a learned vector in a 2 0 £ § ¥ £ § ¥ %
2 0.10178249 56 nips-2001-Convolution Kernels for Natural Language
Author: Michael Collins, Nigel Duffy
Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.
3 0.088550426 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations
Author: Randall C. O'Reilly, R. S. Busby
Abstract: We present a model of binding of relationship information in a spatial domain (e.g., square above triangle) that uses low-order coarse-coded conjunctive representations instead of more popular temporal synchrony mechanisms. Supporters of temporal synchrony argue that conjunctive representations lack both efficiency (i.e., combinatorial numbers of units are required) and systematicity (i.e., the resulting representations are overly specific and thus do not support generalization to novel exemplars). To counter these claims, we show that our model: a) uses far fewer hidden units than the number of conjunctions represented, by using coarse-coded, distributed representations where each unit has a broad tuning curve through high-dimensional conjunction space, and b) is capable of considerable generalization to novel inputs.
4 0.085807301 149 nips-2001-Probabilistic Abstraction Hierarchies
Author: Eran Segal, Daphne Koller, Dirk Ormoneit
Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.
5 0.081339717 190 nips-2001-Thin Junction Trees
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present an algorithm that induces a class of models with thin junction trees—models that are characterized by an upper bound on the size of the maximal cliques of their triangulated graph. By ensuring that the junction tree is thin, inference in our models remains tractable throughout the learning process. This allows both an efficient implementation of an iterative scaling parameter estimation algorithm and also ensures that inference can be performed efficiently with the final model. We illustrate the approach with applications in handwritten digit recognition and DNA splice site detection.
6 0.080620348 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
7 0.074885219 105 nips-2001-Kernel Machines and Boolean Functions
8 0.070423849 5 nips-2001-A Bayesian Model Predicts Human Parse Preference and Reading Times in Sentence Processing
9 0.066730276 182 nips-2001-The Fidelity of Local Ordinal Encoding
10 0.06560979 118 nips-2001-Matching Free Trees with Replicator Equations
11 0.060860604 86 nips-2001-Grammatical Bigrams
12 0.057956651 21 nips-2001-A Variational Approach to Learning Curves
13 0.055533752 78 nips-2001-Fragment Completion in Humans and Machines
14 0.055306915 9 nips-2001-A Generalization of Principal Components Analysis to the Exponential Family
15 0.052871976 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network
16 0.051418841 169 nips-2001-Small-World Phenomena and the Dynamics of Information
17 0.050130509 193 nips-2001-Unsupervised Learning of Human Motion Models
18 0.049408007 115 nips-2001-Linear-time inference in Hierarchical HMMs
19 0.049403992 84 nips-2001-Global Coordination of Local Linear Models
20 0.048318218 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
topicId topicWeight
[(0, -0.15), (1, -0.003), (2, -0.006), (3, -0.017), (4, -0.056), (5, -0.112), (6, -0.105), (7, -0.039), (8, -0.128), (9, -0.002), (10, -0.166), (11, -0.065), (12, 0.024), (13, -0.03), (14, 0.062), (15, -0.038), (16, -0.055), (17, -0.027), (18, -0.029), (19, 0.061), (20, 0.016), (21, -0.041), (22, 0.063), (23, 0.09), (24, -0.02), (25, 0.05), (26, 0.034), (27, 0.016), (28, -0.01), (29, 0.022), (30, 0.011), (31, -0.034), (32, -0.001), (33, 0.019), (34, 0.006), (35, -0.095), (36, 0.064), (37, 0.051), (38, 0.016), (39, 0.11), (40, -0.082), (41, 0.134), (42, -0.028), (43, -0.097), (44, -0.119), (45, -0.006), (46, -0.073), (47, 0.124), (48, 0.131), (49, 0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.93855047 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
Author: Alberto Paccanaro, Geoffrey E. Hinton
Abstract: We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. Its final goal is to be able to generalize, i.e. infer new instances of these relations among the concepts. On a task involving family relationships we show that LRE can generalize better than any previously published method. We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. We want to be able to infer other instances of these relations. For example, if the concepts are the people in a certain family, the relations are kinship relations, and we are given the facts ”Alberto has-father Pietro” and ”Pietro has-brother Giovanni”, we would like to be able to infer ”Alberto has-uncle Giovanni”. Our approach is to learn appropriate distributed representations of the entities in the data, and then exploit the generalization properties of the distributed representations [2] to make the inferences. In this paper we present a method, which we have called Linear Relational Embedding (LRE), which learns a distributed representation for the concepts by embedding them in a space where the relations between concepts are linear transformations of their distributed representations. Let us consider the case in which all the relations are binary, i.e. involve two concepts. , and the problem In this case our data consists of triplets we are trying to solve is to infer missing triplets when we are given only few of them. Inferring a triplet is equivalent to being able to complete it, that is to come up with one of its elements, given the other two. Here we shall always try to complete the third element of the triplets 1 . LRE will then represent each concept in the data as a learned vector in a 2 0 £ § ¥ £ § ¥ %
2 0.61370581 149 nips-2001-Probabilistic Abstraction Hierarchies
Author: Eran Segal, Daphne Koller, Dirk Ormoneit
Abstract: Many domains are naturally organized in an abstraction hierarchy or taxonomy, where the instances in “nearby” classes in the taxonomy are similar. In this paper, we provide a general probabilistic framework for clustering data into a set of classes organized as a taxonomy, where each class is associated with a probabilistic model from which the data was generated. The clustering algorithm simultaneously optimizes three things: the assignment of data instances to clusters, the models associated with the clusters, and the structure of the abstraction hierarchy. A unique feature of our approach is that it utilizes global optimization algorithms for both of the last two steps, reducing the sensitivity to noise and the propensity to local maxima that are characteristic of algorithms such as hierarchical agglomerative clustering that only take local steps. We provide a theoretical analysis for our algorithm, showing that it converges to a local maximum of the joint likelihood of model and data. We present experimental results on synthetic data, and on real data in the domains of gene expression and text.
3 0.60039568 5 nips-2001-A Bayesian Model Predicts Human Parse Preference and Reading Times in Sentence Processing
Author: S. Narayanan, Daniel Jurafsky
Abstract: Narayanan and Jurafsky (1998) proposed that human language comprehension can be modeled by treating human comprehenders as Bayesian reasoners, and modeling the comprehension process with Bayesian decision trees. In this paper we extend the Narayanan and Jurafsky model to make further predictions about reading time given the probability of difference parses or interpretations, and test the model against reading time data from a psycholinguistic experiment. 1
4 0.52827525 56 nips-2001-Convolution Kernels for Natural Language
Author: Michael Collins, Nigel Duffy
Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.
5 0.50160432 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
Author: Dan Klein, Christopher D. Manning
Abstract: This paper presents a novel approach to the unsupervised learning of syntactic analyses of natural language text. Most previous work has focused on maximizing likelihood according to generative PCFG models. In contrast, we employ a simpler probabilistic model over trees based directly on constituent identity and linear context, and use an EM-like iterative procedure to induce structure. This method produces much higher quality analyses, giving the best published results on the ATIS dataset. 1 Overview To enable a wide range of subsequent tasks, human language sentences are standardly given tree-structure analyses, wherein the nodes in a tree dominate contiguous spans of words called constituents, as in figure 1(a). Constituents are the linguistically coherent units in the sentence, and are usually labeled with a constituent category, such as noun phrase (NP) or verb phrase (VP). An aim of grammar induction systems is to figure out, given just the sentences in a corpus S, what tree structures correspond to them. In this sense, the grammar induction problem is an incomplete data problem, where the complete data is the corpus of trees T , but we only observe their yields S. This paper presents a new approach to this problem, which gains leverage by directly making use of constituent contexts. It is an open problem whether entirely unsupervised methods can produce linguistically accurate parses of sentences. Due to the difficulty of this task, the vast majority of statistical parsing work has focused on supervised learning approaches to parsing, where one uses a treebank of fully parsed sentences to induce a model which parses unseen sentences [7, 3]. But there are compelling motivations for unsupervised grammar induction. Building supervised training data requires considerable resources, including time and linguistic expertise. Investigating unsupervised methods can shed light on linguistic phenomena which are implicit within a supervised parser’s supervisory information (e.g., unsupervised systems often have difficulty correctly attaching subjects to verbs above objects, whereas for a supervised parser, this ordering is implicit in the supervisory information). Finally, while the presented system makes no claims to modeling human language acquisition, results on whether there is enough information in sentences to recover their structure are important data for linguistic theory, where it has standardly been assumed that the information in the data is deficient, and strong innate knowledge is required for language acquisition [4]. Node S VP NP NN1 NNS Factory payrolls VBD fell PP IN NN2 in September Constituent S NP VP PP NN 1 NNS VBD IN NN 2 NN NNS VBD IN NN NN NNS VBD IN NN IN NN NN NNS VBD IN NNS Context – – VBD NNS – VBD – – NNS NN – VBD NNS – IN VBD – NN IN – Empty 0 1 2 3 4 5 Context – NN – NNS – VBD – IN – NN – NN NNS VBD IN NN Figure 1: Example parse tree with the constituents and contexts for each tree node. 2 Previous Approaches One aspect of grammar induction where there has already been substantial success is the induction of parts-of-speech. Several different distributional clustering approaches have resulted in relatively high-quality clusterings, though the clusters’ resemblance to classical parts-of-speech varies substantially [9, 15]. For the present work, we take the part-ofspeech induction problem as solved and work with sequences of parts-of-speech rather than words. In some ways this makes the problem easier, such as by reducing sparsity, but in other ways it complicates the task (even supervised parsers perform relatively poorly with the actual words replaced by parts-of-speech). Work attempting to induce tree structures has met with much less success. Most grammar induction work assumes that trees are generated by a symbolic or probabilistic context-free grammar (CFG or PCFG). These systems generally boil down to one of two types. Some fix the structure of the grammar in advance [12], often with an aim to incorporate linguistic constraints [2] or prior knowledge [13]. These systems typically then attempt to find the grammar production parameters which maximize the likelihood P(S| ) using the inside-outside algorithm [1], which is an efficient (dynamic programming) instance of the EM algorithm [8] for PCFG s. Other systems (which have generally been more successful) incorporate a structural search as well, typically using a heuristic to propose candidate grammar modifications which minimize the joint encoding of data and grammar using an MDL criterion, which asserts that a good analysis is a short one, in that the joint encoding of the grammar and the data is compact [6, 16, 18, 17]. These approaches can also be seen as likelihood maximization where the objective function is the a posteriori likelihood of the grammar given the data, and the description length provides a structural prior. The “compact grammar” aspect of MDL is close to some traditional linguistic argumentation which at times has argued for minimal grammars on grounds of analytical [10] or cognitive [5] economy. However, the primary weakness of MDL-based systems does not have to do with the objective function, but the search procedures they employ. Such systems end up growing structures greedily, in a bottom-up fashion. Therefore, their induction quality is determined by how well they are able to heuristically predict what local intermediate structures will fit into good final global solutions. A potential advantage of systems which fix the grammar and only perform parameter search is that they do compare complete grammars against each other, and are therefore able to detect which give rise to systematically compatible parses. However, although early work showed that small, artificial CFGs could be induced with the EM algorithm [12], studies with large natural language grammars have generally suggested that completely unsupervised EM over PCFG s is ineffective for grammar acquisition. For instance, Carroll and Charniak [2] describe experiments running the EM algorithm from random starting points, which produced widely varying learned grammars, almost all of extremely poor quality. 1 1 We duplicated one of their experiments, which used grammars restricted to rules of the form x → x y | y x, where there is one category x for each part-of-speech (such a restricted CFG is isomorphic to a dependency grammar). We began reestimation from a grammar with uniform rewrite It is well-known that EM is only locally optimal, and one might think that the locality of the search procedure, not the objective function, is to blame. The truth is somewhere in between. There are linguistic reasons to distrust an ML objective function. It encourages the symbols and rules to align in ways which maximize the truth of the conditional independence assumptions embodied by the PCFG. The symbols and rules of a natural language grammar, on the other hand, represent syntactically and semantically coherent units, for which a host of linguistic arguments have been made [14]. None of these have anything to do with conditional independence; traditional linguistic constituency reflects only grammatical regularities and possibilities for expansion. There are expected to be strong connections across phrases (such as dependencies between verbs and their selected arguments). It could be that ML over PCFGs and linguistic criteria align, but in practice they do not always seem to. Experiments with both artificial [12] and real [13] data have shown that starting from fixed, correct (or at least linguistically reasonable) structure, EM produces a grammar which has higher log-likelihood than the linguistically determined grammar, but lower parsing accuracy. However, we additionally conjecture that EM over PCFGs fails to propagate contextual cues efficiently. The reason we expect an algorithm to converge on a good PCFG is that there seem to be coherent categories, like noun phrases, which occur in distinctive environments, like between the beginning of the sentence and the verb phrase. In the inside-outside algorithm, the product of inside and outside probabilities α j ( p, q)β j ( p, q) is the probability of generating the sentence with a j constituent spanning words p through q: the outside probability captures the environment, and the inside probability the coherent category. If we had a good idea of what VPs and NPs looked like, then if a novel NP appeared in an NP context, the outside probabilities should pressure the sequence to be parsed as an NP . However, what happens early in the EM procedure, when we have no real idea about the grammar parameters? With randomly-weighted, complete grammars over a symbol set X, we have observed that a frequent, short, noun phrase sequence often does get assigned to some category x early on. However, since there is not a clear overall structure learned, there is only very weak pressure for other NPs, even if they occur in the same positions, to also be assigned to x, and the reestimation process goes astray. To enable this kind of constituent-context pressure to be effective, we propose the model in the following section. 3 The Constituent-Context Model We propose an alternate parametric family of models over trees which is better suited for grammar induction. Broadly speaking, inducing trees like the one shown in figure 1(a) can be broken into two tasks. One is deciding constituent identity: where the brackets should be placed. The second is deciding what to label the constituents. These tasks are certainly correlated and are usually solved jointly. However, the task of labeling chosen brackets is essentially the same as the part-of-speech induction problem, and the solutions cited above can be adapted to cluster constituents [6]. The task of deciding brackets, is the harder task. For example, the sequence DT NN IN DT NN ([the man in the moon]) is virtually always a noun phrase when it is a constituent, but it is only a constituent 66% of the time, because the IN DT NN is often attached elsewhere ([we [sent a man] [to the moon]]). Figure 2(a) probabilities. Figure 4 shows that the resulting grammar (DEP - PCFG) is not as bad as conventional wisdom suggests. Carroll and Charniak are right to observe that the search spaces is riddled with pronounced local maxima, and EM does not do nearly so well when randomly initialized. The need for random seeding in using EM over PCFGs is two-fold. For some grammars, such as one over a set X of non-terminals in which any x 1 → x2 x3 , xi ∈ X is possible, it is needed to break symmetry. This is not the case for dependency grammars, where symmetry is broken by the yields (e.g., a sentence noun verb can only be covered by a noun or verb projection). The second reason is to start the search from a random region of the space. But unless one does many random restarts, the uniform starting condition is better than most extreme points in the space, and produces superior results. 1.5 2 Usually a Constituent Rarely a Constituent 1 1 0.5 0 0 −1 −2 −3 −1.5 −1 −0.5 NP VP PP −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 1.5 (a) (b) Figure 2: The most frequent examples of (a) different constituent labels and (b) constituents and non-constituents, in the vector space of linear contexts, projected onto the first two principal components. Clustering is effective for labeling, but not detecting constituents. shows the 50 most frequent constituent sequences of three types, represented as points in the vector space of their contexts (see below), projected onto their first two principal components. The three clusters are relatively coherent, and it is not difficult to believe that a clustering algorithm could detect them in the unprojected space. Figure 2(a), however, shows 150 sequences which are parsed as constituents at least 50% of the time along with 150 which are not, again projected onto the first two components. This plot at least suggests that the constituent/non-constituent classification is less amenable to direct clustering. Thus, it is important that an induction system be able to detect constituents, either implicitly or explicitly. A variety of methods of constituent detection have been proposed [11, 6], usually based on information-theoretic properties of a sequence’s distributional context. However, here we rely entirely on the following two simple assumptions: (i) constituents of a parse do not cross each other, and (ii) constituents occur in constituent contexts. The first property is self-evident from the nature of the parse trees. The second is an extremely weakened version of classic linguistic constituency tests [14]. Let σ be a terminal sequence. Every occurrence of σ will be in some linear context c(σ ) = x σ y, where x and y are the adjacent terminals or sentence boundaries. Then we can view any tree t over a sentence s as a collection of sequences and contexts, one of each for every node in the tree, plus one for each inter-terminal empty span, as in figure 1(b). Good trees will include nodes whose yields frequently occur as constituents and whose contexts frequently surround constituents. Formally, we use a conditional exponential model of the form: exp( (σ,c)∈t λσ f σ + λc f c ) P(t|s, ) = t:yield(t)=s exp( (σ,c)∈t λσ f σ + λc f c ) We have one feature f σ (t) for each sequence σ whose value on a tree t is the number of nodes in t with yield σ , and one feature f c (t) for each context c representing the number of times c is the context of the yield of some node in the tree.2 No joint features over c and σ are used, and, unlike many other systems, there is no distinction between constituent types. We model only the conditional likelihood of the trees, P(T |S, ), where = {λ σ , λc }. We then use an iterative EM-style procedure to find a local maximum P(T |S, ) of the completed data (trees) T (P(T |S, ) = t∈T ,s=yield(t) P(t|s, )). We initialize such that each λ is zero and initialize T to any arbitrary set of trees. In alternating steps, we first fix the parameters and find the most probable single tree structure t ∗ for each sentence s according to P(t|s, ), using a simple dynamic program. For any this produces the 2 So, for the tree in figure 1(a), P(t|s) ∝ exp(λ NN NNS + λVBD IN NN + λIN NN + λ −VBD + λNNS− + λVBD− + λ −NNS + λNN−VBD + λNNS−IN + λVBD−NN + λIN− ). set of parses T ∗ which maximizes P(T |S, ). Since T ∗ maximizes this quantity, if T is the former set of trees, P(T ∗ |S, ) ≥ P(T |S, ). Second, we fix the trees and estimate new parameters . The task of finding the parameters ∗ which maximize P(T |S, ) is simply the well-studied task of fitting our exponential model to maximize the conditional likelihood of the fixed parses. Running, for example, a conjugate gradient (CG) ascent on will produce the desired ∗ . If is the former parameters, then we will have P(T |S, ∗ ) ≥ P(T |S, ). Therefore, each iteration will increase P(T |S, ) until convergence.3 Note that our parsing model is not a generative model, and this procedure, though clearly related, is not exactly an instance of the EM algorithm. We merely guarantee that the conditional likelihood of the data completions is increasing. Furthermore, unlike in EM where each iteration increases the marginal likelihood of the fixed observed data, our procedure increases the conditional likelihood of a changing complete data set, with the completions changing at every iteration as we reparse. Several implementation details were important in making the system work well. First, tiebreaking was needed, most of all for the first round. Initially, the parameters are zero, and all parses are therefore equally likely. To prevent bias, all ties were broken randomly. Second, like so many statistical NLP tasks, smoothing was vital. There are features in our model for arbitrarily long yields and most yield types occurred only a few times. The most severe consequence of this sparsity was that initial parsing choices could easily become frozen. If a λσ for some yield σ was either 0 or 0, which was usually the case for rare yields, σ would either be locked into always occurring or never occurring, respectively. Not only did we want to push the λσ values close to zero, we also wanted to account for the fact that most spans are not constituents.4 Therefore, we expect the distribution of the λσ to be skewed towards low values.5 A greater amount of smoothing was needed for the first few iterations, while much less was required in later iterations. Finally, parameter estimation using a CG method was slow and difficult to smooth in the desired manner, and so we used the smoothed relative frequency estimates λ σ = count( fσ )/(count(σ ) + M) and λc = count( f c )/(count(c) + N). These estimates ensured that the λ values were between 0 and 1, and gave the desired bias towards non-constituency. These estimates were fast and surprisingly effective, but do not guarantee non-decreasing conditional likelihood (though the conditional likelihood was increasing in practice). 6 4 Results In all experiments, we used hand-parsed sentences from the Penn Treebank. For training, we took the approximately 7500 sentences in the Wall Street Journal (WSJ) section which contained 10 words or fewer after the removal of punctuation. For testing, we evaluated the system by comparing the system’s parses for those same sentences against the supervised parses in the treebank. We consider each parse as a set of constituent brackets, discarding all trivial brackets.7 We calculated the precision and recall of these brackets against the treebank parses in the obvious way. 3 In practice, we stopped the system after 10 iterations, but final behavior was apparent after 4–8. 4 In a sentence of length n, there are (n + 1)(n + 2)/2 total (possibly size zero) spans, but only 3n constituent spans: n − 1 of size ≥ 2, n of size 1, and n + 1 empty spans. 5 Gaussian priors for the exponential model accomplish the former goal, but not the latter. 6 The relative frequency estimators had a somewhat subtle positive effect. Empty spans have no effect on the model when using CG fitting, as all trees include the same empty spans. However, including their counts improved performance substantially when using relative frequency estimators. This is perhaps an indication that a generative version of this model would be advantageous. 7 We discarded both brackets of length one and brackets spanning the entire sentence, since all of these are impossible to get incorrect, and hence ignored sentences of length ≤ 2 during testing. S DT VP NN VBD σ NP σ VBD NP The screen was NP PP DT DT NN IN NP a VBD σ NN VBD σ σ DT σ was DT NN IN NN The screen a sea of DT red NN DT VBD DT was The screen DT a red (b) IN red DT NN of sea of NN (a) NN sea (c) Figure 3: Alternate parse trees for a sentence: (a) the Penn Treebank tree (deemed correct), (b) the one found by our system CCM, and (c) the one found by DEP - PCFG. Method LBRANCH RANDOM DEP - PCFG RBRANCH CCM UBOUND UP 20.5 29.0 39.5 54.1 60.1 78.2 UR 24.2 31.0 42.3 67.5 75.4 100.0 F1 22.2 30.0 40.9 60.0 66.9 87.8 (a) NP UR 28.9 42.8 69.7 38.3 83.8 100.0 PP UR 6.3 23.6 44.1 44.5 71.6 100.0 VP UR 0.6 26.3 22.8 85.8 66.3 100.0 System EMILE ABL CDC -40 RBRANCH CCM UP 51.6 43.6 53.4 39.9 54.4 UR 16.8 35.6 34.6 46.4 46.8 F1 25.4 39.2 42.0 42.9 50.3 CB 0.84 2.12 1.46 2.18 1.61 (b) Figure 4: Comparative accuracy on WSJ sentences (a) and on the ATIS corpus (b). UR = unlabeled recall; UP = unlabeled precision; F1 = the harmonic mean of UR and UP; CB = crossing brackets. Separate recall values are shown for three major categories. To situate the results of our system, figure 4(a) gives the values of several parsing strategies. CCM is our constituent-context model. DEP - PCFG is a dependency PCFG model [2] trained using the inside-outside algorithm. Figure 3 shows sample parses to give a feel for the parses the systems produce. We also tested several baselines. RANDOM parses randomly. This is an appropriate baseline for an unsupervised system. RBRANCH always chooses the right-branching chain, while LBRANCH always chooses the left-branching chain. RBRANCH is often used as a baseline for supervised systems, but exploits a systematic right-branching tendency of English. An unsupervised system has no a priori reason to prefer right chains to left chains, and LBRANCH is well worse than RANDOM. A system need not beat RBRANCH to claim partial success at grammar induction. Finally, we include an upper bound. All of the parsing strategies and systems mentioned here give fully binary-branching structures. Treebank trees, however, need not be fully binary-branching, and generally are not. As a result, there is an upper bound UBOUND on the precision and F1 scores achievable when structurally confined to binary trees. Clearly, CCM is parsing much better than the RANDOM baseline and the DEP - PCFG induced grammar. Significantly, it also out-performs RBRANCH in both precision and recall, and, to our knowledge, it is the first unsupervised system to do so. To facilitate comparison with other recent systems, figure 4(b) gives results where we trained as before but used (all) the sentences from the distributionally different ATIS section of the treebank as a test set. For this experiment, precision and recall were calculated using the EVALB system of measuring precision and recall (as in [6, 17]) – EVALB is a standard for parser evaluation, but complex, and unsuited to evaluating unlabeled constituency. EMILE and ABL are lexical systems described in [17]. The results for CDC-40, from [6], reflect training on much more data (12M words). Our system is superior in terms of both precision and recall (and so F 1 ). These figures are certainly not all that there is to say about an induced grammar; there are a number of issues in how to interpret the results of an unsupervised system when comparing with treebank parses. Errors come in several kinds. First are innocent sins of commission. Treebank trees are very flat; for example, there is no analysis of the inside of many short noun phrases ([two hard drives] rather than [two [hard drives]]). Our system gives a Sequence DT NN NNP NNP CD CD JJ NNS DT JJ NN DT NNS JJ NN CD NN IN NN IN DT NN NN NNS NN NN TO VB DT JJ IN DT PRP VBZ PRP VBP NNS VBP NN VBZ NN IN NNS VBD Example the man United States 4 1/2 daily yields the top rank the people plastic furniture 12 percent on Monday for the moment fire trucks fire truck to go ?the big *of the ?he says ?they say ?people are ?value is *man from ?people were CORRECT 1 2 3 4 5 6 7 8 9 10 11 22 26 78 90 95 180 =350 =532 =648 =648 FREQUENCY 2 1 9 7 – – 3 – – – – 8 – 6 4 – – – 10 5 – ENTROPY 2 – – 3 – – 7 – 9 – 6 10 1 – – – – 4 5 – 8 DEP - PCFG 1 2 5 4 7 – 3 – – – – – 6 – 10 8 9 – – – – CCM 1 2 5 4 6 10 3 9 – – 8 7 – – – – – – – – – Figure 5: Top non-trivial sequences by actual treebank constituent counts, linear frequency, scaled context entropy, and in DEP - PCFG and CCM learned models’ parses. (usually correct) analysis of the insides of such NPs, for which it is penalized on precision (though not recall or crossing brackets). Second are systematic alternate analyses. Our system tends to form modal verb groups and often attaches verbs first to pronoun subjects rather than to objects. As a result, many VPs are systematically incorrect, boosting crossing bracket scores and impacting VP recall. Finally, the treebank’s grammar is sometimes an arbitrary, and even inconsistent standard for an unsupervised learner: alternate analyses may be just as good.8 Notwithstanding this, we believe that the treebank parses have enough truth in them that parsing scores are a useful component of evaluation. Ideally, we would like to inspect the quality of the grammar directly. Unfortunately, the grammar acquired by our system is implicit in the learned feature weights. These are not by themselves particularly interpretable, and not directly comparable to the grammars produced by other systems, except through their functional behavior. Any grammar which parses a corpus will have a distribution over which sequences tend to be analyzed as constituents. These distributions can give a good sense of what structures are and are not being learned. Therefore, to supplement the parsing scores above, we examine these distributions. Figure 5 shows the top scoring constituents by several orderings. These lists do not say very much about how long, complex, recursive constructions are being analyzed by a given system, but grammar induction systems are still at the level where major mistakes manifest themselves in short, frequent sequences. CORRECT ranks sequences by how often they occur as constituents in the treebank parses. DEP - PCFG and CCM are the same, but use counts from the DEP - PCFG and CCM parses. As a baseline, FREQUENCY lists sequences by how often they occur anywhere in the sentence yields. Note that the sequence IN DT (e.g., “of the”) is high on this list, and is a typical error of many early systems. Finally, ENTROPY is the heuristic proposed in [11] which ranks by context entropy. It is better in practice than FREQUENCY , but that isn’t self-evident from this list. Clearly, the lists produced by the CCM system are closer to correct than the others. They look much like a censored version of the FREQUENCY list, where sequences which do not co-exist with higher-ranked ones have been removed (e.g., IN DT often crosses DT NN). This observation may explain a good part of the success of this method. Another explanation for the surprising success of the system is that it exploits a deep fact about language. Most long constituents have some short, frequent equivalent, or proform, which occurs in similar contexts [14]. In the very common case where the proform is a single word, it is guaranteed constituency, which will be transmitted to longer sequences 8 For example, transitive sentences are bracketed [subject [verb object]] (The president [executed the law]) while nominalizations are bracketed [[possessive noun] complement] ([The president’s execution] of the law), an arbitrary inconsistency which is unlikely to be learned automatically. via shared contexts (categories like PP which have infrequent proforms are not learned well unless the empty sequence is in the model – interestingly, the empty sequence appears to act as the proform for PPs, possibly due to the highly optional nature of many PPs). 5 Conclusions We have presented an alternate probability model over trees which is based on simple assumptions about the nature of natural language structure. It is driven by the explicit transfer between sequences and their contexts, and exploits both the proform phenomenon and the fact that good constituents must tile in ways that systematically cover the corpus sentences without crossing. The model clearly has limits. Lacking recursive features, it essentially must analyze long, rare constructions using only contexts. However, despite, or perhaps due to its simplicity, our model predicts bracketings very well, producing higher quality structural analyses than previous methods which employ the PCFG model family. Acknowledgements. We thank John Lafferty, Fernando Pereira, Ben Taskar, and Sebastian Thrun for comments and discussion. This paper is based on work supported in part by the National Science Foundation under Grant No. IIS-0085896. References [1] James K. Baker. Trainable grammars for speech recognition. In D. H. Klatt and J. J. Wolf, editors, Speech Communication Papers for the 97th Meeting of the ASA, pages 547–550, 1979. [2] Glenn Carroll and Eugene Charniak. Two experiments on learning probabilistic dependency grammars from corpora. In C. Weir, S. Abney, R. Grishman, and R. Weischedel, editors, Working Notes of the Workshop Statistically-Based NLP Techniques, pages 1–13. AAAI Press, 1992. [3] Eugene Charniak. A maximum-entropy-inspired parser. In NAACL 1, pages 132–139, 2000. [4] Noam Chomsky. Knowledge of Language. Prager, New York, 1986. [5] Noam Chomsky & Morris Halle. The Sound Pattern of English. Harper & Row, NY, 1968. [6] Alexander Clark. Unsupervised induction of stochastic context-free grammars using distributional clustering. In The Fifth Conference on Natural Language Learning, 2001. [7] Michael John Collins. Three generative, lexicalised models for statistical parsing. In ACL 35/EACL 8, pages 16–23, 1997. [8] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39:1–38, 1977. [9] Steven Finch and Nick Chater. Distributional bootstrapping: From word class to proto-sentence. In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pages 301– 306, Hillsdale, NJ, 1994. Lawrence Erlbaum. [10] Zellig Harris. Methods in Structural Linguistics. University of Chicago Press, Chicago, 1951. [11] Dan Klein and Christopher D. Manning. Distributional phrase structure induction. In The Fifth Conference on Natural Language Learning, 2001. [12] K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the insideoutside algorithm. Computer Speech and Language, 4:35–56, 1990. [13] Fernando Pereira and Yves Schabes. Inside-outside reestimation from partially bracketed corpora. In ACL 30, pages 128–135, 1992. [14] Andrew Radford. Transformational Grammar. Cambridge University Press, Cambridge, 1988. [15] Hinrich Sch¨ tze. Distributional part-of-speech tagging. In EACL 7, pages 141–148, 1995. u [16] Andreas Stolcke and Stephen M. Omohundro. Inducing probabilistic grammars by Bayesian model merging. In Grammatical Inference and Applications: Proceedings of the Second International Colloquium on Grammatical Inference. Springer Verlag, 1994. [17] M. van Zaanen and P. Adriaans. Comparing two unsupervised grammar induction systems: Alignment-based learning vs. emile. Technical Report 2001.05, University of Leeds, 2001. [18] J. G. Wolff. Learning syntax and meanings through optimization and distributional analysis. In Y. Levy, I. M. Schlesinger, and M. D. S. Braine, editors, Categories and processes in language acquisition, pages 179–215. Lawrence Erlbaum, Hillsdale, NJ, 1988.
6 0.41543445 93 nips-2001-Incremental A*
7 0.40560779 190 nips-2001-Thin Junction Trees
8 0.40042627 182 nips-2001-The Fidelity of Local Ordinal Encoding
9 0.40019953 53 nips-2001-Constructing Distributed Representations Using Additive Clustering
10 0.38384938 118 nips-2001-Matching Free Trees with Replicator Equations
11 0.37785476 86 nips-2001-Grammatical Bigrams
12 0.36803505 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations
13 0.34242153 169 nips-2001-Small-World Phenomena and the Dynamics of Information
14 0.33990505 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering
15 0.33918983 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
16 0.30287331 84 nips-2001-Global Coordination of Local Linear Models
17 0.29704291 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
18 0.29519743 178 nips-2001-TAP Gibbs Free Energy, Belief Propagation and Sparsity
19 0.29108226 125 nips-2001-Modularity in the motor system: decomposition of muscle patterns as combinations of time-varying synergies
20 0.27650264 193 nips-2001-Unsupervised Learning of Human Motion Models
topicId topicWeight
[(14, 0.022), (17, 0.016), (19, 0.025), (27, 0.113), (30, 0.087), (36, 0.01), (38, 0.059), (59, 0.022), (67, 0.351), (72, 0.063), (79, 0.053), (83, 0.029), (91, 0.073)]
simIndex simValue paperId paperTitle
1 0.76231444 146 nips-2001-Playing is believing: The role of beliefs in multi-agent learning
Author: Yu-Han Chang, Leslie Pack Kaelbling
Abstract: We propose a new classification for multi-agent learning algorithms, with each league of players characterized by both their possible strategies and possible beliefs. Using this classification, we review the optimality of existing algorithms, including the case of interleague play. We propose an incremental improvement to the existing algorithms that seems to achieve average payoffs that are at least the Nash equilibrium payoffs in the longrun against fair opponents.
same-paper 2 0.75831568 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
Author: Alberto Paccanaro, Geoffrey E. Hinton
Abstract: We present Linear Relational Embedding (LRE), a new method of learning a distributed representation of concepts from data consisting of instances of relations between given concepts. Its final goal is to be able to generalize, i.e. infer new instances of these relations among the concepts. On a task involving family relationships we show that LRE can generalize better than any previously published method. We then show how LRE can be used effectively to find compact distributed representations for variable-sized recursive data structures, such as trees and lists. 1 Linear Relational Embedding Our aim is to take a large set of facts about a domain expressed as tuples of arbitrary symbols in a simple and rigid syntactic format and to be able to infer other “common-sense” facts without having any prior knowledge about the domain. Let us imagine a situation in which we have a set of concepts and a set of relations among these concepts, and that our data consists of few instances of these relations that hold among the concepts. We want to be able to infer other instances of these relations. For example, if the concepts are the people in a certain family, the relations are kinship relations, and we are given the facts ”Alberto has-father Pietro” and ”Pietro has-brother Giovanni”, we would like to be able to infer ”Alberto has-uncle Giovanni”. Our approach is to learn appropriate distributed representations of the entities in the data, and then exploit the generalization properties of the distributed representations [2] to make the inferences. In this paper we present a method, which we have called Linear Relational Embedding (LRE), which learns a distributed representation for the concepts by embedding them in a space where the relations between concepts are linear transformations of their distributed representations. Let us consider the case in which all the relations are binary, i.e. involve two concepts. , and the problem In this case our data consists of triplets we are trying to solve is to infer missing triplets when we are given only few of them. Inferring a triplet is equivalent to being able to complete it, that is to come up with one of its elements, given the other two. Here we shall always try to complete the third element of the triplets 1 . LRE will then represent each concept in the data as a learned vector in a 2 0 £ § ¥ £ § ¥ %
3 0.59935611 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds
Author: B. D. Wright, Kamal Sen, William Bialek, A. J. Doupe
Abstract: In nature, animals encounter high dimensional sensory stimuli that have complex statistical and dynamical structure. Attempts to study the neural coding of these natural signals face challenges both in the selection of the signal ensemble and in the analysis of the resulting neural responses. For zebra finches, naturalistic stimuli can be defined as sounds that they encounter in a colony of conspecific birds. We assembled an ensemble of these sounds by recording groups of 10-40 zebra finches, and then analyzed the response of single neurons in the songbird central auditory area (field L) to continuous playback of long segments from this ensemble. Following methods developed in the fly visual system, we measured the information that spike trains provide about the acoustic stimulus without any assumptions about which features of the stimulus are relevant. Preliminary results indicate that large amounts of information are carried by spike timing, with roughly half of the information accessible only at time resolutions better than 10 ms; additional information is still being revealed as time resolution is improved to 2 ms. Information can be decomposed into that carried by the locking of individual spikes to the stimulus (or modulations of spike rate) vs. that carried by timing in spike patterns. Initial results show that in field L, temporal patterns give at least % extra information. Thus, single central auditory neurons can provide an informative representation of naturalistic sounds, in which spike timing may play a significant role.
4 0.46414205 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
5 0.46072042 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
Author: Gregor Wenning, Klaus Obermayer
Abstract: Cortical neurons might be considered as threshold elements integrating in parallel many excitatory and inhibitory inputs. Due to the apparent variability of cortical spike trains this yields a strongly fluctuating membrane potential, such that threshold crossings are highly irregular. Here we study how a neuron could maximize its sensitivity w.r.t. a relatively small subset of excitatory input. Weak signals embedded in fluctuations is the natural realm of stochastic resonance. The neuron's response is described in a hazard-function approximation applied to an Ornstein-Uhlenbeck process. We analytically derive an optimality criterium and give a learning rule for the adjustment of the membrane fluctuations, such that the sensitivity is maximal exploiting stochastic resonance. We show that adaptation depends only on quantities that could easily be estimated locally (in space and time) by the neuron. The main results are compared with simulations of a biophysically more realistic neuron model. 1
6 0.4501304 46 nips-2001-Categorization by Learning and Combining Object Parts
7 0.44620025 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
8 0.44517821 60 nips-2001-Discriminative Direction for Kernel Classifiers
9 0.44509959 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
11 0.44362968 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
12 0.44360921 13 nips-2001-A Natural Policy Gradient
13 0.44235629 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
14 0.44166243 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
15 0.43923593 190 nips-2001-Thin Junction Trees
16 0.43921906 56 nips-2001-Convolution Kernels for Natural Language
17 0.4391754 185 nips-2001-The Method of Quantum Clustering
18 0.4389925 57 nips-2001-Correlation Codes in Neuronal Populations
19 0.43884915 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
20 0.43852931 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing