nips nips2002 nips2002-35 nips2002-35-reference knowledge-graph by maker-knowledge-mining

35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures

Source: pdf

Author: Zach Solan, Eytan Ruppin, David Horn, Shimon Edelman

Abstract: The distributional principle according to which morphemes that occur in identical contexts belong, in some sense, to the same category [1] has been advanced as a means for extracting syntactic structures from corpus data. We extend this principle by applying it recursively, and by using mutual information for estimating category coherence. The resulting model learns, in an unsupervised fashion, highly structured, distributed representations of syntactic knowledge from corpora. It also exhibits promising behavior in tasks usually thought to require representations anchored in a grammar, such as systematicity. 1 Motivation Models dealing with the acquisition of syntactic knowledge are sharply divided into two classes, depending on whether they subscribe to some variant of the classical generative theory of syntax, or operate within the framework of “general-purpose” statistical or distributional learning. An example of the former is the model of [2], which attempts to learn syntactic structures such as Functional Category, as stipulated by the Government and Binding theory. An example of the latter model is Elman’s widely used Simple Recursive Network (SRN) [3]. We believe that polarization between statistical and classical (generative, rule-based) approaches to syntax is counterproductive, because it hampers the integration of the stronger aspects of each method into a common powerful framework. Indeed, on the one hand, the statistical approach is geared to take advantage of the considerable progress made to date in the areas of distributed representation, probabilistic learning, and “connectionist” modeling. Yet, generic connectionist architectures are ill-suited to the abstraction and processing of symbolic information. On the other hand, classical rule-based systems excel in just those tasks, yet are brittle and difﬁcult to train. We present a scheme that acquires “raw” syntactic information construed in a distributional sense, yet also supports the distillation of rule-like regularities out of the accrued statistical knowledge. Our research is motivated by linguistic theories that postulate syntactic structures (and transformations) rooted in distributional data, as exempliﬁed by the work of Zellig Harris [1]. 2 The ADIOS model The ADIOS (Automatic DIstillation Of Structure) model constructs syntactic representations of a sample of language from unlabeled corpus data. The model consists of two elements: (1) a Representational Data Structure (RDS) graph, and (2) a Pattern Acquisition (PA) algorithm that learns the RDS in an unsupervised fashion. The PA algorithm aims to detect patterns — repetitive sequences of “signiﬁcant” strings of primitives occurring in the corpus (Figure 1). In that, it is related to prior work on alignment-based learning [4] and regular expression (“local grammar”) extraction [5] from corpora. We stress, however, that our algorithm requires no pre-judging either of the scope of the primitives or of their classiﬁcation, say, into syntactic categories: all the information needed for its operation is extracted from the corpus in an unsupervised fashion. In the initial phase of the PA algorithm the text is segmented down to the smallest possible morphological constituents (e.g., ed is split off both walked and bed; the algorithm later discovers that bed should be left whole, on statistical grounds).1 This initial set of unique constituents is the vertex set of the newly formed RDS (multi-)graph. A directed edge is inserted between two vertices whenever the corresponding transition exists in the corpus (Figure 2(a)); the edge is labeled by the sentence number and by its within-sentence index. Thus, corpus sentences initially correspond to paths in the graph, a path being a sequence of edges that share the same sentence number. (a) mh mi mk mj (b) ci{j,k}l ml mn mi ck ... cj ml cu cv . Figure 1: (a) Two sequences mi , mj , ml and mi , mk , ml form a pattern ci{j,k}l = mi , {mj , mk }, ml , which allows mj and mk to be attributed to the same equivalence class, following the principle of complementary distributions [1]. Both the length of the shared context and the cohesiveness of the equivalence class need to be taken into account in estimating the goodness of the candidate pattern (see eq. 1). (b) Patterns can serve as constituents in their own right; recursively abstracting patterns from a corpus allows us to capture the syntactic regularities concisely, yet expressively. Abstraction also supports generalization: in this schematic illustration, two new paths (dashed lines) emerge from the formation of equivalence classes associated with cu and cv . In the second phase, the PA algorithm repeatedly scans the RDS graph for Signiﬁcant P atterns (sequences of constituents) ( SP), which are then used to modify the graph (Algorithm 1). For each path pi , the algorithm constructs a list of candidate constituents, ci1 , . . . , cik . Each of these consists of a “preﬁx” (sequence of graph edges), an equivalence class of vertices, and a “sufﬁx” (another sequence of edges; cf. Figure 2(b)). The criterion I for judging pattern signiﬁcance combines a syntagmatic consideration (the pattern must be long enough) with a paradigmatic one (its constituents c1 , . . . , ck must have high mutual information): I (c1 , c2 , . . . , ck ) = 2 e−(L/k) P (c1 , c2 , . . . , ck ) log P (c1 , c2 , . . . , ck ) Πk P (cj ) j=1 (1) where L is the typical context length and k is the length of the candidate pattern; the probabilities associated with a cj are estimated from frequencies that are immediately available 1 We remark that the algorithm can work in any language, with any set of tokens, including individual characters – or phonemes, if applied to speech. Algorithm 1 PA (pattern acquisition), phase 2 1: while patterns exist do 2: for all path ∈ graph do {path=sentence; graph=corpus} 3: for all source node ∈ path do 4: for all sink node ∈ path do {source and sink can be equivalence classes} 5: degree of separation = path index(sink) − path index(source); 6: pattern table ⇐ detect patterns(source, sink, degree of separation, equivalence table); 7: end for 8: end for 9: winner ⇐ get most signiﬁcant pattern(pattern table); 10: equivalence table ⇐ detect equivalences(graph, winner); 11: graph ⇐ rewire graph(graph, winner); 12: end for 13: end while in the graph (e.g., the out-degree of a node is related to the marginal probability of the corresponding cj ). Equation 1 balances two opposing “forces” in pattern formation: (1) the length of the pattern, and (2) the number and the cohesiveness of the set of examples that support it. On the one hand, shorter patterns are likely to be supported by more examples; on the other hand, they are also more likely to lead to over-generalization, because shorter patterns mean less context. A pattern tagged as signiﬁcant is added as a new vertex to the RDS graph, replacing the constituents and edges it subsumes (Figure 2). Note that only those edges of the multigraph that belong to the detected pattern are rewired; edges that belong to sequences not subsumed by the pattern are untouched. This highly context-sensitive approach to pattern abstraction, which is unique to our model, allows ADIOS to achieve a high degree of representational parsimony without sacriﬁcing generalization power. During the pass over the corpus the list of equivalence sets is updated continuously; the identiﬁcation of new signiﬁcant patterns is done using thecurrent equivalence sets (Figure 3(d)). Thus, as the algorithm processes more and more text, it “bootstraps” itself and enriches the RDS graph structure with new SPs and their accompanying equivalence sets. The recursive nature of this process enables the algorithm to form more and more complex patterns, in a hierarchical manner. The relationships among these can be visualized recursively in a tree format, with tree depth corresponding to the level of recursion (e.g., Figure 3(c)). The PA algorithm halts if it processes a given amount of text without ﬁnding a new SP or equivalence set (in real-life language acquisition this process may never stop). Generalization. A collection of patterns distilled from a corpus can be seen as an empirical grammar of sorts; cf. [6], p.63: “the grammar of a language is simply an inventory of linguistic units.” The patterns can eventually become highly abstract, thus endowing the model with an ability to generalize to unseen inputs. Generalization is possible, for example, when two equivalence classes are placed next to each other in a pattern, creating new paths among the members of the equivalence classes (dashed lines in Figure 1(b)). Generalization can also ensue from partial activation of existing patterns by novel inputs. This function is supported by the input module, designed to process a novel sentence by forming its distributed representation in terms of activities of existing patterns (Figure 6). These are computed by propagating activation from bottom (the terminals) to top (the patterns) of the RDS. The initial activities wj of the terminals cj are calculated given the novel input s1 , . . . , sk as follows: wj = max {I(sk , cj )} m=1..k (2) 102: do you see the cat? 101: the cat is eating 103: are you sure? Sentence Number Within-Sentence Index 101_1 101_4 101_3 101_2 101_5 101_6 END her ing show eat play is cat Pam the BEGIN (a) 131_3 131_2 109_7 END ing 121_12 stay 121_10 play 121_8 101_6 109_6 cat the BEGIN 109_5 121_9 eat 109_4 (b) 109_9 101_5 109_8 101_4 101_3 101_2 is 131_1 101_1 121_13 121_11 131_1 131_3 101_1 109_4 PATTERN 230: the cat is {eat, play, stay} -ing 165_1 Equivalence Class 230: stay, eat, play 165_2 221_3 here stay play 171_3 165_3 eat 221_1 we 171_2 they BEGIN (d) END 101_2 109_5 121_9 121_8 171_1 ing stay 131_2 play eat is cat the BEGIN (c) PATTERN 231: BEGIN {they, we} {230} here 221_2 Figure 2: (a) A small portion of the RDS graph for a simple corpus, with sentence #101 (the cat is eat -ing) indicated by solid arcs. (b) This sentence joins a pattern the cat is {eat, play, stay} -ing, in which two others (#109,121) already participate. (c) The abstracted pattern, and the equivalence class associated with it (edges that belong to sequences not subsumed by this pattern, e.g., #131, are untouched). (d) The identiﬁcation of new signiﬁcant patterns is done using the acquired equivalence classes (e.g., #230). In this manner, the system “bootstraps” itself, recursively distilling more and more complex patterns. where I(sk , cj ) is the mutual information between sk and cj . For an equivalence class, the value propagated upwards is the strongest non-zero activation of its members; for a pattern, it is the average weight of the children nodes, on the condition that all the children were activated by adjacent inputs. Activity propagation continues until it reaches the top nodes of the pattern lattice. When the algorithm encounters a novel word, all the members of the terminal equivalence class contribute a value of , which is then propagated upwards as usual. This enables the model to make an educated guess as to the meaning of the unfamiliar word, by considering the patterns that become active (Figure 6(b)). 3 Results We now brieﬂy describe the results of several studies designed to evaluate the viability of the ADIOS model, in which it was exposed to corpora of varying size and complexity. (a) propnoun:

reference text

[1] Z. S. Harris. Distributional structure. Word, 10:140–162, 1954.

[2] R. Kazman. Simulating the child’s acquisition of the lexicon and syntax - experiences with Babel. Machine Learning, 16:87–120, 1994.

[3] J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.

[4] M. van Zaanen and P. Adriaans. Comparing two unsupervised grammar induction systems: Alignment-based learning vs. EMILE. Report 05, School of Computing, Leeds University, 2001.

[5] M. Gross. The construction of local grammars. In E. Roche and Y. Schab` s, ed., e Finite-State Language Processing, 329–354. MIT Press, Cambridge, MA, 1997.

[6] R. W. Langacker. Foundations of cognitive grammar, volume I: theoretical prerequisites. Stanford University Press, Stanford, CA, 1987.

[7] T. J. van Gelder and L. Niklasson. On being systematically connectionist. Mind and Language, 9:288–302, 1994.

[8] B. MacWhinney and C. Snow. The child language exchange system. Journal of Computational Lingustics, 12:271–296, 1985.

[9] D. Klein and C. D. Manning. Natural language grammar induction using a constituent-context model. In T. G. Dietterich, S. Becker, and Z. Ghahramani, ed., Adv. in Neural Information Proc. Systems 14, Cambridge, MA, 2002. MIT Press.

[10] A. Clark. Unsupervised Language Acquisition: Theory and Practice. PhD thesis, COGS, University of Sussex, 2001.