nips nips2001 nips2001-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Michiro Negishi, Stephen J. Hanson
Abstract: It has been known that people, after being exposed to sentences generated by an artificial grammar, acquire implicit grammatical knowledge and are able to transfer the knowledge to inputs that are generated by a modified grammar. We show that a second order recurrent neural network is able to transfer grammatical knowledge from one language (generated by a Finite State Machine) to another language which differ both in vocabularies and syntax. Representation of the grammatical knowledge in the network is analyzed using linear discriminant analysis. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract It has been known that people, after being exposed to sentences generated by an artificial grammar, acquire implicit grammatical knowledge and are able to transfer the knowledge to inputs that are generated by a modified grammar. [sent-7, score-0.898]
2 We show that a second order recurrent neural network is able to transfer grammatical knowledge from one language (generated by a Finite State Machine) to another language which differ both in vocabularies and syntax. [sent-8, score-0.927]
3 Representation of the grammatical knowledge in the network is analyzed using linear discriminant analysis. [sent-9, score-0.389]
4 1 Introduction In the field of artificial grammar learning, people are known to be able to transfer grammatical knowledge to a new language which consists of a new vocabulary [6]. [sent-10, score-1.155]
5 Furthermore, this effect persists even when the new strings violate the syntactic rule slightly as long as they are similar to the old strings [1]. [sent-11, score-0.112]
6 It has been shown in the past studies that recurrent neural networks also have the ability to generalize previously acquired knowledge to novel inputs. [sent-12, score-0.219]
7 ([2]) showed that a neural network can generalize abstract knowledge acquired in one domain to a new domain. [sent-14, score-0.269]
8 They trained the network to predict the next input symbol in grammatical sequences in the first domain, and showed that the network was able to learn to predict grammatical sequences in the second domain more effectively than it would have learned them without the prior learning. [sent-15, score-0.807]
9 During the training in the second domain, they had to freeze the weights of a part of the network to prevent catastrophic forgetting. [sent-16, score-0.145]
10 ([5]) also showed that a prior learning of a grammar facilitates the learning of a new grammar in the cases where either the syntax or the vocabulary was kept constant. [sent-19, score-1.037]
11 In this study we investigate grammar transfer by a neural network, where both syntax and vocabularies are different from the source grammar to the target grammar. [sent-20, score-1.513]
12 's network, all weights in the network are allowed to change dur- ing the learning of the target grammar, which allows us to investigate interference as well as transfer from the source grammar to the target grammar. [sent-22, score-1.236]
13 1 Simulation Design The Grammar Transfer Task In the following simulations, a neural network is trained with sentences that are generated by a Finite State Machine (FSM) and is tested whether the learning of sentences generated by another FSM is facilitated. [sent-24, score-0.494]
14 Four pairs of FSMs used for the grammar transfer task are shown in Fig. [sent-25, score-0.823]
15 ) denote words, numbers represent states, a state number with an incoming arrow with no state numbers at the arrow foot (e. [sent-32, score-0.41]
16 2A) signifies the initial state, and numbers in circles (e. [sent-35, score-0.08]
17 In each pair of diagrams, transfer was tested in both directions: from the left FSM to the right FSM, and to the opposite direction. [sent-39, score-0.356]
18 Words in a sentence are generated by an FSM and presented to the network one word at a time. [sent-40, score-0.267]
19 At each time, the next word is selected randomly from next possible words (or end of sentence where possible) at the current FSM state with the equal probability, and the FSM state is updated to the next state. [sent-41, score-0.416]
20 The sentence length is limited to 20 words , excluding START. [sent-42, score-0.136]
21 The task for the network is to predict the correct termination of sentences. [sent-43, score-0.198]
22 If the network is to predict that the sentence ends with the current input, the activity of the output node of the network has to be above a threshold value, otherwise the output has to be below another threshold value. [sent-44, score-0.485]
23 Note that if a FSM is at an accepting state but can further transit to another state, the sentence mayor may not end. [sent-45, score-0.305]
24 However, the network will eventually learn to yield higher values when the FSM is at an accepting state than when it is not. [sent-47, score-0.374]
25 After the network learns each training sentence, it is tested with randomly generated 1000 sentences and the training session is completed only when the network makes correct end point judgments for all sentences. [sent-48, score-0.454]
26 Then the network is trained with sentences generated by another FSM. [sent-49, score-0.33]
27 The extent of transfer is measured by the reduction of the number of sentences required to train the network on an FSM after a prior learning of another FSM, compared to the number of sentences required to train the network on the current FSM from scratch. [sent-50, score-1.052]
28 2 The Network Architecture and the Learning Algorithm The network is a second order recurrent neural network, with an added hidden layer that receives first order connections from the input layer (Fig. [sent-52, score-0.692]
29 The network has an input layer with seven nodes (A, B, C, . [sent-54, score-0.356]
30 F, and START), an output layer with one node, an input hidden layer with four nodes, a state hidden layer with four nodes, and a feedback layer with four nodes. [sent-57, score-0.993]
31 Recurrent neural networks are often used for modeling syntactic processing [3]. [sent-58, score-0.087]
32 Second order networks are suited for processing languages generated by FSMs [4] . [sent-59, score-0.053]
33 Learning is carried out by the weight update rule for recurrent networks developed by Williams and Zipser ([7]), extended to second order connections ([4]) where necessary. [sent-60, score-0.121]
34 17 respectively and are adapted after the network have processed the test sentences as follows. [sent-66, score-0.285]
35 The high threshold is modified to the minimum value yielded for all end points in the test sentences minus a margin (0. [sent-67, score-0.214]
36 The low threshold is modified to the high threshold minus another margin (0. [sent-69, score-0.1]
37 Output Layer State Hidden Laye r ~--'1c:----=, '-+-+_-+---" Feedback Laye r Input Laye r Figure 1: A second order recurrent network used in simulations. [sent-72, score-0.237]
38 The network consists of an input layer that receives words, an output layer that predicts sentence ends, two hidden layers (an input hidden layer and a state hidden layer) , and a feedback layer that receives a copy of the state hidden layer activities. [sent-73, score-1.778]
39 1 The Simulation Results The Transfer Effects Numbers of required trainings and changes in number of trainings averaged over 20 networks with different initial weights are shown in Fig. [sent-75, score-0.182]
40 2A shows that it required 14559 sentence presentations for the network to learn the left FSM after the network was trained on the right FSM . [sent-80, score-0.468]
41 On the other hand, it required 20995 sentence presentation for the network to learn the left FSM from the scratch. [sent-81, score-0.302]
42 7% reduction in the transfer direction from right to left. [sent-83, score-0.411]
43 Note that the network was trained only once on sentences from the source grammar to the criteria and then only once on the sentences from the target grammar. [sent-84, score-1.062]
44 Thus after the completion of the target grammar learning, the knowledge about the source grammar is disrupted to some extent. [sent-85, score-1.11]
45 To show that the network eventually learns both grammars , number of required training was examined for more than one cycle. [sent-86, score-0.253]
46 After ten cycles, number of required trainings was reduced to 0. [sent-87, score-0.095]
47 2 Representation of Grammatical Knowledge To analyze the representation of grammatical knowledge in the network, Linear Discriminant Analysis (LDA) was applied to hidden layer activities. [sent-90, score-0.48]
48 LDA is a technique which finds sets of coefficients that defines a linear combination of input variables that can be used to discriminate among sets of input data that belong to different categories . [sent-91, score-0.131]
49 Linear combinations of hidden layer node activities using these coefficients provide low-dimensional views of hidden layer activities that best separate specified categories (e. [sent-92, score-0.614]
50 In this respect, LDA is similar to Principal Component Analysis (PCA) except that PCA finds dimensions along which the data have large variances, whereas LDA finds dimensions which differentiate the specified categories. [sent-95, score-0.062]
51 8 % ( E 2 GJ--/ 15575 (1149) 39372 (2271) ] D F ~ F GGE CV Figure 2: Initial savings observed in various grammar transfer tasks. [sent-105, score-0.892]
52 Numbers are required number of training averaged over 20 networks with different initial weights. [sent-106, score-0.066]
53 A negative change means reduction (positive transfer) and a positive change means increase (negative transfer, or interference). [sent-109, score-0.082]
54 ACAC··o o 0 0 o 0 °00 Stat 3 reg ion corresponding ~;::;;;=~~~~~p. [sent-112, score-0.054]
55 states 2 and 3 in the source grammar), as well as it can be one of the discrimination boundaries between diamonds with dots and squares with dots (i . [sent-124, score-0.515]
56 The triangular shape shows the three FSM state trajectory corresponding to inputs BCCBCC . [sent-127, score-0.17]
57 Ellipses show to state space activities involved in one state loops (at state 1 and at state 3) and two state loops (at state 2 and 3). [sent-131, score-1.162]
58 4 Discussion In the first grammar transfer task (Fig. [sent-132, score-0.823]
59 2A) , only the initial and the accepting states in the FSMs were different, so the frequency distribution of subsequences of words were very similar except for short sentences. [sent-133, score-0.201]
60 In this case, 31 % saving was observed in one transfer direction although there was little change in required training in the other direction. [sent-134, score-0.482]
61 In the second grammar transfer task, directions of all arcs in the FSMs were reversed. [sent-135, score-0.793]
62 Therefore the mirror images of sentences accepted in one grammar were accepted in the other grammar. [sent-136, score-0.633]
63 Although the grammars were very different, there were significant amount of overlaps in the permissible short subsequences. [sent-137, score-0.071]
64 In this case, there were 31% and 41 % savings in training. [sent-138, score-0.099]
65 In the third and fourth grammar transfer tasks, the source and the target grammars shared less subsequences. [sent-139, score-1.084]
66 2C) for instance, the subsequences were very different because the source grammar had two one-state loops (at states 1 and 3) with the same word A, whereas two one-state loops in the target grammar consisted of different words (D and E). [sent-141, score-1.415]
67 In this case, there was little change in the number of learnings required in one transfer direction but there was 67% increase in the other direction. [sent-142, score-0.482]
68 D), there was 26% reduction in one direction but there was 12% increase in the other direction in the number of learnings required. [sent-144, score-0.118]
69 From these observations we hypothesize that , as in the case of syntax transfer ([5]) , if the acquired grammar allows frequent subsequence of words that appears in the target grammar (after the equivalent symbol sets are substituted) the transfer is easier and thus there are more savings. [sent-145, score-1.834]
70 What is the source of savings in grammar transfer? [sent-146, score-0.652]
71 It is tempting to say that, as in the vocabulary transfer task ([5]), the source of savings is the organization of the state hidden layer activity which directly reflects the FSM states. [sent-147, score-1.151]
72 3 shows the state space organization after the grammar transfer shown in Fig. [sent-149, score-1.011]
73 4 shows the change in the state hidden layer activities drawn over the state space organization. [sent-152, score-0.636]
74 The triangular lines are the trajectories as the network receives BCCBCC, which creates the 3-state loops (231)(231) in the FSM. [sent-153, score-0.353]
75 Regions of trajectories corresponding to the 2-state loop (23) and two I-state loops (1) and (3) are also shown in Fig. [sent-154, score-0.147]
76 It can be seen that state space activities that belong to different FSM state loops tend to be distinct even when they belong to the same FSM state, although there seem to be some tendencies that they are allocated in vicinities. [sent-156, score-0.576]
77 Unlike in the vocabulary transfer, regions belonging to different FSM loops tend to be interspersed by regions that belong to the other grammar, causing state space structure to be more fragmented. [sent-157, score-0.464]
78 Furthermore, we found that there was no significant correlation between the correct rate of the linear discrimination with respect to FSM states (which reflects the extent to which the state space organization reflects the FSM states) and savings (not shown). [sent-158, score-0.509]
79 One could reasonably argue that the saving is not due to transfer of grammatical knowledge but is due to some more low-level processing specific to neural networks. [sent-159, score-0.638]
80 For instance, the network may have to move weight values to an appropriate range at the first stage of the source grammar learning, which might become unnecessary for the leaning of the target grammar. [sent-160, score-0.761]
81 We conducted a simulation to examine the effect of altering the initial random weights using the source and target grammars. [sent-161, score-0.204]
82 If neither the state space organization nor the lower-level statistics was not the source of savings, what was transferred? [sent-163, score-0.334]
83 As already mentioned, state space organization observed in grammar transfer task is more fragmented than that observed in vocabulary transfer task (Fig. [sent-164, score-1.552]
84 These fragmented regions have to be discriminated as far as each region (which represents a combination of the current network state and the current vocabulary) has to yield a different network state. [sent-166, score-0.496]
85 State hidden nodes provide clues for the discrimination by placing boundaries in the network state space. [sent-167, score-0.509]
86 Boundary lines collectively define regions in the state space which correspond to sets of state-vocabulary combinations that should be treated equivalently in terms of the given task. [sent-168, score-0.191]
87 These boundaries can be shared: for instance, a hypothetical boundary shown by a broken line in the Fig. [sent-169, score-0.137]
88 4 can be the discrimination boundary between white diamonds and white circles (i. [sent-170, score-0.287]
89 states 2 and 3 in the source grammar), as well as it can be one of the discrimination boundaries between diamonds with dots and squares with dots (i. [sent-172, score-0.515]
90 We speculate that shared boundaries may be the source of savings. [sent-175, score-0.222]
91 That is, boundaries created for the source grammar learning can be used, possibly with some modifications, as one of the boundaries for the target grammar. [sent-176, score-0.746]
92 In other words, the source of savings may not be as high level as FSM state space but some lower level features at the syntactic processing level. [sent-177, score-0.436]
93 5 Conclusion We investigated the ability of a recurrent neural network to transfer grammatical knowledge of a previously acquired language to another. [sent-178, score-0.909]
94 We found that the network was able to transfer the grammatical knowledge to a new grammar with a slightly different syntax defined over a new vocabulary (grammar transfer). [sent-179, score-1.345]
95 We hypothesize that the ability of the network to transfer grammatical knowledge comes from sharing discrimination boundaries of input and vocabulary combinations. [sent-181, score-1.015]
96 In sum, we hope to have demonstrated that neural networks do not simply learn associations among input symbols but they acquire structural knowledge from inputs. [sent-182, score-0.195]
97 (1999) Mapping across domains without feedback: A neural network model of transfer of implicit knowledge, Cognitive Science 23, 53-82. [sent-193, score-0.501]
98 (1991) Distributed representation , simple recurrent neural networks, and grammatical structure. [sent-196, score-0.279]
99 , (2001) The emergence of explicit knowledge (symbols & rules) in (associationist) neural networks, Submitted. [sent-213, score-0.057]
100 (1989) A learning algorithm for continually running fully recurrent neural networks, Neural Computation, 1 (2) , 270. [sent-220, score-0.092]
wordName wordTfidf (topN-words)
[('fsm', 0.558), ('grammar', 0.437), ('transfer', 0.356), ('grammatical', 0.187), ('layer', 0.16), ('network', 0.145), ('sentences', 0.14), ('state', 0.14), ('source', 0.116), ('loops', 0.114), ('savings', 0.099), ('sentence', 0.098), ('diamonds', 0.096), ('recurrent', 0.092), ('vocabulary', 0.087), ('fsms', 0.077), ('syntax', 0.076), ('hidden', 0.076), ('activities', 0.071), ('grammars', 0.071), ('accepting', 0.067), ('boundaries', 0.065), ('target', 0.063), ('discrimination', 0.06), ('syntactic', 0.058), ('dienes', 0.058), ('laye', 0.058), ('trainings', 0.058), ('lda', 0.057), ('knowledge', 0.057), ('organization', 0.055), ('dots', 0.054), ('subsequences', 0.05), ('states', 0.046), ('belong', 0.044), ('numbers', 0.041), ('acquired', 0.041), ('shared', 0.041), ('hypothetical', 0.04), ('circles', 0.039), ('bccbcc', 0.038), ('fragmented', 0.038), ('learnings', 0.038), ('negishi', 0.038), ('newark', 0.038), ('reber', 0.038), ('saving', 0.038), ('zipser', 0.038), ('words', 0.038), ('hanson', 0.038), ('required', 0.037), ('warren', 0.033), ('reg', 0.033), ('rutgers', 0.033), ('jose', 0.033), ('feedback', 0.033), ('trajectories', 0.033), ('psychology', 0.032), ('boundary', 0.032), ('reflects', 0.032), ('language', 0.031), ('receives', 0.031), ('symbols', 0.031), ('finds', 0.031), ('interference', 0.03), ('hypothesize', 0.03), ('triangular', 0.03), ('task', 0.03), ('white', 0.03), ('reduction', 0.03), ('networks', 0.029), ('accepted', 0.028), ('parentheses', 0.028), ('acquire', 0.028), ('vocabularies', 0.028), ('input', 0.028), ('regions', 0.028), ('strings', 0.027), ('domain', 0.026), ('change', 0.026), ('threshold', 0.026), ('modified', 0.025), ('simulation', 0.025), ('direction', 0.025), ('thresholds', 0.024), ('arrow', 0.024), ('chen', 0.024), ('instance', 0.024), ('squares', 0.024), ('generated', 0.024), ('minus', 0.023), ('nodes', 0.023), ('space', 0.023), ('predict', 0.023), ('smith', 0.022), ('extent', 0.022), ('ends', 0.022), ('learn', 0.022), ('trained', 0.021), ('ion', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network
Author: Michiro Negishi, Stephen J. Hanson
Abstract: It has been known that people, after being exposed to sentences generated by an artificial grammar, acquire implicit grammatical knowledge and are able to transfer the knowledge to inputs that are generated by a modified grammar. We show that a second order recurrent neural network is able to transfer grammatical knowledge from one language (generated by a Finite State Machine) to another language which differ both in vocabularies and syntax. Representation of the grammatical knowledge in the network is analyzed using linear discriminant analysis. 1
2 0.23535143 86 nips-2001-Grammatical Bigrams
Author: Mark A. Paskin
Abstract: Unsupervised learning algorithms have been derived for several statistical models of English grammar, but their computational complexity makes applying them to large data sets intractable. This paper presents a probabilistic model of English grammar that is much simpler than conventional models, but which admits an efficient EM training algorithm. The model is based upon grammatical bigrams, i.e. , syntactic relationships between pairs of words. We present the results of experiments that quantify the representational adequacy of the grammatical bigram model, its ability to generalize from labelled data, and its ability to induce syntactic structure from large amounts of raw text. 1
3 0.19561191 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
Author: Dan Klein, Christopher D. Manning
Abstract: This paper presents a novel approach to the unsupervised learning of syntactic analyses of natural language text. Most previous work has focused on maximizing likelihood according to generative PCFG models. In contrast, we employ a simpler probabilistic model over trees based directly on constituent identity and linear context, and use an EM-like iterative procedure to induce structure. This method produces much higher quality analyses, giving the best published results on the ATIS dataset. 1 Overview To enable a wide range of subsequent tasks, human language sentences are standardly given tree-structure analyses, wherein the nodes in a tree dominate contiguous spans of words called constituents, as in figure 1(a). Constituents are the linguistically coherent units in the sentence, and are usually labeled with a constituent category, such as noun phrase (NP) or verb phrase (VP). An aim of grammar induction systems is to figure out, given just the sentences in a corpus S, what tree structures correspond to them. In this sense, the grammar induction problem is an incomplete data problem, where the complete data is the corpus of trees T , but we only observe their yields S. This paper presents a new approach to this problem, which gains leverage by directly making use of constituent contexts. It is an open problem whether entirely unsupervised methods can produce linguistically accurate parses of sentences. Due to the difficulty of this task, the vast majority of statistical parsing work has focused on supervised learning approaches to parsing, where one uses a treebank of fully parsed sentences to induce a model which parses unseen sentences [7, 3]. But there are compelling motivations for unsupervised grammar induction. Building supervised training data requires considerable resources, including time and linguistic expertise. Investigating unsupervised methods can shed light on linguistic phenomena which are implicit within a supervised parser’s supervisory information (e.g., unsupervised systems often have difficulty correctly attaching subjects to verbs above objects, whereas for a supervised parser, this ordering is implicit in the supervisory information). Finally, while the presented system makes no claims to modeling human language acquisition, results on whether there is enough information in sentences to recover their structure are important data for linguistic theory, where it has standardly been assumed that the information in the data is deficient, and strong innate knowledge is required for language acquisition [4]. Node S VP NP NN1 NNS Factory payrolls VBD fell PP IN NN2 in September Constituent S NP VP PP NN 1 NNS VBD IN NN 2 NN NNS VBD IN NN NN NNS VBD IN NN IN NN NN NNS VBD IN NNS Context – – VBD NNS – VBD – – NNS NN – VBD NNS – IN VBD – NN IN – Empty 0 1 2 3 4 5 Context – NN – NNS – VBD – IN – NN – NN NNS VBD IN NN Figure 1: Example parse tree with the constituents and contexts for each tree node. 2 Previous Approaches One aspect of grammar induction where there has already been substantial success is the induction of parts-of-speech. Several different distributional clustering approaches have resulted in relatively high-quality clusterings, though the clusters’ resemblance to classical parts-of-speech varies substantially [9, 15]. For the present work, we take the part-ofspeech induction problem as solved and work with sequences of parts-of-speech rather than words. In some ways this makes the problem easier, such as by reducing sparsity, but in other ways it complicates the task (even supervised parsers perform relatively poorly with the actual words replaced by parts-of-speech). Work attempting to induce tree structures has met with much less success. Most grammar induction work assumes that trees are generated by a symbolic or probabilistic context-free grammar (CFG or PCFG). These systems generally boil down to one of two types. Some fix the structure of the grammar in advance [12], often with an aim to incorporate linguistic constraints [2] or prior knowledge [13]. These systems typically then attempt to find the grammar production parameters which maximize the likelihood P(S| ) using the inside-outside algorithm [1], which is an efficient (dynamic programming) instance of the EM algorithm [8] for PCFG s. Other systems (which have generally been more successful) incorporate a structural search as well, typically using a heuristic to propose candidate grammar modifications which minimize the joint encoding of data and grammar using an MDL criterion, which asserts that a good analysis is a short one, in that the joint encoding of the grammar and the data is compact [6, 16, 18, 17]. These approaches can also be seen as likelihood maximization where the objective function is the a posteriori likelihood of the grammar given the data, and the description length provides a structural prior. The “compact grammar” aspect of MDL is close to some traditional linguistic argumentation which at times has argued for minimal grammars on grounds of analytical [10] or cognitive [5] economy. However, the primary weakness of MDL-based systems does not have to do with the objective function, but the search procedures they employ. Such systems end up growing structures greedily, in a bottom-up fashion. Therefore, their induction quality is determined by how well they are able to heuristically predict what local intermediate structures will fit into good final global solutions. A potential advantage of systems which fix the grammar and only perform parameter search is that they do compare complete grammars against each other, and are therefore able to detect which give rise to systematically compatible parses. However, although early work showed that small, artificial CFGs could be induced with the EM algorithm [12], studies with large natural language grammars have generally suggested that completely unsupervised EM over PCFG s is ineffective for grammar acquisition. For instance, Carroll and Charniak [2] describe experiments running the EM algorithm from random starting points, which produced widely varying learned grammars, almost all of extremely poor quality. 1 1 We duplicated one of their experiments, which used grammars restricted to rules of the form x → x y | y x, where there is one category x for each part-of-speech (such a restricted CFG is isomorphic to a dependency grammar). We began reestimation from a grammar with uniform rewrite It is well-known that EM is only locally optimal, and one might think that the locality of the search procedure, not the objective function, is to blame. The truth is somewhere in between. There are linguistic reasons to distrust an ML objective function. It encourages the symbols and rules to align in ways which maximize the truth of the conditional independence assumptions embodied by the PCFG. The symbols and rules of a natural language grammar, on the other hand, represent syntactically and semantically coherent units, for which a host of linguistic arguments have been made [14]. None of these have anything to do with conditional independence; traditional linguistic constituency reflects only grammatical regularities and possibilities for expansion. There are expected to be strong connections across phrases (such as dependencies between verbs and their selected arguments). It could be that ML over PCFGs and linguistic criteria align, but in practice they do not always seem to. Experiments with both artificial [12] and real [13] data have shown that starting from fixed, correct (or at least linguistically reasonable) structure, EM produces a grammar which has higher log-likelihood than the linguistically determined grammar, but lower parsing accuracy. However, we additionally conjecture that EM over PCFGs fails to propagate contextual cues efficiently. The reason we expect an algorithm to converge on a good PCFG is that there seem to be coherent categories, like noun phrases, which occur in distinctive environments, like between the beginning of the sentence and the verb phrase. In the inside-outside algorithm, the product of inside and outside probabilities α j ( p, q)β j ( p, q) is the probability of generating the sentence with a j constituent spanning words p through q: the outside probability captures the environment, and the inside probability the coherent category. If we had a good idea of what VPs and NPs looked like, then if a novel NP appeared in an NP context, the outside probabilities should pressure the sequence to be parsed as an NP . However, what happens early in the EM procedure, when we have no real idea about the grammar parameters? With randomly-weighted, complete grammars over a symbol set X, we have observed that a frequent, short, noun phrase sequence often does get assigned to some category x early on. However, since there is not a clear overall structure learned, there is only very weak pressure for other NPs, even if they occur in the same positions, to also be assigned to x, and the reestimation process goes astray. To enable this kind of constituent-context pressure to be effective, we propose the model in the following section. 3 The Constituent-Context Model We propose an alternate parametric family of models over trees which is better suited for grammar induction. Broadly speaking, inducing trees like the one shown in figure 1(a) can be broken into two tasks. One is deciding constituent identity: where the brackets should be placed. The second is deciding what to label the constituents. These tasks are certainly correlated and are usually solved jointly. However, the task of labeling chosen brackets is essentially the same as the part-of-speech induction problem, and the solutions cited above can be adapted to cluster constituents [6]. The task of deciding brackets, is the harder task. For example, the sequence DT NN IN DT NN ([the man in the moon]) is virtually always a noun phrase when it is a constituent, but it is only a constituent 66% of the time, because the IN DT NN is often attached elsewhere ([we [sent a man] [to the moon]]). Figure 2(a) probabilities. Figure 4 shows that the resulting grammar (DEP - PCFG) is not as bad as conventional wisdom suggests. Carroll and Charniak are right to observe that the search spaces is riddled with pronounced local maxima, and EM does not do nearly so well when randomly initialized. The need for random seeding in using EM over PCFGs is two-fold. For some grammars, such as one over a set X of non-terminals in which any x 1 → x2 x3 , xi ∈ X is possible, it is needed to break symmetry. This is not the case for dependency grammars, where symmetry is broken by the yields (e.g., a sentence noun verb can only be covered by a noun or verb projection). The second reason is to start the search from a random region of the space. But unless one does many random restarts, the uniform starting condition is better than most extreme points in the space, and produces superior results. 1.5 2 Usually a Constituent Rarely a Constituent 1 1 0.5 0 0 −1 −2 −3 −1.5 −1 −0.5 NP VP PP −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 1.5 (a) (b) Figure 2: The most frequent examples of (a) different constituent labels and (b) constituents and non-constituents, in the vector space of linear contexts, projected onto the first two principal components. Clustering is effective for labeling, but not detecting constituents. shows the 50 most frequent constituent sequences of three types, represented as points in the vector space of their contexts (see below), projected onto their first two principal components. The three clusters are relatively coherent, and it is not difficult to believe that a clustering algorithm could detect them in the unprojected space. Figure 2(a), however, shows 150 sequences which are parsed as constituents at least 50% of the time along with 150 which are not, again projected onto the first two components. This plot at least suggests that the constituent/non-constituent classification is less amenable to direct clustering. Thus, it is important that an induction system be able to detect constituents, either implicitly or explicitly. A variety of methods of constituent detection have been proposed [11, 6], usually based on information-theoretic properties of a sequence’s distributional context. However, here we rely entirely on the following two simple assumptions: (i) constituents of a parse do not cross each other, and (ii) constituents occur in constituent contexts. The first property is self-evident from the nature of the parse trees. The second is an extremely weakened version of classic linguistic constituency tests [14]. Let σ be a terminal sequence. Every occurrence of σ will be in some linear context c(σ ) = x σ y, where x and y are the adjacent terminals or sentence boundaries. Then we can view any tree t over a sentence s as a collection of sequences and contexts, one of each for every node in the tree, plus one for each inter-terminal empty span, as in figure 1(b). Good trees will include nodes whose yields frequently occur as constituents and whose contexts frequently surround constituents. Formally, we use a conditional exponential model of the form: exp( (σ,c)∈t λσ f σ + λc f c ) P(t|s, ) = t:yield(t)=s exp( (σ,c)∈t λσ f σ + λc f c ) We have one feature f σ (t) for each sequence σ whose value on a tree t is the number of nodes in t with yield σ , and one feature f c (t) for each context c representing the number of times c is the context of the yield of some node in the tree.2 No joint features over c and σ are used, and, unlike many other systems, there is no distinction between constituent types. We model only the conditional likelihood of the trees, P(T |S, ), where = {λ σ , λc }. We then use an iterative EM-style procedure to find a local maximum P(T |S, ) of the completed data (trees) T (P(T |S, ) = t∈T ,s=yield(t) P(t|s, )). We initialize such that each λ is zero and initialize T to any arbitrary set of trees. In alternating steps, we first fix the parameters and find the most probable single tree structure t ∗ for each sentence s according to P(t|s, ), using a simple dynamic program. For any this produces the 2 So, for the tree in figure 1(a), P(t|s) ∝ exp(λ NN NNS + λVBD IN NN + λIN NN + λ −VBD + λNNS− + λVBD− + λ −NNS + λNN−VBD + λNNS−IN + λVBD−NN + λIN− ). set of parses T ∗ which maximizes P(T |S, ). Since T ∗ maximizes this quantity, if T is the former set of trees, P(T ∗ |S, ) ≥ P(T |S, ). Second, we fix the trees and estimate new parameters . The task of finding the parameters ∗ which maximize P(T |S, ) is simply the well-studied task of fitting our exponential model to maximize the conditional likelihood of the fixed parses. Running, for example, a conjugate gradient (CG) ascent on will produce the desired ∗ . If is the former parameters, then we will have P(T |S, ∗ ) ≥ P(T |S, ). Therefore, each iteration will increase P(T |S, ) until convergence.3 Note that our parsing model is not a generative model, and this procedure, though clearly related, is not exactly an instance of the EM algorithm. We merely guarantee that the conditional likelihood of the data completions is increasing. Furthermore, unlike in EM where each iteration increases the marginal likelihood of the fixed observed data, our procedure increases the conditional likelihood of a changing complete data set, with the completions changing at every iteration as we reparse. Several implementation details were important in making the system work well. First, tiebreaking was needed, most of all for the first round. Initially, the parameters are zero, and all parses are therefore equally likely. To prevent bias, all ties were broken randomly. Second, like so many statistical NLP tasks, smoothing was vital. There are features in our model for arbitrarily long yields and most yield types occurred only a few times. The most severe consequence of this sparsity was that initial parsing choices could easily become frozen. If a λσ for some yield σ was either 0 or 0, which was usually the case for rare yields, σ would either be locked into always occurring or never occurring, respectively. Not only did we want to push the λσ values close to zero, we also wanted to account for the fact that most spans are not constituents.4 Therefore, we expect the distribution of the λσ to be skewed towards low values.5 A greater amount of smoothing was needed for the first few iterations, while much less was required in later iterations. Finally, parameter estimation using a CG method was slow and difficult to smooth in the desired manner, and so we used the smoothed relative frequency estimates λ σ = count( fσ )/(count(σ ) + M) and λc = count( f c )/(count(c) + N). These estimates ensured that the λ values were between 0 and 1, and gave the desired bias towards non-constituency. These estimates were fast and surprisingly effective, but do not guarantee non-decreasing conditional likelihood (though the conditional likelihood was increasing in practice). 6 4 Results In all experiments, we used hand-parsed sentences from the Penn Treebank. For training, we took the approximately 7500 sentences in the Wall Street Journal (WSJ) section which contained 10 words or fewer after the removal of punctuation. For testing, we evaluated the system by comparing the system’s parses for those same sentences against the supervised parses in the treebank. We consider each parse as a set of constituent brackets, discarding all trivial brackets.7 We calculated the precision and recall of these brackets against the treebank parses in the obvious way. 3 In practice, we stopped the system after 10 iterations, but final behavior was apparent after 4–8. 4 In a sentence of length n, there are (n + 1)(n + 2)/2 total (possibly size zero) spans, but only 3n constituent spans: n − 1 of size ≥ 2, n of size 1, and n + 1 empty spans. 5 Gaussian priors for the exponential model accomplish the former goal, but not the latter. 6 The relative frequency estimators had a somewhat subtle positive effect. Empty spans have no effect on the model when using CG fitting, as all trees include the same empty spans. However, including their counts improved performance substantially when using relative frequency estimators. This is perhaps an indication that a generative version of this model would be advantageous. 7 We discarded both brackets of length one and brackets spanning the entire sentence, since all of these are impossible to get incorrect, and hence ignored sentences of length ≤ 2 during testing. S DT VP NN VBD σ NP σ VBD NP The screen was NP PP DT DT NN IN NP a VBD σ NN VBD σ σ DT σ was DT NN IN NN The screen a sea of DT red NN DT VBD DT was The screen DT a red (b) IN red DT NN of sea of NN (a) NN sea (c) Figure 3: Alternate parse trees for a sentence: (a) the Penn Treebank tree (deemed correct), (b) the one found by our system CCM, and (c) the one found by DEP - PCFG. Method LBRANCH RANDOM DEP - PCFG RBRANCH CCM UBOUND UP 20.5 29.0 39.5 54.1 60.1 78.2 UR 24.2 31.0 42.3 67.5 75.4 100.0 F1 22.2 30.0 40.9 60.0 66.9 87.8 (a) NP UR 28.9 42.8 69.7 38.3 83.8 100.0 PP UR 6.3 23.6 44.1 44.5 71.6 100.0 VP UR 0.6 26.3 22.8 85.8 66.3 100.0 System EMILE ABL CDC -40 RBRANCH CCM UP 51.6 43.6 53.4 39.9 54.4 UR 16.8 35.6 34.6 46.4 46.8 F1 25.4 39.2 42.0 42.9 50.3 CB 0.84 2.12 1.46 2.18 1.61 (b) Figure 4: Comparative accuracy on WSJ sentences (a) and on the ATIS corpus (b). UR = unlabeled recall; UP = unlabeled precision; F1 = the harmonic mean of UR and UP; CB = crossing brackets. Separate recall values are shown for three major categories. To situate the results of our system, figure 4(a) gives the values of several parsing strategies. CCM is our constituent-context model. DEP - PCFG is a dependency PCFG model [2] trained using the inside-outside algorithm. Figure 3 shows sample parses to give a feel for the parses the systems produce. We also tested several baselines. RANDOM parses randomly. This is an appropriate baseline for an unsupervised system. RBRANCH always chooses the right-branching chain, while LBRANCH always chooses the left-branching chain. RBRANCH is often used as a baseline for supervised systems, but exploits a systematic right-branching tendency of English. An unsupervised system has no a priori reason to prefer right chains to left chains, and LBRANCH is well worse than RANDOM. A system need not beat RBRANCH to claim partial success at grammar induction. Finally, we include an upper bound. All of the parsing strategies and systems mentioned here give fully binary-branching structures. Treebank trees, however, need not be fully binary-branching, and generally are not. As a result, there is an upper bound UBOUND on the precision and F1 scores achievable when structurally confined to binary trees. Clearly, CCM is parsing much better than the RANDOM baseline and the DEP - PCFG induced grammar. Significantly, it also out-performs RBRANCH in both precision and recall, and, to our knowledge, it is the first unsupervised system to do so. To facilitate comparison with other recent systems, figure 4(b) gives results where we trained as before but used (all) the sentences from the distributionally different ATIS section of the treebank as a test set. For this experiment, precision and recall were calculated using the EVALB system of measuring precision and recall (as in [6, 17]) – EVALB is a standard for parser evaluation, but complex, and unsuited to evaluating unlabeled constituency. EMILE and ABL are lexical systems described in [17]. The results for CDC-40, from [6], reflect training on much more data (12M words). Our system is superior in terms of both precision and recall (and so F 1 ). These figures are certainly not all that there is to say about an induced grammar; there are a number of issues in how to interpret the results of an unsupervised system when comparing with treebank parses. Errors come in several kinds. First are innocent sins of commission. Treebank trees are very flat; for example, there is no analysis of the inside of many short noun phrases ([two hard drives] rather than [two [hard drives]]). Our system gives a Sequence DT NN NNP NNP CD CD JJ NNS DT JJ NN DT NNS JJ NN CD NN IN NN IN DT NN NN NNS NN NN TO VB DT JJ IN DT PRP VBZ PRP VBP NNS VBP NN VBZ NN IN NNS VBD Example the man United States 4 1/2 daily yields the top rank the people plastic furniture 12 percent on Monday for the moment fire trucks fire truck to go ?the big *of the ?he says ?they say ?people are ?value is *man from ?people were CORRECT 1 2 3 4 5 6 7 8 9 10 11 22 26 78 90 95 180 =350 =532 =648 =648 FREQUENCY 2 1 9 7 – – 3 – – – – 8 – 6 4 – – – 10 5 – ENTROPY 2 – – 3 – – 7 – 9 – 6 10 1 – – – – 4 5 – 8 DEP - PCFG 1 2 5 4 7 – 3 – – – – – 6 – 10 8 9 – – – – CCM 1 2 5 4 6 10 3 9 – – 8 7 – – – – – – – – – Figure 5: Top non-trivial sequences by actual treebank constituent counts, linear frequency, scaled context entropy, and in DEP - PCFG and CCM learned models’ parses. (usually correct) analysis of the insides of such NPs, for which it is penalized on precision (though not recall or crossing brackets). Second are systematic alternate analyses. Our system tends to form modal verb groups and often attaches verbs first to pronoun subjects rather than to objects. As a result, many VPs are systematically incorrect, boosting crossing bracket scores and impacting VP recall. Finally, the treebank’s grammar is sometimes an arbitrary, and even inconsistent standard for an unsupervised learner: alternate analyses may be just as good.8 Notwithstanding this, we believe that the treebank parses have enough truth in them that parsing scores are a useful component of evaluation. Ideally, we would like to inspect the quality of the grammar directly. Unfortunately, the grammar acquired by our system is implicit in the learned feature weights. These are not by themselves particularly interpretable, and not directly comparable to the grammars produced by other systems, except through their functional behavior. Any grammar which parses a corpus will have a distribution over which sequences tend to be analyzed as constituents. These distributions can give a good sense of what structures are and are not being learned. Therefore, to supplement the parsing scores above, we examine these distributions. Figure 5 shows the top scoring constituents by several orderings. These lists do not say very much about how long, complex, recursive constructions are being analyzed by a given system, but grammar induction systems are still at the level where major mistakes manifest themselves in short, frequent sequences. CORRECT ranks sequences by how often they occur as constituents in the treebank parses. DEP - PCFG and CCM are the same, but use counts from the DEP - PCFG and CCM parses. As a baseline, FREQUENCY lists sequences by how often they occur anywhere in the sentence yields. Note that the sequence IN DT (e.g., “of the”) is high on this list, and is a typical error of many early systems. Finally, ENTROPY is the heuristic proposed in [11] which ranks by context entropy. It is better in practice than FREQUENCY , but that isn’t self-evident from this list. Clearly, the lists produced by the CCM system are closer to correct than the others. They look much like a censored version of the FREQUENCY list, where sequences which do not co-exist with higher-ranked ones have been removed (e.g., IN DT often crosses DT NN). This observation may explain a good part of the success of this method. Another explanation for the surprising success of the system is that it exploits a deep fact about language. Most long constituents have some short, frequent equivalent, or proform, which occurs in similar contexts [14]. In the very common case where the proform is a single word, it is guaranteed constituency, which will be transmitted to longer sequences 8 For example, transitive sentences are bracketed [subject [verb object]] (The president [executed the law]) while nominalizations are bracketed [[possessive noun] complement] ([The president’s execution] of the law), an arbitrary inconsistency which is unlikely to be learned automatically. via shared contexts (categories like PP which have infrequent proforms are not learned well unless the empty sequence is in the model – interestingly, the empty sequence appears to act as the proform for PPs, possibly due to the highly optional nature of many PPs). 5 Conclusions We have presented an alternate probability model over trees which is based on simple assumptions about the nature of natural language structure. It is driven by the explicit transfer between sequences and their contexts, and exploits both the proform phenomenon and the fact that good constituents must tile in ways that systematically cover the corpus sentences without crossing. The model clearly has limits. Lacking recursive features, it essentially must analyze long, rare constructions using only contexts. However, despite, or perhaps due to its simplicity, our model predicts bracketings very well, producing higher quality structural analyses than previous methods which employ the PCFG model family. Acknowledgements. We thank John Lafferty, Fernando Pereira, Ben Taskar, and Sebastian Thrun for comments and discussion. This paper is based on work supported in part by the National Science Foundation under Grant No. IIS-0085896. References [1] James K. Baker. Trainable grammars for speech recognition. In D. H. Klatt and J. J. Wolf, editors, Speech Communication Papers for the 97th Meeting of the ASA, pages 547–550, 1979. [2] Glenn Carroll and Eugene Charniak. Two experiments on learning probabilistic dependency grammars from corpora. In C. Weir, S. Abney, R. Grishman, and R. Weischedel, editors, Working Notes of the Workshop Statistically-Based NLP Techniques, pages 1–13. AAAI Press, 1992. [3] Eugene Charniak. A maximum-entropy-inspired parser. In NAACL 1, pages 132–139, 2000. [4] Noam Chomsky. Knowledge of Language. Prager, New York, 1986. [5] Noam Chomsky & Morris Halle. The Sound Pattern of English. Harper & Row, NY, 1968. [6] Alexander Clark. Unsupervised induction of stochastic context-free grammars using distributional clustering. In The Fifth Conference on Natural Language Learning, 2001. [7] Michael John Collins. Three generative, lexicalised models for statistical parsing. In ACL 35/EACL 8, pages 16–23, 1997. [8] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39:1–38, 1977. [9] Steven Finch and Nick Chater. Distributional bootstrapping: From word class to proto-sentence. In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pages 301– 306, Hillsdale, NJ, 1994. Lawrence Erlbaum. [10] Zellig Harris. Methods in Structural Linguistics. University of Chicago Press, Chicago, 1951. [11] Dan Klein and Christopher D. Manning. Distributional phrase structure induction. In The Fifth Conference on Natural Language Learning, 2001. [12] K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the insideoutside algorithm. Computer Speech and Language, 4:35–56, 1990. [13] Fernando Pereira and Yves Schabes. Inside-outside reestimation from partially bracketed corpora. In ACL 30, pages 128–135, 1992. [14] Andrew Radford. Transformational Grammar. Cambridge University Press, Cambridge, 1988. [15] Hinrich Sch¨ tze. Distributional part-of-speech tagging. In EACL 7, pages 141–148, 1995. u [16] Andreas Stolcke and Stephen M. Omohundro. Inducing probabilistic grammars by Bayesian model merging. In Grammatical Inference and Applications: Proceedings of the Second International Colloquium on Grammatical Inference. Springer Verlag, 1994. [17] M. van Zaanen and P. Adriaans. Comparing two unsupervised grammar induction systems: Alignment-based learning vs. emile. Technical Report 2001.05, University of Leeds, 2001. [18] J. G. Wolff. Learning syntax and meanings through optimization and distributional analysis. In Y. Levy, I. M. Schlesinger, and M. D. S. Braine, editors, Categories and processes in language acquisition, pages 179–215. Lawrence Erlbaum, Hillsdale, NJ, 1988.
4 0.099650472 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
Author: M. Schmitt
Abstract: Recurrent neural networks of analog units are computers for realvalued functions. We study the time complexity of real computation in general recurrent neural networks. These have sigmoidal, linear, and product units of unlimited order as nodes and no restrictions on the weights. For networks operating in discrete time, we exhibit a family of functions with arbitrarily high complexity, and we derive almost tight bounds on the time required to compute these functions. Thus, evidence is given of the computational limitations that time-bounded analog recurrent neural networks are subject to. 1
5 0.09059377 56 nips-2001-Convolution Kernels for Natural Language
Author: Michael Collins, Nigel Duffy
Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.
6 0.069051005 115 nips-2001-Linear-time inference in Hierarchical HMMs
7 0.068426199 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
8 0.06554728 183 nips-2001-The Infinite Hidden Markov Model
9 0.061089922 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
10 0.058915582 111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation
11 0.056242514 1 nips-2001-(Not) Bounding the True Error
12 0.056162126 5 nips-2001-A Bayesian Model Predicts Human Parse Preference and Reading Times in Sentence Processing
13 0.05610349 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
14 0.053992216 123 nips-2001-Modeling Temporal Structure in Classical Conditioning
15 0.052871976 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
16 0.052665871 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations
17 0.051865574 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
18 0.048166011 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding
19 0.044881709 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
20 0.044228796 3 nips-2001-ACh, Uncertainty, and Cortical Inference
topicId topicWeight
[(0, -0.139), (1, -0.081), (2, 0.013), (3, -0.025), (4, -0.11), (5, -0.039), (6, -0.055), (7, 0.005), (8, -0.254), (9, -0.077), (10, -0.223), (11, -0.081), (12, -0.051), (13, -0.023), (14, 0.085), (15, -0.021), (16, 0.028), (17, -0.287), (18, 0.058), (19, -0.045), (20, 0.02), (21, -0.026), (22, 0.024), (23, -0.062), (24, 0.018), (25, 0.011), (26, -0.04), (27, -0.096), (28, -0.055), (29, 0.069), (30, -0.037), (31, -0.086), (32, -0.119), (33, -0.066), (34, 0.071), (35, 0.019), (36, 0.033), (37, 0.078), (38, 0.015), (39, -0.102), (40, 0.155), (41, -0.098), (42, 0.058), (43, 0.099), (44, 0.138), (45, -0.033), (46, -0.091), (47, 0.046), (48, -0.011), (49, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.94194144 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network
Author: Michiro Negishi, Stephen J. Hanson
Abstract: It has been known that people, after being exposed to sentences generated by an artificial grammar, acquire implicit grammatical knowledge and are able to transfer the knowledge to inputs that are generated by a modified grammar. We show that a second order recurrent neural network is able to transfer grammatical knowledge from one language (generated by a Finite State Machine) to another language which differ both in vocabularies and syntax. Representation of the grammatical knowledge in the network is analyzed using linear discriminant analysis. 1
2 0.78157955 86 nips-2001-Grammatical Bigrams
Author: Mark A. Paskin
Abstract: Unsupervised learning algorithms have been derived for several statistical models of English grammar, but their computational complexity makes applying them to large data sets intractable. This paper presents a probabilistic model of English grammar that is much simpler than conventional models, but which admits an efficient EM training algorithm. The model is based upon grammatical bigrams, i.e. , syntactic relationships between pairs of words. We present the results of experiments that quantify the representational adequacy of the grammatical bigram model, its ability to generalize from labelled data, and its ability to induce syntactic structure from large amounts of raw text. 1
3 0.67206448 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
Author: Dan Klein, Christopher D. Manning
Abstract: This paper presents a novel approach to the unsupervised learning of syntactic analyses of natural language text. Most previous work has focused on maximizing likelihood according to generative PCFG models. In contrast, we employ a simpler probabilistic model over trees based directly on constituent identity and linear context, and use an EM-like iterative procedure to induce structure. This method produces much higher quality analyses, giving the best published results on the ATIS dataset. 1 Overview To enable a wide range of subsequent tasks, human language sentences are standardly given tree-structure analyses, wherein the nodes in a tree dominate contiguous spans of words called constituents, as in figure 1(a). Constituents are the linguistically coherent units in the sentence, and are usually labeled with a constituent category, such as noun phrase (NP) or verb phrase (VP). An aim of grammar induction systems is to figure out, given just the sentences in a corpus S, what tree structures correspond to them. In this sense, the grammar induction problem is an incomplete data problem, where the complete data is the corpus of trees T , but we only observe their yields S. This paper presents a new approach to this problem, which gains leverage by directly making use of constituent contexts. It is an open problem whether entirely unsupervised methods can produce linguistically accurate parses of sentences. Due to the difficulty of this task, the vast majority of statistical parsing work has focused on supervised learning approaches to parsing, where one uses a treebank of fully parsed sentences to induce a model which parses unseen sentences [7, 3]. But there are compelling motivations for unsupervised grammar induction. Building supervised training data requires considerable resources, including time and linguistic expertise. Investigating unsupervised methods can shed light on linguistic phenomena which are implicit within a supervised parser’s supervisory information (e.g., unsupervised systems often have difficulty correctly attaching subjects to verbs above objects, whereas for a supervised parser, this ordering is implicit in the supervisory information). Finally, while the presented system makes no claims to modeling human language acquisition, results on whether there is enough information in sentences to recover their structure are important data for linguistic theory, where it has standardly been assumed that the information in the data is deficient, and strong innate knowledge is required for language acquisition [4]. Node S VP NP NN1 NNS Factory payrolls VBD fell PP IN NN2 in September Constituent S NP VP PP NN 1 NNS VBD IN NN 2 NN NNS VBD IN NN NN NNS VBD IN NN IN NN NN NNS VBD IN NNS Context – – VBD NNS – VBD – – NNS NN – VBD NNS – IN VBD – NN IN – Empty 0 1 2 3 4 5 Context – NN – NNS – VBD – IN – NN – NN NNS VBD IN NN Figure 1: Example parse tree with the constituents and contexts for each tree node. 2 Previous Approaches One aspect of grammar induction where there has already been substantial success is the induction of parts-of-speech. Several different distributional clustering approaches have resulted in relatively high-quality clusterings, though the clusters’ resemblance to classical parts-of-speech varies substantially [9, 15]. For the present work, we take the part-ofspeech induction problem as solved and work with sequences of parts-of-speech rather than words. In some ways this makes the problem easier, such as by reducing sparsity, but in other ways it complicates the task (even supervised parsers perform relatively poorly with the actual words replaced by parts-of-speech). Work attempting to induce tree structures has met with much less success. Most grammar induction work assumes that trees are generated by a symbolic or probabilistic context-free grammar (CFG or PCFG). These systems generally boil down to one of two types. Some fix the structure of the grammar in advance [12], often with an aim to incorporate linguistic constraints [2] or prior knowledge [13]. These systems typically then attempt to find the grammar production parameters which maximize the likelihood P(S| ) using the inside-outside algorithm [1], which is an efficient (dynamic programming) instance of the EM algorithm [8] for PCFG s. Other systems (which have generally been more successful) incorporate a structural search as well, typically using a heuristic to propose candidate grammar modifications which minimize the joint encoding of data and grammar using an MDL criterion, which asserts that a good analysis is a short one, in that the joint encoding of the grammar and the data is compact [6, 16, 18, 17]. These approaches can also be seen as likelihood maximization where the objective function is the a posteriori likelihood of the grammar given the data, and the description length provides a structural prior. The “compact grammar” aspect of MDL is close to some traditional linguistic argumentation which at times has argued for minimal grammars on grounds of analytical [10] or cognitive [5] economy. However, the primary weakness of MDL-based systems does not have to do with the objective function, but the search procedures they employ. Such systems end up growing structures greedily, in a bottom-up fashion. Therefore, their induction quality is determined by how well they are able to heuristically predict what local intermediate structures will fit into good final global solutions. A potential advantage of systems which fix the grammar and only perform parameter search is that they do compare complete grammars against each other, and are therefore able to detect which give rise to systematically compatible parses. However, although early work showed that small, artificial CFGs could be induced with the EM algorithm [12], studies with large natural language grammars have generally suggested that completely unsupervised EM over PCFG s is ineffective for grammar acquisition. For instance, Carroll and Charniak [2] describe experiments running the EM algorithm from random starting points, which produced widely varying learned grammars, almost all of extremely poor quality. 1 1 We duplicated one of their experiments, which used grammars restricted to rules of the form x → x y | y x, where there is one category x for each part-of-speech (such a restricted CFG is isomorphic to a dependency grammar). We began reestimation from a grammar with uniform rewrite It is well-known that EM is only locally optimal, and one might think that the locality of the search procedure, not the objective function, is to blame. The truth is somewhere in between. There are linguistic reasons to distrust an ML objective function. It encourages the symbols and rules to align in ways which maximize the truth of the conditional independence assumptions embodied by the PCFG. The symbols and rules of a natural language grammar, on the other hand, represent syntactically and semantically coherent units, for which a host of linguistic arguments have been made [14]. None of these have anything to do with conditional independence; traditional linguistic constituency reflects only grammatical regularities and possibilities for expansion. There are expected to be strong connections across phrases (such as dependencies between verbs and their selected arguments). It could be that ML over PCFGs and linguistic criteria align, but in practice they do not always seem to. Experiments with both artificial [12] and real [13] data have shown that starting from fixed, correct (or at least linguistically reasonable) structure, EM produces a grammar which has higher log-likelihood than the linguistically determined grammar, but lower parsing accuracy. However, we additionally conjecture that EM over PCFGs fails to propagate contextual cues efficiently. The reason we expect an algorithm to converge on a good PCFG is that there seem to be coherent categories, like noun phrases, which occur in distinctive environments, like between the beginning of the sentence and the verb phrase. In the inside-outside algorithm, the product of inside and outside probabilities α j ( p, q)β j ( p, q) is the probability of generating the sentence with a j constituent spanning words p through q: the outside probability captures the environment, and the inside probability the coherent category. If we had a good idea of what VPs and NPs looked like, then if a novel NP appeared in an NP context, the outside probabilities should pressure the sequence to be parsed as an NP . However, what happens early in the EM procedure, when we have no real idea about the grammar parameters? With randomly-weighted, complete grammars over a symbol set X, we have observed that a frequent, short, noun phrase sequence often does get assigned to some category x early on. However, since there is not a clear overall structure learned, there is only very weak pressure for other NPs, even if they occur in the same positions, to also be assigned to x, and the reestimation process goes astray. To enable this kind of constituent-context pressure to be effective, we propose the model in the following section. 3 The Constituent-Context Model We propose an alternate parametric family of models over trees which is better suited for grammar induction. Broadly speaking, inducing trees like the one shown in figure 1(a) can be broken into two tasks. One is deciding constituent identity: where the brackets should be placed. The second is deciding what to label the constituents. These tasks are certainly correlated and are usually solved jointly. However, the task of labeling chosen brackets is essentially the same as the part-of-speech induction problem, and the solutions cited above can be adapted to cluster constituents [6]. The task of deciding brackets, is the harder task. For example, the sequence DT NN IN DT NN ([the man in the moon]) is virtually always a noun phrase when it is a constituent, but it is only a constituent 66% of the time, because the IN DT NN is often attached elsewhere ([we [sent a man] [to the moon]]). Figure 2(a) probabilities. Figure 4 shows that the resulting grammar (DEP - PCFG) is not as bad as conventional wisdom suggests. Carroll and Charniak are right to observe that the search spaces is riddled with pronounced local maxima, and EM does not do nearly so well when randomly initialized. The need for random seeding in using EM over PCFGs is two-fold. For some grammars, such as one over a set X of non-terminals in which any x 1 → x2 x3 , xi ∈ X is possible, it is needed to break symmetry. This is not the case for dependency grammars, where symmetry is broken by the yields (e.g., a sentence noun verb can only be covered by a noun or verb projection). The second reason is to start the search from a random region of the space. But unless one does many random restarts, the uniform starting condition is better than most extreme points in the space, and produces superior results. 1.5 2 Usually a Constituent Rarely a Constituent 1 1 0.5 0 0 −1 −2 −3 −1.5 −1 −0.5 NP VP PP −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 1.5 (a) (b) Figure 2: The most frequent examples of (a) different constituent labels and (b) constituents and non-constituents, in the vector space of linear contexts, projected onto the first two principal components. Clustering is effective for labeling, but not detecting constituents. shows the 50 most frequent constituent sequences of three types, represented as points in the vector space of their contexts (see below), projected onto their first two principal components. The three clusters are relatively coherent, and it is not difficult to believe that a clustering algorithm could detect them in the unprojected space. Figure 2(a), however, shows 150 sequences which are parsed as constituents at least 50% of the time along with 150 which are not, again projected onto the first two components. This plot at least suggests that the constituent/non-constituent classification is less amenable to direct clustering. Thus, it is important that an induction system be able to detect constituents, either implicitly or explicitly. A variety of methods of constituent detection have been proposed [11, 6], usually based on information-theoretic properties of a sequence’s distributional context. However, here we rely entirely on the following two simple assumptions: (i) constituents of a parse do not cross each other, and (ii) constituents occur in constituent contexts. The first property is self-evident from the nature of the parse trees. The second is an extremely weakened version of classic linguistic constituency tests [14]. Let σ be a terminal sequence. Every occurrence of σ will be in some linear context c(σ ) = x σ y, where x and y are the adjacent terminals or sentence boundaries. Then we can view any tree t over a sentence s as a collection of sequences and contexts, one of each for every node in the tree, plus one for each inter-terminal empty span, as in figure 1(b). Good trees will include nodes whose yields frequently occur as constituents and whose contexts frequently surround constituents. Formally, we use a conditional exponential model of the form: exp( (σ,c)∈t λσ f σ + λc f c ) P(t|s, ) = t:yield(t)=s exp( (σ,c)∈t λσ f σ + λc f c ) We have one feature f σ (t) for each sequence σ whose value on a tree t is the number of nodes in t with yield σ , and one feature f c (t) for each context c representing the number of times c is the context of the yield of some node in the tree.2 No joint features over c and σ are used, and, unlike many other systems, there is no distinction between constituent types. We model only the conditional likelihood of the trees, P(T |S, ), where = {λ σ , λc }. We then use an iterative EM-style procedure to find a local maximum P(T |S, ) of the completed data (trees) T (P(T |S, ) = t∈T ,s=yield(t) P(t|s, )). We initialize such that each λ is zero and initialize T to any arbitrary set of trees. In alternating steps, we first fix the parameters and find the most probable single tree structure t ∗ for each sentence s according to P(t|s, ), using a simple dynamic program. For any this produces the 2 So, for the tree in figure 1(a), P(t|s) ∝ exp(λ NN NNS + λVBD IN NN + λIN NN + λ −VBD + λNNS− + λVBD− + λ −NNS + λNN−VBD + λNNS−IN + λVBD−NN + λIN− ). set of parses T ∗ which maximizes P(T |S, ). Since T ∗ maximizes this quantity, if T is the former set of trees, P(T ∗ |S, ) ≥ P(T |S, ). Second, we fix the trees and estimate new parameters . The task of finding the parameters ∗ which maximize P(T |S, ) is simply the well-studied task of fitting our exponential model to maximize the conditional likelihood of the fixed parses. Running, for example, a conjugate gradient (CG) ascent on will produce the desired ∗ . If is the former parameters, then we will have P(T |S, ∗ ) ≥ P(T |S, ). Therefore, each iteration will increase P(T |S, ) until convergence.3 Note that our parsing model is not a generative model, and this procedure, though clearly related, is not exactly an instance of the EM algorithm. We merely guarantee that the conditional likelihood of the data completions is increasing. Furthermore, unlike in EM where each iteration increases the marginal likelihood of the fixed observed data, our procedure increases the conditional likelihood of a changing complete data set, with the completions changing at every iteration as we reparse. Several implementation details were important in making the system work well. First, tiebreaking was needed, most of all for the first round. Initially, the parameters are zero, and all parses are therefore equally likely. To prevent bias, all ties were broken randomly. Second, like so many statistical NLP tasks, smoothing was vital. There are features in our model for arbitrarily long yields and most yield types occurred only a few times. The most severe consequence of this sparsity was that initial parsing choices could easily become frozen. If a λσ for some yield σ was either 0 or 0, which was usually the case for rare yields, σ would either be locked into always occurring or never occurring, respectively. Not only did we want to push the λσ values close to zero, we also wanted to account for the fact that most spans are not constituents.4 Therefore, we expect the distribution of the λσ to be skewed towards low values.5 A greater amount of smoothing was needed for the first few iterations, while much less was required in later iterations. Finally, parameter estimation using a CG method was slow and difficult to smooth in the desired manner, and so we used the smoothed relative frequency estimates λ σ = count( fσ )/(count(σ ) + M) and λc = count( f c )/(count(c) + N). These estimates ensured that the λ values were between 0 and 1, and gave the desired bias towards non-constituency. These estimates were fast and surprisingly effective, but do not guarantee non-decreasing conditional likelihood (though the conditional likelihood was increasing in practice). 6 4 Results In all experiments, we used hand-parsed sentences from the Penn Treebank. For training, we took the approximately 7500 sentences in the Wall Street Journal (WSJ) section which contained 10 words or fewer after the removal of punctuation. For testing, we evaluated the system by comparing the system’s parses for those same sentences against the supervised parses in the treebank. We consider each parse as a set of constituent brackets, discarding all trivial brackets.7 We calculated the precision and recall of these brackets against the treebank parses in the obvious way. 3 In practice, we stopped the system after 10 iterations, but final behavior was apparent after 4–8. 4 In a sentence of length n, there are (n + 1)(n + 2)/2 total (possibly size zero) spans, but only 3n constituent spans: n − 1 of size ≥ 2, n of size 1, and n + 1 empty spans. 5 Gaussian priors for the exponential model accomplish the former goal, but not the latter. 6 The relative frequency estimators had a somewhat subtle positive effect. Empty spans have no effect on the model when using CG fitting, as all trees include the same empty spans. However, including their counts improved performance substantially when using relative frequency estimators. This is perhaps an indication that a generative version of this model would be advantageous. 7 We discarded both brackets of length one and brackets spanning the entire sentence, since all of these are impossible to get incorrect, and hence ignored sentences of length ≤ 2 during testing. S DT VP NN VBD σ NP σ VBD NP The screen was NP PP DT DT NN IN NP a VBD σ NN VBD σ σ DT σ was DT NN IN NN The screen a sea of DT red NN DT VBD DT was The screen DT a red (b) IN red DT NN of sea of NN (a) NN sea (c) Figure 3: Alternate parse trees for a sentence: (a) the Penn Treebank tree (deemed correct), (b) the one found by our system CCM, and (c) the one found by DEP - PCFG. Method LBRANCH RANDOM DEP - PCFG RBRANCH CCM UBOUND UP 20.5 29.0 39.5 54.1 60.1 78.2 UR 24.2 31.0 42.3 67.5 75.4 100.0 F1 22.2 30.0 40.9 60.0 66.9 87.8 (a) NP UR 28.9 42.8 69.7 38.3 83.8 100.0 PP UR 6.3 23.6 44.1 44.5 71.6 100.0 VP UR 0.6 26.3 22.8 85.8 66.3 100.0 System EMILE ABL CDC -40 RBRANCH CCM UP 51.6 43.6 53.4 39.9 54.4 UR 16.8 35.6 34.6 46.4 46.8 F1 25.4 39.2 42.0 42.9 50.3 CB 0.84 2.12 1.46 2.18 1.61 (b) Figure 4: Comparative accuracy on WSJ sentences (a) and on the ATIS corpus (b). UR = unlabeled recall; UP = unlabeled precision; F1 = the harmonic mean of UR and UP; CB = crossing brackets. Separate recall values are shown for three major categories. To situate the results of our system, figure 4(a) gives the values of several parsing strategies. CCM is our constituent-context model. DEP - PCFG is a dependency PCFG model [2] trained using the inside-outside algorithm. Figure 3 shows sample parses to give a feel for the parses the systems produce. We also tested several baselines. RANDOM parses randomly. This is an appropriate baseline for an unsupervised system. RBRANCH always chooses the right-branching chain, while LBRANCH always chooses the left-branching chain. RBRANCH is often used as a baseline for supervised systems, but exploits a systematic right-branching tendency of English. An unsupervised system has no a priori reason to prefer right chains to left chains, and LBRANCH is well worse than RANDOM. A system need not beat RBRANCH to claim partial success at grammar induction. Finally, we include an upper bound. All of the parsing strategies and systems mentioned here give fully binary-branching structures. Treebank trees, however, need not be fully binary-branching, and generally are not. As a result, there is an upper bound UBOUND on the precision and F1 scores achievable when structurally confined to binary trees. Clearly, CCM is parsing much better than the RANDOM baseline and the DEP - PCFG induced grammar. Significantly, it also out-performs RBRANCH in both precision and recall, and, to our knowledge, it is the first unsupervised system to do so. To facilitate comparison with other recent systems, figure 4(b) gives results where we trained as before but used (all) the sentences from the distributionally different ATIS section of the treebank as a test set. For this experiment, precision and recall were calculated using the EVALB system of measuring precision and recall (as in [6, 17]) – EVALB is a standard for parser evaluation, but complex, and unsuited to evaluating unlabeled constituency. EMILE and ABL are lexical systems described in [17]. The results for CDC-40, from [6], reflect training on much more data (12M words). Our system is superior in terms of both precision and recall (and so F 1 ). These figures are certainly not all that there is to say about an induced grammar; there are a number of issues in how to interpret the results of an unsupervised system when comparing with treebank parses. Errors come in several kinds. First are innocent sins of commission. Treebank trees are very flat; for example, there is no analysis of the inside of many short noun phrases ([two hard drives] rather than [two [hard drives]]). Our system gives a Sequence DT NN NNP NNP CD CD JJ NNS DT JJ NN DT NNS JJ NN CD NN IN NN IN DT NN NN NNS NN NN TO VB DT JJ IN DT PRP VBZ PRP VBP NNS VBP NN VBZ NN IN NNS VBD Example the man United States 4 1/2 daily yields the top rank the people plastic furniture 12 percent on Monday for the moment fire trucks fire truck to go ?the big *of the ?he says ?they say ?people are ?value is *man from ?people were CORRECT 1 2 3 4 5 6 7 8 9 10 11 22 26 78 90 95 180 =350 =532 =648 =648 FREQUENCY 2 1 9 7 – – 3 – – – – 8 – 6 4 – – – 10 5 – ENTROPY 2 – – 3 – – 7 – 9 – 6 10 1 – – – – 4 5 – 8 DEP - PCFG 1 2 5 4 7 – 3 – – – – – 6 – 10 8 9 – – – – CCM 1 2 5 4 6 10 3 9 – – 8 7 – – – – – – – – – Figure 5: Top non-trivial sequences by actual treebank constituent counts, linear frequency, scaled context entropy, and in DEP - PCFG and CCM learned models’ parses. (usually correct) analysis of the insides of such NPs, for which it is penalized on precision (though not recall or crossing brackets). Second are systematic alternate analyses. Our system tends to form modal verb groups and often attaches verbs first to pronoun subjects rather than to objects. As a result, many VPs are systematically incorrect, boosting crossing bracket scores and impacting VP recall. Finally, the treebank’s grammar is sometimes an arbitrary, and even inconsistent standard for an unsupervised learner: alternate analyses may be just as good.8 Notwithstanding this, we believe that the treebank parses have enough truth in them that parsing scores are a useful component of evaluation. Ideally, we would like to inspect the quality of the grammar directly. Unfortunately, the grammar acquired by our system is implicit in the learned feature weights. These are not by themselves particularly interpretable, and not directly comparable to the grammars produced by other systems, except through their functional behavior. Any grammar which parses a corpus will have a distribution over which sequences tend to be analyzed as constituents. These distributions can give a good sense of what structures are and are not being learned. Therefore, to supplement the parsing scores above, we examine these distributions. Figure 5 shows the top scoring constituents by several orderings. These lists do not say very much about how long, complex, recursive constructions are being analyzed by a given system, but grammar induction systems are still at the level where major mistakes manifest themselves in short, frequent sequences. CORRECT ranks sequences by how often they occur as constituents in the treebank parses. DEP - PCFG and CCM are the same, but use counts from the DEP - PCFG and CCM parses. As a baseline, FREQUENCY lists sequences by how often they occur anywhere in the sentence yields. Note that the sequence IN DT (e.g., “of the”) is high on this list, and is a typical error of many early systems. Finally, ENTROPY is the heuristic proposed in [11] which ranks by context entropy. It is better in practice than FREQUENCY , but that isn’t self-evident from this list. Clearly, the lists produced by the CCM system are closer to correct than the others. They look much like a censored version of the FREQUENCY list, where sequences which do not co-exist with higher-ranked ones have been removed (e.g., IN DT often crosses DT NN). This observation may explain a good part of the success of this method. Another explanation for the surprising success of the system is that it exploits a deep fact about language. Most long constituents have some short, frequent equivalent, or proform, which occurs in similar contexts [14]. In the very common case where the proform is a single word, it is guaranteed constituency, which will be transmitted to longer sequences 8 For example, transitive sentences are bracketed [subject [verb object]] (The president [executed the law]) while nominalizations are bracketed [[possessive noun] complement] ([The president’s execution] of the law), an arbitrary inconsistency which is unlikely to be learned automatically. via shared contexts (categories like PP which have infrequent proforms are not learned well unless the empty sequence is in the model – interestingly, the empty sequence appears to act as the proform for PPs, possibly due to the highly optional nature of many PPs). 5 Conclusions We have presented an alternate probability model over trees which is based on simple assumptions about the nature of natural language structure. It is driven by the explicit transfer between sequences and their contexts, and exploits both the proform phenomenon and the fact that good constituents must tile in ways that systematically cover the corpus sentences without crossing. The model clearly has limits. Lacking recursive features, it essentially must analyze long, rare constructions using only contexts. However, despite, or perhaps due to its simplicity, our model predicts bracketings very well, producing higher quality structural analyses than previous methods which employ the PCFG model family. Acknowledgements. We thank John Lafferty, Fernando Pereira, Ben Taskar, and Sebastian Thrun for comments and discussion. This paper is based on work supported in part by the National Science Foundation under Grant No. IIS-0085896. References [1] James K. Baker. Trainable grammars for speech recognition. In D. H. Klatt and J. J. Wolf, editors, Speech Communication Papers for the 97th Meeting of the ASA, pages 547–550, 1979. [2] Glenn Carroll and Eugene Charniak. Two experiments on learning probabilistic dependency grammars from corpora. In C. Weir, S. Abney, R. Grishman, and R. Weischedel, editors, Working Notes of the Workshop Statistically-Based NLP Techniques, pages 1–13. AAAI Press, 1992. [3] Eugene Charniak. A maximum-entropy-inspired parser. In NAACL 1, pages 132–139, 2000. [4] Noam Chomsky. Knowledge of Language. Prager, New York, 1986. [5] Noam Chomsky & Morris Halle. The Sound Pattern of English. Harper & Row, NY, 1968. [6] Alexander Clark. Unsupervised induction of stochastic context-free grammars using distributional clustering. In The Fifth Conference on Natural Language Learning, 2001. [7] Michael John Collins. Three generative, lexicalised models for statistical parsing. In ACL 35/EACL 8, pages 16–23, 1997. [8] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39:1–38, 1977. [9] Steven Finch and Nick Chater. Distributional bootstrapping: From word class to proto-sentence. In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pages 301– 306, Hillsdale, NJ, 1994. Lawrence Erlbaum. [10] Zellig Harris. Methods in Structural Linguistics. University of Chicago Press, Chicago, 1951. [11] Dan Klein and Christopher D. Manning. Distributional phrase structure induction. In The Fifth Conference on Natural Language Learning, 2001. [12] K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the insideoutside algorithm. Computer Speech and Language, 4:35–56, 1990. [13] Fernando Pereira and Yves Schabes. Inside-outside reestimation from partially bracketed corpora. In ACL 30, pages 128–135, 1992. [14] Andrew Radford. Transformational Grammar. Cambridge University Press, Cambridge, 1988. [15] Hinrich Sch¨ tze. Distributional part-of-speech tagging. In EACL 7, pages 141–148, 1995. u [16] Andreas Stolcke and Stephen M. Omohundro. Inducing probabilistic grammars by Bayesian model merging. In Grammatical Inference and Applications: Proceedings of the Second International Colloquium on Grammatical Inference. Springer Verlag, 1994. [17] M. van Zaanen and P. Adriaans. Comparing two unsupervised grammar induction systems: Alignment-based learning vs. emile. Technical Report 2001.05, University of Leeds, 2001. [18] J. G. Wolff. Learning syntax and meanings through optimization and distributional analysis. In Y. Levy, I. M. Schlesinger, and M. D. S. Braine, editors, Categories and processes in language acquisition, pages 179–215. Lawrence Erlbaum, Hillsdale, NJ, 1988.
4 0.48150992 5 nips-2001-A Bayesian Model Predicts Human Parse Preference and Reading Times in Sentence Processing
Author: S. Narayanan, Daniel Jurafsky
Abstract: Narayanan and Jurafsky (1998) proposed that human language comprehension can be modeled by treating human comprehenders as Bayesian reasoners, and modeling the comprehension process with Bayesian decision trees. In this paper we extend the Narayanan and Jurafsky model to make further predictions about reading time given the probability of difference parses or interpretations, and test the model against reading time data from a psycholinguistic experiment. 1
5 0.36677775 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
Author: M. Schmitt
Abstract: Recurrent neural networks of analog units are computers for realvalued functions. We study the time complexity of real computation in general recurrent neural networks. These have sigmoidal, linear, and product units of unlimited order as nodes and no restrictions on the weights. For networks operating in discrete time, we exhibit a family of functions with arbitrarily high complexity, and we derive almost tight bounds on the time required to compute these functions. Thus, evidence is given of the computational limitations that time-bounded analog recurrent neural networks are subject to. 1
6 0.34445074 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
7 0.33628026 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
8 0.32950878 83 nips-2001-Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons
9 0.32371876 194 nips-2001-Using Vocabulary Knowledge in Bayesian Multinomial Estimation
10 0.30949724 177 nips-2001-Switch Packet Arbitration via Queue-Learning
11 0.29936561 91 nips-2001-Improvisation and Learning
12 0.29146218 56 nips-2001-Convolution Kernels for Natural Language
13 0.25719801 78 nips-2001-Fragment Completion in Humans and Machines
14 0.25549546 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
15 0.25355935 110 nips-2001-Learning Hierarchical Structures with Linear Relational Embedding
16 0.2473408 115 nips-2001-Linear-time inference in Hierarchical HMMs
17 0.23534636 107 nips-2001-Latent Dirichlet Allocation
18 0.23491229 183 nips-2001-The Infinite Hidden Markov Model
19 0.22696948 27 nips-2001-Activity Driven Adaptive Stochastic Resonance
20 0.22499973 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation
topicId topicWeight
[(14, 0.019), (17, 0.017), (19, 0.023), (24, 0.309), (27, 0.105), (30, 0.102), (38, 0.018), (59, 0.014), (72, 0.049), (79, 0.042), (83, 0.012), (84, 0.049), (91, 0.147)]
simIndex simValue paperId paperTitle
same-paper 1 0.80888808 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network
Author: Michiro Negishi, Stephen J. Hanson
Abstract: It has been known that people, after being exposed to sentences generated by an artificial grammar, acquire implicit grammatical knowledge and are able to transfer the knowledge to inputs that are generated by a modified grammar. We show that a second order recurrent neural network is able to transfer grammatical knowledge from one language (generated by a Finite State Machine) to another language which differ both in vocabularies and syntax. Representation of the grammatical knowledge in the network is analyzed using linear discriminant analysis. 1
2 0.56530827 56 nips-2001-Convolution Kernels for Natural Language
Author: Michael Collins, Nigel Duffy
Abstract: We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.
3 0.56085694 162 nips-2001-Relative Density Nets: A New Way to Combine Backpropagation with HMM's
Author: Andrew D. Brown, Geoffrey E. Hinton
Abstract: Logistic units in the first hidden layer of a feedforward neural network compute the relative probability of a data point under two Gaussians. This leads us to consider substituting other density models. We present an architecture for performing discriminative learning of Hidden Markov Models using a network of many small HMM's. Experiments on speech data show it to be superior to the standard method of discriminatively training HMM's. 1
4 0.55886602 10 nips-2001-A Hierarchical Model of Complex Cells in Visual Cortex for the Binocular Perception of Motion-in-Depth
Author: Silvio P. Sabatini, Fabio Solari, Giulia Andreani, Chiara Bartolozzi, Giacomo M. Bisio
Abstract: A cortical model for motion-in-depth selectivity of complex cells in the visual cortex is proposed. The model is based on a time extension of the phase-based techniques for disparity estimation. We consider the computation of the total temporal derivative of the time-varying disparity through the combination of the responses of disparity energy units. To take into account the physiological plausibility, the model is based on the combinations of binocular cells characterized by different ocular dominance indices. The resulting cortical units of the model show a sharp selectivity for motion-indepth that has been compared with that reported in the literature for real cortical cells. 1
5 0.55733764 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning
Author: Ran El-Yaniv, Oren Souroujon
Abstract: We present a powerful meta-clustering technique called Iterative Double Clustering (IDC). The IDC method is a natural extension of the recent Double Clustering (DC) method of Slonim and Tishby that exhibited impressive performance on text categorization tasks [12]. Using synthetically generated data we empirically find that whenever the DC procedure is successful in recovering some of the structure hidden in the data, the extended IDC procedure can incrementally compute a significantly more accurate classification. IDC is especially advantageous when the data exhibits high attribute noise. Our simulation results also show the effectiveness of IDC in text categorization problems. Surprisingly, this unsupervised procedure can be competitive with a (supervised) SVM trained with a small training set. Finally, we propose a simple and natural extension of IDC for semi-supervised and transductive learning where we are given both labeled and unlabeled examples. 1
6 0.55719018 149 nips-2001-Probabilistic Abstraction Hierarchies
7 0.55671656 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
8 0.55501348 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments
9 0.5538677 46 nips-2001-Categorization by Learning and Combining Object Parts
10 0.55386579 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
11 0.55352914 161 nips-2001-Reinforcement Learning with Long Short-Term Memory
12 0.55209529 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
13 0.55193543 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
14 0.55172187 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
15 0.55121922 89 nips-2001-Grouping with Bias
16 0.55117381 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes
17 0.55109787 182 nips-2001-The Fidelity of Local Ordinal Encoding
18 0.55011177 169 nips-2001-Small-World Phenomena and the Dynamics of Information
19 0.54896921 13 nips-2001-A Natural Policy Gradient
20 0.54826975 39 nips-2001-Audio-Visual Sound Separation Via Hidden Markov Models