nips nips2002 nips2002-35 knowledge-graph by maker-knowledge-mining

35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures


Source: pdf

Author: Zach Solan, Eytan Ruppin, David Horn, Shimon Edelman

Abstract: The distributional principle according to which morphemes that occur in identical contexts belong, in some sense, to the same category [1] has been advanced as a means for extracting syntactic structures from corpus data. We extend this principle by applying it recursively, and by using mutual information for estimating category coherence. The resulting model learns, in an unsupervised fashion, highly structured, distributed representations of syntactic knowledge from corpora. It also exhibits promising behavior in tasks usually thought to require representations anchored in a grammar, such as systematicity. 1 Motivation Models dealing with the acquisition of syntactic knowledge are sharply divided into two classes, depending on whether they subscribe to some variant of the classical generative theory of syntax, or operate within the framework of “general-purpose” statistical or distributional learning. An example of the former is the model of [2], which attempts to learn syntactic structures such as Functional Category, as stipulated by the Government and Binding theory. An example of the latter model is Elman’s widely used Simple Recursive Network (SRN) [3]. We believe that polarization between statistical and classical (generative, rule-based) approaches to syntax is counterproductive, because it hampers the integration of the stronger aspects of each method into a common powerful framework. Indeed, on the one hand, the statistical approach is geared to take advantage of the considerable progress made to date in the areas of distributed representation, probabilistic learning, and “connectionist” modeling. Yet, generic connectionist architectures are ill-suited to the abstraction and processing of symbolic information. On the other hand, classical rule-based systems excel in just those tasks, yet are brittle and difficult to train. We present a scheme that acquires “raw” syntactic information construed in a distributional sense, yet also supports the distillation of rule-like regularities out of the accrued statistical knowledge. Our research is motivated by linguistic theories that postulate syntactic structures (and transformations) rooted in distributional data, as exemplified by the work of Zellig Harris [1]. 2 The ADIOS model The ADIOS (Automatic DIstillation Of Structure) model constructs syntactic representations of a sample of language from unlabeled corpus data. The model consists of two elements: (1) a Representational Data Structure (RDS) graph, and (2) a Pattern Acquisition (PA) algorithm that learns the RDS in an unsupervised fashion. The PA algorithm aims to detect patterns — repetitive sequences of “significant” strings of primitives occurring in the corpus (Figure 1). In that, it is related to prior work on alignment-based learning [4] and regular expression (“local grammar”) extraction [5] from corpora. We stress, however, that our algorithm requires no pre-judging either of the scope of the primitives or of their classification, say, into syntactic categories: all the information needed for its operation is extracted from the corpus in an unsupervised fashion. In the initial phase of the PA algorithm the text is segmented down to the smallest possible morphological constituents (e.g., ed is split off both walked and bed; the algorithm later discovers that bed should be left whole, on statistical grounds).1 This initial set of unique constituents is the vertex set of the newly formed RDS (multi-)graph. A directed edge is inserted between two vertices whenever the corresponding transition exists in the corpus (Figure 2(a)); the edge is labeled by the sentence number and by its within-sentence index. Thus, corpus sentences initially correspond to paths in the graph, a path being a sequence of edges that share the same sentence number. (a) mh mi mk mj (b) ci{j,k}l ml mn mi ck ... cj ml cu cv . Figure 1: (a) Two sequences mi , mj , ml and mi , mk , ml form a pattern ci{j,k}l = mi , {mj , mk }, ml , which allows mj and mk to be attributed to the same equivalence class, following the principle of complementary distributions [1]. Both the length of the shared context and the cohesiveness of the equivalence class need to be taken into account in estimating the goodness of the candidate pattern (see eq. 1). (b) Patterns can serve as constituents in their own right; recursively abstracting patterns from a corpus allows us to capture the syntactic regularities concisely, yet expressively. Abstraction also supports generalization: in this schematic illustration, two new paths (dashed lines) emerge from the formation of equivalence classes associated with cu and cv . In the second phase, the PA algorithm repeatedly scans the RDS graph for Significant P atterns (sequences of constituents) ( SP), which are then used to modify the graph (Algorithm 1). For each path pi , the algorithm constructs a list of candidate constituents, ci1 , . . . , cik . Each of these consists of a “prefix” (sequence of graph edges), an equivalence class of vertices, and a “suffix” (another sequence of edges; cf. Figure 2(b)). The criterion I for judging pattern significance combines a syntagmatic consideration (the pattern must be long enough) with a paradigmatic one (its constituents c1 , . . . , ck must have high mutual information): I (c1 , c2 , . . . , ck ) = 2 e−(L/k) P (c1 , c2 , . . . , ck ) log P (c1 , c2 , . . . , ck ) Πk P (cj ) j=1 (1) where L is the typical context length and k is the length of the candidate pattern; the probabilities associated with a cj are estimated from frequencies that are immediately available 1 We remark that the algorithm can work in any language, with any set of tokens, including individual characters – or phonemes, if applied to speech. Algorithm 1 PA (pattern acquisition), phase 2 1: while patterns exist do 2: for all path ∈ graph do {path=sentence; graph=corpus} 3: for all source node ∈ path do 4: for all sink node ∈ path do {source and sink can be equivalence classes} 5: degree of separation = path index(sink) − path index(source); 6: pattern table ⇐ detect patterns(source, sink, degree of separation, equivalence table); 7: end for 8: end for 9: winner ⇐ get most significant pattern(pattern table); 10: equivalence table ⇐ detect equivalences(graph, winner); 11: graph ⇐ rewire graph(graph, winner); 12: end for 13: end while in the graph (e.g., the out-degree of a node is related to the marginal probability of the corresponding cj ). Equation 1 balances two opposing “forces” in pattern formation: (1) the length of the pattern, and (2) the number and the cohesiveness of the set of examples that support it. On the one hand, shorter patterns are likely to be supported by more examples; on the other hand, they are also more likely to lead to over-generalization, because shorter patterns mean less context. A pattern tagged as significant is added as a new vertex to the RDS graph, replacing the constituents and edges it subsumes (Figure 2). Note that only those edges of the multigraph that belong to the detected pattern are rewired; edges that belong to sequences not subsumed by the pattern are untouched. This highly context-sensitive approach to pattern abstraction, which is unique to our model, allows ADIOS to achieve a high degree of representational parsimony without sacrificing generalization power. During the pass over the corpus the list of equivalence sets is updated continuously; the identification of new significant patterns is done using thecurrent equivalence sets (Figure 3(d)). Thus, as the algorithm processes more and more text, it “bootstraps” itself and enriches the RDS graph structure with new SPs and their accompanying equivalence sets. The recursive nature of this process enables the algorithm to form more and more complex patterns, in a hierarchical manner. The relationships among these can be visualized recursively in a tree format, with tree depth corresponding to the level of recursion (e.g., Figure 3(c)). The PA algorithm halts if it processes a given amount of text without finding a new SP or equivalence set (in real-life language acquisition this process may never stop). Generalization. A collection of patterns distilled from a corpus can be seen as an empirical grammar of sorts; cf. [6], p.63: “the grammar of a language is simply an inventory of linguistic units.” The patterns can eventually become highly abstract, thus endowing the model with an ability to generalize to unseen inputs. Generalization is possible, for example, when two equivalence classes are placed next to each other in a pattern, creating new paths among the members of the equivalence classes (dashed lines in Figure 1(b)). Generalization can also ensue from partial activation of existing patterns by novel inputs. This function is supported by the input module, designed to process a novel sentence by forming its distributed representation in terms of activities of existing patterns (Figure 6). These are computed by propagating activation from bottom (the terminals) to top (the patterns) of the RDS. The initial activities wj of the terminals cj are calculated given the novel input s1 , . . . , sk as follows: wj = max {I(sk , cj )} m=1..k (2) 102: do you see the cat? 101: the cat is eating 103: are you sure? Sentence Number Within-Sentence Index 101_1 101_4 101_3 101_2 101_5 101_6 END her ing show eat play is cat Pam the BEGIN (a) 131_3 131_2 109_7 END ing 121_12 stay 121_10 play 121_8 101_6 109_6 cat the BEGIN 109_5 121_9 eat 109_4 (b) 109_9 101_5 109_8 101_4 101_3 101_2 is 131_1 101_1 121_13 121_11 131_1 131_3 101_1 109_4 PATTERN 230: the cat is {eat, play, stay} -ing 165_1 Equivalence Class 230: stay, eat, play 165_2 221_3 here stay play 171_3 165_3 eat 221_1 we 171_2 they BEGIN (d) END 101_2 109_5 121_9 121_8 171_1 ing stay 131_2 play eat is cat the BEGIN (c) PATTERN 231: BEGIN {they, we} {230} here 221_2 Figure 2: (a) A small portion of the RDS graph for a simple corpus, with sentence #101 (the cat is eat -ing) indicated by solid arcs. (b) This sentence joins a pattern the cat is {eat, play, stay} -ing, in which two others (#109,121) already participate. (c) The abstracted pattern, and the equivalence class associated with it (edges that belong to sequences not subsumed by this pattern, e.g., #131, are untouched). (d) The identification of new significant patterns is done using the acquired equivalence classes (e.g., #230). In this manner, the system “bootstraps” itself, recursively distilling more and more complex patterns. where I(sk , cj ) is the mutual information between sk and cj . For an equivalence class, the value propagated upwards is the strongest non-zero activation of its members; for a pattern, it is the average weight of the children nodes, on the condition that all the children were activated by adjacent inputs. Activity propagation continues until it reaches the top nodes of the pattern lattice. When the algorithm encounters a novel word, all the members of the terminal equivalence class contribute a value of , which is then propagated upwards as usual. This enables the model to make an educated guess as to the meaning of the unfamiliar word, by considering the patterns that become active (Figure 6(b)). 3 Results We now briefly describe the results of several studies designed to evaluate the viability of the ADIOS model, in which it was exposed to corpora of varying size and complexity. (a) propnoun:

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The distributional principle according to which morphemes that occur in identical contexts belong, in some sense, to the same category [1] has been advanced as a means for extracting syntactic structures from corpus data. [sent-5, score-0.436]

2 We extend this principle by applying it recursively, and by using mutual information for estimating category coherence. [sent-6, score-0.039]

3 The resulting model learns, in an unsupervised fashion, highly structured, distributed representations of syntactic knowledge from corpora. [sent-7, score-0.176]

4 An example of the former is the model of [2], which attempts to learn syntactic structures such as Functional Category, as stipulated by the Government and Binding theory. [sent-10, score-0.194]

5 We believe that polarization between statistical and classical (generative, rule-based) approaches to syntax is counterproductive, because it hampers the integration of the stronger aspects of each method into a common powerful framework. [sent-12, score-0.065]

6 Yet, generic connectionist architectures are ill-suited to the abstraction and processing of symbolic information. [sent-14, score-0.055]

7 We present a scheme that acquires “raw” syntactic information construed in a distributional sense, yet also supports the distillation of rule-like regularities out of the accrued statistical knowledge. [sent-16, score-0.233]

8 Our research is motivated by linguistic theories that postulate syntactic structures (and transformations) rooted in distributional data, as exemplified by the work of Zellig Harris [1]. [sent-17, score-0.3]

9 2 The ADIOS model The ADIOS (Automatic DIstillation Of Structure) model constructs syntactic representations of a sample of language from unlabeled corpus data. [sent-18, score-0.394]

10 The model consists of two elements: (1) a Representational Data Structure (RDS) graph, and (2) a Pattern Acquisition (PA) algorithm that learns the RDS in an unsupervised fashion. [sent-19, score-0.034]

11 The PA algorithm aims to detect patterns — repetitive sequences of “significant” strings of primitives occurring in the corpus (Figure 1). [sent-20, score-0.335]

12 We stress, however, that our algorithm requires no pre-judging either of the scope of the primitives or of their classification, say, into syntactic categories: all the information needed for its operation is extracted from the corpus in an unsupervised fashion. [sent-22, score-0.359]

13 In the initial phase of the PA algorithm the text is segmented down to the smallest possible morphological constituents (e. [sent-23, score-0.148]

14 , ed is split off both walked and bed; the algorithm later discovers that bed should be left whole, on statistical grounds). [sent-25, score-0.049]

15 1 This initial set of unique constituents is the vertex set of the newly formed RDS (multi-)graph. [sent-26, score-0.148]

16 A directed edge is inserted between two vertices whenever the corresponding transition exists in the corpus (Figure 2(a)); the edge is labeled by the sentence number and by its within-sentence index. [sent-27, score-0.418]

17 Thus, corpus sentences initially correspond to paths in the graph, a path being a sequence of edges that share the same sentence number. [sent-28, score-0.632]

18 Figure 1: (a) Two sequences mi , mj , ml and mi , mk , ml form a pattern ci{j,k}l = mi , {mj , mk }, ml , which allows mj and mk to be attributed to the same equivalence class, following the principle of complementary distributions [1]. [sent-33, score-0.957]

19 Both the length of the shared context and the cohesiveness of the equivalence class need to be taken into account in estimating the goodness of the candidate pattern (see eq. [sent-34, score-0.42]

20 (b) Patterns can serve as constituents in their own right; recursively abstracting patterns from a corpus allows us to capture the syntactic regularities concisely, yet expressively. [sent-36, score-0.675]

21 Abstraction also supports generalization: in this schematic illustration, two new paths (dashed lines) emerge from the formation of equivalence classes associated with cu and cv . [sent-37, score-0.323]

22 In the second phase, the PA algorithm repeatedly scans the RDS graph for Significant P atterns (sequences of constituents) ( SP), which are then used to modify the graph (Algorithm 1). [sent-38, score-0.166]

23 For each path pi , the algorithm constructs a list of candidate constituents, ci1 , . [sent-39, score-0.055]

24 Each of these consists of a “prefix” (sequence of graph edges), an equivalence class of vertices, and a “suffix” (another sequence of edges; cf. [sent-43, score-0.337]

25 The criterion I for judging pattern significance combines a syntagmatic consideration (the pattern must be long enough) with a paradigmatic one (its constituents c1 , . [sent-45, score-0.363]

26 , ck must have high mutual information): I (c1 , c2 , . [sent-48, score-0.104]

27 , the out-degree of a node is related to the marginal probability of the corresponding cj ). [sent-60, score-0.088]

28 Equation 1 balances two opposing “forces” in pattern formation: (1) the length of the pattern, and (2) the number and the cohesiveness of the set of examples that support it. [sent-61, score-0.126]

29 On the one hand, shorter patterns are likely to be supported by more examples; on the other hand, they are also more likely to lead to over-generalization, because shorter patterns mean less context. [sent-62, score-0.304]

30 A pattern tagged as significant is added as a new vertex to the RDS graph, replacing the constituents and edges it subsumes (Figure 2). [sent-63, score-0.283]

31 Note that only those edges of the multigraph that belong to the detected pattern are rewired; edges that belong to sequences not subsumed by the pattern are untouched. [sent-64, score-0.361]

32 This highly context-sensitive approach to pattern abstraction, which is unique to our model, allows ADIOS to achieve a high degree of representational parsimony without sacrificing generalization power. [sent-65, score-0.089]

33 During the pass over the corpus the list of equivalence sets is updated continuously; the identification of new significant patterns is done using thecurrent equivalence sets (Figure 3(d)). [sent-66, score-0.843]

34 Thus, as the algorithm processes more and more text, it “bootstraps” itself and enriches the RDS graph structure with new SPs and their accompanying equivalence sets. [sent-67, score-0.337]

35 The relationships among these can be visualized recursively in a tree format, with tree depth corresponding to the level of recursion (e. [sent-69, score-0.118]

36 The PA algorithm halts if it processes a given amount of text without finding a new SP or equivalence set (in real-life language acquisition this process may never stop). [sent-72, score-0.391]

37 A collection of patterns distilled from a corpus can be seen as an empirical grammar of sorts; cf. [sent-74, score-0.498]

38 63: “the grammar of a language is simply an inventory of linguistic units. [sent-76, score-0.279]

39 ” The patterns can eventually become highly abstract, thus endowing the model with an ability to generalize to unseen inputs. [sent-77, score-0.193]

40 Generalization is possible, for example, when two equivalence classes are placed next to each other in a pattern, creating new paths among the members of the equivalence classes (dashed lines in Figure 1(b)). [sent-78, score-0.655]

41 Generalization can also ensue from partial activation of existing patterns by novel inputs. [sent-79, score-0.259]

42 This function is supported by the input module, designed to process a novel sentence by forming its distributed representation in terms of activities of existing patterns (Figure 6). [sent-80, score-0.434]

43 These are computed by propagating activation from bottom (the terminals) to top (the patterns) of the RDS. [sent-81, score-0.06]

44 The initial activities wj of the terminals cj are calculated given the novel input s1 , . [sent-82, score-0.135]

45 (b) This sentence joins a pattern the cat is {eat, play, stay} -ing, in which two others (#109,121) already participate. [sent-90, score-0.428]

46 (c) The abstracted pattern, and the equivalence class associated with it (edges that belong to sequences not subsumed by this pattern, e. [sent-91, score-0.291]

47 (d) The identification of new significant patterns is done using the acquired equivalence classes (e. [sent-94, score-0.442]

48 In this manner, the system “bootstraps” itself, recursively distilling more and more complex patterns. [sent-97, score-0.05]

49 where I(sk , cj ) is the mutual information between sk and cj . [sent-98, score-0.259]

50 For an equivalence class, the value propagated upwards is the strongest non-zero activation of its members; for a pattern, it is the average weight of the children nodes, on the condition that all the children were activated by adjacent inputs. [sent-99, score-0.395]

51 Activity propagation continues until it reaches the top nodes of the pattern lattice. [sent-100, score-0.089]

52 When the algorithm encounters a novel word, all the members of the terminal equivalence class contribute a value of , which is then propagated upwards as usual. [sent-101, score-0.424]

53 This enables the model to make an educated guess as to the meaning of the unfamiliar word, by considering the patterns that become active (Figure 6(b)). [sent-102, score-0.152]

54 (a) propnoun: "Joe" | "Beth" | "Jim" | "Cindy" | "Pam" | "George"; (b) the horse is living very extremely far away. [sent-104, score-0.097]

55 BEGIN article "The" | "A" article "The" noun: "cat" | "dog" | "cow" | "bird" | "rabbit" | "horse" noun: "cats" | "dogs" | "cows" | "birds" | "rabbits" | "horses" the cow is working at least until Thursday. [sent-105, score-0.059]

56 4 144 120 are 95 98 67 ly far away END is celebrat liv play stay work END 65 Beth Cindy George Jim Joe Pam far away 70 BEGIN emphasize: very | extremely|really 101 66 ing verb: working | living | playing extreme real is Figure 3: (a) A part of a simple grammar. [sent-115, score-0.814]

57 (c) The structure of a sample sentence (pattern #144), presented in the form of a tree that captures the hierarchical relationships among constituents. [sent-117, score-0.269]

58 Figure 3 shows an example of a sentence from a corpus produced by a simple artificial grammar and its ADIOS analysis (the use of a simple grammar, constructed with Rmutt, http://www. [sent-120, score-0.581]

59 The abstract representation of the sample sentence in Figure 3(c) looks very much like a parse tree, indicating that our method successfully identified the grammatical structure used to generate its data. [sent-123, score-0.235]

60 To illustrate the gradual emergence of our model’s ability for such concise representation of syntactic structures, we show in Figure 4, top, four trees built for the same sentence after exposing the model to progressively more data from the same corpus. [sent-124, score-0.409]

61 Note that both the number of distinct patterns and the average number of patterns per sentence asymptote for this corpus after exposure to about 500 sentences (Figure 4, bottom). [sent-125, score-0.841]

62 We have assessed the systematicity of the ADIOS model by splitting the corpus generated by the grammar of Figure 3 into training and test sets. [sent-128, score-0.421]

63 After training the model on the former, we examined the representations of unseen sentences from the test set. [sent-129, score-0.121]

64 A typical result appears in Figure 5; the general finding was of Level 3 systematicity according to the nomenclature of [7]. [sent-130, score-0.075]

65 The ADIOS system’s input module allows it to process a novel sentence by forming its distributed representation in terms of activities of existing patterns. [sent-132, score-0.347]

66 Figure 6 shows the activation of two patterns (#141 and #120) by a phrase that contains a word in a novel context (stay), as well as another word never before encountered in any context (5pm). [sent-133, score-0.411]

67 00 1000 800 Number of Sentences in the corpus Figure 4: Top: the build-up of structured information with progressive exposure to a corpus generated by the simple grammar of Figure 3. [sent-143, score-0.615]

68 Bottom: the total number of detected patterns ( ) and the average number of patterns in a sentence ( ), plotted vs. [sent-148, score-0.593]

69 The identity between the structures detected in (a) and (b) is a manifestation of Level-3 systematicity of the ADIOS model (“Novel Constituent: the test set contains at least one atomic constituent that did not appear anywhere in the training set”; see [7], pp. [sent-153, score-0.228]

70 0 me tell her see show give help get you build we did can BEGIN 2080 1726 Figure 6: 1865 input module in action (the two most relevant – highly active – patterns the 1734 responding to the input Joe and Beth are staying until 5pm). [sent-170, score-0.254]

71 Leaf activation is proportional to the mutual information between inputs and various members of the equivalence classes (e. [sent-171, score-0.431]

72 8 is the mutual information between stay and liv, which is a member of equivalence class #112). [sent-174, score-0.515]

73 It is then propagated upwards by taking the average at each junction. [sent-175, score-0.081]

74 2077 the that at s ing go look not just 're you they look go start bunny BEGIN Working with real data: the CHILDES corpus. [sent-176, score-0.171]

75 The corpus we selected contained 9665 sentences (74500 words) produced 1829 by parents. [sent-178, score-0.263]

76 The results, one of which is shown in Figure 7, were encouraging: the algorithm found intuitively significant SPs and produced semantically adequate corresponding 1785 1398 1828 1785 1558 1739 equivalence sets. [sent-179, score-0.254]

77 Altogether, 1062 patterns and 775 equivalence classes were established. [sent-180, score-0.442]

78 Representing the corpus in terms of these constituents resulted in a significant compression: the average number of constituents per sentence dropped from 6. [sent-181, score-0.714]

79 1960 CHILDES_2764 : they don ’t want ta go for a ride ? [sent-186, score-0.111]

80 1959 CHILDES_2642 : CHILDES_2504 : 1912 1629 1739 can we make a little house ? [sent-188, score-0.041]

81 1407 1656 1913 CHILDES_1038 : where ’d the what go ? [sent-192, score-0.033]

82 BEGIN where ' s Becky Brennen Eric Miffy mommy that the the big biggest blue different easy little littlest next right round square white other yellow green orange yellow chicken one room side way ? [sent-193, score-0.125]

83 Figure 7: Left: a typical pattern extracted from a subset of the CHILDES corpora collection [8]. [sent-197, score-0.089]

84 Hundreds of such patterns and equivalence classes (underscored in this figure) together constitute a concise representation of the raw data. [sent-198, score-0.474]

85 Some of the phrases that can be described/generated by pattern 1960 are: where’s the big room? [sent-199, score-0.13]

86 Right: some of the phrases generated by ADIOS (lower lines in each pair) using sentences from CHILDES (upper lines) as examples. [sent-204, score-0.121]

87 The generation module works by traversing the top-level pattern tree, stringing together lowerlevel patterns and selecting randomly one member from each equivalence class. [sent-205, score-0.56]

88 Extensive testing (currently under way) is needed to determine whether the grammaticality of the newly generated phrases (which is at present less than ideal, as can be seen here) improves with more training data. [sent-206, score-0.041]

89 Although our pattern-based representations may look like collections of finite automata, the information they contain is much richer, because of the recursive invocation of one pattern by another, and because of the context sensitivity implied by relationships among patterns. [sent-208, score-0.129]

90 The sensitivity to context of pattern abstraction (during learning) and use (during generation) contributes greatly both to the conciseness of the ADIOS representation and to the conservative nature of its generative behavior. [sent-209, score-0.184]

91 This context sensitivity — in particular, the manner whereby ADIOS balances syntagmatic and paradigmatic cues provided by the data — is mainly what distinguishes it from other current work on unsupervised probabilistic learning of syntax, such as [9, 10, 4]. [sent-210, score-0.111]

92 In summary, finding a good set of structured units leads to the emergence of a convergent representation of language, which eventually changes less and less with progressive exposure to more data. [sent-211, score-0.086]

93 The power of the constituent graph representation stems from the interacting ensembles of patterns and equivalence classes that comprise it. [sent-212, score-0.572]

94 Together, the local patterns create global complexity and impose long-range order on the linguistic structures they encode. [sent-213, score-0.251]

95 Some of the challenges implicit in this approach that we leave for future work are (1) interpreting the syntactic structures found by ADIOS in the context of contemporary theories of syntax, and (2) relating those structures to semantics. [sent-214, score-0.286]

96 Simulating the child’s acquisition of the lexicon and syntax - experiences with Babel. [sent-224, score-0.133]

97 Comparing two unsupervised grammar induction systems: Alignment-based learning vs. [sent-234, score-0.197]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('equivalence', 0.254), ('sentence', 0.235), ('adios', 0.224), ('stay', 0.222), ('beth', 0.205), ('corpus', 0.183), ('liv', 0.168), ('grammar', 0.163), ('george', 0.162), ('patterns', 0.152), ('thursday', 0.149), ('constituents', 0.148), ('syntactic', 0.142), ('play', 0.131), ('celebrat', 0.131), ('cindy', 0.131), ('joe', 0.131), ('pam', 0.131), ('rds', 0.131), ('sunday', 0.131), ('wednesday', 0.131), ('eat', 0.13), ('jim', 0.118), ('friday', 0.112), ('monday', 0.112), ('tuesday', 0.112), ('ing', 0.105), ('cat', 0.104), ('saturday', 0.097), ('horse', 0.097), ('begin', 0.093), ('pattern', 0.089), ('cj', 0.088), ('graph', 0.083), ('sentences', 0.08), ('childes', 0.075), ('mk', 0.075), ('sink', 0.075), ('systematicity', 0.075), ('tomorrow', 0.075), ('language', 0.069), ('acquisition', 0.068), ('ck', 0.065), ('syntax', 0.065), ('module', 0.065), ('activation', 0.06), ('cow', 0.059), ('distributional', 0.059), ('end', 0.059), ('playing', 0.057), ('bird', 0.056), ('dog', 0.056), ('rabbit', 0.056), ('abstraction', 0.055), ('mj', 0.055), ('path', 0.055), ('detected', 0.054), ('structures', 0.052), ('recursively', 0.05), ('bed', 0.049), ('mi', 0.048), ('constituent', 0.047), ('linguistic', 0.047), ('pa', 0.047), ('structured', 0.047), ('novel', 0.047), ('edges', 0.046), ('ml', 0.045), ('yellow', 0.044), ('upwards', 0.044), ('sk', 0.044), ('members', 0.042), ('house', 0.041), ('phrases', 0.041), ('ta', 0.041), ('winner', 0.041), ('unseen', 0.041), ('context', 0.04), ('mutual', 0.039), ('exposure', 0.039), ('aviv', 0.037), ('becky', 0.037), ('bootstraps', 0.037), ('cohesiveness', 0.037), ('mommy', 0.037), ('ride', 0.037), ('sps', 0.037), ('staying', 0.037), ('subsumed', 0.037), ('syntagmatic', 0.037), ('propagated', 0.037), ('classes', 0.036), ('word', 0.036), ('tree', 0.034), ('cant', 0.034), ('unsupervised', 0.034), ('paths', 0.033), ('go', 0.033), ('concise', 0.032), ('distillation', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures

Author: Zach Solan, Eytan Ruppin, David Horn, Shimon Edelman

Abstract: The distributional principle according to which morphemes that occur in identical contexts belong, in some sense, to the same category [1] has been advanced as a means for extracting syntactic structures from corpus data. We extend this principle by applying it recursively, and by using mutual information for estimating category coherence. The resulting model learns, in an unsupervised fashion, highly structured, distributed representations of syntactic knowledge from corpora. It also exhibits promising behavior in tasks usually thought to require representations anchored in a grammar, such as systematicity. 1 Motivation Models dealing with the acquisition of syntactic knowledge are sharply divided into two classes, depending on whether they subscribe to some variant of the classical generative theory of syntax, or operate within the framework of “general-purpose” statistical or distributional learning. An example of the former is the model of [2], which attempts to learn syntactic structures such as Functional Category, as stipulated by the Government and Binding theory. An example of the latter model is Elman’s widely used Simple Recursive Network (SRN) [3]. We believe that polarization between statistical and classical (generative, rule-based) approaches to syntax is counterproductive, because it hampers the integration of the stronger aspects of each method into a common powerful framework. Indeed, on the one hand, the statistical approach is geared to take advantage of the considerable progress made to date in the areas of distributed representation, probabilistic learning, and “connectionist” modeling. Yet, generic connectionist architectures are ill-suited to the abstraction and processing of symbolic information. On the other hand, classical rule-based systems excel in just those tasks, yet are brittle and difficult to train. We present a scheme that acquires “raw” syntactic information construed in a distributional sense, yet also supports the distillation of rule-like regularities out of the accrued statistical knowledge. Our research is motivated by linguistic theories that postulate syntactic structures (and transformations) rooted in distributional data, as exemplified by the work of Zellig Harris [1]. 2 The ADIOS model The ADIOS (Automatic DIstillation Of Structure) model constructs syntactic representations of a sample of language from unlabeled corpus data. The model consists of two elements: (1) a Representational Data Structure (RDS) graph, and (2) a Pattern Acquisition (PA) algorithm that learns the RDS in an unsupervised fashion. The PA algorithm aims to detect patterns — repetitive sequences of “significant” strings of primitives occurring in the corpus (Figure 1). In that, it is related to prior work on alignment-based learning [4] and regular expression (“local grammar”) extraction [5] from corpora. We stress, however, that our algorithm requires no pre-judging either of the scope of the primitives or of their classification, say, into syntactic categories: all the information needed for its operation is extracted from the corpus in an unsupervised fashion. In the initial phase of the PA algorithm the text is segmented down to the smallest possible morphological constituents (e.g., ed is split off both walked and bed; the algorithm later discovers that bed should be left whole, on statistical grounds).1 This initial set of unique constituents is the vertex set of the newly formed RDS (multi-)graph. A directed edge is inserted between two vertices whenever the corresponding transition exists in the corpus (Figure 2(a)); the edge is labeled by the sentence number and by its within-sentence index. Thus, corpus sentences initially correspond to paths in the graph, a path being a sequence of edges that share the same sentence number. (a) mh mi mk mj (b) ci{j,k}l ml mn mi ck ... cj ml cu cv . Figure 1: (a) Two sequences mi , mj , ml and mi , mk , ml form a pattern ci{j,k}l = mi , {mj , mk }, ml , which allows mj and mk to be attributed to the same equivalence class, following the principle of complementary distributions [1]. Both the length of the shared context and the cohesiveness of the equivalence class need to be taken into account in estimating the goodness of the candidate pattern (see eq. 1). (b) Patterns can serve as constituents in their own right; recursively abstracting patterns from a corpus allows us to capture the syntactic regularities concisely, yet expressively. Abstraction also supports generalization: in this schematic illustration, two new paths (dashed lines) emerge from the formation of equivalence classes associated with cu and cv . In the second phase, the PA algorithm repeatedly scans the RDS graph for Significant P atterns (sequences of constituents) ( SP), which are then used to modify the graph (Algorithm 1). For each path pi , the algorithm constructs a list of candidate constituents, ci1 , . . . , cik . Each of these consists of a “prefix” (sequence of graph edges), an equivalence class of vertices, and a “suffix” (another sequence of edges; cf. Figure 2(b)). The criterion I for judging pattern significance combines a syntagmatic consideration (the pattern must be long enough) with a paradigmatic one (its constituents c1 , . . . , ck must have high mutual information): I (c1 , c2 , . . . , ck ) = 2 e−(L/k) P (c1 , c2 , . . . , ck ) log P (c1 , c2 , . . . , ck ) Πk P (cj ) j=1 (1) where L is the typical context length and k is the length of the candidate pattern; the probabilities associated with a cj are estimated from frequencies that are immediately available 1 We remark that the algorithm can work in any language, with any set of tokens, including individual characters – or phonemes, if applied to speech. Algorithm 1 PA (pattern acquisition), phase 2 1: while patterns exist do 2: for all path ∈ graph do {path=sentence; graph=corpus} 3: for all source node ∈ path do 4: for all sink node ∈ path do {source and sink can be equivalence classes} 5: degree of separation = path index(sink) − path index(source); 6: pattern table ⇐ detect patterns(source, sink, degree of separation, equivalence table); 7: end for 8: end for 9: winner ⇐ get most significant pattern(pattern table); 10: equivalence table ⇐ detect equivalences(graph, winner); 11: graph ⇐ rewire graph(graph, winner); 12: end for 13: end while in the graph (e.g., the out-degree of a node is related to the marginal probability of the corresponding cj ). Equation 1 balances two opposing “forces” in pattern formation: (1) the length of the pattern, and (2) the number and the cohesiveness of the set of examples that support it. On the one hand, shorter patterns are likely to be supported by more examples; on the other hand, they are also more likely to lead to over-generalization, because shorter patterns mean less context. A pattern tagged as significant is added as a new vertex to the RDS graph, replacing the constituents and edges it subsumes (Figure 2). Note that only those edges of the multigraph that belong to the detected pattern are rewired; edges that belong to sequences not subsumed by the pattern are untouched. This highly context-sensitive approach to pattern abstraction, which is unique to our model, allows ADIOS to achieve a high degree of representational parsimony without sacrificing generalization power. During the pass over the corpus the list of equivalence sets is updated continuously; the identification of new significant patterns is done using thecurrent equivalence sets (Figure 3(d)). Thus, as the algorithm processes more and more text, it “bootstraps” itself and enriches the RDS graph structure with new SPs and their accompanying equivalence sets. The recursive nature of this process enables the algorithm to form more and more complex patterns, in a hierarchical manner. The relationships among these can be visualized recursively in a tree format, with tree depth corresponding to the level of recursion (e.g., Figure 3(c)). The PA algorithm halts if it processes a given amount of text without finding a new SP or equivalence set (in real-life language acquisition this process may never stop). Generalization. A collection of patterns distilled from a corpus can be seen as an empirical grammar of sorts; cf. [6], p.63: “the grammar of a language is simply an inventory of linguistic units.” The patterns can eventually become highly abstract, thus endowing the model with an ability to generalize to unseen inputs. Generalization is possible, for example, when two equivalence classes are placed next to each other in a pattern, creating new paths among the members of the equivalence classes (dashed lines in Figure 1(b)). Generalization can also ensue from partial activation of existing patterns by novel inputs. This function is supported by the input module, designed to process a novel sentence by forming its distributed representation in terms of activities of existing patterns (Figure 6). These are computed by propagating activation from bottom (the terminals) to top (the patterns) of the RDS. The initial activities wj of the terminals cj are calculated given the novel input s1 , . . . , sk as follows: wj = max {I(sk , cj )} m=1..k (2) 102: do you see the cat? 101: the cat is eating 103: are you sure? Sentence Number Within-Sentence Index 101_1 101_4 101_3 101_2 101_5 101_6 END her ing show eat play is cat Pam the BEGIN (a) 131_3 131_2 109_7 END ing 121_12 stay 121_10 play 121_8 101_6 109_6 cat the BEGIN 109_5 121_9 eat 109_4 (b) 109_9 101_5 109_8 101_4 101_3 101_2 is 131_1 101_1 121_13 121_11 131_1 131_3 101_1 109_4 PATTERN 230: the cat is {eat, play, stay} -ing 165_1 Equivalence Class 230: stay, eat, play 165_2 221_3 here stay play 171_3 165_3 eat 221_1 we 171_2 they BEGIN (d) END 101_2 109_5 121_9 121_8 171_1 ing stay 131_2 play eat is cat the BEGIN (c) PATTERN 231: BEGIN {they, we} {230} here 221_2 Figure 2: (a) A small portion of the RDS graph for a simple corpus, with sentence #101 (the cat is eat -ing) indicated by solid arcs. (b) This sentence joins a pattern the cat is {eat, play, stay} -ing, in which two others (#109,121) already participate. (c) The abstracted pattern, and the equivalence class associated with it (edges that belong to sequences not subsumed by this pattern, e.g., #131, are untouched). (d) The identification of new significant patterns is done using the acquired equivalence classes (e.g., #230). In this manner, the system “bootstraps” itself, recursively distilling more and more complex patterns. where I(sk , cj ) is the mutual information between sk and cj . For an equivalence class, the value propagated upwards is the strongest non-zero activation of its members; for a pattern, it is the average weight of the children nodes, on the condition that all the children were activated by adjacent inputs. Activity propagation continues until it reaches the top nodes of the pattern lattice. When the algorithm encounters a novel word, all the members of the terminal equivalence class contribute a value of , which is then propagated upwards as usual. This enables the model to make an educated guess as to the meaning of the unfamiliar word, by considering the patterns that become active (Figure 6(b)). 3 Results We now briefly describe the results of several studies designed to evaluate the viability of the ADIOS model, in which it was exposed to corpora of varying size and complexity. (a) propnoun:

2 0.18017653 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus

Author: Willem H. Zuidema

Abstract: Language acquisition is a special kind of learning problem because the outcome of learning of one generation is the input for the next. That makes it possible for languages to adapt to the particularities of the learner. In this paper, I show that this type of language change has important consequences for models of the evolution and acquisition of syntax. 1 The Language Acquisition Problem For both artificial systems and non-human animals, learning the syntax of natural languages is a notoriously hard problem. All healthy human infants, in contrast, learn any of the approximately 6000 human languages rapidly, accurately and spontaneously. Any explanation of how they accomplish this difficult task must specify the (innate) inductive bias that human infants bring to bear, and the input data that is available to them. Traditionally, the inductive bias is termed - somewhat unfortunately -

3 0.074506283 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

Author: Alexei Vinokourov, Nello Cristianini, John Shawe-Taylor

Abstract: The problem of learning a semantic representation of a text document from data is addressed, in the situation where a corpus of unlabeled paired documents is available, each pair being formed by a short English document and its French translation. This representation can then be used for any retrieval, categorization or clustering task, both in a standard and in a cross-lingual setting. By using kernel functions, in this case simple bag-of-words inner products, each part of the corpus is mapped to a high-dimensional space. The correlations between the two spaces are then learnt by using kernel Canonical Correlation Analysis. A set of directions is found in the first and in the second space that are maximally correlated. Since we assume the two representations are completely independent apart from the semantic content, any correlation between them should reflect some semantic similarity. Certain patterns of English words that relate to a specific meaning should correlate with certain patterns of French words corresponding to the same meaning, across the corpus. Using the semantic representation obtained in this way we first demonstrate that the correlations detected between the two versions of the corpus are significantly higher than random, and hence that a representation based on such features does capture statistical patterns that should reflect semantic information. Then we use such representation both in cross-language and in single-language retrieval tasks, observing performance that is consistently and significantly superior to LSI on the same data.

4 0.072548024 163 nips-2002-Prediction and Semantic Association

Author: Thomas L. Griffiths, Mark Steyvers

Abstract: We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language. 1

5 0.06561061 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

Author: Dan Klein, Christopher D. Manning

Abstract: We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.

6 0.057704393 115 nips-2002-Informed Projections

7 0.05331726 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

8 0.050107766 176 nips-2002-Replay, Repair and Consolidation

9 0.049069371 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

10 0.048716448 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology

11 0.046487458 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces

12 0.04613604 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion

13 0.044797577 142 nips-2002-Maximum Likelihood and the Information Bottleneck

14 0.0442666 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

15 0.044190235 40 nips-2002-Bayesian Models of Inductive Generalization

16 0.043554913 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design

17 0.041203666 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

18 0.040498473 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

19 0.039442316 83 nips-2002-Extracting Relevant Structures with Side Information

20 0.039390862 10 nips-2002-A Model for Learning Variance Components of Natural Images


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.122), (1, -0.01), (2, -0.0), (3, 0.005), (4, -0.091), (5, 0.063), (6, -0.019), (7, -0.076), (8, 0.016), (9, -0.122), (10, -0.054), (11, 0.021), (12, 0.037), (13, -0.022), (14, -0.096), (15, -0.036), (16, 0.08), (17, 0.05), (18, 0.037), (19, -0.141), (20, -0.051), (21, 0.083), (22, -0.077), (23, -0.076), (24, 0.001), (25, -0.121), (26, 0.0), (27, -0.097), (28, 0.137), (29, -0.007), (30, 0.072), (31, -0.045), (32, 0.029), (33, 0.041), (34, 0.064), (35, 0.16), (36, 0.156), (37, 0.079), (38, -0.08), (39, -0.109), (40, 0.065), (41, 0.046), (42, 0.01), (43, -0.126), (44, -0.059), (45, -0.012), (46, -0.135), (47, -0.01), (48, 0.11), (49, -0.111)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95634449 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures

Author: Zach Solan, Eytan Ruppin, David Horn, Shimon Edelman

Abstract: The distributional principle according to which morphemes that occur in identical contexts belong, in some sense, to the same category [1] has been advanced as a means for extracting syntactic structures from corpus data. We extend this principle by applying it recursively, and by using mutual information for estimating category coherence. The resulting model learns, in an unsupervised fashion, highly structured, distributed representations of syntactic knowledge from corpora. It also exhibits promising behavior in tasks usually thought to require representations anchored in a grammar, such as systematicity. 1 Motivation Models dealing with the acquisition of syntactic knowledge are sharply divided into two classes, depending on whether they subscribe to some variant of the classical generative theory of syntax, or operate within the framework of “general-purpose” statistical or distributional learning. An example of the former is the model of [2], which attempts to learn syntactic structures such as Functional Category, as stipulated by the Government and Binding theory. An example of the latter model is Elman’s widely used Simple Recursive Network (SRN) [3]. We believe that polarization between statistical and classical (generative, rule-based) approaches to syntax is counterproductive, because it hampers the integration of the stronger aspects of each method into a common powerful framework. Indeed, on the one hand, the statistical approach is geared to take advantage of the considerable progress made to date in the areas of distributed representation, probabilistic learning, and “connectionist” modeling. Yet, generic connectionist architectures are ill-suited to the abstraction and processing of symbolic information. On the other hand, classical rule-based systems excel in just those tasks, yet are brittle and difficult to train. We present a scheme that acquires “raw” syntactic information construed in a distributional sense, yet also supports the distillation of rule-like regularities out of the accrued statistical knowledge. Our research is motivated by linguistic theories that postulate syntactic structures (and transformations) rooted in distributional data, as exemplified by the work of Zellig Harris [1]. 2 The ADIOS model The ADIOS (Automatic DIstillation Of Structure) model constructs syntactic representations of a sample of language from unlabeled corpus data. The model consists of two elements: (1) a Representational Data Structure (RDS) graph, and (2) a Pattern Acquisition (PA) algorithm that learns the RDS in an unsupervised fashion. The PA algorithm aims to detect patterns — repetitive sequences of “significant” strings of primitives occurring in the corpus (Figure 1). In that, it is related to prior work on alignment-based learning [4] and regular expression (“local grammar”) extraction [5] from corpora. We stress, however, that our algorithm requires no pre-judging either of the scope of the primitives or of their classification, say, into syntactic categories: all the information needed for its operation is extracted from the corpus in an unsupervised fashion. In the initial phase of the PA algorithm the text is segmented down to the smallest possible morphological constituents (e.g., ed is split off both walked and bed; the algorithm later discovers that bed should be left whole, on statistical grounds).1 This initial set of unique constituents is the vertex set of the newly formed RDS (multi-)graph. A directed edge is inserted between two vertices whenever the corresponding transition exists in the corpus (Figure 2(a)); the edge is labeled by the sentence number and by its within-sentence index. Thus, corpus sentences initially correspond to paths in the graph, a path being a sequence of edges that share the same sentence number. (a) mh mi mk mj (b) ci{j,k}l ml mn mi ck ... cj ml cu cv . Figure 1: (a) Two sequences mi , mj , ml and mi , mk , ml form a pattern ci{j,k}l = mi , {mj , mk }, ml , which allows mj and mk to be attributed to the same equivalence class, following the principle of complementary distributions [1]. Both the length of the shared context and the cohesiveness of the equivalence class need to be taken into account in estimating the goodness of the candidate pattern (see eq. 1). (b) Patterns can serve as constituents in their own right; recursively abstracting patterns from a corpus allows us to capture the syntactic regularities concisely, yet expressively. Abstraction also supports generalization: in this schematic illustration, two new paths (dashed lines) emerge from the formation of equivalence classes associated with cu and cv . In the second phase, the PA algorithm repeatedly scans the RDS graph for Significant P atterns (sequences of constituents) ( SP), which are then used to modify the graph (Algorithm 1). For each path pi , the algorithm constructs a list of candidate constituents, ci1 , . . . , cik . Each of these consists of a “prefix” (sequence of graph edges), an equivalence class of vertices, and a “suffix” (another sequence of edges; cf. Figure 2(b)). The criterion I for judging pattern significance combines a syntagmatic consideration (the pattern must be long enough) with a paradigmatic one (its constituents c1 , . . . , ck must have high mutual information): I (c1 , c2 , . . . , ck ) = 2 e−(L/k) P (c1 , c2 , . . . , ck ) log P (c1 , c2 , . . . , ck ) Πk P (cj ) j=1 (1) where L is the typical context length and k is the length of the candidate pattern; the probabilities associated with a cj are estimated from frequencies that are immediately available 1 We remark that the algorithm can work in any language, with any set of tokens, including individual characters – or phonemes, if applied to speech. Algorithm 1 PA (pattern acquisition), phase 2 1: while patterns exist do 2: for all path ∈ graph do {path=sentence; graph=corpus} 3: for all source node ∈ path do 4: for all sink node ∈ path do {source and sink can be equivalence classes} 5: degree of separation = path index(sink) − path index(source); 6: pattern table ⇐ detect patterns(source, sink, degree of separation, equivalence table); 7: end for 8: end for 9: winner ⇐ get most significant pattern(pattern table); 10: equivalence table ⇐ detect equivalences(graph, winner); 11: graph ⇐ rewire graph(graph, winner); 12: end for 13: end while in the graph (e.g., the out-degree of a node is related to the marginal probability of the corresponding cj ). Equation 1 balances two opposing “forces” in pattern formation: (1) the length of the pattern, and (2) the number and the cohesiveness of the set of examples that support it. On the one hand, shorter patterns are likely to be supported by more examples; on the other hand, they are also more likely to lead to over-generalization, because shorter patterns mean less context. A pattern tagged as significant is added as a new vertex to the RDS graph, replacing the constituents and edges it subsumes (Figure 2). Note that only those edges of the multigraph that belong to the detected pattern are rewired; edges that belong to sequences not subsumed by the pattern are untouched. This highly context-sensitive approach to pattern abstraction, which is unique to our model, allows ADIOS to achieve a high degree of representational parsimony without sacrificing generalization power. During the pass over the corpus the list of equivalence sets is updated continuously; the identification of new significant patterns is done using thecurrent equivalence sets (Figure 3(d)). Thus, as the algorithm processes more and more text, it “bootstraps” itself and enriches the RDS graph structure with new SPs and their accompanying equivalence sets. The recursive nature of this process enables the algorithm to form more and more complex patterns, in a hierarchical manner. The relationships among these can be visualized recursively in a tree format, with tree depth corresponding to the level of recursion (e.g., Figure 3(c)). The PA algorithm halts if it processes a given amount of text without finding a new SP or equivalence set (in real-life language acquisition this process may never stop). Generalization. A collection of patterns distilled from a corpus can be seen as an empirical grammar of sorts; cf. [6], p.63: “the grammar of a language is simply an inventory of linguistic units.” The patterns can eventually become highly abstract, thus endowing the model with an ability to generalize to unseen inputs. Generalization is possible, for example, when two equivalence classes are placed next to each other in a pattern, creating new paths among the members of the equivalence classes (dashed lines in Figure 1(b)). Generalization can also ensue from partial activation of existing patterns by novel inputs. This function is supported by the input module, designed to process a novel sentence by forming its distributed representation in terms of activities of existing patterns (Figure 6). These are computed by propagating activation from bottom (the terminals) to top (the patterns) of the RDS. The initial activities wj of the terminals cj are calculated given the novel input s1 , . . . , sk as follows: wj = max {I(sk , cj )} m=1..k (2) 102: do you see the cat? 101: the cat is eating 103: are you sure? Sentence Number Within-Sentence Index 101_1 101_4 101_3 101_2 101_5 101_6 END her ing show eat play is cat Pam the BEGIN (a) 131_3 131_2 109_7 END ing 121_12 stay 121_10 play 121_8 101_6 109_6 cat the BEGIN 109_5 121_9 eat 109_4 (b) 109_9 101_5 109_8 101_4 101_3 101_2 is 131_1 101_1 121_13 121_11 131_1 131_3 101_1 109_4 PATTERN 230: the cat is {eat, play, stay} -ing 165_1 Equivalence Class 230: stay, eat, play 165_2 221_3 here stay play 171_3 165_3 eat 221_1 we 171_2 they BEGIN (d) END 101_2 109_5 121_9 121_8 171_1 ing stay 131_2 play eat is cat the BEGIN (c) PATTERN 231: BEGIN {they, we} {230} here 221_2 Figure 2: (a) A small portion of the RDS graph for a simple corpus, with sentence #101 (the cat is eat -ing) indicated by solid arcs. (b) This sentence joins a pattern the cat is {eat, play, stay} -ing, in which two others (#109,121) already participate. (c) The abstracted pattern, and the equivalence class associated with it (edges that belong to sequences not subsumed by this pattern, e.g., #131, are untouched). (d) The identification of new significant patterns is done using the acquired equivalence classes (e.g., #230). In this manner, the system “bootstraps” itself, recursively distilling more and more complex patterns. where I(sk , cj ) is the mutual information between sk and cj . For an equivalence class, the value propagated upwards is the strongest non-zero activation of its members; for a pattern, it is the average weight of the children nodes, on the condition that all the children were activated by adjacent inputs. Activity propagation continues until it reaches the top nodes of the pattern lattice. When the algorithm encounters a novel word, all the members of the terminal equivalence class contribute a value of , which is then propagated upwards as usual. This enables the model to make an educated guess as to the meaning of the unfamiliar word, by considering the patterns that become active (Figure 6(b)). 3 Results We now briefly describe the results of several studies designed to evaluate the viability of the ADIOS model, in which it was exposed to corpora of varying size and complexity. (a) propnoun:

2 0.86219352 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus

Author: Willem H. Zuidema

Abstract: Language acquisition is a special kind of learning problem because the outcome of learning of one generation is the input for the next. That makes it possible for languages to adapt to the particularities of the learner. In this paper, I show that this type of language change has important consequences for models of the evolution and acquisition of syntax. 1 The Language Acquisition Problem For both artificial systems and non-human animals, learning the syntax of natural languages is a notoriously hard problem. All healthy human infants, in contrast, learn any of the approximately 6000 human languages rapidly, accurately and spontaneously. Any explanation of how they accomplish this difficult task must specify the (innate) inductive bias that human infants bring to bear, and the input data that is available to them. Traditionally, the inductive bias is termed - somewhat unfortunately -

3 0.60479587 15 nips-2002-A Probabilistic Model for Learning Concatenative Morphology

Author: Matthew G. Snover, Michael R. Brent

Abstract: This paper describes a system for the unsupervised learning of morphological suffixes and stems from word lists. The system is composed of a generative probability model and hill-climbing and directed search algorithms. By extracting and examining morphologically rich subsets of an input lexicon, the directed search identifies highly productive paradigms. The hill-climbing algorithm then further maximizes the probability of the hypothesis. Quantitative results are shown by measuring the accuracy of the morphological relations identified. Experiments in English and Polish, as well as comparisons with another recent unsupervised morphology learning algorithm demonstrate the effectiveness of this technique.

4 0.57880884 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

Author: Dan Klein, Christopher D. Manning

Abstract: We present a novel generative model for natural language tree structures in which semantic (lexical dependency) and syntactic (PCFG) structures are scored with separate models. This factorization provides conceptual simplicity, straightforward opportunities for separately improving the component models, and a level of performance comparable to similar, non-factored models. Most importantly, unlike other modern parsing models, the factored model admits an extremely effective A* parsing algorithm, which enables efficient, exact inference.

5 0.41621563 85 nips-2002-Fast Kernels for String and Tree Matching

Author: Alex J. Smola, S.v.n. Vishwanathan

Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynarrtic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.

6 0.38903964 163 nips-2002-Prediction and Semantic Association

7 0.37541199 42 nips-2002-Bias-Optimal Incremental Problem Solving

8 0.33655047 112 nips-2002-Inferring a Semantic Representation of Text via Cross-Language Correlation Analysis

9 0.27042854 176 nips-2002-Replay, Repair and Consolidation

10 0.26261008 167 nips-2002-Rational Kernels

11 0.26131415 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering

12 0.25685483 40 nips-2002-Bayesian Models of Inductive Generalization

13 0.25666976 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

14 0.25376043 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion

15 0.24815854 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

16 0.2396711 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

17 0.23315293 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction

18 0.23231015 150 nips-2002-Multiple Cause Vector Quantization

19 0.23101769 11 nips-2002-A Model for Real-Time Computation in Generic Neural Microcircuits

20 0.22678658 5 nips-2002-A Digital Antennal Lobe for Pattern Equalization: Analysis and Design


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.017), (23, 0.02), (41, 0.01), (42, 0.038), (54, 0.112), (55, 0.026), (56, 0.408), (57, 0.012), (64, 0.017), (67, 0.019), (68, 0.026), (74, 0.101), (79, 0.011), (92, 0.034), (98, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83031678 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures

Author: Zach Solan, Eytan Ruppin, David Horn, Shimon Edelman

Abstract: The distributional principle according to which morphemes that occur in identical contexts belong, in some sense, to the same category [1] has been advanced as a means for extracting syntactic structures from corpus data. We extend this principle by applying it recursively, and by using mutual information for estimating category coherence. The resulting model learns, in an unsupervised fashion, highly structured, distributed representations of syntactic knowledge from corpora. It also exhibits promising behavior in tasks usually thought to require representations anchored in a grammar, such as systematicity. 1 Motivation Models dealing with the acquisition of syntactic knowledge are sharply divided into two classes, depending on whether they subscribe to some variant of the classical generative theory of syntax, or operate within the framework of “general-purpose” statistical or distributional learning. An example of the former is the model of [2], which attempts to learn syntactic structures such as Functional Category, as stipulated by the Government and Binding theory. An example of the latter model is Elman’s widely used Simple Recursive Network (SRN) [3]. We believe that polarization between statistical and classical (generative, rule-based) approaches to syntax is counterproductive, because it hampers the integration of the stronger aspects of each method into a common powerful framework. Indeed, on the one hand, the statistical approach is geared to take advantage of the considerable progress made to date in the areas of distributed representation, probabilistic learning, and “connectionist” modeling. Yet, generic connectionist architectures are ill-suited to the abstraction and processing of symbolic information. On the other hand, classical rule-based systems excel in just those tasks, yet are brittle and difficult to train. We present a scheme that acquires “raw” syntactic information construed in a distributional sense, yet also supports the distillation of rule-like regularities out of the accrued statistical knowledge. Our research is motivated by linguistic theories that postulate syntactic structures (and transformations) rooted in distributional data, as exemplified by the work of Zellig Harris [1]. 2 The ADIOS model The ADIOS (Automatic DIstillation Of Structure) model constructs syntactic representations of a sample of language from unlabeled corpus data. The model consists of two elements: (1) a Representational Data Structure (RDS) graph, and (2) a Pattern Acquisition (PA) algorithm that learns the RDS in an unsupervised fashion. The PA algorithm aims to detect patterns — repetitive sequences of “significant” strings of primitives occurring in the corpus (Figure 1). In that, it is related to prior work on alignment-based learning [4] and regular expression (“local grammar”) extraction [5] from corpora. We stress, however, that our algorithm requires no pre-judging either of the scope of the primitives or of their classification, say, into syntactic categories: all the information needed for its operation is extracted from the corpus in an unsupervised fashion. In the initial phase of the PA algorithm the text is segmented down to the smallest possible morphological constituents (e.g., ed is split off both walked and bed; the algorithm later discovers that bed should be left whole, on statistical grounds).1 This initial set of unique constituents is the vertex set of the newly formed RDS (multi-)graph. A directed edge is inserted between two vertices whenever the corresponding transition exists in the corpus (Figure 2(a)); the edge is labeled by the sentence number and by its within-sentence index. Thus, corpus sentences initially correspond to paths in the graph, a path being a sequence of edges that share the same sentence number. (a) mh mi mk mj (b) ci{j,k}l ml mn mi ck ... cj ml cu cv . Figure 1: (a) Two sequences mi , mj , ml and mi , mk , ml form a pattern ci{j,k}l = mi , {mj , mk }, ml , which allows mj and mk to be attributed to the same equivalence class, following the principle of complementary distributions [1]. Both the length of the shared context and the cohesiveness of the equivalence class need to be taken into account in estimating the goodness of the candidate pattern (see eq. 1). (b) Patterns can serve as constituents in their own right; recursively abstracting patterns from a corpus allows us to capture the syntactic regularities concisely, yet expressively. Abstraction also supports generalization: in this schematic illustration, two new paths (dashed lines) emerge from the formation of equivalence classes associated with cu and cv . In the second phase, the PA algorithm repeatedly scans the RDS graph for Significant P atterns (sequences of constituents) ( SP), which are then used to modify the graph (Algorithm 1). For each path pi , the algorithm constructs a list of candidate constituents, ci1 , . . . , cik . Each of these consists of a “prefix” (sequence of graph edges), an equivalence class of vertices, and a “suffix” (another sequence of edges; cf. Figure 2(b)). The criterion I for judging pattern significance combines a syntagmatic consideration (the pattern must be long enough) with a paradigmatic one (its constituents c1 , . . . , ck must have high mutual information): I (c1 , c2 , . . . , ck ) = 2 e−(L/k) P (c1 , c2 , . . . , ck ) log P (c1 , c2 , . . . , ck ) Πk P (cj ) j=1 (1) where L is the typical context length and k is the length of the candidate pattern; the probabilities associated with a cj are estimated from frequencies that are immediately available 1 We remark that the algorithm can work in any language, with any set of tokens, including individual characters – or phonemes, if applied to speech. Algorithm 1 PA (pattern acquisition), phase 2 1: while patterns exist do 2: for all path ∈ graph do {path=sentence; graph=corpus} 3: for all source node ∈ path do 4: for all sink node ∈ path do {source and sink can be equivalence classes} 5: degree of separation = path index(sink) − path index(source); 6: pattern table ⇐ detect patterns(source, sink, degree of separation, equivalence table); 7: end for 8: end for 9: winner ⇐ get most significant pattern(pattern table); 10: equivalence table ⇐ detect equivalences(graph, winner); 11: graph ⇐ rewire graph(graph, winner); 12: end for 13: end while in the graph (e.g., the out-degree of a node is related to the marginal probability of the corresponding cj ). Equation 1 balances two opposing “forces” in pattern formation: (1) the length of the pattern, and (2) the number and the cohesiveness of the set of examples that support it. On the one hand, shorter patterns are likely to be supported by more examples; on the other hand, they are also more likely to lead to over-generalization, because shorter patterns mean less context. A pattern tagged as significant is added as a new vertex to the RDS graph, replacing the constituents and edges it subsumes (Figure 2). Note that only those edges of the multigraph that belong to the detected pattern are rewired; edges that belong to sequences not subsumed by the pattern are untouched. This highly context-sensitive approach to pattern abstraction, which is unique to our model, allows ADIOS to achieve a high degree of representational parsimony without sacrificing generalization power. During the pass over the corpus the list of equivalence sets is updated continuously; the identification of new significant patterns is done using thecurrent equivalence sets (Figure 3(d)). Thus, as the algorithm processes more and more text, it “bootstraps” itself and enriches the RDS graph structure with new SPs and their accompanying equivalence sets. The recursive nature of this process enables the algorithm to form more and more complex patterns, in a hierarchical manner. The relationships among these can be visualized recursively in a tree format, with tree depth corresponding to the level of recursion (e.g., Figure 3(c)). The PA algorithm halts if it processes a given amount of text without finding a new SP or equivalence set (in real-life language acquisition this process may never stop). Generalization. A collection of patterns distilled from a corpus can be seen as an empirical grammar of sorts; cf. [6], p.63: “the grammar of a language is simply an inventory of linguistic units.” The patterns can eventually become highly abstract, thus endowing the model with an ability to generalize to unseen inputs. Generalization is possible, for example, when two equivalence classes are placed next to each other in a pattern, creating new paths among the members of the equivalence classes (dashed lines in Figure 1(b)). Generalization can also ensue from partial activation of existing patterns by novel inputs. This function is supported by the input module, designed to process a novel sentence by forming its distributed representation in terms of activities of existing patterns (Figure 6). These are computed by propagating activation from bottom (the terminals) to top (the patterns) of the RDS. The initial activities wj of the terminals cj are calculated given the novel input s1 , . . . , sk as follows: wj = max {I(sk , cj )} m=1..k (2) 102: do you see the cat? 101: the cat is eating 103: are you sure? Sentence Number Within-Sentence Index 101_1 101_4 101_3 101_2 101_5 101_6 END her ing show eat play is cat Pam the BEGIN (a) 131_3 131_2 109_7 END ing 121_12 stay 121_10 play 121_8 101_6 109_6 cat the BEGIN 109_5 121_9 eat 109_4 (b) 109_9 101_5 109_8 101_4 101_3 101_2 is 131_1 101_1 121_13 121_11 131_1 131_3 101_1 109_4 PATTERN 230: the cat is {eat, play, stay} -ing 165_1 Equivalence Class 230: stay, eat, play 165_2 221_3 here stay play 171_3 165_3 eat 221_1 we 171_2 they BEGIN (d) END 101_2 109_5 121_9 121_8 171_1 ing stay 131_2 play eat is cat the BEGIN (c) PATTERN 231: BEGIN {they, we} {230} here 221_2 Figure 2: (a) A small portion of the RDS graph for a simple corpus, with sentence #101 (the cat is eat -ing) indicated by solid arcs. (b) This sentence joins a pattern the cat is {eat, play, stay} -ing, in which two others (#109,121) already participate. (c) The abstracted pattern, and the equivalence class associated with it (edges that belong to sequences not subsumed by this pattern, e.g., #131, are untouched). (d) The identification of new significant patterns is done using the acquired equivalence classes (e.g., #230). In this manner, the system “bootstraps” itself, recursively distilling more and more complex patterns. where I(sk , cj ) is the mutual information between sk and cj . For an equivalence class, the value propagated upwards is the strongest non-zero activation of its members; for a pattern, it is the average weight of the children nodes, on the condition that all the children were activated by adjacent inputs. Activity propagation continues until it reaches the top nodes of the pattern lattice. When the algorithm encounters a novel word, all the members of the terminal equivalence class contribute a value of , which is then propagated upwards as usual. This enables the model to make an educated guess as to the meaning of the unfamiliar word, by considering the patterns that become active (Figure 6(b)). 3 Results We now briefly describe the results of several studies designed to evaluate the viability of the ADIOS model, in which it was exposed to corpora of varying size and complexity. (a) propnoun:

2 0.3844344 2 nips-2002-A Bilinear Model for Sparse Coding

Author: David B. Grimes, Rajesh P. Rao

Abstract: Recent algorithms for sparse coding and independent component analysis (ICA) have demonstrated how localized features can be learned from natural images. However, these approaches do not take image transformations into account. As a result, they produce image codes that are redundant because the same feature is learned at multiple locations. We describe an algorithm for sparse coding based on a bilinear generative model of images. By explicitly modeling the interaction between image features and their transformations, the bilinear approach helps reduce redundancy in the image code and provides a basis for transformationinvariant vision. We present results demonstrating bilinear sparse coding of natural images. We also explore an extension of the model that can capture spatial relationships between the independent features of an object, thereby providing a new framework for parts-based object recognition.

3 0.3843796 124 nips-2002-Learning Graphical Models with Mercer Kernels

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.

4 0.38385266 27 nips-2002-An Impossibility Theorem for Clustering

Author: Jon M. Kleinberg

Abstract: Although the study of clustering is centered around an intuitively compelling goal, it has been very difficult to develop a unified framework for reasoning about it at a technical level, and profoundly diverse approaches to clustering abound in the research community. Here we suggest a formal perspective on the difficulty in finding such a unification, in the form of an impossibility theorem: for a set of three simple properties, we show that there is no clustering function satisfying all three. Relaxations of these properties expose some of the interesting (and unavoidable) trade-offs at work in well-studied clustering techniques such as single-linkage, sum-of-pairs, k-means, and k-median. 1

5 0.38383883 152 nips-2002-Nash Propagation for Loopy Graphical Games

Author: Luis E. Ortiz, Michael Kearns

Abstract: We introduce NashProp, an iterative and local message-passing algorithm for computing Nash equilibria in multi-player games represented by arbitrary undirected graphs. We provide a formal analysis and experimental evidence demonstrating that NashProp performs well on large graphical games with many loops, often converging in just a dozen iterations on graphs with hundreds of nodes. NashProp generalizes the tree algorithm of (Kearns et al. 2001), and can be viewed as similar in spirit to belief propagation in probabilistic inference, and thus complements the recent work of (Vickrey and Koller 2002), who explored a junction tree approach. Thus, as for probabilistic inference, we have at least two promising general-purpose approaches to equilibria computation in graphs.

6 0.38360554 53 nips-2002-Clustering with the Fisher Score

7 0.38302448 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

8 0.38291603 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games

9 0.38285264 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

10 0.38146278 89 nips-2002-Feature Selection by Maximum Marginal Diversity

11 0.38126773 39 nips-2002-Bayesian Image Super-Resolution

12 0.38092965 74 nips-2002-Dynamic Structure Super-Resolution

13 0.38073477 10 nips-2002-A Model for Learning Variance Components of Natural Images

14 0.38061035 173 nips-2002-Recovering Intrinsic Images from a Single Image

15 0.38056189 48 nips-2002-Categorization Under Complexity: A Unified MDL Account of Human Learning of Regular and Irregular Categories

16 0.38035804 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

17 0.37974495 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

18 0.37915069 3 nips-2002-A Convergent Form of Approximate Policy Iteration

19 0.37886631 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

20 0.37881058 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions