jmlr jmlr2010 jmlr2010-115 knowledge-graph by maker-knowledge-mining

115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

Source: pdf

Author: Alexander Clark, Rémi Eyraud, Amaury Habrard

Abstract: We present a polynomial update time algorithm for the inductive inference of a large class of context-free languages using the paradigm of positive data and a membership oracle. We achieve this result by moving to a novel representation, called Contextual Binary Feature Grammars (CBFGs), which are capable of representing richly structured context-free languages as well as some context sensitive languages. These representations explicitly model the lattice structure of the distribution of a set of substrings and can be inferred using a generalisation of distributional learning. This formalism is an attempt to bridge the gap between simple learnable classes and the sorts of highly expressive representations necessary for linguistic representation: it allows the learnability of a large class of context-free languages, that includes all regular languages and those context-free languages that satisfy two simple constraints. The formalism and the algorithm seem well suited to natural language and in particular to the modeling of ﬁrst language acquisition. Preliminary experimental results conﬁrm the effectiveness of this approach. Keywords: grammatical inference, context-free language, positive data only, membership queries

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The formalism and the algorithm seem well suited to natural language and in particular to the modeling of ﬁrst language acquisition. [sent-14, score-0.374]

2 The next class of languages to consider is the class of context-free languages (CFL). [sent-28, score-0.364]

3 This may be explained by the fact that this class relies on syntactic properties instead of intrinsic properties of the language like the notion of residuals for regular languages (Denis et al. [sent-30, score-0.378]

4 If each non-terminal corresponds to a single congruence class (the NTS languages Boasson and Senizergues, 1985), then the problem may be tractable. [sent-44, score-0.275]

5 However in general the contexts of different non-terminals overlap enormously: for instance the contexts of adjective phrases and noun phrases in English both contain contexts of the form (“it is”, “. [sent-45, score-0.462]

6 The distribution of a string corresponds to all the possible contexts in which the string can appear. [sent-55, score-0.466]

7 2708 U SING C ONTEXTUAL R EPRESENTATIONS Thus we cannot hope to learn natural languages by learning one congruence class at a time: it is vital to use a more structured representation. [sent-56, score-0.275]

8 This grammar formalism is deﬁned using a set of contexts which play the role of features with a strict semantics attached to these features. [sent-62, score-0.419]

9 We consider then the case when the contextual features assigned to a string correspond to the contexts that the string can occur in, in the language deﬁned by the grammar. [sent-64, score-0.711]

10 The crucial point here is that for languages that can be deﬁned by an exact CBFG, the underlying structure of the representation relies on intrinsic properties of the language easily observable on samples by looking at context sets. [sent-66, score-0.362]

11 We investigate the links with the classical Chomsky hierarchy, some well known grammatical representations used in natural language processing. [sent-72, score-0.285]

12 We extend this to sets of strings and contexts in the natural way. [sent-91, score-0.361]

13 Deﬁnition 1 The set of contexts, or context distribution, of a string u in a language L is, CL (u) = {(l, r) ∈ Σ∗ × Σ∗ |lur ∈ L}. [sent-95, score-0.336]

14 Deﬁnition 2 Two strings u and v are syntactically congruent with respect to a language L, denoted u ≡L v, if and only if CL (u) = CL (v). [sent-97, score-0.407]

15 After these basic deﬁnitions and notations, we recall here the deﬁnition of a context-free grammar which is a class which is close to the language class studied in this paper. [sent-99, score-0.34]

16 The representation by contextual binary feature grammars relies on the inclusion relation between sets of contexts of language L. [sent-114, score-0.501]

17 The basic insight behind CBFGs is that there is a relation between 2710 U SING C ONTEXTUAL R EPRESENTATIONS the contexts of a string w and the contexts of its substrings. [sent-119, score-0.464]

18 This is given by the following trivial lemma: Lemma 4 For any language L and for any strings u, u′ , v, v′ if C(u) = C(u′ ) and C(v) = C(v′ ), then C(uv) = C(u′ v′ ). [sent-120, score-0.366]

19 There is also a slightly stronger result: Lemma 5 For any language L and for any strings u, u′ , v, v′ if C(u) ⊆ C(u′ ) and C(v) ⊆ C(v′ ), then C(uv) ⊆ C(u′ v′ ). [sent-128, score-0.366]

20 Looking at Lemma 5 we can also say that, if we have some ﬁnite set of strings K, where we know the contexts, then: Corollary 6 For any language L and for any set of strings K, we have: C(w) ⊇ C(uv). [sent-136, score-0.573]

21 Thus a CBFG, G, deﬁnes for every string u a set of contexts fG (u): this will be a representation of the context distribution. [sent-155, score-0.331]

22 Deﬁnition 8 The language deﬁned by a CBFG G is the set of all strings that are assigned the empty context: L(G) = {u|(λ, λ) ∈ fG (u)}. [sent-158, score-0.366]

23 A rule x → yz is applied to analyse a string w if there is a split or cut of the string w into two strings u and v such that uv = w s. [sent-160, score-0.684]

24 One way of viewing a production x → yz is as an implication: if two strings u and v are such that they have the features y and z, then their concatenation will have the features x. [sent-163, score-0.403]

25 In the special case where all of the productions involve only singleton sets then this will reduce to a CFG—the non-terminals will correspond to the individual features, and fG (w) will correspond to the set of non-terminals that can derive the string w. [sent-179, score-0.284]

26 Indeed, the grammar assigns only (λ, λ) to the elements of the language; for all elements w of {an bn+1 : n > 1} we have fG (w) = {(a, λ)} = FL (w) and for all all elements w of {an+1 bn : n > 1}, fG (w) = {(λ, b)} = FL (w); The lexical rules assign correct contexts to each letter. [sent-203, score-0.426]

27 The set of productions is deﬁned merely by observation: we take the set of all productions that we observe as the concatenation of elements of the small set K. [sent-240, score-0.285]

28 P is the interesting set of productions: these allow us to predict the features of a string uv from the features of its part u and v. [sent-244, score-0.308]

29 To construct P we take all triples of strings u, v, uv that are in our ﬁnite set K. [sent-245, score-0.303]

30 We observe that u has the contexts FL (u) and v has the set of contexts FL (v): our rule then states that we can combine any string that has all of the contexts in FL (u) together with any string that has the contexts in FL (v) and the result will have all of the contexts in FL (uv). [sent-246, score-1.11]

31 Each of these will give rise to an element of P: • The rule given by ab = a ◦ b: {(λ, λ)} → {(λ, b)}{(a, λ)}, / • aa = a ◦ a gives 0 → {(λ, b)}{(λ, b)}, / • aab = aa ◦ b gives {(λ, b)} → 0{(a, λ)}, • aab = a ◦ ab gives {(λ, b)} → {(λ, b)}{(λ, λ)}. [sent-253, score-0.81]

32 However, in this case, the language L(G0 ) is not the same as L; moreover, the resulting grammar is not exact. [sent-255, score-0.34]

33 The problem here is caused by the fact that the production {(λ, b)} → 0{(a, λ)} allows any / / string to occur in the place speciﬁed by the 0: indeed since 0 ⊆ fG0 (aab) and {(a, λ)} ⊆ fG0 (b) the rule holds for aabb and thus {(λ, b)} ⊆ fG0 (aabb). [sent-257, score-0.316]

34 This is actually caused by the fact that there are no contexts in F that correspond to the string aa in K. [sent-258, score-0.414]

35 Fixing L for the moment, clearly the language deﬁned depends on two factors: the set of strings K and the set of features F. [sent-259, score-0.394]

36 In the next section we will establish two important lemmas that show that the search for K and F is fundamentally tractable: ﬁrst, that as K increases the language deﬁned by G0 (K, L, F) will increase, and secondly that as F increases the language will decrease. [sent-262, score-0.318]

37 The language deﬁned by G0 contains an b and also an since {(λ, λ)} ⊂ {(λ, λ), (λ, b)} allowing the third production to accept strings ending with an a. [sent-268, score-0.436]

38 The addition of the last rule allows the grammar to recognize ban and it can be easily shown that by a combination of the last two productions an bam belongs to the language deﬁned by the grammar. [sent-272, score-0.496]

39 In the following lemma we abuse notation and use G0 for when we have inﬁnite K, and F: in this lemma we let K be the set of all non-empty strings and we let F be the set of all possible contexts (Σ∗ × Σ∗ ). [sent-276, score-0.415]

40 The base case is trivial ′ since FL (a) ∩ F = FL (a); if it is true for all strings up to length k, then if f ∈ fG′ (u) ∩ F; there must be a production in F ′ with f on the head. [sent-302, score-0.297]

41 2718 U SING C ONTEXTUAL R EPRESENTATIONS Proof Clearly the sets of productions of G0 (K, L, F) will be a subset of the set of productions of G0 (K ′ , L, F), and so anything that can be derived by the ﬁrst can be derived by the second, again by induction on the length of the string. [sent-310, score-0.295]

42 3 Fiducial Feature Sets and Finite Context Property We need to be able to prove that for any K if we have enough features then the language deﬁned will be included within the target language L. [sent-327, score-0.368]

43 We formalise the idea of having enough features in the following way: Deﬁnition 17 For a language L and a string u, a set of features F is ﬁducial on u if for all v ∈ Σ∗ , FL (u) ⊆ FL (v) implies CL (u) ⊆ CL (v). [sent-328, score-0.371]

44 Deﬁnition 18 For a language L and a set of strings K, a set of features F is ﬁducial on K if for all u ∈ K, F is ﬁducial on u. [sent-331, score-0.394]

45 Clearly, for any string w, CL (w) will be ﬁducial on w; but this is vacuous—we are interested in cases where there is a ﬁnite set of contexts which is ﬁducial for w, but where CL (w) is inﬁnite. [sent-332, score-0.31]

46 Consider now the string aab; clearly {(λ, b)} is ﬁducial for aab, even though the string a, which is not congruent to aab also occurs in this context. [sent-341, score-0.56]

47 However, looking at string a, {(λ, b)} is not ﬁducial, since aab has this context but does not include all the contexts of a such as, for example, (λ, abb). [sent-343, score-0.538]

48 Consider the ﬁnite language L = {ab, ac, db, ec, dx, ey}, and the string a. [sent-345, score-0.315]

49 We now deﬁne the ﬁnite context property, which is one of the two conditions that languages must satisfy to be learnable in this model; this condition is a purely language theoretic property. [sent-348, score-0.388]

50 Deﬁnition 19 A language L has the Finite Context Property (FCP) if every string has a ﬁnite ﬁducial feature set. [sent-349, score-0.315]

51 It states that the features assigned by the grammar will correspond to the language theoretic interpretation of them as contexts. [sent-357, score-0.368]

52 Lemma 20 For any language L, given a set of strings K and a set of features F, let G = G0 (K, L, F). [sent-358, score-0.394]

53 In our case, we can take the inﬁnite set of all contexts and deﬁne productions based on the congruence classes. [sent-383, score-0.375]

54 If F is the set of all contexts then we have FL (u) = CL (u), thus the productions will be exactly of the form C(uv) → C(u)C(v). [sent-384, score-0.282]

55 Notice that all regular languages have the FKP since they have a ﬁnite number of congruence classes. [sent-399, score-0.312]

56 Accordingly, we have decided to use the model of positive data together with membership queries: an oracle can tell the learner whether a string is in the language or not (Angluin, 1988). [sent-421, score-0.415]

57 The presented algorithm runs in time polynomial in the size of the sample S: since the strings are of variable length, this size must be the sum of the lengths of the strings in S, S = ∑w∈S |w|. [sent-422, score-0.414]

58 It is hard to tighten the model sufﬁciently: the suggestion of de la Higuera (1997) for a polynomial characteristic set is inapplicable for representations, such as the ones in this paper, that are powerful enough to deﬁne languages whose shortest strings are exponentially long. [sent-424, score-0.389]

59 Informally, D is the list of all strings that have been seen so far and Gn is the current grammar obtained with the ﬁrst n strings of D. [sent-440, score-0.595]

60 This requires us to search through a polynomially bounded set of strings looking for a string that is in L(Gn ) \ L. [sent-446, score-0.363]

61 The ﬁnal grammar will have 10 productions and 2 lexical productions; |K| = 8 and |F| = 21. [sent-559, score-0.368]

62 Each grammar is then used either to generate training and test sets or as an oracle for answering membership queries. [sent-578, score-0.281]

63 From these grammars without useless rules, we force them to generate non ﬁnite languages by checking that the start symbol is used at least once in a right hand side of a grammar rule (in average this symbol appears in a right hand side of a rule more than 3 times per grammar). [sent-589, score-0.582]

64 We then complete this learning set by randomly generating new strings from the grammar in order to have a total of 50 examples. [sent-600, score-0.388]

65 Since the learning phase uses a membership oracle, when the hypothesis is being constructed, some new strings may be built and queried for the oracle by picking a substring and a context from the learning sample. [sent-603, score-0.364]

66 Thus, even if the test set does not contain any string of the learning sample, the construction G0 may consider some strings present in the test set. [sent-604, score-0.363]

67 According to this procedure, we randomly generate a test set of 1000 strings over the alphabet of terminal symbols used to deﬁne the target grammar (1000 examples is twice the size of the small test sets of the Omphalos competition). [sent-606, score-0.464]

68 The ratio between strings in the language and strings outside the language is ﬁxed to be between 40% and 60%. [sent-609, score-0.732]

69 For each target context-free grammar, we construct a CBFG by applying the construction G0 (K, O , F) with K = Sub(S) and F = Con(S) where S is a set of strings drawn from the learning set and using the target grammar as the oracle O for the membership queries. [sent-611, score-0.532]

70 3 Results and Discussion Figure 3 shows the averaged accuracy over the different target grammars according to the number of strings in the learning sample. [sent-616, score-0.359]

71 While a very worst case analysis of the grammar construction used by G0 would lead to a complexity in O(|S|5 ), we can observe that the number of queries seems to be quadratic, at least in the case of the grammars we consider here. [sent-667, score-0.347]

72 We will show that this class • includes all regular languages • does not include all context free languages • includes some non-context-free languages. [sent-683, score-0.422]

73 Note that the number of congruence classes of a regular language is ﬁnite. [sent-706, score-0.289]

74 Let KL be ﬁnite set of strings formed by taking the lexicographically shortest string from each congruence class. [sent-708, score-0.477]

75 For every pair of strings u, v ∈ KL , we deﬁne a rule FL (uv) → FL (u)FL (v) and we add lexical productions of the form FL (a) → a, a ∈ Σ. [sent-710, score-0.422]

76 First, we show ∀w ∈ Σ∗ , fg (w) ⊆ FL (w) by induction on the length of w. [sent-715, score-0.386]

77 Suppose that a string w, |w| = n > 1, is parsed by the CBFG G, then there exists a cut of w in uv = w and a rule z → xy in G such that x ⊆ fG (u) and y ⊆ fG (v). [sent-717, score-0.28]

78 The key point relies on the fact that when a string w is parsed by a CBFG G, there exists a cut of w into uv = w (u, v ∈ Σ∗ ) and a rule z → xy in G such that x ⊆ fG (u) and y ⊆ fG (v). [sent-724, score-0.28]

79 For the language of Figure 5, the following set is sufﬁcient to build an exact CBFG: {a, b, aa, ab, ba, aab, bb, bba} (this corresponds to all the substrings of aab and bba). [sent-727, score-0.433]

80 First, CBFGs can compactly represent languages like the ﬁnite language of all n! [sent-747, score-0.341]

81 In particular, we deﬁne a language closely related to the MIX language (consisting of strings with an equal number of a’s, b’s and c’s in any order) which is known to be non context-free, and indeed is conjectured to be outside the class of indexed grammars (Boullier, 2003). [sent-750, score-0.688]

82 Indeed, this rule, with the presence of two contexts in one of categories, means that an element of the language has to be derived so that it has a preﬁx u such that fG (u) ⊇ {(λ, d), (λ, e)}. [sent-764, score-0.313]

83 L is not context-free since if we intersect L with the regular language Σ∗ d, we get an instance of the non context-free MIX language (with d appended). [sent-771, score-0.388]

84 The exactness comes from the fact that we chose the contexts in order to ensure that strings belonging to a sublanguage can not belong to another one and that the derivation of a substring will provide all the possible correct features with the help of the union of all the possible derivations. [sent-772, score-0.425]

85 Note that the MIX language on its own is not deﬁnable by an exact CBFG: it is only when other parts of the language can distributionally deﬁne the appropriate partial structures that we can get context sensitive languages. [sent-773, score-0.339]

86 This formalism thus potentially accounts not just for the existence of non context-free natural languages but also for their rarity. [sent-775, score-0.271]

87 Deﬁnition 38 A conjunctive grammar is deﬁned as a quadruple Σ, N, P, S , in which: Σ is the alphabet; N is the set of non terminal symbols; P is the set of rules, each of the form A → α1 &. [sent-818, score-0.278]

88 We claim that for every every language L generated by a conjunctive grammar there is a CBFG representing L# (where the special character # is not included in the original alphabet). [sent-826, score-0.376]

89 &Pn; Qn we add distinct contexts ∗ {(lPi Qi , rPi Qi )} to F, such that for all i it exists ui , lPi Qi ui rPi Qi ∈ L and Pi Qi ⇒G ui ; • Let FX, j = {(lPi Qi , rPi Qi ) : ∀i} the set of contexts corresponding to the jth rule applicable to X. [sent-832, score-0.336]

90 We have proposed a learning algorithm using only membership queries and shown that this algorithm can identify in the limit the class of context-free languages satisfying the FCP and FKP assumptions. [sent-841, score-0.28]

91 First of all, we should establish how large the class of languages with the FCP and the FKP is: it includes all ﬁnite languages and all regular languages, since the set of congruence classes is ﬁnite for ﬁnite state languages. [sent-842, score-0.494]

92 It similarly includes the context-free substitutable languages (Clark and Eyraud, 2007), since every string in a substitutable language belongs to only one syntactic congruence class. [sent-843, score-0.632]

93 However it does include languages like the Dyck languages of arbitrary order, Lukacevic language and most other classic simple examples. [sent-845, score-0.523]

94 The class of CFLs where the context distribution of every string is a ﬁnite union of equivalence classes of contexts clearly has both the FKP and the FCP. [sent-847, score-0.331]

95 These strings do not appear in the list of learning examples given to the oracle but are built from all the possible contexts and substrings that can be extracted from this list. [sent-852, score-0.466]

96 We can imagine a procedure that changes the grammar by adding these new positive strings for building the CBFG, however this could lead to having to deal with an exponential number of strings. [sent-855, score-0.388]

97 2 Linguistics The ﬁeld of grammatical inference has close relations to the study of language acquisition. [sent-872, score-0.285]

98 Attempts to model natural languages with context-free grammars require additional machinery: natural language categories such as noun phrases contain many overlapping subclasses with features such as case, number, gender and similarly for verbal categories. [sent-873, score-0.499]

99 Secondly, considering features that correspond to individual contexts may be too narrow a deﬁnition for natural language given the well known problems of data sparseness and it will be necessary to switch to features corresponding to sets of contexts, which may overlap. [sent-878, score-0.369]

100 Inferring grammars for mildly context sensitive languages in polynomial-time. [sent-1079, score-0.333]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fl', 0.408), ('fg', 0.347), ('cbfg', 0.322), ('ducial', 0.259), ('aab', 0.207), ('strings', 0.207), ('languages', 0.182), ('grammar', 0.181), ('language', 0.159), ('string', 0.156), ('contexts', 0.154), ('abb', 0.15), ('grammars', 0.13), ('productions', 0.128), ('cbfgs', 0.114), ('grammatical', 0.107), ('aa', 0.104), ('abrard', 0.099), ('lark', 0.099), ('yraud', 0.099), ('uv', 0.096), ('congruence', 0.093), ('ontextual', 0.093), ('cl', 0.089), ('bb', 0.087), ('pl', 0.081), ('ae', 0.081), ('epresentations', 0.08), ('ab', 0.08), ('cfg', 0.072), ('production', 0.07), ('substrings', 0.067), ('ad', 0.067), ('sub', 0.062), ('aabb', 0.062), ('sing', 0.062), ('membership', 0.062), ('lexical', 0.059), ('contextual', 0.058), ('formalism', 0.056), ('bd', 0.05), ('fkp', 0.047), ('chomsky', 0.044), ('congruent', 0.041), ('fcp', 0.041), ('lac', 0.041), ('omphalos', 0.041), ('yz', 0.041), ('ce', 0.04), ('colloquium', 0.04), ('oracle', 0.038), ('regular', 0.037), ('conjunctive', 0.036), ('duciality', 0.036), ('lmin', 0.036), ('substring', 0.036), ('kn', 0.036), ('queries', 0.036), ('eyraud', 0.035), ('clark', 0.034), ('non', 0.033), ('rules', 0.032), ('labc', 0.031), ('lur', 0.031), ('kv', 0.029), ('concatenation', 0.029), ('rule', 0.028), ('terminal', 0.028), ('features', 0.028), ('ku', 0.027), ('lemma', 0.027), ('boullier', 0.026), ('lpi', 0.026), ('rpi', 0.026), ('starkie', 0.026), ('alphabet', 0.026), ('learnable', 0.026), ('lab', 0.025), ('competition', 0.023), ('inductive', 0.023), ('fn', 0.023), ('angluin', 0.022), ('icgi', 0.022), ('target', 0.022), ('kernel', 0.022), ('learnability', 0.021), ('context', 0.021), ('distributional', 0.021), ('aaab', 0.021), ('cfl', 0.021), ('lexicographically', 0.021), ('substitutable', 0.021), ('gn', 0.021), ('gt', 0.02), ('length', 0.02), ('expressiveness', 0.02), ('representations', 0.019), ('induction', 0.019), ('inference', 0.019), ('presentation', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

Author: Alexander Clark, Rémi Eyraud, Amaury Habrard

2 0.18840279 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

Author: Shay B. Cohen, Noah A. Smith

Abstract: Probabilistic grammars offer great ﬂexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of well-understood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, non-parallel data. Keywords: dependency grammar induction, variational inference, logistic normal distribution, Bayesian inference

3 0.13817506 53 jmlr-2010-Inducing Tree-Substitution Grammars

Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater

Abstract: Inducing a grammar from text has proven to be a notoriously challenging learning task despite decades of research. The primary reason for its difﬁculty is that in order to induce plausible grammars, the underlying model must be capable of representing the intricacies of language while also ensuring that it can be readily learned from data. The majority of existing work on grammar induction has favoured model simplicity (and thus learnability) over representational capacity by using context free grammars and ﬁrst order dependency grammars, which are not sufﬁciently expressive to model many common linguistic constructions. We propose a novel compromise by inferring a probabilistic tree substitution grammar, a formalism which allows for arbitrarily large tree fragments and thereby better represent complex linguistic structures. To limit the model’s complexity we employ a Bayesian non-parametric prior which biases the model towards a sparse grammar with shallow productions. We demonstrate the model’s efﬁcacy on supervised phrase-structure parsing, where we induce a latent segmentation of the training treebank, and on unsupervised dependency grammar induction. In both cases the model uncovers interesting latent linguistic structures while producing competitive results. Keywords: grammar induction, tree substitution grammar, Bayesian non-parametrics, Pitman-Yor process, Chinese restaurant process

4 0.067257434 109 jmlr-2010-Stochastic Composite Likelihood

Author: Joshua V. Dillon, Guy Lebanon

Abstract: Maximum likelihood estimators are often of limited practical use due to the intensive computation they require. We propose a family of alternative estimators that maximize a stochastic variation of the composite likelihood function. Each of the estimators resolve the computation-accuracy tradeoff differently, and taken together they span a continuous spectrum of computation-accuracy tradeoff resolutions. We prove the consistency of the estimators, provide formulas for their asymptotic variance, statistical robustness, and computational complexity. We discuss experimental results in the context of Boltzmann machines and conditional random ﬁelds. The theoretical and experimental studies demonstrate the effectiveness of the estimators when the computational resources are insufﬁcient. They also demonstrate that in some cases reduced computational complexity is associated with robustness thereby increasing statistical accuracy. Keywords: Markov random ﬁelds, composite likelihood, maximum likelihood estimation

5 0.058701336 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

Author: James Henderson, Ivan Titov

Abstract: We propose a class of Bayesian networks appropriate for structured prediction problems where the Bayesian network’s model structure is a function of the predicted output structure. These incremental sigmoid belief networks (ISBNs) make decoding possible because inference with partial output structures does not require summing over the unboundedly many compatible model structures, due to their directed edges and incrementally speciﬁed model structure. ISBNs are speciﬁcally targeted at challenging structured prediction problems such as natural language parsing, where learning the domain’s complex statistical dependencies beneﬁts from large numbers of latent variables. While exact inference in ISBNs with large numbers of latent variables is not tractable, we propose two efﬁcient approximations. First, we demonstrate that a previous neural network parsing model can be viewed as a coarse mean-ﬁeld approximation to inference with ISBNs. We then derive a more accurate but still tractable variational approximation, which proves effective in artiﬁcial experiments. We compare the effectiveness of these models on a benchmark natural language parsing task, where they achieve accuracy competitive with the state-of-the-art. The model which is a closer approximation to an ISBN has better parsing accuracy, suggesting that ISBNs are an appropriate abstract model of natural language grammar learning. Keywords: Bayesian networks, dynamic Bayesian networks, grammar learning, natural language parsing, neural networks

6 0.054365657 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

7 0.050837986 15 jmlr-2010-Approximate Tree Kernels

8 0.03660322 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate

9 0.034112055 44 jmlr-2010-Graph Kernels

10 0.029585538 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

11 0.028008088 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

12 0.027679376 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

13 0.026429724 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

14 0.02520344 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

15 0.022470342 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

16 0.022366315 60 jmlr-2010-Learnability, Stability and Uniform Convergence

17 0.022351092 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

18 0.019795315 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

19 0.019697424 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

20 0.019198576 22 jmlr-2010-Classification Using Geometric Level Sets

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.123), (1, 0.027), (2, -0.176), (3, 0.078), (4, 0.07), (5, -0.316), (6, -0.035), (7, 0.028), (8, -0.008), (9, -0.005), (10, 0.022), (11, -0.079), (12, -0.156), (13, -0.073), (14, -0.014), (15, -0.089), (16, -0.055), (17, 0.014), (18, -0.085), (19, 0.009), (20, 0.02), (21, -0.081), (22, 0.12), (23, -0.016), (24, -0.017), (25, 0.012), (26, -0.168), (27, -0.116), (28, -0.036), (29, 0.112), (30, -0.074), (31, 0.016), (32, -0.035), (33, 0.115), (34, 0.163), (35, 0.089), (36, 0.014), (37, 0.074), (38, 0.054), (39, -0.094), (40, -0.116), (41, -0.025), (42, -0.171), (43, -0.03), (44, 0.1), (45, 0.057), (46, -0.008), (47, 0.129), (48, 0.06), (49, -0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96754211 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

Author: Alexander Clark, Rémi Eyraud, Amaury Habrard

2 0.61184919 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

Author: Shay B. Cohen, Noah A. Smith

3 0.57802141 53 jmlr-2010-Inducing Tree-Substitution Grammars

Author: Trevor Cohn, Phil Blunsom, Sharon Goldwater

4 0.40297773 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate

Author: Gérard Biau, Frédéric Cérou, Arnaud Guyader

Abstract: Bagging is a simple way to combine estimates in order to improve their performance. This method, suggested by Breiman in 1996, proceeds by resampling from the original data set, constructing a predictor from each subsample, and decide by combining. By bagging an n-sample, the crude nearest neighbor regression estimate is turned into a consistent weighted nearest neighbor regression estimate, which is amenable to statistical analysis. Letting the resampling size kn grows appropriately with n, it is shown that this estimate may achieve optimal rate of convergence, independently from the fact that resampling is done with or without replacement. Since the estimate with the optimal rate of convergence depends on the unknown distribution of the observations, adaptation results by data-splitting are presented. Keywords: bagging, resampling, nearest neighbor estimate, rates of convergence

5 0.2913065 109 jmlr-2010-Stochastic Composite Likelihood

Author: Joshua V. Dillon, Guy Lebanon

6 0.15612221 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

7 0.1522067 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

8 0.14987822 15 jmlr-2010-Approximate Tree Kernels

9 0.14890195 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

10 0.1471476 110 jmlr-2010-The SHOGUN Machine Learning Toolbox

11 0.14197469 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

12 0.13820885 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

13 0.13671841 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

14 0.13407035 60 jmlr-2010-Learnability, Stability and Uniform Convergence

15 0.1276245 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

16 0.12571803 37 jmlr-2010-Evolving Static Representations for Task Transfer

17 0.12345628 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

18 0.11636034 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm

19 0.11455481 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

20 0.11272495 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.017), (21, 0.042), (22, 0.047), (24, 0.013), (32, 0.039), (33, 0.014), (36, 0.029), (37, 0.034), (45, 0.485), (75, 0.081), (85, 0.045), (96, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.68329453 115 jmlr-2010-Using Contextual Representations to Efficiently Learn Context-Free Languages

Author: Alexander Clark, Rémi Eyraud, Amaury Habrard

2 0.2416041 29 jmlr-2010-Covariance in Unsupervised Learning of Probabilistic Grammars

Author: Shay B. Cohen, Noah A. Smith

3 0.23595221 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

Author: Gunnar Carlsson, Facundo Mémoli

Abstract: We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods. Keywords: clustering, hierarchical clustering, stability of clustering, Gromov-Hausdorff distance

4 0.2312858 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

Author: James Henderson, Ivan Titov

5 0.22474271 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

Author: Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, Chih-Jen Lin

Abstract: Kernel techniques have long been used in SVM to handle linearly inseparable problems by transforming data to a high dimensional space, but training and testing large data sets is often time consuming. In contrast, we can efﬁciently train and test much larger data sets using linear SVM without kernels. In this work, we apply fast linear-SVM methods to the explicit form of polynomially mapped data and investigate implementation issues. The approach enjoys fast training and testing, but may sometimes achieve accuracy close to that of using highly nonlinear kernels. Empirical experiments show that the proposed method is useful for certain large-scale data sets. We successfully apply the proposed method to a natural language processing (NLP) application by improving the testing accuracy under some training/testing speed requirements. Keywords: decomposition methods, low-degree polynomial mapping, kernel functions, support vector machines, dependency parsing, natural language processing

6 0.22000264 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

7 0.219624 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization

8 0.21907371 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

9 0.21834113 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

10 0.21724989 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

11 0.2150404 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

12 0.21500939 86 jmlr-2010-On the Rate of Convergence of the Bagged Nearest Neighbor Estimate

13 0.21500497 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

14 0.21442731 102 jmlr-2010-Semi-Supervised Novelty Detection

15 0.21337107 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

16 0.21322891 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

17 0.21276383 109 jmlr-2010-Stochastic Composite Likelihood

18 0.21275234 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

19 0.21221663 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

20 0.21189751 63 jmlr-2010-Learning Instance-Specific Predictive Models