jmlr jmlr2007 jmlr2007-67 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alexander Clark, Rémi Eyraud
Abstract: This paper formalises the idea of substitutability introduced by Zellig Harris in the 1950s and makes it the basis for a learning algorithm from positive data only for a subclass of context-free languages. We show that there is a polynomial characteristic set, and thus prove polynomial identification in the limit of this class. We discuss the relationship of this class of languages to other common classes discussed in grammatical inference. It transpires that it is not necessary to identify constituents in order to learn a context-free language—it is sufficient to identify the syntactic congruence, and the operations of the syntactic monoid can be converted into a context-free grammar. We also discuss modifications to the algorithm that produces a reduction system rather than a context-free grammar, that will be much more compact. We discuss the relationship to Angluin’s notion of reversibility for regular languages. We also demonstrate that an implementation of this algorithm is capable of learning a classic example of structure dependent syntax in English: this constitutes a refutation of an argument that has been used in support of nativist theories of language. Keywords: grammatical inference, context-free languages, positive data only, reduction system, natural languages
Reference: text
sentIndex sentText sentNum sentScore
1 We discuss the relationship of this class of languages to other common classes discussed in grammatical inference. [sent-9, score-0.481]
2 It transpires that it is not necessary to identify constituents in order to learn a context-free language—it is sufficient to identify the syntactic congruence, and the operations of the syntactic monoid can be converted into a context-free grammar. [sent-10, score-0.447]
3 Keywords: grammatical inference, context-free languages, positive data only, reduction system, natural languages 1. [sent-14, score-0.517]
4 In this paper we present an explicit mathematical formalization of this idea of substitutability and use it to define a subclass of context-free languages that we call the substitutable languages, that can be learned according to the polynomial identification c 2007 Alexander Clark and R´ mi Eyraud. [sent-19, score-0.931]
5 These languages are not comparable to the very simple languages, but seem better suited to be the basis for algorithms that can learn natural languages. [sent-21, score-0.422]
6 Indeed, it is sufficient to be able to learn the syntactic congruence: the syntactic monoid can be converted into a context-free grammar in Chomsky normal form in a very natural way. [sent-26, score-0.771]
7 Secondly, we define a simple, purely language theoretic criterion, which allows the syntactic congruence to be identified very easily. [sent-27, score-0.642]
8 The key to the Harris approach for learning a language L, is to look at pairs of strings u and v and to see whether they occur in the same contexts; that is to say, to look for pairs of strings of the form lur and lvr that are both in L. [sent-32, score-0.81]
9 Therefore to obtain learnability results we must define subclasses of the languages that sufficiently restrict the class so that learning can take place. [sent-40, score-0.424]
10 Our main formal result is that this simple, but powerful constraint on languages—and note that it is expressed in purely language theoretic terms—sufficiently restricts the class of context-free languages to the extent that it can be learned using a simple polynomial algorithm. [sent-43, score-0.697]
11 More generally, this work shows how one can move from the syntactic congruence of a contextfree language to a grammar for that language under certain assumptions. [sent-45, score-1.195]
12 The set of all substrings of a language L is denoted Sub(L) = {u ∈ Σ+ : ∃l, r ∈ Σ∗ such that lur ∈ L} (notice that the empty word does not belong to Sub(L)). [sent-65, score-0.537]
13 Definition 1 (Grammar) A grammar is a quadruple G = Σ,V, P, S where Σ is a finite alphabet of terminal symbols, V is a (distinct) finite alphabet of variables or non-terminals, P is a finite set of production rules, and S ∈ V is a start symbol. [sent-69, score-0.657]
14 If P ⊆ V × (Σ ∪ V )+ then the grammar is said to be context-free (CF), and we will write the productions as T → α. [sent-70, score-0.471]
15 a language L, written u ≡L v, if and only if ∀l, r ∈ Σ∗ lur ∈ L iff lvr ∈ L. [sent-79, score-0.492]
16 We can think of this syntactic congruence as the strong notion of substitutability. [sent-80, score-0.389]
17 Note two things: first this is clearly an equivalence relation, and secondly, it is a congruence of the monoid Σ∗ , that is, u ≡L v implies ∀l, r lur ≡L lvr. [sent-81, score-0.494]
18 The syntactic monoid of the language L is just the quotient of Σ ∗ by this relation. [sent-82, score-0.476]
19 We can see that u ≡L v iff (|u|a − |u|b ) = (|v|a − |v|b ); the congruence classes will thus correspond to particular values of (|u|a − |u|b ) and the syntactic monoid will be isomorphic to Z. [sent-86, score-0.512]
20 Another way of looking at this relation is to define the set of contexts of a string: 1727 C LARK AND E YRAUD Definition 3 (Set of contexts) The set of contexts of a string u in a language L is written CL (u) and defined as CL (u) = {(l, r) : lur ∈ L}. [sent-87, score-0.63]
21 We now can define the class of languages that we are concerned with: Definition 5 (Substitutable language) A language L is substitutable if and only if for every pair of . [sent-99, score-0.966]
22 In terms of contexts we can say that a language is substitutable, if whenever the sets of contexts of two strings have non-empty intersection, they are identical. [sent-101, score-0.502]
23 The substitutable context-free languages are just those languages that are both substitutable and context-free. [sent-102, score-1.476]
24 Lemma 6 There are languages which are substitutable but not context-free. [sent-104, score-0.738]
25 The language L = {u ∈ Σ ∗ : |u|a = |u|b ∧ |u|c = |u|d } is substitutable, as can easily be verified: the syntactic monoid here is isomorphic to Z × Z. [sent-106, score-0.476]
26 The intersection of L with the regular language {a∗ c∗ b∗ d ∗ } gives the language {an cm bn d m : n, m > 0} which is not context-free; therefore L is not context-free. [sent-107, score-0.524]
27 Example 2 A context-free grammar Gn with the one letter alphabet {a} is defined as follows: it contains a set of non-terminals N1 , . [sent-127, score-0.451]
28 Clearly, relaxing the constraint for polynomial size of the characteristic set, to merely requiring polynomial cardinality, is unsatisfactory, since it would allow algorithms to use exponential amounts of computation, by specifying an exponentially long string in the characteristic set. [sent-143, score-0.433]
29 Several ideas have been formulated to tackle this problem, such as to focus on shallow languages (Adriaans, 2002) or to require a characteristic sample that is polynomial in a parameter other than the size of the target (Wakatsuki and Tomita, 1993), but none of them represents a consensus. [sent-144, score-0.56]
30 Algorithm We now define an algorithm SGL (Substitution Graph Learner), that will learn a context-free grammar from a sample of positive strings of a language. [sent-147, score-0.567]
31 If the language is substitutable, then every member of the same component will be syntactically congruent, and can thus be freely swapped with each other without altering language membership. [sent-156, score-0.499]
32 First, note that since syntactic congruence is transitive, and we are interested in substitutable languages, we can compute the transitive closure of the graph, by adding any edges (u, w) when . [sent-158, score-0.808]
33 If S is a subset of a = substitutable language L then u ∼S v implies u ≡L v. [sent-161, score-0.57]
34 Recall that we write [u]L for the congruence class of u with respect to the syntactic congruence of L. [sent-164, score-0.631]
35 1 Constructing the Grammar Algorithm 1: Algorithm to generate a grammar from a substitution graph. [sent-166, score-0.458]
36 Data: A substitution graph SG = (V, E) ˆ Result: A context-free grammar G Let Σ be the set of all the letters used in the nodes of SG ; ˆ Compute V the set of components of SG ; ˆ Store map V → V defined by u → u ; ˆ Let S be the unique element of V corresponding to the context (λ, λ). [sent-167, score-0.508]
37 ; ˆ = {} ; P for u ∈ V do if |u| > 1 then for v, w such that u = vw do ˆ ˆ P ← P∪( u → v w ) ; end else ˆ ˆ P ← P ∪ ( u → u) end end ˆ ˆ ˆ ˆ output G = Σ, V , S, P ; ˆ ˆ ˆ ˆ Given the SG we now construct a grammar G = Σ, V , P, S . [sent-168, score-0.401]
38 For every node in the substitution graph u, if |u| > 1, for every pair of non-empty strings v, w such that u = vw, we add a production u → v w . [sent-179, score-0.412]
39 Given some substrings such that [u] = [v][w], we can consider this as saying, that any element of the congruence class [v] concatenated with any element of the congruence class [w] will give an element of the congruence class [u]. [sent-184, score-0.828]
40 Phrased like this, it is clear that this can be directly considered as a production rule: if we wish to generate an element of [u] one way of doing it is to generate a string from [v] and then a string from [w]. [sent-185, score-0.378]
41 Algorithm 1 defines the process of generating a grammar from a substitution graph. [sent-187, score-0.458]
42 G = Grammar generating the empty language ; while true do read next string sn ; if sn ∈ L (G) then set SG to be the substitution graph generated from {s1 , . [sent-195, score-0.615]
43 sn }; set G to be the grammar generated from SG; end output G; end 3. [sent-198, score-0.377]
44 The second one allows a discussion about the induced grammar for a particular language. [sent-201, score-0.35]
45 The grammar will thus have productions aa → a a which is S → SS and ˆ ˆ ˆ ˆˆ ˆ ˆ a → a which is S → a. [sent-207, score-0.503]
46 Thus the learned grammar will be {a}, {S}, {S → SS, S → a}, S which +. [sent-208, score-0.35]
47 generates the language a 1731 C LARK AND E YRAUD Example 4 Consider the language L = {an cbn : n ≥ 0} that can be represented, for instance, by the set of productions {S → aSb, S → c}. [sent-209, score-0.607]
48 The grammar generated from this sample will then have rules of the form C j+k → C j Bk for all k > 0, j ∈ Z , C j−k → AkC j for all k > 0, j ∈ Z, C0 → c, Ai+ j → Ai A j for all i, j > 0, A1 → a, Bi+ j → Bi B j for all i, j > 0, B1 → b. [sent-213, score-0.371]
49 This grammar defines the language L, but as can be seen the set of non-terminals can be substantially larger than that of the original grammar. [sent-214, score-0.578]
50 cost of computing, for a given pair of strings u,v, all of the substrings u , v such that u =S v can be done in time less than L2 , and thus assuming a constant time map from substrings to nodes in the graph, we can compute all the edges in the graph in time less than L 2 n2 . [sent-225, score-0.424]
51 Proof Our main result is as follows: Theorem 9 SGL polynomially identifies in the limit the class of substitutable (context-free) languages. [sent-231, score-0.4]
52 First, we shall show that for any sample, the language defined by the grammar inferred from the sample will be a subset of the target language; another way of saying this is that the learner will never make a false positive error. [sent-233, score-0.641]
53 1 Proof that Pypothesis is Not Too Large First of all we shall show that, for any finite sample of a substitutable language, the grammar derived from a finite sample does not define a language that is too large. [sent-238, score-1.004]
54 If |w| = 1, then by the construction of the grammar there is a unary rule w → w. [sent-242, score-0.378]
55 Therefore the generated grammar will always include the training sample. [sent-246, score-0.35]
56 ˆ Since we have a production u → v w in P, there must be strings v , w such that v w is a string in the same component as u, and v ≡L v and w ≡L w and u ≡L v w . [sent-258, score-0.401]
57 ˆ Theorem 12 For any positive sample of a substitutable context-free language L, if G is the result ˆ ⊆ L. [sent-260, score-0.591]
58 2 Proof that Hypothesis is Large Enough To prove that the hypothesis is large enough, we first need to define a characteristic set, that is to say a subset of a target language L∗ which will ensure that the algorithm will output a grammar G such that L (G) = L∗ . [sent-265, score-0.673]
59 2 C ONVERGENCE We now must show that for any substitutable context-free grammar G, if the sample C contains the ˆ characteristic set CS(G), that is, CS(G) ⊆ C ⊆ L (G), and if G is the output SGL produces on the ˆ sample C, then L (G) = L (G). [sent-287, score-0.829]
60 ˆ ˆ Theorem 15 For any substitutable CFG G, when the sample contains the characteristic set, L ⊆ ˆ L (G). [sent-340, score-0.458]
61 In this section we will consider some algorithms for doing this; the languages defined will be unchanged, but the algorithms will be more efficient. [sent-347, score-0.396]
62 Secondly, one can remove non-terminals, one by one, and test whether the grammar continues to accept all of the sample, and thus arrive at a minimal CFG. [sent-350, score-0.35]
63 In particular if we have one component that contains the strings u and v, where v u, and another that contains the strings lur and lvr, we can reduce the graph by removing the string lur. [sent-355, score-0.688]
64 1735 C LARK AND E YRAUD Definition 16 (Reduction system) A reduction system T , over an alphabet Σ is a finite set of pairs of strings T ⊂ Σ∗ × Σ∗ , where each pair (u, v) is normally written u T v, is called a reduction rule and satisfies v u. [sent-360, score-0.365]
65 Being confluent and Noetherian means that there is a simple algorithm to determine this congruence: each string belong to only one congruence class. [sent-367, score-0.389]
66 Given a reduction system one can define a language as the union of finitely many congruence classes. [sent-370, score-0.528]
67 Thus given a reduction system T and a set of irreducible strings A on an alphabet Σ, we can define a language L(T, A) = {v ∈ Σ∗ : ∃a ∈ A ∧ v ∗ a}. [sent-371, score-0.557]
68 The language defined by L(T, {c}) is exactly the palindrome language over a, b with center marker c. [sent-377, score-0.482]
69 Assuming that we have a set of examples generated from a substitutable CFG that contains the characteristic set, it is easy to prove the following lemmas. [sent-385, score-0.437]
70 Subject to this caveat, we can define a small reduction system that represents the same language, namely the set of all strings that reduces to the least string w(S) (w. [sent-408, score-0.375]
71 ) in the language, but without the large number of redundant rules that the simple grammar construction produces. [sent-411, score-0.378]
72 Substitutable Languages This section contains some examples of substitutable CFLs, as well as some simple CFLs that are not substitutable, and a discussion on the relationship of this class of languages to other standard classes. [sent-414, score-0.738]
73 This is without a doubt a restricted class of languages but contains some interesting examples. [sent-415, score-0.396]
74 Since we are learning under a Gold style paradigm, we cannot hope to learn all finite languages (Gold, 1967). [sent-417, score-0.422]
75 Indeed, the more complex the languages we hope to learn, the smaller the set of finite languages we will be able to learn. [sent-418, score-0.792]
76 1 Examples We will now give some examples of some simple languages that are substitutable as well as some simple languages that are not. [sent-420, score-1.134]
77 Given an alphabet Σ, the following languages are substitutable: • The set Σ∗ of all the strings on the alphabet is substitutable. [sent-423, score-0.768]
78 • Any language consisting of only one string (namely a singleton language) is substitutable, . [sent-425, score-0.375]
79 1737 C LARK AND E YRAUD • The languages {an : n > 0}, for all a ∈ Σ, are substitutable. [sent-427, score-0.396]
80 We now turn our attention to simple languages that are not substitutable. [sent-429, score-0.396]
81 2 Relation to Other Language Classes The state of the art concerning polynomial identification of context-free languages from positive example only is quite limited. [sent-441, score-0.444]
82 The fact that substitutable context-free languages can be represented by reduction systems proves that they are properly included within the class of congruential languages (Book and Otto, 1993). [sent-443, score-1.17]
83 The examples previously presented show that they are incomparable with the classes of finite languages and regular languages. [sent-444, score-0.439]
84 The most important subclass of context-free languages that is polynomially identifiable is that of very simple grammars (Yokomori, 2003). [sent-445, score-0.659]
85 Some very simple grammars are not substitutable: an example is the regular grammar with productions S → aNP, S → bN, N → n, P → rP, P → p. [sent-447, score-0.685]
86 We conjecture that all substitutable context free languages are NTS languages. [sent-457, score-0.738]
87 This is put forward as strong evidence in favour of innately specified language specific knowledge: we shall refer to this view as linguistic nativism. [sent-479, score-0.378]
88 The learned grammar also correctly generated the correct form and did not generate the final form. [sent-508, score-0.35]
89 We can see that in the learned grammar “the man” will be congruent to “the man who is hungry”, since there is a pair of sentences which differ only by this. [sent-510, score-0.635]
90 One of the derivations for this sentence would be: [is the man hungry ? [sent-514, score-0.371]
91 The rule uv → u v appears at first sight vacuous, yet it is sufficient to define correct languages for an interesting class of grammars. [sent-564, score-0.396]
92 Yokomori (2003) shows that the class of very simple languages can be polynomially identified in the limit. [sent-568, score-0.454]
93 Recall that a regular language is reversible if whenever uw and vw are in the language then ux is in the language if and only if vx is in the language. [sent-576, score-0.821]
94 Nonetheless, following the work on reversible regular languages, 1742 S UBSTITUTABLE L ANGUAGES one can think of a definition of k-substitutable languages and may obtain positive learning results on that hierarchy of languages. [sent-580, score-0.482]
95 We can also compare the substitutability to µ-distinguishability for inference of regular languages (Ron et al. [sent-581, score-0.581]
96 There are some technical problems to be overcome, since the number of syntactic congruence classes will be infinite for non regular languages, and thus the distinguishability will not in general be bounded from below. [sent-587, score-0.432]
97 Algorithms for learning regular languages focus on identifying the states of a deterministic automaton. [sent-589, score-0.439]
98 From a learnability point of view this is interesting because it is purely syntactic—it is not semantic as it does not depend on the representation of the language but only on the language itself. [sent-596, score-0.481]
99 One of the weaknesses in the work is the fact that we do not yet have a grammatical characterisation of substitutability, nor an algorithm for determining whether a grammar defines a substitutable language. [sent-600, score-0.777]
100 Substitutable languages have a property that allows a trivial procedure for determining when two substrings are congruent, but it is easy to think of much more robust methods of determining this that rely on more complex properties of the context distributions. [sent-603, score-0.498]
wordName wordTfidf (topN-words)
[('languages', 0.396), ('grammar', 0.35), ('substitutable', 0.342), ('congruence', 0.242), ('language', 0.228), ('hungry', 0.181), ('grammars', 0.171), ('strings', 0.17), ('lur', 0.151), ('string', 0.147), ('syntactic', 0.147), ('productions', 0.121), ('man', 0.119), ('substitutability', 0.111), ('substitution', 0.108), ('sg', 0.105), ('substrings', 0.102), ('alphabet', 0.101), ('anguages', 0.101), ('congruent', 0.101), ('lark', 0.101), ('monoid', 0.101), ('ubstitutable', 0.101), ('yraud', 0.101), ('characteristic', 0.095), ('lvr', 0.091), ('sgl', 0.091), ('grammatical', 0.085), ('production', 0.084), ('substring', 0.071), ('linguistic', 0.068), ('sentences', 0.065), ('dinner', 0.06), ('nts', 0.06), ('uent', 0.06), ('polynomially', 0.058), ('auxiliary', 0.054), ('transitive', 0.053), ('contexts', 0.052), ('vw', 0.051), ('chomsky', 0.05), ('higuera', 0.05), ('reversibility', 0.05), ('graph', 0.05), ('polynomial', 0.048), ('sentence', 0.045), ('syntactically', 0.043), ('reversible', 0.043), ('regular', 0.043), ('sub', 0.043), ('shall', 0.042), ('adriaans', 0.04), ('asb', 0.04), ('favour', 0.04), ('nativist', 0.04), ('harris', 0.039), ('reduction', 0.036), ('gold', 0.035), ('subclass', 0.034), ('vi', 0.034), ('aa', 0.032), ('english', 0.031), ('inference', 0.031), ('cbn', 0.03), ('eating', 0.03), ('fronting', 0.03), ('iil', 0.03), ('nizergues', 0.03), ('noetherian', 0.03), ('pullum', 0.03), ('construction', 0.028), ('word', 0.028), ('empty', 0.028), ('subclasses', 0.028), ('sn', 0.027), ('learn', 0.026), ('reductions', 0.026), ('clause', 0.026), ('derivations', 0.026), ('constituents', 0.026), ('marker', 0.026), ('polar', 0.026), ('bn', 0.025), ('purely', 0.025), ('identi', 0.024), ('closure', 0.024), ('la', 0.023), ('cl', 0.023), ('exposed', 0.023), ('length', 0.023), ('iff', 0.022), ('system', 0.022), ('constituent', 0.021), ('terminal', 0.021), ('secondly', 0.021), ('children', 0.021), ('ron', 0.021), ('sample', 0.021), ('derivation', 0.02), ('anp', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 67 jmlr-2007-Polynomial Identification in the Limit of Substitutable Context-free Languages
Author: Alexander Clark, Rémi Eyraud
Abstract: This paper formalises the idea of substitutability introduced by Zellig Harris in the 1950s and makes it the basis for a learning algorithm from positive data only for a subclass of context-free languages. We show that there is a polynomial characteristic set, and thus prove polynomial identification in the limit of this class. We discuss the relationship of this class of languages to other common classes discussed in grammatical inference. It transpires that it is not necessary to identify constituents in order to learn a context-free language—it is sufficient to identify the syntactic congruence, and the operations of the syntactic monoid can be converted into a context-free grammar. We also discuss modifications to the algorithm that produces a reduction system rather than a context-free grammar, that will be much more compact. We discuss the relationship to Angluin’s notion of reversibility for regular languages. We also demonstrate that an implementation of this algorithm is capable of learning a classic example of structure dependent syntax in English: this constitutes a refutation of an argument that has been used in support of nativist theories of language. Keywords: grammatical inference, context-free languages, positive data only, reduction system, natural languages
2 0.061987944 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
Author: Marta Arias, Roni Khardon, Jérôme Maloberti
Abstract: The paper introduces L OG A N -H —a system for learning first-order function-free Horn expressions from interpretations. The system is based on an algorithm that learns by asking questions and that was proved correct in previous work. The current paper shows how the algorithm can be implemented in a practical system, and introduces a new algorithm based on it that avoids interaction and learns from examples only. The L OG A N -H system implements these algorithms and adds several facilities and optimizations that allow efficient applications in a wide range of problems. As one of the important ingredients, the system includes several fast procedures for solving the subsumption problem, an NP-complete problem that needs to be solved many times during the learning process. We describe qualitative and quantitative experiments in several domains. The experiments demonstrate that the system can deal with varied problems, large amounts of data, and that it achieves good classification accuracy. Keywords: inductive logic programming, subsumption, bottom-up learning, learning with queries
3 0.040088553 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
Author: Evgeniy Gabrilovich, Shaul Markovitch
Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge
Author: Ariel Elbaz, Homin K. Lee, Rocco A. Servedio, Andrew Wan
Abstract: We consider a natural framework of learning from correlated data, in which successive examples used for learning are generated according to a random walk over the space of possible examples. A recent paper by Bshouty et al. (2003) shows that the class of polynomial-size DNF formulas is efficiently learnable in this random walk model; this result suggests that the Random Walk model is more powerful than comparable standard models of learning from independent examples, in which similarly efficient DNF learning algorithms are not known. We give strong evidence that the Random Walk model is indeed more powerful than the standard model, by showing that if any cryptographic one-way function exists (a universally held belief in cryptography), then there is a class of functions that can be learned efficiently in the Random Walk setting but not in the standard setting where all examples are independent. Keywords: random walks, uniform distribution learning, cryptographic hardness, correlated data, PAC learning
5 0.023561157 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
Author: Sofus A. Macskassy, Foster Provost
Abstract: paper1 This is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well—well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes—that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection. Keywords: relational learning, network learning, collective inference, collective classification, networked data, probabilistic relational models, network analysis, network data
6 0.02208834 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
7 0.021907726 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
8 0.01995373 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
9 0.019706385 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
10 0.019532908 54 jmlr-2007-Measuring Differentiability: Unmasking Pseudonymous Authors
11 0.019202048 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
12 0.017795479 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
15 0.016273655 72 jmlr-2007-Relational Dependency Networks
16 0.015725564 90 jmlr-2007-Value Regularization and Fenchel Duality
17 0.015723191 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
18 0.015494305 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
19 0.015436969 31 jmlr-2007-Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm
20 0.015137347 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
topicId topicWeight
[(0, 0.1), (1, 0.038), (2, -0.014), (3, -0.008), (4, -0.037), (5, 0.063), (6, 0.034), (7, -0.004), (8, 0.027), (9, -0.112), (10, 0.02), (11, 0.111), (12, -0.061), (13, 0.108), (14, 0.027), (15, 0.026), (16, -0.013), (17, -0.144), (18, 0.071), (19, 0.016), (20, 0.081), (21, -0.023), (22, -0.053), (23, 0.041), (24, 0.097), (25, 0.109), (26, 0.053), (27, -0.022), (28, 0.128), (29, 0.027), (30, 0.108), (31, 0.059), (32, -0.186), (33, 0.234), (34, -0.04), (35, 0.031), (36, -0.092), (37, -0.382), (38, 0.596), (39, -0.144), (40, -0.222), (41, -0.122), (42, -0.015), (43, 0.19), (44, 0.126), (45, -0.001), (46, -0.069), (47, 0.041), (48, -0.117), (49, -0.157)]
simIndex simValue paperId paperTitle
same-paper 1 0.9800477 67 jmlr-2007-Polynomial Identification in the Limit of Substitutable Context-free Languages
Author: Alexander Clark, Rémi Eyraud
Abstract: This paper formalises the idea of substitutability introduced by Zellig Harris in the 1950s and makes it the basis for a learning algorithm from positive data only for a subclass of context-free languages. We show that there is a polynomial characteristic set, and thus prove polynomial identification in the limit of this class. We discuss the relationship of this class of languages to other common classes discussed in grammatical inference. It transpires that it is not necessary to identify constituents in order to learn a context-free language—it is sufficient to identify the syntactic congruence, and the operations of the syntactic monoid can be converted into a context-free grammar. We also discuss modifications to the algorithm that produces a reduction system rather than a context-free grammar, that will be much more compact. We discuss the relationship to Angluin’s notion of reversibility for regular languages. We also demonstrate that an implementation of this algorithm is capable of learning a classic example of structure dependent syntax in English: this constitutes a refutation of an argument that has been used in support of nativist theories of language. Keywords: grammatical inference, context-free languages, positive data only, reduction system, natural languages
2 0.25883022 47 jmlr-2007-Learning Horn Expressions with LOGAN-H
Author: Marta Arias, Roni Khardon, Jérôme Maloberti
Abstract: The paper introduces L OG A N -H —a system for learning first-order function-free Horn expressions from interpretations. The system is based on an algorithm that learns by asking questions and that was proved correct in previous work. The current paper shows how the algorithm can be implemented in a practical system, and introduces a new algorithm based on it that avoids interaction and learns from examples only. The L OG A N -H system implements these algorithms and adds several facilities and optimizations that allow efficient applications in a wide range of problems. As one of the important ingredients, the system includes several fast procedures for solving the subsumption problem, an NP-complete problem that needs to be solved many times during the learning process. We describe qualitative and quantitative experiments in several domains. The experiments demonstrate that the system can deal with varied problems, large amounts of data, and that it achieves good classification accuracy. Keywords: inductive logic programming, subsumption, bottom-up learning, learning with queries
Author: Evgeniy Gabrilovich, Shaul Markovitch
Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge
4 0.10253914 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
5 0.086100027 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
Author: Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh
Abstract: In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data. In addition to maximum conditional likelihood, we present two alternative approaches for training DCRFs: marginal likelihood training, for when we are primarily interested in predicting only a subset of the variables, and cascaded training, for when we have a distinct data set for each state variable, as in transfer learning. We evaluate marginal training and cascaded training on both synthetic data and real-world text data, finding that marginal training can improve accuracy when uncertainty exists over the latent variables, and that for transfer learning, a DCRF trained in a cascaded fashion performs better than a linear-chain CRF that predicts the final task directly. Keywords: conditional random fields, graphical models, sequence labeling
6 0.079381816 15 jmlr-2007-Bilinear Discriminant Component Analysis
7 0.076933391 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
8 0.070589714 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
10 0.069893301 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
11 0.068663724 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
13 0.06757278 42 jmlr-2007-Infinitely Imbalanced Logistic Regression
15 0.066620409 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
16 0.065007083 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
17 0.060930274 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
18 0.05767642 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
19 0.055103943 34 jmlr-2007-From External to Internal Regret (Special Topic on the Conference on Learning Theory 2005)
20 0.054899402 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
topicId topicWeight
[(10, 0.015), (12, 0.021), (15, 0.029), (28, 0.046), (40, 0.037), (45, 0.011), (48, 0.027), (58, 0.534), (60, 0.044), (80, 0.022), (85, 0.032), (98, 0.075)]
simIndex simValue paperId paperTitle
same-paper 1 0.75197804 67 jmlr-2007-Polynomial Identification in the Limit of Substitutable Context-free Languages
Author: Alexander Clark, Rémi Eyraud
Abstract: This paper formalises the idea of substitutability introduced by Zellig Harris in the 1950s and makes it the basis for a learning algorithm from positive data only for a subclass of context-free languages. We show that there is a polynomial characteristic set, and thus prove polynomial identification in the limit of this class. We discuss the relationship of this class of languages to other common classes discussed in grammatical inference. It transpires that it is not necessary to identify constituents in order to learn a context-free language—it is sufficient to identify the syntactic congruence, and the operations of the syntactic monoid can be converted into a context-free grammar. We also discuss modifications to the algorithm that produces a reduction system rather than a context-free grammar, that will be much more compact. We discuss the relationship to Angluin’s notion of reversibility for regular languages. We also demonstrate that an implementation of this algorithm is capable of learning a classic example of structure dependent syntax in English: this constitutes a refutation of an argument that has been used in support of nativist theories of language. Keywords: grammatical inference, context-free languages, positive data only, reduction system, natural languages
2 0.20794232 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
Author: Roland Nilsson, José M. Peña, Johan Björkegren, Jesper Tegnér
Abstract: We analyze two different feature selection problems: finding a minimal feature set optimal for classification (MINIMAL - OPTIMAL) vs. finding all features relevant to the target variable (ALL RELEVANT ). The latter problem is motivated by recent applications within bioinformatics, particularly gene expression analysis. For both problems, we identify classes of data distributions for which there exist consistent, polynomial-time algorithms. We also prove that ALL - RELEVANT is much harder than MINIMAL - OPTIMAL and propose two consistent, polynomial-time algorithms. We argue that the distribution classes considered are reasonable in many practical cases, so that our results simplify feature selection in a wide range of machine learning tasks. Keywords: learning theory, relevance, classification, Markov blanket, bioinformatics
Author: Miroslav Dudík, Steven J. Phillips, Robert E. Schapire
Abstract: We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including 1 , 2 , 2 , and 1 + 2 style regularization. We propose an algorithm solving a large and 2 2 general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on real-world data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models. Keywords: maximum entropy, density estimation, regularization, iterative scaling, species distribution modeling
4 0.19963783 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
Author: Shantanu Chakrabartty, Gert Cauwenberghs
Abstract: Many classification tasks require estimation of output class probabilities for use as confidence scores or for inference integrated with other models. Probability estimates derived from large margin classifiers such as support vector machines (SVMs) are often unreliable. We extend SVM large margin classification to GiniSVM maximum entropy multi-class probability regression. GiniSVM combines a quadratic (Gini-Simpson) entropy based agnostic model with a kernel based similarity model. A form of Huber loss in the GiniSVM primal formulation elucidates a connection to robust estimation, further corroborated by the impulsive noise filtering property of the reverse water-filling procedure to arrive at normalized classification margins. The GiniSVM normalized classification margins directly provide estimates of class conditional probabilities, approximating kernel logistic regression (KLR) but at reduced computational cost. As with other SVMs, GiniSVM produces a sparse kernel expansion and is trained by solving a quadratic program under linear constraints. GiniSVM training is efficiently implemented by sequential minimum optimization or by growth transformation on probability functions. Results on synthetic and benchmark data, including speaker verification and face detection data, show improved classification performance and increased tolerance to imprecision over soft-margin SVM and KLR. Keywords: support vector machines, large margin classifiers, kernel regression, probabilistic models, quadratic entropy, Gini index, growth transformation
5 0.19943894 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
Author: Guy Lebanon, Yi Mao, Joshua Dillon
Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing
6 0.19938219 20 jmlr-2007-Combining PAC-Bayesian and Generic Chaining Bounds
7 0.19840539 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
8 0.19730708 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
9 0.1969642 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
10 0.19651119 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
11 0.19571529 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
12 0.19559452 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
13 0.19487202 61 jmlr-2007-On the Consistency of Multiclass Classification Methods (Special Topic on the Conference on Learning Theory 2005)
14 0.19419771 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
15 0.19396493 43 jmlr-2007-Integrating Naïve Bayes and FOIL
16 0.19376817 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
17 0.19348621 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
18 0.1929356 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
19 0.19216067 77 jmlr-2007-Stagewise Lasso
20 0.19134954 39 jmlr-2007-Handling Missing Values when Applying Classification Models