jmlr jmlr2011 jmlr2011-65 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich
Abstract: While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects “helpful” training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, “helpful” examples can speed up the learning process. The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using “coding tricks”, has been widely addressed. Nevertheless, the resulting teaching and learning protocols do not seem to make the teacher select intuitively “helpful” examples. The two models introduced in this paper are built on what we call subset teaching sets and recursive teaching sets. They extend previous models of teaching by letting both the teacher and the learner exploit knowing that the partner is cooperative. For this purpose, we introduce a new notion of “coding trick”/“collusion”. We show how both resulting sample complexity measures (the subset teaching dimension and the recursive teaching dimension) can be arbitrarily lower than the classic teaching dimension and known variants thereof, without using coding tricks. For instance, monomials can be taught with only two examples independent of the number of variables. The subset teaching dimension turns out to be nonmonotonic with respect to subclasses of concept classes. We discuss why this nonmonotonicity might be inherent in many interesting cooperative teaching and learning scenarios. Keywords: teaching dimension, learning Boolean functions, interactive learning, collusion c 2011 Sandra Zilles, Steffen Lange, Robert Holte and Martin Zinkevich. Z ILLES , L ANGE , H OLTE AND Z INKEVICH
Reference: text
sentIndex sentText sentNum sentScore
1 Nevertheless, the resulting teaching and learning protocols do not seem to make the teacher select intuitively “helpful” examples. [sent-14, score-1.075]
2 The two models introduced in this paper are built on what we call subset teaching sets and recursive teaching sets. [sent-15, score-1.877]
3 They extend previous models of teaching by letting both the teacher and the learner exploit knowing that the partner is cooperative. [sent-16, score-1.123]
4 We show how both resulting sample complexity measures (the subset teaching dimension and the recursive teaching dimension) can be arbitrarily lower than the classic teaching dimension and known variants thereof, without using coding tricks. [sent-18, score-2.881]
5 The subset teaching dimension turns out to be nonmonotonic with respect to subclasses of concept classes. [sent-20, score-1.104]
6 The teaching dimension of the class C is the maximum teaching dimension over all concepts in C. [sent-57, score-1.947]
7 The teaching dimension however does not always seem to capture the intuitive idea of cooperation in teaching and learning. [sent-60, score-1.874]
8 Each singleton concept {xi } has a teaching dimension of 1, since the single positive example (xi , +) is sufficient for determining {xi }. [sent-66, score-1.095]
9 In contrast to that, the empty concept has a teaching dimension of n—every example has to be presented. [sent-68, score-1.097]
10 If this idea is iteratively propagated by both partners, one can refine teaching sets iteratively ending up with a framework for highly efficient teaching and learning without any coding tricks. [sent-85, score-1.841]
11 We show that the resulting variant of the teaching dimension—called the subset teaching dimension (STD)—is not only a uniform lower bound of the teaching dimension but can be constant where the original teaching dimension is exponential, even in cases where only one iteration is needed. [sent-88, score-3.765]
12 For example, as illustrated above, the STD of the class of monomials over m ≥ 2 variables is 2, in contrast to its original teaching dimension of 2m . [sent-89, score-1.027]
13 Some examples however will reveal a nonmonotonicity of the subset teaching dimension: some classes possess subclasses with a higher subset teaching dimension, which is at first glance not very intuitive. [sent-90, score-1.9]
14 We will explain below why in a cooperative model such a nonmonotonicity does not have to contradict intuition; additionally we introduce a second model of cooperative teaching and learning, that results in a monotonic dimension, called the recursive teaching dimension (RTD). [sent-91, score-1.997]
15 Recursive teaching is based on the idea to let the teacher and the learner exploit a hierarchical structure that is intrinsic in the concept class. [sent-92, score-1.251]
16 At every stage, the recursive teaching sets for the concepts that are easiest to teach are the teaching sets for these concepts with respect to the class of remaining concepts. [sent-96, score-1.971]
17 The recursive teaching dimension is the size of the largest recursive teaching set constructed this way. [sent-97, score-1.93]
18 Both our teaching protocols significantly improve sample efficiency compared to previously studied variants of the teaching dimension. [sent-103, score-1.844]
19 The teaching dimension model, independently introduced by Goldman and Kearns (1991; 1995) and by Shinohara and Miyano (1991), is concerned with the sample complexity of teaching arbitrary consistent learners. [sent-111, score-1.865]
20 Samples that will allow any consistent learner to identify the target concept are called teaching sets; the maximum size of minimal teaching sets of all concepts in the underlying concept class C is called the teaching dimension of C. [sent-112, score-3.198]
21 353 Z ILLES , L ANGE , H OLTE AND Z INKEVICH In the models of teaching and learning to be defined below, we will always assume that the goal in an interaction between a teacher and a learner is to make the learner identify a (finite) concept over a (finite) instance space X. [sent-133, score-1.325]
22 354 M ODELS OF C OOPERATIVE T EACHING AND L EARNING All teaching and learning models introduced below will involve a very simple protocol between a teacher and a learner (and an adversary). [sent-202, score-1.162]
23 The goal in sample-efficient teaching and learning is to design protocols that, for every concept class C, are successful for C while reducing the (worst-case) size of the samples the teacher presents to the learner for any target concept in C. [sent-224, score-1.447]
24 2 Protocols Using Minimal Teaching Sets and Balbach Teaching Sets The fundamental model of teaching we consider here is based on the notion of minimal teaching sets, which is due to Goldman and Kearns (1995) and Shinohara and Miyano (1991). [sent-227, score-1.836]
25 A teaching set allows a learning algorithm to uniquely identify a concept in the concept class C. [sent-231, score-1.197]
26 The teaching dimension of i in C is the size of such a minimal teaching set, that is, TD(i,C) = min{|S| | Cons(S,C) = {i}}, the worst case of which defines the teaching dimension of C, that is, TD(C) = max{TD(i,C) | 1 ≤ i ≤ z}. [sent-233, score-2.817]
27 The teaching dimension of a concept class C is then a measure of the worst case sample size required in this protocol with respect to Ad∗ when teaching/learning any concept in C. [sent-245, score-1.282]
28 The reason that, for every concept class C ∈ Cz , the protocol P(C) is successful (with respect to Ad∗ ) is simply that a teaching set eliminates all but one concept due to inconsistency. [sent-246, score-1.257]
29 For instance, assume a learner knows that all but one concept C(i) have a teaching set of size one and that the teacher will teach using teaching sets. [sent-251, score-2.173]
30 If a teacher knew that a learner eliminates by consistency and by sample size then the teacher could consequently reduce the size of some teaching sets (e. [sent-254, score-1.253]
31 The Balbach teaching dimension BTD(i,C) of i in C is defined by BTD(i,C) = BTDκ (i,C) and the Balbach teaching dimension BTD(C) of the class C is BTD(C) = max{BTD(i,C) | 1 ≤ i ≤ z}. [sent-271, score-1.902]
32 Balbach (2008) denotes this by IOTTD, called iterated optimal teacher teaching dimension; we deviate from this notation for the sake of convenience. [sent-285, score-1.063]
33 Not only the protocol based on the teaching dimension (Protocol 4), but also the protocol based on the Balbach teaching dimension (Protocol 6) yields only valid teacher/learner pairs according to this definition—a consequence of Theorem 10. [sent-380, score-1.988]
34 It is important to notice, first of all, that in parts of the existing literature, teaching sets and teaching dimension are defined via properties of sets rather than properties of representations of sets, see Balbach (2008) and Kobayashi and Shinohara (2009). [sent-409, score-1.854]
35 Each concept {xi } has the unique minimal teaching set {(xi , +)} in this class, whereas the empty concept only has a teaching set of size n, namely {(x1 , −), . [sent-498, score-2.126]
36 The idea of elimination by size allows a learner to conjecture the empty concept as soon as two examples have been provided, due to the fact that all other concepts possess a teaching set of size one. [sent-502, score-1.207]
37 In terms of teaching sets this means to reduce the teaching sets to their minimal subsets that are not contained in minimal teaching sets for other concepts in the given concept class. [sent-505, score-2.946]
38 In fact, a technicality in the definition of the Balbach teaching dimension (Definition 5) disallows the Balbach teaching dimension to be 1 unless the teaching dimension itself is already 1, as the following proposition states. [sent-506, score-2.835]
39 ˆ ˆ So, if the Balbach model improves on the worst case teaching complexity, it does so only by improving the teaching dimension to a value of at least 2. [sent-519, score-1.854]
40 1 The Model We formalize the idea of cooperative teaching and learning using subsets of teaching sets as follows. [sent-521, score-1.849]
41 The subset teaching dimension STD(c,C) of c in C is defined by STD(c,C) = STDκ (c,C) and we denote by STS(c,C) = STSκ (c,C) the set of all minimal subset teaching sets for c in C. [sent-529, score-1.914]
42 Note that the example of the concept class C0 establishes that the subset teaching dimension can be 1 even if the teaching dimension is larger, in contrast to Proposition 15. [sent-540, score-2.061]
43 The definition of STS(c,C) induces a protocol for teaching and learning: For a target concept c, a teacher presents the examples in a subset teaching set for c to the learner. [sent-541, score-2.185]
44 The learner will also be able to pre-compute all subset teaching sets for all concepts and determine the target concept from the sample provided by the teacher. [sent-542, score-1.186]
45 Note that Definition 16 does not presume any special order of the concept representations or of the instances, that is, teacher and learner do not have to agree on any such order to make use of the teaching and learning protocol. [sent-552, score-1.251]
46 That means, given a special concept class C, the computation of its subset teaching sets does not involve any special coding trick depending on C—it just follows a general rule. [sent-553, score-1.115]
47 Hence {(x1 , −), (x3 , +)} is consistent with both c2 and c3 and contains a subset teaching set for c2 as well as a subset teaching set for c3 . [sent-560, score-1.871]
48 concept {x1 , x2 , x3 } {x2 , x3 } {x3 } {} x1 + − − − x2 + + − − x3 + + + − STS0 {(x1 , +)} {(x1 , −), (x2 , +)} {(x2 , −), (x3 , +)} {(x3 , −)} STS1 {(x1 , +)} {(x1 , −)}, {(x2 , +)} {(x2 , −)}, {(x3 , +)} {(x3 , −)} Table 1: Iterated subset teaching sets for the class Cθ . [sent-565, score-1.08]
49 2 Comparison to the Balbach Teaching Dimension Obviously, when using the trivial adversary, Protocol 17 based on the subset teaching dimension never requires a sample larger than a teaching set; often a smaller sample is sufficient. [sent-567, score-1.875]
50 However, compared to the Balbach teaching dimension, the subset teaching dimension is superior in some cases and inferior in others. [sent-568, score-1.875]
51 The latter may seem unintuitive, but is possible because Balbach’s teaching sets are not restricted to be subsets of the original teaching sets. [sent-569, score-1.818]
52 The only minimal teaching set for a concept {xi , x j } with i = j is the sample Si, j = {(xi , +), (x j , +)}. [sent-604, score-1.065]
53 On the one hand, every subset of every minimal teaching set for a concept c ∈ C is contained in some minimal teaching set for some concept c′ ∈ C with c = c′ . [sent-605, score-2.151]
54 This holds because every other concept in C that is consistent with this sample is a concept containing two instances and thus has a teaching set of size smaller than 3 (= |S|). [sent-609, score-1.212]
55 The main result is that the STD of the class of all monomials is 2, independent on the number m of variables, whereas its teaching dimension is exponential in m and its BTD is linear in m, see Balbach (2008). [sent-613, score-1.027]
56 (ii) Any sufficiently small “different” subset S′ of the unique teaching set in STS0 (M,C)—that is, S′ contains at most two negative examples, but two having Hamming distance 2—is a subset of a teaching set in STS0 (M ′ ,C) for some M ′ of Type 2. [sent-683, score-1.86]
57 This is due to the following observations: (i) S is not a subset of any teaching set S′ in STS1 (M ′ ,C) for some M ′ of Type 1, of Type 3 or of Type 4, since none of these teaching sets contains one positive and one negative example. [sent-691, score-1.839]
58 For this it suffices to see that a minimal teaching set for M ′ in C must contain negative examples, while no minimal teaching set for any D ∈ C with D ≡ M ′ contains any negative examples. [sent-722, score-1.854]
59 In fact, if n was not considered fixed then every concept class C′ would have a superset C (via addition of instances) of lower subset teaching dimension. [sent-748, score-1.08]
60 In the given example, the subset teaching dimension of C is 1, while the subset teaching dimension of C−x1 is 2. [sent-873, score-1.932]
61 1), one observes that a subset teaching set for a concept cs,0 contains only the negative example (xu+1+int(s) , −) distinguishing it from cs,1 (its nearest neighbor in terms of Hamming distance). [sent-887, score-1.094]
62 This intuitive idea of subset teaching sets being used for distinguishing a concept from its nearest neighbors has to be treated with care though. [sent-891, score-1.104]
63 Accord/ ing to the intuition explained here, this would suggest {(x1 , −)} being a subset teaching set for 0 although the subset teaching sets here equal the teaching sets and are all of cardinality 2. [sent-898, score-2.769]
64 To summarize, • nonmonotonicity has an intuitive reason and is not an indication of an ill-defined version of the teaching dimension, • nonmonotonicity would in fact be a consequence of implementing the idea that the existence of specific concepts (e. [sent-905, score-1.024]
65 , nearest neighbours) associated with a target concept is beneficial for teaching and learning. [sent-907, score-1.072]
66 So, the STD captures certain intuitions about teaching and learning that monotonic dimensions cannot capture; at the same time monotonicity might in other respects itself be an intuitive property of teaching and learning which then the STD cannot capture. [sent-908, score-1.841]
67 In other words, is there a teaching/learning framework • resulting in a monotonic variant of a teaching dimension and • achieving low teaching complexity results similar to the subset teaching dimension? [sent-914, score-2.797]
68 However, we would like to have a constant dimension for the class of all monomials, as well as, for example, a teaching set of size 1 for the empty concept in our often used concept class C0 . [sent-916, score-1.259]
69 We will now via several steps introduce a monotonic variant of the teaching dimension and show that for most of the examples studied above, it is as low as the subset teaching dimension. [sent-917, score-1.898]
70 After selecting a set of concepts each of which is “easy to teach” because of possessing a small minimal teaching set, we eliminate these concepts from our concept class and consider only the remaining concepts. [sent-925, score-1.167]
71 The recursive teaching dimension RTD(c,C) of c in C is then defined as RTD(c,C) = d j and we denote by RTS(c,C) = TS(c,C j ) the set of all recursive teaching sets for c in C. [sent-946, score-1.93]
72 375 Z ILLES , L ANGE , H OLTE AND Z INKEVICH The definition of teaching hierarchy induces a protocol for teaching and learning: for a target concept c, a teacher uses the teaching hierarchy H = ((C1 , d1 ), . [sent-950, score-3.063]
73 The teacher then presents the examples in a teaching set from TS(c,C j ), that is, a recursive teaching set for c in C, to the learner. [sent-954, score-2.006]
74 The learner will use the teaching hierarchy to determine the target concept from the sample provided by the teacher. [sent-955, score-1.12]
75 Note again that Definition 25 does not presume any special order of the concept representations or of the instances, that is, teacher and learner do not have to agree on any such order to make use of the teaching and learning protocol. [sent-961, score-1.251]
76 The following definition of canonical teaching plans yields an alternative definition of the recursive teaching dimension. [sent-963, score-1.856]
77 Note that every concept class has a canonical teaching plan. [sent-989, score-1.059]
78 It turns out that a canonical teaching plan has the lowest possible order over all teaching plans; this order coincides with the recursive teaching dimension, see Theorem 29. [sent-990, score-2.793]
79 Theorem 29 Let C be a concept class and p∗ a canonical teaching plan for C. [sent-991, score-1.087]
80 Interestingly, unlike for subset teaching set protocols, the teacher-learner pairs based on recursive teaching set protocols are valid in the sense of Goldman and Mathias’s definition (Goldman and Mathias, 1996). [sent-1040, score-1.903]
81 2 Comparison to the Balbach Teaching Dimension Unlike the subset teaching dimension, the recursive teaching dimension lower-bounds the Balbach dimension. [sent-1065, score-1.913]
82 To prove this, we first observe that the smallest teaching dimension of all concepts in a given concept class is a lower bound on the Balbach dimension. [sent-1066, score-1.14]
83 3 again, this time in order to determine the recursive teaching dimension of the corresponding classes of concepts represented by boolean functions. [sent-1107, score-1.048]
84 As in the case of the subset teaching dimension, see Theorem 19, we obtain that the recursive teaching dimension of the class of all monomials over m (m ≥ 2) variables is 2, independent of m. [sent-1108, score-1.995]
85 We can show that the recursive teaching dimension can be arbitrarily larger than the subset teaching dimension; it can even be larger than the maximal STD computed over all subsets of the concept class. [sent-1149, score-2.051]
86 Similarly, we can prove that the subset teaching dimension can be arbitrarily larger than the recursive teaching dimension. [sent-1166, score-1.913]
87 To this end, we define a property that is sufficient for a concept class to have a recursive teaching dimension equal to its subset teaching dimension. [sent-1176, score-2.063]
88 If S′j+1 was contained in a subset teaching set for some concept among c1 , . [sent-1244, score-1.068]
89 , c j , this would imply that S′j+1 equaled some subset teaching set for some concept among c1 , . [sent-1247, score-1.068]
90 Second, since S′j+1 is contained in the subset teaching set S for c j+1 and not contained in any subset teaching set for any c ∈ C \ {c j+1 }, S′j+1 equals S and is itself a subset teaching set for c j+1 . [sent-1251, score-2.79]
91 A recursive argument for the subsequent concepts in the teaching plan shows that Cθ has the subset teaching property. [sent-1263, score-1.95]
92 Conclusions We introduced two new models of teaching and learning of finite concept classes, based on the idea that learners can learn from a smaller number of labeled examples if they assume that the teacher chooses helpful examples. [sent-1266, score-1.21]
93 However, one of our two models, the one based on recursive teaching sets, complies with Goldman and Mathias’s original definition of valid teaching without coding tricks, see Goldman and Mathias (1996). [sent-1270, score-1.879]
94 Intuitively, their definition requires a learner to hypothesize a concept c as soon as any teaching set for c is contained in the given sample. [sent-1273, score-1.111]
95 The subset teaching protocol questions not only classic definitions of coding trick but also the intuitive idea that information-theoretic complexity measures should be monotonic with respect to the inclusion of concept classes. [sent-1277, score-1.175]
96 For many “natural” concept classes, the subset teaching dimension and the recursive teaching dimension turn out to be equal, but in general the two measures are not comparable. [sent-1279, score-2.087]
97 One could easily design a protocol that, for every concept class C, would follow the subset teaching protocol if STD(C) ≤ RTD(C) and would follow the recursive teaching protocol if RTD(C) < STD(C). [sent-1281, score-2.174]
98 Such a protocol would comply with our definition of collusion-freeness and would strictly dominate both the subset teaching protocol and the recursive teaching protocol. [sent-1282, score-1.975]
99 In this paper, we did not address the question of optimality of teaching protocols; we focused on intuitiveness of the protocols and the resulting teaching sets. [sent-1283, score-1.844]
100 We furthermore gratefully acknowledge the support of Laura Zilles and Michael Geilke who developed and provided a software tool for computing subset teaching sets and recursive teaching sets. [sent-1285, score-1.877]
wordName wordTfidf (topN-words)
[('teaching', 0.909), ('teacher', 0.14), ('concept', 0.138), ('btd', 0.12), ('stsk', 0.11), ('rtd', 0.107), ('std', 0.091), ('balbach', 0.09), ('cons', 0.09), ('td', 0.077), ('cz', 0.072), ('monomials', 0.07), ('learner', 0.064), ('ord', 0.055), ('goldman', 0.051), ('protocol', 0.049), ('concepts', 0.045), ('ange', 0.045), ('illes', 0.045), ('inkevich', 0.045), ('olte', 0.045), ('eaching', 0.042), ('ooperative', 0.042), ('recursive', 0.038), ('monomial', 0.036), ('dimension', 0.036), ('mathias', 0.034), ('btdk', 0.032), ('sts', 0.032), ('cooperative', 0.031), ('conssize', 0.03), ('nonmonotonicity', 0.03), ('plan', 0.028), ('assertion', 0.028), ('bts', 0.027), ('dnf', 0.027), ('protocols', 0.026), ('coding', 0.023), ('sz', 0.023), ('adc', 0.022), ('rts', 0.022), ('shinohara', 0.022), ('teachers', 0.022), ('subset', 0.021), ('fcol', 0.02), ('boolean', 0.02), ('odels', 0.019), ('adversary', 0.018), ('minimal', 0.018), ('subclass', 0.017), ('hamming', 0.017), ('zilles', 0.017), ('nearest', 0.016), ('instances', 0.016), ('elimination', 0.016), ('ts', 0.016), ('collusion', 0.015), ('cpair', 0.015), ('frow', 0.015), ('empty', 0.014), ('iterated', 0.014), ('lange', 0.013), ('monotonic', 0.013), ('learners', 0.013), ('ful', 0.013), ('premise', 0.013), ('teach', 0.013), ('obviously', 0.013), ('kobayashi', 0.012), ('tricks', 0.012), ('singleton', 0.012), ('class', 0.012), ('evaluates', 0.012), ('trick', 0.012), ('induction', 0.012), ('angluin', 0.011), ('consistent', 0.011), ('conjecture', 0.011), ('successful', 0.011), ('adversaries', 0.011), ('earning', 0.011), ('jr', 0.01), ('instance', 0.01), ('distinguishing', 0.01), ('conssub', 0.01), ('krikis', 0.01), ('partner', 0.01), ('stdk', 0.01), ('intuitive', 0.01), ('examples', 0.01), ('cooperation', 0.01), ('lls', 0.01), ('witnessed', 0.01), ('target', 0.009), ('lr', 0.009), ('interactive', 0.009), ('vm', 0.009), ('ci', 0.009), ('ad', 0.009), ('jackson', 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 65 jmlr-2011-Models of Cooperative Teaching and Learning
Author: Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich
Abstract: While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects “helpful” training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, “helpful” examples can speed up the learning process. The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using “coding tricks”, has been widely addressed. Nevertheless, the resulting teaching and learning protocols do not seem to make the teacher select intuitively “helpful” examples. The two models introduced in this paper are built on what we call subset teaching sets and recursive teaching sets. They extend previous models of teaching by letting both the teacher and the learner exploit knowing that the partner is cooperative. For this purpose, we introduce a new notion of “coding trick”/“collusion”. We show how both resulting sample complexity measures (the subset teaching dimension and the recursive teaching dimension) can be arbitrarily lower than the classic teaching dimension and known variants thereof, without using coding tricks. For instance, monomials can be taught with only two examples independent of the number of variables. The subset teaching dimension turns out to be nonmonotonic with respect to subclasses of concept classes. We discuss why this nonmonotonicity might be inherent in many interesting cooperative teaching and learning scenarios. Keywords: teaching dimension, learning Boolean functions, interactive learning, collusion c 2011 Sandra Zilles, Steffen Lange, Robert Holte and Martin Zinkevich. Z ILLES , L ANGE , H OLTE AND Z INKEVICH
2 0.019183187 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
Author: Zhihua Zhang, Guang Dai, Michael I. Jordan
Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a flexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes
3 0.016208848 36 jmlr-2011-Generalized TD Learning
Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii
Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function
4 0.015707219 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu
Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efficiently compute marginals and update MAP configurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP configurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU
5 0.011856285 59 jmlr-2011-Learning with Structured Sparsity
Author: Junzhou Huang, Tong Zhang, Dimitris Metaxas
Abstract: This paper investigates a learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea that has become popular in recent years. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. It is shown that if the coding complexity of the target signal is small, then one can achieve improved performance by using coding complexity regularization methods, which generalize the standard sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. It is shown that the greedy algorithm approximately solves the coding complexity optimization problem under appropriate conditions. Experiments are included to demonstrate the advantage of structured sparsity over standard sparsity on some real applications. Keywords: structured sparsity, standard sparsity, group sparsity, tree sparsity, graph sparsity, sparse learning, feature selection, compressive sensing
6 0.010964924 33 jmlr-2011-Exploiting Best-Match Equations for Efficient Reinforcement Learning
7 0.010185653 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
8 0.00837937 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
9 0.0083560133 75 jmlr-2011-Parallel Algorithm for Learning Optimal Bayesian Network Structure
10 0.0072067967 54 jmlr-2011-Learning Latent Tree Graphical Models
11 0.0068158261 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
12 0.0063393964 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
13 0.0063236891 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
14 0.0061745741 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
15 0.0061180051 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
16 0.0060971817 58 jmlr-2011-Learning from Partial Labels
17 0.005910405 91 jmlr-2011-The Sample Complexity of Dictionary Learning
18 0.005880475 55 jmlr-2011-Learning Multi-modal Similarity
19 0.0058465782 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
20 0.0058264737 79 jmlr-2011-Proximal Methods for Hierarchical Sparse Coding
topicId topicWeight
[(0, 0.035), (1, 0.0), (2, -0.008), (3, 0.003), (4, -0.008), (5, 0.005), (6, -0.003), (7, 0.011), (8, 0.006), (9, 0.004), (10, -0.007), (11, -0.004), (12, 0.014), (13, -0.008), (14, 0.008), (15, -0.036), (16, -0.016), (17, -0.008), (18, 0.002), (19, -0.016), (20, -0.032), (21, -0.002), (22, 0.004), (23, 0.011), (24, 0.018), (25, 0.001), (26, -0.116), (27, -0.053), (28, 0.004), (29, 0.029), (30, 0.032), (31, 0.073), (32, 0.004), (33, -0.186), (34, -0.271), (35, -0.022), (36, 0.197), (37, -0.314), (38, -0.223), (39, 0.358), (40, -0.484), (41, 0.416), (42, -0.167), (43, -0.065), (44, 0.104), (45, 0.062), (46, -0.076), (47, -0.15), (48, 0.118), (49, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.99234694 65 jmlr-2011-Models of Cooperative Teaching and Learning
Author: Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich
Abstract: While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects “helpful” training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, “helpful” examples can speed up the learning process. The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using “coding tricks”, has been widely addressed. Nevertheless, the resulting teaching and learning protocols do not seem to make the teacher select intuitively “helpful” examples. The two models introduced in this paper are built on what we call subset teaching sets and recursive teaching sets. They extend previous models of teaching by letting both the teacher and the learner exploit knowing that the partner is cooperative. For this purpose, we introduce a new notion of “coding trick”/“collusion”. We show how both resulting sample complexity measures (the subset teaching dimension and the recursive teaching dimension) can be arbitrarily lower than the classic teaching dimension and known variants thereof, without using coding tricks. For instance, monomials can be taught with only two examples independent of the number of variables. The subset teaching dimension turns out to be nonmonotonic with respect to subclasses of concept classes. We discuss why this nonmonotonicity might be inherent in many interesting cooperative teaching and learning scenarios. Keywords: teaching dimension, learning Boolean functions, interactive learning, collusion c 2011 Sandra Zilles, Steffen Lange, Robert Holte and Martin Zinkevich. Z ILLES , L ANGE , H OLTE AND Z INKEVICH
2 0.14307697 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
Author: Zhihua Zhang, Guang Dai, Michael I. Jordan
Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a flexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes
3 0.086840428 36 jmlr-2011-Generalized TD Learning
Author: Tsuyoshi Ueno, Shin-ichi Maeda, Motoaki Kawanabe, Shin Ishii
Abstract: Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for model-free policy evaluation have been proposed. Although they have brought much progress in practical applications of reinforcement learning (RL), there still remain fundamental problems concerning statistical properties of the value function estimation. To solve these problems, we introduce a new framework, semiparametric statistical inference, to model-free policy evaluation. This framework generalizes TD learning and its extensions, and allows us to investigate statistical properties of both of batch and online learning procedures for the value function estimation in a unified way in terms of estimating functions. Furthermore, based on this framework, we derive an optimal estimating function with the minimum asymptotic variance and propose batch and online learning algorithms which achieve the optimality. Keywords: reinforcement learning, model-free policy evaluation, TD learning, semiparametirc model, estimating function
4 0.082629032 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
Author: Özgür Sümer, Umut A. Acar, Alexander T. Ihler, Ramgopal R. Mettu
Abstract: Many algorithms and applications involve repeatedly solving variations of the same inference problem, for example to introduce new evidence to the model or to change conditional dependencies. As the model is updated, the goal of adaptive inference is to take advantage of previously computed quantities to perform inference more rapidly than from scratch. In this paper, we present algorithms for adaptive exact inference on general graphs that can be used to efficiently compute marginals and update MAP configurations under arbitrary changes to the input factor graph and its associated elimination tree. After a linear time preprocessing step, our approach enables updates to the model and the computation of any marginal in time that is logarithmic in the size of the input model. Moreover, in contrast to max-product our approach can also be used to update MAP configurations in time that is roughly proportional to the number of updated entries, rather than the size of the input model. To evaluate the practical effectiveness of our algorithms, we implement and test them using synthetic data as well as for two real-world computational biology applications. Our experiments show that adaptive inference can achieve substantial speedups over performing complete inference as the model undergoes small changes over time. Keywords: exact inference, factor graphs, factor elimination, marginalization, dynamic programming, MAP computation, model updates, parallel tree contraction ¨ u u c 2011 Ozg¨ r S¨ mer, Umut A. Acar, Alexander T. Ihler and Ramgopal R. Mettu. ¨ S UMER , ACAR , I HLER AND M ETTU
5 0.079590842 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
Author: Vladimir V. V'yugin
Abstract: In this paper the sequential prediction problem with expert advice is considered for the case where losses of experts suffered at each step cannot be bounded in advance. We present some modification of Kalai and Vempala algorithm of following the perturbed leader where weights depend on past losses of the experts. New notions of a volume and a scaled fluctuation of a game are introduced. We present a probabilistic algorithm protected from unrestrictedly large one-step losses. This algorithm has the optimal performance in the case when the scaled fluctuations of one-step losses of experts of the pool tend to zero. Keywords: prediction with expert advice, follow the perturbed leader, unbounded losses, adaptive learning rate, expected bounds, Hannan consistency, online sequential prediction
6 0.068375446 16 jmlr-2011-Clustering Algorithms for Chains
7 0.066640534 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
8 0.053735379 6 jmlr-2011-A Simpler Approach to Matrix Completion
9 0.052729506 59 jmlr-2011-Learning with Structured Sparsity
10 0.05211921 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
11 0.050734445 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
12 0.042316742 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
13 0.042049184 20 jmlr-2011-Convex and Network Flow Optimization for Structured Sparsity
14 0.03958125 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms
15 0.039121237 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
16 0.035022181 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
17 0.03402625 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
18 0.033422478 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
19 0.032039966 60 jmlr-2011-Locally Defined Principal Curves and Surfaces
20 0.031860348 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data
topicId topicWeight
[(4, 0.036), (9, 0.018), (10, 0.024), (24, 0.026), (31, 0.034), (32, 0.021), (41, 0.017), (73, 0.022), (78, 0.05), (90, 0.581)]
simIndex simValue paperId paperTitle
1 0.87817353 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
Author: Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn, Karsten M. Borgwardt
Abstract: In this article, we propose a family of efficient kernels for large graphs with discrete node labels. Key to our method is a rapid feature extraction scheme based on the Weisfeiler-Lehman test of isomorphism on graphs. It maps the original graph to a sequence of graphs, whose node attributes capture topological and label information. A family of kernels can be defined based on this Weisfeiler-Lehman sequence of graphs, including a highly efficient kernel comparing subtree-like patterns. Its runtime scales only linearly in the number of edges of the graphs and the length of the Weisfeiler-Lehman graph sequence. In our experimental evaluation, our kernels outperform state-of-the-art graph kernels on several graph classification benchmark data sets in terms of accuracy and runtime. Our kernels open the door to large-scale applications of graph kernels in various disciplines such as computational biology and social network analysis. Keywords: graph kernels, graph classification, similarity measures for graphs, Weisfeiler-Lehman algorithm c 2011 Nino Shervashidze, Pascal Schweitzer, Erik Jan van Leeuwen, Kurt Mehlhorn and Karsten M. Borgwardt. S HERVASHIDZE , S CHWEITZER , VAN L EEUWEN , M EHLHORN AND B ORGWARDT
same-paper 2 0.87053192 65 jmlr-2011-Models of Cooperative Teaching and Learning
Author: Sandra Zilles, Steffen Lange, Robert Holte, Martin Zinkevich
Abstract: While most supervised machine learning models assume that training examples are sampled at random or adversarially, this article is concerned with models of learning from a cooperative teacher that selects “helpful” training examples. The number of training examples a learner needs for identifying a concept in a given class C of possible target concepts (sample complexity of C) is lower in models assuming such teachers, that is, “helpful” examples can speed up the learning process. The problem of how a teacher and a learner can cooperate in order to reduce the sample complexity, yet without using “coding tricks”, has been widely addressed. Nevertheless, the resulting teaching and learning protocols do not seem to make the teacher select intuitively “helpful” examples. The two models introduced in this paper are built on what we call subset teaching sets and recursive teaching sets. They extend previous models of teaching by letting both the teacher and the learner exploit knowing that the partner is cooperative. For this purpose, we introduce a new notion of “coding trick”/“collusion”. We show how both resulting sample complexity measures (the subset teaching dimension and the recursive teaching dimension) can be arbitrarily lower than the classic teaching dimension and known variants thereof, without using coding tricks. For instance, monomials can be taught with only two examples independent of the number of variables. The subset teaching dimension turns out to be nonmonotonic with respect to subclasses of concept classes. We discuss why this nonmonotonicity might be inherent in many interesting cooperative teaching and learning scenarios. Keywords: teaching dimension, learning Boolean functions, interactive learning, collusion c 2011 Sandra Zilles, Steffen Lange, Robert Holte and Martin Zinkevich. Z ILLES , L ANGE , H OLTE AND Z INKEVICH
3 0.84564936 32 jmlr-2011-Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation
Author: Yizhao Ni, Craig Saunders, Sandor Szedmak, Mahesan Niranjan
Abstract: We propose a distance phrase reordering model (DPR) for statistical machine translation (SMT), where the aim is to learn the grammatical rules and context dependent changes using a phrase reordering classification framework. We consider a variety of machine learning techniques, including state-of-the-art structured prediction methods. Techniques are compared and evaluated on a Chinese-English corpus, a language pair known for the high reordering characteristics which cannot be adequately captured with current models. In the reordering classification task, the method significantly outperforms the baseline against which it was tested, and further, when integrated as a component of the state-of-the-art machine translation system, MOSES, it achieves improvement in translation results. Keywords: statistical machine translation (SMT), phrase reordering, lexicalized reordering (LR), maximum entropy (ME), support vector machine (SVM), maximum margin regression (MMR) , max-margin structure learning (MMS)
4 0.83681107 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes
Author: Elias Zavitsanos, Georgios Paliouras, George A. Vouros
Abstract: This paper presents hHDP, a hierarchical algorithm for representing a document collection as a hierarchy of latent topics, based on Dirichlet process priors. The hierarchical nature of the algorithm refers to the Bayesian hierarchy that it comprises, as well as to the hierarchy of the latent topics. hHDP relies on nonparametric Bayesian priors and it is able to infer a hierarchy of topics, without making any assumption about the depth of the learned hierarchy and the branching factor at each level. We evaluate the proposed method on real-world data sets in document modeling, as well as in ontology learning, and provide qualitative and quantitative evaluation results, showing that the model is robust, it models accurately the training data set and is able to generalize on held-out data. Keywords: hierarchical Dirichlet processes, probabilistic topic models, topic distributions, ontology learning from text, topic hierarchy
5 0.29531863 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
Author: Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa
Abstract: We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements. Keywords: natural language processing, neural networks
6 0.28828835 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
7 0.27398312 46 jmlr-2011-Introduction to the Special Topic on Grammar Induction, Representation of Language and Language Learning
8 0.27022585 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
9 0.24496096 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
10 0.24287306 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
11 0.23818439 16 jmlr-2011-Clustering Algorithms for Chains
12 0.23795429 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
13 0.22987059 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
14 0.22933593 61 jmlr-2011-Logistic Stick-Breaking Process
15 0.22827309 7 jmlr-2011-Adaptive Exact Inference in Graphical Models
17 0.22671284 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
18 0.22405419 48 jmlr-2011-Kernel Analysis of Deep Networks
19 0.22254086 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms
20 0.22218743 35 jmlr-2011-Forest Density Estimation