jmlr jmlr2007 jmlr2007-88 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dima Kuzmin, Manfred K. Warmuth
Abstract: n Maximum concept classes of VC dimension d over n domain points have size ≤d , and this is an upper bound on the size of any concept class of VC dimension d over n points. We give a compression scheme for any maximum class that represents each concept by a subset of up to d unlabeled domain points and has the property that for any sample of a concept in the class, the representative of exactly one of the concepts consistent with the sample is a subset of the domain of the sample. This allows us to compress any sample of a concept in the class to a subset of up to d unlabeled sample points such that this subset represents a concept consistent with the entire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compresses to labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class. Keywords: compression schemes, VC dimension, maximum classes, one-inclusion graph
Reference: text
sentIndex sentText sentNum sentScore
1 More generally, a compression scheme is defined by two mappings (Littlestone and Warmuth, 1986): one mapping samples of concepts in the class to subsamples, and the other one mapping the resulting subsamples to hypotheses on the domain of the class. [sent-37, score-0.637]
2 The size of the compression scheme is the size of the largest representative, and the minimum size of a compression scheme for a class serves as an alternate measure of complexity. [sent-43, score-0.696]
3 Our compression scheme represents concepts by subsets of size at most d and for any k ≤ d, the concepts represented by subsets of size up to k will form a maximum class of VC dimension k. [sent-100, score-0.885]
4 The reduction C x of a concept class C wrt a dimension x ∈ dom(C) consists of all those concepts in C − x that have two possible extensions onto concepts in C. [sent-189, score-0.641]
5 The tail of concept class C on dimension x consists of all concepts that do not have an edge labeled with x. [sent-192, score-0.631]
6 We denote the tail of C on dimension x as tailx (C). [sent-194, score-0.842]
7 The class C can therefore be partitioned as 0Cx ∪ 1Cx ∪ tailx (C), where ∪ denotes the disjoint union and bC x consists of all concepts in C x extended with bit b in dimension x. [sent-195, score-0.923]
8 Unlabeled Compression Scheme Our unlabeled compression scheme for maximum classes represents the concepts as unlabeled subsets of dom(C) of size at most d. [sent-226, score-0.827]
9 t Lemma 3 For any representation mapping r of a maximum concept class C and the one-inclusion x graph of C, any edge c — c in the graph has the property that its associated dimension x lies in exactly one of the representatives r(c) or r(c ). [sent-274, score-0.683]
10 All cells not bordering the x plane form the subclass tailx (C). [sent-350, score-0.761]
11 By Corollary 5, the representative of a concept in an unlabeled compression scheme is always a subset of the incident dimensions (here the bordering hyperplanes) in the one-inclusion graph. [sent-351, score-0.79]
12 Comparison with Old Scheme In the unlabeled compression schemes introduced in this paper, each subset of up to d domain points represents a unique concept in the class, and every sample of a concept contains exactly one subset that represents a concept consistent with the sample. [sent-378, score-0.937]
13 Before we show that there always exist unlabeled compression schemes, we present the old compression scheme for maximum classes from Floyd and Warmuth (1995) in a concise way that brings out the difference between both schemes. [sent-379, score-0.772]
14 Each dimension x ∈ dom(C) can be used to split class C into three disjoint sets: C = 0C x ∪ 1Cx ∪ tailx (C). [sent-413, score-0.724]
15 A forbidden labeling (Floyd and Warmuth, 1995) of a class C of VC dimension d is a labeled set f of d + 1 points in dom(C) that is not consistent with any concept in C. [sent-421, score-0.73]
16 Our Tail Matching Algorithm assigns all concepts in tailx (C) a forbidden labeling of the class of size (d − 1) + 1. [sent-425, score-1.22]
17 Since c|r(c) is now a forbidden labeling for C x , clashes between the tailx (C) and Cx are avoided. [sent-426, score-1.075]
18 The class tailx (C) contains the same number of concepts, since C − x = C x ∪ (tailx (C) − x) 2064 U NLABELED C OMPRESSION S CHEMES FOR M AXIMUM C LASSES and Cx and C − x are maximum classes: |tailx (C)| = |C − x| − |Cx | = n−1 n−1 n−1 − = . [sent-428, score-0.679]
19 ≤d ≤ d −1 d (1) We next show that every tail concept contains some forbidden labeling of C x and each such forbidden labeling occurs in at least one tail concept. [sent-429, score-1.189]
20 Therefore all concepts in tailx (C) contain at least one forbidden labeling of Cx . [sent-432, score-1.2]
21 We will now show that the Tail Subroutine actually constructs a matching between the forbidden labelings of size d for C x and the tail concepts that contain them. [sent-434, score-0.755]
22 This matching is unique (Theorem 10 below) and using these matched forbidden labelings as representatives avoids clashes between tail concepts. [sent-435, score-0.819]
23 If we denote tail x (Cy ) / as {ci : i ∈ I} and tailx (C − y) as {c j : j ∈ J} (where I ∩ J = 0),6 then there exist bit values {ai : i ∈ I}, {a j : j ∈ J} for the y dimension such that tailx (C) = {ai ci : i ∈ I} ∪ {a j c j : j ∈ J}. [sent-438, score-1.498]
24 d d −1 d Next we will show that any concept in tailx (Cy ) and tailx (C − y) can be mapped to a concept in tailx (C) by extending it with a suitable y bit. [sent-440, score-2.186]
25 We also have to account for the possibility that there can be some concepts c ∈ tailx (Cy ) ∩ tailx (C − y). [sent-441, score-1.427]
26 Concepts in the intersection will need to be mapped back to two different concepts of tailx (C). [sent-442, score-0.799]
27 Assume 0c was an extension outside of the tailx (C). [sent-453, score-0.628]
28 This can only happen for concepts in c ∈ tailx (Cy ) ∩ tailx (C − y). [sent-457, score-1.427]
29 In this case 0c, 1c ∈ C, and by the previous paragraph, both lie in tailx (C). [sent-458, score-0.628]
30 So we can arbitrarily map c ∈ tailx (Cy ) to 0c and c ∈ tailx (C − y) to 1c. [sent-459, score-1.256]
31 Note that while C y ⊆ C − y, this does not imply that tailx (Cy ) ⊆ tailx (C − y), as the deletion of the concepts (C − y) Cy from C − y can remove x edges as well, and thus introduce new tail concepts. [sent-462, score-1.6]
32 Extend r to tailx (C) via the subroutine of Figure 14. [sent-471, score-0.647]
33 Tail Subroutine Input: a maximum concept class C, x ∈ dom(C) Output: an assignment of representatives to tailx (C) / 1. [sent-473, score-1.021]
34 / / If VCdim(C) = |dom(C)|, then tailx (C) = 0 and r := 0. [sent-477, score-0.628]
35 Otherwise, pick some y ∈ dom(C), y = x and recursively find representatives for tailx (Cy ) and tailx (C − y). [sent-478, score-1.447]
36 ∀c ∈ tailx (C − y) tailx (C − y), find c ∈ tailx (C), s. [sent-482, score-1.884]
37 c − y = c, r(c ) := r(c) ∪ tailx (Cy ), find c ∈ tailx (C), s. [sent-484, score-1.256]
38 ∀c ∈ tailx (Cy ) ∩ tailx (C − y), consider the concepts 0c, 1c ∈ tailx (C). [sent-488, score-2.055]
39 Let r1 be the representative for c from tailx (Cy ) and r2 be the one from tailx (C − y). [sent-489, score-1.306]
40 Corollary 8 Any forbidden labeling of (C − y)x is also a forbidden labeling of C x . [sent-502, score-0.802]
41 The corollary now follows from the fact that the forbidden labelings of C x − y are exactly all forbidden labeling of Cx that do not contain y. [sent-504, score-0.791]
42 2067 K UZMIN AND WARMUTH Lemma 9 If we have a forbidden labeling for C xy of size d − 1, then there exists a bit value for the y dimension such that extending this forbidden labeling with this bit results in a forbidden labeling of size d for Cx . [sent-505, score-1.413]
43 Proof We will establish a bijection between forbidden labelings of C x of size d that contain y and forbidden labelings of size d − 1 for C xy . [sent-506, score-0.868]
44 Exactly d n−2 d−1 of these forbidden labelings contain y and this is also the total number of forbidden labelings of size d − 1 for C xy . [sent-508, score-0.818]
45 It follows that every forbidden set is mapped to a different forbidden labeling and by the counting argument above we see that all forbidden sets are covered. [sent-512, score-1.009]
46 For any x ∈ dom(C) we can construct a bipartite graph between the n−1 concepts in tailx (C) and the n−1 d d forbidden labelings of size d for C x with an edge between a concept and a forbidden labeling if this labeling is contained in the concept. [sent-515, score-1.951]
47 By Lemma 6, we can compose tail x (C) from tailx (Cy ) and tailx (C − y). [sent-525, score-1.374]
48 Since VCdim(C x ) = d − 1 and |dom(C − x)| = n − 1,7 then (n − 1, d), (n − 1, d − 1) < (n, d) and we can use the inductive hypothesis for these classes and assume that the desired matchings already exist for tailx (Cy ) and tailx (C − y). [sent-526, score-1.331]
49 Concepts in tailx (C − y) are matched to forbidden labelings of (C − y)x of size d. [sent-529, score-1.056]
50 By Lemma 8, any forbidden labeling of (C − y) x is also a forbidden labeling of C x . [sent-530, score-0.802]
51 On the other hand, tailx (Cy ) is matched to labelings of size d − 1. [sent-532, score-0.752]
52 In the case where just one of two possible extensions of a concept in tail x (Cy ) is in the tailx (C), there are no problems: the single concept will be the concept of Lemma 9, since the other concept lies in C x and thus does not contain any forbidden labelings. [sent-536, score-1.687]
53 2068 U NLABELED C OMPRESSION S CHEMES FOR M AXIMUM C LASSES that both extensions are in tailx (C). [sent-539, score-0.645]
54 From the proof of Lemma 6 we see that this only happens to the concepts that are in tailx (Cy ) ∩ tailx (C − y). [sent-540, score-1.444]
55 The other extension will correspond to the tailx (C − y) matching. [sent-542, score-0.628]
56 Essentially, where before Lemma 6 told us to map the intersection tailx (Cy ) ∩ tailx (C − y) back to tailx (C) by assigning a bit arbitrarily, we now choose a bit in a specific way. [sent-543, score-1.94]
57 From any matching for tailx (C) we will show how to construct matchings for tailx (C − y) and tailx (Cy ) with the property that two different matchings for tailx (C) will disagree with the constructed matchings for either tailx (C − y) or tailx (Cy ). [sent-546, score-3.936]
58 Consider any concept c in tailx (C), such that c − y ∈ tailx (Cy ) tailx (C − y). [sent-548, score-2.035]
59 Then c − y lies in C − y, but not in tailx (C − y). [sent-549, score-0.644]
60 / From these two facts it follows that if a concept in tailx (C) is matched to a forbidden set containing y, then c − y ∈ tailx (Cy ), and if it is matched to a set not containing y, then c − y ∈ tail x (C − y). [sent-554, score-1.865]
61 We conclude that a matching for tailx (C) splits into a matching for tailx (C − y) and a matching for tailx (Cy ). [sent-555, score-2.052]
62 This implies that if there are two matchings for all of tailx (C), then there are either two matchings for tailx (C − y) or two matchings for tailx (Cy ). [sent-556, score-1.977]
63 Bijection condition: The representation mapping for C is composed of a bijection between 1C x and all sets of size ≤ d containing x, a bijection between 0C x and all sets of size < d that do not contain x, and finally a bijection between tailx (C) sets of size equal d that do not contain x. [sent-562, score-0.823]
64 By Theorem 10, we know that concepts in the tail are assigned to representatives that define a forbidden labeling for C x . [sent-565, score-0.881]
65 Therefore, clashes between tailx (C) and 0Cx , 1Cx are avoided. [sent-566, score-0.674]
66 By Theorem 10, the matching between concepts in tailx (C) and forbidden labeling of C x is unique. [sent-568, score-1.256]
67 Note that by Corollary 4, the unlabeled compression scheme produced by our recursive algorithm induces a d-orientation of the one-inclusion graph: orient each edge away from the concept that contains the dimension of the edge in its representative. [sent-573, score-0.832]
68 As a matter of fact a topological order can be constructed by ordering the concepts of C as follows: 0C x , 1Cx , tailx (C). [sent-575, score-0.799]
69 The concept within tailx (C) can be ordered arbitrarily and the concepts within 0C x and 1Cx are ordered recursively based on the topological order for C x . [sent-576, score-0.95]
70 Clearly none of the concepts in tailx (C) lie in Ck because their representatives are of size d > k. [sent-591, score-1.01]
71 From the recursion of the algorithm it follows x x that Ck = 0Ck ∪ 1Ck−1 , that is, it consists of all concepts in 0C x with representatives of size ≤ k in the mapping for C x , plus all the concepts in 1C x with representatives of size ≤ k − 1 in the mapping x x for Cx . [sent-592, score-0.854]
72 Lemma 13 For any maximum class C and x ∈ dom(C), restricting wrt x does not change the sets of incident dimensions of concepts in tailx (C), that is, ∀c ∈ tailx (C), IC (c) = IC−x (c − x). [sent-618, score-1.646]
73 Proof Let (c, c ) be any edge leaving a concept c ∈ tailx (C). [sent-619, score-0.832]
74 By the definition of tailx (C), this edge cannot be an x edge, and therefore c and c agree on x and (c − x, c − x) is an edge in C − x. [sent-620, score-0.734]
75 It follows that IC (c) ⊆ IC−x (c − x) when c ∈ tailx (C). [sent-621, score-0.628]
76 If IC (c) is a strict subset of IC−x (c − x) for some c ∈ tailx (C), then the number of edges incident to tailx (C) − x = (C − x) C x in C − x is larger than the number of edges incident to tailx (C) in C. [sent-622, score-2.2]
77 The class C − x is partitioned into C x and tailx (C) − x. [sent-635, score-0.628]
78 Thus we can assume that P begins and ends with a segment ˆ ˆ in tailx (C) − x and has a segment of C x concepts in the middle, where some of the three segments may be empty. [sent-637, score-0.831]
79 We partition tailx (C) into tailx=0 (C) and tailx=1 (C). [sent-638, score-0.628]
80 It follows that any segment of P from tailx (C) − x must be from the same part of the tail. [sent-641, score-0.644]
81 Note that from the above discussion, all concepts in the beginning and ending tail segments of P come from the part of the tailx (C) that label x with c1 (x) = c2 (x). [sent-645, score-0.917]
82 If c1 (x) = c2 (x), then P must contain a concept c1 in Cx , because if all concepts in P lied in ˜ tailx (C) − x then this would imply an edge between a concept in tail x=0 (C) − x and a concept in tailx=1 (C) − x. [sent-647, score-1.423]
83 For the general case, if IC (c) = dom(C) pick x ∈ IC (c) for which c ∈ tailx (C). [sent-677, score-0.628]
84 A natural idea for constructing a compression scheme for any class is to embed that class into some other class, for which a compression scheme is known. [sent-697, score-0.636]
85 The old labeled compression scheme for maximum classes cannot be extended to maximal classes due to the nature of its mapping between representatives and represented concepts. [sent-700, score-0.825]
86 A compression scheme is valid if for any sample domain, the number of concepts on the domain equals the number of concepts that have at least one representative inside the domain. [sent-733, score-0.767]
87 Unfortunately, Table 1 presents a maximal concept class that does not have an unlabeled compression scheme of size equal the VC dimension with multiple representatives of concepts in the class. [sent-735, score-1.121]
88 We checked that for any assignment of multiple representatives to concepts in this class, there is always some sample that cannot be compressed, that is, there is no representative of a consistent concept in the sample domain. [sent-736, score-0.645]
89 However, there is an infinite maximum (but not maximal) concept class of VC dimension one with the following property: there is no scheme when labeled points must represent concepts in the class, but there is a scheme when labeled points can represent hypotheses outside of the class (Eaton, 2005). [sent-741, score-0.711]
90 Also, there is no unlabeled compression scheme for positive halfspaces in R2 in which the compression sets (of size at most two) represent positive halfspaces (Neylon, 2006b). [sent-742, score-0.853]
91 A compression scheme for a concept class C is essentially defined by a mapping f from representatives to hypotheses which are arbitrary subsets of dom(C). [sent-745, score-0.724]
92 Recall that our Tail Matching Algorithm used the forbidden sets of C x to represent the concepts in tailx (C). [sent-759, score-1.103]
93 However, in maximal classes the number of tail concepts can be larger than the number of forbidden sets of C x . [sent-761, score-0.709]
94 Thus, we have a single forbidden set available for representing two concepts, and therefore using forbidden sets as representatives does not seem to work for maximal classes. [sent-767, score-0.871]
95 Maximal classes can still be split as C = 0Cx ∪ 1Cx ∪ tailx (C), but now C x and C − x are not necessarily maximal and they do not have specific sizes. [sent-778, score-0.744]
96 Our Tail Matching Algorithm relied on a further decomposition of the class tail x (C) given in Lemma 6: for any concept in tailx (C − y) or tailx (Cy ), we can extend these concepts with a y bit to concepts in tailx (C). [sent-779, score-2.523]
97 These extensions also exist for maximal classes, but we cannot get all concepts in tailx (C) this way. [sent-780, score-0.888]
98 Our constructions for unlabeled compression schemes only apply to finite maximum classes whereas the original labeled compression scheme for maximum classes is applicable for infinite maximum classes as well. [sent-814, score-1.032]
99 The existence of unlabeled compression schemes for infinite maximum classes of size equal to the VC dimension does follow from the compactness property of compression schemes as shown in Ben-David and Litman (1998). [sent-815, score-0.883]
100 An unlabeled compression scheme is also known to exist (Ben-David and Litman, 1998) (via a nonconstructive proof) but it would be interesting to find a simple constructive unlabeled compression scheme for this class. [sent-825, score-0.84]
wordName wordTfidf (topN-words)
[('tailx', 0.628), ('forbidden', 0.304), ('dom', 0.244), ('compression', 0.239), ('vc', 0.191), ('representatives', 0.191), ('concepts', 0.171), ('concept', 0.151), ('vcdim', 0.129), ('tail', 0.118), ('cy', 0.118), ('cx', 0.118), ('incident', 0.103), ('unlabeled', 0.102), ('labeling', 0.097), ('dimension', 0.096), ('chemes', 0.087), ('halfspaces', 0.087), ('nlabeled', 0.087), ('uzmin', 0.087), ('labelings', 0.086), ('warmuth', 0.083), ('scheme', 0.079), ('lasses', 0.074), ('ic', 0.072), ('maximal', 0.072), ('ompression', 0.07), ('aximum', 0.066), ('cells', 0.065), ('matching', 0.056), ('edges', 0.055), ('floyd', 0.055), ('edge', 0.053), ('lemma', 0.052), ('maximum', 0.051), ('representative', 0.05), ('ck', 0.049), ('vertex', 0.049), ('arrangement', 0.048), ('arrangements', 0.046), ('clashes', 0.046), ('schemes', 0.046), ('mapping', 0.045), ('classes', 0.044), ('hypercubes', 0.044), ('labeled', 0.042), ('graph', 0.04), ('consistent', 0.04), ('compress', 0.039), ('bordering', 0.036), ('clash', 0.036), ('domain', 0.036), ('vertices', 0.036), ('wrt', 0.035), ('planes', 0.032), ('conjecture', 0.032), ('plane', 0.032), ('matchings', 0.031), ('neylon', 0.031), ('dimensions', 0.03), ('bijection', 0.03), ('cell', 0.029), ('bit', 0.028), ('circled', 0.026), ('compresses', 0.026), ('litman', 0.026), ('outdegree', 0.026), ('rubinstein', 0.026), ('welzl', 0.026), ('hyperplanes', 0.025), ('restriction', 0.024), ('peeling', 0.022), ('subsample', 0.022), ('subsamples', 0.022), ('recursive', 0.021), ('produced', 0.021), ('sample', 0.021), ('size', 0.02), ('degree', 0.02), ('subroutine', 0.019), ('disagree', 0.019), ('shattered', 0.019), ('rectangle', 0.019), ('subsets', 0.019), ('marks', 0.018), ('orientation', 0.018), ('old', 0.018), ('xy', 0.018), ('matched', 0.018), ('orient', 0.017), ('tight', 0.017), ('proof', 0.017), ('extensions', 0.017), ('correctness', 0.017), ('hypercube', 0.017), ('compressed', 0.017), ('segment', 0.016), ('lies', 0.016), ('compressing', 0.016), ('haussler', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
Author: Dima Kuzmin, Manfred K. Warmuth
Abstract: n Maximum concept classes of VC dimension d over n domain points have size ≤d , and this is an upper bound on the size of any concept class of VC dimension d over n points. We give a compression scheme for any maximum class that represents each concept by a subset of up to d unlabeled domain points and has the property that for any sample of a concept in the class, the representative of exactly one of the concepts consistent with the sample is a subset of the domain of the sample. This allows us to compress any sample of a concept in the class to a subset of up to d unlabeled sample points such that this subset represents a concept consistent with the entire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compresses to labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class. Keywords: compression schemes, VC dimension, maximum classes, one-inclusion graph
2 0.075553164 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
Author: Yann Guermeur
Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classifiers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection
3 0.067707263 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
4 0.059400707 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin
Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classifier achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classifier achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassifications. Keywords: set covering machines, sample-compression, loss bounds
5 0.047265302 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
Author: Marlon Núñez, Raúl Fidalgo, Rafael Morales
Abstract: In the process of concept learning, target concepts may have portions with short-term changes, other portions may support long-term changes, and yet others may not change at all. For this reason several local windows need to be handled. We suggest facing this problem, which naturally exists in the field of concept learning, by allocating windows which can adapt their size to portions of the target concept. We propose an incremental decision tree that is updated with incoming examples. Each leaf of the decision tree holds a time window and a local performance measure as the main parameter to be controlled. When the performance of a leaf decreases, the size of its local window is reduced. This learning algorithm, called OnlineTree2, automatically adjusts its internal parameters in order to face the current dynamics of the data stream. Results show that it is comparable to other batch algorithms when facing problems with no concept change, and it is better than evaluated methods in its ability to deal with concept drift when dealing with problems in which: concept change occurs at different speeds, noise may be present and, examples may arrive from different areas of the problem domain (virtual drift). Keywords: incremental algorithms, online learning, concept drift, decision trees, robust learners
6 0.046977088 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
7 0.046508003 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
8 0.045489449 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
9 0.044171896 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
10 0.041520823 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
11 0.038114693 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
12 0.037743777 83 jmlr-2007-The On-Line Shortest Path Problem Under Partial Monitoring
13 0.036270417 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
14 0.03164757 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
17 0.028226072 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
18 0.025384445 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
20 0.02410716 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
topicId topicWeight
[(0, 0.147), (1, -0.007), (2, -0.074), (3, -0.049), (4, 0.033), (5, 0.041), (6, 0.104), (7, 0.049), (8, 0.016), (9, -0.063), (10, 0.064), (11, 0.162), (12, -0.174), (13, 0.007), (14, 0.068), (15, 0.062), (16, -0.087), (17, 0.135), (18, -0.209), (19, -0.024), (20, -0.141), (21, 0.18), (22, -0.047), (23, -0.032), (24, -0.239), (25, 0.322), (26, -0.188), (27, -0.29), (28, -0.054), (29, 0.016), (30, -0.072), (31, 0.059), (32, 0.006), (33, 0.029), (34, 0.047), (35, -0.106), (36, -0.047), (37, -0.088), (38, -0.037), (39, -0.028), (40, 0.09), (41, 0.052), (42, -0.231), (43, 0.177), (44, -0.111), (45, -0.014), (46, -0.139), (47, -0.118), (48, -0.063), (49, 0.191)]
simIndex simValue paperId paperTitle
same-paper 1 0.97762758 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
Author: Dima Kuzmin, Manfred K. Warmuth
Abstract: n Maximum concept classes of VC dimension d over n domain points have size ≤d , and this is an upper bound on the size of any concept class of VC dimension d over n points. We give a compression scheme for any maximum class that represents each concept by a subset of up to d unlabeled domain points and has the property that for any sample of a concept in the class, the representative of exactly one of the concepts consistent with the sample is a subset of the domain of the sample. This allows us to compress any sample of a concept in the class to a subset of up to d unlabeled sample points such that this subset represents a concept consistent with the entire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compresses to labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class. Keywords: compression schemes, VC dimension, maximum classes, one-inclusion graph
2 0.3210426 89 jmlr-2007-VC Theory of Large Margin Multi-Category Classifiers (Special Topic on Model Selection)
Author: Yann Guermeur
Abstract: In the context of discriminant analysis, Vapnik’s statistical learning theory has mainly been developed in three directions: the computation of dichotomies with binary-valued functions, the computation of dichotomies with real-valued functions, and the computation of polytomies with functions taking their values in finite sets, typically the set of categories itself. The case of classes of vectorvalued functions used to compute polytomies has seldom been considered independently, which is unsatisfactory, for three main reasons. First, this case encompasses the other ones. Second, it cannot be treated appropriately through a na¨ve extension of the results devoted to the computation of ı dichotomies. Third, most of the classification problems met in practice involve multiple categories. In this paper, a VC theory of large margin multi-category classifiers is introduced. Central in this theory are generalized VC dimensions called the γ-Ψ-dimensions. First, a uniform convergence bound on the risk of the classifiers of interest is derived. The capacity measure involved in this bound is a covering number. This covering number can be upper bounded in terms of the γ-Ψdimensions thanks to generalizations of Sauer’s lemma, as is illustrated in the specific case of the scale-sensitive Natarajan dimension. A bound on this latter dimension is then computed for the class of functions on which multi-class SVMs are based. This makes it possible to apply the structural risk minimization inductive principle to those machines. Keywords: multi-class discriminant analysis, large margin classifiers, uniform strong laws of large numbers, generalized VC dimensions, multi-class SVMs, structural risk minimization inductive principle, model selection
3 0.27365401 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
4 0.25002593 73 jmlr-2007-Revised Loss Bounds for the Set Covering Machine and Sample-Compression Loss Bounds for Imbalanced Data
Author: Zakria Hussain, François Laviolette, Mario Marchand, John Shawe-Taylor, Spencer Charles Brubaker, Matthew D. Mullin
Abstract: Marchand and Shawe-Taylor (2002) have proposed a loss bound for the set covering machine that has the property to depend on the observed fraction of positive examples and on what the classifier achieves on the positive training examples. We show that this loss bound is incorrect. We then propose a loss bound, valid for any sample-compression learning algorithm (including the set covering machine), that depends on the observed fraction of positive examples and on what the classifier achieves on them. We also compare numerically the loss bound proposed in this paper with the incorrect bound, the original SCM bound and a recently proposed loss bound of Marchand and Sokolova (2005) (which does not depend on the observed fraction of positive examples) and show that the latter loss bounds can be substantially larger than the new bound in the presence of imbalanced misclassifications. Keywords: set covering machines, sample-compression, loss bounds
5 0.20508365 14 jmlr-2007-Behavioral Shaping for Geometric Concepts
Author: Manu Chhabra, Robert A. Jacobs, Daniel Štefankovič
Abstract: In a search problem, an agent uses the membership oracle of a target concept to find a positive example of the concept. In a shaped search problem the agent is aided by a sequence of increasingly restrictive concepts leading to the target concept (analogous to behavioral shaping). The concepts are given by membership oracles, and the agent has to find a positive example of the target concept while minimizing the total number of oracle queries. We show that for the concept class of intervals on the real line an agent using a bounded number of queries per oracle exists. In contrast, for the concept class of unions of two intervals on the real line no agent with a bounded number of queries per oracle exists. We then investigate the (amortized) number of queries per oracle needed for the shaped search problem over other concept classes. We explore the following methods to obtain efficient agents. For axis-parallel rectangles we use a bootstrapping technique to obtain gradually better approximations of the target concept. We show that given rectangles R ⊆ A ⊆ R d one can obtain a rectangle A ⊇ R with vol(A )/vol(R) ≤ 2, using only O(d · vol(A)/vol(R)) random samples from A. For ellipsoids of bounded eccentricity in Rd we analyze a deterministic ray-shooting process which uses a sequence of rays to get close to the centroid. Finally, we use algorithms for generating random points in convex bodies (Dyer et al., 1991; Kannan et al., 1997) to give a randomized agent for the concept class of convex bodies. In the final section, we explore connections between our bootstrapping method and active learning. Specifically, we use the bootstrapping technique for axis-parallel rectangles to active learn axis-parallel rectangles under the uniform distribution in O(d ln(1/ε)) labeled samples. Keywords: computational learning theory, behavioral shaping, active learning
6 0.19939347 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections
7 0.19806263 29 jmlr-2007-Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts
8 0.18152978 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
9 0.17545584 48 jmlr-2007-Learning in Environments with Unknown Dynamics: Towards more Robust Concept Learners
10 0.16065964 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization
11 0.14453261 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
12 0.14103431 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study
13 0.13707918 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
14 0.13498315 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
16 0.13146558 36 jmlr-2007-Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption
18 0.12136684 65 jmlr-2007-PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers
19 0.1183686 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers (Special Topic on Model Selection)
20 0.11639585 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
topicId topicWeight
[(4, 0.014), (8, 0.022), (10, 0.011), (12, 0.048), (14, 0.01), (15, 0.017), (28, 0.057), (29, 0.476), (40, 0.035), (48, 0.019), (60, 0.036), (80, 0.025), (85, 0.064), (98, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.6737166 88 jmlr-2007-Unlabeled Compression Schemes for Maximum Classes
Author: Dima Kuzmin, Manfred K. Warmuth
Abstract: n Maximum concept classes of VC dimension d over n domain points have size ≤d , and this is an upper bound on the size of any concept class of VC dimension d over n points. We give a compression scheme for any maximum class that represents each concept by a subset of up to d unlabeled domain points and has the property that for any sample of a concept in the class, the representative of exactly one of the concepts consistent with the sample is a subset of the domain of the sample. This allows us to compress any sample of a concept in the class to a subset of up to d unlabeled sample points such that this subset represents a concept consistent with the entire original sample. Unlike the previously known compression scheme for maximum classes (Floyd and Warmuth, 1995) which compresses to labeled subsets of the sample of size equal d, our new scheme is tight in the sense that the number of possible unlabeled compression sets of size at most d equals the number of concepts in the class. Keywords: compression schemes, VC dimension, maximum classes, one-inclusion graph
2 0.22943176 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
Author: Rie Johnson, Tong Zhang
Abstract: This paper investigates the effect of Laplacian normalization in graph-based semi-supervised learning. To this end, we consider multi-class transductive learning on graphs with Laplacian regularization. Generalization bounds are derived using geometric properties of the graph. Specifically, by introducing a definition of graph cut from learning theory, we obtain generalization bounds that depend on the Laplacian regularizer. We then use this analysis to better understand the role of graph Laplacian matrix normalization. Under assumptions that the cut is small, we derive near-optimal normalization factors by approximately minimizing the generalization bounds. The analysis reveals the limitations of the standard degree-based normalization method in that the resulting normalization factors can vary significantly within each connected component with the same class label, which may cause inferior generalization performance. Our theory also suggests a remedy that does not suffer from this problem. Experiments confirm the superiority of the normalization scheme motivated by learning theory on artificial and real-world data sets. Keywords: transductive learning, graph learning, Laplacian regularization, normalization of graph Laplacian
3 0.22638972 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
Author: Guy Lebanon, Yi Mao, Joshua Dillon
Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing
Author: Vitaly Feldman
Abstract: We consider the problems of attribute-efficient PAC learning of two well-studied concept classes: parity functions and DNF expressions over {0, 1}n . We show that attribute-efficient learning of parities with respect to the uniform distribution is equivalent to decoding high-rate random linear codes from low number of errors, a long-standing open problem in coding theory. This is the first evidence that attribute-efficient learning of a natural PAC learnable concept class can be computationally hard. An algorithm is said to use membership queries (MQs) non-adaptively if the points at which the algorithm asks MQs do not depend on the target concept. Using a simple non-adaptive parity learning algorithm and a modification of Levin’s algorithm for locating a weakly-correlated parity due to Bshouty et al. (1999), we give the first non-adaptive and attribute-efficient algorithm for ˜ learning DNF with respect to the uniform distribution. Our algorithm runs in time O(ns4 /ε) and ˜ uses O(s4 · log2 n/ε) non-adaptive MQs, where s is the number of terms in the shortest DNF representation of the target concept. The algorithm improves on the best previous algorithm for learning DNF (of Bshouty et al., 1999) and can also be easily modified to tolerate random persistent classification noise in MQs. Keywords: attribute-efficient, non-adaptive, membership query, DNF, parity function, random linear code
5 0.22616543 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
Author: Jaime S. Cardoso, Joaquim F. Pinto da Costa
Abstract: Classification of ordinal data is one of the most important tasks of relation learning. This paper introduces a new machine learning paradigm specifically intended for classification problems where the classes have a natural order. The technique reduces the problem of classifying ordered classes to the standard two-class problem. The introduced method is then mapped into support vector machines and neural networks. Generalization bounds of the proposed ordinal classifier are also provided. An experimental study with artificial and real data sets, including an application to gene expression analysis, verifies the usefulness of the proposed approach. Keywords: classification, ordinal data, support vector machines, neural networks
6 0.22561201 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
8 0.2246761 30 jmlr-2007-Dynamics and Generalization Ability of LVQ Algorithms
9 0.22369598 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
10 0.22347848 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
11 0.22301143 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time
13 0.22069675 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
14 0.22065517 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
15 0.22015551 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
16 0.21920085 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
17 0.21909401 77 jmlr-2007-Stagewise Lasso
18 0.21890546 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
19 0.2189002 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
20 0.21863261 47 jmlr-2007-Learning Horn Expressions with LOGAN-H