nips nips2006 nips2006-109 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. [sent-6, score-0.353]
2 We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. [sent-7, score-0.995]
3 These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. [sent-8, score-0.507]
4 We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. [sent-9, score-1.279]
5 We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension). [sent-10, score-1.064]
6 1 Introduction A set of classifiers and a probability distribution over their domain induce a metric in which the distance between classifiers is the probability that they disagree on how to classify a random object. [sent-11, score-0.41]
7 ) Properties of metrics like this have long been used for analyzing the generalization ability of learning algorithms [11, 32]. [sent-13, score-0.09]
8 This paper is about bounds on the number of examples required for PAC learning in terms of the doubling dimension [4] of this metric space. [sent-14, score-1.076]
9 The doubling dimension of a metric space is the least such that any ball can be covered by ¾ balls of half its radius. [sent-15, score-1.408]
10 The doubling dimension has been frequently used lately in the analysis of algorithms [13, 20, 21, 17, 29, 14, 7, 22, 28, 6]. [sent-16, score-0.864]
11 In the PAC-learning model, an algorithm is given examples ´Ü½ ´Ü½ µµ ´ÜÑ ´ÜÑ µµ of the beÜÑ are chosen independently havior of an arbitrary member of a known class . [sent-17, score-0.087]
12 The algorithm must, with probability at least ½ Æ (w. [sent-19, score-0.074]
13 to the random choice of ܽ ÜÑ ), output a classifier whose distance from is at most ¯. [sent-22, score-0.056]
14 We show that if using ´ µ has doubling dimension , then · ÐÓ Ç ¯ can be PAC-learned with respect to ½ Æ (1) examples. [sent-23, score-0.836]
15 If in addition the VC-dimension of is , we show that any algorithm that outputs a classifier with zero training error whenever this is possible PAC-learns w. [sent-24, score-0.152]
16 ÐÓ ¯ · ÐÓ ½ ¯ ½ Æ ½ (2) We show that if consists of halfspaces through the origin, and is the uniform distribution over the unit ball in ÊÒ , then the doubling dimension of ´ µ is Ç´Òµ. [sent-28, score-1.269]
17 Thus (1) generalizes the ½ Ò·ÐÓ Æ known bound of Ç for learning halfspaces with respect to the uniform distribution [25], ¯ matching a known lower bound for this problem [23] up to a constant factor. [sent-29, score-0.523]
18 Both upper bounds Ò ÐÓ ½ ·ÐÓ ½ ¯ Æ improve on the Ç bound that follows from the traditional analysis; (2) is the first ¯ such improvement for a polynomial-time algorithm. [sent-30, score-0.219]
19 Some previous analyses of the sample complexity of learning have made use of the fact that the “metric dimension” [18] is at most the VC-dimension [11, 15]. [sent-31, score-0.101]
20 Since using the doubling dimension can sometimes lead to a better bound, a natural question is whether there is also a bound on the doubling dimension in terms of the VC-dimension. [sent-32, score-1.771]
21 We show that this is not the case: it is possible to pack ´½ «µ´½ ¾ Ó´½µµ classifiers in a set of VC-dimension so that the distance between every pair is in the interval « ¾« . [sent-33, score-0.107]
22 Combining our upper bound analysis with established techniques (see [33, 3, 8, 31, 30]), one can perform similar analyses for the more general case in which no classifier in has zero error. [sent-35, score-0.192]
23 We have begun with the PAC model because it is a clean setting in which to illustrate the power of the doubling dimension for analyzing learning algorithms. [sent-36, score-0.907]
24 The doubling dimension appears most useful when the best achievable error rate (the Bayes error) is of the same order as the inverse of the number of training examples (or smaller). [sent-37, score-0.901]
25 Bounding the doubling dimension is useful for analyzing the sample complexity of learning because it limits the richness of a subclass of near the classifier to be learned. [sent-38, score-1.031]
26 For other analyses that exploit bounds on such local richness, please see [31, 30, 5, 25, 26, 34]. [sent-39, score-0.139]
27 In any case, it appears that the doubling dimension is an intuitive yet powerful way to bound the local complexity of a collection of classifiers. [sent-41, score-1.012]
28 1 Learning For some domain , an example consists of a member of , and its classification in ¼ ½ . [sent-43, score-0.137]
29 A training set is a finite collection of examples. [sent-45, score-0.069]
30 A learning algorithm takes as input a training set, and outputs a classifier. [sent-46, score-0.111]
31 Suppose is a probability distribution over µ ÈÖÜ ´ ´Üµ ´ A learning algorithm if, for any ¾ , if . [sent-47, score-0.044]
32 An «-cover for ¨ is a set Ì such that every element of distance at most « (with respect to ). [sent-54, score-0.138]
33 The «-ball centered at Þ ¾ such that every pair of elements of Ì are at a distance greater consists of all Ø ¾ for which Denote the size of the smallest «-cover by Å´« ¨µ. [sent-56, score-0.257]
34 ´ Denote the size of the largest «-packing by µ, and any « ¼, Å´¾« ¨µ Æ ´« ¨µ Å´« ¨µ The doubling dimension of ¨ is the least such that, for all radii « ¼, any «-ball in ¨ can be covered by at most ¾ « ¾-balls. [sent-59, score-0.981]
35 3 Probability For a function and a probability distribution , let Ü We will shorten this to ´ µ, and if Ù ´Ù½ ÙÑ µ ¾ We will use ÈÖÜ , ÈÖ , and ÈÖÙ similarly. [sent-62, score-0.044]
36 Ñ 3 The strongest upper bound Theorem 2 Suppose learns from Ç is the doubling dimension of Ƶ examples. [sent-68, score-1.007]
37 There is an algorithm that PAC- The key lemma limits the extent to which points that are separated from one another can crowd around some point in a metric space with limited doubling dimension. [sent-70, score-0.994]
38 µ is a metric space with doubling dimension and Þ ¾ . [sent-71, score-0.965]
39 ´Þ ¬ µ consist of the elements of Ù ¾ such that ´Ù Þ µ ¬ (that is, the ¬ -ball Lemma 3 (see [13]) Suppose ¨ ¼, let For ¬ centered at Þ ). [sent-72, score-0.15]
40 Then ´ Å´« ´Þ ¬ µµ (In other words, any «-packing must have at most ´ ¬ « ¬ «µ elements within distance ¬ of Þ . [sent-73, score-0.12]
41 ) Proof: Since ¨ has doubling dimension , the set ´Þ ¬ µ can be covered by ¾ balls of radius ¬ ¾. [sent-74, score-1.207]
42 Each of these can be covered by ¾ balls of radius ¬ , and so on. [sent-75, score-0.371]
43 Thus, ´Þ ¬ µ can be covered by ¾ ÐÓ ¾ ¬ « ´ ¬ «µ balls of radius « ¾. [sent-76, score-0.371]
44 The proof is an application of the peeling technique [1] (see [30]). [sent-79, score-0.122]
45 Proof of Theorem 2: Construct an ¯ packing greedily, by repeatedly adding an element of to for as long as this is possible. [sent-80, score-0.253]
46 This packing is also an ¯ -cover, since otherwise we could add another member to . [sent-81, score-0.197]
47 Consider the algorithm that outputs the element of with minimum error on the training set. [sent-82, score-0.193]
48 Whatever the target, some element of has error at most ¯ . [sent-83, score-0.082]
49 Applying Chernoff bounds, Ç ÐÓ ´½ Ƶ ¯ examples are sufficient that, with probability at least ½ Æ ¾, this classifier is incorrect on at most a fraction ¯ ¾ of the training data. [sent-84, score-0.139]
50 Thus, the training error of the hypothesis output by is at most ¯ ¾ with probability at least ½ Æ ¾. [sent-85, score-0.154]
51 Choose an arbitrary function , and let Ë be the random training set resulting from drawing Ñ examples according to , and classifying them using . [sent-86, score-0.065]
52 Define Ë ´ µ to be the fraction of examples in Ë on which and disagree. [sent-87, score-0.028]
53 Each of the following steps is a straightforward manipulation: ÐÓ ´½ ¼ ¿¾ ¯µ ¾´ · µ ¾ ½ ¯µ ÐÓ ´½ ¾ ¾ ¯Ñ ¿¾ ¯Ñ ¯Ñ ÐÓ ´½ ¯Ñ ½ ¾ ¯Ñ ¿¾ ¼ ¾¾ ¾ ¼ Since Ñ Ç´´ · ÐÓ completes the proof. [sent-89, score-0.088]
54 ¯µ ¾¾ ¾ ¯Ñ ¼ ¯Ñ ´½ Ƶµ ¯µ is sufficient for Æ ¾ and ¾ ¯Ñ ½ ¾, this 4 A bound for consistent hypothesis finders In this section we analyze algorithms that work by finding hypotheses with zero training error. [sent-90, score-0.179]
55 Lemma 5 (see [12, 24]) Suppose is a set of ¼ ½ -valued functions with a common domain . [sent-94, score-0.078]
56 Then if ½ ´ ÐÓ « · ÐÓ ½ µ Æ Ñ «Ã ÐÓ ´½ · à µ where is an absolute constant, then ÈÖÙ Ñ ´ ¾ ÈÖ ´ µ « but ÈÖÙ ´ µ ´½ · à µ«µ Æ Now we are ready for the main analysis of this section. [sent-98, score-0.123]
57 Any consistent hypothesis finder for PAC learns Proof: Assume without loss of generality that ½ ½¼¼, we have « ¯ . [sent-100, score-0.078]
58 ´Üµ, the doubling dimension by of is the same as the doubling dimension of same as the VC-dimension of (see [32]). [sent-106, score-1.672]
59 ; the VC-dimension of is also known to be the Construct an « packing greedily, by repeatedly adding an element of is possible. [sent-107, score-0.253]
60 ¾ For each to for as long as this ´ µ be its nearest neighbor in . [sent-109, score-0.033]
61 Computing a geometric sum exactly as in the proof of Theorem 2, we have that Ñ Ç´ for ½¯ ¾¯Ñ ÈÖ´ ¾ ´ ´ µµ ¯ but Ù´ ´ µµ ¯ µ for absolute constants ½ ¾ µ ¯ µ ¯µ suffices « ¼. [sent-111, score-0.156]
62 ½ Æ 5 Halfspaces and the uniform distribution Proposition 7 If ÍÒ is the uniform distribution over the unit ball in ÊÒ , and ÀÒ is the set of halfspaces that go through the origin, then the doubling dimension of ´ÀÒ ÍÒ µ is Ç´Òµ. [sent-113, score-1.321]
63 Proof: Choose ¾ ÀÒ and « covered by Ç´Òµ balls of radius « ¼. [sent-114, score-0.371]
64 We will show that the ball of radius « centered at can be Suppose ÍÀÒ is the probability distribution over ÀÒ obtained by choosing a normal vector Û uniformly from the unit ball, and outputting Ü Û ¡ Ü ¼ . [sent-116, score-0.434]
65 It is known (see Lemma 4 of [25]) that ÈÖ where ¼ is an absolute constant independent of « and Ò. [sent-118, score-0.062]
66 Furthermore, ÈÖ ÍÀÒ ´ ÍÒ ´ µ « µ ´ ¾ «µÒ ½ ¼ is another absolute constant. [sent-119, score-0.062]
67 ½ where µ « µ ´ ½ «µÒ ½ ÍÀÒ ´ ÍÒ ´ ¾ Suppose we choose arbitrarily choose ½ ¾ ¾ ÀÒ that are at a distance at most « from , but « ¾ far from one another. [sent-120, score-0.16]
68 By the triangle inequality, « -balls centered at ½ ¾ are disjoint. [sent-121, score-0.147]
69 Thus, the probability that an random element of ÀÒ is in a ball of radius « centered at one of ½ Æ is at least Æ ´ ½ «µÒ ½ . [sent-122, score-0.474]
70 On the other hand, since each ½ Æ has distance at most « from , any element of an « ball centered at one of them is at most « · « far from . [sent-123, score-0.361]
71 Thus, the union of the « balls centered at ½ Æ is contained in the « ball centered at . [sent-124, score-0.47]
72 Thus Æ ´ ½ «µÒ ½ ´ ¾ «µÒ ½ , which implies Æ ´ ¾ ½ µÒ ½ ¾Ç´Òµ , completing the proof. [sent-125, score-0.03]
73 6 Separation Theorem 8 For all « ¾ ¼ ½ ¾ and positive integers there is a set of classifiers and a probability distribution over their common domain with the following properties: ¯ ¯ ¯ the VC-dimension of ½ ¾ for each ¾ ½ « ¡ is ¾ ¾ ,« ´ µ ¾«. [sent-126, score-0.161]
74 Suppose is chosen uniformly at random from among × of size . [sent-130, score-0.042]
75 Then, for any ½, the subsets of ½ ÈÖ´ ½ ´½ · µ ´ ½ ´½· µµ µ ´ ½ µ ½· Proof: in Appendix A. [sent-131, score-0.038]
76 Now we’re ready for the proof of Theorem 8, which uses the deletion technique (see [2]). [sent-132, score-0.183]
77 × ¡ ¾ Proof (of Theorem 8): Set the domain to be ½ × , where × « . [sent-133, score-0.078]
78 ¾ Suppose ½ Æ are chosen independently, uniformly at random from among the classifiers that evaluate to ½ on exactly elements of . [sent-135, score-0.106]
79 For any distinct , suppose ½ ´½µ is fixed, and we think of the members of ½ ´½µ as being chosen one at a time. [sent-136, score-0.098]
80 The probability that any of the elements of ½ ´½µ is also in ½ ´½µ is ×. [sent-137, score-0.108]
81 × Since each ¾ has ½ ´½µ , no function in evaluates to ½ on each element of any set of · ½ elements of . [sent-141, score-0.146]
82 elements of is at least Æ ¾ ¾ Theorem 8 implies that there is no bound on the doubling dimension of ´ µ in terms of the VC-dimension of . [sent-143, score-1.059]
83 For any constraint on the VC-dimension, a set satisfying the constraint can have arbitrarily large doubling dimension by setting the value of « in Theorem 8 arbitrarily small. [sent-144, score-0.912]
84 Fast construction of nets in low dimensional metrics, and their applications. [sent-229, score-0.041]
85 Sphere packing numbers for subsets of the Boolean Ò-cube with bounded VapnikChervonenkis dimension. [sent-233, score-0.176]
86 On the sample complexity of PAC learning halfspaces against the uniform distribution. [sent-281, score-0.341]
87 An upper bound on the sample complexity of PAC learning halfspaces with respect to the uniform distribution. [sent-291, score-0.477]
88 Distance estimation and object location via rings of neighbors. [sent-304, score-0.03]
89 Information theoretical upper and lower bounds for statistical estimation. [sent-332, score-0.12]
90 ¾ Lemma 12 ([9]) Collections ½ ÈÒ for any ¼, ´ ÜÔ´ µµ ½ ¾ Ò of ÉÒ negatively associated random variables satisfy Chernoff bounds: ½ ´ ÜÔ´ µµ ¾ Proof of Lemma 9: Let ¼ ½ indicate whether . [sent-336, score-0.097]
91 Combining Lemma 12 with a standard Chernoff-Hoeffding ½ bound (see Theorem 4. [sent-339, score-0.099]
wordName wordTfidf (topN-words)
[('doubling', 0.655), ('halfspaces', 0.244), ('dimension', 0.181), ('lemma', 0.18), ('chernoff', 0.166), ('balls', 0.161), ('pac', 0.143), ('packing', 0.138), ('ball', 0.137), ('metric', 0.129), ('covered', 0.115), ('krauthgamer', 0.104), ('bound', 0.099), ('suppose', 0.098), ('negatively', 0.097), ('radius', 0.095), ('proof', 0.094), ('completes', 0.088), ('centered', 0.086), ('bounds', 0.083), ('element', 0.082), ('domain', 0.078), ('classi', 0.077), ('ers', 0.074), ('outputs', 0.074), ('dubhashi', 0.07), ('theorem', 0.066), ('elements', 0.064), ('absolute', 0.062), ('ready', 0.061), ('triangle', 0.061), ('podc', 0.061), ('soda', 0.061), ('gupta', 0.061), ('nder', 0.061), ('member', 0.059), ('er', 0.059), ('distance', 0.056), ('analyses', 0.056), ('uniform', 0.052), ('ces', 0.051), ('pair', 0.051), ('metrics', 0.049), ('richness', 0.049), ('greedily', 0.046), ('complexity', 0.045), ('focs', 0.044), ('probability', 0.044), ('hypothesis', 0.043), ('uniformly', 0.042), ('whenever', 0.041), ('analyzing', 0.041), ('nets', 0.041), ('integers', 0.039), ('empirical', 0.039), ('annals', 0.038), ('subsets', 0.038), ('arbitrarily', 0.038), ('upper', 0.037), ('training', 0.037), ('learns', 0.035), ('inequality', 0.034), ('nearest', 0.033), ('choose', 0.033), ('repeatedly', 0.033), ('applying', 0.033), ('bounding', 0.032), ('springer', 0.032), ('collection', 0.032), ('prove', 0.031), ('origin', 0.03), ('limits', 0.03), ('begun', 0.03), ('neyman', 0.03), ('outputting', 0.03), ('spencer', 0.03), ('sept', 0.03), ('plong', 0.03), ('beygelzimer', 0.03), ('navigating', 0.03), ('rings', 0.03), ('singapore', 0.03), ('disagree', 0.03), ('geometries', 0.03), ('subclass', 0.03), ('vapnikchervonenkis', 0.03), ('implies', 0.03), ('least', 0.03), ('generalizes', 0.029), ('classify', 0.029), ('embedding', 0.028), ('examples', 0.028), ('locality', 0.028), ('whatever', 0.028), ('alon', 0.028), ('peeling', 0.028), ('concentrates', 0.028), ('learnability', 0.028), ('lately', 0.028), ('deletion', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
2 0.12417588 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
3 0.097122028 33 nips-2006-Analysis of Representations for Domain Adaptation
Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1
4 0.086130396 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
Author: Philip M. Long, Rocco Servedio
Abstract: We consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efficiently learn decision lists of length k over n boolean variables using poly(k, log n) many examples provided that the marginal distribution over the relevant variables is “not too concentrated” in an L 2 -norm sense. Using a recent result of H˚ stad, we extend the analysis to obtain a similar a (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefficients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.
5 0.084656477 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
6 0.076794922 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
7 0.076381534 65 nips-2006-Denoising and Dimension Reduction in Feature Space
8 0.066439301 14 nips-2006-A Small World Threshold for Economic Network Formation
9 0.066148333 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
10 0.065795287 186 nips-2006-Support Vector Machines on a Budget
11 0.065264061 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
12 0.063889444 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
13 0.062680267 150 nips-2006-On Transductive Regression
14 0.060357861 129 nips-2006-Map-Reduce for Machine Learning on Multicore
15 0.057495099 110 nips-2006-Learning Dense 3D Correspondence
16 0.056692507 21 nips-2006-AdaBoost is Consistent
17 0.054751031 50 nips-2006-Chained Boosting
18 0.054044344 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
19 0.054016639 163 nips-2006-Prediction on a Graph with a Perceptron
20 0.053671446 181 nips-2006-Stability of $K$-Means Clustering
topicId topicWeight
[(0, -0.162), (1, 0.061), (2, -0.069), (3, 0.009), (4, -0.036), (5, 0.134), (6, -0.091), (7, 0.064), (8, 0.004), (9, 0.013), (10, 0.089), (11, 0.002), (12, -0.002), (13, 0.003), (14, 0.033), (15, -0.098), (16, 0.013), (17, -0.021), (18, -0.004), (19, -0.109), (20, -0.048), (21, 0.096), (22, 0.064), (23, -0.012), (24, -0.054), (25, -0.117), (26, 0.027), (27, 0.088), (28, -0.047), (29, 0.002), (30, 0.065), (31, -0.075), (32, 0.007), (33, 0.072), (34, 0.021), (35, -0.104), (36, -0.07), (37, -0.004), (38, -0.037), (39, 0.037), (40, 0.047), (41, -0.048), (42, 0.041), (43, -0.16), (44, 0.165), (45, -0.138), (46, 0.046), (47, -0.021), (48, -0.006), (49, 0.093)]
simIndex simValue paperId paperTitle
same-paper 1 0.95651084 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
2 0.72649992 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
3 0.62194008 33 nips-2006-Analysis of Representations for Domain Adaptation
Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1
4 0.58161694 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
Author: Benjamin I. Rubinstein, Peter L. Bartlett, J. H. Rubinstein
Abstract: Under the prediction model of learning, a prediction strategy is presented with an i.i.d. sample of n − 1 points in X and corresponding labels from a concept f ∈ F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F)/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube—the one-inclusion graph; the key step is a d = VC(F) bound on one-inclusion graph density. The first main result of this n n−1 paper is a density bound of n ≤d−1 / ( ≤d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d ≈ Θ(n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k-class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout. 1
5 0.57429725 5 nips-2006-A Kernel Method for the Two-Sample-Problem
Author: Arthur Gretton, Karsten M. Borgwardt, Malte Rasch, Bernhard Schölkopf, Alex J. Smola
Abstract: We propose two statistical tests to determine if two samples are from different distributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The first test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be computed in O(m2 ) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.
6 0.57275498 193 nips-2006-Tighter PAC-Bayes Bounds
8 0.4756861 30 nips-2006-An Oracle Inequality for Clipped Regularized Risk Minimizers
9 0.46748319 155 nips-2006-Optimal Single-Class Classification Strategies
10 0.46068227 144 nips-2006-Near-Uniform Sampling of Combinatorial Spaces Using XOR Constraints
11 0.44743967 158 nips-2006-PG-means: learning the number of clusters in data
12 0.4088369 21 nips-2006-AdaBoost is Consistent
13 0.40010566 150 nips-2006-On Transductive Regression
14 0.38489339 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
15 0.38216543 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
16 0.38187447 163 nips-2006-Prediction on a Graph with a Perceptron
17 0.36761254 50 nips-2006-Chained Boosting
18 0.35899147 156 nips-2006-Ordinal Regression by Extended Binary Classification
19 0.35648853 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
20 0.35349661 146 nips-2006-No-regret Algorithms for Online Convex Programs
topicId topicWeight
[(1, 0.082), (3, 0.017), (7, 0.1), (9, 0.067), (20, 0.011), (22, 0.063), (44, 0.114), (54, 0.294), (57, 0.049), (65, 0.082), (69, 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.79715109 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
2 0.7937454 100 nips-2006-Information Bottleneck for Non Co-Occurrence Data
Author: Yevgeny Seldin, Noam Slonim, Naftali Tishby
Abstract: We present a general model-independent approach to the analysis of data in cases when these data do not appear in the form of co-occurrence of two variables X, Y , but rather as a sample of values of an unknown (stochastic) function Z(X, Y ). For example, in gene expression data, the expression level Z is a function of gene X and condition Y ; or in movie ratings data the rating Z is a function of viewer X and movie Y . The approach represents a consistent extension of the Information Bottleneck method that has previously relied on the availability of co-occurrence statistics. By altering the relevance variable we eliminate the need in the sample of joint distribution of all input variables. This new formulation also enables simple MDL-like model complexity control and prediction of missing values of Z. The approach is analyzed and shown to be on a par with the best known clustering algorithms for a wide range of domains. For the prediction of missing values (collaborative filtering) it improves the currently best known results. 1
3 0.57862061 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
Author: Gloria Haro, Gregory Randall, Guillermo Sapiro
Abstract: The study of point cloud data sampled from a stratification, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artificial and real examples demonstrating the importance of extending manifold learning to stratification learning. 1
4 0.5607484 117 nips-2006-Learning on Graph with Laplacian Regularization
Author: Rie K. Ando, Tong Zhang
Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.
5 0.55665213 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
6 0.55655664 80 nips-2006-Fundamental Limitations of Spectral Clustering
7 0.55364192 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
8 0.5492658 20 nips-2006-Active learning for misspecified generalized linear models
9 0.54830325 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
10 0.54804426 21 nips-2006-AdaBoost is Consistent
11 0.54766673 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
12 0.54726946 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
13 0.54703027 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
14 0.54699218 175 nips-2006-Simplifying Mixture Models through Function Approximation
15 0.54688543 163 nips-2006-Prediction on a Graph with a Perceptron
16 0.54593718 5 nips-2006-A Kernel Method for the Two-Sample-Problem
17 0.54509693 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
18 0.54249775 139 nips-2006-Multi-dynamic Bayesian Networks
19 0.53964293 35 nips-2006-Approximate inference using planar graph decomposition
20 0.53854972 98 nips-2006-Inferring Network Structure from Co-Occurrences