nips nips2008 nips2008-199 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohak Shah
Abstract: We derive risk bounds for the randomized classifiers in Sample Compression setting where the classifier-specification utilizes two sources of information viz. the compression set and the message string. By extending the recently proposed Occam’s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classifiers and also recover the corresponding classical PAC-Bayes bound. We further show how these compare favorably to the existing results.
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract We derive risk bounds for the randomized classifiers in Sample Compression setting where the classifier-specification utilizes two sources of information viz. [sent-3, score-0.243]
2 By extending the recently proposed Occam’s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classifiers and also recover the corresponding classical PAC-Bayes bound. [sent-5, score-0.287]
3 We further show how these compare favorably to the existing results. [sent-6, score-0.014]
4 1 Introduction The Sample compression framework [Littlestone and Warmuth, 1986, Floyd and Warmuth, 1995] has resulted in an important class of learning algorithms known as sample compression algorithms. [sent-7, score-0.665]
5 Moreover, the approach has also resulted in practical realizable bounds and has shown significant promise in using these bounds in model selection. [sent-10, score-0.122]
6 On another learning theoretic front, the PAC-Bayes approach [McAllester, 1999] has shown that stochastic classifier selection can prove to be more powerful than outputing a deterministic classifier. [sent-11, score-0.014]
7 With regard to the sample compression settings, this was further confirmed in the case of sample compressed Gibbs classifier by Laviolette and Marchand [2007]. [sent-12, score-0.536]
8 However, the specific classifier output by the algorithm (according to a selected posterior) is generally of immediate interest since this is the classifier whose future performance is of relevance in practice. [sent-13, score-0.016]
9 Diluting such guarantees in terms of the expectancy of the risk over the posterior over the classifier space, although gives tighter risk bounds, result in averaged statements over the expected true error. [sent-14, score-0.36]
10 A significant result in obtaining such guarantees for the specific randomized classifier has appeared in the form of Occam’s Hammer [Blanchard and Fleuret, 2007]. [sent-15, score-0.078]
11 It deals with bounding the performance of algorithms that result in a set output when given training data. [sent-16, score-0.064]
12 With respect to classifiers, this results in a bound on the true risk of the randomized classifier output by the algorithm in accordance with a learned posterior over the classifier space from training data. [sent-17, score-0.346]
13 Blanchard and Fleuret [2007] also present a PAC-Bayes bound for the data-independent settings (when the classifier space is defined independently of the training data). [sent-18, score-0.119]
14 Motivated by this result, we derive risk bounds for the randomized sample compressed classifiers. [sent-19, score-0.405]
15 Note that the classifier space in the case of sample compression settings, unlike other settings, is data-dependent in the sense that it is defined upon the specification of training data. [sent-20, score-0.38]
16 1 The rest of 1 Note that the classifier space depends on the amount of the training data as we see further and not on the training data themselves. [sent-21, score-0.051]
17 Hence, a data-independent prior over the classifier space can still be obtained in this setting, e. [sent-22, score-0.038]
18 , in the PAC-Bayes case, owing to the independence of the classifier space definition from the content of the training data. [sent-24, score-0.035]
19 the paper is organized as follows: Section 2 provides a background on the sample compressed classifiers and establishes the context; Section 3 then states the Occam’s Hammer for the dataindependent settings. [sent-25, score-0.219]
20 We then derive bounds for the randomized sample compressed classifier in Section 4 followed by showing how we can recover bounds for the sample compressed Gibbs case (classical PAC-Bayes for sample compressed classifiers) in Section 5. [sent-26, score-0.703]
21 2 Sample Compressed (SC) Classifiers We consider binary classification problems where the input space X consists of an arbitrary subset def of Rn and the output space Y = {−1, +1}. [sent-28, score-0.126]
22 Sample Compression learning algorithms are characterized as follows: Given a training set S = {z1 , . [sent-30, score-0.016]
23 Given a training set S, the compression set zi is def defined by a vector i of indices i = (i1 , i2 , . [sent-34, score-0.989]
24 < i|i| and where |i| denotes the number of indices present in i. [sent-43, score-0.018]
25 Hence, zi denotes the ith example of S whereas zi denotes the subset of examples of S that are pointed to by the vector of indices i defined above. [sent-44, score-1.196]
26 We will use i to denote the set of indices not present in i. [sent-45, score-0.018]
27 Hence, we have S = zi ∪ zi for any vector i ∈ I where I denotes the set of the 2m possible realizations of i. [sent-46, score-1.178]
28 Finally, a learning algorithm is a sample compression learning algorithm (that is identified solely by a compression set zi and a message string σ) iff there exists a Reconstruction Function R : (X × Y)|i| × K −→ H, associated with A. [sent-47, score-1.345]
29 Here, H is the (data-dependent) classifier space and K ⊂ I × M s. [sent-48, score-0.019]
30 That is, R outputs a classifier R(σ, zi ) when given an arbitrary compression set zi ⊆ S and message string σ chosen from the set M(zi ) of all distinct messages that can be supplied to R with the compression set zi . [sent-51, score-2.472]
31 We seek a tight risk bound for arbitrary reconstruction functions that holds uniformly for all compression sets and message strings. [sent-52, score-0.588]
32 For this, we adopt the PAC setting where each example z is drawn according to a fixed, but unknown, probability distribution D on X × Y. [sent-53, score-0.043]
33 The true risk R(f ) of any classifier f is defined as the probability that it misclassifies an example drawn according to D: def R(f ) = Pr(x,y)∼D (f (x) = y) = E(x,y)∼D I(f (x) = y) where I(a) = 1 if predicate a is true and 0 otherwise. [sent-54, score-0.246]
34 , zm } of m examples, the empirical risk RS (f ) on S, of any classifier f , is defined according to: 1 RS (f ) = m m def def I(f (xi ) = yi ) = E(x,y)∼S I(f (x) = y) i=1 Let Zm denote the collection of m random variables whose instantiation gives a training sample S = zm = {z1 , . [sent-58, score-0.612]
35 To obtain the tightest possible risk bound, we will fully exploit the fact that the distribution of classification errors is a binomial. [sent-62, score-0.134]
36 We now discuss the generic Occam’s Hammer principle (w. [sent-63, score-0.021]
37 the classification scenario) and then go on to show how it can be applied to the sample compression setting. [sent-66, score-0.345]
38 3 Occam’s Hammer for data independent setting In this section, we briefly detail the Occam’s hammer [Blanchard and Fleuret, 2007] for dataindependent setting. [sent-67, score-0.192]
39 Occam’s hammer work by bounding the probability of bad event defined as follows. [sent-69, score-0.261]
40 For every classifier h ∈ H, and a confidence parameter δ ∈ [0, 1], the bad event B(h, δ) is defined as the region where the desired property on the classifier h does not hold, with probability δ. [sent-70, score-0.064]
41 Intuitively, this means that with decreasing δ the bound on the true error of the classifier h becomes tighter. [sent-73, score-0.081]
42 With the above assumption satisfied, let, P be a non-negative reference measure on the classifier space H known as the volumic measure. [sent-74, score-0.048]
43 Let Π be a probability distribution on H absolutely contindΠ uous w. [sent-75, score-0.026]
44 Let Γ be a probability distribution on (0, +∞) (the inverse density prior). [sent-79, score-0.04]
45 Then for any algorithm S → θS returning a probability density θS over H with respect to P, and such that (S, h) → θS (h) is jointly measurable in its two variables, it holds that Pr S ∈ B(h, ∆(h, θS (h)−1 )) ≤ δ, m S∼D ,h∼Q where Q is the distribution on H such that dQ dP = θS . [sent-82, score-0.058]
46 Note above that Q is the (data-dependent) posterior distribution on H after observing the data sample S while P is the data-independent prior on H. [sent-83, score-0.131]
47 Moreover, the distribution Π on the space of classifiers may or may not be data-dependent. [sent-85, score-0.032]
48 As we will see later, in the case of sample compression learning settings we will consider priors over the space of classifiers without reference to the data (such as PAC-Bayes case). [sent-86, score-0.415]
49 To this end, we can either opt for a prior Π independent of the data or make it the same as the volume measure P which establishes a distribution on the classifier space without reference to the data. [sent-87, score-0.113]
50 4 Bounds for Randomized SC Classifiers We work in the sample compression settings and as mentioned before, each classifier in this setting is denoted in terms of a compression set and a message string. [sent-88, score-0.745]
51 A reconstruction function then uses these two information sources to reconstruct the classifier. [sent-89, score-0.043]
52 The hypothesis space is defined, in our case, based on the size of data sample (and not the actual contents of the sample). [sent-92, score-0.07]
53 Hence, we consider the priors built on the size of the possible compression sets and associated message strings. [sent-93, score-0.39]
54 The above choice of the form for PI (i) is (|i|) appropriate since we do not have any a priori information to distinguish one compression set from other. [sent-95, score-0.294]
55 However, as we will see later, we should choose p(d) such that we give more weight to smaller compression sets. [sent-96, score-0.294]
56 Then, we are interested in algorithms that output a posterior Q ∈ PK over the space of classifiers with probability density Q(zi , σ) factorizable as QI (i)QM(zi ) (σ). [sent-98, score-0.11]
57 A sample compressed classifier is then defined by choosing a classifier (zi , σ) according to the posterior Q(zi , σ). [sent-99, score-0.242]
58 This is basically the Gibbs classifier defined in the PAC-Bayes settings where the idea is to bound the true risk of this Gibbs classifier defined as R(GQ ) = E(zi ,σ)∼Q R((zi , σ)). [sent-100, score-0.225]
59 On the other hand, we are interested in bounding the true risk of the specific classifier (zi , σ) output according to Q. [sent-101, score-0.189]
60 As shown in [Laviolette and Marchand, 2007], a rescaled posterior Q of the following form can provide tighter guarantees while maintaining the Occam’s principle of parsimony. [sent-102, score-0.154]
61 Definition 2 Given a distribution Q ∈ PK , we denote by Q the distribution: QI (i)QM(zi ) (σ) Q(zi , σ) def Q(zi , σ) = = = QI (i)QM(zi ) (σ) 1 1 |i|E(zi ,σ)∼Q |i| |i|E(zi ,σ)∼Q |i| ∀(zi , σ) ∈ K Hence, note that the posterior is effectively rescaled for the compression set part. [sent-103, score-0.469]
62 1 A PAC-Bayes Bound for randomized SC classifier We exploit the fact that the distribution of the errors is binomial and define the following error quantities (for a given i, and hence zi over z|i| ): Definition 3 Let S ∈ Dm with D a distribution on X × Y, and (zi , σ) ∈ K. [sent-107, score-0.768]
63 We denote by BinS (i, σ), the probability that the classifier R(zi , σ) of (true) risk R(zbi , σ) makes |i|Rzi (zi , σ) or fewer errors on z′ ∼ D|i| . [sent-108, score-0.134]
64 That is, i |i|Rz (zi ,σ) i BinS (i, σ) = λ=0 |i| (R(σ, zi ))λ (1 − R(σ, zi ))|i|−λ λ and by BS (i, σ), the probability that this classifier makes exactly |i|Rzi (zi , σ) errors on z′ ∼ D|i| . [sent-109, score-1.208]
65 m As also shown in [Laviolette and Marchand, 2007], since m = m−j and kl(RS (f ) R(f )) = j kl(1 − RS (f ) 1 − R(f )), the above inequality holds true for each factor inside the sum on the left hand side. [sent-111, score-0.038]
66 Let ψzm (i, σ) be the posterior probability density of the rescaled data-dependent posterior distribution Q over the classifier space with respect to the volume measure P. [sent-113, score-0.229]
67 However, note that we are interested in data-independent priors over the space of classifiers2 , and hence, we consider our prior Π to be the same as the volume measure P over the classifier space yielding π as unity. [sent-115, score-0.102]
68 That is, our prior gives a distribution over the classifier space without any regard to the data. [sent-116, score-0.065]
69 1 1 Note that we do not encounter the m−d factor in the bound instead of m−|i| unlike the bound Q of Laviolette and Marchand [2007]. [sent-118, score-0.122]
70 This is because the PAC-Bayes bound of Laviolette and Marchand [2007] computes the expectancy over the kl-divergence of the empirical and true risk of the classifiers chosen according to Q. [sent-119, score-0.243]
71 This, as a result of rescaling of Q in preference of smaller compression sets, is reflected in the bound. [sent-120, score-0.294]
72 On the other hand, the bound of Theorem 4 is a point-wise version bounding the true error of the specific classifier chosen according to Q and hence concerns the specific compression set utilized by this classifier. [sent-121, score-0.451]
73 2 A Binomial Tail Inversion Bound for randomized SC classifier A tighter condition can be imposed on the true risk of the classifier by considering the binomial tail k inversion over the distribution of errors. [sent-123, score-0.331]
74 Moreover, in the sample compression case, we are interested in the empirical risk of the classifier on the examples not in the compression set (consistent compression set assumption). [sent-126, score-1.037]
75 Now, let δr be a δ-weighed measure on the classifier space, i. [sent-127, score-0.014]
76 Then, for the compression sets and associated message strings, 2 Hence, the missing S in the subscript of π(h) in the r. [sent-130, score-0.394]
77 Alternatively, let P (zi , σ) and Q(zi , σ) denote the probability densities of the prior distribution P and rescaled posterior distributions Q over classifiers such that dQ = Q(zi , σ)dµ and dP = P (zi , σ)dµ w. [sent-134, score-0.135]
78 Note that the final expression is independent of the underlying dP P (zi ,σ) measure µ. [sent-139, score-0.014]
79 We follow the general line of argument of Blanchard and Fleuret [2007] to recover the PAC-Bayes bound for the Sample Compressed Gibbs classifier. [sent-142, score-0.076]
80 However, note that we do this for the data-dependent setting here and also utilize the rescaled posterior over the space of sample compressed classifiers. [sent-143, score-0.286]
81 5 m−1 2(m − 1) δ ≥1−δ Theorem 8 recovers almost exactly the PAC-Bayes bound for the Sample Compressed Classifiers of Laviolette and Marchand [2007]. [sent-148, score-0.061]
82 Note that the bound of Theorem 8 is derived in a relatively more straightforward manner with the Occam’s Hammer criterion. [sent-150, score-0.061]
83 6 Conclusion It has been shown that stochastic classifier selection is preferable to deterministic selection by the PAC-Bayes principle resulting in tighter risk bounds over averaged risk of classifiers according to the learned posterior. [sent-151, score-0.335]
84 Further, this observation resulted in tight bounds in the case of stochastic sample compressed classifiers [Laviolette and Marchand, 2007] also showing that sparsity considerations are of importance even in this scenario via. [sent-152, score-0.265]
85 However, of immediate relevance are the guarantees of the specific classifier output by such algorithms according to the learned posterior and hence a point-wise version of this bound is indeed needed. [sent-154, score-0.185]
86 We have derived bounds for such randomized sample compressed classifiers by adapting Occam’s Hammer principle to the data-dependent sample compression settings. [sent-155, score-0.653]
87 This has resulted in bounds on the specific classifier output by a sample compression learning algorithm according to the learned data-dependent posterior and is more relevant in practice. [sent-156, score-0.5]
88 Further, we also showed how classical PAC-Bayes bound for the sample compressed Gibbs classifier can be recovered in a more direct manner and show that this compares favorably to the existing result of Laviolette and Marchand [2007]. [sent-157, score-0.265]
89 In Proceedings of the 20th Annual Con¸ ference on Learning Theory (COLT-2007), volume 4539 of Lecture Notes on Computer Science, pages 112–126, 2007. [sent-161, score-0.018]
90 PAC-Bayes risk bounds for stochastic averages and major¸ ity votes of sample-compressed classifiers. [sent-169, score-0.166]
91 In Proceedings of the 16th European Conference on Machine Learning, ECML 2005, volume 3720 of Lecture Notes in Artificial Intelligence, pages 206–217. [sent-173, score-0.018]
wordName wordTfidf (topN-words)
[('zi', 0.589), ('compression', 0.294), ('rzi', 0.274), ('qm', 0.236), ('er', 0.2), ('ln', 0.194), ('hammer', 0.165), ('marchand', 0.165), ('laviolette', 0.156), ('classi', 0.155), ('occam', 0.154), ('qi', 0.154), ('zm', 0.14), ('compressed', 0.126), ('kl', 0.109), ('risk', 0.104), ('gq', 0.099), ('fleuret', 0.096), ('blanchard', 0.096), ('rs', 0.091), ('dq', 0.087), ('message', 0.083), ('pr', 0.081), ('ers', 0.077), ('def', 0.072), ('dm', 0.068), ('randomized', 0.062), ('bound', 0.061), ('pm', 0.056), ('sample', 0.051), ('bin', 0.049), ('bounds', 0.048), ('posterior', 0.048), ('binomial', 0.047), ('gibbs', 0.046), ('pi', 0.045), ('sc', 0.044), ('rescaled', 0.042), ('expectancy', 0.041), ('mario', 0.041), ('mohak', 0.041), ('ei', 0.036), ('francois', 0.036), ('pk', 0.034), ('string', 0.034), ('strings', 0.033), ('bounding', 0.032), ('tail', 0.029), ('inversion', 0.029), ('bad', 0.029), ('reconstruction', 0.028), ('dataindependent', 0.027), ('prs', 0.027), ('tighter', 0.027), ('hence', 0.027), ('langford', 0.027), ('bs', 0.026), ('resulted', 0.026), ('rz', 0.024), ('floyd', 0.024), ('dp', 0.024), ('settings', 0.023), ('event', 0.022), ('principle', 0.021), ('cruz', 0.021), ('bins', 0.02), ('true', 0.02), ('littlestone', 0.019), ('prior', 0.019), ('space', 0.019), ('warmuth', 0.019), ('hs', 0.019), ('volume', 0.018), ('holds', 0.018), ('indices', 0.018), ('theorem', 0.018), ('santa', 0.018), ('errors', 0.017), ('according', 0.017), ('basically', 0.017), ('subscript', 0.017), ('output', 0.016), ('training', 0.016), ('guarantees', 0.016), ('argument', 0.015), ('reference', 0.015), ('establishes', 0.015), ('sources', 0.015), ('stochastic', 0.014), ('measure', 0.014), ('derive', 0.014), ('regard', 0.014), ('favorably', 0.014), ('density', 0.014), ('probability', 0.013), ('priors', 0.013), ('classical', 0.013), ('distribution', 0.013), ('yields', 0.013), ('john', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
Author: Mohak Shah
Abstract: We derive risk bounds for the randomized classifiers in Sample Compression setting where the classifier-specification utilizes two sources of information viz. the compression set and the message string. By extending the recently proposed Occam’s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classifiers and also recover the corresponding classical PAC-Bayes bound. We further show how these compare favorably to the existing results.
2 0.20439334 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
3 0.19842245 238 nips-2008-Theory of matching pursuit
Author: Zakria Hussain, John S. Shawe-taylor
Abstract: We analyse matching pursuit for kernel principal components analysis (KPCA) by proving that the sparse subspace it produces is a sample compression scheme. We show that this bound is tighter than the KPCA bound of Shawe-Taylor et al [7] and highly predictive of the size of the subspace needed to capture most of the variance in the data. We analyse a second matching pursuit algorithm called kernel matching pursuit (KMP) which does not correspond to a sample compression scheme. However, we give a novel bound that views the choice of subspace of the KMP algorithm as a compression scheme and hence provide a VC bound to upper bound its future loss. Finally we describe how the same bound can be applied to other matching pursuit related algorithms. 1
4 0.17276636 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
Author: David Sontag, Amir Globerson, Tommi S. Jaakkola
Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and significantly faster than a method that does not use partitioning. 1
5 0.16689405 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.
6 0.11527205 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
7 0.10533523 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
8 0.10431311 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
9 0.095524073 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
10 0.094231062 214 nips-2008-Sparse Online Learning via Truncated Gradient
11 0.093663126 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
12 0.092452176 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
13 0.091744415 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
14 0.090313651 65 nips-2008-Domain Adaptation with Multiple Sources
15 0.089132927 62 nips-2008-Differentiable Sparse Coding
16 0.084136993 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
17 0.08126875 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
18 0.081154399 32 nips-2008-Bayesian Kernel Shaping for Learning Control
19 0.069085933 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
20 0.068416521 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
topicId topicWeight
[(0, -0.182), (1, -0.085), (2, -0.115), (3, -0.031), (4, -0.116), (5, 0.068), (6, -0.047), (7, -0.057), (8, -0.075), (9, 0.065), (10, 0.043), (11, 0.139), (12, 0.247), (13, 0.022), (14, -0.035), (15, 0.075), (16, 0.065), (17, 0.078), (18, 0.04), (19, -0.207), (20, 0.223), (21, -0.137), (22, 0.065), (23, -0.026), (24, -0.088), (25, -0.144), (26, -0.034), (27, 0.061), (28, 0.137), (29, 0.057), (30, 0.134), (31, 0.041), (32, -0.032), (33, -0.043), (34, 0.119), (35, -0.001), (36, -0.071), (37, -0.101), (38, 0.003), (39, 0.129), (40, 0.024), (41, 0.096), (42, -0.017), (43, -0.041), (44, -0.114), (45, 0.014), (46, -0.136), (47, -0.017), (48, -0.002), (49, -0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.97785294 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
Author: Mohak Shah
Abstract: We derive risk bounds for the randomized classifiers in Sample Compression setting where the classifier-specification utilizes two sources of information viz. the compression set and the message string. By extending the recently proposed Occam’s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classifiers and also recover the corresponding classical PAC-Bayes bound. We further show how these compare favorably to the existing results.
2 0.54105216 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
Author: Massih Amini, Nicolas Usunier, François Laviolette
Abstract: We propose two transductive bounds on the risk of majority votes that are estimated over partially labeled training sets. The first one involves the margin distribution of the classifier and a risk bound on its associate Gibbs classifier. The bound is tight when so is the Gibbs’s bound and when the errors of the majority vote classifier is concentrated on a zone of low margin. In semi-supervised learning, considering the margin as an indicator of confidence constitutes the working hypothesis of algorithms which search the decision boundary on low density regions. Following this assumption, we propose to bound the error probability of the voted classifier on the examples for whose margins are above a fixed threshold. As an application, we propose a self-learning algorithm which iteratively assigns pseudo-labels to the set of unlabeled training examples that have their margin above a threshold obtained from this bound. Empirical results on different datasets show the effectiveness of our approach compared to the same algorithm and the TSVM in which the threshold is fixed manually. 1
3 0.5365032 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
4 0.51619452 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.
5 0.48707184 238 nips-2008-Theory of matching pursuit
Author: Zakria Hussain, John S. Shawe-taylor
Abstract: We analyse matching pursuit for kernel principal components analysis (KPCA) by proving that the sparse subspace it produces is a sample compression scheme. We show that this bound is tighter than the KPCA bound of Shawe-Taylor et al [7] and highly predictive of the size of the subspace needed to capture most of the variance in the data. We analyse a second matching pursuit algorithm called kernel matching pursuit (KMP) which does not correspond to a sample compression scheme. However, we give a novel bound that views the choice of subspace of the KMP algorithm as a compression scheme and hence provide a VC bound to upper bound its future loss. Finally we describe how the same bound can be applied to other matching pursuit related algorithms. 1
6 0.47467282 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
7 0.42051449 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
8 0.40736654 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
9 0.38514745 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
10 0.37186816 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
11 0.36977592 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
12 0.35469645 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
13 0.33880186 216 nips-2008-Sparse probabilistic projections
14 0.33791459 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
15 0.32711449 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
16 0.31350496 32 nips-2008-Bayesian Kernel Shaping for Learning Control
17 0.29717469 65 nips-2008-Domain Adaptation with Multiple Sources
18 0.29166794 15 nips-2008-Adaptive Martingale Boosting
19 0.28923452 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
20 0.28444636 178 nips-2008-Performance analysis for L\ 2 kernel classification
topicId topicWeight
[(6, 0.166), (7, 0.032), (12, 0.027), (15, 0.011), (28, 0.12), (38, 0.011), (54, 0.287), (57, 0.045), (59, 0.012), (63, 0.01), (71, 0.043), (77, 0.04), (83, 0.079)]
simIndex simValue paperId paperTitle
same-paper 1 0.77511376 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
Author: Mohak Shah
Abstract: We derive risk bounds for the randomized classifiers in Sample Compression setting where the classifier-specification utilizes two sources of information viz. the compression set and the message string. By extending the recently proposed Occam’s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classifiers and also recover the corresponding classical PAC-Bayes bound. We further show how these compare favorably to the existing results.
2 0.69832933 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
Author: Pavel P. Kuksa, Pai-hsi Huang, Vladimir Pavlovic
Abstract: We present a new family of linear time algorithms for string comparison with mismatches under the string kernels framework. Based on sufficient statistics, our algorithms improve theoretical complexity bounds of existing approaches while scaling well in sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets and under loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classification, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with significantly reduced running times. 1
3 0.65382344 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement
Author: Michael T. Todd, Yael Niv, Jonathan D. Cohen
Abstract: Working memory is a central topic of cognitive neuroscience because it is critical for solving real-world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating framework [14] is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We unite Gating with machine learning theory concerning the general problem of memory-based optimal control [5-6]. We present a normative model that learns, by online temporal difference methods, to use working memory to maximize discounted future reward in partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in humans. Our purpose is to introduce a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards. 1 I n t ro d u c t i o n Working memory is loosely defined in cognitive neuroscience as information that is (1) internally maintained on a temporary or short term basis, and (2) required for tasks in which immediate observations cannot be mapped to correct actions. It is widely assumed that prefrontal cortex (PFC) plays a role in maintaining and updating working memory. However, relatively little is known about how PFC develops useful working memory representations for a new task. Furthermore, current work focuses on describing the structure and limitations of working memory, but does not ask why, or in what general class of tasks, is it necessary. Borrowing from the theory of optimal control in partially observable Markov decision problems (POMDPs), we frame the psychological concept of working memory as an internal state representation, developed and employed to maximize future reward in partially observable environments. We combine computational insights from POMDPs and neurobiologically plausible models from cognitive neuroscience to suggest a simple reinforcement learning (RL) model of working memory function that can be implemented through dopaminergic training of the basal ganglia and PFC. The Gating framework is a series of cognitive neuroscience models developed to explain how dopaminergic RL signals can shape useful working memory representations [1-4]. Computationally this framework models working memory as a collection of past observations, each of which can occasionally be replaced with the current observation, and addresses the problem of learning when to update each memory element versus maintaining it. In the original Gating model [1-2] the PFC contained a unitary working memory representation that was updated whenever a phasic dopamine (DA) burst occurred (e.g., due to unexpected reward or novelty). That model was the first to connect working memory and RL via the temporal difference (TD) model of DA firing [7-8], and thus to suggest how working memory might serve a normative purpose. However, that model had limited computational flexibility due to the unitary nature of the working memory (i.e., a singleobservation memory controlled by a scalar DA signal). More recent work [3-4] has partially repositioned the Gating framework within the Actor/Critic model of mesostriatal RL [9-10], positing memory updating as but another cortical action controlled by the dorsal striatal
4 0.60477436 214 nips-2008-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
5 0.5960626 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
Author: Julia Owen, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan, David P. Wipf
Abstract: The synchronous brain activity measured via MEG (or EEG) can be interpreted as arising from a collection (possibly large) of current dipoles or sources located throughout the cortex. Estimating the number, location, and orientation of these sources remains a challenging task, one that is significantly compounded by the effects of source correlations and the presence of interference from spontaneous brain activity, sensor noise, and other artifacts. This paper derives an empirical Bayesian method for addressing each of these issues in a principled fashion. The resulting algorithm guarantees descent of a cost function uniquely designed to handle unknown orientations and arbitrary correlations. Robust interference suppression is also easily incorporated. In a restricted setting, the proposed method is shown to have theoretically zero bias estimating both the location and orientation of multi-component dipoles even in the presence of correlations, unlike a variety of existing Bayesian localization methods or common signal processing techniques such as beamforming and sLORETA. Empirical results on both simulated and real data sets verify the efficacy of this approach. 1
6 0.59283948 178 nips-2008-Performance analysis for L\ 2 kernel classification
7 0.59252852 43 nips-2008-Cell Assemblies in Large Sparse Inhibitory Networks of Biologically Realistic Spiking Neurons
8 0.58070326 65 nips-2008-Domain Adaptation with Multiple Sources
9 0.57928532 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
10 0.57371312 202 nips-2008-Robust Regression and Lasso
11 0.57221001 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
12 0.56515026 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
13 0.55996776 75 nips-2008-Estimating vector fields using sparse basis field expansions
14 0.55891716 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
15 0.55471838 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
16 0.5539487 62 nips-2008-Differentiable Sparse Coding
17 0.55342025 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
18 0.55227005 228 nips-2008-Support Vector Machines with a Reject Option
19 0.55138612 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
20 0.5507921 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging