nips nips2008 nips2008-199 knowledge-graph by maker-knowledge-mining

199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

Source: pdf

Author: Mohak Shah

Abstract: We derive risk bounds for the randomized classiﬁers in Sample Compression setting where the classiﬁer-speciﬁcation utilizes two sources of information viz. the compression set and the message string. By extending the recently proposed Occam’s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classiﬁers and also recover the corresponding classical PAC-Bayes bound. We further show how these compare favorably to the existing results.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We derive risk bounds for the randomized classiﬁers in Sample Compression setting where the classiﬁer-speciﬁcation utilizes two sources of information viz. [sent-3, score-0.243]

2 By extending the recently proposed Occam’s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classiﬁers and also recover the corresponding classical PAC-Bayes bound. [sent-5, score-0.287]

3 We further show how these compare favorably to the existing results. [sent-6, score-0.014]

4 1 Introduction The Sample compression framework [Littlestone and Warmuth, 1986, Floyd and Warmuth, 1995] has resulted in an important class of learning algorithms known as sample compression algorithms. [sent-7, score-0.665]

5 Moreover, the approach has also resulted in practical realizable bounds and has shown signiﬁcant promise in using these bounds in model selection. [sent-10, score-0.122]

6 On another learning theoretic front, the PAC-Bayes approach [McAllester, 1999] has shown that stochastic classiﬁer selection can prove to be more powerful than outputing a deterministic classiﬁer. [sent-11, score-0.014]

7 With regard to the sample compression settings, this was further conﬁrmed in the case of sample compressed Gibbs classiﬁer by Laviolette and Marchand [2007]. [sent-12, score-0.536]

8 However, the speciﬁc classiﬁer output by the algorithm (according to a selected posterior) is generally of immediate interest since this is the classiﬁer whose future performance is of relevance in practice. [sent-13, score-0.016]

9 Diluting such guarantees in terms of the expectancy of the risk over the posterior over the classiﬁer space, although gives tighter risk bounds, result in averaged statements over the expected true error. [sent-14, score-0.36]

10 A signiﬁcant result in obtaining such guarantees for the speciﬁc randomized classiﬁer has appeared in the form of Occam’s Hammer [Blanchard and Fleuret, 2007]. [sent-15, score-0.078]

11 It deals with bounding the performance of algorithms that result in a set output when given training data. [sent-16, score-0.064]

12 With respect to classiﬁers, this results in a bound on the true risk of the randomized classiﬁer output by the algorithm in accordance with a learned posterior over the classiﬁer space from training data. [sent-17, score-0.346]

13 Blanchard and Fleuret [2007] also present a PAC-Bayes bound for the data-independent settings (when the classiﬁer space is deﬁned independently of the training data). [sent-18, score-0.119]

14 Motivated by this result, we derive risk bounds for the randomized sample compressed classiﬁers. [sent-19, score-0.405]

15 Note that the classiﬁer space in the case of sample compression settings, unlike other settings, is data-dependent in the sense that it is deﬁned upon the speciﬁcation of training data. [sent-20, score-0.38]

16 1 The rest of 1 Note that the classiﬁer space depends on the amount of the training data as we see further and not on the training data themselves. [sent-21, score-0.051]

17 Hence, a data-independent prior over the classiﬁer space can still be obtained in this setting, e. [sent-22, score-0.038]

18 , in the PAC-Bayes case, owing to the independence of the classiﬁer space deﬁnition from the content of the training data. [sent-24, score-0.035]

19 the paper is organized as follows: Section 2 provides a background on the sample compressed classiﬁers and establishes the context; Section 3 then states the Occam’s Hammer for the dataindependent settings. [sent-25, score-0.219]

20 We then derive bounds for the randomized sample compressed classiﬁer in Section 4 followed by showing how we can recover bounds for the sample compressed Gibbs case (classical PAC-Bayes for sample compressed classiﬁers) in Section 5. [sent-26, score-0.703]

21 2 Sample Compressed (SC) Classiﬁers We consider binary classiﬁcation problems where the input space X consists of an arbitrary subset def of Rn and the output space Y = {−1, +1}. [sent-28, score-0.126]

22 Sample Compression learning algorithms are characterized as follows: Given a training set S = {z1 , . [sent-30, score-0.016]

23 Given a training set S, the compression set zi is def deﬁned by a vector i of indices i = (i1 , i2 , . [sent-34, score-0.989]

24 < i|i| and where |i| denotes the number of indices present in i. [sent-43, score-0.018]

25 Hence, zi denotes the ith example of S whereas zi denotes the subset of examples of S that are pointed to by the vector of indices i deﬁned above. [sent-44, score-1.196]

26 We will use i to denote the set of indices not present in i. [sent-45, score-0.018]

27 Hence, we have S = zi ∪ zi for any vector i ∈ I where I denotes the set of the 2m possible realizations of i. [sent-46, score-1.178]

28 Finally, a learning algorithm is a sample compression learning algorithm (that is identiﬁed solely by a compression set zi and a message string σ) iff there exists a Reconstruction Function R : (X × Y)|i| × K −→ H, associated with A. [sent-47, score-1.345]

29 Here, H is the (data-dependent) classiﬁer space and K ⊂ I × M s. [sent-48, score-0.019]

30 That is, R outputs a classiﬁer R(σ, zi ) when given an arbitrary compression set zi ⊆ S and message string σ chosen from the set M(zi ) of all distinct messages that can be supplied to R with the compression set zi . [sent-51, score-2.472]

31 We seek a tight risk bound for arbitrary reconstruction functions that holds uniformly for all compression sets and message strings. [sent-52, score-0.588]

32 For this, we adopt the PAC setting where each example z is drawn according to a ﬁxed, but unknown, probability distribution D on X × Y. [sent-53, score-0.043]

33 The true risk R(f ) of any classiﬁer f is deﬁned as the probability that it misclassiﬁes an example drawn according to D: def R(f ) = Pr(x,y)∼D (f (x) = y) = E(x,y)∼D I(f (x) = y) where I(a) = 1 if predicate a is true and 0 otherwise. [sent-54, score-0.246]

34 , zm } of m examples, the empirical risk RS (f ) on S, of any classiﬁer f , is deﬁned according to: 1 RS (f ) = m m def def I(f (xi ) = yi ) = E(x,y)∼S I(f (x) = y) i=1 Let Zm denote the collection of m random variables whose instantiation gives a training sample S = zm = {z1 , . [sent-58, score-0.612]

35 To obtain the tightest possible risk bound, we will fully exploit the fact that the distribution of classiﬁcation errors is a binomial. [sent-62, score-0.134]

36 We now discuss the generic Occam’s Hammer principle (w. [sent-63, score-0.021]

37 the classiﬁcation scenario) and then go on to show how it can be applied to the sample compression setting. [sent-66, score-0.345]

38 3 Occam’s Hammer for data independent setting In this section, we brieﬂy detail the Occam’s hammer [Blanchard and Fleuret, 2007] for dataindependent setting. [sent-67, score-0.192]

39 Occam’s hammer work by bounding the probability of bad event deﬁned as follows. [sent-69, score-0.261]

40 For every classiﬁer h ∈ H, and a conﬁdence parameter δ ∈ [0, 1], the bad event B(h, δ) is deﬁned as the region where the desired property on the classiﬁer h does not hold, with probability δ. [sent-70, score-0.064]

41 Intuitively, this means that with decreasing δ the bound on the true error of the classiﬁer h becomes tighter. [sent-73, score-0.081]

42 With the above assumption satisﬁed, let, P be a non-negative reference measure on the classiﬁer space H known as the volumic measure. [sent-74, score-0.048]

43 Let Π be a probability distribution on H absolutely contindΠ uous w. [sent-75, score-0.026]

44 Let Γ be a probability distribution on (0, +∞) (the inverse density prior). [sent-79, score-0.04]

45 Then for any algorithm S → θS returning a probability density θS over H with respect to P, and such that (S, h) → θS (h) is jointly measurable in its two variables, it holds that Pr S ∈ B(h, ∆(h, θS (h)−1 )) ≤ δ, m S∼D ,h∼Q where Q is the distribution on H such that dQ dP = θS . [sent-82, score-0.058]

46 Note above that Q is the (data-dependent) posterior distribution on H after observing the data sample S while P is the data-independent prior on H. [sent-83, score-0.131]

47 Moreover, the distribution Π on the space of classiﬁers may or may not be data-dependent. [sent-85, score-0.032]

48 As we will see later, in the case of sample compression learning settings we will consider priors over the space of classiﬁers without reference to the data (such as PAC-Bayes case). [sent-86, score-0.415]

49 To this end, we can either opt for a prior Π independent of the data or make it the same as the volume measure P which establishes a distribution on the classiﬁer space without reference to the data. [sent-87, score-0.113]

50 4 Bounds for Randomized SC Classiﬁers We work in the sample compression settings and as mentioned before, each classiﬁer in this setting is denoted in terms of a compression set and a message string. [sent-88, score-0.745]

51 A reconstruction function then uses these two information sources to reconstruct the classiﬁer. [sent-89, score-0.043]

52 The hypothesis space is deﬁned, in our case, based on the size of data sample (and not the actual contents of the sample). [sent-92, score-0.07]

53 Hence, we consider the priors built on the size of the possible compression sets and associated message strings. [sent-93, score-0.39]

54 The above choice of the form for PI (i) is (|i|) appropriate since we do not have any a priori information to distinguish one compression set from other. [sent-95, score-0.294]

55 However, as we will see later, we should choose p(d) such that we give more weight to smaller compression sets. [sent-96, score-0.294]

56 Then, we are interested in algorithms that output a posterior Q ∈ PK over the space of classiﬁers with probability density Q(zi , σ) factorizable as QI (i)QM(zi ) (σ). [sent-98, score-0.11]

57 A sample compressed classiﬁer is then deﬁned by choosing a classiﬁer (zi , σ) according to the posterior Q(zi , σ). [sent-99, score-0.242]

58 This is basically the Gibbs classiﬁer deﬁned in the PAC-Bayes settings where the idea is to bound the true risk of this Gibbs classiﬁer deﬁned as R(GQ ) = E(zi ,σ)∼Q R((zi , σ)). [sent-100, score-0.225]

59 On the other hand, we are interested in bounding the true risk of the speciﬁc classiﬁer (zi , σ) output according to Q. [sent-101, score-0.189]

60 As shown in [Laviolette and Marchand, 2007], a rescaled posterior Q of the following form can provide tighter guarantees while maintaining the Occam’s principle of parsimony. [sent-102, score-0.154]

61 Deﬁnition 2 Given a distribution Q ∈ PK , we denote by Q the distribution: QI (i)QM(zi ) (σ) Q(zi , σ) def Q(zi , σ) = = = QI (i)QM(zi ) (σ) 1 1 |i|E(zi ,σ)∼Q |i| |i|E(zi ,σ)∼Q |i| ∀(zi , σ) ∈ K Hence, note that the posterior is effectively rescaled for the compression set part. [sent-103, score-0.469]

62 1 A PAC-Bayes Bound for randomized SC classiﬁer We exploit the fact that the distribution of the errors is binomial and deﬁne the following error quantities (for a given i, and hence zi over z|i| ): Deﬁnition 3 Let S ∈ Dm with D a distribution on X × Y, and (zi , σ) ∈ K. [sent-107, score-0.768]

63 We denote by BinS (i, σ), the probability that the classiﬁer R(zi , σ) of (true) risk R(zbi , σ) makes |i|Rzi (zi , σ) or fewer errors on z′ ∼ D|i| . [sent-108, score-0.134]

64 That is, i |i|Rz (zi ,σ) i BinS (i, σ) = λ=0 |i| (R(σ, zi ))λ (1 − R(σ, zi ))|i|−λ λ and by BS (i, σ), the probability that this classiﬁer makes exactly |i|Rzi (zi , σ) errors on z′ ∼ D|i| . [sent-109, score-1.208]

65 m As also shown in [Laviolette and Marchand, 2007], since m = m−j and kl(RS (f ) R(f )) = j kl(1 − RS (f ) 1 − R(f )), the above inequality holds true for each factor inside the sum on the left hand side. [sent-111, score-0.038]

66 Let ψzm (i, σ) be the posterior probability density of the rescaled data-dependent posterior distribution Q over the classiﬁer space with respect to the volume measure P. [sent-113, score-0.229]

67 However, note that we are interested in data-independent priors over the space of classiﬁers2 , and hence, we consider our prior Π to be the same as the volume measure P over the classiﬁer space yielding π as unity. [sent-115, score-0.102]

68 That is, our prior gives a distribution over the classiﬁer space without any regard to the data. [sent-116, score-0.065]

69 1 1 Note that we do not encounter the m−d factor in the bound instead of m−|i| unlike the bound Q of Laviolette and Marchand [2007]. [sent-118, score-0.122]

70 This is because the PAC-Bayes bound of Laviolette and Marchand [2007] computes the expectancy over the kl-divergence of the empirical and true risk of the classiﬁers chosen according to Q. [sent-119, score-0.243]

71 This, as a result of rescaling of Q in preference of smaller compression sets, is reﬂected in the bound. [sent-120, score-0.294]

72 On the other hand, the bound of Theorem 4 is a point-wise version bounding the true error of the speciﬁc classiﬁer chosen according to Q and hence concerns the speciﬁc compression set utilized by this classiﬁer. [sent-121, score-0.451]

73 2 A Binomial Tail Inversion Bound for randomized SC classiﬁer A tighter condition can be imposed on the true risk of the classiﬁer by considering the binomial tail k inversion over the distribution of errors. [sent-123, score-0.331]

74 Moreover, in the sample compression case, we are interested in the empirical risk of the classiﬁer on the examples not in the compression set (consistent compression set assumption). [sent-126, score-1.037]

75 Now, let δr be a δ-weighed measure on the classiﬁer space, i. [sent-127, score-0.014]

76 Then, for the compression sets and associated message strings, 2 Hence, the missing S in the subscript of π(h) in the r. [sent-130, score-0.394]

77 Alternatively, let P (zi , σ) and Q(zi , σ) denote the probability densities of the prior distribution P and rescaled posterior distributions Q over classiﬁers such that dQ = Q(zi , σ)dµ and dP = P (zi , σ)dµ w. [sent-134, score-0.135]

78 Note that the ﬁnal expression is independent of the underlying dP P (zi ,σ) measure µ. [sent-139, score-0.014]

79 We follow the general line of argument of Blanchard and Fleuret [2007] to recover the PAC-Bayes bound for the Sample Compressed Gibbs classiﬁer. [sent-142, score-0.076]

80 However, note that we do this for the data-dependent setting here and also utilize the rescaled posterior over the space of sample compressed classiﬁers. [sent-143, score-0.286]

81 5 m−1 2(m − 1) δ ≥1−δ Theorem 8 recovers almost exactly the PAC-Bayes bound for the Sample Compressed Classiﬁers of Laviolette and Marchand [2007]. [sent-148, score-0.061]

82 Note that the bound of Theorem 8 is derived in a relatively more straightforward manner with the Occam’s Hammer criterion. [sent-150, score-0.061]

83 6 Conclusion It has been shown that stochastic classiﬁer selection is preferable to deterministic selection by the PAC-Bayes principle resulting in tighter risk bounds over averaged risk of classiﬁers according to the learned posterior. [sent-151, score-0.335]

84 Further, this observation resulted in tight bounds in the case of stochastic sample compressed classiﬁers [Laviolette and Marchand, 2007] also showing that sparsity considerations are of importance even in this scenario via. [sent-152, score-0.265]

85 However, of immediate relevance are the guarantees of the speciﬁc classiﬁer output by such algorithms according to the learned posterior and hence a point-wise version of this bound is indeed needed. [sent-154, score-0.185]

86 We have derived bounds for such randomized sample compressed classiﬁers by adapting Occam’s Hammer principle to the data-dependent sample compression settings. [sent-155, score-0.653]

87 This has resulted in bounds on the speciﬁc classiﬁer output by a sample compression learning algorithm according to the learned data-dependent posterior and is more relevant in practice. [sent-156, score-0.5]

88 Further, we also showed how classical PAC-Bayes bound for the sample compressed Gibbs classiﬁer can be recovered in a more direct manner and show that this compares favorably to the existing result of Laviolette and Marchand [2007]. [sent-157, score-0.265]

89 In Proceedings of the 20th Annual Con¸ ference on Learning Theory (COLT-2007), volume 4539 of Lecture Notes on Computer Science, pages 112–126, 2007. [sent-161, score-0.018]

90 PAC-Bayes risk bounds for stochastic averages and major¸ ity votes of sample-compressed classiﬁers. [sent-169, score-0.166]

91 In Proceedings of the 16th European Conference on Machine Learning, ECML 2005, volume 3720 of Lecture Notes in Artiﬁcial Intelligence, pages 206–217. [sent-173, score-0.018]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('zi', 0.589), ('compression', 0.294), ('rzi', 0.274), ('qm', 0.236), ('er', 0.2), ('ln', 0.194), ('hammer', 0.165), ('marchand', 0.165), ('laviolette', 0.156), ('classi', 0.155), ('occam', 0.154), ('qi', 0.154), ('zm', 0.14), ('compressed', 0.126), ('kl', 0.109), ('risk', 0.104), ('gq', 0.099), ('fleuret', 0.096), ('blanchard', 0.096), ('rs', 0.091), ('dq', 0.087), ('message', 0.083), ('pr', 0.081), ('ers', 0.077), ('def', 0.072), ('dm', 0.068), ('randomized', 0.062), ('bound', 0.061), ('pm', 0.056), ('sample', 0.051), ('bin', 0.049), ('bounds', 0.048), ('posterior', 0.048), ('binomial', 0.047), ('gibbs', 0.046), ('pi', 0.045), ('sc', 0.044), ('rescaled', 0.042), ('expectancy', 0.041), ('mario', 0.041), ('mohak', 0.041), ('ei', 0.036), ('francois', 0.036), ('pk', 0.034), ('string', 0.034), ('strings', 0.033), ('bounding', 0.032), ('tail', 0.029), ('inversion', 0.029), ('bad', 0.029), ('reconstruction', 0.028), ('dataindependent', 0.027), ('prs', 0.027), ('tighter', 0.027), ('hence', 0.027), ('langford', 0.027), ('bs', 0.026), ('resulted', 0.026), ('rz', 0.024), ('floyd', 0.024), ('dp', 0.024), ('settings', 0.023), ('event', 0.022), ('principle', 0.021), ('cruz', 0.021), ('bins', 0.02), ('true', 0.02), ('littlestone', 0.019), ('prior', 0.019), ('space', 0.019), ('warmuth', 0.019), ('hs', 0.019), ('volume', 0.018), ('holds', 0.018), ('indices', 0.018), ('theorem', 0.018), ('santa', 0.018), ('errors', 0.017), ('according', 0.017), ('basically', 0.017), ('subscript', 0.017), ('output', 0.016), ('training', 0.016), ('guarantees', 0.016), ('argument', 0.015), ('reference', 0.015), ('establishes', 0.015), ('sources', 0.015), ('stochastic', 0.014), ('measure', 0.014), ('derive', 0.014), ('regard', 0.014), ('favorably', 0.014), ('density', 0.014), ('probability', 0.013), ('priors', 0.013), ('classical', 0.013), ('distribution', 0.013), ('yields', 0.013), ('john', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000008 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

Author: Mohak Shah

2 0.20439334 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

Author: Xuming He, Richard S. Zemel

Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difﬁcult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1

3 0.19842245 238 nips-2008-Theory of matching pursuit

Author: Zakria Hussain, John S. Shawe-taylor

Abstract: We analyse matching pursuit for kernel principal components analysis (KPCA) by proving that the sparse subspace it produces is a sample compression scheme. We show that this bound is tighter than the KPCA bound of Shawe-Taylor et al [7] and highly predictive of the size of the subspace needed to capture most of the variance in the data. We analyse a second matching pursuit algorithm called kernel matching pursuit (KMP) which does not correspond to a sample compression scheme. However, we give a novel bound that views the choice of subspace of the KMP algorithm as a compression scheme and hence provide a VC bound to upper bound its future loss. Finally we describe how the same bound can be applied to other matching pursuit related algorithms. 1

4 0.17276636 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

Author: David Sontag, Amir Globerson, Tommi S. Jaakkola

Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for ﬁnding the most probable (MAP) conﬁguration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and signiﬁcantly faster than a method that does not use partitioning. 1

5 0.16689405 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents the ﬁrst Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They beneﬁt from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such ﬁnite samples and lead to tighter generalization bounds. We also present the ﬁrst margin bounds for kernel-based classiﬁcation in this non-i.i.d. setting and brieﬂy study their convergence.

6 0.11527205 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

7 0.10533523 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

8 0.10431311 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG

9 0.095524073 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

10 0.094231062 214 nips-2008-Sparse Online Learning via Truncated Gradient

11 0.093663126 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

12 0.092452176 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

13 0.091744415 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

14 0.090313651 65 nips-2008-Domain Adaptation with Multiple Sources

15 0.089132927 62 nips-2008-Differentiable Sparse Coding

16 0.084136993 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

17 0.08126875 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

18 0.081154399 32 nips-2008-Bayesian Kernel Shaping for Learning Control

19 0.069085933 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime

20 0.068416521 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.182), (1, -0.085), (2, -0.115), (3, -0.031), (4, -0.116), (5, 0.068), (6, -0.047), (7, -0.057), (8, -0.075), (9, 0.065), (10, 0.043), (11, 0.139), (12, 0.247), (13, 0.022), (14, -0.035), (15, 0.075), (16, 0.065), (17, 0.078), (18, 0.04), (19, -0.207), (20, 0.223), (21, -0.137), (22, 0.065), (23, -0.026), (24, -0.088), (25, -0.144), (26, -0.034), (27, 0.061), (28, 0.137), (29, 0.057), (30, 0.134), (31, 0.041), (32, -0.032), (33, -0.043), (34, 0.119), (35, -0.001), (36, -0.071), (37, -0.101), (38, 0.003), (39, 0.129), (40, 0.024), (41, 0.096), (42, -0.017), (43, -0.041), (44, -0.114), (45, 0.014), (46, -0.136), (47, -0.017), (48, -0.002), (49, -0.09)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97785294 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

Author: Mohak Shah

2 0.54105216 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

Author: Massih Amini, Nicolas Usunier, François Laviolette

Abstract: We propose two transductive bounds on the risk of majority votes that are estimated over partially labeled training sets. The ﬁrst one involves the margin distribution of the classiﬁer and a risk bound on its associate Gibbs classiﬁer. The bound is tight when so is the Gibbs’s bound and when the errors of the majority vote classiﬁer is concentrated on a zone of low margin. In semi-supervised learning, considering the margin as an indicator of conﬁdence constitutes the working hypothesis of algorithms which search the decision boundary on low density regions. Following this assumption, we propose to bound the error probability of the voted classiﬁer on the examples for whose margins are above a ﬁxed threshold. As an application, we propose a self-learning algorithm which iteratively assigns pseudo-labels to the set of unlabeled training examples that have their margin above a threshold obtained from this bound. Empirical results on different datasets show the effectiveness of our approach compared to the same algorithm and the TSVM in which the threshold is ﬁxed manually. 1

3 0.5365032 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

Author: Xuming He, Richard S. Zemel

4 0.51619452 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

5 0.48707184 238 nips-2008-Theory of matching pursuit

Author: Zakria Hussain, John S. Shawe-taylor

6 0.47467282 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

7 0.42051449 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

8 0.40736654 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG

9 0.38514745 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime

10 0.37186816 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

11 0.36977592 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

12 0.35469645 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

13 0.33880186 216 nips-2008-Sparse probabilistic projections

14 0.33791459 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

15 0.32711449 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

16 0.31350496 32 nips-2008-Bayesian Kernel Shaping for Learning Control

17 0.29717469 65 nips-2008-Domain Adaptation with Multiple Sources

18 0.29166794 15 nips-2008-Adaptive Martingale Boosting

19 0.28923452 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning

20 0.28444636 178 nips-2008-Performance analysis for L\ 2 kernel classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.166), (7, 0.032), (12, 0.027), (15, 0.011), (28, 0.12), (38, 0.011), (54, 0.287), (57, 0.045), (59, 0.012), (63, 0.01), (71, 0.043), (77, 0.04), (83, 0.079)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77511376 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

Author: Mohak Shah

2 0.69832933 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching

Author: Pavel P. Kuksa, Pai-hsi Huang, Vladimir Pavlovic

Abstract: We present a new family of linear time algorithms for string comparison with mismatches under the string kernels framework. Based on sufﬁcient statistics, our algorithms improve theoretical complexity bounds of existing approaches while scaling well in sequence alphabet size, the number of allowed mismatches and the size of the dataset. In particular, on large alphabets and under loose mismatch constraints our algorithms are several orders of magnitude faster than the existing algorithms for string comparison under the mismatch similarity measure. We evaluate our algorithms on synthetic data and real applications in music genre classiﬁcation, protein remote homology detection and protein fold prediction. The scalability of the algorithms allows us to consider complex sequence transformations, modeled using longer string features and larger numbers of mismatches, leading to a state-of-the-art performance with signiﬁcantly reduced running times. 1

3 0.65382344 121 nips-2008-Learning to Use Working Memory in Partially Observable Environments through Dopaminergic Reinforcement

Author: Michael T. Todd, Yael Niv, Jonathan D. Cohen

Abstract: Working memory is a central topic of cognitive neuroscience because it is critical for solving real-world problems in which information from multiple temporally distant sources must be combined to generate appropriate behavior. However, an often neglected fact is that learning to use working memory effectively is itself a difficult problem. The Gating framework [14] is a collection of psychological models that show how dopamine can train the basal ganglia and prefrontal cortex to form useful working memory representations in certain types of problems. We unite Gating with machine learning theory concerning the general problem of memory-based optimal control [5-6]. We present a normative model that learns, by online temporal difference methods, to use working memory to maximize discounted future reward in partially observable settings. The model successfully solves a benchmark working memory problem, and exhibits limitations similar to those observed in humans. Our purpose is to introduce a concise, normative definition of high level cognitive concepts such as working memory and cognitive control in terms of maximizing discounted future rewards. 1 I n t ro d u c t i o n Working memory is loosely defined in cognitive neuroscience as information that is (1) internally maintained on a temporary or short term basis, and (2) required for tasks in which immediate observations cannot be mapped to correct actions. It is widely assumed that prefrontal cortex (PFC) plays a role in maintaining and updating working memory. However, relatively little is known about how PFC develops useful working memory representations for a new task. Furthermore, current work focuses on describing the structure and limitations of working memory, but does not ask why, or in what general class of tasks, is it necessary. Borrowing from the theory of optimal control in partially observable Markov decision problems (POMDPs), we frame the psychological concept of working memory as an internal state representation, developed and employed to maximize future reward in partially observable environments. We combine computational insights from POMDPs and neurobiologically plausible models from cognitive neuroscience to suggest a simple reinforcement learning (RL) model of working memory function that can be implemented through dopaminergic training of the basal ganglia and PFC. The Gating framework is a series of cognitive neuroscience models developed to explain how dopaminergic RL signals can shape useful working memory representations [1-4]. Computationally this framework models working memory as a collection of past observations, each of which can occasionally be replaced with the current observation, and addresses the problem of learning when to update each memory element versus maintaining it. In the original Gating model [1-2] the PFC contained a unitary working memory representation that was updated whenever a phasic dopamine (DA) burst occurred (e.g., due to unexpected reward or novelty). That model was the first to connect working memory and RL via the temporal difference (TD) model of DA firing [7-8], and thus to suggest how working memory might serve a normative purpose. However, that model had limited computational flexibility due to the unitary nature of the working memory (i.e., a singleobservation memory controlled by a scalar DA signal). More recent work [3-4] has partially repositioned the Gating framework within the Actor/Critic model of mesostriatal RL [9-10], positing memory updating as but another cortical action controlled by the dorsal striatal

4 0.60477436 214 nips-2008-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsiﬁcation from no sparsiﬁcation to total sparsiﬁcation. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsiﬁcation result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and ﬁnd for datasets with large numbers of features, substantial sparsity is discoverable. 1

5 0.5960626 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG

Author: Julia Owen, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan, David P. Wipf

Abstract: The synchronous brain activity measured via MEG (or EEG) can be interpreted as arising from a collection (possibly large) of current dipoles or sources located throughout the cortex. Estimating the number, location, and orientation of these sources remains a challenging task, one that is signiﬁcantly compounded by the effects of source correlations and the presence of interference from spontaneous brain activity, sensor noise, and other artifacts. This paper derives an empirical Bayesian method for addressing each of these issues in a principled fashion. The resulting algorithm guarantees descent of a cost function uniquely designed to handle unknown orientations and arbitrary correlations. Robust interference suppression is also easily incorporated. In a restricted setting, the proposed method is shown to have theoretically zero bias estimating both the location and orientation of multi-component dipoles even in the presence of correlations, unlike a variety of existing Bayesian localization methods or common signal processing techniques such as beamforming and sLORETA. Empirical results on both simulated and real data sets verify the efﬁcacy of this approach. 1

6 0.59283948 178 nips-2008-Performance analysis for L\ 2 kernel classification

7 0.59252852 43 nips-2008-Cell Assemblies in Large Sparse Inhibitory Networks of Biologically Realistic Spiking Neurons

8 0.58070326 65 nips-2008-Domain Adaptation with Multiple Sources

9 0.57928532 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

10 0.57371312 202 nips-2008-Robust Regression and Lasso

11 0.57221001 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

12 0.56515026 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias

13 0.55996776 75 nips-2008-Estimating vector fields using sparse basis field expansions

14 0.55891716 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

15 0.55471838 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

16 0.5539487 62 nips-2008-Differentiable Sparse Coding

17 0.55342025 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

18 0.55227005 228 nips-2008-Support Vector Machines with a Reject Option

19 0.55138612 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

20 0.5507921 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging