nips nips2008 nips2008-178 knowledge-graph by maker-knowledge-mining

178 nips-2008-Performance analysis for L\ 2 kernel classification

Source: pdf

Author: Jooseuk Kim, Clayton Scott

Abstract: We provide statistical performance guarantees for a recently introduced kernel classiﬁer that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classiﬁer is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classiﬁer. Unlike SVMs, however, the L2 kernel classiﬁer does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classiﬁer on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Performance analysis for L2 kernel classiﬁcation Clayton D. [sent-1, score-0.116]

2 edu Abstract We provide statistical performance guarantees for a recently introduced kernel classiﬁer that optimizes the L2 or integrated squared error (ISE) of a difference of densities. [sent-4, score-0.204]

3 The classiﬁer is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classiﬁer. [sent-5, score-0.03]

4 Unlike SVMs, however, the L2 kernel classiﬁer does not involve a regularization parameter. [sent-6, score-0.116]

5 We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classiﬁer on the sense of both ISE and probability of error. [sent-7, score-0.425]

6 Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. [sent-8, score-0.229]

7 , (xn , yn ) of a jointly distributed pair (X, Y ), where X ∈ Rd is a pattern and Y ∈ {−1, +1} is a class label. [sent-12, score-0.034]

8 A kernel classiﬁer (without offset) has the form n g(x) = sign αi yi k(x, xi ) , i=1 where αi are parameters and k is a kernel function. [sent-17, score-0.544]

9 For example, support vector machines (SVMs) without offset have this form [2], as does the standard kernel density estimate (KDE) plug-in rule. [sent-18, score-0.238]

10 Recently Kim and Scott [3] introduced an L2 or integrated squared error (ISE) criterion to design the coefﬁcients αi of a kernel classiﬁer with Gaussian kernel. [sent-19, score-0.187]

11 Their L2 classiﬁer performs comparably to existing kernel methods while possesing a number of desirable properties. [sent-20, score-0.116]

12 Like the SVM, L2 kernel classiﬁers are the solutions of convex quadratic programs that can be solved efﬁciently using standard decomposition algorithms. [sent-21, score-0.146]

13 Unlike the SVM, however, there are no free parameters to be set by the user except the kernel bandwidth parameter. [sent-23, score-0.151]

14 In this paper we develop statistical performance guarantees for the L2 kernel classiﬁer introduced in [3]. [sent-24, score-0.154]

15 The linchpin of our analysis is a new concentration inequality bounding the deviation of a cross-validation based ISE estimate from the true ISE. [sent-25, score-0.167]

16 This bound is then applied to prove an oracle inequality and consistency in both ISE and probability of error. [sent-26, score-0.229]

17 In addition, as a special case of ∗ Both authors supported in part by NSF Grant CCF-0830490 1 our analysis, we are able to deduce performance guarantees for the method of L2 kernel density estimation described in [4, 5]. [sent-27, score-0.287]

18 The ISE criterion has a long history in the literature on bandwidth selection for kernel density estimation [6] and more recently in parametric estimation [7]. [sent-28, score-0.322]

19 The use of ISE for optimizing the weights of a KDE via quadratic programming was ﬁrst described in [4] and later rediscovered in [5]. [sent-29, score-0.03]

20 In [8], an 1 penalized ISE criterion was used to aggregate a ﬁnite number of pre-determined densities. [sent-30, score-0.021]

21 Linear and convex aggregation of densities, based on an L2 criterion, are studied in [9], where the densities are based on a ﬁnite dictionary or an independent sample. [sent-31, score-0.124]

22 In [11] and [13], a difference of densities are used to ﬁnd smoothing parameters or kernel bandwidths. [sent-35, score-0.216]

23 In [12], conditional densities are chosen among a parameterized set of densities to maximize the average (bounded) density differences. [sent-36, score-0.275]

24 Section 2 reviews the L2 kernel classiﬁer, and presents a slight modiﬁcation needed for our analysis. [sent-37, score-0.116]

25 For convenience, we relabel Y so that it belongs to {1, −γ} and denote I+ = {i | Yi = +1} and I− = {i | Yi = −γ}. [sent-41, score-0.019]

26 Let f− (x) and f+ (x) denote the class-conditional densities of the pattern given the label. [sent-42, score-0.116]

27 From decision theory, the optimal classiﬁer has the form g ∗ (x) = sign {f+ (x) − γf− (x)} , (1) where γ incorporates prior class probabilities and class-conditional error costs (in the Bayesian setting) or a desired tradeoff between false positives and false negatives [14]. [sent-43, score-0.069]

28 The class-conditional densities are modelled using the Gaussian kernel as f+ (x; α) = αi kσ (x, Xi ) , f− (x; α) = i∈I+ αi kσ (x, Xi ) i∈I− with constraints α = (α1 , . [sent-45, score-0.216]

29  The Gaussian kernel is deﬁned as −d/2 kσ (x, Xi ) = 2πσ 2 exp − x − Xi 2σ 2 2 . [sent-49, score-0.116]

30 The ISE associated with α is ISE (α) = dγ (x; α) − dγ (x) = d2 (x; α) dx − 2 γ 2 L2 = dγ (x; α) − dγ (x) dγ (x; α) dγ (x) dx + 2 dx d2 (x) dx. [sent-50, score-0.438]

31 γ Since we do not know the true dγ (x), we need to estimate the second term in the above equation H (α) dγ (x; α) dγ (x) dx (2) by Hn (α) which will be explained in detail in Section 2. [sent-51, score-0.192]

32 Then, the empirical ISE is ISE (α) = d2 (x; α) dx − 2Hn (α) + γ 2 d2 (x) dx. [sent-53, score-0.146]

33 The basic idea is to view H (α) as an expectation and estimate it using a sample average. [sent-57, score-0.026]

34 In [3], the resubstitution estimator for H (α) was used. [sent-58, score-0.037]

35 However, since this estimator is biased, we use a leave-one-out cross-validation (LOOCV) estimator, which is unbiased and facilitates our theoretical analysis. [sent-59, score-0.055]

36 Note that the difference of densities can be expressed as n dγ (x; α) = f+ (x) − γ f− (x) = αi Yi kσ (x, Xi ) . [sent-60, score-0.1]

37 i=1 Then, H (α) = dγ (x; α) dγ (x) dx = dγ (x; α) f+ (x) dx − γ n = dγ (x; α) f− (x) dx n αi Yi kσ (x, Xi ) f+ (x) dx − γ i=1 αi Yi kσ (x, Xi ) f− (x) dx i=1 n = αi Yi h (Xi ) i=1 where h (Xi ) kσ (x, Xi ) f+ (x) dx − γ kσ (x, Xi ) f− (x) dx. [sent-61, score-0.876]

38 (5) We estimate each h (Xi ) in (5) for i = 1, . [sent-62, score-0.026]

39 , n using leave-one-out cross-validation  γ  1  kσ (Xj , Xi ) − kσ (Xj , Xi ) , i ∈ I+ N − 1  + N− j∈I− j∈I+ ,j=i hi γ  1  kσ (Xj , Xi ) − kσ (Xj , Xi ) , i ∈ I−   N+ N− − 1 j∈I+ j∈I− ,j=i where N+ = |I+ | , N− = |I− |. [sent-65, score-0.25]

40 Optimization The optimization problem (4) can be formulated as a quadratic program. [sent-68, score-0.03]

41 The ﬁrst term in (3) is 2 n d2 γ αi Yi kσ (x, Xi ) (x; α) dx = dx i=1 n n = n αi αj Yi Yj n αi αj Yi Yj k√2σ (Xi , Xj ) kσ (x, Xi ) kσ (x, Xj ) dx = i=1 j=1 i=1 j=1 by the convolution theorem for Gaussian kernels [15]. [sent-69, score-0.516]

42 1, the second n term Hn (α) in (3) is linear in α and can be expressed as i=1 αi ci where ci = Yi hi . [sent-71, score-0.322]

43 Finally, since the third term does not depend on α, the optimization problem (4) becomes the following quadratic program (QP) α = arg min α∈A 1 2 n n n αi αj Yi Yj k√2σ (Xi , Xj ) − i=1 j=1 ci αi . [sent-72, score-0.076]

44 Conditioned on Xi , hi is an unbiased estimator of h (Xi ), i. [sent-80, score-0.305]

45 Furthermore, for any > 0, P where c = 2 sup |Hn (α) − H (α)| > α∈A √ 2πσ 2d 2 ≤ 2n e−c(n+ −1) + e−c(n− −1) 2 4 / (1 + γ) . [sent-82, score-0.056]

46 Lemma 1 implies that Hn (α) → H (α) almost surely for all α ∈ A simultaneously, provided that σ, n+ , and n− evolve as functions of n such that n+ σ 2d / ln n → ∞ and n− σ 2d / ln n → ∞. [sent-83, score-0.206]

47 2 Oracle Inequality Next, we establish on oracle inequality, which relates the performance of our estimator to that of the best possible kernel classiﬁer. [sent-85, score-0.236]

48 Then, with probability at least 1 − δ, for all α ∈ A, we have ISE (α) ≤ ISE (α) + 2 ≤ ISE (α) + 2 ≤ ISE (α) + 4 where the second inequality holds from the deﬁnition of α. [sent-91, score-0.097]

49 3 ISE consistency Next, we have a theorem stating that ISE (α) converges to zero in probability. [sent-94, score-0.137]

50 Suppose that for f = f+ and f− , the Hessian Hf (x) exists and each entry of Hf (x) is piecewise continuous and square integrable. [sent-96, score-0.04]

51 The consistency with respect to the probability of error could be easily shown if we set γ to γ ∗ = 1−p and apply Theorem 3 in [17]. [sent-104, score-0.065]

52 However, since p is p unknown, we must estimate γ ∗ . [sent-105, score-0.026]

53 Note that N+ and N− are binomial random variables, and we may estimate γ ∗ as γ = N− . [sent-106, score-0.026]

54 The next theorem says the L2 kernel classiﬁer is consistent with respect to N+ the probability of error. [sent-107, score-0.152]

55 If σ evolves as p a function of n such that σ → 0 and nσ 2d / ln n → ∞ as n → ∞, then the L2 kernel classiﬁer is consistent. [sent-114, score-0.206]

56 , (Xn , Yn )), the classiﬁcation error Ln = P sgn dγ (X; α) = Y | Dn converges to the Bayes error L∗ in probability as n → ∞. [sent-118, score-0.066]

57 5 Application to density estimation By setting γ = 0, our goal becomes estimating f+ and we recover the L2 kernel density estimate of [4, 5] using leave-one-out cross-validation. [sent-122, score-0.321]

58 , Xn from f (x), the L2 kernel density estimate of f (x) is deﬁned as n f (x; α) = αi kσ (x, Xi ) i=1 with αi ’s optimized such that α = arg min αi =1 αi ≥0 1 2 n n n αi  αi αj k√2σ (Xi , Xj ) − i=1 j=1 i=1   1 n−1 kσ (Xi , Xj ) . [sent-128, score-0.217]

59 j=i Our concentration inequality, oracle inequality, and L2 consistency result immediately extend to provide the same performance guarantees for this method. [sent-129, score-0.214]

60 Suppose that the Hessian Hf (x) of a density function f (x) exists and each entry of Hf (x) is piecewise continuous and square integrable. [sent-132, score-0.115]

61 If σ → 0 and nσ 2d / ln n → ∞ as n → ∞, then f (x; α) − f (x) 2 dx → 0 in probability. [sent-133, score-0.236]

62 4 Conclusion Through the development of a novel concentration inequality, we have established statistical performance guarantees on a recently introduced L2 kernel classiﬁer. [sent-134, score-0.198]

63 We view the relatively clean analysis of this classiﬁer as an attractive feature relative to other kernel methods. [sent-135, score-0.116]

64 In future work, we hope to invoke the full power of the oracle inequality to obtain adaptive rates of convergence, and consistency for σ not necessarily tending to zero. [sent-136, score-0.229]

65 Since Xi ∼ f+ (x) for i ∈ I+ and Xi ∼ f− (x) for i ∈ I− , it can be easily shown that E hi | Xi = h (Xi ) . [sent-140, score-0.25]

66 The second term in (7) is P kσ (Xj , Xi ) − n− E [kσ (W, Xi ) | Xi ] > j∈I− = P kσ (Xj , Xi ) − E j∈I− ≤ 2e −2n− 2 n− Xi = x 1+γ kσ (Xj , Xi ) | Xi j∈I− /(1+γ)2 M 2 ≤ 2e −2(n− −1) 2 /(1+γ)2 M 2 > n− Xi = x 1+γ . [sent-142, score-0.02]

67 Therefore, P hi − h (Xi ) ≥ =E P hi − h (Xi ) ≥ ≤ 2e−2(n+ −1) Xi = X 2 /(1+γ)2 M 2 + 2e−2(n− −1) 2 /(1+γ)2 M 2 . [sent-143, score-0.5]

68 Proof of Theorem 3 From Theorem 3 in [17], it sufﬁces to show that dγ (x; α) − dγ ∗ (x) 2 dx → 0 in probability. [sent-147, score-0.146]

69 Since from the triangle inequality dγ (x; α) − dγ ∗ (x) L2 = dγ (x; α) − dγ (x) + (γ − γ ∗ ) f− (x) ≤ dγ (x; α) − dγ (x) = L2 ∗ L2 + (γ − γ ) f− (x) ∗ ISE (α) + |γ − γ | · f− (x) L2 L2 , ∗ we need to show that ISE (α) and γ converges in probability to 0 and γ , respectively. [sent-148, score-0.131]

70 The convergence of γ to γ ∗ can be easily shown from the strong law of large numbers. [sent-149, score-0.02]

71 Deﬁne an event D = N+ ≥ np , N− ≥ n(1−p) , γ ≤ 2γ ∗ . [sent-151, score-0.028]

72 The ﬁrst term converges to 0 from the strong law of large numbers. [sent-153, score-0.074]

73 Let deﬁne a set S = (n+ , n− ) n+ ≥ np , n− ≥ n(1−p) , n− ≤ 2γ ∗ . [sent-154, score-0.028]

74 (n+ ,n− )∈S ≤ max (n+ ,n− )∈S 7 (8) Provided that σ → 0 and nσ 2d / ln n → ∞, any pair (n+ , n− ) ∈ S satisﬁes σ → 0, n+ σ 2d / ln n → ∞, and n− σ 2d / ln n → ∞ as n → ∞ and thus the term in (8) converges to 0 from Theorem 2. [sent-156, score-0.324]

75 Scott, “Kernel classiﬁcation via integrated squared error,” IEEE Workshop on Statistical Signal Processing, August 2007. [sent-170, score-0.034]

76 Kim, Least Squares Mixture Decomposition Estimation, unpublished doctoral dissertation, Dept. [sent-172, score-0.035]

77 [5] Mark Girolami and Chao He, “Probability density estimation from optimally condensed data samples,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. [sent-176, score-0.104]

78 Turlach, “Bandwidth selection in kernel density estimation: A review,” Technical Report 9317, C. [sent-182, score-0.191]

79 Scott, “Parametric statistical modeling by minimum integrated square error,” Technometrics 43, pp. [sent-188, score-0.052]

80 Wegkamp, “Sparse density estimation with l1 penalties,” Proceedings of 20th Annual Conference on Learning Theory, COLT 2007, Lecture Notes in Artiﬁcial Intelligence, v4539, pp. [sent-195, score-0.104]

81 Tsybakov, “Linear and convex aggregation of density estimators,” https:// hal. [sent-200, score-0.099]

82 Principe, and Torbjørn Eltoft, “Towards a uniﬁcation of information theoretic learning and kernel method,” in Proc. [sent-205, score-0.116]

83 Ritter, “Discriminative densities from maximum contrast estimation,” in Advances in Neural Information Proceeding Systems 15, Vancouver, Canada, 2002, pp. [sent-215, score-0.1]

84 Taylor, “Kernel density classiﬁcation and boosting: an l2 analysis,” Statistics and Computing, vol. [sent-220, score-0.075]

85 Wagner, “Asymptotically optimal discriminant fucntions for pattern classiﬁcation,” IEEE Trans. [sent-235, score-0.016]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ise', 0.824), ('hi', 0.25), ('xi', 0.235), ('dx', 0.146), ('kernel', 0.116), ('hn', 0.101), ('densities', 0.1), ('inequality', 0.097), ('xj', 0.094), ('classi', 0.093), ('ln', 0.09), ('er', 0.085), ('oracle', 0.083), ('density', 0.075), ('hf', 0.071), ('scott', 0.063), ('yi', 0.06), ('kim', 0.06), ('sup', 0.056), ('consistency', 0.049), ('qp', 0.045), ('concentration', 0.044), ('arbor', 0.039), ('kde', 0.039), ('guarantees', 0.038), ('estimator', 0.037), ('ann', 0.036), ('theorem', 0.036), ('bandwidth', 0.035), ('converges', 0.034), ('integrated', 0.034), ('svms', 0.032), ('quadratic', 0.03), ('eecs', 0.03), ('estimation', 0.029), ('deduce', 0.029), ('tsybakov', 0.029), ('michigan', 0.029), ('dn', 0.028), ('np', 0.028), ('ci', 0.026), ('estimate', 0.026), ('ers', 0.026), ('svm', 0.026), ('evolve', 0.026), ('hessian', 0.025), ('yj', 0.025), ('aggregation', 0.024), ('cation', 0.024), ('lemma', 0.023), ('kernels', 0.022), ('piecewise', 0.022), ('criterion', 0.021), ('offset', 0.021), ('proves', 0.021), ('law', 0.02), ('hall', 0.02), ('term', 0.02), ('mi', 0.02), ('relabel', 0.019), ('polytechnic', 0.019), ('wand', 0.019), ('loocv', 0.019), ('wagner', 0.019), ('sao', 0.019), ('chao', 0.019), ('cortes', 0.019), ('wegkamp', 0.019), ('suppose', 0.019), ('appendix', 0.019), ('xn', 0.018), ('square', 0.018), ('false', 0.018), ('unbiased', 0.018), ('bunea', 0.018), ('parzen', 0.018), ('terry', 0.018), ('luis', 0.018), ('jose', 0.018), ('clayton', 0.018), ('https', 0.018), ('dissertation', 0.018), ('ritter', 0.018), ('unpublished', 0.018), ('stating', 0.018), ('pf', 0.018), ('turlach', 0.018), ('rigollet', 0.018), ('holdout', 0.018), ('oct', 0.018), ('yn', 0.018), ('sign', 0.017), ('girolami', 0.017), ('technometrics', 0.017), ('doctoral', 0.017), ('devroye', 0.017), ('vapnik', 0.017), ('parametric', 0.017), ('error', 0.016), ('pattern', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 178 nips-2008-Performance analysis for L\ 2 kernel classification

Author: Jooseuk Kim, Clayton Scott

2 0.11084212 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Conﬁdence-weighted (CW) learning [6], an online learning method for linear classiﬁers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Conﬁdence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classiﬁes examples with a speciﬁed probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used ﬁrst-order and second-order online methods. 1

3 0.097247161 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

Author: Zhengdong Lu, Jeffrey Kaye, Todd K. Leen

Abstract: We develop new techniques for time series classiﬁcation based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and varying sampling intervals. This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classiﬁers based on the proposed kernel out-perform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information.

4 0.092689157 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efﬁciency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1

5 0.092584074 117 nips-2008-Learning Taxonomies by Dependence Maximization

Author: Matthew Blaschko, Arthur Gretton

Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1

6 0.09048906 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables

7 0.088817671 65 nips-2008-Domain Adaptation with Multiple Sources

8 0.07786978 226 nips-2008-Supervised Dictionary Learning

9 0.076847568 196 nips-2008-Relative Margin Machines

10 0.074648641 228 nips-2008-Support Vector Machines with a Reject Option

11 0.072183661 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

12 0.071256675 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

13 0.070848502 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

14 0.069583155 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

15 0.068299323 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

16 0.065244399 72 nips-2008-Empirical performance maximization for linear rank statistics

17 0.061435927 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

18 0.060175806 113 nips-2008-Kernelized Sorting

19 0.059826806 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

20 0.058549467 138 nips-2008-Modeling human function learning with Gaussian processes

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.163), (1, -0.066), (2, -0.108), (3, 0.054), (4, -0.022), (5, 0.01), (6, 0.026), (7, -0.026), (8, 0.038), (9, 0.051), (10, 0.155), (11, 0.026), (12, 0.067), (13, 0.007), (14, -0.035), (15, 0.019), (16, 0.013), (17, -0.037), (18, -0.054), (19, 0.061), (20, 0.032), (21, -0.009), (22, 0.026), (23, 0.108), (24, 0.025), (25, -0.075), (26, 0.085), (27, -0.152), (28, 0.167), (29, 0.04), (30, 0.009), (31, 0.015), (32, 0.139), (33, 0.043), (34, -0.0), (35, 0.003), (36, -0.08), (37, 0.087), (38, -0.043), (39, -0.003), (40, -0.106), (41, 0.049), (42, 0.147), (43, 0.016), (44, 0.014), (45, -0.09), (46, 0.002), (47, 0.06), (48, -0.022), (49, 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93001026 178 nips-2008-Performance analysis for L\ 2 kernel classification

Author: Jooseuk Kim, Clayton Scott

2 0.66063207 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

3 0.65781498 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

Author: Zhengdong Lu, Jeffrey Kaye, Todd K. Leen

4 0.64171898 196 nips-2008-Relative Margin Machines

Author: Tony Jebara, Pannagadatta K. Shivaswamy

Abstract: In classiﬁcation problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be eﬃciently solved and experiments on digit datasets show drastic performance improvements over SVMs. 1

5 0.6029225 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

Author: Ofer Dekel

6 0.57045114 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables

7 0.55033576 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels

8 0.54399759 228 nips-2008-Support Vector Machines with a Reject Option

9 0.48850575 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy

10 0.47594038 226 nips-2008-Supervised Dictionary Learning

11 0.44284621 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

12 0.43651715 72 nips-2008-Empirical performance maximization for linear rank statistics

13 0.42812079 185 nips-2008-Privacy-preserving logistic regression

14 0.42588696 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm

15 0.42188221 113 nips-2008-Kernelized Sorting

16 0.40825787 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

17 0.39216882 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

18 0.3876473 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

19 0.38559416 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning

20 0.38211301 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.455), (7, 0.061), (12, 0.016), (28, 0.17), (57, 0.06), (63, 0.015), (71, 0.012), (77, 0.029), (83, 0.038)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99659044 65 nips-2008-Domain Adaptation with Multiple Sources

Author: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents a theoretical analysis of the problem of domain adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most ǫ are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions beneﬁt from favorable theoretical guarantees. Our main result shows that, remarkably, for any ﬁxed target function, there exists a distribution weighted combining rule that has a loss of at most ǫ with respect to any target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3ǫ. Finally, we report empirical results for a multiple source adaptation problem with a real-world dataset.

2 0.98313653 43 nips-2008-Cell Assemblies in Large Sparse Inhibitory Networks of Biologically Realistic Spiking Neurons

Author: Adam Ponzi, Jeff Wickens

Abstract: Cell assemblies exhibiting episodes of recurrent coherent activity have been observed in several brain regions including the striatum[1] and hippocampus CA3[2]. Here we address the question of how coherent dynamically switching assemblies appear in large networks of biologically realistic spiking neurons interacting deterministically. We show by numerical simulations of large asymmetric inhibitory networks with ﬁxed external excitatory drive that if the network has intermediate to sparse connectivity, the individual cells are in the vicinity of a bifurcation between a quiescent and ﬁring state and the network inhibition varies slowly on the spiking timescale, then cells form assemblies whose members show strong positive correlation, while members of different assemblies show strong negative correlation. We show that cells and assemblies switch between ﬁring and quiescent states with time durations consistent with a power-law. Our results are in good qualitative agreement with the experimental studies. The deterministic dynamical behaviour is related to winner-less competition[3], shown in small closed loop inhibitory networks with heteroclinic cycles connecting saddle-points. 1

3 0.97527534 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG

Author: Julia Owen, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan, David P. Wipf

Abstract: The synchronous brain activity measured via MEG (or EEG) can be interpreted as arising from a collection (possibly large) of current dipoles or sources located throughout the cortex. Estimating the number, location, and orientation of these sources remains a challenging task, one that is signiﬁcantly compounded by the effects of source correlations and the presence of interference from spontaneous brain activity, sensor noise, and other artifacts. This paper derives an empirical Bayesian method for addressing each of these issues in a principled fashion. The resulting algorithm guarantees descent of a cost function uniquely designed to handle unknown orientations and arbitrary correlations. Robust interference suppression is also easily incorporated. In a restricted setting, the proposed method is shown to have theoretically zero bias estimating both the location and orientation of multi-component dipoles even in the presence of correlations, unlike a variety of existing Bayesian localization methods or common signal processing techniques such as beamforming and sLORETA. Empirical results on both simulated and real data sets verify the efﬁcacy of this approach. 1

4 0.96357411 214 nips-2008-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsiﬁcation from no sparsiﬁcation to total sparsiﬁcation. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsiﬁcation result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and ﬁnd for datasets with large numbers of features, substantial sparsity is discoverable. 1

same-paper 5 0.92361325 178 nips-2008-Performance analysis for L\ 2 kernel classification

Author: Jooseuk Kim, Clayton Scott

6 0.83880937 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers

7 0.83568966 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

8 0.79028702 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

9 0.7458204 202 nips-2008-Robust Regression and Lasso

10 0.73334521 75 nips-2008-Estimating vector fields using sparse basis field expansions

11 0.72449452 62 nips-2008-Differentiable Sparse Coding

12 0.72099692 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

13 0.71956658 85 nips-2008-Fast Rates for Regularized Objectives

14 0.71748692 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

15 0.71448761 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

16 0.71249074 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization

17 0.7120775 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias

18 0.71157074 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

19 0.7032907 228 nips-2008-Support Vector Machines with a Reject Option

20 0.69762373 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms