nips nips2008 nips2008-178 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jooseuk Kim, Clayton Scott
Abstract: We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Performance analysis for L2 kernel classification Clayton D. [sent-1, score-0.116]
2 edu Abstract We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. [sent-4, score-0.204]
3 The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. [sent-5, score-0.03]
4 Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. [sent-6, score-0.116]
5 We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. [sent-7, score-0.425]
6 Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. [sent-8, score-0.229]
7 , (xn , yn ) of a jointly distributed pair (X, Y ), where X ∈ Rd is a pattern and Y ∈ {−1, +1} is a class label. [sent-12, score-0.034]
8 A kernel classifier (without offset) has the form n g(x) = sign αi yi k(x, xi ) , i=1 where αi are parameters and k is a kernel function. [sent-17, score-0.544]
9 For example, support vector machines (SVMs) without offset have this form [2], as does the standard kernel density estimate (KDE) plug-in rule. [sent-18, score-0.238]
10 Recently Kim and Scott [3] introduced an L2 or integrated squared error (ISE) criterion to design the coefficients αi of a kernel classifier with Gaussian kernel. [sent-19, score-0.187]
11 Their L2 classifier performs comparably to existing kernel methods while possesing a number of desirable properties. [sent-20, score-0.116]
12 Like the SVM, L2 kernel classifiers are the solutions of convex quadratic programs that can be solved efficiently using standard decomposition algorithms. [sent-21, score-0.146]
13 Unlike the SVM, however, there are no free parameters to be set by the user except the kernel bandwidth parameter. [sent-23, score-0.151]
14 In this paper we develop statistical performance guarantees for the L2 kernel classifier introduced in [3]. [sent-24, score-0.154]
15 The linchpin of our analysis is a new concentration inequality bounding the deviation of a cross-validation based ISE estimate from the true ISE. [sent-25, score-0.167]
16 This bound is then applied to prove an oracle inequality and consistency in both ISE and probability of error. [sent-26, score-0.229]
17 In addition, as a special case of ∗ Both authors supported in part by NSF Grant CCF-0830490 1 our analysis, we are able to deduce performance guarantees for the method of L2 kernel density estimation described in [4, 5]. [sent-27, score-0.287]
18 The ISE criterion has a long history in the literature on bandwidth selection for kernel density estimation [6] and more recently in parametric estimation [7]. [sent-28, score-0.322]
19 The use of ISE for optimizing the weights of a KDE via quadratic programming was first described in [4] and later rediscovered in [5]. [sent-29, score-0.03]
20 In [8], an 1 penalized ISE criterion was used to aggregate a finite number of pre-determined densities. [sent-30, score-0.021]
21 Linear and convex aggregation of densities, based on an L2 criterion, are studied in [9], where the densities are based on a finite dictionary or an independent sample. [sent-31, score-0.124]
22 In [11] and [13], a difference of densities are used to find smoothing parameters or kernel bandwidths. [sent-35, score-0.216]
23 In [12], conditional densities are chosen among a parameterized set of densities to maximize the average (bounded) density differences. [sent-36, score-0.275]
24 Section 2 reviews the L2 kernel classifier, and presents a slight modification needed for our analysis. [sent-37, score-0.116]
25 For convenience, we relabel Y so that it belongs to {1, −γ} and denote I+ = {i | Yi = +1} and I− = {i | Yi = −γ}. [sent-41, score-0.019]
26 Let f− (x) and f+ (x) denote the class-conditional densities of the pattern given the label. [sent-42, score-0.116]
27 From decision theory, the optimal classifier has the form g ∗ (x) = sign {f+ (x) − γf− (x)} , (1) where γ incorporates prior class probabilities and class-conditional error costs (in the Bayesian setting) or a desired tradeoff between false positives and false negatives [14]. [sent-43, score-0.069]
28 The class-conditional densities are modelled using the Gaussian kernel as f+ (x; α) = αi kσ (x, Xi ) , f− (x; α) = i∈I+ αi kσ (x, Xi ) i∈I− with constraints α = (α1 , . [sent-45, score-0.216]
29 The Gaussian kernel is defined as −d/2 kσ (x, Xi ) = 2πσ 2 exp − x − Xi 2σ 2 2 . [sent-49, score-0.116]
30 The ISE associated with α is ISE (α) = dγ (x; α) − dγ (x) = d2 (x; α) dx − 2 γ 2 L2 = dγ (x; α) − dγ (x) dγ (x; α) dγ (x) dx + 2 dx d2 (x) dx. [sent-50, score-0.438]
31 γ Since we do not know the true dγ (x), we need to estimate the second term in the above equation H (α) dγ (x; α) dγ (x) dx (2) by Hn (α) which will be explained in detail in Section 2. [sent-51, score-0.192]
32 Then, the empirical ISE is ISE (α) = d2 (x; α) dx − 2Hn (α) + γ 2 d2 (x) dx. [sent-53, score-0.146]
33 The basic idea is to view H (α) as an expectation and estimate it using a sample average. [sent-57, score-0.026]
34 In [3], the resubstitution estimator for H (α) was used. [sent-58, score-0.037]
35 However, since this estimator is biased, we use a leave-one-out cross-validation (LOOCV) estimator, which is unbiased and facilitates our theoretical analysis. [sent-59, score-0.055]
36 Note that the difference of densities can be expressed as n dγ (x; α) = f+ (x) − γ f− (x) = αi Yi kσ (x, Xi ) . [sent-60, score-0.1]
37 i=1 Then, H (α) = dγ (x; α) dγ (x) dx = dγ (x; α) f+ (x) dx − γ n = dγ (x; α) f− (x) dx n αi Yi kσ (x, Xi ) f+ (x) dx − γ i=1 αi Yi kσ (x, Xi ) f− (x) dx i=1 n = αi Yi h (Xi ) i=1 where h (Xi ) kσ (x, Xi ) f+ (x) dx − γ kσ (x, Xi ) f− (x) dx. [sent-61, score-0.876]
38 (5) We estimate each h (Xi ) in (5) for i = 1, . [sent-62, score-0.026]
39 , n using leave-one-out cross-validation γ 1 kσ (Xj , Xi ) − kσ (Xj , Xi ) , i ∈ I+ N − 1 + N− j∈I− j∈I+ ,j=i hi γ 1 kσ (Xj , Xi ) − kσ (Xj , Xi ) , i ∈ I− N+ N− − 1 j∈I+ j∈I− ,j=i where N+ = |I+ | , N− = |I− |. [sent-65, score-0.25]
40 Optimization The optimization problem (4) can be formulated as a quadratic program. [sent-68, score-0.03]
41 The first term in (3) is 2 n d2 γ αi Yi kσ (x, Xi ) (x; α) dx = dx i=1 n n = n αi αj Yi Yj n αi αj Yi Yj k√2σ (Xi , Xj ) kσ (x, Xi ) kσ (x, Xj ) dx = i=1 j=1 i=1 j=1 by the convolution theorem for Gaussian kernels [15]. [sent-69, score-0.516]
42 1, the second n term Hn (α) in (3) is linear in α and can be expressed as i=1 αi ci where ci = Yi hi . [sent-71, score-0.322]
43 Finally, since the third term does not depend on α, the optimization problem (4) becomes the following quadratic program (QP) α = arg min α∈A 1 2 n n n αi αj Yi Yj k√2σ (Xi , Xj ) − i=1 j=1 ci αi . [sent-72, score-0.076]
44 Conditioned on Xi , hi is an unbiased estimator of h (Xi ), i. [sent-80, score-0.305]
45 Furthermore, for any > 0, P where c = 2 sup |Hn (α) − H (α)| > α∈A √ 2πσ 2d 2 ≤ 2n e−c(n+ −1) + e−c(n− −1) 2 4 / (1 + γ) . [sent-82, score-0.056]
46 Lemma 1 implies that Hn (α) → H (α) almost surely for all α ∈ A simultaneously, provided that σ, n+ , and n− evolve as functions of n such that n+ σ 2d / ln n → ∞ and n− σ 2d / ln n → ∞. [sent-83, score-0.206]
47 2 Oracle Inequality Next, we establish on oracle inequality, which relates the performance of our estimator to that of the best possible kernel classifier. [sent-85, score-0.236]
48 Then, with probability at least 1 − δ, for all α ∈ A, we have ISE (α) ≤ ISE (α) + 2 ≤ ISE (α) + 2 ≤ ISE (α) + 4 where the second inequality holds from the definition of α. [sent-91, score-0.097]
49 3 ISE consistency Next, we have a theorem stating that ISE (α) converges to zero in probability. [sent-94, score-0.137]
50 Suppose that for f = f+ and f− , the Hessian Hf (x) exists and each entry of Hf (x) is piecewise continuous and square integrable. [sent-96, score-0.04]
51 The consistency with respect to the probability of error could be easily shown if we set γ to γ ∗ = 1−p and apply Theorem 3 in [17]. [sent-104, score-0.065]
52 However, since p is p unknown, we must estimate γ ∗ . [sent-105, score-0.026]
53 Note that N+ and N− are binomial random variables, and we may estimate γ ∗ as γ = N− . [sent-106, score-0.026]
54 The next theorem says the L2 kernel classifier is consistent with respect to N+ the probability of error. [sent-107, score-0.152]
55 If σ evolves as p a function of n such that σ → 0 and nσ 2d / ln n → ∞ as n → ∞, then the L2 kernel classifier is consistent. [sent-114, score-0.206]
56 , (Xn , Yn )), the classification error Ln = P sgn dγ (X; α) = Y | Dn converges to the Bayes error L∗ in probability as n → ∞. [sent-118, score-0.066]
57 5 Application to density estimation By setting γ = 0, our goal becomes estimating f+ and we recover the L2 kernel density estimate of [4, 5] using leave-one-out cross-validation. [sent-122, score-0.321]
58 , Xn from f (x), the L2 kernel density estimate of f (x) is defined as n f (x; α) = αi kσ (x, Xi ) i=1 with αi ’s optimized such that α = arg min αi =1 αi ≥0 1 2 n n n αi αi αj k√2σ (Xi , Xj ) − i=1 j=1 i=1 1 n−1 kσ (Xi , Xj ) . [sent-128, score-0.217]
59 j=i Our concentration inequality, oracle inequality, and L2 consistency result immediately extend to provide the same performance guarantees for this method. [sent-129, score-0.214]
60 Suppose that the Hessian Hf (x) of a density function f (x) exists and each entry of Hf (x) is piecewise continuous and square integrable. [sent-132, score-0.115]
61 If σ → 0 and nσ 2d / ln n → ∞ as n → ∞, then f (x; α) − f (x) 2 dx → 0 in probability. [sent-133, score-0.236]
62 4 Conclusion Through the development of a novel concentration inequality, we have established statistical performance guarantees on a recently introduced L2 kernel classifier. [sent-134, score-0.198]
63 We view the relatively clean analysis of this classifier as an attractive feature relative to other kernel methods. [sent-135, score-0.116]
64 In future work, we hope to invoke the full power of the oracle inequality to obtain adaptive rates of convergence, and consistency for σ not necessarily tending to zero. [sent-136, score-0.229]
65 Since Xi ∼ f+ (x) for i ∈ I+ and Xi ∼ f− (x) for i ∈ I− , it can be easily shown that E hi | Xi = h (Xi ) . [sent-140, score-0.25]
66 The second term in (7) is P kσ (Xj , Xi ) − n− E [kσ (W, Xi ) | Xi ] > j∈I− = P kσ (Xj , Xi ) − E j∈I− ≤ 2e −2n− 2 n− Xi = x 1+γ kσ (Xj , Xi ) | Xi j∈I− /(1+γ)2 M 2 ≤ 2e −2(n− −1) 2 /(1+γ)2 M 2 > n− Xi = x 1+γ . [sent-142, score-0.02]
67 Therefore, P hi − h (Xi ) ≥ =E P hi − h (Xi ) ≥ ≤ 2e−2(n+ −1) Xi = X 2 /(1+γ)2 M 2 + 2e−2(n− −1) 2 /(1+γ)2 M 2 . [sent-143, score-0.5]
68 Proof of Theorem 3 From Theorem 3 in [17], it suffices to show that dγ (x; α) − dγ ∗ (x) 2 dx → 0 in probability. [sent-147, score-0.146]
69 Since from the triangle inequality dγ (x; α) − dγ ∗ (x) L2 = dγ (x; α) − dγ (x) + (γ − γ ∗ ) f− (x) ≤ dγ (x; α) − dγ (x) = L2 ∗ L2 + (γ − γ ) f− (x) ∗ ISE (α) + |γ − γ | · f− (x) L2 L2 , ∗ we need to show that ISE (α) and γ converges in probability to 0 and γ , respectively. [sent-148, score-0.131]
70 The convergence of γ to γ ∗ can be easily shown from the strong law of large numbers. [sent-149, score-0.02]
71 Define an event D = N+ ≥ np , N− ≥ n(1−p) , γ ≤ 2γ ∗ . [sent-151, score-0.028]
72 The first term converges to 0 from the strong law of large numbers. [sent-153, score-0.074]
73 Let define a set S = (n+ , n− ) n+ ≥ np , n− ≥ n(1−p) , n− ≤ 2γ ∗ . [sent-154, score-0.028]
74 (n+ ,n− )∈S ≤ max (n+ ,n− )∈S 7 (8) Provided that σ → 0 and nσ 2d / ln n → ∞, any pair (n+ , n− ) ∈ S satisfies σ → 0, n+ σ 2d / ln n → ∞, and n− σ 2d / ln n → ∞ as n → ∞ and thus the term in (8) converges to 0 from Theorem 2. [sent-156, score-0.324]
75 Scott, “Kernel classification via integrated squared error,” IEEE Workshop on Statistical Signal Processing, August 2007. [sent-170, score-0.034]
76 Kim, Least Squares Mixture Decomposition Estimation, unpublished doctoral dissertation, Dept. [sent-172, score-0.035]
77 [5] Mark Girolami and Chao He, “Probability density estimation from optimally condensed data samples,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. [sent-176, score-0.104]
78 Turlach, “Bandwidth selection in kernel density estimation: A review,” Technical Report 9317, C. [sent-182, score-0.191]
79 Scott, “Parametric statistical modeling by minimum integrated square error,” Technometrics 43, pp. [sent-188, score-0.052]
80 Wegkamp, “Sparse density estimation with l1 penalties,” Proceedings of 20th Annual Conference on Learning Theory, COLT 2007, Lecture Notes in Artificial Intelligence, v4539, pp. [sent-195, score-0.104]
81 Tsybakov, “Linear and convex aggregation of density estimators,” https:// hal. [sent-200, score-0.099]
82 Principe, and Torbjørn Eltoft, “Towards a unification of information theoretic learning and kernel method,” in Proc. [sent-205, score-0.116]
83 Ritter, “Discriminative densities from maximum contrast estimation,” in Advances in Neural Information Proceeding Systems 15, Vancouver, Canada, 2002, pp. [sent-215, score-0.1]
84 Taylor, “Kernel density classification and boosting: an l2 analysis,” Statistics and Computing, vol. [sent-220, score-0.075]
85 Wagner, “Asymptotically optimal discriminant fucntions for pattern classification,” IEEE Trans. [sent-235, score-0.016]
wordName wordTfidf (topN-words)
[('ise', 0.824), ('hi', 0.25), ('xi', 0.235), ('dx', 0.146), ('kernel', 0.116), ('hn', 0.101), ('densities', 0.1), ('inequality', 0.097), ('xj', 0.094), ('classi', 0.093), ('ln', 0.09), ('er', 0.085), ('oracle', 0.083), ('density', 0.075), ('hf', 0.071), ('scott', 0.063), ('yi', 0.06), ('kim', 0.06), ('sup', 0.056), ('consistency', 0.049), ('qp', 0.045), ('concentration', 0.044), ('arbor', 0.039), ('kde', 0.039), ('guarantees', 0.038), ('estimator', 0.037), ('ann', 0.036), ('theorem', 0.036), ('bandwidth', 0.035), ('converges', 0.034), ('integrated', 0.034), ('svms', 0.032), ('quadratic', 0.03), ('eecs', 0.03), ('estimation', 0.029), ('deduce', 0.029), ('tsybakov', 0.029), ('michigan', 0.029), ('dn', 0.028), ('np', 0.028), ('ci', 0.026), ('estimate', 0.026), ('ers', 0.026), ('svm', 0.026), ('evolve', 0.026), ('hessian', 0.025), ('yj', 0.025), ('aggregation', 0.024), ('cation', 0.024), ('lemma', 0.023), ('kernels', 0.022), ('piecewise', 0.022), ('criterion', 0.021), ('offset', 0.021), ('proves', 0.021), ('law', 0.02), ('hall', 0.02), ('term', 0.02), ('mi', 0.02), ('relabel', 0.019), ('polytechnic', 0.019), ('wand', 0.019), ('loocv', 0.019), ('wagner', 0.019), ('sao', 0.019), ('chao', 0.019), ('cortes', 0.019), ('wegkamp', 0.019), ('suppose', 0.019), ('appendix', 0.019), ('xn', 0.018), ('square', 0.018), ('false', 0.018), ('unbiased', 0.018), ('bunea', 0.018), ('parzen', 0.018), ('terry', 0.018), ('luis', 0.018), ('jose', 0.018), ('clayton', 0.018), ('https', 0.018), ('dissertation', 0.018), ('ritter', 0.018), ('unpublished', 0.018), ('stating', 0.018), ('pf', 0.018), ('turlach', 0.018), ('rigollet', 0.018), ('holdout', 0.018), ('oct', 0.018), ('yn', 0.018), ('sign', 0.017), ('girolami', 0.017), ('technometrics', 0.017), ('doctoral', 0.017), ('devroye', 0.017), ('vapnik', 0.017), ('parametric', 0.017), ('error', 0.016), ('pattern', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 178 nips-2008-Performance analysis for L\ 2 kernel classification
Author: Jooseuk Kim, Clayton Scott
Abstract: We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1
2 0.11084212 78 nips-2008-Exact Convex Confidence-Weighted Learning
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1
3 0.097247161 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data
Author: Zhengdong Lu, Jeffrey Kaye, Todd K. Leen
Abstract: We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and varying sampling intervals. This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classifiers based on the proposed kernel out-perform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information.
4 0.092689157 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
Author: Ofer Dekel
Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1
5 0.092584074 117 nips-2008-Learning Taxonomies by Dependence Maximization
Author: Matthew Blaschko, Arthur Gretton
Abstract: We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data. 1
6 0.09048906 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
7 0.088817671 65 nips-2008-Domain Adaptation with Multiple Sources
8 0.07786978 226 nips-2008-Supervised Dictionary Learning
9 0.076847568 196 nips-2008-Relative Margin Machines
10 0.074648641 228 nips-2008-Support Vector Machines with a Reject Option
11 0.072183661 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
12 0.071256675 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
13 0.070848502 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
14 0.069583155 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
15 0.068299323 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
16 0.065244399 72 nips-2008-Empirical performance maximization for linear rank statistics
17 0.061435927 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
18 0.060175806 113 nips-2008-Kernelized Sorting
19 0.059826806 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
20 0.058549467 138 nips-2008-Modeling human function learning with Gaussian processes
topicId topicWeight
[(0, -0.163), (1, -0.066), (2, -0.108), (3, 0.054), (4, -0.022), (5, 0.01), (6, 0.026), (7, -0.026), (8, 0.038), (9, 0.051), (10, 0.155), (11, 0.026), (12, 0.067), (13, 0.007), (14, -0.035), (15, 0.019), (16, 0.013), (17, -0.037), (18, -0.054), (19, 0.061), (20, 0.032), (21, -0.009), (22, 0.026), (23, 0.108), (24, 0.025), (25, -0.075), (26, 0.085), (27, -0.152), (28, 0.167), (29, 0.04), (30, 0.009), (31, 0.015), (32, 0.139), (33, 0.043), (34, -0.0), (35, 0.003), (36, -0.08), (37, 0.087), (38, -0.043), (39, -0.003), (40, -0.106), (41, 0.049), (42, 0.147), (43, 0.016), (44, 0.014), (45, -0.09), (46, 0.002), (47, 0.06), (48, -0.022), (49, 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.93001026 178 nips-2008-Performance analysis for L\ 2 kernel classification
Author: Jooseuk Kim, Clayton Scott
Abstract: We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1
2 0.66063207 78 nips-2008-Exact Convex Confidence-Weighted Learning
Author: Koby Crammer, Mark Dredze, Fernando Pereira
Abstract: Confidence-weighted (CW) learning [6], an online learning method for linear classifiers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Confidence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classifies examples with a specified probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used first-order and second-order online methods. 1
3 0.65781498 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data
Author: Zhengdong Lu, Jeffrey Kaye, Todd K. Leen
Abstract: We develop new techniques for time series classification based on hierarchical Bayesian generative models (called mixed-effect models) and the Fisher kernel derived from them. A key advantage of the new formulation is that one can compute the Fisher information matrix despite varying sequence lengths and varying sampling intervals. This avoids the commonly-used ad hoc replacement of the Fisher information matrix with the identity which destroys the geometric invariance of the kernel. Our construction retains the geometric invariance, resulting in a kernel that is properly invariant under change of coordinates in the model parameter space. Experiments on detecting cognitive decline show that classifiers based on the proposed kernel out-perform those based on generative models and other feature extraction routines, and on Fisher kernels that use the identity in place of the Fisher information.
4 0.64171898 196 nips-2008-Relative Margin Machines
Author: Tony Jebara, Pannagadatta K. Shivaswamy
Abstract: In classification problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be efficiently solved and experiments on digit datasets show drastic performance improvements over SVMs. 1
5 0.6029225 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
Author: Ofer Dekel
Abstract: We present cutoff averaging, a technique for converting any conservative online learning algorithm into a batch learning algorithm. Most online-to-batch conversion techniques work well with certain types of online learning algorithms and not with others, whereas cutoff averaging explicitly tries to adapt to the characteristics of the online algorithm being converted. An attractive property of our technique is that it preserves the efficiency of the original online algorithm, making it appropriate for large-scale learning problems. We provide a statistical analysis of our technique and back our theoretical claims with experimental results. 1
6 0.57045114 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
7 0.55033576 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
8 0.54399759 228 nips-2008-Support Vector Machines with a Reject Option
9 0.48850575 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
10 0.47594038 226 nips-2008-Supervised Dictionary Learning
11 0.44284621 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
12 0.43651715 72 nips-2008-Empirical performance maximization for linear rank statistics
13 0.42812079 185 nips-2008-Privacy-preserving logistic regression
14 0.42588696 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
15 0.42188221 113 nips-2008-Kernelized Sorting
16 0.40825787 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
17 0.39216882 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
18 0.3876473 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
19 0.38559416 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
20 0.38211301 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
topicId topicWeight
[(6, 0.455), (7, 0.061), (12, 0.016), (28, 0.17), (57, 0.06), (63, 0.015), (71, 0.012), (77, 0.029), (83, 0.038)]
simIndex simValue paperId paperTitle
1 0.99659044 65 nips-2008-Domain Adaptation with Multiple Sources
Author: Yishay Mansour, Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents a theoretical analysis of the problem of domain adaptation with multiple sources. For each source domain, the distribution over the input points as well as a hypothesis with error at most ǫ are given. The problem consists of combining these hypotheses to derive a hypothesis with small error with respect to the target domain. We present several theoretical results relating to this problem. In particular, we prove that standard convex combinations of the source hypotheses may in fact perform very poorly and that, instead, combinations weighted by the source distributions benefit from favorable theoretical guarantees. Our main result shows that, remarkably, for any fixed target function, there exists a distribution weighted combining rule that has a loss of at most ǫ with respect to any target mixture of the source distributions. We further generalize the setting from a single target function to multiple consistent target functions and show the existence of a combining rule with error at most 3ǫ. Finally, we report empirical results for a multiple source adaptation problem with a real-world dataset.
2 0.98313653 43 nips-2008-Cell Assemblies in Large Sparse Inhibitory Networks of Biologically Realistic Spiking Neurons
Author: Adam Ponzi, Jeff Wickens
Abstract: Cell assemblies exhibiting episodes of recurrent coherent activity have been observed in several brain regions including the striatum[1] and hippocampus CA3[2]. Here we address the question of how coherent dynamically switching assemblies appear in large networks of biologically realistic spiking neurons interacting deterministically. We show by numerical simulations of large asymmetric inhibitory networks with fixed external excitatory drive that if the network has intermediate to sparse connectivity, the individual cells are in the vicinity of a bifurcation between a quiescent and firing state and the network inhibition varies slowly on the spiking timescale, then cells form assemblies whose members show strong positive correlation, while members of different assemblies show strong negative correlation. We show that cells and assemblies switch between firing and quiescent states with time durations consistent with a power-law. Our results are in good qualitative agreement with the experimental studies. The deterministic dynamical behaviour is related to winner-less competition[3], shown in small closed loop inhibitory networks with heteroclinic cycles connecting saddle-points. 1
3 0.97527534 74 nips-2008-Estimating the Location and Orientation of Complex, Correlated Neural Activity using MEG
Author: Julia Owen, Hagai T. Attias, Kensuke Sekihara, Srikantan S. Nagarajan, David P. Wipf
Abstract: The synchronous brain activity measured via MEG (or EEG) can be interpreted as arising from a collection (possibly large) of current dipoles or sources located throughout the cortex. Estimating the number, location, and orientation of these sources remains a challenging task, one that is significantly compounded by the effects of source correlations and the presence of interference from spontaneous brain activity, sensor noise, and other artifacts. This paper derives an empirical Bayesian method for addressing each of these issues in a principled fashion. The resulting algorithm guarantees descent of a cost function uniquely designed to handle unknown orientations and arbitrary correlations. Robust interference suppression is also easily incorporated. In a restricted setting, the proposed method is shown to have theoretically zero bias estimating both the location and orientation of multi-component dipoles even in the presence of correlations, unlike a variety of existing Bayesian localization methods or common signal processing techniques such as beamforming and sLORETA. Empirical results on both simulated and real data sets verify the efficacy of this approach. 1
4 0.96357411 214 nips-2008-Sparse Online Learning via Truncated Gradient
Author: John Langford, Lihong Li, Tong Zhang
Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of online-learning algorithms with convex loss. This method has several essential properties. First, the degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. Second, the approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. Finally, the approach works well empirically. We apply it to several datasets and find for datasets with large numbers of features, substantial sparsity is discoverable. 1
same-paper 5 0.92361325 178 nips-2008-Performance analysis for L\ 2 kernel classification
Author: Jooseuk Kim, Clayton Scott
Abstract: We provide statistical performance guarantees for a recently introduced kernel classifier that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classifier is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classifier. Unlike SVMs, however, the L2 kernel classifier does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classifier on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1
6 0.83880937 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
7 0.83568966 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
8 0.79028702 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
9 0.7458204 202 nips-2008-Robust Regression and Lasso
10 0.73334521 75 nips-2008-Estimating vector fields using sparse basis field expansions
11 0.72449452 62 nips-2008-Differentiable Sparse Coding
12 0.72099692 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
13 0.71956658 85 nips-2008-Fast Rates for Regularized Objectives
14 0.71748692 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
15 0.71448761 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
16 0.71249074 145 nips-2008-Multi-stage Convex Relaxation for Learning with Sparse Regularization
17 0.7120775 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
18 0.71157074 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
19 0.7032907 228 nips-2008-Support Vector Machines with a Reject Option
20 0.69762373 210 nips-2008-Signal-to-Noise Ratio Analysis of Policy Gradient Algorithms