jmlr jmlr2007 jmlr2007-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan
Abstract: We develop gain adaptation methods that improve convergence of the kernel Hebbian algorithm (KHA) for iterative kernel PCA (Kim et al., 2005). KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence. We accelerate it by incorporating the reciprocal of the current estimated eigenvalues as part of a gain vector. An additional normalization term then allows us to eliminate a tuning parameter in the annealing schedule. Finally we derive and apply stochastic meta-descent (SMD) gain vector adaptation (Schraudolph, 1999, 2002) in reproducing kernel Hilbert space to further speed up convergence. Experimental results on kernel PCA and spectral clustering of USPS digits, motion capture and image denoising, and image super-resolution tasks confirm that our methods converge substantially faster than conventional KHA. To demonstrate scalability, we perform kernel PCA on the entire MNIST data set. Keywords: step size adaptation, gain vector adaptation, stochastic meta-descent, kernel Hebbian algorithm, online learning
Reference: text
sentIndex sentText sentNum sentScore
1 KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence. [sent-16, score-0.227]
2 We accelerate it by incorporating the reciprocal of the current estimated eigenvalues as part of a gain vector. [sent-17, score-0.276]
3 Finally we derive and apply stochastic meta-descent (SMD) gain vector adaptation (Schraudolph, 1999, 2002) in reproducing kernel Hilbert space to further speed up convergence. [sent-19, score-0.324]
4 Experimental results on kernel PCA and spectral clustering of USPS digits, motion capture and image denoising, and image super-resolution tasks confirm that our methods converge substantially faster than conventional KHA. [sent-20, score-0.361]
5 Keywords: step size adaptation, gain vector adaptation, stochastic meta-descent, kernel Hebbian algorithm, online learning 1. [sent-22, score-0.262]
6 The r ×n matrix W whose rows are the eigenvectors of Q associated with the r ≤ n largest eigenvalues minimizes the least-squares reconstruction error X −W WX F, (1) where · F is the Frobenius norm. [sent-25, score-0.343]
7 Plugging (5) into (3) and stochastically approximating the expectation E[y] with its instantaneous estimate yt := Wt xt , where xt ∈ Rn is the observation at time t, yields ∂Wt J(W ) = yt xt − lt(yt yt )Wt . [sent-44, score-0.675]
8 (6) Gradient ascent in (6) gives the generalized Hebbian algorithm (GHA) of Sanger (1989): Wt+1 = Wt + ηt [yt xt − lt(yt yt )Wt ]. [sent-45, score-0.225]
9 A closely related algorithm by Oja and Karhunen (1985, Section 5) omits the lt operator: Wt+1 = Wt + ηt [yt xt − yt yt Wt ]. [sent-48, score-0.524]
10 Both GHA and KHA updates incorporate a scalar gain parameter ηt , which is either held fixed or annealed according to some predefined schedule. [sent-61, score-0.184]
11 τ determines the length of an initial search phase with near-constant gain (ηt ≈ η0 for t τ), before the gain decays asymptotically as τ/t (for t τ) in the annealing phase (Darken and Moody, 1992). [sent-64, score-0.331]
12 Here we propose the inclusion of a gain vector in the KHA, which provides each estimated eigenvector with its individual gain parameter. [sent-69, score-0.314]
13 1 we describe our KHA/et* algorithm, which sets the gain for each eigenvector inversely proportional to its estimated eigenvalue, in addition to using (9) for annealing. [sent-71, score-0.17]
14 3 additionally multiplies the gain vector by the length of the vector of estimated eigenvalues; this allows us to eliminate the τ tuning parameter. [sent-73, score-0.176]
15 We then derive and apply the stochastic meta-descent (SMD) gain vector adaptation technique (Schraudolph, 1999, 2002) to KHA/et* and KHA/et to further speed up their convergence. [sent-74, score-0.239]
16 , 1998) it is known that the principal components must lie o in the span of the centered data in feature space; we can therefore express the GHA weight matrix as Wt = At Φ , where A is an r × l matrix of expansion coefficients, and r the desired number of principal components. [sent-102, score-0.161]
17 Since we have Φ (xi ) = ei Φ , where ei is the unit vector in direction i, (13) can be rewritten solely in terms of expansion coefficients as At+1 = At + ηt [yt e p(t) − lt(yt yt )At ]. [sent-104, score-0.196]
18 (15) Introducing the update coefficient matrix Γt := yt e p(t) − lt(yt yt )At (16) At+1 = At + ηt Γt . [sent-105, score-0.467]
19 (2005) employed the KHA update (17) with a constant scalar gain ηt = η0 ; they also proposed letting the gain decay as ηt = η0 /t. [sent-107, score-0.433]
20 Our implementation (which we denote KHA/t) employs the more general (9) instead, from which an η 0 /(t + 1) decay is obtained by setting τ = 1, and a constant gain in the limit as τ → ∞. [sent-108, score-0.207]
21 Gain Decay with Reciprocal Eigenvalues Consider the term yt xt = Wt xt xt appearing on the right-hand side of the GHA update (7). [sent-110, score-0.325]
22 The elements of yt thus scale with the associated eigenvalues of Q. [sent-112, score-0.282]
23 We counteract this problem by furnishing the KHA with a gain vector ηt ∈ Rr that provides + each eigenvector estimate with its individual gain parameter; we will discuss how to set ηt below. [sent-114, score-0.343]
24 KHA/et* thus conditions the KHA update by proportionately decreasing (increasing) the gain (19) for rows of At associated with large (small) eigenvalues. [sent-121, score-0.209]
25 2 Calculating the Eigenvalues The above update (19) requires the first r eigenvalues of K —but the KHA is an algorithm for estimating these eigenvalues and their associated eigenvectors in the first place. [sent-124, score-0.282]
26 3 The KHA/et Algorithm The τ parameter of the KHA/et* update (19) above determines at what point in the iterative kernel PCA we gradually shift from the initial search phase (with near-constant ηt ) into the asymptotic annealing phase (with ηt near-proportional to 1/t). [sent-134, score-0.194]
27 One way to achieve this is to have some measure of progress counteract the gain decay: As long as we are making rapid progress, we are in the search phase, and do not want to decrease the gains; when progress stalls it is time to start annealing them. [sent-136, score-0.216]
28 Our KHA/et algorithm fixes the gain decay schedule of KHA/et* at τ = l, but multiplies the gains by λt : [ηt ]i = λt l η0 . [sent-139, score-0.281]
29 [λt ]i t + l (22) The rapid early growth of λt thus serves to counteract the gain decay until the leading eigenspace has been identified. [sent-140, score-0.236]
30 Asymptotically λt approaches its (constant) maximum, and so the gain decay will ultimately dominate (22). [sent-141, score-0.207]
31 This achieves an effect comparable to an “adaptive search then converge” (ASTC) gain schedule (Darken and Moody, 1992) while eliminating the τ tuning parameter. [sent-142, score-0.205]
32 Since (19) and (22) can both be expressed as [ηt ]i = ˆ ηt , [λt ]i (23) ˆ for particular choices of ηt , we can compare the gain vectors used by KHA/et* and KHA/et by ˆ monitoring how they evolve the scalar ηt ; this is shown in Figure 1 for all experiments reported ˆ in Section 5. [sent-143, score-0.184]
33 We briefly review gradient-based gain adaptation methods, then derive and implement Schraudolph’s (1999; 2002) stochastic meta-descent (SMD) algorithm for both KHA/et* and KHA/et, focusing on the scalar form of SMD that can be used in an RKHS. [sent-149, score-0.279]
34 We adapt θ via the stochastic gradient descent θt+1 = θt − eρt gt , where gt = ∂θt Jt (θt ), using ∂θt as a shorthand for ∂ ∂θ θ=θ . [sent-155, score-0.25]
35 Using the chain rule and (24) we find ρt+1 = ρt − µ ∂ρt Jt+1 (θt+1 ) = ρt − µ [∂θt+1 Jt+1 (θt+1 )] ∂ρt θt+1 (25) = ρt + µ eρt gt+1 gt , where the meta-gain µ ≥ 0 is a scalar tuning parameter. [sent-159, score-0.164]
36 Intuitively, the gain adaptation (25) is driven by the angle between successive gradient measurements: If it is less than 90 ◦ , then gt+1 gt > 0, and ρt will be increased. [sent-160, score-0.331]
37 One shortcoming of (25) is that the decorrelation occurs only across a single time step, making the gain adaptation overly sensitive to spurious short-term correlations in the data. [sent-163, score-0.206]
38 To compute vt+1 efficiently, we expand θt+1 in terms of its recursive definition (24): t vt+1 := ∑ ξi ∂ρt−i θt+1 i=0 t t = ∑ ξi ∂ρt−i θt − ∑ ξi ∂ρt−i [eρt gt ] (27) i=0 i=0 t ≈ ξvt − eρt (gt + ∂θt gt ∑ ξi ∂ρt−i θt ). [sent-165, score-0.184]
39 Noting that ∂θt gt is the Hessian Ht of Jt (θt ), we arrive at the simple iterative update vt+1 = ξvt − eρt (gt + ξHt vt ). [sent-167, score-0.219]
40 Note that for ξ = 0 (29) and (26) reduce to the single-step gain adaptation (25). [sent-169, score-0.206]
41 , directional derivative) through the gradient computation: dθt := vt ⇒ Ht vt := dgt . [sent-174, score-0.155]
42 (30) In other words, if we set the differential dθt of the parameter vector to vt , then the resulting differential of the gradient gt (a function of θt ) is the Hessian-vector product Ht vt . [sent-175, score-0.303]
43 We can, however, reduce this cost to O(rl) by noting that (16) implies that Γt K = yt e p(t) K − lt(yt yt )At K = yt k p(t) − lt(yt yt )At K , (33) where the r × l matrix At K can be stored and updated incrementally via (31): At+1 K = At K + ediag(ρt ) diag(ηt ) Γt K . [sent-189, score-0.817]
44 Inserting (16) and (36) into (35) finally yields the update rule Bt+1 = ξBt + ediag(ρt ) diag(ηt )[(At +ξBt ) k p(t) e p(t) (37) − lt(yt yt )(At +ξBt ) − ξ lt(Bt k p(t) yt + yt k p(t) Bt )At ]. [sent-194, score-0.63]
45 For r n (as is typical), the most expensive step in the iteration loop will be the computation of a row of the kernel matrix in 2(c), required by all algorithms. [sent-203, score-0.174]
46 In the first, we benchmark against the KHA with a conventional gain decay schedule (9), which we denote KHA/t, in a number of different settings: Performing kernel PCA and spectral clustering on the well-known USPS data set (LeCun et al. [sent-209, score-0.442]
47 , 1989), replicating image denoising and face image super-resolution experiments of Kim et al. [sent-210, score-0.156]
48 A common feature of all these data sets is that the kernel matrix can be stored in main memory, and the optimal reconstruction can thus be 1902 FAST I TERATIVE K ERNEL PCA Algorithm 1 KHA-SMD Eq. [sent-214, score-0.251]
49 kernel) USPS (RBF kernel) Lena image denoising face super-resolution USPS spectral clustering motion capture KPCA 1 for KHA/t 2 for Section σ τ1 τ2 η01 η02 η03 µ4 µ5 ξ 5. [sent-225, score-0.271]
50 1 Experiments on Small Data Sets In these experiments the KHA and our enhanced variants are used to find the first r eigenvectors of the centered kernel matrix K . [sent-260, score-0.265]
51 To assess the quality of the solution, we reconstruct the kernel matrix using the eigenvectors found by the iterative algorithms, and measure the reconstruction error E (A) := K − (AK ) AK F. [sent-261, score-0.343]
52 (38) Since the kernel matrix can be stored in memory, the optimal reconstruction error from r eigenvectors, E min := minA E (A), is computed with a conventional eigensolver. [sent-262, score-0.304]
53 This allows us to report reconstruction errors as excess errors relative to the optimal reconstruction, that is, E (A)/ E min − 1. [sent-263, score-0.231]
54 To compare algorithms we plot the excess reconstruction error on a logarithmic scale after each pass through the entire data set. [sent-264, score-0.276]
55 1904 FAST I TERATIVE K ERNEL PCA Figure 2: Excess relative reconstruction error of KHA variants for kernel PCA (16 eigenvectors) on the USPS data, using a dot-product (left) resp. [sent-271, score-0.272]
56 , 1989)—namely, the first 100 samples of each digit—with two different kernel functions: the dot-product kernel2 k(x, x ) = x x (39) and the RBF kernel k(x, x ) = exp (x − x ) (x − x ) 2σ2 (40) with σ = 8, the value used by Mika et al. [sent-279, score-0.17]
57 We extract the first 16 eigenvectors of the kernel matrix and plot the excess relative error in Figure 2. [sent-281, score-0.284]
58 1905 ¨ G UNTER , S CHRAUDOLPH AND V ISHWANATHAN Figure 4: Comparison of excess relative reconstruction error of KHA variants estimating eigenvalues and updating gains every iteration (’i’) vs. [sent-285, score-0.416]
59 We show the first 10 eigenvectors obtained by KHA/et* for each kernel in Figure 3. [sent-290, score-0.153]
60 In Figure 4 we compare the performance of our algorithms, which estimate the eigenvalues and update the gains only once after every pass through the data (’p’), against variants (’i’) which do this after every iteration. [sent-291, score-0.272]
61 1906 FAST I TERATIVE K ERNEL PCA Figure 6: Excess relative reconstruction error of KHA variants in our replication of experiments due to Kim et al. [sent-294, score-0.187]
62 Left: multipatch image kernel PCA on a noisy Lena image; Right: super-resolution of face images. [sent-296, score-0.184]
63 Figure 7: Reconstructed Lena image after (left to right) 1, 2, and 3 passes through the data set, for KHA with constant gain ηt = 0. [sent-297, score-0.244]
64 KHA-SMD and KHA-SMD* achieved an excess reconstruction error that is over three orders of magnitude better than the conventional KHA after 50 passes through the data. [sent-319, score-0.348]
65 ’s (2005) 800 passes through the data with the constant-gain KHA we obtain an excess relative reconstruction error of 5. [sent-321, score-0.295]
66 The signal-to-noise ratio (SNR) of the reconstruction after 800 passes with constant gain is 13. [sent-323, score-0.341]
67 To illustrate the large difference in early performance between conventional KHA and KHASMD, we show the images reconstructed from either method after 1, 2, and 3 passes through the data set in Figure 7. [sent-326, score-0.18]
68 The overall gain used by KHA-SMD* comprises three factors: the scheduled gain decay over time (9), the reciprocal of the current estimated eigenvalues, and the gain adapted by SMD. [sent-329, score-0.541]
69 Figure 8 (left) shows the excess relative error as a function of the number of passes through the data. [sent-336, score-0.162]
70 On its own, SMD (s) outperforms the scheduled gain decay (t), but combining the two (ts) is better still. [sent-337, score-0.207]
71 Figure 8 (right) plots the excess relative error of the s variant (SMD alone, black) and KHA-SMD* (light red) on the Lena image denoising problem for the last three values of µ prior to divergence. [sent-342, score-0.208]
72 1908 FAST I TERATIVE K ERNEL PCA Figure 8: Excess relative reconstruction error for multipatch image PCA on a noisy Lena image. [sent-348, score-0.198]
73 Left: comparison of original KHA variants (black) with those using other combinations (light red) of gain decay (t), reciprocal eigenvalues (e), and SMD (s). [sent-349, score-0.393]
74 Figure 6 (right) plots the excess relative reconstruction error of the different algorithms on this task. [sent-362, score-0.231]
75 After 50 passes through the data, all our methods achieve an excess reconstruction error about three orders of magnitude better than the conventional KHA, though KHA-SMD is substantially faster than the others at reaching this level of performance. [sent-365, score-0.348]
76 Figure 9 illustrates that the reconstructed face images after one pass through the training data generally show better high-resolution detail for KHA-SMD than for the conventional KHA with constant gain. [sent-366, score-0.195]
77 (2002): 1909 ¨ G UNTER , S CHRAUDOLPH AND V ISHWANATHAN Figure 9: Rows from top to bottom: Original face images (60 × 60 pixels); sub-sampled images (20 × 20 pixels); super-resolution images produced by KHA after one pass through the data set; likewise for KHA-SMD. [sent-370, score-0.19]
78 Define the normalized transition matrix P := D − 2 KD − 2 , where K ∈ Rl×l is the kernel matrix of the data, and D is a diagonal matrix with [D]ii = ∑ j [K]i j . [sent-372, score-0.184]
79 The excess relative reconstruction errors—for spectral clustering, of the matrix P —plotted in Figure 10 (right) confirm that our methods outperform KHA/t. [sent-394, score-0.297]
80 1911 ¨ G UNTER , S CHRAUDOLPH AND V ISHWANATHAN Figure 10: Quality of spectral clustering of the USPS data using an RBF kernel, as measured by variation of information (left) and excess relative reconstruction error (right). [sent-398, score-0.299]
81 Figure 11: Excess relative reconstruction error on human motion capture data. [sent-400, score-0.216]
82 √ motion is reconstructed in R3 via the KHA with an RBF kernel (σ = 1. [sent-417, score-0.172]
83 While kernel PCA has previously been applied to subsets of this data, to the best of our knowledge nobody has attempted it on the entire data set—for obvious reasons: the full kernel matrix has 3. [sent-425, score-0.203]
84 We will perform a single pass through the MNIST data, attempting to find the first 50 eigenvalues of the centered kernel matrix. [sent-428, score-0.241]
85 Hitherto we have used the 1913 ¨ G UNTER , S CHRAUDOLPH AND V ISHWANATHAN excess reconstruction error relative to the optimal reconstruction error to measure the performance of the KHA. [sent-430, score-0.364]
86 Instead we simply report the reconstruction error (38), which we can still compute— albeit with a rather high time complexity, as it requires calculating all entries of the kernel matrix. [sent-432, score-0.218]
87 05l a priori, which corresponds to decreasing the gain by a factor of 20 during the first (and only) pass through the data. [sent-434, score-0.189]
88 Algorithm 2 exploits this to automatically tune a gain parameter (η 0 resp. [sent-437, score-0.19]
89 It is therefore possible to tune both η0 and µ for the SMD variants as follows: first run Algorithm 2 to tune η0 (with µ = 0) on a small fraction of the data, then run it a second time to tune µ (with the previously obtained value for η0 ) on the entire data set. [sent-452, score-0.192]
90 For a fair comparison, all our algorithms use the same initial random matrix A 1 , whose absolute reconstruction error is 33417. [sent-458, score-0.166]
91 The reconstruction error after one pass through the data is shown in Table 2; it is evident that all our algorithms significantly improve upon the performance of KHA/t, with the SMD variants slightly ahead of their non-SMD analogues. [sent-459, score-0.232]
92 Table 2 also reports the time spent in parameter tuning, the resulting tuned parameter values, the time needed by each KHA variant for one pass through the data, and the total runtime (comprising kernel centering, parameter tuning, KHA proper, and computing the reconstruction error). [sent-475, score-0.336]
93 (2005) by providing a separate gain for each eigenvector estimate, and presented two methods, KHA/et* and KHA/et, which set those gains inversely proportional to the current estimate of the eigenvalues. [sent-480, score-0.215]
94 KHA/et has a normalization term which allowed us to eliminate one of the free parameters of the gain decay scheme. [sent-481, score-0.207]
95 Both methods were then enhanced by applying stochastic meta-descent (SMD) to perform gain adaptation in RKHS. [sent-482, score-0.239]
96 with a scheduled gain decay (KHA/t), in seven different experimental settings. [sent-484, score-0.207]
97 Like other kernel methods, kernel PCA (Sch olkopf et al. [sent-493, score-0.17]
98 Traditionally, kernel methods require computation and storage of the entire kernel matrix. [sent-495, score-0.17]
99 On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. [sent-599, score-0.187]
100 Human motion de-noising via greedy kernel principal component analysis filtering. [sent-648, score-0.181]
wordName wordTfidf (topN-words)
[('kha', 0.648), ('smd', 0.381), ('yt', 0.196), ('pca', 0.184), ('gain', 0.144), ('reconstruction', 0.133), ('rl', 0.125), ('chraudolph', 0.124), ('ishwanathan', 0.124), ('unter', 0.124), ('gha', 0.114), ('lt', 0.103), ('usps', 0.098), ('excess', 0.098), ('gt', 0.092), ('terative', 0.086), ('lena', 0.086), ('eigenvalues', 0.086), ('kernel', 0.085), ('bt', 0.085), ('schraudolph', 0.079), ('kim', 0.078), ('ernel', 0.074), ('eigenvectors', 0.068), ('passes', 0.064), ('decay', 0.063), ('adaptation', 0.062), ('vt', 0.061), ('motion', 0.061), ('jt', 0.059), ('mnist', 0.059), ('variants', 0.054), ('conventional', 0.053), ('diag', 0.051), ('denoising', 0.05), ('hebbian', 0.048), ('darken', 0.048), ('pcc', 0.048), ('tune', 0.046), ('reciprocal', 0.046), ('wt', 0.046), ('pass', 0.045), ('gains', 0.045), ('kpca', 0.043), ('annealing', 0.043), ('update', 0.042), ('scalar', 0.04), ('nicol', 0.04), ('denoised', 0.038), ('ediag', 0.038), ('mkm', 0.038), ('pc', 0.038), ('images', 0.037), ('moody', 0.036), ('image', 0.036), ('principal', 0.035), ('clustering', 0.035), ('rbf', 0.034), ('face', 0.034), ('gradient', 0.033), ('stochastic', 0.033), ('spectral', 0.033), ('vishwanathan', 0.033), ('matrix', 0.033), ('alex', 0.033), ('tuning', 0.032), ('ht', 0.031), ('xt', 0.029), ('lecun', 0.029), ('karhunen', 0.029), ('rr', 0.029), ('hc', 0.029), ('rkhs', 0.029), ('schedule', 0.029), ('counteract', 0.029), ('dat', 0.029), ('multipatch', 0.029), ('expensive', 0.028), ('row', 0.028), ('differential', 0.028), ('bernhard', 0.027), ('tuned', 0.027), ('eigenvector', 0.026), ('reconstructed', 0.026), ('centered', 0.025), ('centering', 0.025), ('iterative', 0.024), ('snr', 0.024), ('nicta', 0.024), ('variant', 0.024), ('calculate', 0.023), ('pixel', 0.023), ('mk', 0.023), ('rows', 0.023), ('sch', 0.022), ('capture', 0.022), ('australia', 0.022), ('runtime', 0.022), ('juha', 0.022), ('individually', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
Author: Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan
Abstract: We develop gain adaptation methods that improve convergence of the kernel Hebbian algorithm (KHA) for iterative kernel PCA (Kim et al., 2005). KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence. We accelerate it by incorporating the reciprocal of the current estimated eigenvalues as part of a gain vector. An additional normalization term then allows us to eliminate a tuning parameter in the annealing schedule. Finally we derive and apply stochastic meta-descent (SMD) gain vector adaptation (Schraudolph, 1999, 2002) in reproducing kernel Hilbert space to further speed up convergence. Experimental results on kernel PCA and spectral clustering of USPS digits, motion capture and image denoising, and image super-resolution tasks confirm that our methods converge substantially faster than conventional KHA. To demonstrate scalability, we perform kernel PCA on the entire MNIST data set. Keywords: step size adaptation, gain vector adaptation, stochastic meta-descent, kernel Hebbian algorithm, online learning
2 0.076166466 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
Author: Ofer Dekel, Philip M. Long, Yoram Singer
Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron
3 0.065239355 90 jmlr-2007-Value Regularization and Fenchel Duality
Author: Ryan M. Rifkin, Ross A. Lippert
Abstract: Regularization is an approach to function learning that balances fit and smoothness. In practice, we search for a function f with a finite representation f = ∑i ci φi (·). In most treatments, the ci are the primary objects of study. We consider value regularization, constructing optimization problems in which the predicted values at the training points are the primary variables, and therefore the central objects of study. Although this is a simple change, it has profound consequences. From convex conjugacy and the theory of Fenchel duality, we derive separate optimality conditions for the regularization and loss portions of the learning problem; this technique yields clean and short derivations of standard algorithms. This framework is ideally suited to studying many other phenomena at the intersection of learning theory and optimization. We obtain a value-based variant of the representer theorem, which underscores the transductive nature of regularization in reproducing kernel Hilbert spaces. We unify and extend previous results on learning kernel functions, with very simple proofs. We analyze the use of unregularized bias terms in optimization problems, and low-rank approximations to kernel matrices, obtaining new results in these areas. In summary, the combination of value regularization and Fenchel duality are valuable tools for studying the optimization problems in machine learning. Keywords: kernel machines, duality, optimization, convex analysis, kernel learning
4 0.043629114 28 jmlr-2007-Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data
Author: Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh
Abstract: In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data. In addition to maximum conditional likelihood, we present two alternative approaches for training DCRFs: marginal likelihood training, for when we are primarily interested in predicting only a subset of the variables, and cascaded training, for when we have a distinct data set for each state variable, as in transfer learning. We evaluate marginal training and cascaded training on both synthetic data and real-world text data, finding that marginal training can improve accuracy when uncertainty exists over the latent variables, and that for transfer learning, a DCRF trained in a cascaded fashion performs better than a linear-chain CRF that predicts the final task directly. Keywords: conditional random fields, graphical models, sequence labeling
5 0.040897682 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification
Author: Onur C. Hamsici, Aleix M. Martinez
Abstract: Many feature representations, as in genomics, describe directional data where all feature vectors share a common norm. In other cases, as in computer vision, a norm or variance normalization step, where all feature vectors are normalized to a common length, is generally used. These representations and pre-processing step map the original data from R p to the surface of a hypersphere S p−1 . Such representations should then be modeled using spherical distributions. However, the difficulty associated with such spherical representations has prompted researchers to model their spherical data using Gaussian distributions instead—as if the data were represented in R p rather than S p−1 . This opens the question to whether the classification results calculated with the Gaussian approximation are the same as those obtained when using the original spherical distributions. In this paper, we show that in some particular cases (which we named spherical-homoscedastic) the answer to this question is positive. In the more general case however, the answer is negative. For this reason, we further investigate the additional error added by the Gaussian modeling. We conclude that the more the data deviates from spherical-homoscedastic, the less advisable it is to employ the Gaussian approximation. We then show how our derivations can be used to define optimal classifiers for spherical-homoscedastic distributions. By using a kernel which maps the original space into one where the data adapts to the spherical-homoscedastic model, we can derive non-linear classifiers with potential applications in a large number of problems. We conclude this paper by demonstrating the uses of spherical-homoscedasticity in the classification of images of objects, gene expression sequences, and text data. Keywords: directional data, spherical distributions, normal distributions, norm normalization, linear and non-linear classifiers, computer vision
6 0.038779344 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
7 0.038088847 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
8 0.035613563 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models
9 0.035186902 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition
10 0.03514789 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
11 0.034074076 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
12 0.034063835 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
13 0.032802071 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
14 0.032696996 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
15 0.032580525 4 jmlr-2007-A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning (Special Topic on Model Selection)
16 0.03250799 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
17 0.032117121 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
18 0.030923985 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
19 0.030891001 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems
20 0.030490004 45 jmlr-2007-Learnability of Gaussians with Flexible Variances
topicId topicWeight
[(0, 0.164), (1, 0.026), (2, 0.052), (3, 0.065), (4, -0.158), (5, -0.03), (6, -0.137), (7, -0.044), (8, -0.11), (9, 0.061), (10, 0.038), (11, 0.018), (12, 0.018), (13, 0.036), (14, -0.054), (15, 0.055), (16, -0.025), (17, -0.074), (18, -0.149), (19, 0.095), (20, -0.203), (21, -0.032), (22, 0.08), (23, 0.017), (24, 0.039), (25, -0.159), (26, -0.104), (27, 0.034), (28, 0.055), (29, -0.07), (30, -0.22), (31, -0.077), (32, -0.03), (33, 0.177), (34, 0.115), (35, 0.084), (36, 0.022), (37, -0.242), (38, -0.249), (39, -0.173), (40, -0.014), (41, 0.065), (42, -0.129), (43, -0.134), (44, 0.102), (45, -0.1), (46, 0.122), (47, 0.161), (48, -0.174), (49, -0.142)]
simIndex simValue paperId paperTitle
same-paper 1 0.9567275 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
Author: Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan
Abstract: We develop gain adaptation methods that improve convergence of the kernel Hebbian algorithm (KHA) for iterative kernel PCA (Kim et al., 2005). KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence. We accelerate it by incorporating the reciprocal of the current estimated eigenvalues as part of a gain vector. An additional normalization term then allows us to eliminate a tuning parameter in the annealing schedule. Finally we derive and apply stochastic meta-descent (SMD) gain vector adaptation (Schraudolph, 1999, 2002) in reproducing kernel Hilbert space to further speed up convergence. Experimental results on kernel PCA and spectral clustering of USPS digits, motion capture and image denoising, and image super-resolution tasks confirm that our methods converge substantially faster than conventional KHA. To demonstrate scalability, we perform kernel PCA on the entire MNIST data set. Keywords: step size adaptation, gain vector adaptation, stochastic meta-descent, kernel Hebbian algorithm, online learning
Author: Charles Sutton, Andrew McCallum, Khashayar Rohanimanesh
Abstract: In sequence modeling, we often wish to represent complex interaction between labels, such as when performing multiple, cascaded labeling tasks on the same sequence, or when long-range dependencies exist. We present dynamic conditional random fields (DCRFs), a generalization of linear-chain conditional random fields (CRFs) in which each time slice contains a set of state variables and edges—a distributed state representation as in dynamic Bayesian networks (DBNs)—and parameters are tied across slices. Since exact inference can be intractable in such models, we perform approximate inference using several schedules for belief propagation, including tree-based reparameterization (TRP). On a natural-language chunking task, we show that a DCRF performs better than a series of linear-chain CRFs, achieving comparable performance using only half the training data. In addition to maximum conditional likelihood, we present two alternative approaches for training DCRFs: marginal likelihood training, for when we are primarily interested in predicting only a subset of the variables, and cascaded training, for when we have a distinct data set for each state variable, as in transfer learning. We evaluate marginal training and cascaded training on both synthetic data and real-world text data, finding that marginal training can improve accuracy when uncertainty exists over the latent variables, and that for transfer learning, a DCRF trained in a cascaded fashion performs better than a linear-chain CRF that predicts the final task directly. Keywords: conditional random fields, graphical models, sequence labeling
3 0.29749718 78 jmlr-2007-Statistical Consistency of Kernel Canonical Correlation Analysis
Author: Kenji Fukumizu, Francis R. Bach, Arthur Gretton
Abstract: While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of finite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n−1/3 , where n is the number of data. Keywords: canonical correlation analysis, kernel, consistency, regularization, Hilbert space
4 0.28549445 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm
Author: Roni Khardon, Gabriel Wachman
Abstract: A large number of variants of the Perceptron algorithm have been proposed and partially evaluated in recent work. One type of algorithm aims for noise tolerance by replacing the last hypothesis of the perceptron with another hypothesis or a vote among hypotheses. Another type simply adds a margin term to the perceptron in order to increase robustness and accuracy, as done in support vector machines. A third type borrows further from support vector machines and constrains the update function of the perceptron in ways that mimic soft-margin techniques. The performance of these algorithms, and the potential for combining different techniques, has not been studied in depth. This paper provides such an experimental study and reveals some interesting facts about the algorithms. In particular the perceptron with margin is an effective method for tolerating noise and stabilizing the algorithm. This is surprising since the margin in itself is not designed or used for noise tolerance, and there are no known guarantees for such performance. In most cases, similar performance is obtained by the voted-perceptron which has the advantage that it does not require parameter selection. Techniques using soft margin ideas are run-time intensive and do not give additional performance benefits. The results also highlight the difficulty with automatic parameter selection which is required with some of these variants. Keywords: perceptron algorithm, on-line learning, noise tolerance, kernel methods
5 0.28526586 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss
Author: Ofer Dekel, Philip M. Long, Yoram Singer
Abstract: We study the problem of learning multiple tasks in parallel within the online learning framework. On each online round, the algorithm receives an instance for each of the parallel tasks and responds by predicting the label of each instance. We consider the case where the predictions made on each round all contribute toward a common goal. The relationship between the various tasks is defined by a global loss function, which evaluates the overall quality of the multiple predictions made on each round. Specifically, each individual prediction is associated with its own loss value, and then these multiple loss values are combined into a single number using the global loss function. We focus on the case where the global loss function belongs to the family of absolute norms, and present several online learning algorithms for the induced problem. We prove worst-case relative loss bounds for all of our algorithms, and demonstrate the effectiveness of our approach on a largescale multiclass-multilabel text categorization problem. Keywords: online learning, multitask learning, multiclass multilabel classiifcation, perceptron
6 0.27460697 90 jmlr-2007-Value Regularization and Fenchel Duality
7 0.22633657 50 jmlr-2007-Local Discriminant Wavelet Packet Coordinates for Face Recognition
8 0.18019573 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"
9 0.17691171 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features
10 0.15911376 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
11 0.15374072 35 jmlr-2007-General Polynomial Time Decomposition Algorithms (Special Topic on the Conference on Learning Theory 2005)
12 0.1497196 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression
13 0.14839742 63 jmlr-2007-On the Representer Theorem and Equivalent Degrees of Freedom of SVR
14 0.13843748 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
15 0.13560396 18 jmlr-2007-Characterizing the Function Space for Bayesian Kernel Models
16 0.13442728 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
18 0.13426955 80 jmlr-2007-Synergistic Face Detection and Pose Estimation with Energy-Based Models
19 0.13360189 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
20 0.12966879 8 jmlr-2007-A Unified Continuous Optimization Framework for Center-Based Clustering Methods
topicId topicWeight
[(4, 0.041), (5, 0.362), (8, 0.032), (10, 0.033), (12, 0.03), (15, 0.017), (22, 0.02), (28, 0.044), (40, 0.053), (45, 0.014), (48, 0.048), (60, 0.032), (80, 0.012), (85, 0.079), (98, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.71836042 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis
Author: Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan
Abstract: We develop gain adaptation methods that improve convergence of the kernel Hebbian algorithm (KHA) for iterative kernel PCA (Kim et al., 2005). KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence. We accelerate it by incorporating the reciprocal of the current estimated eigenvalues as part of a gain vector. An additional normalization term then allows us to eliminate a tuning parameter in the annealing schedule. Finally we derive and apply stochastic meta-descent (SMD) gain vector adaptation (Schraudolph, 1999, 2002) in reproducing kernel Hilbert space to further speed up convergence. Experimental results on kernel PCA and spectral clustering of USPS digits, motion capture and image denoising, and image super-resolution tasks confirm that our methods converge substantially faster than conventional KHA. To demonstrate scalability, we perform kernel PCA on the entire MNIST data set. Keywords: step size adaptation, gain vector adaptation, stochastic meta-descent, kernel Hebbian algorithm, online learning
2 0.38238859 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data
Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby
Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming
3 0.38110974 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification
Author: Jia Li, Surajit Ray, Bruce G. Lindsay
Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density
4 0.37529325 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation
Author: Guy Lebanon, Yi Mao, Joshua Dillon
Abstract: The popular bag of words assumption represents a document as a histogram of word occurrences. While computationally efficient, such a representation is unable to maintain any sequential information. We present an effective sequential document representation that goes beyond the bag of words representation and its n-gram extensions. This representation uses local smoothing to embed documents as smooth curves in the multinomial simplex thereby preserving valuable sequential information. In contrast to bag of words or n-grams, the new representation is able to robustly capture medium and long range sequential trends in the document. We discuss the representation and its geometric properties and demonstrate its applicability for various text processing tasks. Keywords: text processing, local smoothing
Author: Onur C. Hamsici, Aleix M. Martinez
Abstract: Many feature representations, as in genomics, describe directional data where all feature vectors share a common norm. In other cases, as in computer vision, a norm or variance normalization step, where all feature vectors are normalized to a common length, is generally used. These representations and pre-processing step map the original data from R p to the surface of a hypersphere S p−1 . Such representations should then be modeled using spherical distributions. However, the difficulty associated with such spherical representations has prompted researchers to model their spherical data using Gaussian distributions instead—as if the data were represented in R p rather than S p−1 . This opens the question to whether the classification results calculated with the Gaussian approximation are the same as those obtained when using the original spherical distributions. In this paper, we show that in some particular cases (which we named spherical-homoscedastic) the answer to this question is positive. In the more general case however, the answer is negative. For this reason, we further investigate the additional error added by the Gaussian modeling. We conclude that the more the data deviates from spherical-homoscedastic, the less advisable it is to employ the Gaussian approximation. We then show how our derivations can be used to define optimal classifiers for spherical-homoscedastic distributions. By using a kernel which maps the original space into one where the data adapts to the spherical-homoscedastic model, we can derive non-linear classifiers with potential applications in a large number of problems. We conclude this paper by demonstrating the uses of spherical-homoscedasticity in the classification of images of objects, gene expression sequences, and text data. Keywords: directional data, spherical distributions, normal distributions, norm normalization, linear and non-linear classifiers, computer vision
6 0.37369078 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning
7 0.37217903 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models
8 0.36928624 46 jmlr-2007-Learning Equivariant Functions with Matrix Valued Kernels
9 0.3666752 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition
11 0.36413127 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection
13 0.36346817 69 jmlr-2007-Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes
14 0.36330879 26 jmlr-2007-Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis
15 0.36244479 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method
16 0.3613143 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression
17 0.36082974 77 jmlr-2007-Stagewise Lasso
18 0.36082232 3 jmlr-2007-A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
20 0.35832217 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study