nips nips2006 nips2006-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan
Abstract: We introduce two methods to improve convergence of the Kernel Hebbian Algorithm (KHA) for iterative kernel PCA. KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems confirm that our methods converge substantially faster than conventional KHA. 1
Reference: text
sentIndex sentText sentNum sentScore
1 KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. [sent-11, score-0.292]
2 Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. [sent-12, score-0.308]
3 We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. [sent-13, score-0.236]
4 Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems confirm that our methods converge substantially faster than conventional KHA. [sent-14, score-0.424]
5 Given a matrix X ∈ Rn×l of l centered, n-dimensional observations, PCA performs an eigendecomposition of the covariance matrix Q := XX . [sent-16, score-0.078]
6 The r × n matrix W whose rows are the eigenvectors of Q associated with the r ≤ n largest eigenvalues minimizes the least-squares reconstruction error ||X − W W X||F , (1) where || · ||F is the Frobenius norm. [sent-17, score-0.315]
7 One such method is Sanger’s [1] Generalized Hebbian Algorithm (GHA), which updates W as Wt+1 = Wt + ηt [yt xt − lt(yt yt )Wt ]. [sent-20, score-0.424]
8 (2) n Here xt ∈ R is the observation at time t, yt := Wt xt , and lt(·) makes its argument lower triangular by zeroing all elements above the diagonal. [sent-21, score-0.422]
9 For an appropriate scalar gain ηt , Wt will generally tend to converge to the principal component solution as t → ∞; though its global convergence is not proven [2]. [sent-22, score-0.333]
10 One can do better than PCA in minimizing the reconstruction error (1) by allowing nonlinear projections of the data into r dimensions. [sent-23, score-0.119]
11 Kernel PCA [4] performs an eigendecomposition on the kernel expansion of the data, an l × l matrix. [sent-26, score-0.181]
12 To reduce the attendant O(l2 ) space and O(l3 ) time complexity, Kim et al. [sent-27, score-0.038]
13 Both GHA and KHA are examples of stochastic approximation algorithms, whose iterative updates employ individual observations in place of — but, in the limit, approximating — statistical properties of the entire data. [sent-29, score-0.119]
14 By interleaving their updates with the passage through the data, stochastic approximation algorithms can greatly outperform conventional methods on large, redundant data sets, even though their convergence is comparatively slow. [sent-30, score-0.186]
15 Both the GHA and KHA updates incorporate a scalar gain parameter ηt , which is either held fixed or annealed according to some predefined schedule. [sent-31, score-0.314]
16 Robbins and Monro [5] established conditions on the sequence of ηt that guarantee the convergence of many stochastic approximation algorithms; a widely used annealing schedule that obeys these conditions is ηt ∝ τ /(t + τ ), for any τ > 0. [sent-32, score-0.105]
17 Here we propose the inclusion of a gain vector in the KHA, which provides each estimated eigenvector with its individual gain parameter. [sent-33, score-0.382]
18 We present two methods for setting these gains: In the KHA/et algorithm, the gain of an eigenvector is reciprocal to its estimated eigenvalue as well as the iteration number t [6]. [sent-34, score-0.272]
19 Our second method, KHA-SMD, additionally employs Schraudolph’s [7] Stochastic Meta-Descent (SMD) technique for adaptively controlling a gain vector for stochastic gradient descent, derived and applied here in Reproducing Kernel Hilbert Space (RKHS), cf. [sent-35, score-0.271]
20 2 Kernel Hebbian Algorithm (KHA and KHA/t) Kim et al. [sent-41, score-0.038]
21 [2] apply Sanger’s [1] GHA to data mapped into a reproducing kernel Hilbert space (RKHS) H via the function Φ : Rn → H. [sent-42, score-0.184]
22 H and Φ are implicitly defined via the kernel k : Rn × Rn → H with the property ∀x, x ∈ Rn : k(x, x ) = Φ(x), Φ(x ) H , where ·, · H denotes the inner product in H. [sent-43, score-0.126]
23 Since the mapping into feature space performed by kernel methods does not necessarily preserve such centering, we must re-center the mapped data: Φ := Φ − M Φ, (4) where M denotes the l × l matrix with entries all equal to 1/l. [sent-52, score-0.183]
24 This is achieved by replacing the kernel matrix K := ΦΦ (i. [sent-53, score-0.152]
25 , [K]ij := k(xi , xj )) by its centered version K := Φ Φ = (Φ − M Φ)(Φ − M Φ) = K − MK − (MK) + MKM . [sent-55, score-0.045]
26 (5) Since all rows of MK are identical (as are all elements of MKM ) we can precalculate that row in O(l2 ) time and store it in O(l) space to efficiently implement operations with the centered kernel. [sent-56, score-0.07]
27 The kernel centered on the training data is also used when testing the trained system on new data. [sent-57, score-0.171]
28 From Kernel PCA [4] it is known that the principal components must lie in the span of the centered mapped data; we can therefore express the GHA weight matrix as Wt = At Φ , where A is an r × l matrix of expansion coefficients, and r the number of principal components. [sent-58, score-0.269]
29 The GHA weight update (2) thus becomes At+1 Φ = At Φ + ηt [yt Φ (xp(t) ) − lt(yt yt )At Φ ], (6) yt := Wt Φ (xp(t) ) = At Φ Φ (xp(t) ) = At kp(t) , (7) where using ki to denote the ith column of the centered kernel matrix K . [sent-59, score-0.935]
30 Since we have Φ (xi ) = ei Φ , where ei is the unit vector in direction i, (6) can be rewritten solely in terms of expansion coefficients as At+1 = At + ηt [yt ep(t) − lt(yt yt )At ]. [sent-60, score-0.369]
31 (8) Introducing the update coefficient matrix Γt := yt ep(t) − lt(yt yt )At (9) At+1 = At + ηt Γt . [sent-61, score-0.764]
32 (10) we obtain the compact update rule In their experiments, Kim et al. [sent-62, score-0.096]
33 [2] employed the KHA update (8) with a constant scalar gain, ηt = const. [sent-63, score-0.135]
34 They also proposed letting the gain decay as ηt = 1/t for stationary data. [sent-64, score-0.236]
35 3 Gain Decay with Reciprocal Eigenvalues (KHA/et) Consider the term yt xt = Wt xt xt appearing on the right-hand side of the GHA update rule (2). [sent-65, score-0.521]
36 At the desired solution, the rows of Wt contain the principal components, i. [sent-66, score-0.081]
37 The elements of yt thus scale with the associated eigenvalues of Q. [sent-69, score-0.416]
38 Wide spreads of eigenvalues can therefore lead to ill-conditioning, hence slow convergence, of the GHA; the same holds for the KHA. [sent-70, score-0.076]
39 In our KHA/et algorithm, we counteract this problem by furnishing KHA with a gain vector ηt that provides each eigenvector estimate with its individual gain parameter. [sent-71, score-0.418]
40 The update rule (10) thus becomes At+1 = At + diag(ηt ) Γt , (11) where diag(·) turns a vector into a diagonal matrix. [sent-72, score-0.058]
41 To condition KHA, we set the gain parameters proportional to the reciprocal of both the iteration number t and the current estimated eigenvalue; a similar apporach was used by Chen and Chang [6] for neural network feature selection. [sent-73, score-0.232]
42 Let λt be the vector of eigenvalues associated with the current estimate (as stored in At ) of the first r eigenvectors. [sent-74, score-0.076]
43 KHA/et sets the ith element of ηt to [ηt ]i = ||λt || l η0 , [λt ]i t + l (12) where η0 is a free scalar parameter, and l the size of the data set. [sent-75, score-0.077]
44 This conditions the KHA update (8) by proportionately decreasing (increasing) the gain for rows of At associated with large (small) eigenvalues. [sent-76, score-0.254]
45 The norm ||λt || in the numerator of (12) is maximized by the principal components; its growth serves to counteract the l/(t + l) gain decay while the leading eigenspace is idientified. [sent-77, score-0.349]
46 This achieves an effect comparable to an adaptive “search then converge” gain schedule [9] without introducing any tuning parameters. [sent-78, score-0.23]
47 As the goal of KHA is to find the eigenvectors in the first place, we don’t know the true eigenvalues while running the algorithm. [sent-79, score-0.145]
48 Instead we use the eigenvalues associated with KHA’s current eigenvector estimate, computed as [λt ]i = ||[At ]i∗ K || ||[At ]i∗ || (13) where [At ]i∗ denotes the i-th row of At . [sent-80, score-0.116]
49 Since the eigenvalues evolve gradually, it suffices to re-estimate them only occasionally; we determine λt and ηt once for each pass through the training data set, i. [sent-83, score-0.076]
50 Below we derive a way to maintain AK incrementally in an affordable O(rl) via Equations (17) and (18). [sent-86, score-0.064]
51 4 KHA with Stochastic Meta-Descent (KHA-SMD) While KHA/et makes reasonable assumptions about how the gains of a KHA update should be scaled, it is by no means clear how close the resulting gains are to being optimal. [sent-87, score-0.154]
52 SMD controls gains adaptively in response to the observed history of parameter updates so as to optimize convergence. [sent-89, score-0.118]
53 Using the KHA/et gains as a starting point, the KHA-SMD update is At+1 = At + ediag(ρt ) diag(ηt ) Γt , (15) where the log-gain vector ρt is adjusted by SMD. [sent-91, score-0.106]
54 ) In an RKHS, SMD adapts a scalar log-gain whose update is driven by the inner product between the gradient and a differential of the system parameters, all in the RKHS [8]. [sent-93, score-0.191]
55 Note that Γt Φ can be interpreted as the gradient in the RKHS of the (unknown) merit function maximized by KHA, and that (15) can be viewed as r coupled updates in RKHS, one for each row of At , each associated with a scalar gain. [sent-94, score-0.142]
56 We can, however, reduce this cost to O(rl) by noting that (9) implies Γt K = yt ep(t) K − lt(yt yt )At K = yt kp(t) − lt(yt yt )At K , (17) where the r × l matrix At K can be stored and updated incrementally via (15): At+1 K = At K + ediag(ρt ) diag(ηt ) Γt K . [sent-97, score-1.409]
57 (18) The initial computation of A1 K still costs O(rl2 ) in general but is affordable as it is performed only once. [sent-98, score-0.041]
58 Finally, we apply SMD’s standard update of the differential parameters: Bt+1 = ξBt + ediag(ρt ) diag(ηt ) (Γt + ξdΓt ), (19) where the decay factor 0 ≤ ξ ≤ 1 is another scalar tuning parameter. [sent-100, score-0.268]
59 The differential dΓt of the gradient is easily computed by routine application of the rules of calculus: dΓt = d[yt ep(t) − lt(yt yt )At ] = (dAt )kp(t) ep(t) − lt(yt yt )(dAt ) − [d lt(yt yt )]At (20) = Bt kp(t) ep(t) − lt(yt yt )Bt − lt(Bt kp(t) yt + yt kp(t) Bt )At . [sent-101, score-2.096]
60 Inserting (9) and (20) into (19) yields the update rule Bt+1 = ξBt + ediag(ρt ) diag(ηt )[(At + ξBt ) kp(t) ep(t) (21) − lt(yt yt )(At + ξBt ) − ξ lt(Bt kp(t) yt + yt kp(t) Bt )At ]. [sent-102, score-1.078]
61 Initialize: (a) calculate MK, MKM — O(l2 ) (b) ρ0 := [1 . [sent-108, score-0.044]
62 1] (c) B1 := 0 (d) A1 ∼ N (0, (rl)−1 I) (e) calculate A1 K — O(rl2 ) 2. [sent-111, score-0.044]
63 In all experiments the Kernel Hebian Algorithm (KHA) and our enhanced variants are used to find the first r eigenvectors of the centered Kernel matrix K . [sent-117, score-0.14]
64 To assess the quality of the result, we reconstruct the Kernel matrix from the found eigenvectors and measure the reconstruction error E(A) := ||K − (AK ) AK ||F , (22) where || · ||F is the Frobenius norm. [sent-118, score-0.237]
65 The minimal reconstruction error from r eigenvectors, E min := minA E(A), can be calculated by an eigendecomposition. [sent-119, score-0.119]
66 This allows us to report reconstruction errors as excess errors relative to the optimal reconstruction, i. [sent-120, score-0.233]
67 To compare algorithms we plot the excess reconstruction error on a logarithmic scale after each pass through the entire data set. [sent-123, score-0.233]
68 Thus a comparable amount of tuning effort went into each algorithm. [sent-128, score-0.034]
69 KHA with both a dot-product kernel and a Gaussian kernel with σ = 8 1 was used to extract the first 16 eigenvectors. [sent-132, score-0.252]
70 Figure 1: Excess relative reconstruction error for kernel PCA (16 eigenvectors) on USPS data, using a dot-product (left) vs. [sent-137, score-0.245]
71 2 Multipatch Image PCA For our second set of experiments we replicated the image de-noising problem used by Kim et al. [sent-140, score-0.071]
72 [2], the idea being that reconstructing image patches from their r leading eigenvectors will eliminate most of the noise. [sent-141, score-0.123]
73 The KHA with Gaussian kernel (σ = 1) was used to find the 20 best eigenvectors for each sub-image. [sent-144, score-0.195]
74 Results averaged over all four sub-images are shown in Figure 2 (left), including KHA with the constant gain of η0 = 0. [sent-145, score-0.171]
75 After 50 passes through the training data, KHA/et achieves an excess reconstruction error two orders of magnitude better than conventional KHA; KHA-SMD yields an additional order of magnitude improvement. [sent-148, score-0.313]
76 Replicating this approach we obtain a reconstruction error of 5. [sent-152, score-0.119]
77 The signal-to-noise ratio (SNR) of the reconstruction after 800 passes with constant gain is 13. [sent-154, score-0.334]
78 3 Spectral Clustering Spectral Clustering [13] is a clustering method which includes the extraction of the first kernel PCs. [sent-158, score-0.202]
79 In this section we present results of the spectral clustering of all 7291 patterns of the USPS data [10] where 10 kernel PCs were obtained by KHA. [sent-159, score-0.262]
80 We used the spectral clustering method presented in 2 Kim et al. [sent-160, score-0.174]
81 Figure 2: Excess relative reconstruction error (left) for multipatch image PCA on a noisy Lena image (center), using a Gaussian kernel with σ = 1; denoised image obtained by KHA-SMD (right). [sent-163, score-0.456]
82 Figure 3: Excess relative reconstruction error (left) and quality of clustering as measured by variation of information (right) for spectral clustering of the USPS data with a Gaussian kernel (σ = 8). [sent-164, score-0.48]
83 [13], and evaluate our results via the Variation of Information (VI) metric [14], which compares the clustering obtained by spectral clustering to that induced by the class labels. [sent-165, score-0.212]
84 54 corresponds to random performance, while clustering in perfect accordance with the class labels would give a VI of zero. [sent-167, score-0.076]
85 Again KHA-SMD dominates KHA/et in both convergence speed and quality of reconstruction (left); KHA/et in turn outperforms KHA/t. [sent-169, score-0.171]
86 The quality of the resulting clustering (right) reflects the quality of reconstruction. [sent-170, score-0.122]
87 KHA/et and KHA-SMD produce a clustering as good as that obtained from a (computationally expensive) full kernel PCA within 10 passes through the data; KHA/t after more than 30 passes. [sent-171, score-0.246]
88 4 Human motion denoising In our final set of experiments we employed KHA to denoise a human walking motion trajectory from the CMU motion capture database (http://mocap. [sent-173, score-0.331]
89 The experimental setup was similar to that of Tangkuampien and Suter [15]: Gaussian noise was added to the frames of the original motion, then KHA with 25 PCs was used to denoise them. [sent-181, score-0.041]
90 5%; it is hard to visually Figure 4: From left to right: Excess relative reconstruction error on human motion capture data with √ Gaussian kernel (σ = 1. [sent-185, score-0.361]
91 5), one frame of the original data, a superposition of this original and the noisy data, and a superposition of the original and reconstructed (denoised) data. [sent-186, score-0.06]
92 detect a difference between the denoised frames and the original ones — see Figure 4 (right) for an example. [sent-187, score-0.071]
93 We include movies of the original, noisy, and denoised walk in the supporting material. [sent-188, score-0.071]
94 ’s [2] Kernel Hebbian Algorithm (KHA) by providing a separate gain for each eigenvector estimate. [sent-190, score-0.211]
95 KHA/et sets them inversely proportional to the estimated eigenvalues and iteration number; KHA-SMD enhances that further by applying Stochastic Meta-Descent (SMD [7]) to perform gain adaptation in RKHS [8]. [sent-192, score-0.283]
96 In four different experimental settings both methods were compared to a conventional gain decay schedule. [sent-193, score-0.272]
97 As measured by relative reconstruction error, KHA-SMD clearly outperformed KHA/et, which in turn outperformed the scheduled decay, in all our experiments. [sent-194, score-0.223]
98 Iterative kernel principal component analysis for o image modeling. [sent-208, score-0.215]
99 Nonlinear component analysis as a kernel eigeno u value problem. [sent-221, score-0.126]
100 Human motion de-noising via greedy kernel principal component analysis filtering. [sent-316, score-0.252]
wordName wordTfidf (topN-words)
[('kha', 0.595), ('yt', 0.34), ('bt', 0.247), ('gha', 0.205), ('smd', 0.205), ('gain', 0.171), ('lt', 0.153), ('kim', 0.139), ('kp', 0.136), ('kernel', 0.126), ('reconstruction', 0.119), ('excess', 0.114), ('usps', 0.112), ('pca', 0.103), ('diag', 0.092), ('hebbian', 0.089), ('ediag', 0.082), ('scalar', 0.077), ('eigenvalues', 0.076), ('clustering', 0.076), ('denoised', 0.071), ('motion', 0.07), ('wt', 0.069), ('eigenvectors', 0.069), ('rkhs', 0.067), ('ep', 0.065), ('decay', 0.065), ('dat', 0.062), ('mkm', 0.062), ('reciprocal', 0.061), ('spectral', 0.06), ('update', 0.058), ('principal', 0.056), ('australia', 0.055), ('xp', 0.054), ('rl', 0.053), ('stochastic', 0.051), ('gains', 0.048), ('centered', 0.045), ('calculate', 0.044), ('ak', 0.044), ('passes', 0.044), ('mk', 0.044), ('schraudolph', 0.043), ('updates', 0.043), ('affordable', 0.041), ('denoise', 0.041), ('multipatch', 0.041), ('sanger', 0.041), ('tangkuampien', 0.041), ('xt', 0.041), ('eigenvector', 0.04), ('australian', 0.039), ('ict', 0.039), ('et', 0.038), ('adaptation', 0.036), ('conventional', 0.036), ('replicating', 0.036), ('counteract', 0.036), ('lena', 0.036), ('pcs', 0.036), ('robbins', 0.036), ('scheduled', 0.036), ('differential', 0.034), ('tuning', 0.034), ('rn', 0.034), ('outperformed', 0.034), ('denoising', 0.034), ('image', 0.033), ('snr', 0.033), ('zl', 0.033), ('mapped', 0.031), ('superposition', 0.03), ('convergence', 0.029), ('expansion', 0.029), ('expensive', 0.029), ('suitably', 0.029), ('comparatively', 0.027), ('adaptively', 0.027), ('hilbert', 0.027), ('reproducing', 0.027), ('canberra', 0.026), ('vishwanathan', 0.026), ('eigendecomposition', 0.026), ('matrix', 0.026), ('rows', 0.025), ('iterative', 0.025), ('xx', 0.025), ('schedule', 0.025), ('vi', 0.024), ('mika', 0.024), ('capture', 0.023), ('quality', 0.023), ('human', 0.023), ('held', 0.023), ('incrementally', 0.023), ('frobenius', 0.023), ('sch', 0.022), ('gradient', 0.022), ('leading', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 79 nips-2006-Fast Iterative Kernel PCA
Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan
Abstract: We introduce two methods to improve convergence of the Kernel Hebbian Algorithm (KHA) for iterative kernel PCA. KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems confirm that our methods converge substantially faster than conventional KHA. 1
2 0.23474219 203 nips-2006-implicit Online Learning with Kernels
Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan
Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1
3 0.16366933 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
4 0.13114679 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
Author: Gloria Haro, Gregory Randall, Guillermo Sapiro
Abstract: The study of point cloud data sampled from a stratification, a collection of manifolds with possible different dimensions, is pursued in this paper. We present a technique for simultaneously soft clustering and estimating the mixed dimensionality and density of such structures. The framework is based on a maximum likelihood estimation of a Poisson mixture model. The presentation of the approach is completed with artificial and real examples demonstrating the importance of extending manifold learning to stratification learning. 1
5 0.12948419 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1
6 0.11409794 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
7 0.11274765 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
8 0.10479739 60 nips-2006-Convergence of Laplacian Eigenmaps
9 0.097923525 61 nips-2006-Convex Repeated Games and Fenchel Duality
10 0.085986942 7 nips-2006-A Local Learning Approach for Clustering
11 0.084433623 80 nips-2006-Fundamental Limitations of Spectral Clustering
12 0.083319336 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics
13 0.082095377 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
14 0.081076764 146 nips-2006-No-regret Algorithms for Online Convex Programs
15 0.080682687 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
16 0.079667851 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
17 0.071519643 77 nips-2006-Fast Computation of Graph Kernels
18 0.069343038 149 nips-2006-Nonnegative Sparse PCA
19 0.06933362 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
20 0.06763909 141 nips-2006-Multiple timescales and uncertainty in motor adaptation
topicId topicWeight
[(0, -0.201), (1, 0.065), (2, -0.112), (3, 0.215), (4, -0.106), (5, -0.122), (6, 0.005), (7, -0.167), (8, -0.085), (9, 0.01), (10, -0.052), (11, 0.05), (12, -0.04), (13, -0.141), (14, 0.028), (15, -0.016), (16, -0.132), (17, 0.06), (18, -0.004), (19, 0.101), (20, -0.073), (21, -0.006), (22, -0.039), (23, 0.025), (24, 0.014), (25, -0.019), (26, -0.013), (27, -0.008), (28, -0.064), (29, 0.004), (30, -0.019), (31, -0.033), (32, -0.006), (33, 0.14), (34, -0.09), (35, 0.011), (36, -0.045), (37, -0.124), (38, 0.159), (39, -0.076), (40, -0.057), (41, 0.064), (42, -0.081), (43, 0.003), (44, -0.062), (45, -0.037), (46, -0.05), (47, 0.05), (48, -0.135), (49, 0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.94330686 79 nips-2006-Fast Iterative Kernel PCA
Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan
Abstract: We introduce two methods to improve convergence of the Kernel Hebbian Algorithm (KHA) for iterative kernel PCA. KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems confirm that our methods converge substantially faster than conventional KHA. 1
2 0.63647419 203 nips-2006-implicit Online Learning with Kernels
Author: Li Cheng, Dale Schuurmans, Shaojun Wang, Terry Caelli, S.v.n. Vishwanathan
Abstract: We present two new algorithms for online learning in reproducing kernel Hilbert spaces. Our first algorithm, ILK (implicit online learning with kernels), employs a new, implicit update technique that can be applied to a wide variety of convex loss functions. We then introduce a bounded memory version, SILK (sparse ILK), that maintains a compact representation of the predictor without compromising solution quality, even in non-stationary environments. We prove loss bounds and analyze the convergence rate of both. Experimental evidence shows that our proposed algorithms outperform current methods on synthetic and real data. 1
3 0.59555221 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
Author: Yoshinobu Kawahara, Takehisa Yairi, Kazuo Machida
Abstract: In this paper, we present a subspace method for learning nonlinear dynamical systems based on stochastic realization, in which state vectors are chosen using kernel canonical correlation analysis, and then state-space systems are identified through regression with the state vectors. We construct the theoretical underpinning and derive a concrete algorithm for nonlinear identification. The obtained algorithm needs no iterative optimization procedure and can be implemented on the basis of fast and reliable numerical schemes. The simulation result shows that our algorithm can express dynamics with a high degree of accuracy. 1
4 0.5311777 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus
Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.
5 0.52183795 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
6 0.50229168 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
7 0.47440153 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
8 0.45996997 149 nips-2006-Nonnegative Sparse PCA
9 0.45309541 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics
10 0.39688733 60 nips-2006-Convergence of Laplacian Eigenmaps
11 0.38912663 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
12 0.38889584 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
13 0.37330034 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
14 0.35791069 82 nips-2006-Gaussian and Wishart Hyperkernels
15 0.33296683 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
16 0.33128476 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
17 0.33021659 7 nips-2006-A Local Learning Approach for Clustering
18 0.3261795 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
19 0.32570499 61 nips-2006-Convex Repeated Games and Fenchel Duality
20 0.32172573 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments
topicId topicWeight
[(1, 0.091), (3, 0.025), (7, 0.058), (9, 0.051), (13, 0.229), (20, 0.029), (22, 0.079), (44, 0.053), (52, 0.01), (57, 0.079), (65, 0.106), (69, 0.027), (71, 0.014), (83, 0.035), (90, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.82564324 79 nips-2006-Fast Iterative Kernel PCA
Author: Nicol N. Schraudolph, Simon Günter, S.v.n. Vishwanathan
Abstract: We introduce two methods to improve convergence of the Kernel Hebbian Algorithm (KHA) for iterative kernel PCA. KHA has a scalar gain parameter which is either held constant or decreased as 1/t, leading to slow convergence. Our KHA/et algorithm accelerates KHA by incorporating the reciprocal of the current estimated eigenvalues as a gain vector. We then derive and apply Stochastic MetaDescent (SMD) to KHA/et; this further speeds convergence by performing gain adaptation in RKHS. Experimental results for kernel PCA and spectral clustering of USPS digits as well as motion capture and image de-noising problems confirm that our methods converge substantially faster than conventional KHA. 1
2 0.81143099 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations
Author: Lorenzo Torresani, Peggy Hackney, Christoph Bregler
Abstract: This paper presents an algorithm for synthesis of human motion in specified styles. We use a theory of movement observation (Laban Movement Analysis) to describe movement styles as points in a multi-dimensional perceptual space. We cast the task of learning to synthesize desired movement styles as a regression problem: sequences generated via space-time interpolation of motion capture data are used to learn a nonlinear mapping between animation parameters and movement styles in perceptual space. We demonstrate that the learned model can apply a variety of motion styles to pre-recorded motion sequences and it can extrapolate styles not originally included in the training data. 1
3 0.78724176 126 nips-2006-Logistic Regression for Single Trial EEG Classification
Author: Ryota Tomioka, Kazuyuki Aihara, Klaus-Robert Müller
Abstract: We propose a novel framework for the classification of single trial ElectroEncephaloGraphy (EEG), based on regularized logistic regression. Framed in this robust statistical framework no prior feature extraction or outlier removal is required. We present two variations of parameterizing the regression function: (a) with a full rank symmetric matrix coefficient and (b) as a difference of two rank=1 matrices. In the first case, the problem is convex and the logistic regression is optimal under a generative model. The latter case is shown to be related to the Common Spatial Pattern (CSP) algorithm, which is a popular technique in Brain Computer Interfacing. The regression coefficients can also be topographically mapped onto the scalp similarly to CSP projections, which allows neuro-physiological interpretation. Simulations on 162 BCI datasets demonstrate that classification accuracy and robustness compares favorably against conventional CSP based classifiers. 1
4 0.74123532 168 nips-2006-Reducing Calibration Time For Brain-Computer Interfaces: A Clustering Approach
Author: Matthias Krauledat, Michael Schröder, Benjamin Blankertz, Klaus-Robert Müller
Abstract: Up to now even subjects that are experts in the use of machine learning based BCI systems still have to undergo a calibration session of about 20-30 min. From this data their (movement) intentions are so far infered. We now propose a new paradigm that allows to completely omit such calibration and instead transfer knowledge from prior sessions. To achieve this goal we first define normalized CSP features and distances in-between. Second, we derive prototypical features across sessions: (a) by clustering or (b) by feature concatenation methods. Finally, we construct a classifier based on these individualized prototypes and show that, indeed, classifiers can be successfully transferred to a new session for a number of subjects.
5 0.62482399 61 nips-2006-Convex Repeated Games and Fenchel Duality
Author: Shai Shalev-shwartz, Yoram Singer
Abstract: We describe an algorithmic framework for an abstract game which we term a convex repeated game. We show that various online learning and boosting algorithms can be all derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Our algorithmic framework stems from a connection that we build between the notions of regret in game theory and weak duality in convex optimization. 1 Introduction and Problem Setting Several problems arising in machine learning can be modeled as a convex repeated game. Convex repeated games are closely related to online convex programming (see [19, 9] and the discussion in the last section). A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set S. Next, the second player responds with a convex function gt : S → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. The goal of the first player is to minimize its cumulative loss, t gt (wt ). To motivate this rather abstract setting let us first cast the more familiar setting of online learning as a convex repeated game. Online learning is performed in a sequence of consecutive rounds. On round t, the learner first receives a question, cast as a vector xt , and is required to provide an answer for this question. For example, xt can be an encoding of an email message and the question is whether the email is spam or not. The prediction of the learner is performed based on an hypothesis, ht : X → Y, where X is the set of questions and Y is the set of possible answers. In the aforementioned example, Y would be {+1, −1} where +1 stands for a spam email and −1 stands for a benign one. After predicting an answer, the learner receives the correct answer for the question, denoted yt , and suffers loss according to a loss function (ht , (xt , yt )). In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, H = {hw : w ∈ S}. For example, the set of linear classifiers, which is used for answering yes/no questions, is defined as H = {hw (x) = sign( w, x ) : w ∈ Rn }. Thus, rather than saying that on round t the learner chooses a hypothesis, we can say that the learner chooses a vector wt and its hypothesis is hwt . Next, we note that once the environment chooses a question-answer pair (xt , yt ), the loss function becomes a function over the hypotheses space or equivalently over the set of parameter vectors S. We can therefore redefine the online learning process as follows. On round t, the learner chooses a vector wt ∈ S, which defines a hypothesis hwt to be used for prediction. Then, the environment chooses a questionanswer pair (xt , yt ), which induces the following loss function over the set of parameter vectors, gt (w) = (hw , (xt , yt )). Finally, the learner suffers the loss gt (wt ) = (hwt , (xt , yt )). We have therefore described the process of online learning as a convex repeated game. In this paper we assess the performance of the first player using the notion of regret. Given a number of rounds T and a fixed vector u ∈ S, we define the regret of the first player as the excess loss for not consistently playing the vector u, 1 T T gt (wt ) − t=1 1 T T gt (u) . t=1 Our main result is an algorithmic framework for the first player which guarantees low regret with respect to any vector u ∈ S. Specifically, we derive regret bounds that take the following form ∀u ∈ S, 1 T T gt (wt ) − t=1 1 T T gt (u) ≤ t=1 f (u) + L √ , T (1) where f : S → R and L ∈ R+ . Informally, the function f measures the “complexity” of vectors in S and the scalar L is related to some generalized Lipschitz property of the functions g1 , . . . , gT . We defer the exact requirements we impose on f and L to later sections. Our algorithmic framework emerges from a representation of the regret bound given in Eq. (1) using an optimization problem. Specifically, we rewrite Eq. (1) as follows 1 T T gt (wt ) ≤ inf t=1 u∈S 1 T T gt (u) + t=1 f (u) + L √ . T (2) That is, the average loss of the first player should be bounded above by the minimum value of an optimization problem in which we jointly minimize the average loss of u and the “complexity” of u as measured by the function f . Note that the optimization problem on the right-hand side of Eq. (2) can only be solved in hindsight after observing the entire sequence of loss functions. Nevertheless, writing the regret bound as in Eq. (2) implies that the average loss of the first player forms a lower bound for a minimization problem. The notion of duality, commonly used in convex optimization theory, plays an important role in obtaining lower bounds for the minimal value of a minimization problem (see for example [14]). By generalizing the notion of Fenchel duality, we are able to derive a dual optimization problem, which can be optimized incrementally, as the game progresses. In order to derive explicit quantitative regret bounds we make an immediate use of the fact that dual objective lower bounds the primal objective. We therefore reduce the process of playing convex repeated games to the task of incrementally increasing the dual objective function. The amount by which the dual increases serves as a new and natural notion of progress. By doing so we are able to tie the primal objective value, the average loss of the first player, and the increase in the dual. The rest of this paper is organized as follows. In Sec. 2 we establish our notation and point to a few mathematical tools that we use throughout the paper. Our main tool for deriving algorithms for playing convex repeated games is a generalization of Fenchel duality, described in Sec. 3. Our algorithmic framework is given in Sec. 4 and analyzed in Sec. 5. The generality of our framework allows us to utilize it in different problems arising in machine learning. Specifically, in Sec. 6 we underscore the applicability of our framework for online learning and in Sec. 7 we outline and analyze boosting algorithms based on our framework. We conclude with a discussion and point to related work in Sec. 8. Due to the lack of space, some of the details are omitted from the paper and can be found in [16]. 2 Mathematical Background We denote scalars with lower case letters (e.g. x and w), and vectors with bold face letters (e.g. x and w). The inner product between vectors x and w is denoted by x, w . Sets are designated by upper case letters (e.g. S). The set of non-negative real numbers is denoted by R+ . For any k ≥ 1, the set of integers {1, . . . , k} is denoted by [k]. A norm of a vector x is denoted by x . The dual norm is defined as λ = sup{ x, λ : x ≤ 1}. For example, the Euclidean norm, x 2 = ( x, x )1/2 is dual to itself and the 1 norm, x 1 = i |xi |, is dual to the ∞ norm, x ∞ = maxi |xi |. We next recall a few definitions from convex analysis. The reader familiar with convex analysis may proceed to Lemma 1 while for a more thorough introduction see for example [1]. A set S is convex if for any two vectors w1 , w2 in S, all the line between w1 and w2 is also within S. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ S. A set S is open if every point in S has a neighborhood lying in S. A set S is closed if its complement is an open set. A function f : S → R is closed and convex if for any scalar α ∈ R, the level set {w : f (w) ≤ α} is closed and convex. The Fenchel conjugate of a function f : S → R is defined as f (θ) = supw∈S w, θ − f (w) . If f is closed and convex then the Fenchel conjugate of f is f itself. The Fenchel-Young inequality states that for any w and θ we have that f (w) + f (θ) ≥ w, θ . A vector λ is a sub-gradient of a function f at w if for all w ∈ S we have that f (w ) − f (w) ≥ w − w, λ . The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by f (w). Sub-gradients play an important role in the definition of Fenchel conjugate. In particular, the following lemma states that if λ ∈ ∂f (w) then Fenchel-Young inequality holds with equality. Lemma 1 Let f be a closed and convex function and let ∂f (w ) be its differential set at w . Then, for all λ ∈ ∂f (w ) we have, f (w ) + f (λ ) = λ , w . A continuous function f is σ-strongly convex over a convex set S with respect to a norm · if S is contained in the domain of f and for all v, u ∈ S and α ∈ [0, 1] we have 1 (3) f (α v + (1 − α) u) ≤ α f (v) + (1 − α) f (u) − σ α (1 − α) v − u 2 . 2 Strongly convex functions play an important role in our analysis primarily due to the following lemma. Lemma 2 Let · be a norm over Rn and let · be its dual norm. Let f be a σ-strongly convex function on S and let f be its Fenchel conjugate. Then, f is differentiable with f (θ) = arg maxx∈S θ, x − f (x). Furthermore, for any θ, λ ∈ Rn we have 1 f (θ + λ) − f (θ) ≤ f (θ), λ + λ 2 . 2σ Two notable examples of strongly convex functions which we use are as follows. 1 Example 1 The function f (w) = 2 w norm. Its conjugate function is f (θ) = 2 2 1 2 is 1-strongly convex over S = Rn with respect to the θ 2. 2 2 n 1 Example 2 The function f (w) = i=1 wi log(wi / n ) is 1-strongly convex over the probabilistic n simplex, S = {w ∈ R+ : w 1 = 1}, with respect to the 1 norm. Its conjugate function is n 1 f (θ) = log( n i=1 exp(θi )). 3 Generalized Fenchel Duality In this section we derive our main analysis tool. We start by considering the following optimization problem, T inf c f (w) + t=1 gt (w) , w∈S where c is a non-negative scalar. An equivalent problem is inf w0 ,w1 ,...,wT c f (w0 ) + T t=1 gt (wt ) s.t. w0 ∈ S and ∀t ∈ [T ], wt = w0 . Introducing T vectors λ1 , . . . , λT , each λt ∈ Rn is a vector of Lagrange multipliers for the equality constraint wt = w0 , we obtain the following Lagrangian T T L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) = c f (w0 ) + t=1 gt (wt ) + t=1 λt , w0 − wt . The dual problem is the task of maximizing the following dual objective value, D(λ1 , . . . , λT ) = inf L(w0 , w1 , . . . , wT , λ1 , . . . , λT ) w0 ∈S,w1 ,...,wT = − c sup w0 ∈S = −c f −1 c w0 , − 1 c T t=1 T t=1 λt − λt − f (w0 ) − T t=1 gt (λt ) , T t=1 sup ( wt , λt − gt (wt )) wt where, following the exposition of Sec. 2, f , g1 , . . . , gT are the Fenchel conjugate functions of f, g1 , . . . , gT . Therefore, the generalized Fenchel dual problem is sup − cf λ1 ,...,λT −1 c T t=1 λt − T t=1 gt (λt ) . (4) Note that when T = 1 and c = 1, the above duality is the so called Fenchel duality. 4 A Template Learning Algorithm for Convex Repeated Games In this section we describe a template learning algorithm for playing convex repeated games. As mentioned before, we study convex repeated games from the viewpoint of the first player which we shortly denote as P1. Recall that we would like our learning algorithm to achieve a regret bound of the form given in Eq. (2). We start by rewriting Eq. (2) as follows T m gt (wt ) − c L ≤ inf u∈S t=1 c f (u) + gt (u) , (5) t=1 √ where c = T . Thus, up to the sublinear term c L, the cumulative loss of P1 lower bounds the optimum of the minimization problem on the right-hand side of Eq. (5). In the previous section we derived the generalized Fenchel dual of the right-hand side of Eq. (5). Our construction is based on the weak duality theorem stating that any value of the dual problem is smaller than the optimum value of the primal problem. The algorithmic framework we propose is therefore derived by incrementally ascending the dual objective function. Intuitively, by ascending the dual objective we move closer to the optimal primal value and therefore our performance becomes similar to the performance of the best fixed weight vector which minimizes the right-hand side of Eq. (5). Initially, we use the elementary dual solution λ1 = 0 for all t. We assume that inf w f (w) = 0 and t for all t inf w gt (w) = 0 which imply that D(λ1 , . . . , λ1 ) = 0. We assume in addition that f is 1 T σ-strongly convex. Therefore, based on Lemma 2, the function f is differentiable. At trial t, P1 uses for prediction the vector wt = f −1 c T i=1 λt i . (6) After predicting wt , P1 receives the function gt and suffers the loss gt (wt ). Then, P1 updates the dual variables as follows. Denote by ∂t the differential set of gt at wt , that is, ∂t = {λ : ∀w ∈ S, gt (w) − gt (wt ) ≥ λ, w − wt } . (7) The new dual variables (λt+1 , . . . , λt+1 ) are set to be any set of vectors which satisfy the following 1 T two conditions: (i). ∃λ ∈ ∂t s.t. D(λt+1 , . . . , λt+1 ) ≥ D(λt , . . . , λt , λ , λt , . . . , λt ) 1 1 t−1 t+1 T T (ii). ∀i > t, λt+1 = 0 i . (8) In the next section we show that condition (i) ensures that the increase of the dual at trial t is proportional to the loss gt (wt ). The second condition ensures that we can actually calculate the dual at trial t without any knowledge on the yet to be seen loss functions gt+1 , . . . , gT . We conclude this section with two update rules that trivially satisfy the above two conditions. The first update scheme simply finds λ ∈ ∂t and set λt+1 = i λ λt i if i = t if i = t . (9) The second update defines (λt+1 , . . . , λt+1 ) = argmax D(λ1 , . . . , λT ) 1 T λ1 ,...,λT s.t. ∀i = t, λi = λt . i (10) 5 Analysis In this section we analyze the performance of the template algorithm given in the previous section. Our proof technique is based on monitoring the value of the dual objective function. The main result is the following lemma which gives upper and lower bounds for the final value of the dual objective function. Lemma 3 Let f be a σ-strongly convex function with respect to a norm · over a set S and assume that minw∈S f (w) = 0. Let g1 , . . . , gT be a sequence of convex and closed functions such that inf w gt (w) = 0 for all t ∈ [T ]. Suppose that a dual-incrementing algorithm which satisfies the conditions of Eq. (8) is run with f as a complexity function on the sequence g1 , . . . , gT . Let w1 , . . . , wT be the sequence of primal vectors that the algorithm generates and λT +1 , . . . , λT +1 1 T be its final sequence of dual variables. Then, there exists a sequence of sub-gradients λ1 , . . . , λT , where λt ∈ ∂t for all t, such that T 1 gt (wt ) − 2σc t=1 T T λt 2 ≤ D(λT +1 , . . . , λT +1 ) 1 T t=1 ≤ inf c f (w) + w∈S gt (w) . t=1 Proof The second inequality follows directly from the weak duality theorem. Turning to the left most inequality, denote ∆t = D(λt+1 , . . . , λt+1 ) − D(λt , . . . , λt ) and note that 1 1 T T T D(λ1 +1 , . . . , λT +1 ) can be rewritten as T T t=1 D(λT +1 , . . . , λT +1 ) = 1 T T t=1 ∆t − D(λ1 , . . . , λ1 ) = 1 T ∆t , (11) where the last equality follows from the fact that f (0) = g1 (0) = . . . = gT (0) = 0. The definition of the update implies that ∆t ≥ D(λt , . . . , λt , λt , 0, . . . , 0) − D(λt , . . . , λt , 0, 0, . . . , 0) for 1 t−1 1 t−1 t−1 some subgradient λt ∈ ∂t . Denoting θ t = − 1 j=1 λj , we now rewrite the lower bound on ∆t as, c ∆t ≥ −c (f (θ t − λt /c) − f (θ t )) − gt (λt ) . Using Lemma 2 and the definition of wt we get that 1 (12) ∆t ≥ wt , λt − gt (λt ) − 2 σ c λt 2 . Since λt ∈ ∂t and since we assume that gt is closed and convex, we can apply Lemma 1 to get that wt , λt − gt (λt ) = gt (wt ). Plugging this equality into Eq. (12) and summing over t we obtain that T T T 1 2 . t=1 ∆t ≥ t=1 gt (wt ) − 2 σ c t=1 λt Combining the above inequality with Eq. (11) concludes our proof. The following regret bound follows as a direct corollary of Lemma 3. T 1 Theorem 1 Under the same conditions of Lemma 3. Denote L = T t=1 λt w ∈ S we have, T T c f (w) 1 1 + 2L c . t=1 gt (wt ) − T t=1 gt (w) ≤ T T σ √ In particular, if c = T , we obtain the bound, 1 T 6 T t=1 gt (wt ) − 1 T T t=1 gt (w) ≤ f (w)+L/(2 σ) √ T 2 . Then, for all . Application to Online learning In Sec. 1 we cast the task of online learning as a convex repeated game. We now demonstrate the applicability of our algorithmic framework for the problem of instance ranking. We analyze this setting since several prediction problems, including binary classification, multiclass prediction, multilabel prediction, and label ranking, can be cast as special cases of the instance ranking problem. Recall that on each online round, the learner receives a question-answer pair. In instance ranking, the question is encoded by a matrix Xt of dimension kt × n and the answer is a vector yt ∈ Rkt . The semantic of yt is as follows. For any pair (i, j), if yt,i > yt,j then we say that yt ranks the i’th row of Xt ahead of the j’th row of Xt . We also interpret yt,i − yt,j as the confidence in which the i’th row should be ranked ahead of the j’th row. For example, each row of Xt encompasses a representation of a movie while yt,i is the movie’s rating, expressed as the number of stars this movie has received by a movie reviewer. The predictions of the learner are determined ˆ based on a weight vector wt ∈ Rn and are defined to be yt = Xt wt . Finally, let us define two loss functions for ranking, both generalize the hinge-loss used in binary classification problems. Denote by Et the set {(i, j) : yt,i > yt,j }. For all (i, j) ∈ Et we define a pair-based hinge-loss i,j (w; (Xt , yt )) = [(yt,i − yt,j ) − w, xt,i − xt,j ]+ , where [a]+ = max{a, 0} and xt,i , xt,j are respectively the i’th and j’th rows of Xt . Note that i,j is zero if w ranks xt,i higher than xt,j with a sufficient confidence. Ideally, we would like i,j (wt ; (Xt , yt )) to be zero for all (i, j) ∈ Et . If this is not the case, we are being penalized according to some combination of the pair-based losses i,j . For example, we can set (w; (Xt , yt )) to be the average over the pair losses, 1 avg (w; (Xt , yt )) = |Et | (i,j)∈Et i,j (w; (Xt , yt )) . This loss was suggested by several authors (see for example [18]). Another popular approach (see for example [5]) penalizes according to the maximal loss over the individual pairs, max (w; (Xt , yt )) = max(i,j)∈Et i,j (w; (Xt , yt )) . We can apply our algorithmic framework given in Sec. 4 for ranking, using for gt (w) either avg (w; (Xt , yt )) or max (w; (Xt , yt )). The following theorem provides us with a sufficient condition under which the regret bound from Thm. 1 holds for ranking as well. Theorem 2 Let f be a σ-strongly convex function over S with respect to a norm · . Denote by Lt the maximum over (i, j) ∈ Et of xt,i − xt,j 2 . Then, for both gt (w) = avg (w; (Xt , yt )) and ∗ gt (w) = max (w; (Xt , yt )), the following regret bound holds ∀u ∈ S, 7 1 T T t=1 gt (wt ) − 1 T T t=1 gt (u) ≤ 1 f (u)+ T PT t=1 Lt /(2 σ) √ T . The Boosting Game In this section we describe the applicability of our algorithmic framework to the analysis of boosting algorithms. A boosting algorithm uses a weak learning algorithm that generates weak-hypotheses whose performances are just slightly better than random guessing to build a strong-hypothesis which can attain an arbitrarily low error. The AdaBoost algorithm, proposed by Freund and Schapire [6], receives as input a training set of examples {(x1 , y1 ), . . . , (xm , ym )} where for all i ∈ [m], xi is taken from an instance domain X , and yi is a binary label, yi ∈ {+1, −1}. The boosting process proceeds in a sequence of consecutive trials. At trial t, the booster first defines a distribution, denoted wt , over the set of examples. Then, the booster passes the training set along with the distribution wt to the weak learner. The weak learner is assumed to return a hypothesis ht : X → {+1, −1} whose average error is slightly smaller than 1 . That is, there exists a constant γ > 0 such that, 2 def m 1−yi ht (xi ) = ≤ 1 −γ . (13) i=1 wt,i 2 2 The goal of the boosting algorithm is to invoke the weak learner several times with different distributions, and to combine the hypotheses returned by the weak learner into a final, so called strong, hypothesis whose error is small. The final hypothesis combines linearly the T hypotheses returned by the weak learner with coefficients α1 , . . . , αT , and is defined to be the sign of hf (x) where T hf (x) = t=1 αt ht (x) . The coefficients α1 , . . . , αT are determined by the booster. In Ad1 1 aBoost, the initial distribution is set to be the uniform distribution, w1 = ( m , . . . , m ). At iter1 ation t, the value of αt is set to be 2 log((1 − t )/ t ). The distribution is updated by the rule wt+1,i = wt,i exp(−αt yi ht (xi ))/Zt , where Zt is a normalization factor. Freund and Schapire [6] have shown that under the assumption given in Eq. (13), the error of the final strong hypothesis is at most exp(−2 γ 2 T ). t Several authors [15, 13, 8, 4] have proposed to view boosting as a coordinate-wise greedy optimization process. To do so, note first that hf errs on an example (x, y) iff y hf (x) ≤ 0. Therefore, the exp-loss function, defined as exp(−y hf (x)), is a smooth upper bound of the zero-one error, which equals to 1 if y hf (x) ≤ 0 and to 0 otherwise. Thus, we can restate the goal of boosting as minimizing the average exp-loss of hf over the training set with respect to the variables α1 , . . . , αT . To simplify our derivation in the sequel, we prefer to say that boosting maximizes the negation of the loss, that is, T m 1 (14) max − m i=1 exp −yi t=1 αt ht (xi ) . α1 ,...,αT In this view, boosting is an optimization procedure which iteratively maximizes Eq. (14) with respect to the variables α1 , . . . , αT . This view of boosting, enables the hypotheses returned by the weak learner to be general functions into the reals, ht : X → R (see for instance [15]). In this paper we view boosting as a convex repeated game between a booster and a weak learner. To motivate our construction, we would like to note that boosting algorithms define weights in two different domains: the vectors wt ∈ Rm which assign weights to examples and the weights {αt : t ∈ [T ]} over weak-hypotheses. In the terminology used throughout this paper, the weights wt ∈ Rm are primal vectors while (as we show in the sequel) each weight αt of the hypothesis ht is related to a dual vector λt . In particular, we show that Eq. (14) is exactly the Fenchel dual of a primal problem for a convex repeated game, thus the algorithmic framework described thus far for playing games naturally fits the problem of iteratively solving Eq. (14). To derive the primal problem whose Fenchel dual is the problem given in Eq. (14) let us first denote by vt the vector in Rm whose ith element is vt,i = yi ht (xi ). For all t, we set gt to be the function gt (w) = [ w, vt ]+ . Intuitively, gt penalizes vectors w which assign large weights to examples which are predicted accurately, that is yi ht (xi ) > 0. In particular, if ht (xi ) ∈ {+1, −1} and wt is a distribution over the m examples (as is the case in AdaBoost), gt (wt ) reduces to 1 − 2 t (see Eq. (13)). In this case, minimizing gt is equivalent to maximizing the error of the individual T hypothesis ht over the examples. Consider the problem of minimizing c f (w) + t=1 gt (w) where f (w) is the relative entropy given in Example 2 and c = 1/(2 γ) (see Eq. (13)). To derive its Fenchel dual, we note that gt (λt ) = 0 if there exists βt ∈ [0, 1] such that λt = βt vt and otherwise gt (λt ) = ∞ (see [16]). In addition, let us define αt = 2 γ βt . Since our goal is to maximize the αt dual, we can restrict λt to take the form λt = βt vt = 2 γ vt , and get that D(λ1 , . . . , λT ) = −c f − 1 c T βt vt t=1 =− 1 log 2γ 1 m m e− PT t=1 αt yi ht (xi ) . (15) i=1 Minimizing the exp-loss of the strong hypothesis is therefore the dual problem of the following primal minimization problem: find a distribution over the examples, whose relative entropy to the uniform distribution is as small as possible while the correlation of the distribution with each vt is as small as possible. Since the correlation of w with vt is inversely proportional to the error of ht with respect to w, we obtain that in the primal problem we are trying to maximize the error of each individual hypothesis, while in the dual problem we minimize the global error of the strong hypothesis. The intuition of finding distributions which in retrospect result in large error rates of individual hypotheses was also alluded in [15, 8]. We can now apply our algorithmic framework from Sec. 4 to boosting. We describe the game αt with the parameters αt , where αt ∈ [0, 2 γ], and underscore that in our case, λt = 2 γ vt . At the beginning of the game the booster sets all dual variables to be zero, ∀t αt = 0. At trial t of the boosting game, the booster first constructs a primal weight vector wt ∈ Rm , which assigns importance weights to the examples in the training set. The primal vector wt is constructed as in Eq. (6), that is, wt = f (θ t ), where θ t = − i αi vi . Then, the weak learner responds by presenting the loss function gt (w) = [ w, vt ]+ . Finally, the booster updates the dual variables so as to increase the dual objective function. It is possible to show that if the range of ht is {+1, −1} 1 then the update given in Eq. (10) is equivalent to the update αt = min{2 γ, 2 log((1 − t )/ t )}. We have thus obtained a variant of AdaBoost in which the weights αt are capped above by 2 γ. A disadvantage of this variant is that we need to know the parameter γ. We would like to note in passing that this limitation can be lifted by a different definition of the functions gt . We omit the details due to the lack of space. To analyze our game of boosting, we note that the conditions given in Lemma 3 holds T and therefore the left-hand side inequality given in Lemma 3 tells us that t=1 gt (wt ) − T T +1 T +1 1 2 , . . . , λT ) . The definition of gt and the weak learnability ast=1 λt ∞ ≤ D(λ1 2c sumption given in Eq. (13) imply that wt , vt ≥ 2 γ for all t. Thus, gt (wt ) = wt , vt ≥ 2 γ which also implies that λt = vt . Recall that vt,i = yi ht (xi ). Assuming that the range of ht is [+1, −1] we get that λt ∞ ≤ 1. Combining all the above with the left-hand side inequality T given in Lemma 3 we get that 2 T γ − 2 c ≤ D(λT +1 , . . . , λT +1 ). Using the definition of D (see 1 T Eq. (15)), the value c = 1/(2 γ), and rearranging terms we recover the original bound for AdaBoost PT 2 m 1 −yi t=1 αt ht (xi ) ≤ e−2 γ T . i=1 e m 8 Related Work and Discussion We presented a new framework for designing and analyzing algorithms for playing convex repeated games. Our framework was used for the analysis of known algorithms for both online learning and boosting settings. The framework also paves the way to new algorithms. In a previous paper [17], we suggested the use of duality for the design of online algorithms in the context of mistake bound analysis. The contribution of this paper over [17] is three fold as we now briefly discuss. First, we generalize the applicability of the framework beyond the specific setting of online learning with the hinge-loss to the general setting of convex repeated games. The setting of convex repeated games was formally termed “online convex programming” by Zinkevich [19] and was first presented by Gordon in [9]. There is voluminous amount of work on unifying approaches for deriving online learning algorithms. We refer the reader to [11, 12, 3] for work closely related to the content of this paper. By generalizing our previously studied algorithmic framework [17] beyond online learning, we can automatically utilize well known online learning algorithms, such as the EG and p-norm algorithms [12, 11], to the setting of online convex programming. We would like to note that the algorithms presented in [19] can be derived as special cases of our algorithmic framework 1 by setting f (w) = 2 w 2 . Parallel and independently to this work, Gordon [10] described another algorithmic framework for online convex programming that is closely related to the potential based algorithms described by Cesa-Bianchi and Lugosi [3]. Gordon also considered the problem of defining appropriate potential functions. Our work generalizes some of the theorems in [10] while providing a somewhat simpler analysis. Second, the usage of generalized Fenchel duality rather than the Lagrange duality given in [17] enables us to analyze boosting algorithms based on the framework. Many authors derived unifying frameworks for boosting algorithms [13, 8, 4]. Nonetheless, our general framework and the connection between game playing and Fenchel duality underscores an interesting perspective of both online learning and boosting. We believe that this viewpoint has the potential of yielding new algorithms in both domains. Last, despite the generality of the framework introduced in this paper, the resulting analysis is more distilled than the earlier analysis given in [17] for two reasons. (i) The usage of Lagrange duality in [17] is somehow restricted while the notion of generalized Fenchel duality is more appropriate to the general and broader problems we consider in this paper. (ii) The strongly convex property we employ both simplifies the analysis and enables more intuitive conditions in our theorems. There are various possible extensions of the work that we did not pursue here due to the lack of space. For instanc, our framework can naturally be used for the analysis of other settings such as repeated games (see [7, 19]). The applicability of our framework to online learning can also be extended to other prediction problems such as regression and sequence prediction. Last, we conjecture that our primal-dual view of boosting will lead to new methods for regularizing boosting algorithms, thus improving their generalization capabilities. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2006. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. M. Collins, R.E. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 2002. K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. JMLR, 7, Mar 2006. Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995. Y. Freund and R.E. Schapire. Game theory, on-line prediction and boosting. In COLT, 1996. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 2000. G. Gordon. Regret bounds for prediction problems. In COLT, 1999. G. Gordon. No-regret algorithms for online convex programs. In NIPS, 2006. A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for linear discriminant updates. Machine Learning, 43(3), 2001. J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Journal of Machine Learning, 45(3),2001. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. In Advances in Large Margin Classifiers. MIT Press, 1999. Y. Nesterov. Primal-dual subgradient methods for convex problems. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2005. R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):1–40, 1999. S. Shalev-Shwartz and Y. Singer. Convex repeated games and fenchel duality. Technical report, The Hebrew University, 2006. S. Shalev-Shwartz and Y. Singer. Online learning meets optimization in the dual. In COLT, 2006. J. Weston and C. Watkins. Support vector machines for multi-class pattern recognition. In ESANN, April 1999. M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.
6 0.62171453 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
7 0.61992639 65 nips-2006-Denoising and Dimension Reduction in Feature Space
8 0.61882591 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
9 0.61460787 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
10 0.61300397 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
11 0.61264062 203 nips-2006-implicit Online Learning with Kernels
12 0.61131382 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
13 0.60829026 146 nips-2006-No-regret Algorithms for Online Convex Programs
14 0.60766476 170 nips-2006-Robotic Grasping of Novel Objects
15 0.60753566 115 nips-2006-Learning annotated hierarchies from relational data
16 0.60659915 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
17 0.60604995 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
18 0.60583776 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
19 0.60536212 35 nips-2006-Approximate inference using planar graph decomposition
20 0.60364324 175 nips-2006-Simplifying Mixture Models through Function Approximation