jmlr jmlr2010 jmlr2010-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
Reference: text
sentIndex sentText sentNum sentScore
1 We also uncover a general relationship between regularized discriminant analysis and ridge regression. [sent-14, score-0.227]
2 This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. [sent-15, score-0.165]
3 Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition 1. [sent-16, score-0.226]
4 Introduction In this paper we are concerned with Fisher linear discriminant analysis (FDA), an enduring classification method in multivariate analysis and machine learning. [sent-17, score-0.09]
5 It is well known that the FDA formulation reduces to the solution of a generalized eigenproblem (Golub and Van Loan, 1996) that involves the between-class scatter matrix and total scatter matrix of the data vectors. [sent-18, score-0.328]
6 To solve the generalized eigenproblem, FDA typically requires the pooled scatter matrix to be nonsingular. [sent-19, score-0.153]
7 In applications such as information retrieval, face recognition and microarray analysis, for example, we often meet undersampled problems which are in a “small n but large p” regime; that is, there are a small number of samples but a very large number of variables. [sent-21, score-0.103]
8 There are two main variants of FDA in the literature that aim to deal with this issue: the pseudoinverse method and the regularization method (Hastie et al. [sent-22, score-0.107]
9 Z HANG , DAI , X U AND J ORDAN generalized singular value decomposition (GSVD) (Paige and Saunders, 1981) into the FDA solution by using special representations of the pooled scatter matrix and between-class scatter matrix. [sent-29, score-0.271]
10 Our algorithm is also appropriate for the pseudoinverse variant of FDA. [sent-48, score-0.107]
11 Indeed, we establish an equivalence between the pseudoinverse variant and an ordinary least squares (OLS) estimation problem. [sent-49, score-0.134]
12 A more general method, known as generalized discriminant analysis (GDA) (Baudat and Anouar, 2000), requires that the kernel matrix be nonsingular. [sent-58, score-0.191]
13 The approach to FDA that we present in the current paper not only handles the nonsingularity issue but also extends naturally to KDA, both in its regularization and pseudoinverse forms. [sent-62, score-0.107]
14 For a matrix A ∈ Rm×q with m ≥ q, we always express the (reduced) singular value decomposition (SVD) of A as A = UΓV′ where U ∈ Rm×q is a matrix with orthonormal columns (that is, U′ U = Iq ), V ∈ Rq×q is orthogonal (i. [sent-99, score-0.087]
15 The condensed SVD of A is then A = UA ΓA V′ where UA ∈ Rm×r and VA ∈ Rq×r are A matrices with orthonormal columns (i. [sent-106, score-0.114]
16 , bq ] as q eigenpairs of the matrix pencil (Σ1 , Σ2 ) if Σ1 B = Σ2 BΛ; namely, Σ1 bi = λi Σ2 bi , for i = 1, . [sent-118, score-0.222]
17 The problem of finding eigenpairs of (Σ1 , Σ2 ) is known as a generalized eigenproblem. [sent-122, score-0.182]
18 , q and refer to (Λ, B) as the nonzero eigenpairs of (Σ1 , Σ2 ). [sent-126, score-0.192]
19 If Σ2 is nonsingular, (Λ, B) is also referred to as the (nonzero) eigenpairs of Σ−1 Σ1 because the generalized eigenproblem is equivalent to the eigenproblem: 2 Σ−1 Σ1 B = BΛ. [sent-127, score-0.296]
20 2 In the case that Σ2 is singular, one typically resorts to a pseudoinverse eigenproblem: Σ+ Σ1 B = BΛ. [sent-128, score-0.107]
21 2 2201 Z HANG , DAI , X U AND J ORDAN Fortunately, we are able to establish a connection between the solutions of the generalized eigenproblem and its corresponding pseudoinverse eigenproblem. [sent-129, score-0.252]
22 Then, if (Λ, B) are the nonzero eigenpairs of Σ+ Σ1 , we have that (Λ, B) are the nonzero eigenpairs of the matrix 2 pencil (Σ1 , Σ2 ). [sent-133, score-0.455]
23 Conversely, if (Λ, B) are the nonzero eigenpairs of the matrix pencil (Σ1 , Σ2 ), then (Λ, Σ+ Σ2 B) are the nonzero eigenpairs of Σ+ Σ1 . [sent-134, score-0.455]
24 We then have the pooled scatter matrix St = ∑i=1 (xi − m)(xi − m)′ and the between-class scatter matrix Sb = ∑c n j (m j −m)(m j −m)′ . [sent-143, score-0.196]
25 Conventional FDA solves the following generalized j=1 eigenproblem: Sb a j = λ j St a j , λ1 ≥ λ2 ≥ · · · ≥ λq > λq+1 = 0 (1) where q ≤ min{p, c−1} and where we refer to a j as the jth discriminant direction. [sent-144, score-0.121]
26 Since St = Sb + Sw where Sw is the pooled within-class scatter matrix, FDA is equivalent to finding a solution to Sb a = λ/(1−λ)Sw a. [sent-146, score-0.131]
27 We see that FDA involves solving the generalized eigenproblem in (1), which can be expressed in matrix form: Sb A = St AΛ. [sent-147, score-0.171]
28 Thus, the (λ j , a j ) are the eigenpairs of St−1 Sb and the eigenvectors corresponding to the largest eigenvalues of St−1 Sb are used for the discriminant directions. [sent-156, score-0.241]
29 In applications such as information retrieval, face recognition and microarray analysis, however, we often meet a “small n but large p” problem. [sent-160, score-0.103]
30 The first variant, the pseudoinverse method, involves replacing St−1 by St+ and solving the following eigenproblem: St+ Sb A = AΛ. [sent-188, score-0.107]
31 The second variant is referred to as regularized discriminant analysis (RDA) (Friedman, 1989). [sent-192, score-0.126]
32 In this paper we concentrate on the solution of (6) with or without this constraint, and investigate the connection with a ridge regression problem in the multi-class setting. [sent-202, score-0.105]
33 In kernel methods, we are given a reproducing kernel K : X × X → R such that K(s, t) = ϕ(s)′ ϕ(t) for s, t ∈ X . [sent-206, score-0.088]
34 (9) ˜ ˜ ˜ This implies that (Λ, X′ Hϒ) are also the q eigenpairs of the matrix pencil (Sb , St ). [sent-227, score-0.222]
35 literature the solution of (7) is typically restricted to R (X ˜ Premultiplying both sides of the Equation (9) by HX, we have a new generalized eigenvalue problem CEΠ−1 E′ Cϒ = CCϒΛ, (10) ˜˜ which involves only the kernel matrix K = XX′ via its centered form C = HKH. [sent-229, score-0.136]
36 After having obtained ϒ from (10) or (11), for a new input vector x, the projection z of its feature ˜ ˜ vector x onto A is computed by 1 ˜ ˜ 1˜ z = ϒ′ HX x − X′ 1n = ϒ′ H kx − K1n , n n (12) ′ where kx = K(x, x1 ), . [sent-241, score-0.164]
37 New Approaches to Kernel Discriminant Analysis Recall that we are interested in the pseudoinverse and regularization forms of (7), defined respectively by ˜ ˜ ˜ ˜ St+ Sb A = AΛ (13) and ˜ ˜ ˜ ˜ Sb A = (St + σ2 Ig )AΛ. [sent-256, score-0.107]
38 To solve (13), we consider the pseudoinverse form of (10), which is C+ EΠ−1 E′ Cϒ = ϒΛ (15) due to C+ C+ C = C+ (see Lemma 11). [sent-262, score-0.107]
39 According to this theorem, for a new x, the ˜ projection z onto A is given by 1 z = ϒ′ H kx − K1n . [sent-294, score-0.082]
40 1 The Algorithm for RFDA We reformulate the eigenproblem in (6) as 1 GΠ− 2 E′ HXA = AΛ, where 1 G = (X′ HX + σ2 I p )−1 X′ HEΠ− 2 (18) due to (3) and (4). [sent-304, score-0.114]
41 Moreover, if 1 (Λ, V) is the nonzero eigenpair of R, (Λ, GV) is the nonzero eigenpair of GΠ− 2 E′ HX. [sent-313, score-0.136]
42 Note that the first step can be implemented by computing the condensed SVD of X′ HX (or HXX′ H). [sent-319, score-0.114]
43 The second step forms the condensed SVD of R for which the computational complexity is O(c3 ). [sent-324, score-0.114]
44 (21) Algorithm 4 EVD-based Algorithm for RFDA problem (6) 1: procedure RFDA(X, E, Π, σ2 ) 2: Calculate G by (18) or (19) and R by (20); 3: Perform the condensed SVD of R as R = VR ΓR V′ ; R −1 Return A = GVR ΓR 2 or B = GVR as the solution of RFDA problem (6). [sent-329, score-0.149]
45 n 1 2209 (22) Z HANG , DAI , X U AND J ORDAN This shows that we can calculate R and z directly using K and kx . [sent-342, score-0.115]
46 Algorithm 5 EVD-based Algorithm for RKDA problem (14) 1: procedure RKDA(K, E, kx , Π, σ2 ) 1 1 2: Calculate R = Π− 2 E′ C(C+σ2 In )−1 EΠ− 2 ; 3: Perform the condensed SVD of R as R = VR ΓR V′ ; R 4: Calculate z by (22); 5: Return z as the q-dimensional representation of x. [sent-351, score-0.196]
47 On the other hand, let R = VR ΓR V′ be the condensed SVD of R. [sent-360, score-0.114]
48 We then go on to consider a similar relationship between RKDA and the corresponding ridge regression problem. [sent-374, score-0.101]
49 In summary, we have obtained a relationship between the ridge estimation problem in (24) and the RFDA problem in (6). [sent-395, score-0.101]
50 Theorem 7 Let W be the solution of the ridge estimation problem in (24) (resp. [sent-396, score-0.105]
51 As we see from Section 3, its cor1 1 responding pseudoinverse form is given by (15). [sent-427, score-0.107]
52 Let the condensed SVD of R = Π− 2 E′ CC+ EΠ− 2 −1 be R = VR ΓR V′ . [sent-428, score-0.114]
53 In particular, the comparison was implemented on four face data sets, two handwritten digits data sets, the “letters” data set, and the WebKB data set. [sent-433, score-0.103]
54 1 Setup The four face data sets are the ORL face database, the Yale face database, the Yale face database B with extension, and the CMU PIE face database, respectively. [sent-438, score-0.56]
55 − The ORL face database contains 400 facial images of 40 subjects with 10 different images for each subject. [sent-439, score-0.341]
56 The images were taken at different times with variations in facial details (glasses/no glasses), facial expressions (open/closed eyes, smiling/nonsimling), and facial poses (tilted and rotated up to 20 degrees). [sent-442, score-0.196]
57 − The Yale face images for each subject were captured under different facial expressions or configurations (e. [sent-445, score-0.203]
58 − The Yale face database with extension includes the Yale face database B (Georghiades et al. [sent-448, score-0.296]
59 The Yale face database B contains 5760 face images of 10 subjects with 576 different images for each subject, and the extended Yale Face Database B contains 16128 face images of 28 subjects, with each subject having 576 different images. [sent-451, score-0.551]
60 The facial images for each subject were captured under 9 poses and 64 illumination conditions. [sent-452, score-0.1]
61 For the sake of simplicity, a subset called the YaleB&E; was collected from two databases; it contains the 2414 face images of 38 subjects. [sent-453, score-0.155]
62 − The CMU PIE face database contains 41,368 face images of 68 subjects. [sent-454, score-0.303]
63 The facial images for each subject were captured under 13 different poses, 43 different illumination conditions, and with 4 different expressions. [sent-455, score-0.1]
64 For simplicity, we collected a subset of the PIE face database, containing the 6800 face images of 68 subjects with 100 different images for each subject. [sent-457, score-0.351]
65 2213 Z HANG , DAI , X U AND J ORDAN (a) ORL (b) Yale (c) YaleB&E; (d) PIE Figure 1: Some sample images randomly chosen from the four data sets, where four subjects from each data set and each subject with six sample images. [sent-460, score-0.093]
66 − The Binary Alphadigits (BA) data set was collected from a binary 20×16 digits database of “0” through “9” and capital “A” through “Z,” and thus contains 1404 images of 36 subjects, each subject with 39 image. [sent-464, score-0.097]
67 As we have shown, when B is used, Algorithm 4 provides the solution of ridge regression. [sent-484, score-0.105]
68 Table 2 presents an overall comparison of the methods on all of the data sets and Figure 2 presents the comparative classification results on the four face data sets. [sent-490, score-0.103]
69 It is seen that the RFDA methods have better classification accuracy overall than other methods throughout a range of choices of the number of discriminant variates. [sent-491, score-0.09]
70 Particularly striking is the performance of the RFDA methods when the number of discriminant variates q is small. [sent-492, score-0.09]
71 We also compared the computational time of the different methods on the four face data sets. [sent-497, score-0.103]
72 Figure 3 plots the results as a function of the training percentage n/k on the four face data sets. [sent-498, score-0.152]
73 We see that our method has an overall favorable computational complexity in comparison with the other methods on the four face data sets. [sent-499, score-0.103]
74 Note that when the training percentage n/k on the YaleB&E; and PIE data sets increases, the singularity problem of the within-class scatter matrix Sw , that is, the small sample size problem, tends to disappear. [sent-501, score-0.123]
75 2215 Z HANG , DAI , X U AND J ORDAN FDA/GSVD acc (±std) time ORL 91. [sent-504, score-0.089]
76 Finally, Figure 6 presents the performance of the four regularized methods with respect to different regularization parameters σ on the four face data sets. [sent-612, score-0.139]
77 of Features (c) (d) Figure 2: Comparison of the classification accuracies for FDA methods on the four face data sets: (a) ORL; (b) Yale; (c) YaleB&E; (d) PIE. [sent-623, score-0.103]
78 This problem can be formulated as a generalized eigenproblem as follows: X′ YY′ XA = f (Q)AΛ. [sent-643, score-0.145]
79 of Features (c) (d) Figure 4: Comparison of the classification accuracies for KDA methods on the four face data sets: (a) ORL; (b) Yale; (c) YaleB&E; (d) PIE. [sent-652, score-0.103]
80 Let the condensed SVD of X be X = UX ΓX V′ and X 1 the condensed SVD of F = ( f (ΓX ))− 2 ΓX U′ Y be F = UF ΓF V′ . [sent-657, score-0.228]
81 We have: (i) (Λ, T) where T = F X 1 VX ( f (ΓX ))− 2 UF and Λ = Γ2 are r eigenpairs of the pencil (X′ YY′ X, f (Q)); (ii) the matrix Tq F consisting of the first q columns of T is a maximizer of the generalized Rayleigh quotient in (29). [sent-658, score-0.253]
82 Algorithm 6 SVD-based Algorithm for Problem (29) 1: procedure GEP({X, Y, σ2 }) 2: Perform the condensed SVD of X as X = UX ΓX V′ . [sent-682, score-0.114]
83 X 4: Perform the condensed SVD of F as F = UF ΓF V′ . [sent-684, score-0.114]
84 A′ Sxx Ax = Iq and A′ Syy Ay = Iq , y x R EGULARIZED D ISCRIMINANT A NALYSIS KDA/GSVD Algorithm 2 Data Set acc (±std) time ORL 94. [sent-689, score-0.089]
85 where q ≤ min{p, m, n−1}, Sxx = X′ HX and Syy = Y′ HY are the pooled covariance matrices of x and y, respectively, and Sxy = X′ HY = S′ is the pooled cross-covariance matrix between x and y. [sent-787, score-0.122]
86 We now address the solution to the generalized eigenproblem in (32). [sent-800, score-0.18]
87 Given x ∈ R p and y ∈ Rm , we can directly calculate their canonical variables by 1 ˜x x ˜ zx = A′ (˜ − mx ) = ϒ′ kx − Kx 1n , n 1 n ′ and K = XX′ , and ˜˜ ˜ ˜ where mx = ∑i=1 xi , kx = (Kx (x, x1 ), . [sent-804, score-0.197]
88 , Kx (x, xn )) x n 1 1 ˜y y ˜ zy = A′ (˜ −my ) = Λ− 2 ϒ′ Cx (Cy +σ2 In )−1 ky − Ky 1n , y n ˜˜ ˜ where my = 1 ∑n yi , ky = (Ky (y, y1 ), . [sent-807, score-0.088]
89 Conclusion In this paper we have provided a solution to an open problem concerning the relationship between multi-class discriminant analysis problems and multivariate regression problems, both in the linear setting and the kernel setting. [sent-814, score-0.2]
90 Our theory has yielded efficient and effective algorithms for FDA and KDA within both the regularization and pseudoinverse paradigms. [sent-815, score-0.107]
91 Proof of Theorem 1 Let Σ1 = U1 Γ1 V′ and Σ2 = U2 Γ2 V′ be the condensed SVD of Σ1 and Σ2 . [sent-824, score-0.114]
92 If (Λ, B) are the eigenpairs of Σ+ Σ1 , then it is easily seen that (Λ, B) are also the eigenpairs of 2 (Σ1 , Σ2 ) due to Σ2 Σ+ Σ1 = Σ1 . [sent-831, score-0.302]
93 2 Conversely, suppose (Λ, B) are the eigenpairs of (Σ1 , Σ2 ). [sent-832, score-0.151]
94 2 This implies that (Λ, Σ+ Σ2 B) are the eigenpairs of Σ+ Σ1 due to Σ2 Σ+ Σ1 = Σ1 and Σ+ Σ2 Σ+ = Σ+ . [sent-834, score-0.151]
95 From few to many: Illumination cone models for face recognition under variable lighting and pose. [sent-936, score-0.103]
96 Generalizing discriminant analysis using the generalized singular value decomposition. [sent-980, score-0.156]
97 Acquiring linear subspaces for face recognition under variable lighting. [sent-999, score-0.103]
98 Nonlinear discriminant analysis using kernel functions and the generalized singular value decomposition. [sent-1035, score-0.2]
99 A relationship between linear discriminant analysis and the generalized minimum squared error solution. [sent-1041, score-0.152]
100 Bayesian framework for least-squares support vector machine classifiers, Gaussian processes, and kernel Fisher discriminant analysis. [sent-1103, score-0.134]
wordName wordTfidf (topN-words)
[('fda', 0.418), ('hx', 0.327), ('kda', 0.267), ('rkda', 0.258), ('sb', 0.189), ('rfda', 0.187), ('st', 0.172), ('park', 0.171), ('hxx', 0.16), ('eigenpairs', 0.151), ('uf', 0.142), ('ordan', 0.134), ('iq', 0.129), ('yale', 0.123), ('eigenproblem', 0.114), ('condensed', 0.114), ('svd', 0.11), ('orl', 0.107), ('pseudoinverse', 0.107), ('iscriminant', 0.107), ('face', 0.103), ('dai', 0.1), ('egularized', 0.096), ('discriminant', 0.09), ('acc', 0.089), ('pie', 0.089), ('cy', 0.084), ('kx', 0.082), ('cc', 0.08), ('nalysis', 0.078), ('hc', 0.077), ('ay', 0.077), ('hhx', 0.071), ('syy', 0.071), ('ridge', 0.07), ('ig', 0.069), ('vx', 0.069), ('hang', 0.066), ('ax', 0.062), ('howland', 0.062), ('sxy', 0.062), ('sxx', 0.061), ('ce', 0.058), ('yy', 0.055), ('suykens', 0.053), ('nonsingular', 0.053), ('images', 0.052), ('vr', 0.05), ('percentage', 0.049), ('scatter', 0.048), ('cx', 0.048), ('std', 0.048), ('rk', 0.048), ('pooled', 0.048), ('facial', 0.048), ('ye', 0.047), ('database', 0.045), ('classificaition', 0.045), ('pencil', 0.045), ('ky', 0.044), ('kernel', 0.044), ('subjects', 0.041), ('nonzero', 0.041), ('webkb', 0.038), ('regularized', 0.036), ('baudat', 0.036), ('kcca', 0.036), ('syx', 0.036), ('solution', 0.035), ('theorem', 0.035), ('usps', 0.035), ('ir', 0.035), ('singular', 0.035), ('gestel', 0.034), ('calculate', 0.033), ('uc', 0.032), ('relationship', 0.031), ('generalized', 0.031), ('rda', 0.03), ('zhihua', 0.03), ('ww', 0.029), ('bb', 0.028), ('hastie', 0.027), ('tr', 0.027), ('squares', 0.027), ('sw', 0.027), ('van', 0.027), ('anouar', 0.027), ('congfu', 0.027), ('eigenpair', 0.027), ('eigenproblems', 0.027), ('gda', 0.027), ('guang', 0.027), ('hyy', 0.027), ('ols', 0.027), ('paige', 0.027), ('loan', 0.027), ('mika', 0.026), ('classification', 0.026), ('matrix', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
2 0.055648439 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
Author: Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright
Abstract: The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on “tree-based” linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a “re-parameterization” property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. Keywords: graphical models, MAP Estimation, LP relaxation, proximal minimization, rounding schemes
3 0.050333943 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
Author: Lin Xiao
Abstract: We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1 -norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1 -regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method. Keywords: stochastic learning, online optimization, ℓ1 -regularization, structural convex optimization, dual averaging methods, accelerated gradient methods
4 0.045388158 80 jmlr-2010-On-Line Sequential Bin Packing
Author: András György, Gábor Lugosi, György Ottucsàk
Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a fixed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice
5 0.043470312 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
6 0.040407561 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
7 0.038299158 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
8 0.038292315 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
9 0.036486182 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
10 0.035337083 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
11 0.034575194 84 jmlr-2010-On Spectral Learning
12 0.033485744 83 jmlr-2010-On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation
13 0.033065915 82 jmlr-2010-On Learning with Integral Operators
14 0.032553647 65 jmlr-2010-Learning Translation Invariant Kernels for Classification
15 0.029822247 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
16 0.029546937 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization
17 0.027671224 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding
18 0.02563238 72 jmlr-2010-Matrix Completion from Noisy Entries
19 0.025541577 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
20 0.024970403 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
topicId topicWeight
[(0, -0.121), (1, -0.055), (2, 0.054), (3, 0.009), (4, 0.01), (5, -0.004), (6, -0.035), (7, -0.026), (8, 0.039), (9, 0.027), (10, 0.037), (11, 0.054), (12, -0.015), (13, 0.066), (14, -0.05), (15, -0.037), (16, 0.116), (17, 0.066), (18, -0.006), (19, -0.009), (20, -0.034), (21, 0.094), (22, 0.144), (23, 0.105), (24, 0.109), (25, -0.023), (26, -0.139), (27, 0.035), (28, -0.011), (29, 0.115), (30, -0.012), (31, 0.021), (32, -0.018), (33, -0.024), (34, 0.073), (35, 0.26), (36, 0.251), (37, 0.0), (38, -0.095), (39, 0.054), (40, 0.169), (41, 0.191), (42, 0.218), (43, 0.169), (44, 0.111), (45, 0.192), (46, -0.244), (47, -0.037), (48, 0.026), (49, -0.219)]
simIndex simValue paperId paperTitle
same-paper 1 0.92809063 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
2 0.3758955 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
Author: Christoforos Christoforou, Robert Haralick, Paul Sajda, Lucas C. Parra
Abstract: Traditional analysis methods for single-trial classification of electro-encephalography (EEG) focus on two types of paradigms: phase-locked methods, in which the amplitude of the signal is used as the feature for classification, that is, event related potentials; and second-order methods, in which the feature of interest is the power of the signal, that is, event related (de)synchronization. The process of deciding which paradigm to use is ad hoc and is driven by assumptions regarding the underlying neural generators. Here we propose a method that provides an unified framework for the analysis of EEG, combining first and second-order spatial and temporal features based on a bilinear model. Evaluation of the proposed method on simulated data shows that the technique outperforms state-of-the art techniques for single-trial classification for a broad range of signal-to-noise ratios. Evaluations on human EEG—including one benchmark data set from the Brain Computer Interface (BCI) competition—show statistically significant gains in classification accuracy, with a reduction in overall classification error from 26%-28% to 19%. Keywords: regularization, classification, bilinear decomposition, neural signals, brain computer interface
3 0.30030677 80 jmlr-2010-On-Line Sequential Bin Packing
Author: András György, Gábor Lugosi, György Ottucsàk
Abstract: We consider a sequential version of the classical bin packing problem in which items are received one by one. Before the size of the next item is revealed, the decision maker needs to decide whether the next item is packed in the currently open bin or the bin is closed and a new bin is opened. If the new item does not fit, it is lost. If a bin is closed, the remaining free space in the bin accounts for a loss. The goal of the decision maker is to minimize the loss accumulated over n periods. We present an algorithm that has a cumulative loss not much larger than any strategy in a finite class of reference strategies for any sequence of items. Special attention is payed to reference strategies that use a fixed threshold at each step to decide whether a new bin is opened. Some positive and negative results are presented for this case. Keywords: bin packing, on-line learning, prediction with expert advice
4 0.26674712 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference
Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený
Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efficiency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach
5 0.2460282 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization
Author: Jarkko Venna, Jaakko Peltonen, Kristian Nybo, Helena Aidos, Samuel Kaski
Abstract: Nonlinear dimensionality reduction methods are often used to visualize high-dimensional data, although the existing methods have been designed for other related tasks such as manifold learning. It has been difficult to assess the quality of visualizations since the task has not been well-defined. We give a rigorous definition for a specific visualization task, resulting in quantifiable goodness measures and new visualization methods. The task is information retrieval given the visualization: to find similar data based on the similarities shown on the display. The fundamental tradeoff between precision and recall of information retrieval can then be quantified in visualizations as well. The user needs to give the relative cost of missing similar points vs. retrieving dissimilar points, after which the total cost can be measured. We then introduce a new method NeRV (neighbor retrieval visualizer) which produces an optimal visualization by minimizing the cost. We further derive a variant for supervised visualization; class information is taken rigorously into account when computing the similarity relationships. We show empirically that the unsupervised version outperforms existing unsupervised dimensionality reduction methods in the visualization task, and the supervised version outperforms existing supervised methods. Keywords: information retrieval, manifold learning, multidimensional scaling, nonlinear dimensionality reduction, visualization
6 0.23289324 84 jmlr-2010-On Spectral Learning
7 0.21516134 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
8 0.20801695 12 jmlr-2010-Analysis of Multi-stage Convex Relaxation for Sparse Regularization
9 0.19076078 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
10 0.18612385 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
11 0.15953828 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
12 0.15378627 35 jmlr-2010-Error-Correcting Output Codes Library
13 0.15308458 31 jmlr-2010-Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization
14 0.15224187 96 jmlr-2010-Rate Minimaxity of the Lasso and Dantzig Selector for thelqLoss inlrBalls
15 0.14996737 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM
16 0.14984247 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
17 0.14907154 77 jmlr-2010-Model-based Boosting 2.0
18 0.14855868 71 jmlr-2010-Matched Gene Selection and Committee Classifier for Molecular Classification of Heterogeneous Diseases
19 0.14794914 82 jmlr-2010-On Learning with Integral Operators
20 0.14624572 6 jmlr-2010-A Rotation Test to Verify Latent Structure
topicId topicWeight
[(3, 0.014), (21, 0.022), (32, 0.647), (36, 0.013), (37, 0.032), (75, 0.094), (81, 0.01), (85, 0.053)]
simIndex simValue paperId paperTitle
Author: Jacek P. Dmochowski, Paul Sajda, Lucas C. Parra
Abstract: The presence of asymmetry in the misclassification costs or class prevalences is a common occurrence in the pattern classification domain. While much interest has been devoted to the study of cost-sensitive learning techniques, the relationship between cost-sensitive learning and the specification of the model set in a parametric estimation framework remains somewhat unclear. To that end, we differentiate between the case of the model including the true posterior, and that in which the model is misspecified. In the former case, it is shown that thresholding the maximum likelihood (ML) estimate is an asymptotically optimal solution to the risk minimization problem. On the other hand, under model misspecification, it is demonstrated that thresholded ML is suboptimal and that the risk-minimizing solution varies with the misclassification cost ratio. Moreover, we analytically show that the negative weighted log likelihood (Elkan, 2001) is a tight, convex upper bound of the empirical loss. Coupled with empirical results on several real-world data sets, we argue that weighted ML is the preferred cost-sensitive technique. Keywords: empirical risk minimization, loss function, cost-sensitive learning, imbalanced data sets
2 0.88189012 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
Author: Ran El-Yaniv, Yair Wiener
Abstract: We consider selective classification, a term we adopt here to refer to ‘classification with a reject option.’ The essence in selective classification is to trade-off classifier coverage for higher accuracy. We term this trade-off the risk-coverage (RC) trade-off. Our main objective is to characterize this trade-off and to construct algorithms that can optimally or near optimally achieve the best possible trade-offs in a controlled manner. For noise-free models we present in this paper a thorough analysis of selective classification including characterizations of RC trade-offs in various interesting settings. Keywords: classification with a reject option, selective classification, perfect learning, high performance classification, risk-coverage trade-off
3 0.85183787 38 jmlr-2010-Expectation Truncation and the Benefits of Preselection In Training Generative Models
Author: Jörg Lücke, Julian Eggert
Abstract: We show how a preselection of hidden variables can be used to efficiently train generative models with binary hidden variables. The approach is based on Expectation Maximization (EM) and uses an efficiently computable approximation to the sufficient statistics of a given model. The computational cost to compute the sufficient statistics is strongly reduced by selecting, for each data point, the relevant hidden causes. The approximation is applicable to a wide range of generative models and provides an interpretation of the benefits of preselection in terms of a variational EM approximation. To empirically show that the method maximizes the data likelihood, it is applied to different types of generative models including: a version of non-negative matrix factorization (NMF), a model for non-linear component extraction (MCA), and a linear generative model similar to sparse coding. The derived algorithms are applied to both artificial and realistic data, and are compared to other models in the literature. We find that the training scheme can reduce computational costs by orders of magnitude and allows for a reliable extraction of hidden causes. Keywords: maximum likelihood, deterministic approximations, variational EM, generative models, component extraction, multiple-cause models
same-paper 4 0.84518456 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
Author: Zhihua Zhang, Guang Dai, Congfu Xu, Michael I. Jordan
Abstract: Fisher linear discriminant analysis (FDA) and its kernel extension—kernel discriminant analysis (KDA)—are well known methods that consider dimensionality reduction and classification jointly. While widely deployed in practical problems, there are still unresolved issues surrounding their efficient implementation and their relationship with least mean squares procedures. In this paper we address these issues within the framework of regularized estimation. Our approach leads to a flexible and efficient implementation of FDA as well as KDA. We also uncover a general relationship between regularized discriminant analysis and ridge regression. This relationship yields variations on conventional FDA based on the pseudoinverse and a direct equivalence to an ordinary least squares estimator. Keywords: Fisher discriminant analysis, reproducing kernel, generalized eigenproblems, ridge regression, singular value decomposition, eigenvalue decomposition
5 0.52800715 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization
Author: Ming Yuan, Marten Wegkamp
Abstract: In this paper, we investigate the problem of binary classification with a reject option in which one can withhold the decision of classifying an observation at a cost lower than that of misclassification. Since the natural loss function is non-convex so that empirical risk minimization easily becomes infeasible, the paper proposes minimizing convex risks based on surrogate convex loss functions. A necessary and sufficient condition for infinite sample consistency (both risks share the same minimizer) is provided. Moreover, we show that the excess risk can be bounded through the excess surrogate risk under appropriate conditions. These bounds can be tightened by a generalized margin condition. The impact of the results is illustrated on several commonly used surrogate loss functions. Keywords: classification, convex surrogate loss, empirical risk minimization, generalized margin condition, reject option
6 0.50695848 60 jmlr-2010-Learnability, Stability and Uniform Convergence
7 0.50056028 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
8 0.47349977 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
9 0.46029755 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
10 0.45089626 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
11 0.44933358 102 jmlr-2010-Semi-Supervised Novelty Detection
12 0.44686076 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
13 0.44557551 24 jmlr-2010-Collective Inference for Extraction MRFs Coupled with Symmetric Clique Potentials
14 0.43360555 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values
15 0.43352562 25 jmlr-2010-Composite Binary Losses
16 0.43107283 109 jmlr-2010-Stochastic Composite Likelihood
17 0.42929301 22 jmlr-2010-Classification Using Geometric Level Sets
18 0.42799306 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis
19 0.42781973 11 jmlr-2010-An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data
20 0.42694941 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data