nips nips2012 nips2012-86 nips2012-86-reference knowledge-graph by maker-knowledge-mining

86 nips-2012-Convex Multi-view Subspace Learning

Source: pdf

Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efﬁcient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results. 1

reference text

[1] J. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer, 2010.

[2] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639–2664, 2004.

[3] T. De Bie, N. Cristianini, and R. Rosipal. Eigenproblems in pattern recognition. In Handbook of Geometric Computing, pages 129–170, 2005.

[4] P. Dhillon, D. Foster, and L. Ungar. Multi-view learning of word embeddings via CCA. In NIPS, 2011.

[5] C. Lampert and O. Kr¨ mer. Weakly-paired maximum covariance analysis for multimodal o dimensionality reduction and transfer learning. In ECCV, 2010.

[6] L. Sigal, R. Memisevic, and D. Fleet. Shared kernel information embedding for discriminative inference. In CVPR, 2009.

[7] F. Bach and M. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, University of California, Berkeley, 2006.

[8] C. Archambeau and F. Bach. Sparse probabilistic projections. In NIPS, 2008.

[9] J. Viinikanoja, A. Klami, and S. Kaski. Variational Bayesian mixture of robust CCA. In ECML, 2010.

[10] E. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J.ACM, 58(1): e 1–37, 2011.

[11] X. Zhang, Y. Yu, M. White, R. Huang, and D. Schuurmans. Convex sparse coding, subspace learning, and semi-supervised extensions. In AAAI, 2011.

[12] F. Bach, J. Mairal, and J. Ponce. Convex sparse matrix factorizations. arXiv:0812.1869v1, 2008.

[13] N. Quadrinto and C. Lampert. Learning multi-view neighborhood preserving projections. In ICML, 2011.

[14] Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. In NIPS, pages 982–990, 2010.

[15] M. White and D. Schuurmans. Generalized optimal reverse prediction. In AISTATS, 2012.

[16] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.

[17] D. Bradley and A. Bagnell. Convex coding. In UAI, 2009.

[18] J-B Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms, I and e II, volume 305 and 306. Springer-Verlag, 1993.

[19] D. Petz. A survey of trace inequalities. In Functional Analysis and Operator Theory, pages 287–298. Banach Center, 2004.

[20] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Mathematical Programming, 128:321–353, 2011.

[21] J. Cai, E. Cand` s, and Z. Shen. A singular value thresholding algorithm for matrix completion. e SIAM Journal on Optimization, 20(4):1956–1982, 2010.

[22] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A boosting approach. In NIPS, 2012.

[23] A. Tewari, P. Ravikumar, and I. S. Dhillon. Greedy algorithms for structurally constrained high dimensional problems. In NIPS, 2011.

[24] X. Yuan and S. Yan. Forward basis selection for sparse approximation over dictionary. In AISTATS, 2012.

[25] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[26] D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing the sum of two convex functions. Mathematical Programming, to appear.

[27] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE TPAMI, 23:643–660, 2001. 9 Supplementary Material A Proof of Proposition 1 To show that (1) and (2) have equivalent solutions we exploit some developments from [28]. Let 1 1 N = (XX )− 2 and M = (Y Y )− 2 , hence I MY X N ˜˜ ZZ = N XY M I . First consider (1). Its solution can be characterized by the maximal solutions to the generalized eigenvalue problem [3]: 0 YX XY 0 u v =λ XX 0 0 YY u v , which, under the change of variables u = N a and v = M b and then shifting the eigenvalues by 1, is equivalent to XY M 0 a b =λ N −1 0 N XY M 0 a b =λ I 0 a b = (λ + 1) 0 YX N ≡ 0 MY X N ≡ ˜˜ ZZ ≡ 0 M −1 0 I a b a b a b A ˜˜ By setting B to the top k eigenvectors of Z Z one can show that U = N A and V = M B provides an optimal solution to (1) [3]. ˜ By comparison, for (2), an optimal H is given by H = C † Z, where C † denotes pseudo-inverse. Hence ˜ ˜ F min Z − CH 2 = min (I − CC † )Z 2 F C,H C ˜˜ = tr(Z Z ) − max {C:C C=I} ˜˜ tr(C Z Z C). ˜˜ Here again the solution is given by the top k eigenvectors of Z Z [29].1 B Proof for Lemma 3 First, observe that (3) = min min L(CH; Z) + α H {C:C:,i ∈C} H 2,1 ˆ = min L(Z; Z) + α ˆ Z min min ˆ {C:C:,i ∈C} {H:CH=Z} H 2,1 ˆ ˆ = min L(Z; Z) + α|||Z|||∗ , ˆ Z where the last step follows from Proposition 2. −1 ˆ ˆ It only remains to show |||Z|||∗ = maxρ≥0 Dρ Z the proof in [11] for the convenience of the reader. tr , which was established in [11]. We reproduce We will use two diagonal matrices, I X = diag([1n ; 0m ]) and I Y = diag([0n ; 1m ]) such that I X + I Y = Im+n . Similarly, for c ∈ Rm+n , we use cX (respectively cY ) to denote c1:m (respectively cm+1:m+n ). The ﬁrst stage is to prove that the dual norm is characterized by |||Γ||| = min Dρ Γ ρ≥0 1 sp .

[30] gave a similar but not equivalent formulation to (2), due to the lack of normalization. 10 (16) where the spectral norm X that |||Γ||| = sp = σmax (X) is the dual of the trace norm, X max c∈C, h 2 ≤1 c Γh = max c Γ c∈C = 2 {c: cX max 2 =β, cY tr . To this end, recall cΓ 2 =γ} 2 giving |||Γ|||2 = {c: cX max 2 =β, cY c ΓΓ c = max {Φ:Φ 0, tr(ΦI X )≤β 2 , tr(ΦI Y )≤γ 2 } 2 =γ} tr(ΦΓΓ ), (17) using the fact that when maximizing a convex function, one of the extreme points in the constraint set {Φ : Φ 0, tr(ΦIn )≤β 2, tr(ΦIm )≤γ 2 } must be optimal. Furthermore, since the extreme points have rank at most one in this case [31], the rank constraint rank(Φ) = 1 can be dropped. Next, form the Lagrangian L(Φ; λ, ν, Λ) = tr(ΦΓΓ ) + tr(ΦΛ) + λ(β 2 − tr(ΦI X )) + ν(γ 2 − tr(ΦI Y )) where λ ≥ 0, ν ≥ 0 and Λ 0. Note that the primal variable Φ can be eliminated by formulating the equilibrium condition ∂L/∂Φ = ΓΓ + Λ − λI X − νI Y = 0, which implies ΓΓ − λI X − νI Y 0. Therefore, we achieve the equivalent dual formulation (17) = min β 2 λ + γ 2 ν. (18) {λ,ν:λ≥0, ν≥0, λI X +νI Y ΓΓ } Now observe that for λ ≥ 0 and ν ≥ 0, the relation ΓΓ λI X + νI Y holds if and only if X Y 2 2 Dν/λ ΓΓ Dν/λ Dν/λ (λI +νI )Dν/λ = (β λ+γ ν)In+m , hence (18) = min 2 β 2 λ+γ 2 ν (19) 2 2 {λ,ν:λ≥0, ν≥0, Dν/λ Γ sp ≤β λ+γ ν} The third constraint must be met with equality at the optimum due to continuity, for otherwise we would be able to further decrease the objective, a contradiction to optimality. Note that a standard compactness argument would establish the existence of minimizers. So (19) = min Dν/λ Γ 2 = min Dρ Γ 2 . sp sp ρ≥0 λ≥0,ν≥0 Finally, for the second stage, we characterize the target norm by observing that ˆ |||Z|||∗ = ˆ max tr(Γ Z) Γ:|||Γ|||≤1 = max = max = max ρ≥0 Γ: Dρ Γ max ˜ ˜ ρ≥0 Γ: Γ max ρ≥0 sp ≤1 sp ≤1 ˆ tr(Γ Z) (20) ˜ −1 ˆ tr(Γ Dρ Z) −1 ˆ Dρ Z tr . (21) where (20) uses (16), and (21) exploits the conjugacy of the spectral and trace norms. The lemma follows. C Proof for Theorem 6 and Details of Recovery ˆ Once an optimal reconstruction Z is obtained, we need to recover the optimal factors C and H that satisfy ˆ ˆ CH = Z, , H 2,1 = |||Z|||∗ , and C:,i ∈ C for all i. (22) Note that by Proposition 2 and Lemma 3, the recovery problem (22) can be re-expressed as min ˆ {C,H:C:,i ∈C ∀i, CH=Z} H 2,1 = max {Γ:|||Γ|||≤1} ˆ tr(Γ Z). (23) ˆ Our strategy will be to ﬁrst recover the optimal dual solution Γ given Z, then use Γ to recover H and C. −1 ˆ First, to recover Γ one can simply trace back from (21) to (20). Let U ΣV be the SVD of Dρ Z. −1 ˜ Then Γ = U V and Γ = Dρ U V automatically satisﬁes |||Γ||| = 1 while achieving the optimal −1 ˆ ˜ −1 ˆ trace in (23) because tr(Γ Dρ Z) = tr(Σ) = Dρ Z tr . 11 Given such an optimal Γ, we are then able to characterize an optimal solution (C, H). Introduce the set a C(Γ) := arg max Γ c = c = : a = β, b = γ, Γ c = 1 . (24) b c∈C Theorem 6. For a dual optimal Γ, (C, H) solves recovery problem (22) if and only if C:,i ∈ C(Γ) ˆ and Hi,: = Hi,: 2 C:,i Γ, such that CH = Z. ˆ Proof. By (23), if Z = CH, then ˆ ˆ |||Z|||∗ = tr(Γ Z) = tr(Γ CH) = Hi,: Γ C:,i . (25) i Note that ∀C:,i ∈ C, Γ C:,i 2 ≤ 1 since |||Γ||| ≤ 1 and Hi,: Γ C:,i = Hi,: Γ C:,i 2 ≤ Hi,: 2 Γ C:,i 2 ≤ Hi,: 2 . If (C, H) is optimal, then (25) = i Hi,: 2 , hence implying Γ C:,i 2 = 1 and Hi,: = Hi,: 2 C:,i Γ. ˆ On the other hand, if Γ C:,i 2 = 1 and Hi,: = Hi,: 2 C:,i Γ, then we have |||Z|||∗ = implying the optimality of (C, H). i Hi,: 2 , Therefore, given Γ, the recovery problem (22) has been reduced to ﬁnding a vector µ and matrix C ˆ such that µ ≥ 0, C:,i ∈ C(Γ) for all i, and C diag(µ)C Γ = Z. Next we demonstrate how to incrementally recover µ and C. Denote the range of C diag(µ)C by the set S := { i µi ci ci : ci ∈ C(Γ), µ ≥ 0} . Note that S is the conic hull of (possibly inﬁnitely many) rank one matrices {cc : c ∈ C(Γ)}. However, by Carath´ odory’s theorem [32, §17], any matrix K ∈ S can be written as the conic come bination of ﬁnitely many rank one matrices of the form {cc : c ∈ C(Γ)}. Therefore, conceptually, the recovery problem has been reduced to ﬁnding a sparse set of non-negative weights, µ, over the set of feasible basis vectors, c ∈ C(Γ). To ﬁnd these weights, we use a totally corrective “boosting” procedure [22] that is guaranteed to converge to a feasible solution. Consider the following objective function for boosting ˆ f (K) = KΓ − Z 2 F, where K ∈ S. Note that f is clearly a convex function in K with a Lipschitz continuous gradient. Theorem 6 implies that an optimal solution of (22) corresponds precisely to those K ∈ S such that f (K) = 0. The idea behind totally corrective boosting [22] is to ﬁnd a minimizer of f (hence optimal solution of (22)) incrementally. After initializing K0 = 0, we iterate between two steps: 1. Weak learning step: ﬁnd ct ∈ argmin f (Kt−1 ), cc = argmax c Qc, c∈C(Γ) (26) c∈C(Γ) ˆ where Q = − f (Kt−1 ) = 2(Z − Kt−1 Γ)Γ . 2. “Totally corrective” step: µ(t) = argmin f µ:µi ≥0 Kt = t i=1 t i=1 µi ci ci , (27) (t) µi ci ci . Three key facts can be established about this boosting procedure: (i) each weak learning step can be solved efﬁciently; (ii) each totally corrective weight update can be solved efﬁciently; and (iii) f (Kt ) 0, hence a feasible solution can be arbitrarily well approximated. (iii) has been proved in

[22], while (ii) is immediate because (27) is a standard quadratic program. Only (i) deserves some explanation. We show in the next subsection that C(Γ), deﬁned in (24), can be much simpliﬁed, and consequently we give in the last subsection an efﬁcient algorithm for the oracle problem (26) (the idea is similar to the one inherent in the proof of Lemma 3). 12 C.1 Simpliﬁcation of C(Γ) Since C(Γ) is the set of optimal solutions to max Γ c , (28) c∈C our idea is to ﬁrst obtain an optimal solution to its dual problem, and then use it to recover the optimal c via the KKT conditions. In fact, its dual problem has been stated in (18). Once we obtain the optimal ρ in (21) by solving (8), it is straightforward to backtrack and recover the optimal λ and ν in (18). Then by KKT condition [32, §28], c is an optimal solution to (28) if and only if (29) cY = γ, cX = β, R, cc = 0, where R = λI X + νI Y − ΓΓ 0. (30) Since (30) holds iff c is in the null space of R, we ﬁnd an orthonormal basis {n1 , . . . , nk } for this null space. Assume NX c = N α, where N = [n1 , . . . , nk ] = , α = (α1 , . . . , αk ) . (31) NY By (29), we have 0 = γ 2 cX 2 − β 2 cY 2 = α γ 2 (N X ) N X − β 2 (N Y ) N Y α. (32) The idea is to go through some linear transformations for simpliﬁcation. Perform eigendecomposition U ΣU = γ 2 (N X ) N X − β 2 (N Y ) N Y , where Σ = diag(σ1 , . . . , σk ), and U ∈ Rk×k is orthonormal. Let v = U α. Then by (31), c = N U v, (33) and (32) is satisﬁed if and only if 2 σi vi = 0. v Σv = (34) i Finally, (29) implies 2 β2 + γ2 = c = v U N N U v = v v. (35) In summary, by (33) we have C(Γ) = {N U v : v satisﬁes (34) and (35)} = N U v : v Σv = 0, v C.2 2 = β2 + γ2 . (36) Solving the weak oracle problem (26) The weak oracle needs to solve max c Qc, c∈C(Γ) ˆ where Q = − f (Kt−1 ) = 2(Z − Kt−1 Γ)Γ . By (36), this optimization is equivalent to max 2 v T v, v:v Σv=0, v =β 2 +γ 2 where T = U N QN U . Using the same technique as in the proof of Lemma 3, we have max v Tv v:v v=1,v Σv=0 (let H = vv ) = (Lagrange dual) = max tr(T H) H 0,tr(H)=1,tr(ΣH)=0 min τ,ω:τ Σ+ωI−T 0 ω = min λmax (T − τ Σ), τ ∈R where λmax stands for the maximum eigenvalue. Since λmax is a convex function over real symmetric matrices, the last line search problem is convex in τ , hence can be solved globally and efﬁciently. Given the optimal τ and the optimal objective value ω, the optimal v can be recovered using a similar ˆ ˆ trick as in Appendix C.1. Let the null space of ωI + τ Σ − T be spanned by N = {ˆ 1 , . . . , ns }. n s ˆ α satisﬁes v 2 = β 2 + γ 2 and v Σv = 0. ˆ Then ﬁnd any α ∈ R such that v := N ˆ 13 Auxiliary References

[28] L. Sun, S. Ji, and J. Ye. Canonical correlation analysis for multilabel classiﬁcation: A leastsquares formulation, extensions, and analysis. IEEE TPAMI, 33(1):194–200, 2011.

[29] M. Overton and R. Womersley. Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Mathematical Programming, 62:321–357, 1993.

[30] B. Long, P. Yu, and Z. Zhang. A general model for multiple view unsupervised learning. In ICDM, 2008.

[31] G. Pataki. On the rank of extreme matrices in semideﬁnite programs and the multiplicity of optimal eigenvalues. Mathematics of Operations Research, 23(2):339–358, 1998.

[32] R. Rockafellar. Convex Analysis. Princeton U. Press, 1970. 14