nips nips2012 nips2012-86 nips2012-86-reference knowledge-graph by maker-knowledge-mining

86 nips-2012-Convex Multi-view Subspace Learning


Source: pdf

Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efficient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results. 1


reference text

[1] J. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer, 2010.

[2] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16:2639–2664, 2004.

[3] T. De Bie, N. Cristianini, and R. Rosipal. Eigenproblems in pattern recognition. In Handbook of Geometric Computing, pages 129–170, 2005.

[4] P. Dhillon, D. Foster, and L. Ungar. Multi-view learning of word embeddings via CCA. In NIPS, 2011.

[5] C. Lampert and O. Kr¨ mer. Weakly-paired maximum covariance analysis for multimodal o dimensionality reduction and transfer learning. In ECCV, 2010.

[6] L. Sigal, R. Memisevic, and D. Fleet. Shared kernel information embedding for discriminative inference. In CVPR, 2009.

[7] F. Bach and M. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical Report 688, Department of Statistics, University of California, Berkeley, 2006.

[8] C. Archambeau and F. Bach. Sparse probabilistic projections. In NIPS, 2008.

[9] J. Viinikanoja, A. Klami, and S. Kaski. Variational Bayesian mixture of robust CCA. In ECML, 2010.

[10] E. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J.ACM, 58(1): e 1–37, 2011.

[11] X. Zhang, Y. Yu, M. White, R. Huang, and D. Schuurmans. Convex sparse coding, subspace learning, and semi-supervised extensions. In AAAI, 2011.

[12] F. Bach, J. Mairal, and J. Ponce. Convex sparse matrix factorizations. arXiv:0812.1869v1, 2008.

[13] N. Quadrinto and C. Lampert. Learning multi-view neighborhood preserving projections. In ICML, 2011.

[14] Y. Jia, M. Salzmann, and T. Darrell. Factorized latent spaces with structured sparsity. In NIPS, pages 982–990, 2010.

[15] M. White and D. Schuurmans. Generalized optimal reverse prediction. In AISTATS, 2012.

[16] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.

[17] D. Bradley and A. Bagnell. Convex coding. In UAI, 2009.

[18] J-B Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms, I and e II, volume 305 and 306. Springer-Verlag, 1993.

[19] D. Petz. A survey of trace inequalities. In Functional Analysis and Operator Theory, pages 287–298. Banach Center, 2004.

[20] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank minimization. Mathematical Programming, 128:321–353, 2011.

[21] J. Cai, E. Cand` s, and Z. Shen. A singular value thresholding algorithm for matrix completion. e SIAM Journal on Optimization, 20(4):1956–1982, 2010.

[22] X. Zhang, Y. Yu, and D. Schuurmans. Accelerated training for matrix-norm regularization: A boosting approach. In NIPS, 2012.

[23] A. Tewari, P. Ravikumar, and I. S. Dhillon. Greedy algorithms for structurally constrained high dimensional problems. In NIPS, 2011.

[24] X. Yuan and S. Yan. Forward basis selection for sparse approximation over dictionary. In AISTATS, 2012.

[25] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[26] D. Goldfarb, S. Ma, and K. Scheinberg. Fast alternating linearization methods for minimizing the sum of two convex functions. Mathematical Programming, to appear.

[27] A. Georghiades, P. Belhumeur, and D. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE TPAMI, 23:643–660, 2001. 9 Supplementary Material A Proof of Proposition 1 To show that (1) and (2) have equivalent solutions we exploit some developments from [28]. Let 1 1 N = (XX )− 2 and M = (Y Y )− 2 , hence I MY X N ˜˜ ZZ = N XY M I . First consider (1). Its solution can be characterized by the maximal solutions to the generalized eigenvalue problem [3]: 0 YX XY 0 u v =λ XX 0 0 YY u v , which, under the change of variables u = N a and v = M b and then shifting the eigenvalues by 1, is equivalent to XY M 0 a b =λ N −1 0 N XY M 0 a b =λ I 0 a b = (λ + 1) 0 YX N ≡ 0 MY X N ≡ ˜˜ ZZ ≡ 0 M −1 0 I a b a b a b A ˜˜ By setting B to the top k eigenvectors of Z Z one can show that U = N A and V = M B provides an optimal solution to (1) [3]. ˜ By comparison, for (2), an optimal H is given by H = C † Z, where C † denotes pseudo-inverse. Hence ˜ ˜ F min Z − CH 2 = min (I − CC † )Z 2 F C,H C ˜˜ = tr(Z Z ) − max {C:C C=I} ˜˜ tr(C Z Z C). ˜˜ Here again the solution is given by the top k eigenvectors of Z Z [29].1 B Proof for Lemma 3 First, observe that (3) = min min L(CH; Z) + α H {C:C:,i ∈C} H 2,1 ˆ = min L(Z; Z) + α ˆ Z min min ˆ {C:C:,i ∈C} {H:CH=Z} H 2,1 ˆ ˆ = min L(Z; Z) + α|||Z|||∗ , ˆ Z where the last step follows from Proposition 2. −1 ˆ ˆ It only remains to show |||Z|||∗ = maxρ≥0 Dρ Z the proof in [11] for the convenience of the reader. tr , which was established in [11]. We reproduce We will use two diagonal matrices, I X = diag([1n ; 0m ]) and I Y = diag([0n ; 1m ]) such that I X + I Y = Im+n . Similarly, for c ∈ Rm+n , we use cX (respectively cY ) to denote c1:m (respectively cm+1:m+n ). The first stage is to prove that the dual norm is characterized by |||Γ||| = min Dρ Γ ρ≥0 1 sp .

[30] gave a similar but not equivalent formulation to (2), due to the lack of normalization. 10 (16) where the spectral norm X that |||Γ||| = sp = σmax (X) is the dual of the trace norm, X max c∈C, h 2 ≤1 c Γh = max c Γ c∈C = 2 {c: cX max 2 =β, cY tr . To this end, recall cΓ 2 =γ} 2 giving |||Γ|||2 = {c: cX max 2 =β, cY c ΓΓ c = max {Φ:Φ 0, tr(ΦI X )≤β 2 , tr(ΦI Y )≤γ 2 } 2 =γ} tr(ΦΓΓ ), (17) using the fact that when maximizing a convex function, one of the extreme points in the constraint set {Φ : Φ 0, tr(ΦIn )≤β 2, tr(ΦIm )≤γ 2 } must be optimal. Furthermore, since the extreme points have rank at most one in this case [31], the rank constraint rank(Φ) = 1 can be dropped. Next, form the Lagrangian L(Φ; λ, ν, Λ) = tr(ΦΓΓ ) + tr(ΦΛ) + λ(β 2 − tr(ΦI X )) + ν(γ 2 − tr(ΦI Y )) where λ ≥ 0, ν ≥ 0 and Λ 0. Note that the primal variable Φ can be eliminated by formulating the equilibrium condition ∂L/∂Φ = ΓΓ + Λ − λI X − νI Y = 0, which implies ΓΓ − λI X − νI Y 0. Therefore, we achieve the equivalent dual formulation (17) = min β 2 λ + γ 2 ν. (18) {λ,ν:λ≥0, ν≥0, λI X +νI Y ΓΓ } Now observe that for λ ≥ 0 and ν ≥ 0, the relation ΓΓ λI X + νI Y holds if and only if X Y 2 2 Dν/λ ΓΓ Dν/λ Dν/λ (λI +νI )Dν/λ = (β λ+γ ν)In+m , hence (18) = min 2 β 2 λ+γ 2 ν (19) 2 2 {λ,ν:λ≥0, ν≥0, Dν/λ Γ sp ≤β λ+γ ν} The third constraint must be met with equality at the optimum due to continuity, for otherwise we would be able to further decrease the objective, a contradiction to optimality. Note that a standard compactness argument would establish the existence of minimizers. So (19) = min Dν/λ Γ 2 = min Dρ Γ 2 . sp sp ρ≥0 λ≥0,ν≥0 Finally, for the second stage, we characterize the target norm by observing that ˆ |||Z|||∗ = ˆ max tr(Γ Z) Γ:|||Γ|||≤1 = max = max = max ρ≥0 Γ: Dρ Γ max ˜ ˜ ρ≥0 Γ: Γ max ρ≥0 sp ≤1 sp ≤1 ˆ tr(Γ Z) (20) ˜ −1 ˆ tr(Γ Dρ Z) −1 ˆ Dρ Z tr . (21) where (20) uses (16), and (21) exploits the conjugacy of the spectral and trace norms. The lemma follows. C Proof for Theorem 6 and Details of Recovery ˆ Once an optimal reconstruction Z is obtained, we need to recover the optimal factors C and H that satisfy ˆ ˆ CH = Z, , H 2,1 = |||Z|||∗ , and C:,i ∈ C for all i. (22) Note that by Proposition 2 and Lemma 3, the recovery problem (22) can be re-expressed as min ˆ {C,H:C:,i ∈C ∀i, CH=Z} H 2,1 = max {Γ:|||Γ|||≤1} ˆ tr(Γ Z). (23) ˆ Our strategy will be to first recover the optimal dual solution Γ given Z, then use Γ to recover H and C. −1 ˆ First, to recover Γ one can simply trace back from (21) to (20). Let U ΣV be the SVD of Dρ Z. −1 ˜ Then Γ = U V and Γ = Dρ U V automatically satisfies |||Γ||| = 1 while achieving the optimal −1 ˆ ˜ −1 ˆ trace in (23) because tr(Γ Dρ Z) = tr(Σ) = Dρ Z tr . 11 Given such an optimal Γ, we are then able to characterize an optimal solution (C, H). Introduce the set a C(Γ) := arg max Γ c = c = : a = β, b = γ, Γ c = 1 . (24) b c∈C Theorem 6. For a dual optimal Γ, (C, H) solves recovery problem (22) if and only if C:,i ∈ C(Γ) ˆ and Hi,: = Hi,: 2 C:,i Γ, such that CH = Z. ˆ Proof. By (23), if Z = CH, then ˆ ˆ |||Z|||∗ = tr(Γ Z) = tr(Γ CH) = Hi,: Γ C:,i . (25) i Note that ∀C:,i ∈ C, Γ C:,i 2 ≤ 1 since |||Γ||| ≤ 1 and Hi,: Γ C:,i = Hi,: Γ C:,i 2 ≤ Hi,: 2 Γ C:,i 2 ≤ Hi,: 2 . If (C, H) is optimal, then (25) = i Hi,: 2 , hence implying Γ C:,i 2 = 1 and Hi,: = Hi,: 2 C:,i Γ. ˆ On the other hand, if Γ C:,i 2 = 1 and Hi,: = Hi,: 2 C:,i Γ, then we have |||Z|||∗ = implying the optimality of (C, H). i Hi,: 2 , Therefore, given Γ, the recovery problem (22) has been reduced to finding a vector µ and matrix C ˆ such that µ ≥ 0, C:,i ∈ C(Γ) for all i, and C diag(µ)C Γ = Z. Next we demonstrate how to incrementally recover µ and C. Denote the range of C diag(µ)C by the set S := { i µi ci ci : ci ∈ C(Γ), µ ≥ 0} . Note that S is the conic hull of (possibly infinitely many) rank one matrices {cc : c ∈ C(Γ)}. However, by Carath´ odory’s theorem [32, §17], any matrix K ∈ S can be written as the conic come bination of finitely many rank one matrices of the form {cc : c ∈ C(Γ)}. Therefore, conceptually, the recovery problem has been reduced to finding a sparse set of non-negative weights, µ, over the set of feasible basis vectors, c ∈ C(Γ). To find these weights, we use a totally corrective “boosting” procedure [22] that is guaranteed to converge to a feasible solution. Consider the following objective function for boosting ˆ f (K) = KΓ − Z 2 F, where K ∈ S. Note that f is clearly a convex function in K with a Lipschitz continuous gradient. Theorem 6 implies that an optimal solution of (22) corresponds precisely to those K ∈ S such that f (K) = 0. The idea behind totally corrective boosting [22] is to find a minimizer of f (hence optimal solution of (22)) incrementally. After initializing K0 = 0, we iterate between two steps: 1. Weak learning step: find ct ∈ argmin f (Kt−1 ), cc = argmax c Qc, c∈C(Γ) (26) c∈C(Γ) ˆ where Q = − f (Kt−1 ) = 2(Z − Kt−1 Γ)Γ . 2. “Totally corrective” step: µ(t) = argmin f µ:µi ≥0 Kt = t i=1 t i=1 µi ci ci , (27) (t) µi ci ci . Three key facts can be established about this boosting procedure: (i) each weak learning step can be solved efficiently; (ii) each totally corrective weight update can be solved efficiently; and (iii) f (Kt ) 0, hence a feasible solution can be arbitrarily well approximated. (iii) has been proved in

[22], while (ii) is immediate because (27) is a standard quadratic program. Only (i) deserves some explanation. We show in the next subsection that C(Γ), defined in (24), can be much simplified, and consequently we give in the last subsection an efficient algorithm for the oracle problem (26) (the idea is similar to the one inherent in the proof of Lemma 3). 12 C.1 Simplification of C(Γ) Since C(Γ) is the set of optimal solutions to max Γ c , (28) c∈C our idea is to first obtain an optimal solution to its dual problem, and then use it to recover the optimal c via the KKT conditions. In fact, its dual problem has been stated in (18). Once we obtain the optimal ρ in (21) by solving (8), it is straightforward to backtrack and recover the optimal λ and ν in (18). Then by KKT condition [32, §28], c is an optimal solution to (28) if and only if (29) cY = γ, cX = β, R, cc = 0, where R = λI X + νI Y − ΓΓ 0. (30) Since (30) holds iff c is in the null space of R, we find an orthonormal basis {n1 , . . . , nk } for this null space. Assume NX c = N α, where N = [n1 , . . . , nk ] = , α = (α1 , . . . , αk ) . (31) NY By (29), we have 0 = γ 2 cX 2 − β 2 cY 2 = α γ 2 (N X ) N X − β 2 (N Y ) N Y α. (32) The idea is to go through some linear transformations for simplification. Perform eigendecomposition U ΣU = γ 2 (N X ) N X − β 2 (N Y ) N Y , where Σ = diag(σ1 , . . . , σk ), and U ∈ Rk×k is orthonormal. Let v = U α. Then by (31), c = N U v, (33) and (32) is satisfied if and only if 2 σi vi = 0. v Σv = (34) i Finally, (29) implies 2 β2 + γ2 = c = v U N N U v = v v. (35) In summary, by (33) we have C(Γ) = {N U v : v satisfies (34) and (35)} = N U v : v Σv = 0, v C.2 2 = β2 + γ2 . (36) Solving the weak oracle problem (26) The weak oracle needs to solve max c Qc, c∈C(Γ) ˆ where Q = − f (Kt−1 ) = 2(Z − Kt−1 Γ)Γ . By (36), this optimization is equivalent to max 2 v T v, v:v Σv=0, v =β 2 +γ 2 where T = U N QN U . Using the same technique as in the proof of Lemma 3, we have max v Tv v:v v=1,v Σv=0 (let H = vv ) = (Lagrange dual) = max tr(T H) H 0,tr(H)=1,tr(ΣH)=0 min τ,ω:τ Σ+ωI−T 0 ω = min λmax (T − τ Σ), τ ∈R where λmax stands for the maximum eigenvalue. Since λmax is a convex function over real symmetric matrices, the last line search problem is convex in τ , hence can be solved globally and efficiently. Given the optimal τ and the optimal objective value ω, the optimal v can be recovered using a similar ˆ ˆ trick as in Appendix C.1. Let the null space of ωI + τ Σ − T be spanned by N = {ˆ 1 , . . . , ns }. n s ˆ α satisfies v 2 = β 2 + γ 2 and v Σv = 0. ˆ Then find any α ∈ R such that v := N ˆ 13 Auxiliary References

[28] L. Sun, S. Ji, and J. Ye. Canonical correlation analysis for multilabel classification: A leastsquares formulation, extensions, and analysis. IEEE TPAMI, 33(1):194–200, 2011.

[29] M. Overton and R. Womersley. Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Mathematical Programming, 62:321–357, 1993.

[30] B. Long, P. Yu, and Z. Zhang. A general model for multiple view unsupervised learning. In ICDM, 2008.

[31] G. Pataki. On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Mathematics of Operations Research, 23(2):339–358, 1998.

[32] R. Rockafellar. Convex Analysis. Princeton U. Press, 1970. 14