nips nips2008 nips2008-57 knowledge-graph by maker-knowledge-mining

57 nips-2008-Deflation Methods for Sparse PCA

Source: pdf

Author: Lester W. Mackey

Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deﬂation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justiﬁcation from the PCA context. In this work, we demonstrate that the standard PCA deﬂation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we ﬁrst develop several deﬂation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reﬂect the maximum additional variance objective on each round. The result is a generalized deﬂation procedure that typically outperforms more standard techniques on real-world datasets. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justiﬁcation from the PCA context. [sent-2, score-0.053]

2 In this work, we demonstrate that the standard PCA deﬂation procedure is seldom appropriate for the sparse PCA setting. [sent-3, score-0.119]

3 We then reformulate the sparse PCA optimization problem to explicitly reﬂect the maximum additional variance objective on each round. [sent-5, score-0.171]

4 The goal of PCA is to extract several principal components, linear combinations of input variables that together best account for the variance in a data set. [sent-8, score-0.127]

5 Often, PCA is formulated as an eigenvalue decomposition problem: each eigenvector of the sample covariance matrix of a data set corresponds to the loadings or coefﬁcients of a principal component. [sent-9, score-0.308]

6 A common approach to solving this partial eigenvalue decomposition is to iteratively alternate between two subproblems: rank-one variance maximization and matrix deﬂation. [sent-10, score-0.156]

7 The ﬁrst subproblem involves ﬁnding the maximum-variance loadings vector for a given sample covariance matrix or, equivalently, ﬁnding the leading eigenvector of the matrix. [sent-11, score-0.239]

8 The second involves modifying the covariance matrix to eliminate the inﬂuence of that eigenvector. [sent-12, score-0.079]

9 Each principal component is a linear combination of all variables, and the loadings are typically non-zero. [sent-14, score-0.161]

10 Sparse PCA [8, 3, 16, 17, 6, 18, 1, 2, 9, 10, 12] injects sparsity into the PCA process by searching for “pseudoeigenvectors”, sparse loadings that explain a maximal amount variance in the data. [sent-16, score-0.256]

11 In analogy to the PCA setting, many authors attempt to solve the sparse PCA problem by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deﬂation. [sent-17, score-0.197]

12 In this work, we demonstrate that the standard PCA deﬂation procedure is seldom appropriate for the sparse PCA setting. [sent-20, score-0.119]

13 To rectify the situation, we ﬁrst develop several heuristic deﬂation alternatives with more desirable properties. [sent-21, score-0.04]

14 We then reformulate the sparse PCA optimization problem to explicitly reﬂect the maximum additional variance objective on each round. [sent-22, score-0.171]

15 In Section 2 we discuss matrix deﬂation as it relates to PCA and sparse PCA. [sent-25, score-0.11]

16 We examine the failings of typical PCA deﬂation in the sparse setting and develop several alternative deﬂation procedures. [sent-26, score-0.104]

17 In Section 3, we present a reformulation of the standard iterative sparse PCA optimization problem and derive a generalized deﬂation procedure to solve the reformulation. [sent-27, score-0.124]

18 2 Deﬂation methods A matrix deﬂation modiﬁes a matrix to eliminate the inﬂuence of a given eigenvector, typically by setting the associated eigenvalue to zero (see [14] for a more detailed discussion). [sent-32, score-0.116]

19 We will ﬁrst discuss deﬂation in the context of PCA and then consider its extension to sparse PCA. [sent-33, score-0.084]

20 1 Hotelling’s deﬂation and PCA In the PCA setting, the goal is to extract the r leading eigenvectors of the sample covariance matrix, A0 ∈ Sp , as its eigenvectors are equivalent to the loadings of the ﬁrst r principal components. [sent-35, score-0.279]

21 On the t-th iteration of the deﬂation method, we ﬁrst extract the leading eigenvector of At−1 , xt = argmax xT At−1 x (1) x:xT x=1 and we then use Hotelling’s deﬂation to annihilate xt : At = At−1 − xt xT At−1 xt xT . [sent-37, score-1.898]

22 t t (2) The deﬂation step ensures that the t + 1-st leading eigenvector of A0 is the leading eigenvector of At . [sent-38, score-0.164]

23 ˆ Axj = Axj − xj xT Axj xT xj = Axj − xj xT Axj = λj xj − λj xj = 0xj . [sent-62, score-0.075]

24 Ax j j Thus, Hotelling’s deﬂation preserves all eigenvectors of a matrix and annihilates a selected eigenvalue while maintaining all others. [sent-64, score-0.142]

25 In the case of our iterative deﬂation method, annihilating the t-th leading eigenvector of A0 renders the t + 1-st leading eigenvector dominant in the next round. [sent-66, score-0.217]

26 2 Hotelling’s deﬂation and sparse PCA In the sparse PCA setting, we seek r sparse loadings which together capture the maximum amount of variance in the data. [sent-68, score-0.412]

27 Most authors [1, 9, 16, 12] adopt the additional constraint that the loadings be produced in a sequential fashion. [sent-69, score-0.113]

28 Typically, Hotelling’s deﬂation is utilized by substituting an extracted pseudo-eigenvector for a true eigenvector in the deﬂation step of Eq. [sent-74, score-0.077]

29 Let x = (1, 0)T , a sparse pseudo-eigenvector, and C = C − xxT CxxT , the corresponding 0 1 ˆ ˆ ˆ deﬂated matrix. [sent-84, score-0.084]

30 That Sp is not closed under pseudo-eigenvector Hotelling’s deﬂation is a serious failing, for most + iterative sparse PCA methods assume a positive-semideﬁnite matrix on each iteration. [sent-89, score-0.125]

31 A second, related shortcoming of pseudo-eigenvector Hotelling’s deﬂation is its failure to render a pseudoeigenvector orthogonal to a deﬂated matrix. [sent-90, score-0.039]

32 If A is our matrix of interest, x is our pseudo-eigenvector ˆ ˆ with variance λ = xT Ax, and A = A − xxT AxxT is our deﬂated matrix, then Ax = Ax − xxT AxxT x = Ax − λx is zero iff x is a true eigenvector. [sent-91, score-0.087]

33 3 Alternative deﬂation techniques In this section, we will attempt to rectify the failings of pseudo-eigenvector Hotelling’s deﬂation by considering several alternative deﬂation techniques better suited to the sparse PCA setting. [sent-98, score-0.157]

34 1 Projection deﬂation Given a data matrix Y ∈ Rn×p and an arbitrary unit vector in x ∈ Rp , an intuitive way to remove the contribution of x from Y is to project Y onto the orthocomplement of the space spanned by x: ˆ ˆ Y = Y (I − xxT ). [sent-104, score-0.1]

35 t t However, in the general case, when xt is not a true eigenvector, projection deﬂation maintains the desirable properties that were lost to Hotelling’s deﬂation. [sent-106, score-0.527]

36 For example, positive-semideﬁniteness is preserved: ∀y, y T At y = y T (I − xt xT )At−1 (I − xt xT )y = z T At−1 z t t where z = (I − xt xT )y. [sent-107, score-1.338]

37 Moreover, At is rendered left and right t + orthogonal to xt , as (I −xt xT )xt = xt −xt = 0 and At is symmetric. [sent-109, score-0.931]

38 Projection deﬂation therefore t annihilates all covariances with xt : ∀v, v T At xt = xT At v = 0. [sent-110, score-0.928]

39 2 Schur complement deﬂation Since our goal in matrix deﬂation is to eliminate the inﬂuence, as measured through variance and covariances, of a newly discovered pseudo-eigenvector, it is reasonable to consider the conditional variance of our data variables given a pseudo-principal component. [sent-113, score-0.289]

40 While this conditional variance is non-trivial to compute in general, it takes on a simple closed form when the variables are normally distributed. [sent-114, score-0.061]

41 If W has covariance matrix Σ, then (W, W x) has covariance T Σ Σx Σ matrix V = , and V ar(W |W x) = Σ − ΣxxΣx whenever xT Σx = 0 [15]. [sent-116, score-0.116]

42 xT xT Σ xT Σx That is, the conditional variance is the Schur complement of the vector variance xT Σx in the full covariance matrix V . [sent-117, score-0.28]

43 By substituting sample covariance matrices for their population counterparts, we arrive at a new deﬂation technique: Schur complement deﬂation At = At−1 − At−1 xt xT At−1 t xT At−1 xt t (5) Schur complement deﬂation, like projection deﬂation, preserves positive-semideﬁniteness. [sent-118, score-1.215]

44 To v T At−1 xt xT At−1 v t see this, suppose At−1 ∈ Sp . [sent-119, score-0.446]

45 Then, ∀v, v T At v = v T At−1 v − ≥ 0 as + xT At−1 xt t v T At−1 vxT At−1 xt − (v T At−1 xt )2 ≥ 0 by the Cauchy-Schwarz inequality and xT At−1 xt ≥ 0 t t as At−1 ∈ Sp . [sent-120, score-1.784]

46 + Furthermore, Schur complement deﬂation renders xt left and right orthogonal to At , since At is A x xT At−1 x t symmetric and At xt = At−1 xt − t−1T t t−1 xt t = At−1 xt − At−1 xt = 0. [sent-121, score-2.833]

47 x A t Additionally, Schur complement deﬂation reduces to Hotelling’s deﬂation when xt is an eigenvector of At−1 with eigenvalue λt = 0: At−1 xt xT At−1 t xT At−1 xt t λt xt xT λt t = At−1 − λt = At−1 − xt xT At−1 xt xT . [sent-122, score-2.879]

48 t t At = At−1 − While we motivated Schur complement deﬂation with a Gaussianity assumption, the technique admits a more general interpretation as a column projection of a data matrix. [sent-123, score-0.189]

49 Suppose Y ∈ Rn×p xxT Y T ˆ is a mean-centered data matrix, x ∈ Rp has unit norm, and Y = (I − Y||Y x||2 )Y , the projection of the columns of Y onto the orthocomplement of the space spanned by the pseudo-principal comˆ ponent, Y x. [sent-124, score-0.142]

50 If Y has sample covariance matrix A, then the sample covariance of Y is given by ˆ = 1 Y T (I − Y xxT Y2T )T (I − Y xxT Y2T )Y = 1 Y T (I − Y xxT Y2T )Y = A − AxxT A . [sent-125, score-0.09]

51 Whenever we deal with a sequence of non-orthogonal vectors, we must take care to distinguish between the variance explained by a vector and the additional variance explained, given all previous vectors. [sent-129, score-0.167]

52 These concepts are equivalent in the PCA setting, as true eigenvectors of a matrix are orthogonal, but, in general, the vectors extracted by sparse PCA will not be orthogonal. [sent-130, score-0.157]

53 The additional variance explained by the t-th pseudo-eigenvector, xt , is equivalent to the variance explained by the component of xt orthogonal to the space spanned by all previous pseudo-eigenvectors, qt = xt − Pt−1 xt , where Pt−1 is the orthogonal projection onto the space spanned by x1 , . [sent-131, score-2.42]

54 On each deﬂation step, therefore, we only want to eliminate the variance associated with qt . [sent-135, score-0.265]

55 Annihilating the full vector xt will often lead to “double counting” and could re-introduce components parallel to previously annihilated vectors. [sent-136, score-0.446]

56 (2)) for t = 1 and can be easily expressed in terms of a running Gram-Schmidt decomposition for t > 1: Orthogonalized Hotelling’s deﬂation (OHD) qt = (I − Qt−1 QT )xt t−1 (I − Qt−1 QT )xt t−1 (6) T T At = At−1 − qt qt At−1 qt qt where q1 = x1 , and q1 , . [sent-149, score-0.915]

57 , qt−1 form an orthonormal basis for the space spanned by x1 , . [sent-156, score-0.052]

58 However, the same principles may be applied to projection deﬂation to generate an orthogonalized variant that inherits its desirable properties. [sent-162, score-0.206]

59 Schur complement deﬂation is unique in that it preserves orthogonality in all subsequent rounds. [sent-163, score-0.123]

60 A xt xT A v t That is, if a vector v is orthogonal to At−1 for any t, then At v = At−1 v − t−1T At−1 xt−1 = 0 as xt t At−1 v = 0. [sent-164, score-0.931]

61 Orthogonalized Schur complement deﬂation is equivalent to Schur complement deﬂation. [sent-168, score-0.2]

62 We may write xt = ot + pt , where pt is in the subspace spanned by all previously extracted pseudo-eigenvectors and ot is orthogonal to this subspace. [sent-171, score-1.089]

63 Then we know that At−1 pt = 0, as pt is a linear combination of x1 , . [sent-172, score-0.26]

64 Thus, xT At xt = pT At pt + oT At pt + pT At ot + oT At ot = oT At ot . [sent-176, score-1.135]

65 t t t t t t Further, At−1 xt xT At−1 = At−1 pt pT At−1 +At−1 pt oT At−1 +At−1 ot pT At−1 +At−1 ot oT At−1 = t t t t t At−1 ot oT At−1 . [sent-177, score-1.135]

66 Hence, At = At−1 − t At−1 ot oT At−1 t oT At−1 ot t = At−1 − T At−1 qt qt At−1 T qt At−1 qt as qt = ot ||ot || . [sent-178, score-1.344]

67 In this section, we explore a more principled alternative: reformulating the sparse PCA optimization problem to explicitly reﬂect our maximization objective on each round. [sent-184, score-0.119]

68 Recall that the goal of sparse PCA is to ﬁnd r cardinality-constrained pseudo-eigenvectors which together explain the most variance in the data. [sent-185, score-0.157]

69 If we additionally constrain the sparse loadings to 5 be generated sequentially, as in the PCA setting and the previous section, then a greedy approach of maximizing the additional variance of each new vector naturally suggests itself. [sent-186, score-0.258]

70 T A0 On round t, the additional variance of a vector x is given by q qT q q where A0 is the data covariance matrix, q = (I − Pt−1 )x, and Pt−1 is the projection onto the space spanned by previous pseudo-eigenvectors x1 , . [sent-187, score-0.247]

71 As q T q = xT (I − Pt−1 )(I − Pt−1 )x = xT (I − Pt−1 )x, maximizing additional variance is equivalent to solving a cardinality-constrained maximum generalized eigenvalue problem, max xT (I − Pt−1 )A0 (I − Pt−1 )x x subject to xT (I − Pt−1 )x = 1 Card(x) ≤ kt . [sent-191, score-0.159]

72 , qt−1 form an orthonormal basis for the space t−1 t−1 T T spanned by x1 , . [sent-195, score-0.052]

73 Writing I − Pt−1 = I − s=1 qs qs = s=1 (I − qs qs ) suggests a generalized deﬂation technique that leads to the solution of Eq. [sent-199, score-0.182]

74 We imbed the technique into the following algorithm for sparse PCA: Algorithm 1 Generalized Deﬂation Method for Sparse PCA p Given: A0 ∈ S+ , r ∈ N, {k1 , . [sent-201, score-0.105]

75 , r • xt ← argmax xT At−1 x x:xT Bt−1 x=1,Card(x)≤kt • • • • qt ← Bt−1 xt T T At ← (I − qt qt )At−1 (I − qt qt ) T Bt ← Bt−1 (I − qt qt ) xt ← xt / ||xt || Return: {x1 , . [sent-209, score-3.079]

76 1 Pit props dataset The pit props dataset [5] with 13 variables and 180 observations has become a de facto standard for benchmarking sparse PCA methods. [sent-216, score-0.244]

77 To demonstrate the disparate behavior of differing deﬂation methods, we utilize each sparse PCA algorithm and deﬂation technique to successively extract six sparse loadings, each constrained to have cardinality less than or equal to kt = 4. [sent-217, score-0.242]

78 We report the additional variances explained by each sparse vector in Table 2 and the cumulative percentage variance explained on each iteration in Table 3. [sent-218, score-0.26]

79 For reference, the ﬁrst 6 true principal components of the pit props dataset capture 87% of the variance. [sent-219, score-0.143]

80 908 Table 2: Additional variance explained by each of the ﬁrst 6 sparse loadings extracted from the Pit Props dataset. [sent-292, score-0.292]

81 4% of the variance, while the best performing methods, Schur complement deﬂation and generalized deﬂation, explain approximately 79% of the variance each. [sent-294, score-0.198]

82 Projection deﬂation and its orthogonalized variant also outperform Hotelling’s deﬂation, while orthogonalized Hotelling’s shows the worst performance with only 63. [sent-295, score-0.222]

83 Generalized deﬂation and the two projection deﬂations dominate, with GD achieving the maximum cumulative variance explained on each round. [sent-298, score-0.182]

84 In contrast, the more standard Hotelling’s and orthogonalized Hotelling’s underperform the remaining techniques. [sent-299, score-0.111]

85 2% Table 3: Cumulative percentage variance explained by the ﬁrst 6 sparse loadings extracted from the Pit Props dataset. [sent-372, score-0.309]

86 2 Gene expression data The Berkeley Drosophila Transcription Network Project (BDTNP) 3D gene expression data [4] contains gene expression levels measured in each nucleus of developing Drosophila embryos and averaged across many embryos and developmental stages. [sent-374, score-0.078]

87 223 GSLDA additional variance explained PD SCD OHD OPD 1. [sent-386, score-0.106]

88 1% GSLDA cumulative percentage variance PD SCD OHD OPD 21. [sent-434, score-0.1]

89 6% Table 4: Additional variance and cumulative percentage variance explained by the ﬁrst 8 sparse loadings of GSLDA on the BDTNP VirtualEmbryo. [sent-474, score-0.375]

90 The generalized deﬂation technique performs best, achieving the largest additional variance on every round and a ﬁnal cumulative variance of 79. [sent-476, score-0.222]

91 Schur complement deﬂation, projection deﬂation, and orthogonalized projection deﬂation all perform comparably, explaining roughly 77% of the total variance after 8 rounds. [sent-478, score-0.408]

92 In last place are the standard Hotelling’s and orthogonalized Hotelling’s deﬂations, both of which explain less than 76% of variance after 8 rounds. [sent-479, score-0.184]

93 7 5 Conclusion In this work, we have exposed the theoretical and empirical shortcomings of Hotelling’s deﬂation in the sparse PCA setting and developed several alternative methods more suitable for non-eigenvector deﬂation. [sent-480, score-0.084]

94 Notably, the utility of these procedures is not limited to the sparse PCA setting. [sent-481, score-0.084]

95 Indeed, the methods presented can be applied to any of a number of constrained eigendecomposition-based problems, including sparse canonical correlation analysis [13] and linear discriminant analysis [10]. [sent-482, score-0.084]

96 Loadings and correlations in the interpretation of principal components. [sent-511, score-0.048]

97 Two case studies in the application of principal components. [sent-535, score-0.048]

98 Spectral bounds for sparse PCA: Exact and greedy algorithms. [sent-556, score-0.084]

99 Simon, Low-rank approximations with sparse factors I: Basic algorithms and error analysis. [sent-607, score-0.084]

100 Simon, Low-rank approximations with sparse factors II: Penalized methods with discrete Newton-like iterations. [sent-616, score-0.084]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ation', 0.638), ('xt', 0.446), ('hotelling', 0.324), ('pca', 0.188), ('qt', 0.183), ('ot', 0.143), ('schur', 0.138), ('pt', 0.13), ('orthogonalized', 0.111), ('ohd', 0.101), ('complement', 0.1), ('loadings', 0.099), ('xxt', 0.097), ('gslda', 0.091), ('sparse', 0.084), ('opd', 0.071), ('scd', 0.071), ('gd', 0.071), ('projection', 0.068), ('axj', 0.062), ('variance', 0.061), ('eigenvector', 0.06), ('ax', 0.057), ('hd', 0.057), ('sp', 0.052), ('dcpca', 0.051), ('props', 0.051), ('pd', 0.05), ('pc', 0.049), ('principal', 0.048), ('pit', 0.044), ('eigenvalue', 0.043), ('spanned', 0.041), ('axxt', 0.04), ('orthogonal', 0.039), ('seldom', 0.035), ('qs', 0.034), ('covariance', 0.032), ('explained', 0.031), ('eigenvectors', 0.03), ('rectify', 0.027), ('matrix', 0.026), ('generalized', 0.025), ('rp', 0.025), ('ated', 0.024), ('preserves', 0.023), ('leading', 0.022), ('cumulative', 0.022), ('bt', 0.022), ('technique', 0.021), ('eliminate', 0.021), ('gene', 0.021), ('niteness', 0.021), ('annihilates', 0.02), ('annihilating', 0.02), ('ations', 0.02), ('bdtnp', 0.02), ('failings', 0.02), ('moghaddam', 0.02), ('orthocomplement', 0.02), ('reformulating', 0.02), ('newly', 0.02), ('cardinality', 0.019), ('eigenvalues', 0.019), ('extract', 0.018), ('round', 0.018), ('renders', 0.018), ('borrowed', 0.018), ('zha', 0.018), ('aspremont', 0.018), ('embryos', 0.018), ('percentage', 0.017), ('extracted', 0.017), ('covariances', 0.016), ('jolliffe', 0.016), ('torres', 0.016), ('kt', 0.016), ('subtasks', 0.015), ('drosophila', 0.015), ('sriperumbudur', 0.015), ('xj', 0.015), ('iterative', 0.015), ('maximization', 0.015), ('argmax', 0.014), ('axi', 0.014), ('component', 0.014), ('additional', 0.014), ('card', 0.014), ('inherits', 0.014), ('simon', 0.014), ('de', 0.014), ('sequentially', 0.013), ('desirable', 0.013), ('onto', 0.013), ('techniques', 0.013), ('proposition', 0.012), ('explain', 0.012), ('reformulate', 0.012), ('iteratively', 0.011), ('orthonormal', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 57 nips-2008-Deflation Methods for Sparse PCA

Author: Lester W. Mackey

2 0.23068459 206 nips-2008-Sequential effects: Superstition or rational behavior?

Author: Angela J. Yu, Jonathan D. Cohen

Abstract: In a variety of behavioral tasks, subjects exhibit an automatic and apparently suboptimal sequential effect: they respond more rapidly and accurately to a stimulus if it reinforces a local pattern in stimulus history, such as a string of repetitions or alternations, compared to when it violates such a pattern. This is often the case even if the local trends arise by chance in the context of a randomized design, such that stimulus history has no real predictive power. In this work, we use a normative Bayesian framework to examine the hypothesis that such idiosyncrasies may reﬂect the inadvertent engagement of mechanisms critical for adapting to a changing environment. We show that prior belief in non-stationarity can induce experimentally observed sequential effects in an otherwise Bayes-optimal algorithm. The Bayesian algorithm is shown to be well approximated by linear-exponential ﬁltering of past observations, a feature also apparent in the behavioral data. We derive an explicit relationship between the parameters and computations of the exact Bayesian algorithm and those of the approximate linear-exponential ﬁlter. Since the latter is equivalent to a leaky-integration process, a commonly used model of neuronal dynamics underlying perceptual decision-making and trial-to-trial dependencies, our model provides a principled account of why such dynamics are useful. We also show that parameter-tuning of the leaky-integration process is possible, using stochastic gradient descent based only on the noisy binary inputs. This is a proof of concept that not only can neurons implement near-optimal prediction based on standard neuronal dynamics, but that they can also learn to tune the processing parameters without explicitly representing probabilities. 1

3 0.20395653 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces

Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu

Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classiﬁcation of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difﬁcult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classiﬁcation and cross-language classiﬁcation tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1

4 0.16031653 112 nips-2008-Kernel Measures of Independence for non-iid Data

Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola

Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1

5 0.14886783 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1

6 0.11176921 195 nips-2008-Regularized Policy Iteration

7 0.092825674 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making

8 0.078464977 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

9 0.078255042 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

10 0.076336749 238 nips-2008-Theory of matching pursuit

11 0.071404755 119 nips-2008-Learning a discriminative hidden part model for human action recognition

12 0.070478305 216 nips-2008-Sparse probabilistic projections

13 0.068794988 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

14 0.066628873 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

15 0.062834777 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

16 0.058528014 200 nips-2008-Robust Kernel Principal Component Analysis

17 0.056892619 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

18 0.053077329 15 nips-2008-Adaptive Martingale Boosting

19 0.049399637 31 nips-2008-Bayesian Exponential Family PCA

20 0.045514788 62 nips-2008-Differentiable Sparse Coding

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.115), (1, 0.051), (2, -0.053), (3, -0.002), (4, -0.041), (5, 0.256), (6, -0.041), (7, 0.168), (8, 0.143), (9, -0.12), (10, -0.028), (11, -0.259), (12, 0.028), (13, -0.03), (14, -0.004), (15, 0.01), (16, 0.063), (17, -0.051), (18, 0.069), (19, -0.138), (20, 0.02), (21, -0.139), (22, -0.101), (23, -0.044), (24, -0.016), (25, -0.05), (26, -0.044), (27, -0.032), (28, -0.099), (29, 0.03), (30, 0.014), (31, -0.049), (32, -0.041), (33, 0.047), (34, 0.012), (35, -0.01), (36, 0.004), (37, -0.07), (38, -0.055), (39, -0.005), (40, -0.046), (41, -0.083), (42, 0.172), (43, 0.018), (44, -0.01), (45, -0.01), (46, 0.081), (47, -0.027), (48, -0.016), (49, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9878844 57 nips-2008-Deflation Methods for Sparse PCA

Author: Lester W. Mackey

2 0.72219688 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces

Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu

3 0.69860059 206 nips-2008-Sequential effects: Superstition or rational behavior?

Author: Angela J. Yu, Jonathan D. Cohen

4 0.61344451 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

5 0.60601366 112 nips-2008-Kernel Measures of Independence for non-iid Data

Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola

6 0.55181336 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making

7 0.38654825 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

8 0.37210828 31 nips-2008-Bayesian Exponential Family PCA

9 0.36839724 200 nips-2008-Robust Kernel Principal Component Analysis

10 0.36647639 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

11 0.34766057 195 nips-2008-Regularized Policy Iteration

12 0.34478897 119 nips-2008-Learning a discriminative hidden part model for human action recognition

13 0.31089076 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

14 0.30607352 238 nips-2008-Theory of matching pursuit

15 0.29626772 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

16 0.28153616 15 nips-2008-Adaptive Martingale Boosting

17 0.28131559 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

18 0.26857466 61 nips-2008-Diffeomorphic Dimensionality Reduction

19 0.26173922 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

20 0.23207243 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.047), (7, 0.069), (12, 0.022), (28, 0.197), (42, 0.383), (57, 0.022), (63, 0.022), (77, 0.062), (83, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77295262 57 nips-2008-Deflation Methods for Sparse PCA

Author: Lester W. Mackey

2 0.58057725 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents the ﬁrst Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They beneﬁt from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such ﬁnite samples and lead to tighter generalization bounds. We also present the ﬁrst margin bounds for kernel-based classiﬁcation in this non-i.i.d. setting and brieﬂy study their convergence.

3 0.50295126 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

Author: David Sontag, Amir Globerson, Tommi S. Jaakkola

Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for ﬁnding the most probable (MAP) conﬁguration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and signiﬁcantly faster than a method that does not use partitioning. 1

4 0.502886 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random ﬁeld (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufﬁcient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.

5 0.50115627 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

Author: Yuhong Guo

Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1

6 0.49948582 195 nips-2008-Regularized Policy Iteration

7 0.4992778 218 nips-2008-Spectral Clustering with Perturbed Data

8 0.49903333 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

9 0.49902686 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

10 0.49798471 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation

11 0.49761033 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

12 0.49722868 112 nips-2008-Kernel Measures of Independence for non-iid Data

13 0.49683324 231 nips-2008-Temporal Dynamics of Cognitive Control

14 0.49561739 211 nips-2008-Simple Local Models for Complex Dynamical Systems

15 0.49546078 138 nips-2008-Modeling human function learning with Gaussian processes

16 0.49520597 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning

17 0.49514014 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference

18 0.49512357 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms

19 0.49493676 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields

20 0.49452335 40 nips-2008-Bounds on marginal probability distributions