nips nips2008 nips2008-57 knowledge-graph by maker-knowledge-mining

57 nips-2008-Deflation Methods for Sparse PCA


Source: pdf

Author: Lester W. Mackey

Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several deflation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. [sent-2, score-0.053]

2 In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. [sent-3, score-0.119]

3 We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. [sent-5, score-0.171]

4 The goal of PCA is to extract several principal components, linear combinations of input variables that together best account for the variance in a data set. [sent-8, score-0.127]

5 Often, PCA is formulated as an eigenvalue decomposition problem: each eigenvector of the sample covariance matrix of a data set corresponds to the loadings or coefficients of a principal component. [sent-9, score-0.308]

6 A common approach to solving this partial eigenvalue decomposition is to iteratively alternate between two subproblems: rank-one variance maximization and matrix deflation. [sent-10, score-0.156]

7 The first subproblem involves finding the maximum-variance loadings vector for a given sample covariance matrix or, equivalently, finding the leading eigenvector of the matrix. [sent-11, score-0.239]

8 The second involves modifying the covariance matrix to eliminate the influence of that eigenvector. [sent-12, score-0.079]

9 Each principal component is a linear combination of all variables, and the loadings are typically non-zero. [sent-14, score-0.161]

10 Sparse PCA [8, 3, 16, 17, 6, 18, 1, 2, 9, 10, 12] injects sparsity into the PCA process by searching for “pseudoeigenvectors”, sparse loadings that explain a maximal amount variance in the data. [sent-16, score-0.256]

11 In analogy to the PCA setting, many authors attempt to solve the sparse PCA problem by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. [sent-17, score-0.197]

12 In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. [sent-20, score-0.119]

13 To rectify the situation, we first develop several heuristic deflation alternatives with more desirable properties. [sent-21, score-0.04]

14 We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. [sent-22, score-0.171]

15 In Section 2 we discuss matrix deflation as it relates to PCA and sparse PCA. [sent-25, score-0.11]

16 We examine the failings of typical PCA deflation in the sparse setting and develop several alternative deflation procedures. [sent-26, score-0.104]

17 In Section 3, we present a reformulation of the standard iterative sparse PCA optimization problem and derive a generalized deflation procedure to solve the reformulation. [sent-27, score-0.124]

18 2 Deflation methods A matrix deflation modifies a matrix to eliminate the influence of a given eigenvector, typically by setting the associated eigenvalue to zero (see [14] for a more detailed discussion). [sent-32, score-0.116]

19 We will first discuss deflation in the context of PCA and then consider its extension to sparse PCA. [sent-33, score-0.084]

20 1 Hotelling’s deflation and PCA In the PCA setting, the goal is to extract the r leading eigenvectors of the sample covariance matrix, A0 ∈ Sp , as its eigenvectors are equivalent to the loadings of the first r principal components. [sent-35, score-0.279]

21 On the t-th iteration of the deflation method, we first extract the leading eigenvector of At−1 , xt = argmax xT At−1 x (1) x:xT x=1 and we then use Hotelling’s deflation to annihilate xt : At = At−1 − xt xT At−1 xt xT . [sent-37, score-1.898]

22 t t (2) The deflation step ensures that the t + 1-st leading eigenvector of A0 is the leading eigenvector of At . [sent-38, score-0.164]

23 ˆ Axj = Axj − xj xT Axj xT xj = Axj − xj xT Axj = λj xj − λj xj = 0xj . [sent-62, score-0.075]

24 Ax j j Thus, Hotelling’s deflation preserves all eigenvectors of a matrix and annihilates a selected eigenvalue while maintaining all others. [sent-64, score-0.142]

25 In the case of our iterative deflation method, annihilating the t-th leading eigenvector of A0 renders the t + 1-st leading eigenvector dominant in the next round. [sent-66, score-0.217]

26 2 Hotelling’s deflation and sparse PCA In the sparse PCA setting, we seek r sparse loadings which together capture the maximum amount of variance in the data. [sent-68, score-0.412]

27 Most authors [1, 9, 16, 12] adopt the additional constraint that the loadings be produced in a sequential fashion. [sent-69, score-0.113]

28 Typically, Hotelling’s deflation is utilized by substituting an extracted pseudo-eigenvector for a true eigenvector in the deflation step of Eq. [sent-74, score-0.077]

29 Let x = (1, 0)T , a sparse pseudo-eigenvector, and C = C − xxT CxxT , the corresponding 0 1 ˆ ˆ ˆ deflated matrix. [sent-84, score-0.084]

30 That Sp is not closed under pseudo-eigenvector Hotelling’s deflation is a serious failing, for most + iterative sparse PCA methods assume a positive-semidefinite matrix on each iteration. [sent-89, score-0.125]

31 A second, related shortcoming of pseudo-eigenvector Hotelling’s deflation is its failure to render a pseudoeigenvector orthogonal to a deflated matrix. [sent-90, score-0.039]

32 If A is our matrix of interest, x is our pseudo-eigenvector ˆ ˆ with variance λ = xT Ax, and A = A − xxT AxxT is our deflated matrix, then Ax = Ax − xxT AxxT x = Ax − λx is zero iff x is a true eigenvector. [sent-91, score-0.087]

33 3 Alternative deflation techniques In this section, we will attempt to rectify the failings of pseudo-eigenvector Hotelling’s deflation by considering several alternative deflation techniques better suited to the sparse PCA setting. [sent-98, score-0.157]

34 1 Projection deflation Given a data matrix Y ∈ Rn×p and an arbitrary unit vector in x ∈ Rp , an intuitive way to remove the contribution of x from Y is to project Y onto the orthocomplement of the space spanned by x: ˆ ˆ Y = Y (I − xxT ). [sent-104, score-0.1]

35 t t However, in the general case, when xt is not a true eigenvector, projection deflation maintains the desirable properties that were lost to Hotelling’s deflation. [sent-106, score-0.527]

36 For example, positive-semidefiniteness is preserved: ∀y, y T At y = y T (I − xt xT )At−1 (I − xt xT )y = z T At−1 z t t where z = (I − xt xT )y. [sent-107, score-1.338]

37 Moreover, At is rendered left and right t + orthogonal to xt , as (I −xt xT )xt = xt −xt = 0 and At is symmetric. [sent-109, score-0.931]

38 Projection deflation therefore t annihilates all covariances with xt : ∀v, v T At xt = xT At v = 0. [sent-110, score-0.928]

39 2 Schur complement deflation Since our goal in matrix deflation is to eliminate the influence, as measured through variance and covariances, of a newly discovered pseudo-eigenvector, it is reasonable to consider the conditional variance of our data variables given a pseudo-principal component. [sent-113, score-0.289]

40 While this conditional variance is non-trivial to compute in general, it takes on a simple closed form when the variables are normally distributed. [sent-114, score-0.061]

41 If W has covariance matrix Σ, then (W, W x) has covariance T Σ Σx Σ matrix V = , and V ar(W |W x) = Σ − ΣxxΣx whenever xT Σx = 0 [15]. [sent-116, score-0.116]

42 xT xT Σ xT Σx That is, the conditional variance is the Schur complement of the vector variance xT Σx in the full covariance matrix V . [sent-117, score-0.28]

43 By substituting sample covariance matrices for their population counterparts, we arrive at a new deflation technique: Schur complement deflation At = At−1 − At−1 xt xT At−1 t xT At−1 xt t (5) Schur complement deflation, like projection deflation, preserves positive-semidefiniteness. [sent-118, score-1.215]

44 To v T At−1 xt xT At−1 v t see this, suppose At−1 ∈ Sp . [sent-119, score-0.446]

45 Then, ∀v, v T At v = v T At−1 v − ≥ 0 as + xT At−1 xt t v T At−1 vxT At−1 xt − (v T At−1 xt )2 ≥ 0 by the Cauchy-Schwarz inequality and xT At−1 xt ≥ 0 t t as At−1 ∈ Sp . [sent-120, score-1.784]

46 + Furthermore, Schur complement deflation renders xt left and right orthogonal to At , since At is A x xT At−1 x t symmetric and At xt = At−1 xt − t−1T t t−1 xt t = At−1 xt − At−1 xt = 0. [sent-121, score-2.833]

47 x A t Additionally, Schur complement deflation reduces to Hotelling’s deflation when xt is an eigenvector of At−1 with eigenvalue λt = 0: At−1 xt xT At−1 t xT At−1 xt t λt xt xT λt t = At−1 − λt = At−1 − xt xT At−1 xt xT . [sent-122, score-2.879]

48 t t At = At−1 − While we motivated Schur complement deflation with a Gaussianity assumption, the technique admits a more general interpretation as a column projection of a data matrix. [sent-123, score-0.189]

49 Suppose Y ∈ Rn×p xxT Y T ˆ is a mean-centered data matrix, x ∈ Rp has unit norm, and Y = (I − Y||Y x||2 )Y , the projection of the columns of Y onto the orthocomplement of the space spanned by the pseudo-principal comˆ ponent, Y x. [sent-124, score-0.142]

50 If Y has sample covariance matrix A, then the sample covariance of Y is given by ˆ = 1 Y T (I − Y xxT Y2T )T (I − Y xxT Y2T )Y = 1 Y T (I − Y xxT Y2T )Y = A − AxxT A . [sent-125, score-0.09]

51 Whenever we deal with a sequence of non-orthogonal vectors, we must take care to distinguish between the variance explained by a vector and the additional variance explained, given all previous vectors. [sent-129, score-0.167]

52 These concepts are equivalent in the PCA setting, as true eigenvectors of a matrix are orthogonal, but, in general, the vectors extracted by sparse PCA will not be orthogonal. [sent-130, score-0.157]

53 The additional variance explained by the t-th pseudo-eigenvector, xt , is equivalent to the variance explained by the component of xt orthogonal to the space spanned by all previous pseudo-eigenvectors, qt = xt − Pt−1 xt , where Pt−1 is the orthogonal projection onto the space spanned by x1 , . [sent-131, score-2.42]

54 On each deflation step, therefore, we only want to eliminate the variance associated with qt . [sent-135, score-0.265]

55 Annihilating the full vector xt will often lead to “double counting” and could re-introduce components parallel to previously annihilated vectors. [sent-136, score-0.446]

56 (2)) for t = 1 and can be easily expressed in terms of a running Gram-Schmidt decomposition for t > 1: Orthogonalized Hotelling’s deflation (OHD) qt = (I − Qt−1 QT )xt t−1 (I − Qt−1 QT )xt t−1 (6) T T At = At−1 − qt qt At−1 qt qt where q1 = x1 , and q1 , . [sent-149, score-0.915]

57 , qt−1 form an orthonormal basis for the space spanned by x1 , . [sent-156, score-0.052]

58 However, the same principles may be applied to projection deflation to generate an orthogonalized variant that inherits its desirable properties. [sent-162, score-0.206]

59 Schur complement deflation is unique in that it preserves orthogonality in all subsequent rounds. [sent-163, score-0.123]

60 A xt xT A v t That is, if a vector v is orthogonal to At−1 for any t, then At v = At−1 v − t−1T At−1 xt−1 = 0 as xt t At−1 v = 0. [sent-164, score-0.931]

61 Orthogonalized Schur complement deflation is equivalent to Schur complement deflation. [sent-168, score-0.2]

62 We may write xt = ot + pt , where pt is in the subspace spanned by all previously extracted pseudo-eigenvectors and ot is orthogonal to this subspace. [sent-171, score-1.089]

63 Then we know that At−1 pt = 0, as pt is a linear combination of x1 , . [sent-172, score-0.26]

64 Thus, xT At xt = pT At pt + oT At pt + pT At ot + oT At ot = oT At ot . [sent-176, score-1.135]

65 t t t t t t Further, At−1 xt xT At−1 = At−1 pt pT At−1 +At−1 pt oT At−1 +At−1 ot pT At−1 +At−1 ot oT At−1 = t t t t t At−1 ot oT At−1 . [sent-177, score-1.135]

66 Hence, At = At−1 − t At−1 ot oT At−1 t oT At−1 ot t = At−1 − T At−1 qt qt At−1 T qt At−1 qt as qt = ot ||ot || . [sent-178, score-1.344]

67 In this section, we explore a more principled alternative: reformulating the sparse PCA optimization problem to explicitly reflect our maximization objective on each round. [sent-184, score-0.119]

68 Recall that the goal of sparse PCA is to find r cardinality-constrained pseudo-eigenvectors which together explain the most variance in the data. [sent-185, score-0.157]

69 If we additionally constrain the sparse loadings to 5 be generated sequentially, as in the PCA setting and the previous section, then a greedy approach of maximizing the additional variance of each new vector naturally suggests itself. [sent-186, score-0.258]

70 T A0 On round t, the additional variance of a vector x is given by q qT q q where A0 is the data covariance matrix, q = (I − Pt−1 )x, and Pt−1 is the projection onto the space spanned by previous pseudo-eigenvectors x1 , . [sent-187, score-0.247]

71 As q T q = xT (I − Pt−1 )(I − Pt−1 )x = xT (I − Pt−1 )x, maximizing additional variance is equivalent to solving a cardinality-constrained maximum generalized eigenvalue problem, max xT (I − Pt−1 )A0 (I − Pt−1 )x x subject to xT (I − Pt−1 )x = 1 Card(x) ≤ kt . [sent-191, score-0.159]

72 , qt−1 form an orthonormal basis for the space t−1 t−1 T T spanned by x1 , . [sent-195, score-0.052]

73 Writing I − Pt−1 = I − s=1 qs qs = s=1 (I − qs qs ) suggests a generalized deflation technique that leads to the solution of Eq. [sent-199, score-0.182]

74 We imbed the technique into the following algorithm for sparse PCA: Algorithm 1 Generalized Deflation Method for Sparse PCA p Given: A0 ∈ S+ , r ∈ N, {k1 , . [sent-201, score-0.105]

75 , r • xt ← argmax xT At−1 x x:xT Bt−1 x=1,Card(x)≤kt • • • • qt ← Bt−1 xt T T At ← (I − qt qt )At−1 (I − qt qt ) T Bt ← Bt−1 (I − qt qt ) xt ← xt / ||xt || Return: {x1 , . [sent-209, score-3.079]

76 1 Pit props dataset The pit props dataset [5] with 13 variables and 180 observations has become a de facto standard for benchmarking sparse PCA methods. [sent-216, score-0.244]

77 To demonstrate the disparate behavior of differing deflation methods, we utilize each sparse PCA algorithm and deflation technique to successively extract six sparse loadings, each constrained to have cardinality less than or equal to kt = 4. [sent-217, score-0.242]

78 We report the additional variances explained by each sparse vector in Table 2 and the cumulative percentage variance explained on each iteration in Table 3. [sent-218, score-0.26]

79 For reference, the first 6 true principal components of the pit props dataset capture 87% of the variance. [sent-219, score-0.143]

80 908 Table 2: Additional variance explained by each of the first 6 sparse loadings extracted from the Pit Props dataset. [sent-292, score-0.292]

81 4% of the variance, while the best performing methods, Schur complement deflation and generalized deflation, explain approximately 79% of the variance each. [sent-294, score-0.198]

82 Projection deflation and its orthogonalized variant also outperform Hotelling’s deflation, while orthogonalized Hotelling’s shows the worst performance with only 63. [sent-295, score-0.222]

83 Generalized deflation and the two projection deflations dominate, with GD achieving the maximum cumulative variance explained on each round. [sent-298, score-0.182]

84 In contrast, the more standard Hotelling’s and orthogonalized Hotelling’s underperform the remaining techniques. [sent-299, score-0.111]

85 2% Table 3: Cumulative percentage variance explained by the first 6 sparse loadings extracted from the Pit Props dataset. [sent-372, score-0.309]

86 2 Gene expression data The Berkeley Drosophila Transcription Network Project (BDTNP) 3D gene expression data [4] contains gene expression levels measured in each nucleus of developing Drosophila embryos and averaged across many embryos and developmental stages. [sent-374, score-0.078]

87 223 GSLDA additional variance explained PD SCD OHD OPD 1. [sent-386, score-0.106]

88 1% GSLDA cumulative percentage variance PD SCD OHD OPD 21. [sent-434, score-0.1]

89 6% Table 4: Additional variance and cumulative percentage variance explained by the first 8 sparse loadings of GSLDA on the BDTNP VirtualEmbryo. [sent-474, score-0.375]

90 The generalized deflation technique performs best, achieving the largest additional variance on every round and a final cumulative variance of 79. [sent-476, score-0.222]

91 Schur complement deflation, projection deflation, and orthogonalized projection deflation all perform comparably, explaining roughly 77% of the total variance after 8 rounds. [sent-478, score-0.408]

92 In last place are the standard Hotelling’s and orthogonalized Hotelling’s deflations, both of which explain less than 76% of variance after 8 rounds. [sent-479, score-0.184]

93 7 5 Conclusion In this work, we have exposed the theoretical and empirical shortcomings of Hotelling’s deflation in the sparse PCA setting and developed several alternative methods more suitable for non-eigenvector deflation. [sent-480, score-0.084]

94 Notably, the utility of these procedures is not limited to the sparse PCA setting. [sent-481, score-0.084]

95 Indeed, the methods presented can be applied to any of a number of constrained eigendecomposition-based problems, including sparse canonical correlation analysis [13] and linear discriminant analysis [10]. [sent-482, score-0.084]

96 Loadings and correlations in the interpretation of principal components. [sent-511, score-0.048]

97 Two case studies in the application of principal components. [sent-535, score-0.048]

98 Spectral bounds for sparse PCA: Exact and greedy algorithms. [sent-556, score-0.084]

99 Simon, Low-rank approximations with sparse factors I: Basic algorithms and error analysis. [sent-607, score-0.084]

100 Simon, Low-rank approximations with sparse factors II: Penalized methods with discrete Newton-like iterations. [sent-616, score-0.084]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ation', 0.638), ('xt', 0.446), ('hotelling', 0.324), ('pca', 0.188), ('qt', 0.183), ('ot', 0.143), ('schur', 0.138), ('pt', 0.13), ('orthogonalized', 0.111), ('ohd', 0.101), ('complement', 0.1), ('loadings', 0.099), ('xxt', 0.097), ('gslda', 0.091), ('sparse', 0.084), ('opd', 0.071), ('scd', 0.071), ('gd', 0.071), ('projection', 0.068), ('axj', 0.062), ('variance', 0.061), ('eigenvector', 0.06), ('ax', 0.057), ('hd', 0.057), ('sp', 0.052), ('dcpca', 0.051), ('props', 0.051), ('pd', 0.05), ('pc', 0.049), ('principal', 0.048), ('pit', 0.044), ('eigenvalue', 0.043), ('spanned', 0.041), ('axxt', 0.04), ('orthogonal', 0.039), ('seldom', 0.035), ('qs', 0.034), ('covariance', 0.032), ('explained', 0.031), ('eigenvectors', 0.03), ('rectify', 0.027), ('matrix', 0.026), ('generalized', 0.025), ('rp', 0.025), ('ated', 0.024), ('preserves', 0.023), ('leading', 0.022), ('cumulative', 0.022), ('bt', 0.022), ('technique', 0.021), ('eliminate', 0.021), ('gene', 0.021), ('niteness', 0.021), ('annihilates', 0.02), ('annihilating', 0.02), ('ations', 0.02), ('bdtnp', 0.02), ('failings', 0.02), ('moghaddam', 0.02), ('orthocomplement', 0.02), ('reformulating', 0.02), ('newly', 0.02), ('cardinality', 0.019), ('eigenvalues', 0.019), ('extract', 0.018), ('round', 0.018), ('renders', 0.018), ('borrowed', 0.018), ('zha', 0.018), ('aspremont', 0.018), ('embryos', 0.018), ('percentage', 0.017), ('extracted', 0.017), ('covariances', 0.016), ('jolliffe', 0.016), ('torres', 0.016), ('kt', 0.016), ('subtasks', 0.015), ('drosophila', 0.015), ('sriperumbudur', 0.015), ('xj', 0.015), ('iterative', 0.015), ('maximization', 0.015), ('argmax', 0.014), ('axi', 0.014), ('component', 0.014), ('additional', 0.014), ('card', 0.014), ('inherits', 0.014), ('simon', 0.014), ('de', 0.014), ('sequentially', 0.013), ('desirable', 0.013), ('onto', 0.013), ('techniques', 0.013), ('proposition', 0.012), ('explain', 0.012), ('reformulate', 0.012), ('iteratively', 0.011), ('orthonormal', 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 57 nips-2008-Deflation Methods for Sparse PCA

Author: Lester W. Mackey

Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several deflation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets. 1

2 0.23068459 206 nips-2008-Sequential effects: Superstition or rational behavior?

Author: Angela J. Yu, Jonathan D. Cohen

Abstract: In a variety of behavioral tasks, subjects exhibit an automatic and apparently suboptimal sequential effect: they respond more rapidly and accurately to a stimulus if it reinforces a local pattern in stimulus history, such as a string of repetitions or alternations, compared to when it violates such a pattern. This is often the case even if the local trends arise by chance in the context of a randomized design, such that stimulus history has no real predictive power. In this work, we use a normative Bayesian framework to examine the hypothesis that such idiosyncrasies may reflect the inadvertent engagement of mechanisms critical for adapting to a changing environment. We show that prior belief in non-stationarity can induce experimentally observed sequential effects in an otherwise Bayes-optimal algorithm. The Bayesian algorithm is shown to be well approximated by linear-exponential filtering of past observations, a feature also apparent in the behavioral data. We derive an explicit relationship between the parameters and computations of the exact Bayesian algorithm and those of the approximate linear-exponential filter. Since the latter is equivalent to a leaky-integration process, a commonly used model of neuronal dynamics underlying perceptual decision-making and trial-to-trial dependencies, our model provides a principled account of why such dynamics are useful. We also show that parameter-tuning of the leaky-integration process is possible, using stochastic gradient descent based only on the noisy binary inputs. This is a proof of concept that not only can neurons implement near-optimal prediction based on standard neuronal dynamics, but that they can also learn to tune the processing parameters without explicitly representing probabilities. 1

3 0.20395653 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces

Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu

Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1

4 0.16031653 112 nips-2008-Kernel Measures of Independence for non-iid Data

Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola

Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1

5 0.14886783 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1

6 0.11176921 195 nips-2008-Regularized Policy Iteration

7 0.092825674 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making

8 0.078464977 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

9 0.078255042 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms

10 0.076336749 238 nips-2008-Theory of matching pursuit

11 0.071404755 119 nips-2008-Learning a discriminative hidden part model for human action recognition

12 0.070478305 216 nips-2008-Sparse probabilistic projections

13 0.068794988 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

14 0.066628873 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

15 0.062834777 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

16 0.058528014 200 nips-2008-Robust Kernel Principal Component Analysis

17 0.056892619 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

18 0.053077329 15 nips-2008-Adaptive Martingale Boosting

19 0.049399637 31 nips-2008-Bayesian Exponential Family PCA

20 0.045514788 62 nips-2008-Differentiable Sparse Coding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.115), (1, 0.051), (2, -0.053), (3, -0.002), (4, -0.041), (5, 0.256), (6, -0.041), (7, 0.168), (8, 0.143), (9, -0.12), (10, -0.028), (11, -0.259), (12, 0.028), (13, -0.03), (14, -0.004), (15, 0.01), (16, 0.063), (17, -0.051), (18, 0.069), (19, -0.138), (20, 0.02), (21, -0.139), (22, -0.101), (23, -0.044), (24, -0.016), (25, -0.05), (26, -0.044), (27, -0.032), (28, -0.099), (29, 0.03), (30, 0.014), (31, -0.049), (32, -0.041), (33, 0.047), (34, 0.012), (35, -0.01), (36, 0.004), (37, -0.07), (38, -0.055), (39, -0.005), (40, -0.046), (41, -0.083), (42, 0.172), (43, 0.018), (44, -0.01), (45, -0.01), (46, 0.081), (47, -0.027), (48, -0.016), (49, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9878844 57 nips-2008-Deflation Methods for Sparse PCA

Author: Lester W. Mackey

Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several deflation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets. 1

2 0.72219688 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces

Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu

Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1

3 0.69860059 206 nips-2008-Sequential effects: Superstition or rational behavior?

Author: Angela J. Yu, Jonathan D. Cohen

Abstract: In a variety of behavioral tasks, subjects exhibit an automatic and apparently suboptimal sequential effect: they respond more rapidly and accurately to a stimulus if it reinforces a local pattern in stimulus history, such as a string of repetitions or alternations, compared to when it violates such a pattern. This is often the case even if the local trends arise by chance in the context of a randomized design, such that stimulus history has no real predictive power. In this work, we use a normative Bayesian framework to examine the hypothesis that such idiosyncrasies may reflect the inadvertent engagement of mechanisms critical for adapting to a changing environment. We show that prior belief in non-stationarity can induce experimentally observed sequential effects in an otherwise Bayes-optimal algorithm. The Bayesian algorithm is shown to be well approximated by linear-exponential filtering of past observations, a feature also apparent in the behavioral data. We derive an explicit relationship between the parameters and computations of the exact Bayesian algorithm and those of the approximate linear-exponential filter. Since the latter is equivalent to a leaky-integration process, a commonly used model of neuronal dynamics underlying perceptual decision-making and trial-to-trial dependencies, our model provides a principled account of why such dynamics are useful. We also show that parameter-tuning of the leaky-integration process is possible, using stochastic gradient descent based only on the noisy binary inputs. This is a proof of concept that not only can neurons implement near-optimal prediction based on standard neuronal dynamics, but that they can also learn to tune the processing parameters without explicitly representing probabilities. 1

4 0.61344451 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1

5 0.60601366 112 nips-2008-Kernel Measures of Independence for non-iid Data

Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola

Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1

6 0.55181336 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making

7 0.38654825 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems

8 0.37210828 31 nips-2008-Bayesian Exponential Family PCA

9 0.36839724 200 nips-2008-Robust Kernel Principal Component Analysis

10 0.36647639 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

11 0.34766057 195 nips-2008-Regularized Policy Iteration

12 0.34478897 119 nips-2008-Learning a discriminative hidden part model for human action recognition

13 0.31089076 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

14 0.30607352 238 nips-2008-Theory of matching pursuit

15 0.29626772 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

16 0.28153616 15 nips-2008-Adaptive Martingale Boosting

17 0.28131559 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

18 0.26857466 61 nips-2008-Diffeomorphic Dimensionality Reduction

19 0.26173922 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule

20 0.23207243 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.047), (7, 0.069), (12, 0.022), (28, 0.197), (42, 0.383), (57, 0.022), (63, 0.022), (77, 0.062), (83, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77295262 57 nips-2008-Deflation Methods for Sparse PCA

Author: Lester W. Mackey

Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several deflation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets. 1

2 0.58057725 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.

3 0.50295126 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

Author: David Sontag, Amir Globerson, Tommi S. Jaakkola

Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and significantly faster than a method that does not use partitioning. 1

4 0.502886 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.

5 0.50115627 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

Author: Yuhong Guo

Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1

6 0.49948582 195 nips-2008-Regularized Policy Iteration

7 0.4992778 218 nips-2008-Spectral Clustering with Perturbed Data

8 0.49903333 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor

9 0.49902686 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel

10 0.49798471 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation

11 0.49761033 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

12 0.49722868 112 nips-2008-Kernel Measures of Independence for non-iid Data

13 0.49683324 231 nips-2008-Temporal Dynamics of Cognitive Control

14 0.49561739 211 nips-2008-Simple Local Models for Complex Dynamical Systems

15 0.49546078 138 nips-2008-Modeling human function learning with Gaussian processes

16 0.49520597 125 nips-2008-Local Gaussian Process Regression for Real Time Online Model Learning

17 0.49514014 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference

18 0.49512357 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms

19 0.49493676 93 nips-2008-Global Ranking Using Continuous Conditional Random Fields

20 0.49452335 40 nips-2008-Bounds on marginal probability distributions