nips nips2008 nips2008-227 knowledge-graph by maker-knowledge-mining

227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization


Source: pdf

Author: Yuhong Guo

Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. [sent-3, score-0.566]

2 In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. [sent-4, score-1.069]

3 Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. [sent-5, score-0.932]

4 A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. [sent-6, score-0.529]

5 The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. [sent-7, score-0.253]

6 It provides a closed-form solution for linear unsupervised dimensionality reduction through singular value decomposition (SVD) on the data matrix [8]. [sent-9, score-0.396]

7 Exponential family PCA is the most prominent example, where the underlying dimensionality reduction principle of PCA is extended to the general exponential family [4, 7, 13]. [sent-12, score-0.903]

8 Previous work has shown that improved quality of dimensionality reduction can be obtained by using exponential family models appropriate for the data at hand [4, 13]. [sent-13, score-0.717]

9 Given data from a non-Gaussian distribution these techniques are better able than PCA to capture the intrinsic low dimensional structure. [sent-14, score-0.194]

10 However, most existing non-Gaussian dimensionality reduction methods rely on iterative local optimization procedures and thus suffer from local optima, with the sole exception of [7] which shows a general convex form can be obtained for dimensionality reduction with exponential family models. [sent-15, score-1.305]

11 Recently, supervised dimensionality reduction has begun to receive increased attention. [sent-16, score-0.54]

12 As the goal of dimensionality reduction is to identify the intrinsic structure of a data set in a low dimensional space, there are many reasons why supervised dimensionality reduction is a meaningful topic to study. [sent-17, score-1.074]

13 Moreover, there are many high dimensional data sets with label information available, e. [sent-20, score-0.129]

14 A few supervised dimensionality reduction methods based on exponential family models have been proposed in the literature. [sent-23, score-0.895]

15 For example, a supervised probabilistic PCA (SPPCA) model was proposed in [19]. [sent-24, score-0.211]

16 SPPCA extends probabilistic PCA by assuming that both features and labels have Gaussian distributions and are generated independently from the latent low dimensional space through linear transformations. [sent-25, score-0.266]

17 A more general supervised dimensionality reduction approach with generalized linear models (SDR GLM) was proposed in [12]. [sent-27, score-0.54]

18 SDR GLM views both features and labels as exponential family random variables and optimizes a weighted linear combination of their conditional likelihood given latent low dimensional variables using an alternating EM-style procedure with closed-form update rules. [sent-28, score-0.709]

19 SDR GLM is able to deal with different data types by using different exponential family models. [sent-29, score-0.377]

20 Similar to SDR GLM, the linear supervised dimensionality reduction method proposed in [14] also takes advantage of exponential family models to deal with different data types. [sent-30, score-0.895]

21 However, it optimizes the conditional likelihood of labels given observed features within a mixture model framework using an EM-style optimization procedure. [sent-31, score-0.215]

22 Beyond the PCA framework, many other supervised dimensionality reduction methods have been proposed in the literature. [sent-32, score-0.54]

23 Another notable nonlinear supervised dimensionality reduction approach is the colored maximum variance unfolding (MVU) approach proposed in [15], which maximizes the variance aligning with the side information (e. [sent-35, score-0.724]

24 However, colored MVU has only been evaluated on training data. [sent-38, score-0.118]

25 In this paper, we propose a novel supervised exponential family PCA model (SEPCA). [sent-39, score-0.533]

26 In the SEPCA model, observed data x and its label y are assumed to be generated from the latent variables z via conditional exponential family models; dimensionality reduction is conducted by optimizing the conditional likelihood of the observations (x, y). [sent-40, score-0.947]

27 By exploiting convex duality of the sub-problems and eigenvector properties, a solvable convex formulation of the problem can be derived that preserves solution equivalence to the original. [sent-41, score-0.231]

28 This convex formulation allows efficient global optimization algorithms to be devised. [sent-42, score-0.262]

29 Moreover, by introducing a sample-based approximation to exponential family models, SEPCA does not suffer from the limitations of implicit Gaussian assumptions and is able to be conveniently kernelized to achieve nonlinearity. [sent-43, score-0.541]

30 A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained through a coordinate descent procedure. [sent-44, score-0.529]

31 This projection can be used for other supervised dimensionality reduction approach as well. [sent-46, score-0.614]

32 Our experimental results over both synthetic and real data suggest that a more global, principled probabilistic approach, SEPCA, is better able to capture subtle structure in the data, particularly when good label information is present. [sent-47, score-0.203]

33 First, in Section 2 we present the proposed supervised exponential family PCA model and formulate a convex nondifferentiable optimization problem. [sent-49, score-0.906]

34 Then, an efficient global optimization algorithm is presented in Section 3. [sent-50, score-0.135]

35 In Section 4, we present a simple projection method for new testing points. [sent-51, score-0.122]

36 This is typically viewed as discovering a latent low dimensional manifold in the high dimensional feature space. [sent-57, score-0.307]

37 Since the label information Y is exploited in the discovery process, this is called supervised dimensionality reduction. [sent-58, score-0.404]

38 Given the above setup, in this paper, we are attempting to address the problem of supervised dimensionality reduction using a probabilistic latent variable model. [sent-60, score-0.658]

39 In this section, we formulate the low-dimensional principal component discovering problem as a conditional likelihood maximization problem based on exponential family model representations, which can be reformulated into an equivalent nondifferentiable convex optimization problem. [sent-62, score-0.939]

40 We then exploit a sample-based approximation to unify exponential family models for different data types. [sent-63, score-0.384]

41 The first step is to derive the Fenchel conjugate dual for each log partition function, A(Z, . [sent-74, score-0.143]

42 3]; which can be used to yield 1 x max min (A∗ (Ui: ) + log P0 (Xi: )) + tr (X −U x )(X −U x )⊤ ZZ ⊤ 2β Z:Z ⊤Z=I U x ,U y i y A∗ (Ui: ) + + i 1 tr (Y −U y )(Y −U y )⊤ (ZZ ⊤ + E) 2β (5) that is equivalent to the original problem (1). [sent-77, score-0.417]

43 The second step is based on exploiting the strong min-max property [2] and the relationships between different constraint sets {M : M = ZZ ⊤ for some Z such that Z ⊤ Z = I} ⊆ {M : I M 0, tr(M ) = d}, which allows one to further show the optimization (4) is an upper bound relaxation of (5). [sent-78, score-0.114]

44 Thus the overall min-max optimization is convex [3], but apparently not necessarily differentiable. [sent-82, score-0.222]

45 We will address the nondifferentiable training issue in Section 3. [sent-83, score-0.176]

46 2 Sample-based Approximation In the previous section, we have formulated our supervised exponential family PCA as a convex optimization problem (4). [sent-85, score-0.729]

47 For ∗ different exponential family models, the Fenchel conjugate functions A are different; see [18, Table 2]. [sent-87, score-0.402]

48 Thus the Fenchel conjugate function A∗ (Ui: ) is given by y A∗ (Ui: ) = A∗ (Θy ) = tr Θy log Θy⊤ , where Θy ≥ 0, Θy 1 = 1 i: i: i: (6) The specific exponential family model is determined by the data type and distribution. [sent-89, score-0.622]

49 However, it is tedious and sometimes hard to choose the most appropriate exponential family model to use for each specific application problem. [sent-91, score-0.355]

50 For these reasons, we propose to use a sample-based approximation to the integral (2) and achieve an empirical approximation to the true underlying exponential family model as follows. [sent-93, score-0.413]

51 3 Efficient Global Optimization The optimization (8) we derived in the previous section is a convex-concave min-max optimization problem. [sent-95, score-0.184]

52 The inner maximization of (8) is a well known problem with a closed-form solution [11]: x x ⊤ y y ⊤ M ∗ = Z ∗ Z ∗⊤ and Z ∗ = Qd , where Qd (D) max (I −Θ )K(I −Θ ) + (Y −Θ )(Y −Θ ) max denotes the matrix formed by the top d eigenvectors of D. [sent-96, score-0.173]

53 However, the overall outer minimization problem is nondifferentiable with respect to Θx and Θy . [sent-97, score-0.248]

54 In this section, we deploy a bundle method to solve this nondifferentiable min-max optimization. [sent-99, score-0.439]

55 1 Bundle Method for Min-Max Optimization The bundle method is an efficient subgradient method for nondifferentiable convex optimization; it relies on the computation of subgradient terms of the objective function. [sent-101, score-0.728]

56 A vector g is a subgradient of function f at point x, if f (y) ≥ f (x) + g⊤ (y − x), ∀y. [sent-102, score-0.115]

57 To adapt standard bundle methods to our specific min-max problem, we need to first address the critical issue of subgradient computation. [sent-103, score-0.354]

58 Assume that g is a gradient of h(·, q(x0 )) at x = x0 , then g is a subgradient of f (x) at x = x0 . [sent-106, score-0.115]

59 Proof: f (x) = max h(x, y) ≥ h(x, q(x0 )) y ≥ h(x0 , q(x0 )) + g⊤ (x − x0 ) ⊤ = f (x0 ) + g (x − x0 ) (since h(·, y) is convex for all y ∈ Y) (by the definitions of f (x) and q(x0 )) Thus g is a subgradient of f (x) at x = x0 according to the definition of subgradient. [sent-107, score-0.255]

60 Algorithm 1 illustrates the bundle method we developed to solve the infinite min-max optimization (8), where the linear constraints (9) over Θx and Θy can be conveniently incorporated into the quadratic bound optimization. [sent-109, score-0.458]

61 One important issue in this algorithm is how to manage the size of the linear lower bound constraints formed from the active set B (defined in Algorithm 1), as it incrementally increases with new points being explored. [sent-110, score-0.11]

62 To solve this problem, we noticed the Lagrangian dual parameters α for the lower bound constraints obtained by the quadratic optimization in step 1 is a sparse vector, indicating that many lower bound constraints can be turned off. [sent-111, score-0.285]

63 Therefore, for the bundle method we developed, whenever the size of B is larger than a given constant b, we will keep the active points of B that correspond to the first b largest α values, and drop the remaining ones. [sent-113, score-0.216]

64 The convex optimization (8) works in the dual parameter space, where the size of the parameters Θ = {Θx , Θy }, t × (t + k), depends only on the number of training samples, t, not on the feature size, n. [sent-116, score-0.233]

65 For high dimensional small data sets (n ≫ t), our dual optimization is certainly a good option. [sent-117, score-0.228]

66 It might soon become too large to handle for the quadratic optimization step of the bundle method. [sent-119, score-0.334]

67 On the other hand, the optimization problem (8) possesses a nice semi-decomposable structure: one equality constraint in (9) involves only one row of the Θ; that is, the Θ can be separated into rows without affecting the equality constraints. [sent-120, score-0.122]

68 Based on this observation, we develop a coordinate descent procedure to obtain scalability of the bundle method over large data sets. [sent-121, score-0.354]

69 Specifically, we put an outer loop above the bundle method. [sent-122, score-0.286]

70 Within each of this outer loop iteration, we randomly separate the Θ parameters into m groups, with each group containing a subset rows of Θ; and we then use bundle method to sequentially optimize each subproblem defined on one group of Θ parameters while keeping the remaining rows of Θ fixed. [sent-123, score-0.286]

71 Although coordinate descent with a nondifferentiable convex objective is not guaranteed to converge to a minimum in general [17], we have found that this procedure performs quite well in practice, as shown in the experimental results. [sent-124, score-0.378]

72 4 Projection for Testing Data One important issue for supervised dimensionality reduction is to map new testing data into the dimensionality-reduced principal dimensions. [sent-125, score-0.68]

73 the lower bound linear constraints in B [1]: µ ˆ θ − θ∗ 2 , subject to the linear constraints in (9) θ = arg min ψℓ (θ) + θ 2 ˆ where ψℓ (θ) = f (θ∗ ) + max − ε + g⊤ (θ − θ∗ ), imax {−ei + gi⊤ (θ − θ∗ )} ˆ ˆ i (e ,g )∈B ˆ ˆ ˆ 2. [sent-134, score-0.122]

74 If f (θ∗ ) − f (θℓ ) ≥ mδℓ , then take a serious step: (1) update: ei = ei + f (θℓ ) − f (θ∗ ) + gi⊤ (θ∗ − θℓ ); ˆ (2) update the aggregation: g = i αi gi , ε = i αi ei ; ˆ (3) update the stored solution: θ∗ = θℓ , f (θ∗ ) = f (θℓ ). [sent-142, score-0.21]

75 The projection procedure (11) is used for colored MVU as well. [sent-149, score-0.192]

76 1 Experiments on Synthetic Data Two synthetic experiments were conducted to compare the five approaches under controlled conditions. [sent-153, score-0.156]

77 The first synthetic data set is formed by first generating four Gaussian clusters in a twodimensional space, with each corresponding to one class, and then adding the third dimension to each point by uniformly sampling from a fixed interval. [sent-154, score-0.128]

78 Figure 1 shows the projection results for each approach in a two dimensional space for 120 testing points after being trained on a set with 80 points. [sent-156, score-0.243]

79 The second synthetic experiment is designed to test the capability of performing nonlinear dimensionality reduction. [sent-158, score-0.323]

80 The synthetic data is formed by first generating two circles in a two dimensional space (one circle is located inside the other one), with each circle corresponding to one class, and then the third dimension sampled uniformly from a fixed interval. [sent-159, score-0.281]

81 Figure 2 shows the projection results for each approach in a two dimensional space for 120 SEPCA SDR−GLM 0. [sent-163, score-0.173]

82 2 Experiments on Real Data To better characterize the performance of dimensionality reduction in a supervised manner, we conducted some experiments on a few high dimensional multi-class real world data sets. [sent-249, score-0.723]

83 For each approach, we first learned the dimensionality reduction model on the training set. [sent-253, score-0.362]

84 Moreover, we also trained a logistic regression classifier using the projected training set in the reduced low dimensional space. [sent-254, score-0.22]

85 (Note, for SEPCA, a classifier was trained simultaneously during the process of dimensionality reduction optimization. [sent-255, score-0.384]

86 ) Then the test data were projected into the low dimensional space according to each dimensionality reduction model. [sent-256, score-0.538]

87 To better understand the quality of the classification using projected data, we also included the standard classification results, indicated as ’FULL’, using the original high dimensional data. [sent-259, score-0.147]

88 (Note, we are not able to obtain any result for SDR GLM on the newsgroup data as it is inefficient for very high dimensional data. [sent-260, score-0.159]

89 But different from the synthetic experiments, LDA does not work well on these real data sets. [sent-263, score-0.118]

90 The results on both synthetic and real data show that SEPCA outperforms the other four approaches. [sent-264, score-0.118]

91 This might be attributed to its adaptive exponential family model approximation and its global optimization, while SDR GLM and SPPCA apparently suffer from local optima. [sent-265, score-0.483]

92 6 Conclusions In this paper, we propose a supervised exponential family PCA (SEPCA) approach, which can be solved efficiently to find global solutions. [sent-266, score-0.576]

93 Moreover, SEPCA overcomes the limitation of the Gaussian assumption of PCA and SPPCA by using a data adaptive approximation for exponential family models. [sent-267, score-0.446]

94 A simple, straightforward projection method for new testing data has also been constructed. [sent-268, score-0.122]

95 Empirical study suggests that this SEPCA outperforms other supervised dimensionality reduction approaches, such as SDR GLM, SPPCA, LDA and colored MVU. [sent-269, score-0.658]

96 Table 1: Data set statistics and test accuracy results (%) Dataset #Data #Dim #Class FULL SEPCA SDR GLM SPPCA LDA colored MVU Yale YaleB 11 Tumor Usps3456 Newsgroup 165 2414 174 120 19928 4096 1024 12533 256 25284 15 38 11 4 20 65. [sent-270, score-0.118]

97 A generalization of principal component analysis to the exponential family. [sent-318, score-0.271]

98 Efficient global optimization for exponential family PCA and low-rank matrix factorization. [sent-332, score-0.524]

99 Probabilistic non-linear principle component analysis with gaussian process latent variable models. [sent-342, score-0.114]

100 Convergence of a block coordinate descent method for nondifferentiable minimization. [sent-389, score-0.249]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sepca', 0.405), ('sppca', 0.262), ('glm', 0.25), ('sdr', 0.25), ('pca', 0.24), ('bundle', 0.216), ('dimensionality', 0.196), ('family', 0.186), ('supervised', 0.178), ('exponential', 0.169), ('mvu', 0.167), ('reduction', 0.166), ('tr', 0.161), ('nondifferentiable', 0.153), ('fenchel', 0.125), ('zz', 0.119), ('ui', 0.119), ('colored', 0.118), ('subgradient', 0.115), ('convex', 0.104), ('dimensional', 0.099), ('lda', 0.095), ('synthetic', 0.095), ('optimization', 0.092), ('zi', 0.088), ('projection', 0.074), ('kda', 0.072), ('outer', 0.07), ('principal', 0.069), ('conducted', 0.061), ('log', 0.059), ('latent', 0.057), ('kernelized', 0.057), ('guo', 0.054), ('coordinate', 0.052), ('conveniently', 0.048), ('projected', 0.048), ('conjugates', 0.048), ('deploy', 0.048), ('optima', 0.048), ('testing', 0.048), ('conjugate', 0.047), ('intrinsic', 0.044), ('descent', 0.044), ('global', 0.043), ('ei', 0.042), ('discriminant', 0.042), ('moreover', 0.042), ('sajama', 0.042), ('scalability', 0.042), ('gi', 0.038), ('qd', 0.038), ('newsgroup', 0.038), ('dual', 0.037), ('max', 0.036), ('devised', 0.036), ('maximization', 0.034), ('matrix', 0.034), ('unfolding', 0.034), ('sher', 0.034), ('maxy', 0.034), ('component', 0.033), ('probabilistic', 0.033), ('formed', 0.033), ('constraints', 0.032), ('nonlinear', 0.032), ('limitation', 0.031), ('overcomes', 0.031), ('label', 0.03), ('suffer', 0.03), ('affecting', 0.03), ('conditional', 0.03), ('approximation', 0.029), ('lagrangian', 0.029), ('low', 0.029), ('attempting', 0.028), ('circle', 0.027), ('labels', 0.026), ('quadratic', 0.026), ('apparently', 0.026), ('objective', 0.025), ('minimization', 0.025), ('classi', 0.024), ('formulate', 0.024), ('gained', 0.024), ('gaussian', 0.024), ('issue', 0.023), ('discovering', 0.023), ('alternating', 0.023), ('optimizes', 0.023), ('real', 0.023), ('formulation', 0.023), ('update', 0.023), ('logistic', 0.022), ('solve', 0.022), ('likelihood', 0.022), ('features', 0.022), ('bound', 0.022), ('able', 0.022), ('trained', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

Author: Yuhong Guo

Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1

2 0.14043881 31 nips-2008-Bayesian Exponential Family PCA

Author: Shakir Mohamed, Zoubin Ghahramani, Katherine A. Heller

Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data. 1

3 0.11855578 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan

Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.

4 0.11757673 61 nips-2008-Diffeomorphic Dimensionality Reduction

Author: Christian Walder, Bernhard Schölkopf

Abstract: This paper introduces a new approach to constructing meaningful lower dimensional representations of sets of data points. We argue that constraining the mapping between the high and low dimensional spaces to be a diffeomorphism is a natural way of ensuring that pairwise distances are approximately preserved. Accordingly we develop an algorithm which diffeomorphically maps the data near to a lower dimensional subspace and then projects onto that subspace. The problem of solving for the mapping is transformed into one of solving for an Eulerian flow field which we compute using ideas from kernel methods. We demonstrate the efficacy of our approach on various real world data sets. 1

5 0.10778491 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

Author: Yen-yu Lin, Tyng-luh Liu, Chiou-shann Fuh

Abstract: In solving complex visual learning tasks, adopting multiple descriptors to more precisely characterize the data has been a feasible way for improving performance. These representations are typically high dimensional and assume diverse forms. Thus finding a way to transform them into a unified space of lower dimension generally facilitates the underlying tasks, such as object recognition or clustering. We describe an approach that incorporates multiple kernel learning with dimensionality reduction (MKL-DR). While the proposed framework is flexible in simultaneously tackling data in various feature representations, the formulation itself is general in that it is established upon graph embedding. It follows that any dimensionality reduction techniques explainable by graph embedding can be generalized by our method to consider data in multiple feature representations.

6 0.10311505 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

7 0.10286751 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

8 0.10233375 200 nips-2008-Robust Kernel Principal Component Analysis

9 0.091457911 62 nips-2008-Differentiable Sparse Coding

10 0.09047509 238 nips-2008-Theory of matching pursuit

11 0.088914536 216 nips-2008-Sparse probabilistic projections

12 0.081586994 113 nips-2008-Kernelized Sorting

13 0.075271487 143 nips-2008-Multi-label Multiple Kernel Learning

14 0.073119655 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization

15 0.072929285 60 nips-2008-Designing neurophysiology experiments to optimally constrain receptive field models along parametric submanifolds

16 0.072220467 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data

17 0.071674287 78 nips-2008-Exact Convex Confidence-Weighted Learning

18 0.070854813 214 nips-2008-Sparse Online Learning via Truncated Gradient

19 0.06975814 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition

20 0.067437075 245 nips-2008-Unlabeled data: Now it helps, now it doesn't


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.216), (1, -0.077), (2, -0.052), (3, 0.03), (4, 0.007), (5, 0.02), (6, -0.005), (7, 0.036), (8, -0.09), (9, -0.012), (10, 0.136), (11, -0.066), (12, 0.04), (13, 0.134), (14, 0.035), (15, -0.072), (16, 0.138), (17, 0.057), (18, 0.003), (19, -0.079), (20, -0.033), (21, -0.082), (22, -0.013), (23, -0.001), (24, -0.074), (25, -0.147), (26, -0.101), (27, 0.044), (28, -0.178), (29, 0.068), (30, -0.049), (31, -0.089), (32, -0.04), (33, -0.044), (34, 0.022), (35, -0.008), (36, -0.002), (37, 0.021), (38, -0.039), (39, -0.053), (40, 0.066), (41, -0.104), (42, 0.079), (43, -0.024), (44, 0.04), (45, -0.052), (46, 0.096), (47, -0.077), (48, -0.104), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94815087 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

Author: Yuhong Guo

Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1

2 0.74615967 31 nips-2008-Bayesian Exponential Family PCA

Author: Shakir Mohamed, Zoubin Ghahramani, Katherine A. Heller

Abstract: Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data. 1

3 0.72803003 200 nips-2008-Robust Kernel Principal Component Analysis

Author: Minh H. Nguyen, Fernando Torre

Abstract: Kernel Principal Component Analysis (KPCA) is a popular generalization of linear PCA that allows non-linear feature extraction. In KPCA, data in the input space is mapped to higher (usually) dimensional feature space where the data can be linearly modeled. The feature space is typically induced implicitly by a kernel function, and linear PCA in the feature space is performed via the kernel trick. However, due to the implicitness of the feature space, some extensions of PCA such as robust PCA cannot be directly generalized to KPCA. This paper presents a technique to overcome this problem, and extends it to a unified framework for treating noise, missing data, and outliers in KPCA. Our method is based on a novel cost function to perform inference in KPCA. Extensive experiments, in both synthetic and real data, show that our algorithm outperforms existing methods. 1

4 0.70132601 61 nips-2008-Diffeomorphic Dimensionality Reduction

Author: Christian Walder, Bernhard Schölkopf

Abstract: This paper introduces a new approach to constructing meaningful lower dimensional representations of sets of data points. We argue that constraining the mapping between the high and low dimensional spaces to be a diffeomorphism is a natural way of ensuring that pairwise distances are approximately preserved. Accordingly we develop an algorithm which diffeomorphically maps the data near to a lower dimensional subspace and then projects onto that subspace. The problem of solving for the mapping is transformed into one of solving for an Eulerian flow field which we compute using ideas from kernel methods. We demonstrate the efficacy of our approach on various real world data sets. 1

5 0.60630023 238 nips-2008-Theory of matching pursuit

Author: Zakria Hussain, John S. Shawe-taylor

Abstract: We analyse matching pursuit for kernel principal components analysis (KPCA) by proving that the sparse subspace it produces is a sample compression scheme. We show that this bound is tighter than the KPCA bound of Shawe-Taylor et al [7] and highly predictive of the size of the subspace needed to capture most of the variance in the data. We analyse a second matching pursuit algorithm called kernel matching pursuit (KMP) which does not correspond to a sample compression scheme. However, we give a novel bound that views the choice of subspace of the KMP algorithm as a compression scheme and hence provide a VC bound to upper bound its future loss. Finally we describe how the same bound can be applied to other matching pursuit related algorithms. 1

6 0.50496298 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

7 0.49674588 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

8 0.4966588 126 nips-2008-Localized Sliced Inverse Regression

9 0.48124936 216 nips-2008-Sparse probabilistic projections

10 0.46928078 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

11 0.45933282 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

12 0.44683626 90 nips-2008-Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity

13 0.43514988 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection

14 0.43428323 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions

15 0.42920375 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach

16 0.41694558 143 nips-2008-Multi-label Multiple Kernel Learning

17 0.41653374 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization

18 0.40705031 196 nips-2008-Relative Margin Machines

19 0.40554571 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning

20 0.40003806 57 nips-2008-Deflation Methods for Sparse PCA


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.077), (7, 0.079), (8, 0.21), (12, 0.031), (28, 0.197), (57, 0.066), (59, 0.023), (63, 0.025), (71, 0.01), (77, 0.096), (78, 0.016), (83, 0.093)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87059575 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification

Author: Simon Lacoste-julien, Fei Sha, Michael I. Jordan

Abstract: Probabilistic topic models have become popular as methods for dimensionality reduction in collections of text documents or images. These models are usually treated as generative models and trained using maximum likelihood or Bayesian methods. In this paper, we discuss an alternative: a discriminative framework in which we assume that supervised side information is present, and in which we wish to take that side information into account in finding a reduced dimensionality representation. Specifically, we present DiscLDA, a discriminative variation on Latent Dirichlet Allocation (LDA) in which a class-dependent linear transformation is introduced on the topic mixture proportions. This parameter is estimated by maximizing the conditional likelihood. By using the transformed topic mixture proportions as a new representation of documents, we obtain a supervised dimensionality reduction algorithm that uncovers the latent structure in a document collection while preserving predictive power for the task of classification. We compare the predictive power of the latent structure of DiscLDA with unsupervised LDA on the 20 Newsgroups document classification task and show how our model can identify shared topics across classes as well as class-dependent topics.

same-paper 2 0.84017599 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

Author: Yuhong Guo

Abstract: Recently, supervised dimensionality reduction has been gaining attention, owing to the realization that data labels are often available and indicate important underlying structure in the data. In this paper, we present a novel convex supervised dimensionality reduction approach based on exponential family PCA, which is able to avoid the local optima of typical EM learning. Moreover, by introducing a sample-based approximation to exponential family models, it overcomes the limitation of the prevailing Gaussian assumptions of standard PCA, and produces a kernelized formulation for nonlinear supervised dimensionality reduction. A training algorithm is then devised based on a subgradient bundle method, whose scalability can be gained using a coordinate descent procedure. The advantage of our global optimization approach is demonstrated by empirical results over both synthetic and real data. 1

3 0.77189296 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Author: Francis R. Bach

Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

4 0.76519048 194 nips-2008-Regularized Learning with Networks of Features

Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar

Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1

5 0.76458132 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

Author: Liu Yang, Rong Jin, Rahul Sukthankar

Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.

6 0.76373154 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

7 0.76323491 216 nips-2008-Sparse probabilistic projections

8 0.76239222 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

9 0.76078683 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations

10 0.76017773 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

11 0.7589041 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression

12 0.75772947 202 nips-2008-Robust Regression and Lasso

13 0.75769401 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

14 0.75598145 173 nips-2008-Optimization on a Budget: A Reinforcement Learning Approach

15 0.75579005 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

16 0.75560999 195 nips-2008-Regularized Policy Iteration

17 0.7555629 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features

18 0.75502419 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

19 0.75500768 238 nips-2008-Theory of matching pursuit

20 0.75429654 62 nips-2008-Differentiable Sparse Coding