nips nips2010 nips2010-284 knowledge-graph by maker-knowledge-mining

284 nips-2010-Variational bounds for mixed-data factor analysis


Source: pdf

Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. [sent-11, score-0.487]

2 The algorithm is based on a simple quadratic bound to the log-sum-exp function. [sent-12, score-0.159]

3 In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. [sent-13, score-0.274]

4 We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. [sent-14, score-0.378]

5 A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. [sent-15, score-0.296]

6 1 Introduction Continuous latent factor models, such as factor analysis (FA) and probabilistic principal components analysis (PPCA), are very commonly used density models for continuous-valued data. [sent-17, score-0.402]

7 They have many applications including latent factor discovery, dimensionality reduction, and missing data imputation. [sent-18, score-0.405]

8 The factor analysis model asserts that a low-dimensional continuous latent factor zn ∈ RL underlies each high-dimensional observed data vector yn ∈ RD . [sent-19, score-0.813]

9 Standard factor analysis models assume the prior on the latent factor has the form p(zn ) = N (zn |0, I), while the likelihood has the form p(yn |zn ) = N (yn |Wzn + µ, Σ). [sent-20, score-0.426]

10 W is the D × L factor loading matrix, µ is an offset term, and Σ is a D × D diagonal matrix specifying the marginal noise variances. [sent-21, score-0.283]

11 A problem arises because the Gaussian prior on p(zn ) is not conjugate to the likelihood except when yn also has a Gaussian distribution (the standard FA model). [sent-27, score-0.166]

12 The simplest is to approximate the posterior p(zn |yn ) using a point estimate, which is equivalent to viewing the latent variables as parameters and estimating them by maximum likelihood. [sent-29, score-0.26]

13 We refer to it as the “MM” approach to fitting the general FA model since we maximize over zn in the E-step, as well as W in the M-step. [sent-32, score-0.252]

14 The main drawback of the MM approach is that it ignores posterior uncertainty in zn , which can result in over-fitting unless the model is carefully regularized [WCS08]. [sent-33, score-0.38]

15 This is a particular concern when we have missing data. [sent-34, score-0.154]

16 The opposite end of the model estimation spectrum is to integrate out both zn and W using Markov chain Monte Carlo methods. [sent-35, score-0.252]

17 We will refer to this as the “SS” approach to indicate that we are integrating out both zn and W by sampling. [sent-37, score-0.252]

18 The SS approach preserves posterior uncertainty about zn (unlike the MM approach) and is robust to missing data, but can have a significantly higher computational cost. [sent-38, score-0.534]

19 In this work, we study a variational EM model fitting approach that preserves posterior uncertainty about zn , is robust to missing data, and is more computationally efficient than SS. [sent-39, score-0.652]

20 We refer to this as the “VM” approach to indicate that we integrate over zn in the E-step after applying a variational bound, and maximize over W in the M-step. [sent-40, score-0.37]

21 We focus on the case of continuous (Gaussian) and categorical data. [sent-41, score-0.177]

22 Our main contribution is the development of variational EM algorithms for factor analysis and mixtures of factor analyzers based on a simple quadratic lower bound to the multinomial likelihood (which subsumes the Bernoulli case) [Boh92]. [sent-42, score-0.825]

23 This bound results in an EM iteration that is computationally more efficient than the bound previously proposed by Jaakkola for binary PCA when the training data is fully observed [JJ96], but is less tight. [sent-43, score-0.297]

24 2 The Generalized Mixture of Factor Analyzers Model In this section, we describe a model for mixed continuous and discrete data that we call the generalized mixture of factor analyzers model. [sent-45, score-0.667]

25 This model has two important special cases: mixture models and factor analysis, both for mixed continuous and discrete data. [sent-46, score-0.486]

26 In this work, we focus on Gaussian distributed continuous data and multinomially distributed discrete data. [sent-48, score-0.217]

27 Superscripts C and D indicate variables associated with C continuous and discrete data respectively. [sent-61, score-0.209]

28 We let yn ∈ RDc denote the continuous data vector and 2 D ynd ∈ {1 . [sent-62, score-0.326]

29 1 We use a 1-of-(M + 1) encoding for the D D discrete variables where a variable ynd = m is represented by a (M + 1)-dimensional vector ynd in which m’th element is set to 1, and all remaining elements equal 0. [sent-66, score-0.402]

30 We denote the complete data C D D vector by yn = yn , yn1 , . [sent-67, score-0.222]

31 The generative process begins by sampling a state of the mixture indicator variable qn for each data case n from a K-state multinomial distribution with parameters π. [sent-71, score-0.273]

32 Simultaneously, a length L latent factor vector zn ∈ RL is sampled from a zero-mean Gaussian distribution with precision parameter λz . [sent-72, score-0.503]

33 The natural parameters of the distribution over the data variables is obtained by passing the latent factor vector zn through a linear function defined by a factor loading matrix and an offset term, both of which depend on the setting of the mixture indicator variable qn . [sent-74, score-1.024]

34 C We note that the factor loading matrices for the k th mixture component are Wk ∈ RDc ×L and M +1 Dc D M +1×L C D . [sent-82, score-0.355]

35 We define the enand µdk ∈ R , while the offsets are µk ∈ R Wdk ∈ R D D D C semble of factor loading matrices and offsets to be Wk = [Wk , W1k , W2k , . [sent-83, score-0.297]

36 For each row of each factor loading matrix Wk , we use a Gaussian prior of the form N (0, λ−1 I). [sent-92, score-0.265]

37 w As mentioned at the start of this section, this general model has two important special cases: generalized factor analysis and mixture models for mixed continuous and discrete data. [sent-94, score-0.512]

38 The factor analysis model is obtained by using one mixture component and at least one latent factor (K = 1, L > 1). [sent-95, score-0.487]

39 The mixture model is obtained by using no latent factors and at least one mixture component (K > 1, L = 0). [sent-96, score-0.399]

40 In the mixture model case where L = 0, the distribution is modeled through the offset parameters µk only. [sent-97, score-0.16]

41 Before concluding this section, we point out one key difference between the current model and other latent factor models for discrete data like multinomial PCA [BJ04] and latent Dirichlet allocation (LDA) [BNJ03]. [sent-99, score-0.547]

42 Introduction of discrete observations, however, makes it intractable to compute the posterior as the likelihood for these observations is not conjugate to the Gaussian prior on the latent factors. [sent-110, score-0.383]

43 To overcome these problems, we propose to use a quadratic bound on the LSE function. [sent-111, score-0.159]

44 For simplicity, we describe the bound only for one discrete measurement with K = 1 and µk = 0 in order to suppress the n, k and d subscripts. [sent-115, score-0.216]

45 By substituting this bound in to the log-likelihood, completing the square and exponentiating, we obtain the Gaussian lower bound described below. [sent-122, score-0.234]

46 p(yD |z, W) ≥ h(ψ)N (˜ ψ |Wz, A−1 ) y ˜ yψ = A −1 (10) D (bψ + y ) (11) 1 T ˜ h(ψ) = |2πA | exp yψ A˜ ψ − cψ y (12) 2 We use this result to obtain a lower bound for each mixed data vector yn . [sent-124, score-0.3]

47 , A−1 ) 1 Dd Given this pseudo observation, the computation of the posterior means mn and covariances Vn is similar to the Gaussian FA model as seen below. [sent-136, score-0.153]

48 This result can be generalized to the mixture case in a straightforward way. [sent-137, score-0.142]

49 The M-step is the same as in mixtures of Gaussian factor analyzers [GH96]. [sent-138, score-0.333]

50 −1 ˜ ˜ Vn = (WT Σ ˜ W + λz IL )−1 , −1 ˜ ˜ mn = V n W T Σ ˜ yn (13) The only question remaining is how to obtain the value of ψ. [sent-139, score-0.142]

51 If there is missing data, Vn will change across data cases, so the total cost will be O(N I(L3 + L2 D)). [sent-151, score-0.154]

52 1 Comparison with Other Bounding Methods In the binary case, the Bohning bound reduces to the following: log(1 + eη ) ≤ 1 Aη 2 − bψ η + cψ , 2 where A = 1/4, bψ = Aψ − (1 + e−ψ )−1 , and cψ = 1 Aψ 2 − (1 + e−ψ )−1 ψ + log(1 + eψ ). [sent-153, score-0.156]

53 It is 2 interesting to compare this bound to Jaakkola’s bound [JJ96] used in [Tip98, YT04]. [sent-154, score-0.234]

54 This bound can ˜ ˜ also be written in the quadratic form: log(1 + eη ) ≤ 1 Aξ η 2 − ˜ξ η + cξ , where Aξ = 2λξ , ˜ξ = − 1 , b ˜ b 2 2 1 1 1 1 2 ξ cξ = −λξ ξ − 2 ξ + log(1 + e ), λξ = 2ξ ( 1+e−ξ − 2 ). [sent-155, score-0.159]

55 Consequently, the cost of an E-step is O(N I(L3 + L2 D)), even if there is no missing data (note the L3 term inside the N I loop). [sent-158, score-0.154]

56 To explore the speed vs accuracy trade-off, we use the synthetic binary data described in [MHG08] with N = 600, D = 16, and 10% missing data. [sent-159, score-0.252]

57 We learn on the observed entries in the data matrix and compute the mean squared error (MSE) on the held out missing entries as in [MHG08]. [sent-161, score-0.234]

58 We see in Figure 2 (top left) that the Jaakkola bound gives a lower MSE than Bohning’s bound in less time on this data. [sent-163, score-0.234]

59 Figure 2 (bottom left) shows that the Bohning bound exhibits much better scalability per iteration than the Jaakkola bound in this regime. [sent-169, score-0.258]

60 The speed issue becomes more serious when combining binary variables with categorical variables. [sent-170, score-0.195]

61 Firstly, there is no direct extension of the Jaakkola bound to the general categorical case. [sent-171, score-0.215]

62 Hence, to combine categorical variables with binary variables, we can use the Jaakkola bound for binary and the Bohning for the rest. [sent-172, score-0.324]

63 For computational simplicity, we use Bohning’s bound for both binary and categorical data. [sent-174, score-0.254]

64 Various other bounds and approximations to the multinomial likelihood also exist; however, they are all more computationally intensive, and do not give an efficient variational algorithm. [sent-175, score-0.213]

65 An extension of the Jaakkola bound to the multinomial case was given in [Bou07]. [sent-177, score-0.183]

66 This bound does not give closed form updates for the E and M steps so a numerical optimizer needs to be used (see [BL06] for details). [sent-180, score-0.142]

67 15 MSE 1 10 10 −2 0 2 10 10 10 Prior Strength (λ ) W −1 10 −2 10 −3 10 −2 0 2 10 10 10 Prior Strength (λ ) W Figure 2: Top left: accuracy vs speed of variational EM with the Bohning bound (FA-VM), Jaakkola bound (FA-VJM) and HMC (FA-SS) on synthetic binary data. [sent-195, score-0.45]

68 Bottom left: Time per iteration of EM with Bohning bound and Jaakkola bound as we vary D. [sent-196, score-0.258]

69 We show results on the test and training sets, for 10% and 50% missing data. [sent-198, score-0.154]

70 1 Maximize-Maximize (MM) Method The simplest approach to fit the FA model is to maximize log p(Y, Z, W|λw , λz ) with respect to Z and W, the matrix of latent factor values and the factor loading matrix. [sent-202, score-0.49]

71 To handle missing data, we simply evaluate the gradients by only summing over the observed entries of Y. [sent-206, score-0.194]

72 At test time, consider a data vector consisting of missing and observed components, y∗ = [y∗m , y∗o ]. [sent-207, score-0.154]

73 To fill in the missing entries, we compute ˆ ˆ ˆ z∗ = arg max p(z∗ , y∗o |W) and use it with θ to predict y∗m . [sent-208, score-0.154]

74 We consider the case of 10% and 50% missing data. [sent-215, score-0.154]

75 We evaluate the sensitivity of the methods to the setting of the posterior precision parameter λw by varying it over the range 10−2 to 102 . [sent-216, score-0.163]

76 We train on the observed entries in the training set, and then compute MSE on the missing entries in the training and test sets. [sent-219, score-0.234]

77 We can see that this sensitivity increases as a function of the missing data rate. [sent-222, score-0.219]

78 By contrast, the VM method takes the posterior uncertainty about Z into account, resulting in almost no sensitivity to λw over this range. [sent-227, score-0.193]

79 To handle missing data, we can simply evaluate the gradients by only summing over the observed entries of Y. [sent-234, score-0.194]

80 We do not need to impute the missing entries on the training set. [sent-235, score-0.223]

81 In Figure 2 (right), we see that SS is insensitive to λw , just like VM, since it also models posterior uncertainty in Z (note that the absolute MSE values are higher for SS than VM since for continuous data, VM corresponds to EM with an exact posterior). [sent-238, score-0.207]

82 5 Experiments on Real Data In this section, we evaluate the performance of our model on real data with mixed continuous and discrete variables. [sent-241, score-0.25]

83 We consider the following three cases of our model: (1) a model with latent factors but no mixtures (FA) (2) a model with mixtures but no latent factors (Mix) and (3) the general mixture of factor analyzers model (MixFA). [sent-242, score-0.841]

84 We use the validation set to determine the number of latent factors and the number of mixtures (ranges shown in the table) with imputation error (described below) as our performance objective. [sent-252, score-0.284]

85 One way to assess the performance of a generative model is to see how well it can impute missing data. [sent-258, score-0.183]

86 We do this by randomly introducing missing values in the test data with a missing data rate of 0. [sent-259, score-0.308]

87 For continuous variables, we compute the imputation MSE averaged over all the missing values (these variables are standardized beforehand). [sent-261, score-0.323]

88 For discrete variables, we report the crossˆ entropy (averaged over missing values) defined as yT log p, where pm is the estimated probability ˆ that y = m and y uses the one-of-(M + 1) encoding. [sent-262, score-0.253]

89 L and K are the ranges of number of latent factors and mixture components used for cross validation. [sent-286, score-0.319]

90 Right: the figure shows the imputation error for each dataset for continuous and discrete variables. [sent-288, score-0.237]

91 6 Discussion and Future Work In this work we have proposed a new variational EM algorithm for fitting factor analysis models with mixed data. [sent-290, score-0.31]

92 The algorithm is based on the Bohning bound, a simple quadratic bound to the log-sum-exp function. [sent-291, score-0.159]

93 In the special case of fully observed binary data, the Bohning bound iteration is theoretically faster than Jaakkola’s bound iteration and we have demonstrated this advantage empirically. [sent-292, score-0.321]

94 More importantly, the Bohning bound also easily extends to the categorical case. [sent-293, score-0.215]

95 This enables, for the first time, an efficient variational method for fitting a factor analysis model to mixed continuous, binary, and categorical observations. [sent-294, score-0.408]

96 In comparison to the maximize-maximize (MM) method, which forms the basis of ePCA and other matrix factorization methods, our variational EM method accounts for posterior uncertainty in the latent factors, leading to reduced sensitivity to hyper parameters. [sent-295, score-0.442]

97 We note that the quadratic bound that we study can be used in a variety of other models, such as linear-Gaussian state-space models with categorical observations [SH03]. [sent-301, score-0.257]

98 The bound might also be useful in the context of the correlated topic model [BL06, AX07], where similar variational EM methods have been applied. [sent-303, score-0.235]

99 In the Bayesian statistics literature, it is common to use latent factor models combined with a probit observation model; this allows one to perform inference for the latent states using efficient auxiliary-variable MCMC techniques (see e. [sent-304, score-0.411]

100 Factor analysis with (mixed) observed and latent variables in the exponential family. [sent-446, score-0.193]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fa', 0.417), ('bohning', 0.33), ('vm', 0.255), ('zn', 0.252), ('mm', 0.196), ('lse', 0.187), ('analyzers', 0.155), ('mixfa', 0.155), ('missing', 0.154), ('mse', 0.15), ('ynd', 0.136), ('latent', 0.131), ('jaakkola', 0.126), ('em', 0.122), ('factor', 0.12), ('ss', 0.119), ('loading', 0.119), ('variational', 0.118), ('bound', 0.117), ('mixture', 0.116), ('yn', 0.111), ('discrete', 0.099), ('mix', 0.099), ('posterior', 0.098), ('categorical', 0.098), ('ases', 0.097), ('wdk', 0.097), ('qn', 0.091), ('dd', 0.08), ('continuous', 0.079), ('wk', 0.076), ('mixed', 0.072), ('multinomial', 0.066), ('sensitivity', 0.065), ('vn', 0.064), ('pca', 0.059), ('imputation', 0.059), ('hamiltonian', 0.059), ('auto', 0.058), ('ndk', 0.058), ('wzn', 0.058), ('mixtures', 0.058), ('tting', 0.057), ('hmc', 0.053), ('md', 0.047), ('bc', 0.047), ('british', 0.046), ('dk', 0.045), ('offset', 0.044), ('sm', 0.044), ('softmax', 0.042), ('quadratic', 0.042), ('entries', 0.04), ('binary', 0.039), ('bddf', 0.039), ('emtiyaz', 0.039), ('multinomially', 0.039), ('ppca', 0.039), ('rdc', 0.039), ('vjm', 0.039), ('wdd', 0.039), ('xerox', 0.039), ('dc', 0.038), ('vancouver', 0.037), ('factors', 0.036), ('cross', 0.036), ('monte', 0.035), ('adult', 0.035), ('epca', 0.034), ('columbia', 0.034), ('vs', 0.032), ('europe', 0.031), ('ld', 0.031), ('principal', 0.031), ('exponential', 0.031), ('mn', 0.031), ('variables', 0.031), ('blei', 0.03), ('gaussian', 0.03), ('diag', 0.03), ('uncertainty', 0.03), ('covariance', 0.03), ('offsets', 0.029), ('impute', 0.029), ('probit', 0.029), ('likelihood', 0.029), ('carlo', 0.029), ('dth', 0.028), ('speed', 0.027), ('yd', 0.027), ('prior', 0.026), ('family', 0.026), ('generalized', 0.026), ('sub', 0.025), ('im', 0.025), ('optimizer', 0.025), ('canada', 0.024), ('strength', 0.024), ('pseudo', 0.024), ('iteration', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 284 nips-2010-Variational bounds for mixed-data factor analysis

Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1

2 0.17007771 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

Author: Armand Joulin, Jean Ponce, Francis R. Bach

Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1

3 0.14884514 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks

Author: Kentaro Katahira, Kazuo Okanoya, Masato Okada

Abstract: When animals repeatedly choose actions from multiple alternatives, they can allocate their choices stochastically depending on past actions and outcomes. It is commonly assumed that this ability is achieved by modifications in synaptic weights related to decision making. Choice behavior has been empirically found to follow Herrnstein’s matching law. Loewenstein & Seung (2006) demonstrated that matching behavior is a steady state of learning in neural networks if the synaptic weights change proportionally to the covariance between reward and neural activities. However, their proof did not take into account the change in entire synaptic distributions. In this study, we show that matching behavior is not necessarily a steady state of the covariance-based learning rule when the synaptic strength is sufficiently strong so that the fluctuations in input from individual sensory neurons influence the net input to output neurons. This is caused by the increasing variance in the input potential due to the diffusion of synaptic weights. This effect causes an undermatching phenomenon, which has been observed in many behavioral experiments. We suggest that the synaptic diffusion effects provide a robust neural mechanism for stochastic choice behavior.

4 0.13333285 94 nips-2010-Feature Set Embedding for Incomplete Data

Author: David Grangier, Iain Melvin

Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1

5 0.11218139 290 nips-2010-t-logistic regression

Author: Nan Ding, S.v.n. Vishwanathan

Abstract: We extend logistic regression by using t-exponential families which were introduced recently in statistical physics. This gives rise to a regularized risk minimization problem with a non-convex loss function. An efficient block coordinate descent optimization scheme can be derived for estimating the parameters. Because of the nature of the loss function, our algorithm is tolerant to label noise. Furthermore, unlike other algorithms which employ non-convex loss functions, our algorithm is fairly robust to the choice of initial values. We verify both these observations empirically on a number of synthetic and real datasets. 1

6 0.10242649 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models

7 0.098201528 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

8 0.087674826 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions

9 0.085443966 228 nips-2010-Reverse Multi-Label Learning

10 0.085142501 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

11 0.083774671 194 nips-2010-Online Learning for Latent Dirichlet Allocation

12 0.08283101 101 nips-2010-Gaussian sampling by local perturbations

13 0.079128832 117 nips-2010-Identifying graph-structured activation patterns in networks

14 0.077398449 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

15 0.07527376 89 nips-2010-Factorized Latent Spaces with Structured Sparsity

16 0.071040004 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data

17 0.065987736 283 nips-2010-Variational Inference over Combinatorial Spaces

18 0.063602075 265 nips-2010-The LASSO risk: asymptotic results and real world examples

19 0.062406193 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach

20 0.062175095 235 nips-2010-Self-Paced Learning for Latent Variable Models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.183), (1, 0.041), (2, 0.025), (3, 0.016), (4, -0.107), (5, 0.058), (6, 0.036), (7, 0.057), (8, -0.082), (9, 0.012), (10, -0.046), (11, -0.007), (12, 0.17), (13, -0.036), (14, -0.045), (15, 0.001), (16, 0.023), (17, 0.106), (18, 0.186), (19, 0.06), (20, -0.008), (21, 0.027), (22, -0.082), (23, -0.065), (24, 0.068), (25, 0.052), (26, -0.054), (27, 0.194), (28, -0.099), (29, -0.097), (30, 0.051), (31, 0.025), (32, -0.001), (33, 0.039), (34, -0.019), (35, -0.053), (36, 0.042), (37, -0.058), (38, -0.075), (39, 0.134), (40, -0.062), (41, 0.062), (42, 0.074), (43, 0.001), (44, 0.07), (45, 0.007), (46, -0.002), (47, 0.065), (48, 0.028), (49, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93278629 284 nips-2010-Variational bounds for mixed-data factor analysis

Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1

2 0.64398193 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions

Author: Bo Thiesson, Chong Wang

Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1

3 0.60049647 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models

Author: Armand Joulin, Jean Ponce, Francis R. Bach

Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1

4 0.5715121 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates

Author: Ken Takiyama, Masato Okada

Abstract: 019 020 We propose an algorithm for simultaneously estimating state transitions among neural states, the number of neural states, and nonstationary firing rates using a switching state space model (SSSM). This algorithm enables us to detect state transitions on the basis of not only the discontinuous changes of mean firing rates but also discontinuous changes in temporal profiles of firing rates, e.g., temporal correlation. We construct a variational Bayes algorithm for a non-Gaussian SSSM whose non-Gaussian property is caused by binary spike events. Synthetic data analysis reveals that our algorithm has the high performance for estimating state transitions, the number of neural states, and nonstationary firing rates compared to previous methods. We also analyze neural data that were recorded from the medial temporal area. The statistically detected neural states probably coincide with transient and sustained states that have been detected heuristically. Estimated parameters suggest that our algorithm detects the state transition on the basis of discontinuous changes in the temporal correlation of firing rates, which transitions previous methods cannot detect. This result suggests that our algorithm is advantageous in real-data analysis. 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053

5 0.55527639 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes

Author: Manfred Opper, Andreas Ruttor, Guido Sanguinetti

Abstract: We present a novel approach to inference in conditionally Gaussian continuous time stochastic processes, where the latent process is a Markovian jump process. We first consider the case of jump-diffusion processes, where the drift of a linear stochastic differential equation can jump at arbitrary time points. We derive partial differential equations for exact inference and present a very efficient mean field approximation. By introducing a novel lower bound on the free energy, we then generalise our approach to Gaussian processes with arbitrary covariance, such as the non-Markovian RBF covariance. We present results on both simulated and real data, showing that the approach is very accurate in capturing latent dynamics and can be useful in a number of real data modelling tasks.

6 0.55148125 262 nips-2010-Switched Latent Force Models for Movement Segmentation

7 0.53177029 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models

8 0.53056061 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks

9 0.51315755 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach

10 0.48543286 101 nips-2010-Gaussian sampling by local perturbations

11 0.45769295 290 nips-2010-t-logistic regression

12 0.4553543 35 nips-2010-Auto-Regressive HMM Inference with Incomplete Data for Short-Horizon Wind Forecasting

13 0.44982281 283 nips-2010-Variational Inference over Combinatorial Spaces

14 0.44422957 139 nips-2010-Latent Variable Models for Predicting File Dependencies in Large-Scale Software Development

15 0.44091865 216 nips-2010-Probabilistic Inference and Differential Privacy

16 0.44009459 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics

17 0.43590724 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

18 0.42253801 228 nips-2010-Reverse Multi-Label Learning

19 0.42219192 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization

20 0.41979814 89 nips-2010-Factorized Latent Spaces with Structured Sparsity


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.498), (27, 0.05), (30, 0.064), (35, 0.011), (45, 0.161), (50, 0.047), (52, 0.02), (60, 0.021), (77, 0.016), (90, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91982955 259 nips-2010-Subgraph Detection Using Eigenvector L1 Norms

Author: Benjamin Miller, Nadya Bliss, Patrick J. Wolfe

Abstract: When working with network datasets, the theoretical framework of detection theory for Euclidean vector spaces no longer applies. Nevertheless, it is desirable to determine the detectability of small, anomalous graphs embedded into background networks with known statistical properties. Casting the problem of subgraph detection in a signal processing context, this article provides a framework and empirical results that elucidate a “detection theory” for graph-valued data. Its focus is the detection of anomalies in unweighted, undirected graphs through L1 properties of the eigenvectors of the graph’s so-called modularity matrix. This metric is observed to have relatively low variance for certain categories of randomly-generated graphs, and to reveal the presence of an anomalous subgraph with reasonable reliability when the anomaly is not well-correlated with stronger portions of the background graph. An analysis of subgraphs in real network datasets confirms the efficacy of this approach. 1

2 0.87193507 192 nips-2010-Online Classification with Specificity Constraints

Author: Andrey Bernstein, Shie Mannor, Nahum Shimkin

Abstract: We consider the online binary classification problem, where we are given m classifiers. At each stage, the classifiers map the input to the probability that the input belongs to the positive class. An online classification meta-algorithm is an algorithm that combines the outputs of the classifiers in order to attain a certain goal, without having prior knowledge on the form and statistics of the input, and without prior knowledge on the performance of the given classifiers. In this paper, we use sensitivity and specificity as the performance metrics of the meta-algorithm. In particular, our goal is to design an algorithm that satisfies the following two properties (asymptotically): (i) its average false positive rate (fp-rate) is under some given threshold; and (ii) its average true positive rate (tp-rate) is not worse than the tp-rate of the best convex combination of the m given classifiers that satisfies fprate constraint, in hindsight. We show that this problem is in fact a special case of the regret minimization problem with constraints, and therefore the above goal is not attainable. Hence, we pose a relaxed goal and propose a corresponding practical online learning meta-algorithm that attains it. In the case of two classifiers, we show that this algorithm takes a very simple form. To our best knowledge, this is the first algorithm that addresses the problem of the average tp-rate maximization under average fp-rate constraints in the online setting. 1

3 0.85097152 45 nips-2010-CUR from a Sparse Optimization Viewpoint

Author: Jacob Bien, Ya Xu, Michael W. Mahoney

Abstract: The CUR decomposition provides an approximation of a matrix X that has low reconstruction error and that is sparse in the sense that the resulting approximation lies in the span of only a few columns of X. In this regard, it appears to be similar to many sparse PCA methods. However, CUR takes a randomized algorithmic approach, whereas most sparse PCA methods are framed as convex optimization problems. In this paper, we try to understand CUR from a sparse optimization viewpoint. We show that CUR is implicitly optimizing a sparse regression objective and, furthermore, cannot be directly cast as a sparse PCA method. We also observe that the sparsity attained by CUR possesses an interesting structure, which leads us to formulate a sparse PCA method that achieves a CUR-like sparsity.

4 0.82165378 146 nips-2010-Learning Multiple Tasks using Manifold Regularization

Author: Arvind Agarwal, Samuel Gerber, Hal Daume

Abstract: We present a novel method for multitask learning (MTL) based on manifold regularization: assume that all task parameters lie on a manifold. This is the generalization of a common assumption made in the existing literature: task parameters share a common linear subspace. One proposed method uses the projection distance from the manifold to regularize the task parameters. The manifold structure and the task parameters are learned using an alternating optimization framework. When the manifold structure is fixed, our method decomposes across tasks which can be learnt independently. An approximation of the manifold regularization scheme is presented that preserves the convexity of the single task learning problem, and makes the proposed MTL framework efficient and easy to implement. We show the efficacy of our method on several datasets. 1

same-paper 5 0.80987293 284 nips-2010-Variational bounds for mixed-data factor analysis

Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1

6 0.80935097 221 nips-2010-Random Projections for $k$-means Clustering

7 0.78839505 261 nips-2010-Supervised Clustering

8 0.69775993 262 nips-2010-Switched Latent Force Models for Movement Segmentation

9 0.68748927 136 nips-2010-Large-Scale Matrix Factorization with Missing Data under Additional Constraints

10 0.62213522 89 nips-2010-Factorized Latent Spaces with Structured Sparsity

11 0.61895257 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization

12 0.60630471 226 nips-2010-Repeated Games against Budgeted Adversaries

13 0.59269774 110 nips-2010-Guaranteed Rank Minimization via Singular Value Projection

14 0.58742464 166 nips-2010-Minimum Average Cost Clustering

15 0.58355278 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA

16 0.57652998 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives

17 0.57082522 117 nips-2010-Identifying graph-structured activation patterns in networks

18 0.5705781 196 nips-2010-Online Markov Decision Processes under Bandit Feedback

19 0.57034522 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

20 0.56584889 182 nips-2010-New Adaptive Algorithms for Online Classification