nips nips2013 nips2013-127 knowledge-graph by maker-knowledge-mining

127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models


Source: pdf

Author: Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent

Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty). 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. [sent-2, score-0.177]

2 However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. [sent-3, score-0.959]

3 Another issue is the mathematical justification which is only valid in the limit of small corruption noise. [sent-4, score-0.516]

4 1 Introduction Auto-encoders learn an encoder function from input to representation and a decoder function back from representation to input space, such that the reconstruction (composition of encoder and decoder) is good for training examples. [sent-6, score-0.415]

5 Regularized auto-encoders also involve some form of regularization that prevents the auto-encoder from simply learning the identity function, so that reconstruction error will be low at training examples (and hopefully at test examples) but high in general. [sent-7, score-0.341]

6 For example, clustering algorithms such as k-means only capture the modes of the distribution, while manifold learning algorithms characterize the low-dimensional regions where the density concentrates. [sent-13, score-0.226]

7 (2008): they were viewed as approximating an energy function through the reconstruction error, i. [sent-15, score-0.21]

8 , being trained to have low reconstruction error at the training examples and high reconstruction error elsewhere (through the regularizer, e. [sent-17, score-0.492]

9 This connection and its generalization to other a 1 energy functions, giving rise to the general denoising score matching training criterion, is discussed in several other papers (Kingma and LeCun, 2010; Swersky et al. [sent-22, score-0.514]

10 Another breakthrough has been the development of an empirically successful sampling algorithm for contractive auto-encoders (Rifai et al. [sent-24, score-0.227]

11 , the leading singular vectors of that matrix span the tangent plane of the manifold near which the data density concentrates. [sent-28, score-0.184]

12 What we propose here is a different probabilistic interpretation of DAEs, which is valid for any data type, any corruption process (so long as it has broad enough support), and any reconstruction loss (so long as we can view it as a log-likelihood). [sent-32, score-0.678]

13 ˜ The basic idea is that if we corrupt observed random variable X into X using conditional distribution ˜ ˜ C(X|X), we are really training the DAE to estimate the reverse conditional P (X|X). [sent-33, score-0.29]

14 Combining ˜ this estimator with the known C(X|X), we show that we can recover a consistent estimator of ˜ P (X) through a Markov chain that alternates between sampling from P (X|X) and sampling from ˜ ˜ C(X|X), i. [sent-34, score-0.572]

15 , encode/decode, sample from the reconstruction distribution model P (X|X), apply ˜ the stochastic corruption procedure C(X|X), and iterate. [sent-36, score-0.651]

16 We find that we can improve the sampling behavior by using the model itself to define the corruption process, yielding a training procedure that has some surface similarity to the contrastive divergence algorithm (Hinton, 1999; Hinton et al. [sent-38, score-0.772]

17 Algorithm 1 T HE GENERALIZED DENOISING AUTO - ENCODER TRAINING ALGORITHM requires ˜ a training set or training distribution D of examples X, a given corruption process C(X|X) from ˜ which one can sample, and with which one trains a conditional distribution Pθ (X|X) from which one can sample. [sent-40, score-0.995]

18 repeat • sample training example X ∼ D ˜ ˜ • sample corrupted input X ∼ C(X|X) ˜ as an additional training example towards minimizing the expected value of • use (X, X) ˜ − log Pθ (X|X), e. [sent-41, score-0.351]

19 Let C be a given ˜ ˜ corruption process that stochastically maps an X to a X through conditional distribution C(X|X). [sent-48, score-0.614]

20 ˜ with X ∼ The training data for the generalized denoising auto-encoder is a set of pairs (X, X) ˜ ˜ ˜ P(X) and X ∼ C(X|X). [sent-49, score-0.373]

21 The DAE is trained to predict X given X through a learned conditional ˜ by choosing this conditional distribution within some family of distributions distribution Pθ (X|X), 2 indexed by θ, not necessarily a neural net. [sent-50, score-0.252]

22 The training procedure for the DAE can generally be ˜ formulated as learning to predict X given X by possibly regularized maximum likelihood, i. [sent-51, score-0.204]

23 , the generalization performance that this training criterion attempts to minimize is ˜ L(θ) = −E[log Pθ (X|X)] (1) where the expectation is taken over the joint data-generating distribution ˜ ˜ P(X, X) = P(X)C(X|X). [sent-53, score-0.249]

24 2 Sampling We define the following pseudo-Gibbs Markov chain associated with Pθ : ˜ Xt ∼ Pθ (X|Xt−1 ) ˜ ˜ Xt ∼ C(X|Xt ) (2) (3) which can be initialized from an arbitrary choice X0 . [sent-55, score-0.217]

25 This is the process by which we are going to generate samples Xt according to the model implicitly learned by choosing θ. [sent-56, score-0.16]

26 We define T (Xt |Xt−1 ) the transition operator that defines a conditional distribution for Xt given Xt−1 , independently of t, so that the sequence of Xt ’s forms a homogeneous Markov chain. [sent-57, score-0.149]

27 If the asymptotic marginal distribution of the Xt ’s exists, we call this distribution π(X), and we show below that it consistently estimates P(X). [sent-58, score-0.173]

28 Note that the above chain is not a proper Gibbs chain in general because there is no guarantee ˜ ˜ that Pθ (X|Xt−1 ) and C(X|Xt ) are consistent with a unique joint distribution. [sent-59, score-0.492]

29 , 2000), in ˜ that the pairs (Xt , Xt−1 ) are not guaranteed to have the same asymptotic distribution as the pairs ˜ t ) as t → ∞. [sent-61, score-0.207]

30 3 Consistency Normally we only have access to a finite number n of training examples but as n → ∞, the empirical training distribution approaches the data-generating distribution. [sent-65, score-0.381]

31 We define θn to be the minimizer of Ln (θ) when given n training examples. [sent-71, score-0.15]

32 ˜ ˜ ˜ We define Tn to be the transition operator Tn (Xt |Xt−1 ) = Pθn (Xt |X)C(X|Xt−1 )dX associated with θn (the parameter obtained by minimizing the training criterion with n examples), and define πn to be the asymptotic distribution of the Markov chain generated by Tn (if it exists). [sent-72, score-0.623]

33 We also define T be the operator of the Markov chain associated with the learned model as n → ∞. [sent-73, score-0.277]

34 If Pθn (X|X) is a consistent estimator of the true conditional distribution P(X|X) and Tn defines an ergodic Markov chain, then as the number of examples n → ∞, the asymptotic distribution πn (X) of the generated samples converges to the data-generating distribution P(X). [sent-75, score-0.627]

35 If Tn is ergodic, then the Markov chain converges to a πn . [sent-77, score-0.266]

36 This conditional, ˜ ˜ along with P(X|X) = C(X|X) can be used to define a proper Gibbs chain where one alterna˜ ˜ tively samples from P(X|X) and from P(X|X). [sent-80, score-0.268]

37 T produces P(X) as asymptotic marginal distribution over X (as we 3 consider more samples from the chain) simply because P(X) is the marginal distribution of the joint ˜ ˜ ˜ P(X)C(X|X) to which the chain converges. [sent-83, score-0.465]

38 , the asymptotic distribution of the Markov chain, πn (X) converges to the true data-generating distribution, P(X), as n → ∞. [sent-94, score-0.184]

39 Hence the asymptotic sampling distribution associated with the Markov chain defined by Tn (i. [sent-95, score-0.405]

40 Furthermore, that estimator of P(X) is consistent so long as our (regularized) maximum likelihood ˜ estimator of the conditional Pθ (X|X) is also consistent. [sent-98, score-0.277]

41 We now provide sufficient conditions for the ergodicity of the chain operator (i. [sent-99, score-0.252]

42 If Pθ (X|X) is a consistent estimator of the true conditional distribution P(X|X), and both the data-generating distribution and denoising model are contained in and non-zero in ˜ ˜ ˜ a finite-volume region V (i. [sent-103, score-0.421]

43 , ∀X, ∀X ∈ V, P(X) = 0, Pθ (X|X) = 0), and ∀X, ∀X ∈ V, / ˜ > 0, C(X|X) > 0 and these statements remain true in the limit of n → ∞, ˜ P(X) > 0, Pθ (X|X) then the asymptotic distribution πn (X) of the generated samples converges to the data-generating distribution P(X). [sent-105, score-0.325]

44 These conditions can be generalized to the continuous case, where we obtain ergodic Harris chains rather than ˜ ˜ ergodic Markov chains. [sent-108, score-0.159]

45 To obtain recurrence (preventing the chain from diverging to infinity), we rely on the assumption that the domain V is bounded. [sent-110, score-0.253]

46 We avoid the case where there is no corruption (which would yield a wrong estimation, with the DAE simply learning a dirac probability its input). [sent-115, score-0.488]

47 Second, we avoid the case where the chain wanders to infinity by assuming a finite volume where the model and data live, a real concern in the continuous case. [sent-116, score-0.217]

48 Let us define the notation P (·) to denote the probability of the joint, marginals or conditionals over ˜ the pairs (Xt , Xt−1 ) that are produced by the model’s Markov chain T as t → ∞. [sent-122, score-0.283]

49 So P (X) = π(X) 4 ˜ ˜ is the asymptotic distribution of the Markov chain T , and P (X) the marginal over the X’s in that ˜ t−1 |Xt ) ≈ C(Xt−1 |Xt ) (which is not guaranteed ˜ chain. [sent-123, score-0.352]

50 First note that P (X|X) has ˜ for which X is relatively close to X (assuming that the corruption ˜ only been trained for pairs (X, X) is indeed changing X generally into some neighborhood). [sent-128, score-0.573]

51 One could also imagine that if X1 and XN are far apart, we could chart a path between X1 and XN with intermediate points Xk and use an estimator of the relative energies between the neighbors Xk , Xk+1 , add them up, and obtain an estimator of the relative energy between X1 and XN . [sent-133, score-0.277]

52 In that case, the estimator P (X|X) would directly equal P (X) since X and X are actually sampled independently in its “denoising” training data. [sent-142, score-0.246]

53 We would have gained nothing over just training any probabilistic model just directly modeling the observed X’s. [sent-143, score-0.15]

54 The gain we ˜ expect from using the denoising framework is that if X is a local perturbation of X, then the true ˜ P(X|X) can be well approximated by a much simpler distribution than P(X). [sent-144, score-0.232]

55 , 2006a) and non-local manifold tangent learning (Bengio et al. [sent-147, score-0.164]

56 , 2006b) al˜ gorithms: the local density around a point X can be approximated by a multivariate Gaussian whose covariance matrix has leading eigenvectors that span the local tangent of the manifold near which the data concentrates (if it does). [sent-148, score-0.244]

57 The idea of a locally Gaussian approximation of a density with a manifold structure is also exploited in the more recent work on the contractive auto-encoder (Rifai et al. [sent-149, score-0.287]

58 This can be seen analytically by considering the case where P(X) is a mixture of many Gaussians and the corruption 5 Figure 2: Walkback samples get attracted by spurious modes and contribute to removing them. [sent-154, score-0.719]

59 Segment of data manifold in violet and example walkback path in red dotted line, starting on the manifold and going towards a spurious attractor. [sent-155, score-0.721]

60 We return to this in Section 3, suggesting that in order to avoid spurious modes, it is better to have non-infinitesimal corruption, allowing faster mixing and successful burn-in not pulled by spurious modes far from the data. [sent-158, score-0.276]

61 More training iterations or increasing the amount of corruption noise helps to substantially alleviate that problem, but we discovered an even bigger boost by training the DAE Markov chain to walk back towards the training examples (see Figure 2). [sent-160, score-1.302]

62 ˜ More precisely, the modified corruption process C we propose is the following, based on the original corruption process C. [sent-164, score-1.05]

63 We use it in a version of the training algorithm called walkback, where we ˜ replace the corruption process C of Algorithm 1 by the walkback process C of Algorithm 2. [sent-165, score-1.134]

64 This ˜ samples generated along the walk also provides extra training examples (taking advantage of the X away from X). [sent-166, score-0.317]

65 It is called walkback because it forces the DAE to learn to walk back from the random walk it generates, towards the X’s in the training set. [sent-167, score-0.693]

66 ˜ ˜ Algorithm 2: T HE WALKBACK ALGORITHM is based on the walkback corruption process C(X|X), ˜ defined below in terms of a generic original corruption process C(X|X) and the current model’s re˜ construction conditional distribution P (X|X). [sent-168, score-1.561]

67 For each training example X, it provides a sequence ˜ ∗ ) for the DAE. [sent-169, score-0.15]

68 It has a hyper-parameter that is a geometric of additional training examples (X, X distribution parameter 0 < p < 1 controlling the length of these walks away from X, with p = 0. [sent-170, score-0.231]

69 Training by Algorithm 1 is the same, but using all X ∗ in the returned list L to form the ˜ ∗ ) as training examples instead of just (X, X). [sent-172, score-0.193]

70 ˜ pairs (X, X 1: 2: 3: 4: 5: 6: 7: 8: X ∗ ← X, L ← [ ] ˜ ˜ Sample X ∗ ∼ C(X|X ∗ ) Sample u ∼ Uniform(0, 1) if u > p then ˜ Append X ∗ to L and return L ˜ ˜ If during training, append X ∗ to L, so (X, X ∗ ) will be an additional training example. [sent-173, score-0.217]

71 Let P (X) be the implicitly defined asymptotic distribution of the Markov chain al˜ ˜ ternating sampling from P (X|X) and C(X|X), where C is the original local corruption process. [sent-176, score-0.97]

72 Under the assumptions of corollary 1, minimizing the training criterion in walkback training algo6 rithm for generalized DAEs (combining Algorithms 1 and 2) produces a P (X) that is a consistent estimator of the data generating distribution P(X). [sent-177, score-0.95]

73 Consider that during training, we produce a sequence of estimators Pk (X|X) where Pk corresponds to the k-th training iteration (modifying the parameters after each iteration). [sent-179, score-0.15]

74 With the ˜ walkback algorithm, Pk−1 is used to obtain the corrupted samples X from which the next model Pk is produced. [sent-180, score-0.501]

75 If training converges, Pk ≈ Pk+1 = P and we can then consider the whole ˜ corruption process C fixed. [sent-181, score-0.675]

76 By corollary 1, the Markov chain obtained by alternating samples from ˜ and samples from C(X|X) converges to an asymptotic distribution P (X) which estimates ˜ ˜ P (X|X) ˜ ˜ the underlying data-generating distribution P(X). [sent-182, score-0.541]

77 The walkback corruption C(X|X) corresponds ˜ to a few steps alternating sampling from C(X|X) (the fixed local corruption) and sampling from ˜ ˜ P (X|X). [sent-183, score-1.046]

78 Hence the overall sequence when using C can be seen as a Markov chain obtained by ˜ ˜ alternatively sampling from C(X|X) and from P (X|X) just as it was when using merely C. [sent-184, score-0.27]

79 A consequence is that the walkback training algorithm estimates the same distribution as the original denoising algorithm, but may do it more efficiently (as we observe in the experiments), by exploring the space of corruptions in a way that spends more time where it most helps the model. [sent-186, score-0.774]

80 The mathematical results presented here apply to any denoising training criterion where the reconstruction loss can be interpreted as a negative log-likelihood. [sent-188, score-0.476]

81 This re˜ mains true whether or not the denoising machine P (X|X) is parametrized as the composition of an encoder and decoder. [sent-189, score-0.243]

82 3 shows the distribution recovered by the Markov chain for discrete data with only 10 different ˜ values. [sent-195, score-0.255]

83 The conditional P (X|X) was estimated by multinomial models and maximum likelihood (counting) from 5000 training examples. [sent-196, score-0.201]

84 5000 samples were generated from the chain to estimate the asymptotic distribution πn (X). [sent-197, score-0.427]

85 For continuous data, Figure 3 also shows the result of 5000 generated samples and 500 original training examples with X ∈ R10 , with scatter plots of pairs of ˜ dimensions. [sent-198, score-0.304]

86 The estimator is also non-parametric (Parzen density estimator of P (X|X)). [sent-199, score-0.244]

87 Figure 3: Top left: histogram of a data-generating distribution (true, blue), the empirical distribution (red), and the estimated distribution using a denoising maximum likelihood estimator. [sent-200, score-0.278]

88 Other figures: pairs of variables (out of 10) showing the training samples and the model-generated samples. [sent-201, score-0.237]

89 The 784-2000-784 auto-encoder is trained for 200 epochs with the 50000 training examples and salt-and-pepper noise (probability 0. [sent-207, score-0.274]

90 With walkback training, a chain of 5 steps was used to generate 5 corrupted examples for each training example. [sent-213, score-0.86]

91 The quality of the samples was also estimated quantitatively by measuring the log-likelihood of the test ˆ ˜ set under a non-parametric density estimator P (x) = meanX P (x|X) constructed from 10000 con˜ ˜ ˆ secutively generated samples (X from the Markov chain). [sent-215, score-0.274]

92 Optimization hyper-parameters (learning rate, momentum, and learning rate reduction schedule) were selected based on the training objective. [sent-220, score-0.15]

93 The DAE log-likelihood bound with and without walkback is respectively -116 and -142, confirming visual inspection suggesting that the walkback algorithm produces less spurious samples. [sent-225, score-0.965]

94 Figure 4: Successive samples generated by Markov chain associated with the trained DAEs according to the plain sampling scheme (left) and walkback sampling scheme (right). [sent-229, score-0.869]

95 There are less “spurious” samples with the walkback algorithm. [sent-230, score-0.473]

96 5 Conclusion and Future Work We have proven that training a model to denoise is a way to implicitly estimate the underlying datagenerating process, and that a simple Markov chain that alternates sampling from the denoising model and from the corruption process converges to that estimator. [sent-231, score-1.267]

97 This provides a means for generating data from any DAE (if the corruption is not degenerate, more precisely, if the above chain converges). [sent-232, score-0.705]

98 This study has also suggested a variant of the training procedure, walkback training, which seem to converge faster to same the target distribution. [sent-234, score-0.572]

99 One of the insights arising out of the theoretical results presented here is that in order to reach the asymptotic limit of fully capturing the data distribution P(X), it may be necessary for the model’s ˜ ˜ P (X|X) to have the ability to represent multi-modal distributions over X (given X). [sent-235, score-0.163]

100 On autoencoders and score matching for energy based models. [sent-366, score-0.194]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('corruption', 0.488), ('walkback', 0.422), ('tn', 0.235), ('chain', 0.217), ('xt', 0.207), ('dae', 0.193), ('bengio', 0.167), ('denoising', 0.164), ('training', 0.15), ('alain', 0.126), ('reconstruction', 0.125), ('daes', 0.117), ('contractive', 0.113), ('asymptotic', 0.097), ('estimator', 0.096), ('spurious', 0.096), ('manifold', 0.09), ('energy', 0.085), ('modes', 0.084), ('vincent', 0.083), ('markov', 0.082), ('rifai', 0.081), ('rbm', 0.077), ('ergodic', 0.068), ('unimodal', 0.062), ('parzen', 0.059), ('pk', 0.059), ('encoder', 0.056), ('regularized', 0.054), ('hinton', 0.053), ('score', 0.053), ('sampling', 0.053), ('density', 0.052), ('conditional', 0.051), ('samples', 0.051), ('converges', 0.049), ('trained', 0.049), ('walk', 0.049), ('implicitly', 0.047), ('cho', 0.045), ('yao', 0.045), ('kingma', 0.044), ('examples', 0.043), ('generative', 0.042), ('rbms', 0.042), ('tangent', 0.042), ('nitesimal', 0.039), ('datagenerating', 0.039), ('distribution', 0.038), ('process', 0.037), ('criterion', 0.037), ('recurrence', 0.036), ('irreducibility', 0.036), ('pairs', 0.036), ('ranzato', 0.036), ('operator', 0.035), ('deep', 0.034), ('consistent', 0.034), ('ais', 0.034), ('et', 0.032), ('noise', 0.032), ('gaussian', 0.031), ('heckerman', 0.031), ('cifar', 0.031), ('swersky', 0.031), ('append', 0.031), ('lecun', 0.031), ('matching', 0.03), ('conditionals', 0.03), ('local', 0.03), ('breakthrough', 0.029), ('rinen', 0.029), ('limit', 0.028), ('corrupted', 0.028), ('interpretation', 0.028), ('decoder', 0.028), ('remained', 0.028), ('al', 0.028), ('courville', 0.027), ('positivity', 0.027), ('bergstra', 0.027), ('autoencoders', 0.026), ('contrastive', 0.026), ('hyv', 0.026), ('xk', 0.025), ('validated', 0.025), ('inspection', 0.025), ('transition', 0.025), ('kind', 0.025), ('dependency', 0.025), ('learned', 0.025), ('generated', 0.024), ('larochelle', 0.024), ('joint', 0.024), ('alternates', 0.023), ('towards', 0.023), ('prevents', 0.023), ('yielding', 0.023), ('generalized', 0.023), ('composition', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

Author: Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent

Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty). 1

2 0.14737378 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

Author: Yann Dauphin, Yoshua Bengio

Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1

3 0.123083 66 nips-2013-Computing the Stationary Distribution Locally

Author: Christina E. Lee, Asuman Ozdaglar, Devavrat Shah

Abstract: Computing the stationary distribution of a large finite or countably infinite state space Markov Chain (MC) has become central in many problems such as statistical inference and network analysis. Standard methods involve large matrix multiplications as in power iteration, or simulations of long random walks, as in Markov Chain Monte Carlo (MCMC). Power iteration is costly, as it involves computation at every state. For MCMC, it is difficult to determine whether the random walks are long enough to guarantee convergence. In this paper, we provide a novel algorithm that answers whether a chosen state in a MC has stationary probability larger than some ∆ ∈ (0, 1), and outputs an estimate of the stationary probability. Our algorithm is constant time, using information from a local neighborhood of the state on the graph induced by the MC, which has constant size relative to the state space. The multiplicative error of the estimate is upper bounded by a function of the mixing properties of the MC. Simulation results show MCs for which this method gives tight estimates. 1

4 0.12229028 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

Author: Jeffrey W. Miller, Matthew T. Harrison

Abstract: For data assumed to come from a finite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of clusters — that is, the posterior on the number of components represented in the observed data. However, it turns out that this posterior is not consistent — it does not concentrate at the true number of components. In this note, we give an elementary proof of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a “mixture” with one standard normal component. Further, we show that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster converges (in probability) to 0. 1

5 0.11218809 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

Author: Jonas Peters, Dominik Janzing, Bernhard Schölkopf

Abstract: Causal inference uses observational data to infer the causal structure of the data generating system. We study a class of restricted Structural Equation Models for time series that we call Time Series Models with Independent Noise (TiMINo). These models require independent residual time series, whereas traditional methods like Granger causality exploit the variance of residuals. This work contains two main contributions: (1) Theoretical: By restricting the model class (e.g. to additive noise) we provide general identifiability results. They cover lagged and instantaneous effects that can be nonlinear and unfaithful, and non-instantaneous feedbacks between the time series. (2) Practical: If there are no feedback loops between time series, we propose an algorithm based on non-linear independence tests of time series. We show empirically that when the data are causally insufficient or the model is misspecified, the method avoids incorrect answers. We extend the theoretical and the algorithmic part to situations in which the time series have been measured with different time delays. TiMINo is applied to artificial and real data and code is provided. 1

6 0.11124704 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces

7 0.10728402 331 nips-2013-Top-Down Regularization of Deep Belief Networks

8 0.10369998 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

9 0.098144092 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

10 0.09697891 27 nips-2013-Adaptive Multi-Column Deep Neural Networks with Application to Robust Image Denoising

11 0.09025719 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

12 0.089347288 269 nips-2013-Regression-tree Tuning in a Streaming Setting

13 0.088987425 75 nips-2013-Convex Two-Layer Modeling

14 0.088960886 299 nips-2013-Solving inverse problem of Markov chain with partial observations

15 0.085423857 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

16 0.07787849 348 nips-2013-Variational Policy Search via Trajectory Optimization

17 0.077788904 89 nips-2013-Dimension-Free Exponentiated Gradient

18 0.076699771 109 nips-2013-Estimating LASSO Risk and Noise Level

19 0.076325588 36 nips-2013-Annealing between distributions by averaging moments

20 0.076036423 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.207), (1, 0.048), (2, -0.014), (3, -0.043), (4, 0.014), (5, 0.04), (6, 0.039), (7, 0.109), (8, 0.018), (9, -0.168), (10, 0.072), (11, -0.123), (12, -0.056), (13, 0.068), (14, -0.015), (15, 0.048), (16, 0.037), (17, -0.009), (18, 0.01), (19, 0.136), (20, 0.015), (21, 0.067), (22, 0.051), (23, -0.062), (24, -0.034), (25, 0.089), (26, -0.056), (27, -0.004), (28, -0.039), (29, -0.007), (30, 0.034), (31, 0.056), (32, -0.111), (33, -0.036), (34, 0.038), (35, -0.087), (36, 0.051), (37, 0.057), (38, -0.037), (39, 0.045), (40, 0.008), (41, 0.109), (42, -0.062), (43, 0.063), (44, -0.013), (45, -0.057), (46, -0.047), (47, 0.071), (48, 0.022), (49, 0.081)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94378138 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

Author: Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent

Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty). 1

2 0.72246069 36 nips-2013-Annealing between distributions by averaging moments

Author: Roger B. Grosse, Chris J. Maddison, Ruslan Salakhutdinov

Abstract: Many powerful Monte Carlo techniques for estimating partition functions, such as annealed importance sampling (AIS), are based on sampling from a sequence of intermediate distributions which interpolate between a tractable initial distribution and the intractable target distribution. The near-universal practice is to use geometric averages of the initial and target distributions, but alternative paths can perform substantially better. We present a novel sequence of intermediate distributions for exponential families defined by averaging the moments of the initial and target distributions. We analyze the asymptotic performance of both the geometric and moment averages paths and derive an asymptotically optimal piecewise linear schedule. AIS with moment averaging performs well empirically at estimating partition functions of restricted Boltzmann machines (RBMs), which form the building blocks of many deep learning models. 1

3 0.67707014 315 nips-2013-Stochastic Ratio Matching of RBMs for Sparse High-Dimensional Inputs

Author: Yann Dauphin, Yoshua Bengio

Abstract: Sparse high-dimensional data vectors are common in many application domains where a very large number of rarely non-zero features can be devised. Unfortunately, this creates a computational bottleneck for unsupervised feature learning algorithms such as those based on auto-encoders and RBMs, because they involve a reconstruction step where the whole input vector is predicted from the current feature values. An algorithm was recently developed to successfully handle the case of auto-encoders, based on an importance sampling scheme stochastically selecting which input elements to actually reconstruct during training for each particular example. To generalize this idea to RBMs, we propose a stochastic ratio-matching algorithm that inherits all the computational advantages and unbiasedness of the importance sampling scheme. We show that stochastic ratio matching is a good estimator, allowing the approach to beat the state-of-the-art on two bag-of-word text classification benchmarks (20 Newsgroups and RCV1), while keeping computational cost linear in the number of non-zeros. 1

4 0.62497038 299 nips-2013-Solving inverse problem of Markov chain with partial observations

Author: Tetsuro Morimura, Takayuki Osogami, Tsuyoshi Ide

Abstract: The Markov chain is a convenient tool to represent the dynamics of complex systems such as traffic and social systems, where probabilistic transition takes place between internal states. A Markov chain is characterized by initial-state probabilities and a state-transition probability matrix. In the traditional setting, a major goal is to study properties of a Markov chain when those probabilities are known. This paper tackles an inverse version of the problem: we find those probabilities from partial observations at a limited number of states. The observations include the frequency of visiting a state and the rate of reaching a state from another. Practical examples of this task include traffic monitoring systems in cities, where we need to infer the traffic volume on single link on a road network from a limited number of observation points. We formulate this task as a regularized optimization problem, which is efficiently solved using the notion of natural gradient. Using synthetic and real-world data sets including city traffic monitoring data, we demonstrate the effectiveness of our method.

5 0.61133564 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

Author: Anirban Roychowdhury, Ke Jiang, Brian Kulis

Abstract: Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. Starting with the standard HMM, we first derive a “hard” inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discrete-state sequence data with a non-fixed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. 1

6 0.60867339 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models

7 0.59798491 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations

8 0.58077371 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces

9 0.57442731 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations

10 0.56800413 200 nips-2013-Multi-Prediction Deep Boltzmann Machines

11 0.5627327 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

12 0.55116886 269 nips-2013-Regression-tree Tuning in a Streaming Setting

13 0.53853124 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

14 0.53170097 160 nips-2013-Learning Stochastic Feedforward Neural Networks

15 0.52285022 221 nips-2013-On the Expressive Power of Restricted Boltzmann Machines

16 0.50917262 66 nips-2013-Computing the Stationary Distribution Locally

17 0.50103819 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions

18 0.49598414 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models

19 0.48390859 331 nips-2013-Top-Down Regularization of Deep Belief Networks

20 0.4651204 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.011), (16, 0.039), (31, 0.188), (33, 0.199), (34, 0.109), (41, 0.047), (49, 0.037), (56, 0.143), (70, 0.026), (85, 0.029), (89, 0.046), (93, 0.051)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88373619 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models

Author: Yoshua Bengio, Li Yao, Guillaume Alain, Pascal Vincent

Abstract: Recent work has shown how denoising and contractive autoencoders implicitly capture the structure of the data-generating density, in the case where the corruption noise is Gaussian, the reconstruction error is the squared error, and the data is continuous-valued. This has led to various proposals for sampling from this implicitly learned density function, using Langevin and Metropolis-Hastings MCMC. However, it remained unclear how to connect the training procedure of regularized auto-encoders to the implicit estimation of the underlying datagenerating distribution when the data are discrete, or using other forms of corruption process and reconstruction errors. Another issue is the mathematical justification which is only valid in the limit of small corruption noise. We propose here a different attack on the problem, which deals with all these issues: arbitrary (but noisy enough) corruption, arbitrary reconstruction loss (seen as a log-likelihood), handling both discrete and continuous-valued variables, and removing the bias due to non-infinitesimal corruption noise (or non-infinitesimal contractive penalty). 1

2 0.87325191 355 nips-2013-Which Space Partitioning Tree to Use for Search?

Author: Parikshit Ram, Alexander Gray

Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by:  γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi   VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9

3 0.8250615 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

Author: Zhuo Wang, Alan Stocker, Daniel Lee

Abstract: In many neural systems, information about stimulus variables is often represented in a distributed manner by means of a population code. It is generally assumed that the responses of the neural population are tuned to the stimulus statistics, and most prior work has investigated the optimal tuning characteristics of one or a small number of stimulus variables. In this work, we investigate the optimal tuning for diffeomorphic representations of high-dimensional stimuli. We analytically derive the solution that minimizes the L2 reconstruction loss. We compared our solution with other well-known criteria such as maximal mutual information. Our solution suggests that the optimal weights do not necessarily decorrelate the inputs, and the optimal nonlinearity differs from the conventional equalization solution. Results illustrating these optimal representations are shown for some input distributions that may be relevant for understanding the coding of perceptual pathways. 1

4 0.82340366 318 nips-2013-Structured Learning via Logistic Regression

Author: Justin Domke

Abstract: A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is “smoothed” through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (non-structured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an “oracle” exists to minimize a logistic loss.

5 0.82229877 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations

Author: Tamir Hazan, Subhransu Maji, Tommi Jaakkola

Abstract: In this paper we describe how MAP inference can be used to sample efficiently from Gibbs distributions. Specifically, we provide means for drawing either approximate or unbiased samples from Gibbs’ distributions by introducing low dimensional perturbations and solving the corresponding MAP assignments. Our approach also leads to new ways to derive lower bounds on partition functions. We demonstrate empirically that our method excels in the typical “high signal high coupling” regime. The setting results in ragged energy landscapes that are challenging for alternative approaches to sampling and/or lower bounds. 1

6 0.82045233 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

7 0.81965286 75 nips-2013-Convex Two-Layer Modeling

8 0.81940639 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

9 0.81936693 233 nips-2013-Online Robust PCA via Stochastic Optimization

10 0.81930292 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization

11 0.81911457 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

12 0.81816149 311 nips-2013-Stochastic Convex Optimization with Multiple Objectives

13 0.81728357 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions

14 0.81687838 344 nips-2013-Using multiple samples to learn mixture models

15 0.81681299 161 nips-2013-Learning Stochastic Inverses

16 0.8166815 301 nips-2013-Sparse Additive Text Models with Low Rank Background

17 0.81635147 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

18 0.81634229 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

19 0.81615824 11 nips-2013-A New Convex Relaxation for Tensor Completion

20 0.81593478 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.