nips nips2010 nips2010-242 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Slice sampling covariance hyperparameters of latent Gaussian models Ryan Prescott Adams Dept. [sent-1, score-0.775]
2 Integrating over these hyperparameters considers different possible explanations for the data when making predictions. [sent-4, score-0.275]
3 However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. [sent-6, score-0.331]
4 In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. [sent-7, score-0.428]
5 [1] developed a slice sampling [2] variant, elliptical slice sampling, for updating strongly coupled a-priori Gaussian variates given non-Gaussian observations. [sent-14, score-0.869]
6 Previously, Agarwal and Gelfand [3] demonstrated the utility of slice sampling for updating covariance parameters, conventionally called hyperparameters, with a Gaussian observation model, and questioned the possibility of slice sampling in more general settings. [sent-15, score-0.963]
7 In this work we develop a new slice sampler for updating covariance hyperparameters. [sent-16, score-0.513]
8 1 Latent Gaussian models We consider generative models of data that depend on a vector of latent variables f that are Gaussian distributed with covariance Σθ set by unknown hyperparameters θ. [sent-19, score-0.79]
9 8 0 −2 10 1 −1 0 10 10 1 10 lengthscale, l (a) Prior draws (b) Lengthscale given f Figure 1: (a) Shows draws from the prior over f using three different lengthscales in the squared exponential covariance (2). [sent-40, score-0.293]
10 The generic form of the generative models we consider is summarized by covariance hyperparameters θ ∼ ph, latent variables f ∼ N (0, Σθ ), and a conditional likelihood P (data |f ) = L(f ). [sent-42, score-0.932]
11 However, our experiments focus on the popular case where the covariance is associated with N input vectors {xn }N through the squaredn=1 exponential kernel, D (x −x )2 2 1 (2) (Σθ )ij = k(xi , xj ) = σf exp − 2 d=1 d,i 2 d,j , d 2 2 with hyperparameters θ = {σf , { d }}. [sent-44, score-0.398]
12 Here σf is the ‘signal variance’ controlling the overall scale of the latent variables f . [sent-45, score-0.359]
13 The d give characteristic lengthscales for converting the distances between inputs into covariances between the corresponding latent values f . [sent-46, score-0.462]
14 (3) We would like to avoid implementing new code or tuning algorithms for different covariances Σθ and conditional likelihood functions L(f ). [sent-48, score-0.265]
15 2 Markov chain inference A Markov chain transition operator T (z ← z) defines a conditional distribution on a new position z given an initial position z. [sent-49, score-0.236]
16 A standard way to sample from the joint posterior (3) is to alternately simulate transition operators that leave its conditionals, P (f |data, θ) and P (θ | f ), invariant. [sent-51, score-0.282]
17 Recent work has focused on transition operators for updating the latent variables f given data and a fixed covariance Σθ [6, 1]. [sent-55, score-0.633]
18 Updates to the hyperparameters for fixed latent variables f need to leave the conditional posterior, P (θ |f ) ∝ N (f ; 0, Σθ ) ph(θ), (4) invariant. [sent-56, score-0.715]
19 Other possibilities include slice sampling [2] and Hamiltonian Monte Carlo [7, 8]. [sent-58, score-0.382]
20 Figure 1a shows latent vector samples using squared-exponential covariances with different lengthscales. [sent-61, score-0.34]
21 These samples are highly informative about the lengthscale hyperparameter that was used, especially for short lengthscales. [sent-62, score-0.25]
22 The sharpness of P (θ | f ), Figure 1b, dramatically limits the amount that any Markov chain can update the hyperparameters θ for fixed latent values f . [sent-63, score-0.652]
23 2 Algorithm 1 M–H transition for fixed f Algorithm 2 M–H transition for fixed ν Input: Current f and hyperparameters θ; proposal dist. [sent-64, score-0.509]
24 Output: Next hyperparameters 1: Propose: θ ∼ q(θ ; θ) 2: Draw u ∼ Uniform(0, 1) N (f ;0,Σθ ) p (θ ) q(θ ; θ ) 3: if u < N (f ;0,Σ ) ph(θ) q(θ ; θ) θ h 4: return θ Accept new state 5: else 6: return θ Keep current state Input: Current state θ, f ; proposal dist. [sent-66, score-0.721]
25 Output: Next θ, f −1 1: Solve for N (0, I) variate: ν = LΣθ f 2: Propose θ ∼ q(θ ; θ) 3: Compute implied values: f = LΣθ ν 4: Draw u ∼ Uniform(0, 1) L(f ) p (θ ) q(θ ; θ ) 5: if u < L(f ) ph(θ) q(θ ; θ) h 6: return θ , f Accept new state 7: else 8: return θ, f Keep current state 2. [sent-68, score-0.308]
26 1 Whitening the prior Often the conditional likelihood is quite weak; this is why strong prior smoothing assumptions are often introduced in latent Gaussian models. [sent-69, score-0.501]
27 A vector of independent normals, ν, is drawn independently of the hyperparameters and then deterministically transformed: ν ∼ N (0, I), f = LΣθ ν, where LΣθ LΣθ = Σθ . [sent-76, score-0.275]
28 We can choose to update the hyperparameters θ for fixed ν instead of fixed f . [sent-80, score-0.312]
29 As the original latent variables f are deterministically linked to the hyperparameters θ in (5), these updates will actually change both θ and f . [sent-81, score-0.67]
30 The posterior over hyperparameters for fixed ν is apparent by applying Bayes rule to the generative procedure in (5), or one can laboriously obtain it by changing variables in (3): P (θ | ν, data) ∝ P (θ, ν, data) = P (θ, f = LΣθ ν, data) |LΣθ | ∝ · · · ∝ L(f (θ, ν)) ph(θ). [sent-84, score-0.528]
31 The acceptance rule now depends on the latent variables through the conditional likelihood L(f ) instead of the prior N (f ; 0, Σθ ) and these variables are automatically updated to respect the prior. [sent-86, score-0.753]
32 In the no-data limit, new hyperparameters proposed from the prior are always accepted. [sent-87, score-0.323]
33 Algorithm 2 is ideal in the “weak data” limit where the latent variables f are distributed according to the prior. [sent-89, score-0.359]
34 In the example, the likelihoods are too restrictive for Algorithm 2’s proposal to be acceptable. [sent-90, score-0.244]
35 In the “strong data” limit, where the latent variables f are fixed by the likelihood L, Algorithm 1 would be ideal. [sent-91, score-0.461]
36 For regression problems with Gaussian noise the latent variables can be marginalised out analytically, allowing hyperparameters to be accepted or rejected according to their marginal posterior P (θ |data). [sent-93, score-0.813]
37 If latent variables are required they can be sampled directly from the conditional posterior P (f |θ, data). [sent-94, score-0.523]
38 To build a method that applies to non-Gaussian likelihoods, we create an auxiliary variable model that introduces surrogate Gaussian observations that will guide joint proposals of the hyperparameters and latent variables. [sent-95, score-1.05]
39 5 current state f whitened prior proposal surrogate data proposal Observations, y 0 −0. [sent-97, score-0.808]
40 The current state of the sampler has a short lengthscale hyperparameter ( = 0. [sent-104, score-0.38]
41 The current latent variables do not lie on a straight enough line for the long lengthscale to be plausible. [sent-107, score-0.557]
42 1) updates the latent variables to a straighter line, but ignores the observations. [sent-109, score-0.395]
43 A proposal using surrogate data (Section 3, with Sθ set to the observation noise) sets the latent variables to a draw that is plausible for the proposed lengthscale while being close to the current state. [sent-110, score-1.101]
44 We augment the latent Gaussian model with auxiliary variables, g, a noisy version of the true latent variables: P (g | f , θ) = N (g; f , Sθ ). [sent-111, score-0.657]
45 (7) For now Sθ is an arbitrary free parameter that could be set by hand to either a fixed value or a value that depends on the current hyperparameters θ. [sent-112, score-0.323]
46 We will discuss how to automatically set the auxiliary noise covariance Sθ in Section 3. [sent-113, score-0.309]
47 (9) That is, under the auxiliary model the latent variables of interest are drawn from their posterior given the surrogate data g. [sent-117, score-0.926]
48 (10) We then condition on the “whitened” variables η and the surrogate data g while updating the hyperparameters θ. [sent-119, score-0.759]
49 The implied latent variables f (θ, η, g) will remain a plausible draw from the surrogate posterior for the current hyperparameters. [sent-120, score-0.973]
50 1 Slice sampling The Metropolis–Hastings algorithms discussed so far have a proposal distribution q(θ ; θ) that must be set and tuned. [sent-125, score-0.264]
51 4 Algorithm 3 Surrogate data M–H Algorithm 4 Surrogate data slice sampling Input: θ, f ; prop. [sent-128, score-0.382]
52 2 The auxiliary noise covariance Sθ The surrogate data g and noise covariance Sθ define a pseudo-posterior distribution that softly specifies a plausible region within which the latent variables f are updated. [sent-143, score-1.158]
53 The first two baseline algorithms of Section 2 result from limiting cases of Sθ = αI: 1) if α = 0 the surrogate data and the current latent variables are equal and the acceptance ratio reduces to that of Algorithm 1. [sent-145, score-0.827]
54 Given a Gaussian fit to the site-posterior (12) with variance vi , we can set the auxiliary noise to a level that would result in the same posterior variance at that site alone: −1 −1 (Sθ )ii = (vi −(Σθ )ii )−1 . [sent-152, score-0.383]
55 4 Related work We have discussed samplers that jointly update strongly-coupled latent variables and hyperparameters. [sent-155, score-0.455]
56 The hyperparameters can move further in joint moves than their narrow conditional posteriors (e. [sent-156, score-0.315]
57 However, this method is cumbersome to implement and tune, and using HMC to jointly update latent variables and hyperparameters in hierarchical models does not itself seem to improve sampling [11]. [sent-160, score-0.785]
58 [9] have also proposed a robust representation for sampling in latent Gaussian models. [sent-162, score-0.377]
59 They use an approximation to the target posterior distribution to con5 struct a reparameterization where the unknown variables are close to independent. [sent-163, score-0.442]
60 This likelihood fit results in an approximate Gaussian posterior N (f ; mθ,g=ˆ , Rθ ) as found in (9), with noise Sθ = Λ(ˆ)−1 and data g = ˆ. [sent-166, score-0.281]
61 f f f Thinking of the current latent variables as a draw from this approximate posterior, −1 ω ∼ N (0, I), f = LRθ ω + mθ,ˆ , suggests using the reparameterization ω = LRθ (f − mθ,ˆ ). [sent-167, score-0.711]
62 f f We can then fix the new variables and update the hyperparameters under P (θ | ω, data) ∝ L(f (ω, θ)) N (f (ω, θ); 0, Σθ ) ph(θ) |LRθ | . [sent-168, score-0.408]
63 (14) When the likelihood is Gaussian, the reparameterized variables ω are independent of each other and the hyperparameters. [sent-169, score-0.284]
64 When all of the likelihood terms are flat the reparameterization approach reduces to that of Section 2. [sent-173, score-0.324]
65 The surrogate data samplers of Section 3 can also be viewed as using reparameterizations, −1 by treating η = LRθ (f − mθ,g ) as an arbitrary random reparameterization for making proposals. [sent-176, score-0.593]
66 A proposal density q(η , θ ; η, θ) in the reparameterized space must be multiplied by −1 the Jacobian |LRθ | to give a proposal density in the original parameterization. [sent-177, score-0.386]
67 The probability of proposing the reparameterization must also be included in the Metropolis–Hastings acceptance probability: min 1, −1 P (θ ,f | data)·P (g | f ,Sθ )·q(θ;θ ) |LR | θ −1 P (θ,f | data)·P (g | f ,Sθ )·q(θ ;θ) |LR | . [sent-178, score-0.33]
68 The differences are that the new latent variables f are computed using different pseudo-posterior means and the surrogate data method has an extra term for the random, rather than fixed, choice of reparameterization. [sent-182, score-0.671]
69 The surrogate data sampler is easier to implement than the previous reparameterization work because the surrogate posterior is centred around the current latent variables. [sent-183, score-1.327]
70 2) picking f the noise covariance Sθ poorly may still produce a workable method, whereas a fixed reparameterized can work badly if the true posterior distribution is in the tails of the Gaussian approximation. [sent-185, score-0.388]
71 [9] pointed out that centering the approximate Gaussian likelihood in their reparameterization around the current state is tempting, but that computing the Jacobian of the transformation is then intractable. [sent-187, score-0.443]
72 By construction, the surrogate data model centers the reparameterization near to the current state. [sent-188, score-0.582]
73 The experiments are designed to test the mixing of hyperparameters θ while sampling from the joint posterior (3). [sent-194, score-0.513]
74 All of the discussed approaches except Algorithm 1 update the latent variables f as a side-effect. [sent-195, score-0.396]
75 However, further transition operators for the latent variables for fixed hyperparameters are required. [sent-196, score-0.709]
76 In Algorithm 2 the “whitened” variables ν remain fixed; the latent variables and hyperparameters are constrained to satisfy f = LΣθ ν. [sent-197, score-0.73]
77 The surrogate data samplers are ergodic: the full joint posterior distribution will eventually be explored. [sent-198, score-0.495]
78 However, each update changes the hyperparameters and requires expensive computations involving covariances. [sent-199, score-0.312]
79 After computing the covariances for one set of hyperparameters, it makes sense to apply several cheap updates to the latent variables. [sent-200, score-0.376]
80 For every method we applied ten updates of elliptical slice sampling [1] to the latent variables f between each hyperparameter update. [sent-201, score-0.956]
81 One could also consider applying elliptical slice sampling to a reparameterized representation, for simplicity of comparison we do not. [sent-202, score-0.547]
82 Independently of our work Titsias [13] has used surrogate data like reparameterizations to update latent variables for fixed hyperparameters. [sent-203, score-0.794]
83 Each method used the same slice sampler, as in Algorithm 4, applied to the following model representations. [sent-205, score-0.268]
84 surr-site: using surrogate data with the noise level set to match the site posterior (12). [sent-208, score-0.564]
85 surr-taylor: using surrogate data with noise variance set via Taylor expansion of the log-likelihood (13). [sent-211, score-0.367]
86 post-taylor and post-site: as for the surr- methods but a fixed reparameterization based on a posterior approximation (14). [sent-213, score-0.346]
87 Gaussian Regression (Synthetic) When the observations have Gaussian noise the post-taylor reparameterization of Christensen et al. [sent-218, score-0.309]
88 [9] makes the hyperparameters and latent variables exactly independent. [sent-219, score-0.634]
89 The random centering of the surrogate data model will be less effective. [sent-220, score-0.347]
90 We used a Gaussian regression problem to assess how much worse the surrogate data method is compared to an ideal reparameterization. [sent-221, score-0.312]
91 For Gaussian likelihoods the -site and -taylor methods coincide: the auxiliary noise matches the observation noise (Sθ = 0. [sent-227, score-0.335]
92 We sampled the hyperparameters in (2) and a mean offset to the log-rate. [sent-230, score-0.275]
93 7 fixed prior−white Effective samples per likelihood evaluation surr−site post−site surr−taylor Effective samples per covariance construction post−taylor Effective samples per second 4 4 4 3 3 3 2 2 2 1 1 1 0 ionosphere synthetic x1. [sent-235, score-0.409]
94 The costs are per likelihood evaluation (left), per covariance construction (center), and per second (right). [sent-249, score-0.225]
95 Taylor expanding the likelihood gives no likelihood contribution for bins with zero counts, so it is unsurprising that post-taylor performs similarly to prior-white. [sent-262, score-0.243]
96 Our empirical investigation used slice sampling because it is easy to implement and use. [sent-265, score-0.382]
97 The new surrogate data and post-site representations offer state-of-the-art performance and are the first such advanced methods to be applicable to Gaussian process classification. [sent-267, score-0.35]
98 An important message from our results is that fixing the latent variables and updating hyperparameters according to the conditional posterior — as commonly used by GP practitioners — can work exceedingly poorly. [sent-268, score-0.874]
99 Even if site approximations are difficult and the more advanced methods presented are inapplicable, the simple whitening reparameterization should be given serious consideration when performing MCMC inference of hyperparameters. [sent-271, score-0.463]
100 Learning hyperparameters for neural network models using Hamiltonian dynamics. [sent-330, score-0.275]
wordName wordTfidf (topN-words)
[('surrogate', 0.312), ('hyperparameters', 0.275), ('slice', 0.268), ('latent', 0.263), ('reparameterization', 0.222), ('lr', 0.192), ('ph', 0.165), ('erent', 0.158), ('lengthscale', 0.15), ('proposal', 0.15), ('auxiliary', 0.131), ('posterior', 0.124), ('covariance', 0.123), ('lengthscales', 0.122), ('sampling', 0.114), ('acceptance', 0.108), ('ionosphere', 0.108), ('di', 0.105), ('hastings', 0.105), ('likelihood', 0.102), ('hyperparameter', 0.1), ('gaussian', 0.098), ('metropolis', 0.096), ('variables', 0.096), ('likelihoods', 0.094), ('christensen', 0.092), ('gp', 0.087), ('reparameterizations', 0.086), ('reparameterized', 0.086), ('ective', 0.086), ('draw', 0.082), ('elliptical', 0.079), ('whitening', 0.077), ('chain', 0.077), ('covariances', 0.077), ('updating', 0.076), ('hamiltonian', 0.074), ('redwoods', 0.073), ('site', 0.073), ('monte', 0.071), ('variates', 0.064), ('mining', 0.064), ('whitened', 0.064), ('carlo', 0.063), ('taylor', 0.06), ('samplers', 0.059), ('radford', 0.059), ('mcmc', 0.057), ('poisson', 0.057), ('noise', 0.055), ('approximations', 0.053), ('bracket', 0.053), ('chol', 0.049), ('surr', 0.049), ('prior', 0.048), ('cox', 0.048), ('return', 0.048), ('laplace', 0.048), ('implied', 0.048), ('current', 0.048), ('markov', 0.047), ('accept', 0.046), ('shrink', 0.046), ('fi', 0.046), ('sampler', 0.046), ('tuning', 0.046), ('murray', 0.045), ('else', 0.044), ('redwood', 0.043), ('inapplicable', 0.043), ('titsias', 0.043), ('alternately', 0.042), ('transition', 0.042), ('leave', 0.041), ('conditional', 0.04), ('synthetic', 0.04), ('unknowns', 0.039), ('prescott', 0.039), ('michalis', 0.039), ('chains', 0.039), ('careful', 0.039), ('bins', 0.039), ('xing', 0.039), ('bars', 0.039), ('advanced', 0.038), ('proposals', 0.037), ('update', 0.037), ('updates', 0.036), ('fixed', 0.036), ('state', 0.036), ('centering', 0.035), ('moment', 0.035), ('adams', 0.033), ('jacobian', 0.033), ('iain', 0.033), ('hmc', 0.033), ('operators', 0.033), ('generative', 0.033), ('observations', 0.032), ('uniform', 0.032)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
2 0.20562257 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
Author: Zoubin Ghahramani, Michael I. Jordan, Ryan P. Adams
Abstract: Many data are naturally modeled by an unobserved hierarchical structure. In this paper we propose a flexible nonparametric prior over unknown data hierarchies. The approach uses nested stick-breaking processes to allow for trees of unbounded width and depth, where data can live at any node and are infinitely exchangeable. One can view our model as providing infinite mixtures where the components have a dependency structure corresponding to an evolutionary diffusion down a tree. By using a stick-breaking approach, we can apply Markov chain Monte Carlo methods based on slice sampling to perform Bayesian inference and simulate from the posterior distribution on trees. We apply our method to hierarchical clustering of images and topic modeling of text data. 1
3 0.13610685 101 nips-2010-Gaussian sampling by local perturbations
Author: George Papandreou, Alan L. Yuille
Abstract: We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods. 1
4 0.12920478 118 nips-2010-Implicit Differentiation by Perturbation
Author: Justin Domke
Abstract: This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification. 1
5 0.12517157 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
Author: Yangqing Jia, Mathieu Salzmann, Trevor Darrell
Abstract: Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multiview learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation. 1
6 0.10308155 103 nips-2010-Generating more realistic images using gated MRF's
7 0.10277287 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
8 0.10242649 284 nips-2010-Variational bounds for mixed-data factor analysis
9 0.086870939 54 nips-2010-Copula Processes
10 0.085390396 235 nips-2010-Self-Paced Learning for Latent Variable Models
11 0.085154399 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
12 0.084483542 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
13 0.083482482 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
14 0.083385423 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
15 0.080859125 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
16 0.079532824 100 nips-2010-Gaussian Process Preference Elicitation
17 0.077677339 218 nips-2010-Probabilistic latent variable models for distinguishing between cause and effect
18 0.076161794 115 nips-2010-Identifying Dendritic Processing
19 0.073632792 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
topicId topicWeight
[(0, 0.207), (1, 0.043), (2, -0.029), (3, 0.03), (4, -0.171), (5, 0.06), (6, 0.078), (7, 0.078), (8, -0.069), (9, -0.051), (10, -0.101), (11, -0.0), (12, 0.066), (13, -0.067), (14, -0.019), (15, 0.057), (16, 0.054), (17, 0.075), (18, 0.149), (19, 0.088), (20, -0.115), (21, 0.055), (22, -0.053), (23, -0.109), (24, -0.099), (25, 0.102), (26, -0.009), (27, 0.023), (28, 0.019), (29, 0.026), (30, 0.104), (31, -0.06), (32, 0.052), (33, -0.029), (34, -0.025), (35, 0.011), (36, -0.001), (37, -0.036), (38, -0.063), (39, -0.093), (40, 0.079), (41, 0.089), (42, -0.104), (43, -0.117), (44, -0.055), (45, 0.15), (46, -0.06), (47, 0.034), (48, -0.017), (49, -0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.96482784 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
2 0.73374671 262 nips-2010-Switched Latent Force Models for Movement Segmentation
Author: Mauricio Alvarez, Jan R. Peters, Neil D. Lawrence, Bernhard Schölkopf
Abstract: Latent force models encode the interaction between multiple related dynamical systems in the form of a kernel or covariance function. Each variable to be modeled is represented as the output of a differential equation and each differential equation is driven by a weighted sum of latent functions with uncertainty given by a Gaussian process prior. In this paper we consider employing the latent force model framework for the problem of determining robot motor primitives. To deal with discontinuities in the dynamical systems or the latent driving force we introduce an extension of the basic latent force model, that switches between different latent functions and potentially different dynamical systems. This creates a versatile representation for robot movements that can capture discrete changes and non-linearities in the dynamics. We give illustrative examples on both synthetic data and for striking movements recorded using a Barrett WAM robot as haptic input device. Our inspiration is robot motor primitives, but we expect our model to have wide application for dynamical systems including models for human motion capture data and systems biology. 1
3 0.71885335 101 nips-2010-Gaussian sampling by local perturbations
Author: George Papandreou, Alan L. Yuille
Abstract: We present a technique for exact simulation of Gaussian Markov random fields (GMRFs), which can be interpreted as locally injecting noise to each Gaussian factor independently, followed by computing the mean/mode of the perturbed GMRF. Coupled with standard iterative techniques for the solution of symmetric positive definite systems, this yields a very efficient sampling algorithm with essentially linear complexity in terms of speed and memory requirements, well suited to extremely large scale probabilistic models. Apart from synthesizing data under a Gaussian model, the proposed technique directly leads to an efficient unbiased estimator of marginal variances. Beyond Gaussian models, the proposed algorithm is also very useful for handling highly non-Gaussian continuously-valued MRFs such as those arising in statistical image modeling or in the first layer of deep belief networks describing real-valued data, where the non-quadratic potentials coupling different sites can be represented as finite or infinite mixtures of Gaussians with the help of local or distributed latent mixture assignment variables. The Bayesian treatment of such models most naturally involves a block Gibbs sampler which alternately draws samples of the conditionally independent latent mixture assignments and the conditionally multivariate Gaussian continuous vector and we show that it can directly benefit from the proposed methods. 1
4 0.70976299 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
Author: Manfred Opper, Andreas Ruttor, Guido Sanguinetti
Abstract: We present a novel approach to inference in conditionally Gaussian continuous time stochastic processes, where the latent process is a Markovian jump process. We first consider the case of jump-diffusion processes, where the drift of a linear stochastic differential equation can jump at arbitrary time points. We derive partial differential equations for exact inference and present a very efficient mean field approximation. By introducing a novel lower bound on the free energy, we then generalise our approach to Gaussian processes with arbitrary covariance, such as the non-Markovian RBF covariance. We present results on both simulated and real data, showing that the approach is very accurate in capturing latent dynamics and can be useful in a number of real data modelling tasks.
5 0.69748825 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
Author: Chang Su, Sargur Srihari
Abstract: A method for computing the rarity of latent fingerprints represented by minutiae is given. It allows determining the probability of finding a match for an evidence print in a database of n known prints. The probability of random correspondence between evidence and database is determined in three procedural steps. In the registration step the latent print is aligned by finding its core point; which is done using a procedure based on a machine learning approach based on Gaussian processes. In the evidence probability evaluation step a generative model based on Bayesian networks is used to determine the probability of the evidence; it takes into account both the dependency of each minutia on nearby minutiae and the confidence of their presence in the evidence. In the specific probability of random correspondence step the evidence probability is used to determine the probability of match among n for a given tolerance; the last evaluation is similar to the birthday correspondence probability for a specific birthday. The generative model is validated using a goodness-of-fit test evaluated with a standard database of fingerprints. The probability of random correspondence for several latent fingerprints are evaluated for varying numbers of minutiae. 1
6 0.64144295 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
7 0.59634 100 nips-2010-Gaussian Process Preference Elicitation
8 0.58683962 284 nips-2010-Variational bounds for mixed-data factor analysis
9 0.57358235 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
10 0.56325626 113 nips-2010-Heavy-Tailed Process Priors for Selective Shrinkage
11 0.55949086 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
12 0.53992623 103 nips-2010-Generating more realistic images using gated MRF's
13 0.52820897 54 nips-2010-Copula Processes
14 0.50942087 215 nips-2010-Probabilistic Deterministic Infinite Automata
15 0.50165278 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
16 0.48174632 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
17 0.47440764 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
18 0.46480703 120 nips-2010-Improvements to the Sequence Memoizer
19 0.45088115 122 nips-2010-Improving the Asymptotic Performance of Markov Chain Monte-Carlo by Inserting Vortices
20 0.44196084 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
topicId topicWeight
[(0, 0.235), (13, 0.07), (17, 0.012), (22, 0.023), (27, 0.061), (30, 0.054), (35, 0.013), (45, 0.201), (50, 0.106), (52, 0.051), (60, 0.019), (77, 0.042), (78, 0.014), (90, 0.033)]
simIndex simValue paperId paperTitle
1 0.85618097 14 nips-2010-A Reduction from Apprenticeship Learning to Classification
Author: Umar Syed, Robert E. Schapire
Abstract: We provide new theoretical results for apprenticeship learning, a variant of reinforcement learning in which the true reward function is unknown, and the goal is to perform well relative to an observed expert. We study a common approach to learning from expert demonstrations: using a classification algorithm to learn to imitate the expert’s behavior. Although this straightforward learning strategy is widely-used in practice, it has been subject to very little formal analysis. We prove that, if the learned classifier has error rate ǫ, the difference between the √ value of the apprentice’s policy and the expert’s policy is O( ǫ). Further, we prove that this difference is only O(ǫ) when the expert’s policy is close to optimal. This latter result has an important practical consequence: Not only does imitating a near-optimal expert result in a better policy, but far fewer demonstrations are required to successfully imitate such an expert. This suggests an opportunity for substantial savings whenever the expert is known to be good, but demonstrations are expensive or difficult to obtain. 1
same-paper 2 0.82632089 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
Author: Iain Murray, Ryan P. Adams
Abstract: The Gaussian process (GP) is a popular way to specify dependencies between random variables in a probabilistic model. In the Bayesian framework the covariance structure can be specified using unknown hyperparameters. Integrating over these hyperparameters considers different possible explanations for the data when making predictions. This integration is often performed using Markov chain Monte Carlo (MCMC) sampling. However, with non-Gaussian observations standard hyperparameter sampling approaches require careful tuning and may converge slowly. In this paper we present a slice sampling approach that requires little tuning while mixing well in both strong- and weak-data regimes. 1
3 0.81482428 5 nips-2010-A Dirty Model for Multi-task Learning
Author: Ali Jalali, Sujay Sanghavi, Chao Ruan, Pradeep K. Ravikumar
Abstract: We consider multi-task learning in the setting of multiple linear regression, and where some relevant features could be shared across the tasks. Recent research has studied the use of ℓ1 /ℓq norm block-regularizations with q > 1 for such blocksparse structured problems, establishing strong guarantees on recovery even under high-dimensional scaling where the number of features scale with the number of observations. However, these papers also caution that the performance of such block-regularized methods are very dependent on the extent to which the features are shared across tasks. Indeed they show [8] that if the extent of overlap is less than a threshold, or even if parameter values in the shared features are highly uneven, then block ℓ1 /ℓq regularization could actually perform worse than simple separate elementwise ℓ1 regularization. Since these caveats depend on the unknown true parameters, we might not know when and which method to apply. Even otherwise, we are far away from a realistic multi-task setting: not only do the set of relevant features have to be exactly the same across tasks, but their values have to as well. Here, we ask the question: can we leverage parameter overlap when it exists, but not pay a penalty when it does not ? Indeed, this falls under a more general question of whether we can model such dirty data which may not fall into a single neat structural bracket (all block-sparse, or all low-rank and so on). With the explosion of such dirty high-dimensional data in modern settings, it is vital to develop tools – dirty models – to perform biased statistical estimation tailored to such data. Here, we take a first step, focusing on developing a dirty model for the multiple regression problem. Our method uses a very simple idea: we estimate a superposition of two sets of parameters and regularize them differently. We show both theoretically and empirically, our method strictly and noticeably outperforms both ℓ1 or ℓ1 /ℓq methods, under high-dimensional scaling and over the entire range of possible overlaps (except at boundary cases, where we match the best method). 1 Introduction: Motivation and Setup High-dimensional scaling. In fields across science and engineering, we are increasingly faced with problems where the number of variables or features p is larger than the number of observations n. Under such high-dimensional scaling, for any hope of statistically consistent estimation, it becomes vital to leverage any potential structure in the problem such as sparsity (e.g. in compressed sensing [3] and LASSO [14]), low-rank structure [13, 9], or sparse graphical model structure [12]. It is in such high-dimensional contexts in particular that multi-task learning [4] could be most useful. Here, 1 multiple tasks share some common structure such as sparsity, and estimating these tasks jointly by leveraging this common structure could be more statistically efficient. Block-sparse Multiple Regression. A common multiple task learning setting, and which is the focus of this paper, is that of multiple regression, where we have r > 1 response variables, and a common set of p features or covariates. The r tasks could share certain aspects of their underlying distributions, such as common variance, but the setting we focus on in this paper is where the response variables have simultaneously sparse structure: the index set of relevant features for each task is sparse; and there is a large overlap of these relevant features across the different regression problems. Such “simultaneous sparsity” arises in a variety of contexts [15]; indeed, most applications of sparse signal recovery in contexts ranging from graphical model learning, kernel learning, and function estimation have natural extensions to the simultaneous-sparse setting [12, 2, 11]. It is useful to represent the multiple regression parameters via a matrix, where each column corresponds to a task, and each row to a feature. Having simultaneous sparse structure then corresponds to the matrix being largely “block-sparse” – where each row is either all zero or mostly non-zero, and the number of non-zero rows is small. A lot of recent research in this setting has focused on ℓ1 /ℓq norm regularizations, for q > 1, that encourage the parameter matrix to have such blocksparse structure. Particular examples include results using the ℓ1 /ℓ∞ norm [16, 5, 8], and the ℓ1 /ℓ2 norm [7, 10]. Dirty Models. Block-regularization is “heavy-handed” in two ways. By strictly encouraging sharedsparsity, it assumes that all relevant features are shared, and hence suffers under settings, arguably more realistic, where each task depends on features specific to itself in addition to the ones that are common. The second concern with such block-sparse regularizers is that the ℓ1 /ℓq norms can be shown to encourage the entries in the non-sparse rows taking nearly identical values. Thus we are far away from the original goal of multitask learning: not only do the set of relevant features have to be exactly the same, but their values have to as well. Indeed recent research into such regularized methods [8, 10] caution against the use of block-regularization in regimes where the supports and values of the parameters for each task can vary widely. Since the true parameter values are unknown, that would be a worrisome caveat. We thus ask the question: can we learn multiple regression models by leveraging whatever overlap of features there exist, and without requiring the parameter values to be near identical? Indeed this is an instance of a more general question on whether we can estimate statistical models where the data may not fall cleanly into any one structural bracket (sparse, block-sparse and so on). With the explosion of dirty high-dimensional data in modern settings, it is vital to investigate estimation of corresponding dirty models, which might require new approaches to biased high-dimensional estimation. In this paper we take a first step, focusing on such dirty models for a specific problem: simultaneously sparse multiple regression. Our approach uses a simple idea: while any one structure might not capture the data, a superposition of structural classes might. Our method thus searches for a parameter matrix that can be decomposed into a row-sparse matrix (corresponding to the overlapping or shared features) and an elementwise sparse matrix (corresponding to the non-shared features). As we show both theoretically and empirically, with this simple fix we are able to leverage any extent of shared features, while allowing disparities in support and values of the parameters, so that we are always better than both the Lasso or block-sparse regularizers (at times remarkably so). The rest of the paper is organized as follows: In Sec 2. basic definitions and setup of the problem are presented. Main results of the paper is discussed in sec 3. Experimental results and simulations are demonstrated in Sec 4. Notation: For any matrix M , we denote its j th row as Mj , and its k-th column as M (k) . The set of all non-zero rows (i.e. all rows with at least one non-zero element) is denoted by RowSupp(M ) (k) and its support by Supp(M ). Also, for any matrix M , let M 1,1 := j,k |Mj |, i.e. the sums of absolute values of the elements, and M 1,∞ := j 2 Mj ∞ where, Mj ∞ (k) := maxk |Mj |. 2 Problem Set-up and Our Method Multiple regression. We consider the following standard multiple linear regression model: ¯ y (k) = X (k) θ(k) + w(k) , k = 1, . . . , r, where y (k) ∈ Rn is the response for the k-th task, regressed on the design matrix X (k) ∈ Rn×p (possibly different across tasks), while w(k) ∈ Rn is the noise vector. We assume each w(k) is drawn independently from N (0, σ 2 ). The total number of tasks or target variables is r, the number of features is p, while the number of samples we have for each task is n. For notational convenience, ¯ we collate these quantities into matrices Y ∈ Rn×r for the responses, Θ ∈ Rp×r for the regression n×r parameters and W ∈ R for the noise. ¯ Dirty Model. In this paper we are interested in estimating the true parameter Θ from data by lever¯ aging any (unknown) extent of simultaneous-sparsity. In particular, certain rows of Θ would have many non-zero entries, corresponding to features shared by several tasks (“shared” rows), while certain rows would be elementwise sparse, corresponding to those features which are relevant for some tasks but not all (“non-shared rows”), while certain rows would have all zero entries, corresponding to those features that are not relevant to any task. We are interested in estimators Θ that automatically adapt to different levels of sharedness, and yet enjoy the following guarantees: Support recovery: We say an estimator Θ successfully recovers the true signed support if ¯ sign(Supp(Θ)) = sign(Supp(Θ)). We are interested in deriving sufficient conditions under which ¯ the estimator succeeds. We note that this is stronger than merely recovering the row-support of Θ, which is union of its supports for the different tasks. In particular, denoting Uk for the support of the ¯ k-th column of Θ, and U = k Uk . Error bounds: We are also interested in providing bounds on the elementwise ℓ∞ norm error of the estimator Θ, ¯ Θ−Θ 2.1 ∞ = max max j=1,...,p k=1,...,r (k) Θj (k) ¯ − Θj . Our Method Our method explicitly models the dirty block-sparse structure. We estimate a sum of two parameter matrices B and S with different regularizations for each: encouraging block-structured row-sparsity in B and elementwise sparsity in S. The corresponding “clean” models would either just use blocksparse regularizations [8, 10] or just elementwise sparsity regularizations [14, 18], so that either method would perform better in certain suited regimes. Interestingly, as we will see in the main results, by explicitly allowing to have both block-sparse and elementwise sparse component, we are ¯ able to outperform both classes of these “clean models”, for all regimes Θ. Algorithm 1 Dirty Block Sparse Solve the following convex optimization problem: (S, B) ∈ arg min S,B 1 2n r k=1 y (k) − X (k) S (k) + B (k) 2 2 + λs S 1,1 + λb B 1,∞ . (1) Then output Θ = B + S. 3 Main Results and Their Consequences We now provide precise statements of our main results. A number of recent results have shown that the Lasso [14, 18] and ℓ1 /ℓ∞ block-regularization [8] methods succeed in recovering signed supports with controlled error bounds under high-dimensional scaling regimes. Our first two theorems extend these results to our dirty model setting. In Theorem 1, we consider the case of deterministic design matrices X (k) , and provide sufficient conditions guaranteeing signed support recovery, and elementwise ℓ∞ norm error bounds. In Theorem 2, we specialize this theorem to the case where the 3 rows of the design matrices are random from a general zero mean Gaussian distribution: this allows us to provide scaling on the number of observations required in order to guarantee signed support recovery and bounded elementwise ℓ∞ norm error. Our third result is the most interesting in that it explicitly quantifies the performance gains of our method vis-a-vis Lasso and the ℓ1 /ℓ∞ block-regularization method. Since this entailed finding the precise constants underlying earlier theorems, and a correspondingly more delicate analysis, we follow Negahban and Wainwright [8] and focus on the case where there are two-tasks (i.e. r = 2), and where we have standard Gaussian design matrices as in Theorem 2. Further, while each of two tasks depends on s features, only a fraction α of these are common. It is then interesting to see how the behaviors of the different regularization methods vary with the extent of overlap α. Comparisons. Negahban and Wainwright [8] show that there is actually a “phase transition” in the scaling of the probability of successful signed support-recovery with the number of observations. n Denote a particular rescaling of the sample-size θLasso (n, p, α) = s log(p−s) . Then as Wainwright [18] show, when the rescaled number of samples scales as θLasso > 2 + δ for any δ > 0, Lasso succeeds in recovering the signed support of all columns with probability converging to one. But when the sample size scales as θLasso < 2−δ for any δ > 0, Lasso fails with probability converging to one. For the ℓ1 /ℓ∞ -reguralized multiple linear regression, define a similar rescaled sample size n θ1,∞ (n, p, α) = s log(p−(2−α)s) . Then as Negahban and Wainwright [8] show there is again a transition in probability of success from near zero to near one, at the rescaled sample size of θ1,∞ = (4 − 3α). Thus, for α < 2/3 (“less sharing”) Lasso would perform better since its transition is at a smaller sample size, while for α > 2/3 (“more sharing”) the ℓ1 /ℓ∞ regularized method would perform better. As we show in our third theorem, the phase transition for our method occurs at the rescaled sample size of θ1,∞ = (2 − α), which is strictly before either the Lasso or the ℓ1 /ℓ∞ regularized method except for the boundary cases: α = 0, i.e. the case of no sharing, where we match Lasso, and for α = 1, i.e. full sharing, where we match ℓ1 /ℓ∞ . Everywhere else, we strictly outperform both methods. Figure 3 shows the empirical performance of each of the three methods; as can be seen, they agree very well with the theoretical analysis. (Further details in the experiments Section 4). 3.1 Sufficient Conditions for Deterministic Designs We first consider the case where the design matrices X (k) for k = 1, · · ·, r are deterministic, and start by specifying the assumptions we impose on the model. We note that similar sufficient conditions for the deterministic X (k) ’s case were imposed in papers analyzing Lasso [18] and block-regularization methods [8, 10]. (k) A0 Column Normalization Xj 2 ≤ √ 2n for all j = 1, . . . , p, k = 1, . . . , r. ¯ Let Uk denote the support of the k-th column of Θ, and U = supports for each task. Then we require that k r A1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Xj , XUk (k) (k) XUk , XUk Uk denote the union of −1 c We will also find it useful to define γs := 1−max1≤k≤r maxj∈Uk (k) > 0. 1 k=1 (k) Xj , XUk Note that by the incoherence condition A1, we have γs > 0. A2 Eigenvalue Condition Cmin := min λmin 1≤k≤r A3 Boundedness Condition Dmax := max 1≤k≤r 1 (k) (k) XUk , XUk n 1 (k) (k) XUk , XUk n (k) (k) XUk , XUk −1 . 1 > 0. −1 ∞,1 < ∞. Further, we require the regularization penalties be set as λs > 2(2 − γs )σ log(pr) √ γs n and 4 λb > 2(2 − γb )σ log(pr) √ . γb n (2) 1 0.9 0.8 0.8 Dirty Model L1/Linf Reguralizer Probability of Success Probability of Success 1 0.9 0.7 0.6 0.5 0.4 LASSO 0.3 0.2 0 0.5 1 1.5 1.7 2 2.5 Control Parameter θ 3 3.1 3.5 0.6 0.5 0.4 L1/Linf Reguralizer 0.3 LASSO 0.2 p=128 p=256 p=512 0.1 Dirty Model 0.7 p=128 p=256 p=512 0.1 0 0.5 4 1 1.333 (a) α = 0.3 1.5 2 Control Parameter θ (b) α = 2.5 3 2 3 1 0.9 Dirty Model Probability of Success 0.8 0.7 L1/Linf Reguralizer 0.6 0.5 LASSO 0.4 0.3 0.2 p=128 p=256 p=512 0.1 0 0.5 1 1.2 1.5 1.6 2 Control Parameter θ 2.5 (c) α = 0.8 Figure 1: Probability of success in recovering the true signed support using dirty model, Lasso and ℓ1 /ℓ∞ regularizer. For a 2-task problem, the probability of success for different values of feature-overlap fraction α is plotted. As we can see in the regimes that Lasso is better than, as good as and worse than ℓ1 /ℓ∞ regularizer ((a), (b) and (c) respectively), the dirty model outperforms both of the methods, i.e., it requires less number of observations for successful recovery of the true signed support compared to Lasso and ℓ1 /ℓ∞ regularizer. Here p s = ⌊ 10 ⌋ always. Theorem 1. Suppose A0-A3 hold, and that we obtain estimate Θ from our algorithm with regularization parameters chosen according to (2). Then, with probability at least 1 − c1 exp(−c2 n) → 1, we are guaranteed that the convex program (1) has a unique optimum and (a) The estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ 4σ 2 log (pr) + λs Dmax . n Cmin ≤ bmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that min ¯ (j,k)∈Supp(Θ) ¯(k) θj > bmin . Here the positive constants c1 , c2 depend only on γs , γb , λs , λb and σ, but are otherwise independent of n, p, r, the problem dimensions of interest. Remark: Condition (a) guarantees that the estimate will have no false inclusions; i.e. all included features will be relevant. If in addition, we require that it have no false exclusions and that recover the support exactly, we need to impose the assumption in (b) that the non-zero elements are large enough to be detectable above the noise. 3.2 General Gaussian Designs Often the design matrices consist of samples from a Gaussian ensemble. Suppose that for each task (k) k = 1, . . . , r the design matrix X (k) ∈ Rn×p is such that each row Xi ∈ Rp is a zero-mean Gaussian random vector with covariance matrix Σ(k) ∈ Rp×p , and is independent of every other (k) row. Let ΣV,U ∈ R|V|×|U | be the submatrix of Σ(k) with rows corresponding to V and columns to U . We require these covariance matrices to satisfy the following conditions: r C1 Incoherence Condition γb := 1 − max c j∈U (k) (k) Σj,Uk , ΣUk ,Uk k=1 5 −1 >0 1 C2 Eigenvalue Condition Cmin := min λmin Σ(k),Uk Uk > 0 so that the minimum eigenvalue 1≤k≤r is bounded away from zero. C3 Boundedness Condition Dmax := (k) ΣUk ,Uk −1 ∞,1 < ∞. These conditions are analogues of the conditions for deterministic designs; they are now imposed on the covariance matrix of the (randomly generated) rows of the design matrix. Further, defining s := maxk |Uk |, we require the regularization penalties be set as 1/2 λs > 1/2 4σ 2 Cmin log(pr) √ γs nCmin − 2s log(pr) and λb > 4σ 2 Cmin r(r log(2) + log(p)) . √ γb nCmin − 2sr(r log(2) + log(p)) (3) Theorem 2. Suppose assumptions C1-C3 hold, and that the number of samples scale as n > max 2s log(pr) 2sr r log(2)+log(p) 2 2 Cmin γs , Cmin γb . Suppose we obtain estimate Θ from algorithm (3). Then, with probability at least 1 − c1 exp (−c2 (r log(2) + log(p))) − c3 exp(−c4 log(rs)) → 1 for some positive numbers c1 − c4 , we are guaranteed that the algorithm estimate Θ is unique and satisfies the following conditions: (a) the estimate Θ has no false inclusions, and has bounded ℓ∞ norm error so that ¯ Supp(Θ) ⊆ Supp(Θ), and ¯ Θ−Θ ∞,∞ ≤ 50σ 2 log(rs) + λs nCmin 4s √ + Dmax . Cmin n gmin ¯ (b) sign(Supp(Θ)) = sign Supp(Θ) provided that 3.3 min ¯ (j,k)∈Supp(Θ) ¯(k) θj > gmin . Sharp Transition for 2-Task Gaussian Designs This is one of the most important results of this paper. Here, we perform a more delicate and finer analysis to establish precise quantitative gains of our method. We focus on the special case where r = 2 and the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ), so that C1 − C3 hold, with Cmin = Dmax = 1. As we will see both analytically and experimentally, our method strictly outperforms both Lasso and ℓ1 /ℓ∞ -block-regularization over for all cases, except at the extreme endpoints of no support sharing (where it matches that of Lasso) and full support sharing (where it matches that of ℓ1 /ℓ∞ ). We now present our analytical results; the empirical comparisons are presented next in Section 4. The results will be in terms of a particular rescaling of the sample size n as θ(n, p, s, α) := n . (2 − α)s log (p − (2 − α)s) We will also require the assumptions that 4σ 2 (1 − F1 λs > F2 λb > s/n)(log(r) + log(p − (2 − α)s)) 1/2 (n)1/2 − (s)1/2 − ((2 − α) s (log(r) + log(p − (2 − α)s)))1/2 4σ 2 (1 − s/n)r(r log(2) + log(p − (2 − α)s)) , 1/2 (n)1/2 − (s)1/2 − ((1 − α/2) sr (r log(2) + log(p − (2 − α)s)))1/2 . Theorem 3. Consider a 2-task regression problem (n, p, s, α), where the design matrix has rows generated from the standard Gaussian distribution N (0, In×n ). 6 Suppose maxj∈B∗ ∗(1) Θj − ∗(2) Θj = o(λs ), where B ∗ is the submatrix of Θ∗ with rows where both entries are non-zero. Then the estimate Θ of the problem (1) satisfies the following: (Success) Suppose the regularization coefficients satisfy F1 − F2. Further, assume that the number of samples scales as θ(n, p, s, α) > 1. Then, with probability at least 1 − c1 exp(−c2 n) for some positive numbers c1 and c2 , we are guaranteed that Θ satisfies the support-recovery and ℓ∞ error bound conditions (a-b) in Theorem 2. ˆ ˆ (Failure) If θ(n, p, s, α) < 1 there is no solution (B, S) for any choices of λs and λb such that ¯ sign Supp(Θ) = sign Supp(Θ) . We note that we require the gap ∗(1) Θj ∗(2) − Θj to be small only on rows where both entries are non-zero. As we show in a more general theorem in the appendix, even in the case where the gap is large, the dependence of the sample scaling on the gap is quite weak. 4 Empirical Results In this section, we investigate the performance of our dirty block sparse estimator on synthetic and real-world data. The synthetic experiments explore the accuracy of Theorem 3, and compare our estimator with LASSO and the ℓ1 /ℓ∞ regularizer. We see that Theorem 3 is very accurate indeed. Next, we apply our method to a real world datasets containing hand-written digits for classification. Again we compare against LASSO and the ℓ1 /ℓ∞ . (a multi-task regression dataset) with r = 2 tasks. In both of this real world dataset, we show that dirty model outperforms both LASSO and ℓ1 /ℓ∞ practically. For each method, the parameters are chosen via cross-validation; see supplemental material for more details. 4.1 Synthetic Data Simulation We consider a r = 2-task regression problem as discussed in Theorem 3, for a range of parameters (n, p, s, α). The design matrices X have each entry being i.i.d. Gaussian with mean 0 and variance 1. For each fixed set of (n, s, p, α), we generate 100 instances of the problem. In each instance, ¯ given p, s, α, the locations of the non-zero entries of the true Θ are chosen at randomly; each nonzero entry is then chosen to be i.i.d. Gaussian with mean 0 and variance 1. n samples are then generated from this. We then attempt to estimate using three methods: our dirty model, ℓ1 /ℓ∞ regularizer and LASSO. In each case, and for each instance, the penalty regularizer coefficients are found by cross validation. After solving the three problems, we compare the signed support of the solution with the true signed support and decide whether or not the program was successful in signed support recovery. We describe these process in more details in this section. Performance Analysis: We ran the algorithm for five different values of the overlap ratio α ∈ 2 {0.3, 3 , 0.8} with three different number of features p ∈ {128, 256, 512}. For any instance of the ˆ ¯ problem (n, p, s, α), if the recovered matrix Θ has the same sign support as the true Θ, then we count it as success, otherwise failure (even if one element has different sign, we count it as failure). As Theorem 3 predicts and Fig 3 shows, the right scaling for the number of oservations is n s log(p−(2−α)s) , where all curves stack on the top of each other at 2 − α. Also, the number of observations required by dirty model for true signed support recovery is always less than both LASSO and ℓ1 /ℓ∞ regularizer. Fig 1(a) shows the probability of success for the case α = 0.3 (when LASSO is better than ℓ1 /ℓ∞ regularizer) and that dirty model outperforms both methods. When α = 2 3 (see Fig 1(b)), LASSO and ℓ1 /ℓ∞ regularizer performs the same; but dirty model require almost 33% less observations for the same performance. As α grows toward 1, e.g. α = 0.8 as shown in Fig 1(c), ℓ1 /ℓ∞ performs better than LASSO. Still, dirty model performs better than both methods in this case as well. 7 4 p=128 p=256 p=512 Phase Transition Threshold 3.5 L1/Linf Regularizer 3 2.5 LASSO 2 Dirty Model 1.5 1 0 0.1 0.2 0.3 0.4 0.5 0.6 Shared Support Parameter α 0.7 0.8 0.9 1 Figure 2: Verification of the result of the Theorem 3 on the behavior of phase transition threshold by changing the parameter α in a 2-task (n, p, s, α) problem for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. The y-axis p n is s log(p−(2−α)s) , where n is the number of samples at which threshold was observed. Here s = ⌊ 10 ⌋. Our dirty model method shows a gain in sample complexity over the entire range of sharing α. The pre-constant in Theorem 3 is also validated. n 10 20 40 Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Average Classification Error Variance of Error Average Row Support Size Average Support Size Our Model 8.6% 0.53% B:165 B + S:171 S:18 B + S:1651 3.0% 0.56% B:211 B + S:226 S:34 B + S:2118 2.2% 0.57% B:270 B + S:299 S:67 B + S:2761 ℓ1 /ℓ∞ 9.9% 0.64% 170 1700 3.5% 0.62% 217 2165 3.2% 0.68% 368 3669 LASSO 10.8% 0.51% 123 539 4.1% 0.68% 173 821 2.8% 0.85% 354 2053 Table 1: Handwriting Classification Results for our model, ℓ1 /ℓ∞ and LASSO Scaling Verification: To verify that the phase transition threshold changes linearly with α as predicted by Theorem 3, we plot the phase transition threshold versus α. For five different values of 2 α ∈ {0.05, 0.3, 3 , 0.8, 0.95} and three different values of p ∈ {128, 256, 512}, we find the phase transition threshold for dirty model, LASSO and ℓ1 /ℓ∞ regularizer. We consider the point where the probability of success in recovery of signed support exceeds 50% as the phase transition threshold. We find this point by interpolation on the closest two points. Fig 2 shows that phase transition threshold for dirty model is always lower than the phase transition for LASSO and ℓ1 /ℓ∞ regularizer. 4.2 Handwritten Digits Dataset We use the handwritten digit dataset [1], containing features of handwritten numerals (0-9) extracted from a collection of Dutch utility maps. This dataset has been used by a number of papers [17, 6] as a reliable dataset for handwritten recognition algorithms. There are thus r = 10 tasks, and each handwritten sample consists of p = 649 features. Table 1 shows the results of our analysis for different sizes n of the training set . We measure the classification error for each digit to get the 10-vector of errors. Then, we find the average error and the variance of the error vector to show how the error is distributed over all tasks. We compare our method with ℓ1 /ℓ∞ reguralizer method and LASSO. Again, in all methods, parameters are chosen via cross-validation. For our method we separate out the B and S matrices that our method finds, so as to illustrate how many features it identifies as “shared” and how many as “non-shared”. For the other methods we just report the straight row and support numbers, since they do not make such a separation. Acknowledgements We acknowledge support from NSF grant IIS-101842, and NSF CAREER program, Grant 0954059. 8 References [1] A. Asuncion and D.J. Newman. UCI Machine Learning Repository, http://www.ics.uci.edu/ mlearn/MLRepository.html. University of California, School of Information and Computer Science, Irvine, CA, 2007. [2] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179–1225, 2008. [3] R. Baraniuk. Compressive sensing. IEEE Signal Processing Magazine, 24(4):118–121, 2007. [4] R. Caruana. Multitask learning. Machine Learning, 28:41–75, 1997. [5] C.Zhang and J.Huang. Model selection consistency of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36:1567–1594, 2008. [6] X. He and P. Niyogi. Locality preserving projections. In NIPS, 2003. [7] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de Geer. Taking advantage of sparsity in multi-task learning. In 22nd Conference On Learning Theory (COLT), 2009. [8] S. Negahban and M. J. Wainwright. Joint support recovery under high-dimensional scaling: Benefits and perils of ℓ1,∞ -regularization. In Advances in Neural Information Processing Systems (NIPS), 2008. [9] S. Negahban and M. J. Wainwright. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. In ICML, 2010. [10] G. Obozinski, M. J. Wainwright, and M. I. Jordan. Support union recovery in high-dimensional multivariate regression. Annals of Statistics, 2010. [11] P. Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. Sparse additive models. Journal of the Royal Statistical Society, Series B. [12] P. Ravikumar, M. J. Wainwright, and J. Lafferty. High-dimensional ising model selection using ℓ1 -regularized logistic regression. Annals of Statistics, 2009. [13] B. Recht, M. Fazel, and P. A. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. In Allerton Conference, Allerton House, Illinois, 2007. [14] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. [15] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms for simultaneous sparse approximation. Signal Processing, Special issue on “Sparse approximations in signal and image processing”, 86:572–602, 2006. [16] B. Turlach, W.N. Venables, and S.J. Wright. Simultaneous variable selection. Techno- metrics, 27:349–363, 2005. [17] M. van Breukelen, R.P.W. Duin, D.M.J. Tax, and J.E. den Hartog. Handwritten digit recognition by combined classifiers. Kybernetika, 34(4):381–386, 1998. [18] M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1 -constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55: 2183–2202, 2009. 9
4 0.80940521 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
Author: Yariv Maron, Elie Bienenstock, Michael James
Abstract: Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a hig hdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high embedding dimensionality. Using k-means clustering of the embedded data, our approach efficiently produces state-of-the-art results. We analyze the reasons why the sphere constraint is beneficial in this application, and conjecture that these reasons might apply quite generally to other large-scale tasks. 1 In trod u cti on The embedding of objects in a low-dimensional Euclidean space is a form of dimensionality reduction that has been used in the past mostly to create 2D representations of data for the purpose of visualization and exploratory data analysis [10, 13]. Most methods work on objects of a single type, endowed with a measure of similarity. Other methods, such as [ 3], embed objects of heterogeneous types, based on their co-occurrence statistics. In this paper we demonstrate that the latter can be successfully applied to unsupervised part-of-speech (POS) induction, an extensively studied, challenging, problem in natural language processing [1, 4, 5, 6, 7]. The problem we address is distributional POS tagging, in which words are to be tagged based on the statistics of their immediate left and right context in a corpus (ignoring morphology and other features). The induction task is fully unsupervised, i.e., it uses no annotations. This task has been addressed in the past using a variety of methods. Some approaches, such as [1], combine a Markovian assumption with clustering. Many recent works use HMMs, perhaps due to their excellent performance on the supervised version of the task [7, 2, 5]. Using a latent-descriptor clustering approach, [15] obtain the best results to date for distributional-only unsupervised POS tagging of the widely-used WSJ corpus. Using a heterogeneous-data embedding approach for this task, we define separate embedding functions for the objects
5 0.72152746 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
6 0.72037488 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
7 0.71537298 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
8 0.71170563 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
9 0.71162063 117 nips-2010-Identifying graph-structured activation patterns in networks
10 0.71030897 49 nips-2010-Computing Marginal Distributions over Continuous Markov Networks for Statistical Relational Learning
11 0.70788091 148 nips-2010-Learning Networks of Stochastic Differential Equations
12 0.70757496 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.70727634 217 nips-2010-Probabilistic Multi-Task Feature Selection
14 0.70719749 158 nips-2010-Learning via Gaussian Herding
15 0.70602727 147 nips-2010-Learning Multiple Tasks with a Sparse Matrix-Normal Penalty
16 0.70444816 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
17 0.70428246 63 nips-2010-Distributed Dual Averaging In Networks
18 0.70365399 265 nips-2010-The LASSO risk: asymptotic results and real world examples
19 0.70298576 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs
20 0.70273083 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA