nips nips2013 nips2013-105 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann
Abstract: We propose an efficient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1
Reference: text
sentIndex sentText sentNum sentScore
1 The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. [sent-4, score-0.694]
2 The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. [sent-5, score-0.175]
3 They construct a low-rank approximation to the GP covariance matrix over the full dataset using a small set of inducing points. [sent-9, score-0.566]
4 Some approaches select inducing points from training points [7, 8, 12, 13]. [sent-10, score-0.827]
5 But these methods select the inducing points using ad hoc criteria; i. [sent-11, score-0.664]
6 , they use different objective functions to select inducing points and to optimize GP hyperparameters. [sent-13, score-0.734]
7 More powerful sparsification methods [14, 15, 16] use a single objective function and allow inducing points to move freely over the input domain which are learned via gradient descent. [sent-14, score-0.695]
8 The method optimizes a single objective for joint selection of inducing points and GP hyperparameters. [sent-18, score-0.776]
9 Notably, it optimizes either the marginal likelihood, or a variational free energy [15], exploiting the QR factorization of a partial Cholesky decomposition to efficiently approximate the covariance matrix. [sent-19, score-0.28]
10 Because it chooses inducing points from the training data, it is applicable to problems on discrete or continuous input domains. [sent-20, score-0.778]
11 To our knowledge, it is the first method for selecting discrete inducing points and hyperparameters that optimizes a single objective, with linear space and time complexity. [sent-21, score-0.783]
12 In the discrete case, combinatorial optimization is required to select the inducing points, and this is, in general, intractable. [sent-28, score-0.632]
13 Existing discrete sparsification methods therefore use other criteria to greedily select inducing points [7, 8, 12, 13]. [sent-29, score-0.756]
14 , [8, 12] take an information theoretic perspective), they are greedy and do not use the same objective to select inducing points and to estimate GP hyperparameters. [sent-32, score-0.761]
15 1 The variational formulation of Titsias [15] treats inducing points as variational parameters, and gives a unified objective for discrete and continuous inducing point models. [sent-33, score-1.438]
16 In the continuous case, it uses gradient-based optimization to find inducing points and hyperparameters. [sent-34, score-0.697]
17 In the discrete case, our method optimizes the same variational objective of Titsias [15], but is a significant improvement over greedy forward selection using the variational objective as selection criteria, or some other criteria. [sent-35, score-0.494]
18 In particular, given the cost of evaluating the variational objective on all training points, Titsias [15] evaluates the objective function on a small random subset of candidates at each iteration, and then select the best element from the subset. [sent-36, score-0.384]
19 The approach in [15] also uses greedy forward selection, which provides no way to refine the inducing set after hyperparameter optimization, except to discard all previous inducing points and restart selection. [sent-39, score-1.226]
20 By comparison, our formulation considers all candidates at each step, and revisiting previous selections is efficient, and guaranteed to decrease the objective or terminate. [sent-41, score-0.132]
21 Second, the CSI algorithm selects inducing points in a single greedy pass using an approximate objective. [sent-45, score-0.673]
22 We propose an iterative optimization algorithm that swaps previously selected points with new candidates that are guaranteed to lower the objective. [sent-46, score-0.199]
23 Finally, we perform inducing set selection jointly with gradient-based hyperparameter estimation instead of the grid search in CSI. [sent-47, score-0.595]
24 Our algorithm selects inducing points in a principled fashion, optimizing the variational free energy or the log likelihood. [sent-48, score-0.765]
25 The hyperparameters of a GP, denoted θ, comprise the parameters of the kernel function, and the noise variance σ 2 . [sent-55, score-0.121]
26 The natural objective for learning θ is the negative marginal log likelihood (NMLL) of the training data, − log (P (y|X, θ)), given up to a constant by Efull (θ) = ( y K +σ 2 In −1 y + log |K +σ 2 In | ) / 2 . [sent-56, score-0.211]
27 [12] proposed the Projected Process (PP) model, wherein a set of m inducing points are used to construct a low-rank approximation of the kernel matrix. [sent-59, score-0.709]
28 In the discrete case, where the inducing points are a subset of the training data, with indices I ⊂ {1, 2, . [sent-60, score-0.731]
29 The objective in (4) can be viewed as an approximate log likelihood for the full GP model, or as the exact log likelihood for an approximate model, called the Deterministically Trained Conditional [10]. [sent-70, score-0.223]
30 The same PP model can also be obtained by a variational argument, as in [15], for which the variational free energy objective can be shown to be Eq. [sent-71, score-0.252]
31 3 Efficient optimization We now outline our algorithm for optimizing the variational free energy (5) to select the inducing set I and the hyperparameters θ. [sent-77, score-0.759]
32 ) The algorithm is a form of hybrid coordinate descent that alternates between discrete optimization of inducing points, and continuous optimization of the hyperparameters. [sent-79, score-0.686]
33 We first describe the algorithm to select inducing points, and then discuss continuous hyperparameter optimization and termination criteria in Sec. [sent-80, score-0.721]
34 Finding the optimal inducing set is a combinatorial problem; global optimization is intractable. [sent-83, score-0.538]
35 Instead, the inducing set is initialized to a random subset of the training data, which is then refined by a fixed number of swap updates at each iteration. [sent-84, score-0.648]
36 1 In a single swap update, a randomly chosen inducing point is considered for replacement. [sent-85, score-0.597]
37 With the techniques described below, the computation time required to approximately evaluate all possible candidates and swap an inducing point is O(mn). [sent-88, score-0.659]
38 Swapping all inducing points once takes O(m2 n) time. [sent-89, score-0.625]
39 1 Factored representation To support efficient evaluation of the objective and swapping, we use a factored representation of the kernel matrix. [sent-91, score-0.15]
40 Given an inducing set I of k points, for any k ≤ m, the low-rank Nystr¨ m approxo imation to the kernel matrix (Eq. [sent-92, score-0.57]
41 6 follows from Proposition 1 in [1], and the fact that, given the ordered sequence of pivots I, the partial Cholesky factorization is unique. [sent-97, score-0.325]
42 2 Factorization update Here we present the mechanics of the swap update algorithm, see [3] for pseudo-code. [sent-102, score-0.147]
43 Suppose we wish to swap inducing point i with candidate point j in Im , the inducing set of size m. [sent-103, score-1.153]
44 Downdating to remove inducing point i requires that we shift the corresponding columns/rows in the factorization to the right-most columns of L, Q, R and to the last row of R. [sent-108, score-0.589]
45 When permuting the order of the inducing points, the underlying GP model is invariant, but the matrices in the factored representation are not. [sent-110, score-0.536]
46 After downdating, we have factors Lm−1 ,Qm−1 , Rm−1 , and inducing set Im−1 . [sent-114, score-0.513]
47 The final updates are Qm = [Qm−1 qm ], where qm is given by Gram-Schmidt orthogonalization, qm = ((I − Qm−1 Qm−1 )˜m ) / (I − Qm−1 Qm−1 )˜m , and Rm is updated from Rm−1 so that Lm = Qm Rm . [sent-123, score-0.651]
48 3 Evaluating candidates Next we show how to select candidates for inclusion in the inducing set. [sent-125, score-0.676]
49 Later we will provide an approximation to this objective change that can be computed efficiently. [sent-127, score-0.126]
50 Given an inducing set Im−1 , and matrices Lm−1 , Qm−1 , and Rm−1 , we wish to evaluate the change in Eq. [sent-128, score-0.542]
51 Thus, instead of evaluating the change in the objective exactly, we use an efficient approximation based on a small number, z, of training points which provide information about the residual between the current low-rank covariance matrix (based on inducing points) and the full covariance matrix. [sent-136, score-0.923]
52 Now consider the o full Cholesky decomposition of K = L∗ L∗ where L∗ = [Lm−1 , L(Jm−1 )] is constructed with Im−1 as the first pivots and Jm−1 = {1, . [sent-142, score-0.249]
53 We approximate L(Jm−1 ) by a rank z n matrix, Lz , by taking z points from Jm−1 and performing a partial Cholesky factorization of ˆ ˆ K − Km−1 using these pivots. [sent-146, score-0.209]
54 The pivots used to construct Lz are called information pivots; their selection is discussed in Sec. [sent-148, score-0.291]
55 2 Computing Lz costs O(z 2 n), but can be avoided since information pivots change by at most one when an information pivots is added to the inducing set and needs to be replaced. [sent-155, score-1.04]
56 These z information pivots are equivalent to the “look-ahead” steps of Bach and Jordan’s CSI algorithm, but as described in Sec. [sent-159, score-0.249]
57 2 Ensuring a good approximation Selection of the information pivots determines the approximate objective, and hence the candidate proposal. [sent-165, score-0.34]
58 To ensure a good approximation, the CSI algorithm [1] greedily selects points to find ˆ an approximation of the residual K − Km−1 in Eq. [sent-166, score-0.18]
59 By analyzing the role of the residual matrix, we see that the information pivots provide a low-rank approximation to the orthogonal complement of the space spanned by current inducing set. [sent-170, score-0.83]
60 Although information pivots are changed when one is moved into the inducing set, we find empirically that this is not insufficient. [sent-173, score-0.762]
61 Instead, at regular intervals we replace the entire set of information pivots by random selection. [sent-174, score-0.249]
62 We find this works better than optimizing the information pivots as in [1]. [sent-175, score-0.249]
63 ranking approx total reduction Figure 1 compares the exact and approximate cost reduction for candidate inducing points (left), and their respective rankings (right). [sent-176, score-0.774]
64 It is also robust to changes in the number of information pivots and the frequency of updates. [sent-178, score-0.249]
65 When bad candidates are proposed, they are rejected after ranking exact total reduction exact total reduction evaluating the change in the true objective. [sent-179, score-0.243]
66 Rejection rates also increase for sparser models, where each inducing point plays a more critical role and is harder to replace. [sent-183, score-0.513]
67 035 0 50 100 150 Hybrid optimization The overall hybrid optimization procedure performs block coordinate descent in the inducing points and the continuous hyperparameters. [sent-199, score-0.743]
68 It alternates between discrete and continuous phases until improvement in the objective is below a threshold or the computational time budget is exhausted. [sent-200, score-0.172]
69 In the discrete phase, inducing points are considered for swapping with the hyper-parameters fixed. [sent-201, score-0.737]
70 With the factorization and efficient candidate evaluation above, swapping an inducing point i ∈ Im proceeds as follows: (I) down-date the factorization matrices as in Sec. [sent-202, score-0.765]
71 2 to remove i; (II) compute the true objective function value Fm−1 over the down-dated model with Im \{i}, using (11), (12) and (9); (III) select a replacement candidate using the fast approximate cost change from Sec. [sent-204, score-0.202]
72 1; (IV) evaluate the exact objective change, using (14), (15), and (16); (V) add the exact change to the true objective Fm−1 to get the objective value with the new candidate. [sent-207, score-0.277]
73 6 32 64 128 256 number of inducing points (m) −0. [sent-214, score-0.625]
74 3 32 64 128 256 number of inducing points (m) 512 32 64 128 256 number of inducing points (m) 512 0. [sent-219, score-1.25]
75 1 16 32 64 128 256 number of inducing points (m) 16 512 Figure 2: Test performance on discrete datasets. [sent-230, score-0.68]
76 Otherwise it is rejected and we revert to the factorization with i; (VI) if needed, update the information pivots as in Secs. [sent-235, score-0.367]
77 After each discrete optimization step we fix the inducing set I and optimize the hyperparameters using non-linear conjugate gradients (CG). [sent-241, score-0.657]
78 In practice, because we o alternate each phase for many training epochs, attempting to swap every inducing point in each epoch is unnecessary, just as there is no need to run hyperparameter optimization until convergence. [sent-243, score-0.748]
79 As long as all inducing set points are eventually considered we find that optimized models can achieve similar performance with shorter learning times. [sent-244, score-0.625]
80 4 Experiments and analysis For the experiments that follow we jointly learn inducing points and hyperparameters, a more challenging task than learning inducing points with known hyperparameters [12, 14]. [sent-245, score-1.314]
81 For all but the 1D example, the number of inducing points swapped per epoch is min(60, m). [sent-246, score-0.66]
82 The maximum number of function evaluations per epoch in CG hyperparameter optimization is min(20, max(15, 2d)), where d is the number of continuous hyperparameters. [sent-247, score-0.147]
83 For learning continuous hyperparameters, each method optimizes the same objective using non-linear CG. [sent-253, score-0.156]
84 For our algorithm we use z = 16 information pivots with random selection (CholQR-z16). [sent-255, score-0.291]
85 We do 50-fold random splits to 3660 training points and 192 test points for repeated runs. [sent-261, score-0.295]
86 We use a compound kernel, comprising 14 different graph kernels, and 15 continuous hyperparameters (one 6 3 10 1000 500 0. [sent-262, score-0.14]
87 2 2 10 32 64 128 256 number of inducing points (m) 0 16 512 0. [sent-263, score-0.625]
88 1 32 64 128 256 number of inducing points (m) (a) 1 2 (b) (c) CholQR−z16 IVM Random Titsias−16 Titsias−512 0. [sent-264, score-0.625]
89 3 0 1 2 3 4 10 10 10 10 10 Cumulative training time in secs (log scale) (d) 0 1 2 3 4 10 10 10 10 10 Cumulative training time in secs (log scale) (e) 1 10 2 10 Time in secs (log scaled) (f) Figure 3: Training time versus test performance on discrete datasets. [sent-277, score-0.342]
90 Instead, we predict the vertical position of joints independently, using a histogram intersection kernel [9], having four hyperparameters: one noise variance, and three data variances corresponding to the kernel evaluated over the HoG from each of three cameras. [sent-283, score-0.142]
91 The poor performance of Random selection highlights the importance of selecting good inducing points, as no amount of hyperparameter optimization can correct for poor inducing points. [sent-291, score-1.133]
92 3(c) and 3(f) show the trade-off between the test SMSE and training time for variants of CholQR, with baselines and CSI kernel regression [1]. [sent-296, score-0.174]
93 For CholQR we consider different numbers of information pivots (denoted z8, z16, z64 and z128), and different strategies for their selection including random selection, optimization as in [1] (denote OI) and adaptively growing the information pivot set (denoted AA, see [3] for details). [sent-297, score-0.337]
94 5 128 2048 256 512 1024 2048 Figure 5: Test scores on KIN40K as function of number of inducing points: for each number of inducing points the value plotted is averaged over 10 runs from 10 different (shared) initializations. [sent-313, score-1.138]
95 We use the 1D toy dataset of [14] to show how the PP likelihood with gradient-based optimization of inducing points is easily trapped in local minima. [sent-314, score-0.694]
96 4(f) shows SPGP learned with a more uniform initial distribution of inducing points avoids this local optima and achieves a better negative log likelihood of 11. [sent-322, score-0.671]
97 5 Conclusion We describe an algorithm for selecting inducing points for Gaussian Process sparsification. [sent-339, score-0.625]
98 It optimizes principled objective functions, and is applicable to discrete domains and non-differentiable kernels. [sent-340, score-0.188]
99 Fast forward selection to speed up sparse gaussian process regression. [sent-437, score-0.136]
100 Variational learning of inducing variables in sparse gaussian processes. [sent-458, score-0.567]
wordName wordTfidf (topN-words)
[('inducing', 0.513), ('cholqr', 0.376), ('lz', 0.263), ('pivots', 0.249), ('ivm', 0.221), ('qm', 0.217), ('titsias', 0.204), ('smse', 0.188), ('spgp', 0.18), ('csi', 0.157), ('gp', 0.146), ('im', 0.12), ('points', 0.112), ('snlp', 0.094), ('cholesky', 0.093), ('swap', 0.084), ('km', 0.081), ('lm', 0.081), ('factorization', 0.076), ('sparsi', 0.073), ('objective', 0.07), ('jm', 0.065), ('hyperparameters', 0.064), ('variational', 0.064), ('candidates', 0.062), ('pp', 0.061), ('swapping', 0.057), ('kernel', 0.057), ('discrete', 0.055), ('secs', 0.055), ('training', 0.051), ('qr', 0.051), ('continuous', 0.047), ('bindingdb', 0.047), ('downdating', 0.047), ('testing', 0.045), ('nystr', 0.044), ('candidate', 0.043), ('selection', 0.042), ('residual', 0.041), ('hyperparameter', 0.04), ('optimizes', 0.039), ('select', 0.039), ('predictive', 0.037), ('criteria', 0.037), ('epoch', 0.035), ('hog', 0.035), ('mn', 0.033), ('chalupka', 0.031), ('nmll', 0.031), ('sod', 0.031), ('comprising', 0.029), ('ek', 0.029), ('oi', 0.029), ('change', 0.029), ('free', 0.029), ('csat', 0.028), ('joints', 0.028), ('evaluating', 0.028), ('seeger', 0.027), ('gaussian', 0.027), ('greedy', 0.027), ('approximation', 0.027), ('sparse', 0.027), ('covariance', 0.026), ('woodbury', 0.025), ('energy', 0.025), ('optimization', 0.025), ('mle', 0.025), ('likelihood', 0.024), ('domains', 0.024), ('variants', 0.024), ('snelson', 0.024), ('tr', 0.024), ('factored', 0.023), ('reduction', 0.023), ('standardized', 0.023), ('marker', 0.023), ('log', 0.022), ('regression', 0.022), ('kernels', 0.022), ('update', 0.022), ('forward', 0.021), ('hybrid', 0.021), ('aa', 0.021), ('pivot', 0.021), ('zn', 0.021), ('approximate', 0.021), ('var', 0.02), ('test', 0.02), ('fm', 0.02), ('cg', 0.02), ('rejected', 0.02), ('trapped', 0.02), ('termination', 0.02), ('ranking', 0.02), ('exact', 0.019), ('mechanics', 0.019), ('process', 0.019), ('initialization', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann
Abstract: We propose an efficient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1
2 0.24120674 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
3 0.13110189 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen
Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1
4 0.10167393 54 nips-2013-Bayesian optimization explains human active search
Author: Ali Borji, Laurent Itti
Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to find the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further confirm that Gaussian Processes provide a general and unified theoretical account to explain passive and active function learning and search in humans. 1
5 0.095366836 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
6 0.088555105 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements
7 0.069916621 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
8 0.065952666 156 nips-2013-Learning Kernels Using Local Rademacher Complexity
9 0.058455735 186 nips-2013-Matrix factorization with binary components
10 0.057066888 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
11 0.051434506 203 nips-2013-Multilinear Dynamical Systems for Tensor Time Series
12 0.049803678 75 nips-2013-Convex Two-Layer Modeling
13 0.049087401 137 nips-2013-High-Dimensional Gaussian Process Bandits
14 0.046503689 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
15 0.041837625 76 nips-2013-Correlated random features for fast semi-supervised learning
16 0.041533258 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
17 0.041425366 318 nips-2013-Structured Learning via Logistic Regression
18 0.041203193 173 nips-2013-Least Informative Dimensions
19 0.041037239 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
20 0.039817914 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces
topicId topicWeight
[(0, 0.132), (1, 0.049), (2, -0.0), (3, 0.022), (4, 0.002), (5, 0.047), (6, 0.041), (7, 0.053), (8, 0.037), (9, -0.046), (10, -0.08), (11, -0.032), (12, -0.123), (13, -0.014), (14, -0.001), (15, 0.033), (16, -0.091), (17, 0.058), (18, -0.011), (19, -0.15), (20, 0.086), (21, -0.083), (22, -0.121), (23, 0.09), (24, 0.009), (25, -0.003), (26, -0.029), (27, -0.029), (28, 0.019), (29, 0.063), (30, -0.153), (31, 0.028), (32, 0.035), (33, 0.086), (34, 0.108), (35, 0.112), (36, 0.06), (37, -0.044), (38, -0.015), (39, -0.032), (40, -0.005), (41, -0.056), (42, 0.042), (43, 0.038), (44, -0.027), (45, -0.088), (46, 0.002), (47, 0.036), (48, -0.053), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.91433311 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann
Abstract: We propose an efficient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1
2 0.87812972 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
Author: Michalis Titsias, Miguel Lazaro-Gredilla
Abstract: We introduce a novel variational method that allows to approximately integrate out kernel hyperparameters, such as length-scales, in Gaussian process regression. This approach consists of a novel variant of the variational framework that has been recently developed for the Gaussian process latent variable model which additionally makes use of a standardised representation of the Gaussian process. We consider this technique for learning Mahalanobis distance metrics in a Gaussian process regression setting and provide experimental evaluations and comparisons with existing methods by considering datasets with high-dimensional inputs. 1
3 0.75066739 54 nips-2013-Bayesian optimization explains human active search
Author: Ali Borji, Laurent Itti
Abstract: Many real-world problems have complicated objective functions. To optimize such functions, humans utilize sophisticated sequential decision-making strategies. Many optimization algorithms have also been developed for this same purpose, but how do they compare to humans in terms of both performance and behavior? We try to unravel the general underlying algorithm people may be using while searching for the maximum of an invisible 1D function. Subjects click on a blank screen and are shown the ordinate of the function at each clicked abscissa location. Their task is to find the function’s maximum in as few clicks as possible. Subjects win if they get close enough to the maximum location. Analysis over 23 non-maths undergraduates, optimizing 25 functions from different families, shows that humans outperform 24 well-known optimization algorithms. Bayesian Optimization based on Gaussian Processes, which exploits all the x values tried and all the f (x) values obtained so far to pick the next x, predicts human performance and searched locations better. In 6 follow-up controlled experiments over 76 subjects, covering interpolation, extrapolation, and optimization tasks, we further confirm that Gaussian Processes provide a general and unified theoretical account to explain passive and active function learning and search in humans. 1
4 0.71218902 201 nips-2013-Multi-Task Bayesian Optimization
Author: Kevin Swersky, Jasper Snoek, Ryan P. Adams
Abstract: Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, entropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyperparameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost. 1
5 0.5284515 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
Author: Roger Frigola, Fredrik Lindsten, Thomas B. Schon, Carl Rasmussen
Abstract: State-space models are successfully used in many areas of science, engineering and economics to model time series and dynamical systems. We present a fully Bayesian approach to inference and learning (i.e. state estimation and system identification) in nonlinear nonparametric state-space models. We place a Gaussian process prior over the state transition dynamics, resulting in a flexible model able to capture complex dynamical phenomena. To enable efficient inference, we marginalize over the transition dynamics function and, instead, infer directly the joint smoothing distribution using specially tailored Particle Markov Chain Monte Carlo samplers. Once a sample from the smoothing distribution is computed, the state transition predictive distribution can be formulated analytically. Our approach preserves the full nonparametric expressivity of the model and can make use of sparse Gaussian processes to greatly reduce computational complexity. 1
6 0.49598673 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
7 0.49533489 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
8 0.39197376 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series
9 0.38872653 354 nips-2013-When in Doubt, SWAP: High-Dimensional Sparse Recovery from Correlated Measurements
10 0.37335065 76 nips-2013-Correlated random features for fast semi-supervised learning
11 0.36672664 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
12 0.36414316 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models
13 0.35589787 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
14 0.35314277 137 nips-2013-High-Dimensional Gaussian Process Bandits
15 0.33459309 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
16 0.32875097 86 nips-2013-Demixing odors - fast inference in olfaction
17 0.32853043 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization
18 0.31216094 133 nips-2013-Global Solver and Its Efficient Approximation for Variational Bayesian Low-rank Subspace Clustering
19 0.31163862 85 nips-2013-Deep content-based music recommendation
20 0.30910185 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
topicId topicWeight
[(6, 0.018), (16, 0.06), (33, 0.109), (34, 0.137), (37, 0.291), (41, 0.026), (49, 0.021), (56, 0.079), (70, 0.027), (85, 0.036), (89, 0.024), (93, 0.039), (95, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.74641907 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
Author: Yanshuai Cao, Marcus A. Brubaker, David Fleet, Aaron Hertzmann
Abstract: We propose an efficient optimization algorithm for selecting a subset of training data to induce sparsity for Gaussian process regression. The algorithm estimates an inducing set and the hyperparameters using a single objective, either the marginal likelihood or a variational free energy. The space and time complexity are linear in training set size, and the algorithm can be applied to large regression problems on discrete or continuous domains. Empirical evaluation shows state-ofart performance in discrete cases and competitive results in the continuous case. 1
2 0.68777841 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?
Author: Qiang Liu, Alex Ihler, Mark Steyvers
Abstract: We study the problem of estimating continuous quantities, such as prices, probabilities, and point spreads, using a crowdsourcing approach. A challenging aspect of combining the crowd’s answers is that workers’ reliabilities and biases are usually unknown and highly diverse. Control items with known answers can be used to evaluate workers’ performance, and hence improve the combined results on the target items with unknown answers. This raises the problem of how many control items to use when the total number of items each workers can answer is limited: more control items evaluates the workers better, but leaves fewer resources for the target items that are of direct interest, and vice versa. We give theoretical results for this problem under different scenarios, and provide a simple rule of thumb for crowdsourcing practitioners. As a byproduct, we also provide theoretical analysis of the accuracy of different consensus methods. 1
3 0.6548025 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
Author: Tamir Hazan, Subhransu Maji, Joseph Keshet, Tommi Jaakkola
Abstract: In this work we develop efficient methods for learning random MAP predictors for structured label problems. In particular, we construct posterior distributions over perturbations that can be adjusted via stochastic gradient methods. We show that any smooth posterior distribution would suffice to define a smooth PAC-Bayesian risk bound suitable for gradient methods. In addition, we relate the posterior distributions to computational properties of the MAP predictors. We suggest multiplicative posteriors to learn super-modular potential functions that accompany specialized MAP predictors such as graph-cuts. We also describe label-augmented posterior models that can use efficient MAP approximations, such as those arising from linear program relaxations. 1
4 0.62320751 251 nips-2013-Predicting Parameters in Deep Learning
Author: Misha Denil, Babak Shakibi, Laurent Dinh, Marc'Aurelio Ranzato, Nando de Freitas
Abstract: We demonstrate that there is significant redundancy in the parameterization of several deep learning models. Given only a few weight values for each feature it is possible to accurately predict the remaining values. Moreover, we show that not only can the parameter values be predicted, but many of them need not be learned at all. We train several different architectures by learning only a small number of weights and predicting the rest. In the best case we are able to predict more than 95% of the weights of a network without any drop in accuracy. 1
5 0.57396144 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
6 0.56488347 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
7 0.56381881 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
8 0.56381243 86 nips-2013-Demixing odors - fast inference in olfaction
9 0.56324053 201 nips-2013-Multi-Task Bayesian Optimization
10 0.56268477 346 nips-2013-Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression
11 0.56267941 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
12 0.56241798 278 nips-2013-Reward Mapping for Transfer in Long-Lived Agents
13 0.56238824 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
14 0.56216216 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
15 0.56215149 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
16 0.56118798 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
17 0.56106514 173 nips-2013-Least Informative Dimensions
18 0.56036299 5 nips-2013-A Deep Architecture for Matching Short Texts
19 0.55944026 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
20 0.5592854 312 nips-2013-Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex