nips nips2012 nips2012-76 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuchen Zhang, Martin J. Wainwright, John C. Duchi
Abstract: We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as √ O(N −1 + (N/m)−2 ). Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 108 samples and d ≥ 700, 000 dimensions. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. [sent-4, score-0.071]
2 The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. [sent-5, score-0.166]
3 Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. [sent-7, score-0.115]
4 In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2. [sent-11, score-0.067]
5 1 Introduction Many problems in machine learning are based on a form of (regularized) empirical risk minimization. [sent-13, score-0.107]
6 In a centralized setting, there are many procedures for solving empirical risk minimization problems, including standard convex programming approaches [3] as well as various types of stochastic approximation [19, 8, 14]. [sent-15, score-0.281]
7 Accordingly, the focus of this paper is the theoretical analysis and empirical evaluation of some distributed and communication-efficient procedures for empirical risk minimization. [sent-17, score-0.217]
8 Recent years have witnessed a flurry of research on distributed approaches to solving very large-scale statistical optimization problems (e. [sent-18, score-0.088]
9 In statistical settings, however, distributed computation can lead to gains in statistical efficiency, as shown by Dekel et al. [sent-22, score-0.09]
10 Within the family of distributed algorithms, there can be significant differences in communication complexity: different computers must be synchronized, and when the dimensionality of the data is high, communication can be prohibitively expensive. [sent-24, score-0.161]
11 It is thus interesting to study distributed inference algorithms that require limited synchronization and communication while still enjoying the statistical power guaranteed by having a large dataset. [sent-25, score-0.112]
12 1 With this context, perhaps the simplest algorithm for distributed statistical inference is what we term the average mixture (AVGM) algorithm. [sent-26, score-0.095]
13 It is an appealingly simple method: given m different machines and a dataset of size N = nm, give each machine a (distinct) dataset of size n = N/m, have each machine i compute the empirical minimizer θi on its fraction of the data, then average all the parameters θi across the network. [sent-28, score-0.242]
14 To the best of our knowledge, however, no work has shown theoretically that the AVGM procedure has greater statistical efficiency than the naive approach of using n samples on a single machine. [sent-30, score-0.072]
15 [10] prove that the AVGM approach enjoys a variance reduction relative to the single processor solution, but they only prove that the final mean-squared error of their estimator is O(1/n), since they do not show a reduction in the bias of the estimator. [sent-32, score-0.137]
16 [23] propose a parallel stochastic gradient descent (SGD) procedure, which runs SGD independently on k machines for T iterations, averaging the outputs. [sent-34, score-0.262]
17 The algorithm enjoys good practical performance, but their main result [23, Theorem 12] guarantees a convergence rate of O(log k/T ), which is no better than sequential SGD on a single machine processing T samples. [sent-35, score-0.079]
18 First, we provide a sharp analysis of the AVGM algorithm, showing that under a reasonable set of conditions on the statistical risk function, it can indeed achieve substantially better rates. [sent-37, score-0.105]
19 Whenever the number of machines m is less than the number of samples n per machine, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N = nm samples. [sent-39, score-0.273]
20 Our second contribution is to develop a novel extension of simple averaging; it is based on an appropriate form of bootstrap [6, 7], which we refer to bootstrap average mixture (BAVGM) approach. [sent-41, score-0.436]
21 Thus, as long as m < n2 , the bootstrap method matches the centralized gold standard up to higher order terms. [sent-44, score-0.282]
22 Finally, we complement our theoretical results with experiments on simulated data and a large-scale logistic regression experiment that arises from the problem of predicting whether a user of a search engine will click on an advertisement. [sent-45, score-0.188]
23 Our experiments show that the resampling and correction of the BAVGM method provide substantial performance benets over naive solutions as well as the averaging algorithm AVGM. [sent-46, score-0.137]
24 Let P be a probability distribution over the sample space X , and define the population risk function F0 : Θ → R via f (θ; x)dP (x). [sent-48, score-0.118]
25 In practice, the population distribution P is unknown to us, but we have access to a collection S of samples from the distribution P . [sent-50, score-0.083]
26 In empirical risk minimization, 1 one estimates the vector θ∗ by solving the optimization problem θ ∈ argminθ∈Θ |S| x∈S f (θ; x). [sent-51, score-0.124]
27 In addition, the risk function is required to have some amount of curvature: Assumption B (Local strong convexity). [sent-56, score-0.066]
28 Note that this local condition is milder than a global strong convexity condition and is required to hold only for the population risk F0 . [sent-59, score-0.191]
29 In the distributed setting, we are given a dataset of N = mn samples i. [sent-62, score-0.234]
30 according to the initial distribution P , which we divide evenly amongst m processors or inference procedures. [sent-65, score-0.063]
31 , m}, denote a subsampled dataset of size n, and define the (local) empirical distribution P1 and empirical objective F1 via 1 1 δx and F1,j (θ) := f (θ; x) = f (θ; x)dP1,j (x). [sent-69, score-0.147]
32 (1) j=1 The bootstrap average mixture (BAVGM) procedure is based on an additional level of random sampling. [sent-75, score-0.23]
33 In addition to computing the empirical minimizer θ1,j based on Sj , BAVGM also computes the empirical min1 imizer θ2,j of the function F2,j (θ) := |S2,j | x∈S2,j f (θ; x), constructing the bootstrap average θ2 : = 1 m m j=1 θ2,j and returning the estimate θ1 − rθ2 . [sent-77, score-0.37]
34 The purpose of the weighted estimate (2) is to perform a form of bootstrap bias correction [6, 7]. [sent-79, score-0.288]
35 In rough terms, if b0 = θ∗ − θ1 is the bias of the first estimator, then we may approximate b0 by the bootstrap estimate of bias b1 = θ1 − θ2 . [sent-80, score-0.337]
36 1 Main results Bounds for simple averaging To guarantee good estimation properties of our algorithms, we require regularity conditions on the empirical risks F1 and F2 . [sent-83, score-0.142]
37 While Assumption C may appear strong, some smoothness of ∇2 f is necessary for averaging methods to work, as we now demonstrate by an example. [sent-89, score-0.094]
38 The associated population risk is F0 (w) = 1 (w2 +w2 1(w≤0) ), which is strongly convex and smooth, 2 ′ ′ since |F0 (w) − F0 (v)| ≤ 2|w − v|, but has discontinuous second derivative. [sent-92, score-0.145]
39 3 algorithm using N = mn observations must suffer mean squared error E[(θ1 − θ∗ )2 ] = Ω(n−1 ). [sent-95, score-0.107]
40 Both hold for logistic and linear regression problems so long as the population data covariance matrix is not rank deficient and the data is bounded; moreover, in the linear regression case, we have L = 0. [sent-98, score-0.163]
41 Our assumptions in place, we present our first theorem on the convergence of the AVGM procedure. [sent-99, score-0.092]
42 , m}, let Si be a dataset of n independent samples, and let 1 θ1,i ∈ argmin f (θ; xj ) n θ∈Θ xj ∈Si m 1 denote the minimizer of the empirical risk for the dataset Si . [sent-105, score-0.25]
43 Define θ1 = m i=1 θ1,i and let θ∗ denote the population risk minimizer. [sent-106, score-0.118]
44 Then under Assumptions A–C, we have 2 2 2 E θ1 − θ∗ 2 ≤ E ∇2 F0 (θ∗ )−1 ∇f (θ∗ ; X) 2 nm 5 2 2 + 2 2 H 2 log d + E ∇2 F0 (θ∗ )−1 ∇f (θ∗ ; X) 2 E ∇2 F0 (θ∗ )−1 ∇f (θ∗ ; X) 2 λ n + O(m−1 n−2 ) + O(n−3 ). [sent-107, score-0.065]
45 (4) A simple corollary of Theorem 1 makes it somewhat easier to parse, though we prefer the general form in the theorem as its dimension dependence is somewhat stronger. [sent-108, score-0.129]
46 E θ1 − θ∗ 2 ≤ 2 λ nm λ n λ A comparison of Theorem 1’s conclusions with classical statistical results is also informative. [sent-114, score-0.084]
47 Let N = mn denote the total number of samples available. [sent-116, score-0.138]
48 11] that for any estimator θN based on N samples, sup lim inf M <∞ N →∞ θN − θ ∗ − δ sup √ Eθ∗ +δ δ ≤M/ N 2 2 ≥ tr(I(θ∗ )−1 ). [sent-118, score-0.063]
49 The important aspect of our bound, however, is that we obtain this convergence rate without calculating an estimate on all N = mn data samples xi ; we calculate m independent estimators and average them to attain the convergence guarantee. [sent-124, score-0.229]
50 2 Bounds for bootstrap mixture averaging As shown in Theorem 1 and the immediately preceding corollary, for small m, the convergence rate of the AVGM algorithm is mainly determined by the first term in the bound (4), which is at worst G2 λ2 mn . [sent-126, score-0.442]
51 In addition, when the population risk’s local strong convexity parameter λ is close to zero or the Lipschitz continuity constant H of ∇f (θ; x) is large, the n−2 term in the bound (4) and Corollary 1 may dominate the leading term. [sent-128, score-0.077]
52 This concern motivates our development of the bootstrap average mixture (BAVGM) algorithm and analysis. [sent-129, score-0.23]
53 Due the additional randomness introduced by the bootstrap algorithm BAVGM, its analysis requires an additional smoothness condition. [sent-130, score-0.239]
54 For a ρ > 0, there exists a neighborhood U = {θ ∈ Rd : θ∗ − θ 2 ≤ 2ρ} ⊆ Θ such that the smoothness conditions of Assumption C hold. [sent-133, score-0.07]
55 Note that Assumption D holds for linear regression (in fact, with M = 0); it also holds for logistic regression problems with finite M as long as the data is bounded. [sent-135, score-0.094]
56 We now state our second main theorem, which shows that the use of bootstrap samples to reduce the bias of the AVGM algorithm yields improved performance. [sent-136, score-0.279]
57 The reason for this elimination is that resampling at a rate r reduces the bias of the BAVGM algorithm to O(n−3 ); the bias of the AVGM algorithm induces terms of order n−2 in Theorem 1. [sent-141, score-0.141]
58 Roughly, when m becomes large we increase r, since the bias of the independent solutions may increase and we enjoy averaging affects from the BAVGM algorithm. [sent-143, score-0.131]
59 We leave as an intriguing open question whether computing multiple bootstrap samples at each machine can yield improved performance for the BAVGM procedure. [sent-146, score-0.237]
60 3 Time complexity In practice, the exact empirical minimizers assumed in Theorems 1 and 2 may be unavailable. [sent-148, score-0.075]
61 In this section, we sketch an argument that shows that both the AVGM algorithm and the BAVGM algorithm can use approximate empirical minimizers to achieve the same (optimal) asymptotic bounds. [sent-149, score-0.098]
62 Indeed, suppose that we employ approximate empirical minimizers in AVGM and BAVGM instead of the exact ones. [sent-150, score-0.075]
63 (7) 2 2 The bound (7) shows that solving the empirical minimization problem to accuracy sufficient to have 2 ′ E[ θ1 − θ1 2 ] = O((mn)−2 ) guarantees the same convergence rates provided by Theorem 1. [sent-153, score-0.1]
64 01 0 150 50 100 Number m of machines (a) 150 (b) ∗ Figure 1: Experiments plotting the error in the estimate of θ given by the AVGM algorithm and BAVGM algorithm for total number of samples N = 105 versus number of dataset splits (parallel machines) m. [sent-162, score-0.266]
65 of the AVGM and similar algorithms over the naive approach of processing all N = mn samples on one processor is at least of order m/ log(N ). [sent-165, score-0.193]
66 Let us argue that for such time complexity the necessary empirical convergence is achievable. [sent-166, score-0.065]
67 As we show in our proof of Theorem 1, with high probability the empirical risk F1 is strongly convex in a ball Bρ (θ1 ) of constant radius ρ > 0 around θ1 with high probability. [sent-167, score-0.134]
68 ) A combination of stochastic gradient descent [14] and standard convex programming approaches [3] completes the argument. [sent-169, score-0.141]
69 Indeed, performing stochastic gradient descent for O(log2 (mn)/ρ2 ) iterations on the empirical objective F1 yields that with probability at least 1 − m−2 n−2 , the resulting parameter falls within Bρ (θ1 ) [14, Proposition 2. [sent-170, score-0.155]
70 The local strong convexity guarantees that O(log(mn)) iterations of standard gradient descent [3, Chapter 9]—each requiring O(n) units of time—beginning from this parameter 2 ′ is sufficient to achieve E[ θ1 − θ1 2 ] = O((mn)−2 ), since gradient descent enjoys a locally linear convergence rate. [sent-172, score-0.244]
71 We also remark that under a slightly more global variant of Assumptions A–C, we can show that stochastic gradient descent achieves convergence rates of O((mn)−2 + n−3/2 ), which is order optimal. [sent-174, score-0.138]
72 For each experiment, we use a fixed total number N = 105 of samples, but we vary the number of parallel splits m of the data (and consequently, the local dataset sizes n = N/m) and the dimensionality d of the problem solved. [sent-179, score-0.145]
73 In Figure 1, we plot the error θ − θ∗ 2 of the inferred parameter vector θ for the true parameters θ∗ versus the 2 number of splits, or number of parallel machines, m we use. [sent-191, score-0.083]
74 5 (a) (b) Figure 2: (a) Sythetic data: comparison of AVGM estimator to linear regression estimator based on N/m data points. [sent-208, score-0.09]
75 (b) Advertising data: the log-loss on held-out data for the BAVGM method applied with m = 128 parallel splits of the data, plotted versus the sub-sampling rate r. [sent-209, score-0.151]
76 Even as the dimensionality d grows, we see that splitting the data into as many as m = 64 independent pieces and averaging the solution vectors θi estimated from each subsample i yields a vector θ whose estimate of θ∗ is no worse than twice the solution using all N samples. [sent-214, score-0.084]
77 Indeed, 2 2 1 1 1 setting n = N/m, we see that Theorem 1 implies E[ θ − θ∗ 2 ] = O( mn + n2 ) = O( N + m2 ), N which matches Figure 1. [sent-217, score-0.13]
78 In addition, we see that the BAVGM algorithm enjoys somewhat more stable performance, with increasing benefit as the number of machines m increases. [sent-218, score-0.125]
79 We use a large dataset from the Tencent search engine, soso. [sent-225, score-0.063]
80 com [20], which contains 641,707 distinct advertisement items with N = 235,582,879 data samples. [sent-226, score-0.067]
81 Each sample consists of a so-called impression, which is a list containing a user-issued search, the advertisement presented to the user and a label y ∈ {+1, −1} indicating whether the user clicked on the advertisement. [sent-227, score-0.125]
82 The ads in our dataset were presented to 23,669,283 distinct users. [sent-228, score-0.072]
83 Tencent dataset provides a standard encoding to transform an impression into a useable set of regressors x. [sent-229, score-0.075]
84 Our goal is to predict the probability of a user clicking a given advertisement as a function of the covariates x. [sent-235, score-0.096]
85 In order to do so, we use a logistic regression model to estimate the probability of a 1 click response P (y = 1 | x; θ) := 1+exp(− θ,x ) , where θ ∈ Rd is the unknown regression vector. [sent-236, score-0.157]
86 1294 8 16 32 64 Number of machines m (a) SGD 0. [sent-244, score-0.062]
87 1294 128 1 2 3 4 5 6 7 8 Number of Pas s es 9 10 (b) Figure 3: The negative log-likelihood of the output of the AVGM, BAVGM, and a stochastic gradient descent method on the held-out dataset for the click-through prediction task. [sent-255, score-0.178]
88 (a) Performance of the AVGM and BAVGM methods versus the number of splits m of the data. [sent-256, score-0.106]
89 The dataset is too large to fit in main memory on most computers: in total, four splits of the data require 55 gigabytes. [sent-266, score-0.12]
90 Instead, for each experiment, we perform 10 passes of stochastic gradient descent through the dataset to get a rough baseline of the performance attained by the empirical minimizer for the entire dataset. [sent-268, score-0.288]
91 In Figure 3(a), we show the average hold-out set log-loss (with standard errors) of the estimator θ1 provided by the AVGM method and the BAVGM method versus number of splits of the data m. [sent-270, score-0.133]
92 The plot shows that for small m, both AVGM and BAVGM enjoy good performance, comparable to or better than (our proxy for) the oracle solution using all N samples. [sent-271, score-0.075]
93 As the number of machines m grows, the de-biasing provided by the subsampled bootstrap method yield substantial improvements over the standard AVGM method. [sent-272, score-0.289]
94 In addition, even with m = 128 splits of the dataset, the BAVGM method gives better hold-out set performance than performing two passes of stochastic gradient on the entire dataset of m samples. [sent-273, score-0.231]
95 This is striking, as doing even one pass through the data with stochastic gradient descent is known to give minimax optimal convergence rates [16, 1]. [sent-274, score-0.138]
96 We choose m = 128 because more data splits provide more variable performance in r. [sent-277, score-0.076]
97 Dual averaging for distributed optimization: convergence analysis and network scaling. [sent-317, score-0.137]
98 A randomized incremental subgradient method for distributed optimization in networked systems. [sent-340, score-0.074]
99 Distributed stochastic subgradient projection algorithms for c convex optimization. [sent-393, score-0.091]
100 Hogwild: a lock-free approach to parallelizing stochastic gradient descent. [sent-400, score-0.077]
wordName wordTfidf (topN-words)
[('bavgm', 0.713), ('avgm', 0.539), ('bootstrap', 0.206), ('mn', 0.107), ('splits', 0.076), ('advertisement', 0.067), ('risk', 0.066), ('nm', 0.065), ('sj', 0.062), ('machines', 0.062), ('averaging', 0.061), ('sgd', 0.055), ('centralized', 0.053), ('distributed', 0.052), ('population', 0.052), ('dataset', 0.044), ('engine', 0.042), ('stochastic', 0.042), ('theorem', 0.042), ('bias', 0.042), ('empirical', 0.041), ('communication', 0.041), ('hessian', 0.041), ('click', 0.04), ('descent', 0.037), ('resampling', 0.037), ('regression', 0.036), ('enjoys', 0.035), ('gradient', 0.035), ('tencent', 0.035), ('passes', 0.034), ('minimizers', 0.034), ('processors', 0.034), ('smoothness', 0.033), ('processor', 0.033), ('corollary', 0.031), ('minimizer', 0.031), ('samples', 0.031), ('milder', 0.031), ('impression', 0.031), ('versus', 0.03), ('rd', 0.03), ('user', 0.029), ('evenly', 0.029), ('somewhat', 0.028), ('ads', 0.028), ('nedi', 0.028), ('enjoy', 0.028), ('returning', 0.028), ('plot', 0.028), ('estimator', 0.027), ('computers', 0.027), ('distributes', 0.027), ('zinkevich', 0.027), ('convex', 0.027), ('assumptions', 0.026), ('parallel', 0.025), ('advertising', 0.025), ('mann', 0.025), ('mcdonald', 0.025), ('convexity', 0.025), ('argmin', 0.024), ('mixture', 0.024), ('rough', 0.024), ('convergence', 0.024), ('agarwal', 0.024), ('consequently', 0.024), ('asymptotic', 0.023), ('estimate', 0.023), ('matches', 0.023), ('subgradient', 0.022), ('logistic', 0.022), ('naive', 0.022), ('tr', 0.021), ('lipschitz', 0.021), ('subsampled', 0.021), ('rate', 0.02), ('regularity', 0.02), ('dekel', 0.02), ('berkeley', 0.02), ('fraction', 0.02), ('negative', 0.02), ('conditions', 0.02), ('achievable', 0.019), ('decays', 0.019), ('assumption', 0.019), ('oracle', 0.019), ('search', 0.019), ('statistical', 0.019), ('minimization', 0.018), ('sup', 0.018), ('neighborhood', 0.017), ('curvature', 0.017), ('correction', 0.017), ('duchi', 0.017), ('procedures', 0.017), ('grows', 0.017), ('hold', 0.017), ('solving', 0.017), ('requiring', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, Martin J. Wainwright, John C. Duchi
Abstract: We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as √ O(N −1 + (N/m)−2 ). Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 108 samples and d ≥ 700, 000 dimensions. 1
2 0.087749779 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
3 0.080539405 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.
4 0.062680617 324 nips-2012-Stochastic Gradient Descent with Only One Projection
Author: Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, Jinfeng Yi
Abstract: Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at each iteration to ensure that the obtained solution stays within the feasible domain. For complex domains (e.g., positive semidefinite cone), the projection step can be computationally expensive, making stochastic gradient descent unattractive for large-scale optimization problems. We address this limitation by developing novel stochastic optimization algorithms that do not need intermediate projections. Instead, only one projection at the last iteration is needed to obtain a feasible solution in the given domain. Our theoretical analysis shows that with a high probability, √ the proposed algorithms achieve an O(1/ T ) convergence rate for general convex optimization, and an O(ln T /T ) rate for strongly convex optimization under mild conditions about the domain and the objective function. 1
5 0.055295922 16 nips-2012-A Polynomial-time Form of Robust Regression
Author: Ozlem Aslan, Dale Schuurmans, Yao-liang Yu
Abstract: Despite the variety of robust regression methods that have been developed, current regression formulations are either NP-hard, or allow unbounded response to even a single leverage point. We present a general formulation for robust regression—Variational M-estimation—that unifies a number of robust regression methods while allowing a tractable approximation strategy. We develop an estimator that requires only polynomial-time, while achieving certain robustness and consistency guarantees. An experimental evaluation demonstrates the effectiveness of the new estimation approach compared to standard methods. 1
6 0.053970683 275 nips-2012-Privacy Aware Learning
7 0.051967945 170 nips-2012-Large Scale Distributed Deep Networks
8 0.048042864 247 nips-2012-Nonparametric Reduced Rank Regression
9 0.04233285 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
10 0.042043254 254 nips-2012-On the Sample Complexity of Robust PCA
11 0.040632129 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
12 0.040066678 187 nips-2012-Learning curves for multi-task Gaussian process regression
13 0.03975803 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
14 0.039607048 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
15 0.039441105 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
16 0.038997393 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
17 0.038270526 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
18 0.03772676 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data
19 0.037724387 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
20 0.036904115 32 nips-2012-Active Comparison of Prediction Models
topicId topicWeight
[(0, 0.124), (1, 0.003), (2, 0.048), (3, -0.02), (4, 0.03), (5, 0.042), (6, -0.004), (7, 0.035), (8, 0.001), (9, -0.034), (10, 0.014), (11, 0.031), (12, -0.045), (13, -0.001), (14, -0.009), (15, -0.005), (16, -0.018), (17, -0.043), (18, 0.006), (19, 0.008), (20, 0.023), (21, -0.032), (22, 0.051), (23, 0.006), (24, -0.041), (25, -0.043), (26, 0.05), (27, 0.011), (28, -0.017), (29, -0.056), (30, 0.02), (31, -0.024), (32, -0.043), (33, -0.019), (34, -0.013), (35, 0.081), (36, -0.014), (37, -0.012), (38, -0.003), (39, -0.024), (40, 0.044), (41, -0.037), (42, -0.018), (43, 0.017), (44, -0.083), (45, -0.002), (46, -0.007), (47, 0.004), (48, 0.102), (49, -0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.86129552 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, Martin J. Wainwright, John C. Duchi
Abstract: We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as √ O(N −1 + (N/m)−2 ). Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 108 samples and d ≥ 700, 000 dimensions. 1
2 0.71746123 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
Author: Andre Wibisono, Martin J. Wainwright, Michael I. Jordan, John C. Duchi
Abstract: We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that √ gradient estimates based on random perturbations suffer a factor use of at most d in convergence rate over traditional stochastic gradient methods, where d is the problem dimension. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problemdependent quantities: they cannot be improved by more than constant factors. 1
3 0.70628434 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
Author: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright
Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a O(d/T ) convergence rate for strongly convex objectives in d dimensions and O( s(log d)/T ) convergence rate when the optimum is s-sparse. Our algorithm is based on successively solving a series of ℓ1 -regularized optimization problems using Nesterov’s dual averaging algorithm. We establish that the error of our solution after T iterations is at most O(s(log d)/T ), with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.
4 0.67489648 217 nips-2012-Mixability in Statistical Learning
Author: Tim V. Erven, Peter Grünwald, Mark D. Reid, Robert C. Williamson
Abstract: Statistical learning and sequential prediction are two different but related formalisms to study the quality of predictions. Mapping out their relations and transferring ideas is an active area of investigation. We provide another piece of the puzzle by showing that an important concept in sequential prediction, the mixability of a loss, has a natural counterpart in the statistical setting, which we call stochastic mixability. Just as ordinary mixability characterizes fast rates for the worst-case regret in sequential prediction, stochastic mixability characterizes fast rates in statistical learning. We show that, in the special case of log-loss, stochastic mixability reduces to a well-known (but usually unnamed) martingale condition, which is used in existing convergence theorems for minimum description length and Bayesian inference. In the case of 0/1-loss, it reduces to the margin condition of Mammen and Tsybakov, and in the case that the model under consideration contains all possible predictors, it is equivalent to ordinary mixability. 1
5 0.6337465 285 nips-2012-Query Complexity of Derivative-Free Optimization
Author: Ben Recht, Kevin G. Jamieson, Robert Nowak
Abstract: This paper provides lower bounds on the convergence rate of Derivative Free Optimization (DFO) with noisy function evaluations, exposing a fundamental and unavoidable gap between the performance of algorithms with access to gradients and those with access to only function evaluations. However, there are situations in which DFO is unavoidable, and for such situations we propose a new DFO algorithm that is proved to be near optimal for the class of strongly convex objective functions. A distinctive feature of the algorithm is that it uses only Boolean-valued function comparisons, rather than function evaluations. This makes the algorithm useful in an even wider range of applications, such as optimization based on paired comparisons from human subjects, for example. We also show that regardless of whether DFO is based on noisy function evaluations or Boolean-valued function comparisons, the convergence rate is the same. 1
6 0.61723405 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization
7 0.61279356 16 nips-2012-A Polynomial-time Form of Robust Regression
8 0.60897654 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
9 0.58593571 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
10 0.56395966 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs
11 0.55340976 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent
12 0.54769647 254 nips-2012-On the Sample Complexity of Robust PCA
13 0.54263896 271 nips-2012-Pointwise Tracking the Optimal Regression Function
14 0.53745764 319 nips-2012-Sparse Prediction with the $k$-Support Norm
15 0.52759951 161 nips-2012-Interpreting prediction markets: a stochastic approach
16 0.52123433 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
17 0.52101618 95 nips-2012-Density-Difference Estimation
18 0.51743889 222 nips-2012-Multi-Task Averaging
19 0.50586653 221 nips-2012-Multi-Stage Multi-Task Feature Learning
20 0.5004288 275 nips-2012-Privacy Aware Learning
topicId topicWeight
[(0, 0.021), (16, 0.202), (17, 0.014), (21, 0.031), (38, 0.135), (39, 0.012), (42, 0.073), (54, 0.017), (55, 0.035), (68, 0.015), (74, 0.044), (76, 0.153), (80, 0.083), (92, 0.055)]
simIndex simValue paperId paperTitle
1 0.83933043 329 nips-2012-Super-Bit Locality-Sensitive Hashing
Author: Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian
Abstract: Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within (0, ⇡/2]. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments. 1
2 0.82765973 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
Author: Jeremy Weiss, Sriraam Natarajan, David Page
Abstract: Learning temporal dependencies between variables over continuous time is an important and challenging task. Continuous-time Bayesian networks effectively model such processes but are limited by the number of conditional intensity matrices, which grows exponentially in the number of parents per variable. We develop a partition-based representation using regression trees and forests whose parameter spaces grow linearly in the number of node splits. Using a multiplicative assumption we show how to update the forest likelihood in closed form, producing efficient model updates. Our results show multiplicative forests can be learned from few temporal trajectories with large gains in performance and scalability.
3 0.82132804 222 nips-2012-Multi-Task Averaging
Author: Sergey Feldman, Maya Gupta, Bela Frigyik
Abstract: We present a multi-task learning approach to jointly estimate the means of multiple independent data sets. The proposed multi-task averaging (MTA) algorithm results in a convex combination of the single-task averages. We derive the optimal amount of regularization, and show that it can be effectively estimated. Simulations and real data experiments demonstrate that MTA outperforms both maximum likelihood and James-Stein estimators, and that our approach to estimating the amount of regularization rivals cross-validation in performance but is more computationally efficient. 1
same-paper 4 0.81334233 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, Martin J. Wainwright, John C. Duchi
Abstract: We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as √ O(N −1 + (N/m)−2 ). Whenever m ≤ N , this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as O(N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on largescale problems from the internet search domain. In particular, we show that our methods efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which consists of N ≈ 2.4 × 108 samples and d ≥ 700, 000 dimensions. 1
5 0.74960202 148 nips-2012-Hamming Distance Metric Learning
Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov
Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1
6 0.74747199 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
7 0.74621677 275 nips-2012-Privacy Aware Learning
8 0.74529785 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method
9 0.74504608 292 nips-2012-Regularized Off-Policy TD-Learning
10 0.74448514 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes
11 0.74350381 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
12 0.7434653 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
13 0.74332529 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
14 0.74256325 179 nips-2012-Learning Manifolds with K-Means and K-Flats
15 0.74216264 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions
16 0.74116683 188 nips-2012-Learning from Distributions via Support Measure Machines
17 0.74109465 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
18 0.74088925 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification
19 0.74045944 38 nips-2012-Algorithms for Learning Markov Field Policies
20 0.74040467 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration