nips nips2013 nips2013-146 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. [sent-8, score-0.457]
2 Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. [sent-10, score-0.302]
3 The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. [sent-11, score-0.328]
4 We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. [sent-12, score-0.501]
5 1 Introduction p Consider a p-dimensional probability distribution with true covariance matrix Σ0 ∈ S++ and true p −1 p×n precision (or inverse covariance) matrix Ω0 = Σ0 ∈ S++ . [sent-14, score-0.542]
6 The 1 ¯ centered normalized sample matrix A = [a1 · · · an ] ∈ p×n can be obtained as ai = √n (Ri − R), 1 ¯ where R = n i Ri , so that the sample covariance matrix can be computed as C = AAT . [sent-16, score-0.361]
7 In recent years, considerable effort has been invested in obtaining an accurate estimate of the precision ˆ matrix Ω based on the sample covariance matrix C in the ‘low sample, high dimensions’ setting, i. [sent-17, score-0.542]
8 , n p, especially when the true precision Ω0 is assumed to be sparse [28]. [sent-19, score-0.254]
9 Spurred by these advances in the statistical theory of precision matrix estimation, there has been considerable recent work on developing computationally efficient optimization methods for solving the corresponding statistical estimation problems: see [1, 8, 14, 21, 13], and references therein. [sent-22, score-0.339]
10 Note further that in precision matrix estimation, the number of parameters scales quadratically with the number of variables; so that with a million dimensions p = 106 , the total number of parameters to be estimated is a trillion, p2 = 1012 . [sent-24, score-0.383]
11 The focus of this paper is on designing an efficient distributed algorithm for precision matrix estimation under such ultra-large-scale dimensional settings. [sent-25, score-0.402]
12 The proposed CLIME-ADMM algorithm can scale up to millions of dimensions, and can use up to thousands of cores in a shared-memory or distributed-memory architecture. [sent-35, score-0.412]
13 First, we propose an inexact ADMM [27, 12] algorithm targeted to CLIME, where each step is either elementwise parallel or involves suitable matrix multiplications. [sent-37, score-0.333]
14 Second, we solve (1) in column-blocks of the precision matrix at a time, rather than one column at a time. [sent-39, score-0.492]
15 Since (1) already decomposes columnwise, solving multiple columns together in blocks might not seem worthwhile. [sent-40, score-0.24]
16 Moreover, matrix multiplication can be further simplified as block-by-block operations, which allows choosing optimal block sizes to minimize cache misses, leading to high scalability and performance [16, 5, 15]. [sent-42, score-0.681]
17 , we estimate a precision matrix of one million dimension and one trillion parameters in 11 hours by running the algorithm on 400 cores. [sent-47, score-0.423]
18 In other recent related work based on ADMM, [23] introduce graph projection block splitting (GPBS) to split data into blocks so that examples and features can be distributed among multiple cores. [sent-52, score-0.532]
19 Our framework uses a more general blocking scheme (block cyclic distribution), which provides more options in choosing the optimal block size to improve the efficiency in the use of memory hierarchies and minimize cache misses [16, 15, 5]. [sent-53, score-0.589]
20 An element of a matrix is denoted by a upper case letter with row index i and column index j, e. [sent-58, score-0.319]
21 A block of matrix is denoted by a bold face lower case letter indexed by ij, e. [sent-61, score-0.401]
22 Aij represents a collection of blocks of matrix A on the ij-th core (see block cyclic distribution in Section 4). [sent-64, score-0.827]
23 X ∈ p×k deˆ notes k(1 ≤ k ≤ p) columns of the precision matrix Ω, and E ∈ p×k denotes the same k columns of the identity matrix I ∈ p×p . [sent-72, score-0.509]
24 Let λmax (C) be the largest eigenvalue of covariance matrix C. [sent-73, score-0.236]
25 Assuming a column block contains k(1 ≤ k ≤ p) columns, the sparse precision matrix estimation amounts to solving p/k independent linear programs. [sent-75, score-0.81]
26 In Algorithm 1, while step 5, 7, 8 and 10 amount to elementwise operations which cost O(pk) operations, steps 6 and 9 involve matrix multiplication which is the most computationally intensive part and costs O(p2 k) operations. [sent-96, score-0.341]
27 1 Sparse Structure As we detail here, there could be sparsity in the intermediate iterates, or the sample covariance matrix itself (or a perturbed version thereof); which can be exploited to make our CLIME-ADMM variant more efficient. [sent-110, score-0.348]
28 Iterate Sparsity: As the iterations progress, the soft-thresholding operation will yield a sparse Xt+1 , which can help speed up step 6: Ut+1 = CX t+1 , via sparse matrix multiplication. [sent-111, score-0.271]
29 The statistical guarantees for CLIME [3], including convergence in spectral, matrix L1 , and Frobenius norms, only require from the sample covariance matrix C a deviation bound of the form C − Σ0 ∞ ≤ c log p/n, for some constant c. [sent-117, score-0.361]
30 Accordingly, if we perturb the matrix C with a perturbation matrix ∆ so that the perturbed matrix (C + ∆) continues to satisfy the deviation bound, the statistical guarantees for CLIME would hold even if we used the perturbed matrix (C + ∆). [sent-118, score-0.639]
31 Thus, in practice, even if sample covariance matrix is only close to a sparse matrix [21, 13], or if it is close to being block diagonal [21, 13], the complexity of matrix multiplication in steps 6 and 9 can be significantly reduced via the above perturbations. [sent-125, score-0.934]
32 Since the method will operate in a low-sample setting, we can alternatively use the low rank of the sample covariance matrix to reduce the complexity of matrix multiplication. [sent-128, score-0.361]
33 Since C = AAT and p n, CX = A(AT X), and thus the computational complexity of matrix multiplication reduces from O(p2 k) to O(npk), which can achieve significant speedup for small n. [sent-129, score-0.423]
34 Assume the p × p precision matrix Ω is evenly divided into 1 q l l = p/k (≥ q) column blocks, e. [sent-135, score-0.515]
35 , X , · · · , X , · · · , X , and thus each column block contains k columns. [sent-137, score-0.398]
36 The column blocks are assigned to q cores cyclically, which means the j-th column block is assigned to the mod(j, q)-th core. [sent-138, score-1.149]
37 The q cores can solve q column blocks in parallel without communication and synchronization, which can be simply implemented via multithreading. [sent-139, score-0.768]
38 Meanwhile, another q column blocks are waiting in their respective queues. [sent-140, score-0.404]
39 Figure 1(a) gives an example of how to solve 8 column blocks on 4 cores in a shared-memory environment. [sent-141, score-0.705]
40 While the 4 cores are solving the first 4 column blocks, the next 4 column blocks are waiting in queues (red arrows). [sent-142, score-0.916]
41 Although the shared-memory framework is free from communication and synchronization, the limited resources prevent it from scaling up to datasets with millions of dimensions, which can not be loaded to the memory of a single machine or solved by tens of cores in a reasonble time. [sent-143, score-0.58]
42 Assume q processes are formed as a r × c process ˆ grid and the p × p precision matrix Ω is evenly divided into l = p/k (≥ q) column blocks, e. [sent-145, score-0.563]
43 We solve a column block Xj at a time in the process grid. [sent-148, score-0.426]
44 Assume the data matrix A has been evenly distributed into the process grid and Aij is the data on the ij-th core, i. [sent-149, score-0.287]
45 Figure 1(b) illustrates that the 2 × 2 process grid is computing the first column block X1 while the second column block X2 is waiting in queues (red lines), assuming X1 , X2 are distributed into the process grid in the same way as A and X1 is the block of X1 assigned to the ij-th core. [sent-152, score-1.313]
46 ij A typical issue in parallel computation is load imbalance, which is mainly caused by the computational disparity among cores and leads to unsatisfactory speedups. [sent-153, score-0.491]
47 The following discussion focuses on the matrix multiplication in the step 6 in Algorithm 1. [sent-155, score-0.26]
48 The matrix multiplication U = A(A X1 ) can be decomposed into two steps, i. [sent-157, score-0.26]
49 Dividing matrices A, X evenly into r × c large consecutive blocks like [23] will lead to load imbalance. [sent-160, score-0.362]
50 1), large consecutive blocks may assign dense blocks to some processes and sparse blocks to the other processes. [sent-162, score-0.711]
51 Second, there will be no blocks in some processes after the multiplication using large blocks since W is a small matrix compared to A, X, e. [sent-163, score-0.662]
52 Third, large blocks may not be fit in the cache, leading to cache misses. [sent-166, score-0.302]
53 Therefore, we use block cyclic data distribution which uses a small nonconsecutive blocks and thus can largely achieve load balance and scalability. [sent-167, score-0.624]
54 A matrix is first divided into consecutive blocks of size pb × nb . [sent-168, score-0.497]
55 A is divided into 3 × 2 consecutive blocks, where each block is of size pb × nb . [sent-172, score-0.411]
56 Green blocks will be assigned to the upper left process, i. [sent-174, score-0.238]
57 The distribution of X1 can be done in a similar way except the block size should be pb × kb , where pb is to guarantee that matrix multiplication A X1 works. [sent-177, score-0.708]
58 In particular, we denote pb × nb × kb as the block size for matrix multiplication. [sent-178, score-0.543]
59 To distribute the data in a block cyclic manner, we use a parallel I/O scheme, where processes can access the data in parallel and only read/write the assigned blocks. [sent-179, score-0.511]
60 We run each method using different parameters (which controls the sparsity of the solution), and plot the precision and recall for each solution in Figure 2(b). [sent-201, score-0.249]
61 html 6 (a) Runtime col (a) Speedup Sk col (a) Speedup Sk (b) Precision and recall core (b) Speedup Sq core (b) Speedup Sq Figure 2: Synthetic datasets Figure 3: Shared-Memory. [sent-225, score-0.558]
62 A block-diagonal structure in the sample covariance matrix can be easily incorporated into the matrix multiplication in CLIME-ADMM to achieve a sharp computational gain. [sent-231, score-0.496]
63 Note Tiger can estimate the spase precision matrix column-by-column in parallel, while CLIMEADMM solves CLIME in column-blocks in parallel. [sent-235, score-0.335]
64 2 Scalability of CLIME ADMM We evaluate the scalability of CLIME-ADMM in a shared memory and a distributed memory architecture in terms of two kinds of speedups. [sent-237, score-0.303]
65 The first speedup is defined as the time on 1 core core core core core core T1 over q cores Tq , i. [sent-238, score-1.399]
66 The speedup of solving CLIME in column block with size k over a col col col single column is defined as Sk = T1 /Tk . [sent-243, score-1.097]
67 Shared-memory We estimate a precision matrix with p = 104 dimensions on a server with 20 cores and 64G memory. [sent-247, score-0.699]
68 We run the algorithm on different number of cores q = 1, 5, 10, 20, and with different column block size k. [sent-249, score-0.716]
69 The speedup col Sk is plotted in Figure 3(a), which shows the results on three different number of cores. [sent-250, score-0.325]
70 For k ≥ 20, the speedups are maintained on 1 core and 5 cores, but decreases on 10 and 20 cores. [sent-252, score-0.278]
71 For a fixed k, more columns are involved in the computation when more cores are used, leading to more memory consumption and competition core core for the usage of shared cache. [sent-254, score-0.779]
72 The ideal linear speedups are archived on 5 cores for all block sizes k. [sent-256, score-0.753]
73 On 10 cores, while small and medium column block sizes can maintain the ideal linear speedups, the large column block sizes fail to scale linearly. [sent-257, score-0.934]
74 The failure to achieve a linear speedup propagate to small and medium column block sizes on 20 cores, although their speedups are larger than large column block size. [sent-258, score-1.12]
75 As more and more column blocks are participating in the computation, the speed-ups decrease possibly because of the competition for resources (e. [sent-259, score-0.462]
76 CLIME-ADMM sparsity DC-QUIC Tiger 1 core 8 cores 0. [sent-263, score-0.539]
77 60 > 1 day > 1 day Table 2: Effect (runtime (sec)) of using different number of cores in a node with p = 106 . [sent-283, score-0.456]
78 87 Distributed-memory We estimate a precision matrix with one million dimensions (p = 106 ), which contains one trillion parameters (p2 = 1012 ). [sent-313, score-0.405]
79 We use 1 core per node to avoid the competition for the resources as we observed in q the shared-memory case. [sent-315, score-0.27]
80 The block size pb ×nb ×kb for matrix multiplication is 10×10×1 for k ≤ 10 and 10×10×10 for k > 10. [sent-317, score-0.583]
81 Since the column block CLIME problems are totally independent, we report the speedups on solving a single col column block. [sent-318, score-0.807]
82 The speedup Sk is plotted in Figure 4(a), where the speedups are larger and more stable than that in the shared-memory environment. [sent-319, score-0.324]
83 The speedup keeps increasing before arriving at a certain number as column block size increases. [sent-320, score-0.561]
84 For any column block size, the speedup also core core increases as the number of cores increases. [sent-321, score-1.185]
85 A single column (k = 1) fails to achieve linear speedups when hundreds of cores are used. [sent-323, score-0.639]
86 However, if using a column block k > 1, the ideal linear speedups are achieved with increasing number of cores. [sent-324, score-0.557]
87 Note that due to distributed memory, the larger column block sizes also scale linearly, unlike in the shared memory setting, where the speedups were limited due to resource sharing. [sent-325, score-0.761]
88 As we have seen, k depends on the size of process grid, block size in matrix multiplication, cache size and probably the sparsity pattern of matrices. [sent-326, score-0.534]
89 In Table 2, we compare the performance of 1 core per node to that of using 4 cores per node, which mixes the effects of shared-memory and distributed-memory architectures. [sent-327, score-0.517]
90 For small column block size (k = 1, 5), the use of multiple cores in a node is almost two times slower than the use of a single core in a node. [sent-328, score-0.915]
91 For other column block sizes, it is still 30% slower. [sent-329, score-0.398]
92 Finally, we ran CLIME-ADMM on 400 cores with one node per core and block size k = 500, and the entire computation took about 11 hours. [sent-330, score-0.757]
93 6 Conclusions In this paper, we presented a large scale distributed framework for the estimation of sparse precision matrix using CLIME. [sent-331, score-0.537]
94 The framework is based on inexact ADMM, which decomposes the constrained optimization problem into elementary matrix multiplications and elementwise operations. [sent-333, score-0.373]
95 The proposed framework solves the CLIME in column-blocks and uses block cyclic distribution to achieve load balancing. [sent-335, score-0.482]
96 A constrained 1 minimization approach to sparse precision matrix estimation. [sent-378, score-0.41]
97 Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. [sent-384, score-0.282]
98 A new parallel matrix multiplication algorithm on distributed-memory concurrent computers. [sent-388, score-0.323]
99 An inexact interior point method for L1-reguarlized sparse covariance selection. [sent-495, score-0.281]
100 An R package flare for high dimensional linear regression and precision matrix estimation. [sent-502, score-0.306]
wordName wordTfidf (topN-words)
[('clime', 0.397), ('cores', 0.318), ('block', 0.24), ('blocks', 0.201), ('yt', 0.186), ('precision', 0.181), ('admm', 0.168), ('speedup', 0.163), ('column', 0.158), ('core', 0.153), ('zt', 0.148), ('multiplication', 0.135), ('cx', 0.134), ('col', 0.126), ('speedups', 0.125), ('matrix', 0.125), ('ut', 0.114), ('vt', 0.111), ('covariance', 0.111), ('tiger', 0.108), ('cyclic', 0.108), ('cache', 0.101), ('aij', 0.092), ('xij', 0.089), ('xt', 0.087), ('cxt', 0.085), ('pb', 0.083), ('elementwise', 0.081), ('leukemia', 0.078), ('memory', 0.078), ('eij', 0.077), ('load', 0.075), ('sparse', 0.073), ('sq', 0.07), ('sparsity', 0.068), ('cyt', 0.067), ('inexact', 0.064), ('distributed', 0.063), ('parallel', 0.063), ('hours', 0.063), ('millions', 0.062), ('loaded', 0.059), ('box', 0.058), ('trillion', 0.054), ('nb', 0.053), ('evenly', 0.051), ('perturb', 0.051), ('grid', 0.048), ('climate', 0.046), ('dhillon', 0.046), ('day', 0.046), ('node', 0.046), ('dimensions', 0.045), ('sk', 0.045), ('architectures', 0.045), ('waiting', 0.045), ('scalability', 0.044), ('climeadmm', 0.044), ('openmp', 0.044), ('scalapack', 0.044), ('perturbed', 0.044), ('alternating', 0.042), ('multiplications', 0.042), ('kb', 0.042), ('hsieh', 0.041), ('architecture', 0.04), ('synchronization', 0.039), ('scalable', 0.039), ('columns', 0.039), ('competition', 0.038), ('hundreds', 0.038), ('assigned', 0.037), ('queues', 0.036), ('minnesota', 0.036), ('sizes', 0.036), ('letter', 0.036), ('plotted', 0.036), ('ij', 0.035), ('consecutive', 0.035), ('ideal', 0.034), ('soft', 0.034), ('interior', 0.033), ('estimation', 0.033), ('resources', 0.033), ('participating', 0.032), ('misses', 0.032), ('scale', 0.032), ('scales', 0.032), ('aat', 0.031), ('ravikumar', 0.031), ('constrained', 0.031), ('framework', 0.03), ('lp', 0.03), ('argminx', 0.03), ('server', 0.03), ('solves', 0.029), ('resource', 0.029), ('splitting', 0.028), ('rates', 0.028), ('solve', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1
2 0.17223902 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
Author: Cho-Jui Hsieh, Matyas A. Sustik, Inderjit Dhillon, Pradeep Ravikumar, Russell Poldrack
Abstract: The 1 -regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm B IG QUIC, which can solve 1 million dimensional 1 regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that B IG QUIC can achieve super-linear or even quadratic convergence rates. 1
3 0.15724128 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations
Author: Marius Pachitariu, Biljana Petreska, Maneesh Sahani
Abstract: Population neural recordings with long-range temporal structure are often best understood in terms of a common underlying low-dimensional dynamical process. Advances in recording technology provide access to an ever-larger fraction of the population, but the standard computational approaches available to identify the collective dynamics scale poorly with the size of the dataset. We describe a new, scalable approach to discovering low-dimensional dynamics that underlie simultaneously recorded spike trains from a neural population. We formulate the Recurrent Linear Model (RLM) by generalising the Kalman-filter-based likelihood calculation for latent linear dynamical systems to incorporate a generalised-linear observation process. We show that RLMs describe motor-cortical population data better than either directly-coupled generalised-linear models or latent linear dynamical system models with generalised-linear observations. We also introduce the cascaded generalised-linear model (CGLM) to capture low-dimensional instantaneous correlations in neural populations. The CGLM describes the cortical recordings better than either Ising or Gaussian models and, like the RLM, can be fit exactly and quickly. The CGLM can also be seen as a generalisation of a lowrank Gaussian model, in this case factor analysis. The computational tractability of the RLM and CGLM allow both to scale to very high-dimensional neural data. 1
4 0.11717467 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
Author: Tuo Zhao, Han Liu
Abstract: We propose a semiparametric method for estimating sparse precision matrix of high dimensional elliptical distribution. The proposed method calibrates regularizations when estimating each column of the precision matrix. Thus it not only is asymptotically tuning free, but also achieves an improved finite sample performance. Theoretically, we prove that the proposed method achieves the parametric rates of convergence in both parameter estimation and model selection. We present numerical results on both simulated and real datasets to support our theory and illustrate the effectiveness of the proposed estimator. 1
5 0.11626541 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
Author: David Pfau, Eftychios A. Pnevmatikakis, Liam Paninski
Abstract: Recordings from large populations of neurons make it possible to search for hypothesized low-dimensional dynamics. Finding these dynamics requires models that take into account biophysical constraints and can be fit efficiently and robustly. Here, we present an approach to dimensionality reduction for neural data that is convex, does not make strong assumptions about dynamics, does not require averaging over many trials and is extensible to more complex statistical models that combine local and global influences. The results can be combined with spectral methods to learn dynamical systems models. The basic method extends PCA to the exponential family using nuclear norm minimization. We evaluate the effectiveness of this method using an exact decomposition of the Bregman divergence that is analogous to variance explained for PCA. We show on model data that the parameters of latent linear dynamical systems can be recovered, and that even if the dynamics are not stationary we can still recover the true latent subspace. We also demonstrate an extension of nuclear norm minimization that can separate sparse local connections from global latent dynamics. Finally, we demonstrate improved prediction on real neural data from monkey motor cortex compared to fitting linear dynamical models without nuclear norm smoothing. 1
6 0.11592276 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
7 0.11538184 188 nips-2013-Memory Limited, Streaming PCA
8 0.10699368 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
9 0.099988282 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
10 0.091612823 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
11 0.091360763 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
12 0.086336017 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
13 0.085868984 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
14 0.085460901 91 nips-2013-Dirty Statistical Models
15 0.085458525 280 nips-2013-Robust Data-Driven Dynamic Programming
16 0.085448325 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
17 0.084829532 348 nips-2013-Variational Policy Search via Trajectory Optimization
18 0.082808539 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
19 0.081011824 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
20 0.080235332 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
topicId topicWeight
[(0, 0.213), (1, 0.058), (2, 0.068), (3, 0.044), (4, -0.046), (5, 0.023), (6, -0.043), (7, 0.073), (8, -0.038), (9, -0.103), (10, -0.012), (11, -0.119), (12, 0.057), (13, 0.021), (14, -0.145), (15, 0.023), (16, -0.014), (17, 0.039), (18, -0.022), (19, 0.064), (20, 0.03), (21, -0.116), (22, 0.041), (23, -0.084), (24, 0.031), (25, -0.067), (26, 0.016), (27, -0.092), (28, -0.029), (29, -0.036), (30, 0.122), (31, 0.051), (32, 0.044), (33, 0.08), (34, -0.023), (35, 0.045), (36, -0.089), (37, 0.076), (38, 0.045), (39, -0.019), (40, 0.052), (41, -0.09), (42, 0.064), (43, 0.11), (44, -0.123), (45, 0.111), (46, 0.005), (47, -0.113), (48, 0.026), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.9486776 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1
2 0.78015703 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
Author: Cho-Jui Hsieh, Matyas A. Sustik, Inderjit Dhillon, Pradeep Ravikumar, Russell Poldrack
Abstract: The 1 -regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under high-dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters scaling quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm B IG QUIC, which can solve 1 million dimensional 1 regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel block-coordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that B IG QUIC can achieve super-linear or even quadratic convergence rates. 1
3 0.63881326 198 nips-2013-More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server
Author: Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, Eric Xing
Abstract: We propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model’s values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes. 1
4 0.610403 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations
Author: Marius Pachitariu, Biljana Petreska, Maneesh Sahani
Abstract: Population neural recordings with long-range temporal structure are often best understood in terms of a common underlying low-dimensional dynamical process. Advances in recording technology provide access to an ever-larger fraction of the population, but the standard computational approaches available to identify the collective dynamics scale poorly with the size of the dataset. We describe a new, scalable approach to discovering low-dimensional dynamics that underlie simultaneously recorded spike trains from a neural population. We formulate the Recurrent Linear Model (RLM) by generalising the Kalman-filter-based likelihood calculation for latent linear dynamical systems to incorporate a generalised-linear observation process. We show that RLMs describe motor-cortical population data better than either directly-coupled generalised-linear models or latent linear dynamical system models with generalised-linear observations. We also introduce the cascaded generalised-linear model (CGLM) to capture low-dimensional instantaneous correlations in neural populations. The CGLM describes the cortical recordings better than either Ising or Gaussian models and, like the RLM, can be fit exactly and quickly. The CGLM can also be seen as a generalisation of a lowrank Gaussian model, in this case factor analysis. The computational tractability of the RLM and CGLM allow both to scale to very high-dimensional neural data. 1
5 0.58883572 73 nips-2013-Convex Relaxations for Permutation Problems
Author: Fajwel Fogel, Rodolphe Jenatton, Francis Bach, Alexandre D'Aspremont
Abstract: Seriation seeks to reconstruct a linear order between variables using unsorted similarity information. It has direct applications in archeology and shotgun gene sequencing for example. We prove the equivalence between the seriation and the combinatorial 2-SUM problem (a quadratic minimization problem over permutations) over a class of similarity matrices. The seriation problem can be solved exactly by a spectral algorithm in the noiseless case and we produce a convex relaxation for the 2-SUM problem to improve the robustness of solutions in a noisy setting. This relaxation also allows us to impose additional structural constraints on the solution, to solve semi-supervised seriation problems. We present numerical experiments on archeological data, Markov chains and gene sequences. 1
6 0.57246333 280 nips-2013-Robust Data-Driven Dynamic Programming
7 0.55452353 352 nips-2013-What do row and column marginals reveal about your dataset?
8 0.53453362 302 nips-2013-Sparse Inverse Covariance Estimation with Calibration
9 0.52129436 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces
10 0.51371235 188 nips-2013-Memory Limited, Streaming PCA
11 0.50363749 306 nips-2013-Speeding up Permutation Testing in Neuroimaging
12 0.50141221 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
13 0.50046861 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
14 0.49642125 186 nips-2013-Matrix factorization with binary components
15 0.49471837 314 nips-2013-Stochastic Optimization of PCA with Capped MSG
16 0.49444064 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
17 0.49045876 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
18 0.48793307 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
19 0.48687479 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
20 0.47926536 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
topicId topicWeight
[(16, 0.055), (33, 0.117), (34, 0.07), (41, 0.026), (49, 0.027), (56, 0.075), (70, 0.022), (85, 0.023), (89, 0.038), (93, 0.464), (95, 0.021)]
simIndex simValue paperId paperTitle
1 0.94887328 339 nips-2013-Understanding Dropout
Author: Pierre Baldi, Peter J. Sadowski
Abstract: Dropout is a relatively new algorithm for training neural networks which relies on stochastically “dropping out” neurons during training in order to avoid the co-adaptation of feature detectors. We introduce a general formalism for studying dropout on either units or connections, with arbitrary probability values, and use it to analyze the averaging and regularizing properties of dropout in both linear and non-linear networks. For deep neural networks, the averaging properties of dropout are characterized by three recursive equations, including the approximation of expectations by normalized weighted geometric means. We provide estimates and bounds for these approximations and corroborate the results with simulations. Among other results, we also show how dropout performs stochastic gradient descent on a regularized error function. 1
2 0.89598006 65 nips-2013-Compressive Feature Learning
Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie
Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1
3 0.89505136 211 nips-2013-Non-Linear Domain Adaptation with Boosting
Author: Carlos J. Becker, Christos M. Christoudias, Pascal Fua
Abstract: A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multitask learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state of the art. 1
same-paper 4 0.8718003 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
Author: Huahua Wang, Arindam Banerjee, Cho-Jui Hsieh, Pradeep Ravikumar, Inderjit Dhillon
Abstract: We consider the problem of sparse precision matrix estimation in high dimensions using the CLIME estimator, which has several desirable theoretical properties. We present an inexact alternating direction method of multiplier (ADMM) algorithm for CLIME, and establish rates of convergence for both the objective and optimality conditions. Further, we develop a large scale distributed framework for the computations, which scales to millions of dimensions and trillions of parameters, using hundreds of cores. The proposed framework solves CLIME in columnblocks and only involves elementwise operations and parallel matrix multiplications. We evaluate our algorithm on both shared-memory and distributed-memory architectures, which can use block cyclic distribution of data and parameters to achieve load balance and improve the efficiency in the use of memory hierarchies. Experimental results show that our algorithm is substantially more scalable than state-of-the-art methods and scales almost linearly with the number of cores. 1
5 0.86348867 172 nips-2013-Learning word embeddings efficiently with noise-contrastive estimation
Author: Andriy Mnih, Koray Kavukcuoglu
Abstract: Continuous-valued word embeddings learned by neural language models have recently been shown to capture semantic and syntactic information about words very well, setting performance records on several word similarity tasks. The best results are obtained by learning high-dimensional embeddings from very large quantities of data, which makes scalability of the training method a critical factor. We propose a simple and scalable new approach to learning word embeddings based on training log-bilinear models with noise-contrastive estimation. Our approach is simpler, faster, and produces better results than the current state-of-theart method. We achieve results comparable to the best ones reported, which were obtained on a cluster, using four times less data and more than an order of magnitude less computing time. We also investigate several model types and find that the embeddings learned by the simpler models perform at least as well as those learned by the more complex ones. 1
6 0.72745073 215 nips-2013-On Decomposing the Proximal Map
7 0.69324887 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning
8 0.64576191 99 nips-2013-Dropout Training as Adaptive Regularization
9 0.63641244 82 nips-2013-Decision Jungles: Compact and Rich Models for Classification
10 0.62604856 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
11 0.61694372 96 nips-2013-Distributed Representations of Words and Phrases and their Compositionality
12 0.60386688 30 nips-2013-Adaptive dropout for training deep neural networks
13 0.60130131 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
14 0.58376759 251 nips-2013-Predicting Parameters in Deep Learning
15 0.57394785 5 nips-2013-A Deep Architecture for Matching Short Texts
16 0.56914067 69 nips-2013-Context-sensitive active sensing in humans
17 0.55314583 263 nips-2013-Reasoning With Neural Tensor Networks for Knowledge Base Completion
18 0.55197799 64 nips-2013-Compete to Compute
19 0.54792434 183 nips-2013-Mapping paradigm ontologies to and from the brain
20 0.54737401 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization