nips nips2012 nips2012-131 knowledge-graph by maker-knowledge-mining

131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent


Source: pdf

Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin

Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 gov Abstract Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. [sent-8, score-0.066]

2 Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. [sent-10, score-0.754]

3 We give a unified convergence analysis for the family of block-greedy algorithms. [sent-11, score-0.075]

4 The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. [sent-12, score-1.108]

5 Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. [sent-13, score-0.075]

6 We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. [sent-14, score-0.105]

7 In recent years, coordinate descent (CD) algorithms have been shown to be efficient for this class of problems [Friedman et al. [sent-17, score-0.426]

8 Motivated by the need to solve large scale `1 regularized problems, researchers have begun to explore parallel algorithms. [sent-20, score-0.182]

9 [2012] have developed “GenCD”, a generic framework for expressing 1 parallel coordinate descent algorithms. [sent-24, score-0.5]

10 As our first contribution, we describe a general randomized block-greedy that includes all three as special cases. [sent-30, score-0.137]

11 The block-greedy algorithm has two parameter: B, the total number of feature blocks and P , the size of the random subset of the B blocks that is chosen at every time step. [sent-31, score-0.583]

12 Second, we present a non-asymptotic convergence rate analysis for the randomized block-greedy coordinate descent algorithms for general values of B 2 {1, . [sent-33, score-0.589]

13 , p} (as the number of blocks cannot exceed the number of features) and P 2 {1, . [sent-36, score-0.265]

14 This result therefore applies to stochastic CD, greedy CD, Shotgun, and thread-greedy. [sent-40, score-0.075]

15 Our general convergence result, and in particular its instantiation to thread-greedy CD, is novel. [sent-42, score-0.075]

16 Third, based on the convergence rate analysis for block-greedy, we optimize a certain “block spectral radius” associated with the design matrix. [sent-43, score-0.164]

17 We show that the block spectral radius can be upper bounded by the maximum inner product (or correlation if features are mean zero) between features in distinct blocks. [sent-45, score-0.692]

18 This motivates the use of correlation-based feature clustering to accelerate the convergence of the thread-greedy algorithm. [sent-46, score-0.239]

19 Finally, we conduct an experimental study using a simple clustering heuristic. [sent-47, score-0.111]

20 We observe dramatic acceleration due to clustering for smaller values of the regularization parameter, and show characteristics that must be paid particularly close attention for heavily regularized problems, and that can be improved upon in future work. [sent-48, score-0.206]

21 [2012] describe “GenCD”, a generic framework for parallel coordinate descent algorithms, in which a parallel coordinate descent algorithm can be determined by specifying a select step and an accept step. [sent-50, score-1.053]

22 At each iteration, features chosen by select are evaluated, and a proposed increment is generated for each corresponding feature weight. [sent-51, score-0.236]

23 In these terms, we consider the block-greedy algorithm that takes as part of the input a partition of the features into B blocks. [sent-53, score-0.125]

24 Given this, each iteration selects features corresponding to a set of P randomly selected blocks, and accepts a single feature from each block, based on an estimate of the resulting reduction in the objective function. [sent-54, score-0.178]

25 The pseudocode for the randomized block-greedy coordinate descent is given by Algorithm 1. [sent-55, score-0.514]

26 The algorithm can be applied to any function of the form F + R where F is smooth and convex, and R is convex and separable across coordinates. [sent-56, score-0.134]

27 The greedy step chooses a feature within a block for Figure 1: The design space which the guaranteed descent in the objective function (if that feature of block-greedy coordinate alone were updated) is maximized. [sent-58, score-0.814]

28 To arrive at an heuristic understanding, it is best to think of |⌘j | as being proportional to the absolute value |rj F (w)| of the jth entry in the gradient of the smooth part F . [sent-61, score-0.211]

29 In fact, if R is zero (no regularization) then this heuristic is exact. [sent-62, score-0.106]

30 For instance when B = p, each feature is a block on its own. [sent-71, score-0.279]

31 Then, setting P = 1 amounts to randomly choosing a single coordinate and updating it which gives us the stochastic CD algorithm of Shalev-Shwartz and Tewari [2011]. [sent-72, score-0.219]

32 Then blockgreedy CD is a deterministic algorithm and becomes the greedy CD algorithm of Li and Osher [2009]; Dhillon et al. [sent-78, score-0.124]

33 When this is the case, and we choose to update all blocks in parallel each time (P = B), we arrive at the thread-greedy algorithm of Scherrer et al. [sent-81, score-0.465]

34 In this section, we give a sufficient condition for convergence and derive a convergence rate assuming this condition. [sent-85, score-0.15]

35 express the convergence criteria for Shotgun algorithm in terms of the spectral radius (maximal eigenvalue) ⇢(XT X). [sent-87, score-0.193]

36 The intuition is that if features from different blocks are almost orthogonal then the matrices M in M will be close to identity and will therefore have small ⇢(M ). [sent-90, score-0.39]

37 Highly correlated features within a block do not increase ⇢block . [sent-91, score-0.351]

38 As we said above, we will assume that we are minimizing a “smooth plus separable” convex function F + R where the convex differentiable function F : Rp ! [sent-92, score-0.098]

39 The function R is separable across coordinates: R(w) = j=1 r(wj ). [sent-97, score-0.064]

40 The quantity ⌘j appearing in Algorithm 1 serves to quantify the guaranteed descent (based on second order upper bound) if feature j alone is updated and is obtained as a solution of the one-dimensional minimization problem ⌘j = argmin rj F (w)⌘ + ⌘ 2 ⌘ 2 + r(wj + ⌘) r(wj ) . [sent-99, score-0.34]

41 Note that if there is no regularization, then ⌘j is simply rj F (w)/ = gj / (if we denote rj F (w) by gj for brevity). [sent-100, score-0.22]

42 In the general case, by first order optimality conditions for the above one-dimensional convex optimization problem, we have gj + ⌘j +⌫j = 0 where ⌫j is a subgradient of r at wj + ⌘j . [sent-101, score-0.21]

43 Suppose the randomized block-greedy coordinate algorithm is run on a smooth plus separable convex function f = F + R to produce the iterates {wk }k 1 . [sent-106, score-0.522]

44 We will use wb to denote wjb (similarly for ⌫b , gb etc. [sent-114, score-0.214]

45 )  0 E [f (w ) f (w)] = P Eb ⌘b gb + (⌘b )2 + r(wb + ⌘b ) r(wb ) 2 ⇥ ⇤ + P (P 1)Eb6=b0 ⌘b · ⌘b0 · ATb Ajb0 j 2 Define the B ⇥ B matrix M (that depends on the current iterate w) with entries Mb,b0 = ATb Ajb . [sent-115, score-0.092]

46 j Then, using r(wb + ⌘b ) r(wb )  ⌫b ⌘b , we continue  ⇤ P T P (P 1) ⇥ >  ⌘ g + ⌘T ⌘ + ⌫ T ⌘ + ⌘ M ⌘ ⌘T ⌘ B 2 2B(B 1) Above (with some abuse of notation), ⌘, ⌫ and g are B length vectors with components ⌘b , ⌫b and gb respectively. [sent-116, score-0.092]

47 P Now note that k⌘k2 = b |⌘jb |2 = k⌘k2 where the “infinity-2” norm k · k1,2 of a p-vector is, by 2 1,2 definition, as follows: take the `1 norm within a block and take the `2 of the resulting values. [sent-120, score-0.284]

48 k1,2 Putting the last two inequalities together gives (for any upper bound R1 on kw w? [sent-128, score-0.099]

49 In particular, consider the case where all blocks are updated in parallel as in the thread-greedy coordinate descent algorithm of Scherrer et al. [sent-136, score-0.849]

50 Suppose the block-greedy coordinate algorithm with B = P (thready-greedy) is run on a smooth plus separable convex function f = F + R to produce the iterates {wk }k 1 . [sent-140, score-0.385]

51 (2 ⇢block )k 4 Feature Clustering The convergence analysis of section 3 shows that we need to minimize the block spectral radius. [sent-143, score-0.36]

52 Directly finding a clustering that minimizes ⇢block is a computationally daunting task. [sent-144, score-0.111]

53 In the absence of an efficient search strategy for this enormous space, we find it convenient to work instead in terms of the inner product of features from distinct blocks. [sent-147, score-0.187]

54 Then the spectral radius of S has the upper bound ⇢(S)  1 + (B 1) " . [sent-151, score-0.154]

55 1 Clustering Heuristic Given p features and B blocks, we wish to distribute the features evenly among the blocks, attempting to minimize the absolute inner product between features across blocks. [sent-156, score-0.51]

56 Moreover, we require an approach that is efficient, since any time spent clustering could instead have been used for iterations of the main algorithm. [sent-157, score-0.111]

57 We describe a simple heuristic that builds uniform-sized clusters of features. [sent-158, score-0.106]

58 To construct a given block, we select a feature as a “seed”, and assign the nearest features to the seed (in terms of absolute inner product) to be in the same block. [sent-159, score-0.317]

59 Because inner products with very sparse features result in a large number of zeros, we choose at each step the most dense unassigned feature as the seed. [sent-160, score-0.24]

60 5 Experimental Setup Platform All our experiments are conducted on a 48-core system comprising of 4 sockets and 8 banks of memory. [sent-167, score-0.06]

61 Each socket is an AMD Opteron processor codenamed Magny-Cours, which is a multichip processor with two 6-core chips on a single die. [sent-168, score-0.15]

62 Each 6-core processor is equipped with a three-level memory hierarchy as follows: (i) 64 KB of L1 cache for data and 512 KB of L2 cache that are private to each core, and (ii) 12 MB of L3 cache that is shared among the 6 cores. [sent-169, score-0.278]

63 Each 6-core processor is linked to a 32 GB memory bank with independent memory controllers leading to a total system memory of 256 GB (32 ⇥ 8) that can be globally addressed from each core. [sent-170, score-0.171]

64 The data was gathered by Ken Lang at Carnegie Mellon University circa 1995. [sent-174, score-0.143]

65 (iii) R EAL S IM consists of about 73, 000 UseNet articles from four discussion groups: simulated auto racing, simulated aviation, real auto racing, and real aviation. [sent-179, score-0.098]

66 The data was gathered by Andrew McCallum while at Just Research circa 1997. [sent-180, score-0.143]

67 Implementation For the current work, our empirical results focus on thread-greedy coordinate descent with 32 blocks. [sent-185, score-0.377]

68 At each iteration, a given thread must step through the nonzeros of each of its features to compute the proposed increment (the ⌘j of Section 3) and the estimated benefit of choosing that feature. [sent-186, score-0.469]

69 Once this is complete, the thread (without waiting) enters the line search phase, where it remains until all threads are being updated by less than the specified tolerance. [sent-187, score-0.133]

70 Testing framework We compare the effect of clustering to randomization (i. [sent-190, score-0.147]

71 features are randomly assigned to blocks), for a variety of values for the regularization parameter . [sent-192, score-0.161]

72 To test the effect of clustering for very 1 Further details on AMD Opteron can be found at http://www. [sent-193, score-0.111]

73 For each dataset, we show the regularized expected loss (top) and number of nonzeros (bottom), using powers of ten as regularization parameters. [sent-205, score-0.409]

74 Results for randomized features are shown in black, and those for clustered features are shown in red. [sent-206, score-0.503]

75 Active blocks Iterations per second NNZ @ 1K sec Objective @ 1K sec NNZ @ 10K iter Objective @ 10K iter = 10 4 Randomized Clustered 32 6 153 12. [sent-208, score-0.415]

76 sparse weights, we first let 0 be the largest power of ten that leads to any nonzero weight estimates. [sent-224, score-0.133]

77 For each run, we measure the regularized expected loss and the number of nonzeros at one-second intervals. [sent-226, score-0.285]

78 Times required for clustering and randomization are negligible, and we do not report them here. [sent-227, score-0.147]

79 6 Results Figure 2 shows the regularized expected loss (top) and number of nonzeros (bottom), for several values of the regularization parameter . [sent-228, score-0.321]

80 Black and red curves indicate randomly-permuted features and clustered features, respectively. [sent-229, score-0.241]

81 Note that a larger value of results in a sparser solution with greater regularized expected loss and a smaller number of nonzeros. [sent-232, score-0.097]

82 For large values of , the simple clustering heuristic results in slower convergence, while for smaller values of we see considerable benefit. [sent-235, score-0.217]

83 Of the datasets we tested, R EUTERS might reasonably lead to the greatest concern. [sent-237, score-0.071]

84 Like the other datasets, clustered features lead to slow convergence for large and fast convergence for small . [sent-238, score-0.391]

85 However, R EUTERS is particularly interesting because for = 10 5 , clustered features seem to provide an initial benefit that does not last; after about 250 seconds it is overtaken by the run with randomized features. [sent-239, score-0.378]

86 The first row of this table gives the number of active blocks, by which we mean the number of blocks containing any nonzeros. [sent-242, score-0.265]

87 For an inactive block, the corresponding thread repeatedly confirms that all weights remain zero without contributing to convergence. [sent-243, score-0.098]

88 In the most regularized case = 10 4 , clustered data results in only six active blocks, while for other cases every block is active. [sent-244, score-0.401]

89 Thus in this case features corresponding to nonzero weights are colocated within these few blocks, severely limiting the advantage of parallel updates. [sent-245, score-0.296]

90 In the second row, we see that for randomized features, the algorithm is able to get through over ten times as many iterations per second. [sent-246, score-0.183]

91 To see why, note that the amount of work for a given thread is a linear function of the number of nonzeros over all of the features in its block. [sent-247, score-0.411]

92 Thus, the block with the greatest number of nonzeros serves as a bottleneck. [sent-248, score-0.456]

93 Note that for this test, randomized features result in faster convergence for the two largest values of . [sent-250, score-0.376]

94 In these terms, clustering is advantageous for all but the largest value of . [sent-252, score-0.15]

95 First, Figure 3a shows the number of nonzeros in all features for each of the 32 blocks. [sent-254, score-0.313]

96 For comparison, Figures 3b and 3c show convergence rates as a function of the number of iterations. [sent-256, score-0.075]

97 7 Conclusion We have presented convergence results for a family of randomized coordinate descent algorithms that we call block-greedy coordinate descent. [sent-257, score-0.808]

98 We have shown that convergence depends on ⇢block , the maximal spectral radius over submatrices of XT X resulting from the choice of one feature for each block. [sent-259, score-0.324]

99 Even though a simple clustering heuristic helps for smaller values of the regularization parameter, our results also show the importance of considering issues of load-balancing and the distribution of weights for heavily-regularized problems. [sent-260, score-0.253]

100 A clear next goal in this work is the development of a clustering heuristic that is relatively well load-balanced and distributes weights for heavily-regularized problems evenly across blocks, while maintaining good computational efficiency. [sent-261, score-0.25]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('blocks', 0.265), ('shotgun', 0.24), ('nnz', 0.239), ('block', 0.226), ('coordinate', 0.219), ('cd', 0.219), ('nonzeros', 0.188), ('euters', 0.18), ('kdda', 0.18), ('scherrer', 0.171), ('descent', 0.158), ('gencd', 0.15), ('randomized', 0.137), ('jb', 0.133), ('wj', 0.125), ('features', 0.125), ('parallel', 0.123), ('rel', 0.122), ('wb', 0.122), ('clustered', 0.116), ('clustering', 0.111), ('bradley', 0.11), ('heuristic', 0.106), ('thread', 0.098), ('gb', 0.092), ('circa', 0.09), ('eal', 0.09), ('ews', 0.09), ('northwest', 0.09), ('tewari', 0.081), ('convergence', 0.075), ('processor', 0.075), ('greedy', 0.075), ('wk', 0.073), ('separable', 0.064), ('lewis', 0.063), ('kw', 0.063), ('inner', 0.062), ('atb', 0.06), ('halappanavar', 0.06), ('richland', 0.06), ('sockets', 0.06), ('usenet', 0.06), ('regularized', 0.059), ('radius', 0.059), ('spectral', 0.059), ('increment', 0.058), ('paci', 0.058), ('rj', 0.058), ('cache', 0.057), ('gathered', 0.053), ('accept', 0.053), ('feature', 0.053), ('lo', 0.053), ('racing', 0.053), ('gj', 0.052), ('et', 0.049), ('amd', 0.049), ('auto', 0.049), ('keerthi', 0.049), ('opteron', 0.049), ('submatrices', 0.049), ('wa', 0.049), ('nonzero', 0.048), ('im', 0.048), ('osher', 0.046), ('ken', 0.046), ('kb', 0.046), ('ten', 0.046), ('dhillon', 0.044), ('powers', 0.042), ('greatest', 0.042), ('iter', 0.042), ('absolute', 0.04), ('largest', 0.039), ('loss', 0.038), ('sij', 0.038), ('parallelism', 0.038), ('smooth', 0.037), ('seed', 0.037), ('cup', 0.037), ('regularization', 0.036), ('upper', 0.036), ('randomization', 0.036), ('rf', 0.036), ('updated', 0.035), ('convex', 0.033), ('evenly', 0.033), ('sec', 0.033), ('memory', 0.032), ('plus', 0.032), ('lin', 0.032), ('xj', 0.032), ('design', 0.03), ('maximal', 0.029), ('datasets', 0.029), ('wu', 0.029), ('norm', 0.029), ('arrive', 0.028), ('sensing', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin

Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1

2 0.15894613 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

3 0.15280752 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki

Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1

4 0.09483508 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

Author: Cho-jui Hsieh, Arindam Banerjee, Inderjit S. Dhillon, Pradeep K. Ravikumar

Abstract: We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. Recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field, even in very high-dimensional regimes with a limited number of samples. In this paper, we are concerned with the computational cost in solving the above optimization problem. Our proposed algorithm partitions the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. For the conquer step, we use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. 1

5 0.089479782 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set

Author: Nan Li, Longin J. Latecki

Abstract: We formulate clustering aggregation as a special instance of Maximum-Weight Independent Set (MWIS) problem. For a given dataset, an attributed graph is constructed from the union of the input clusterings generated by different underlying clustering algorithms with different parameters. The vertices, which represent the distinct clusters, are weighted by an internal index measuring both cohesion and separation. The edges connect the vertices whose corresponding clusters overlap. Intuitively, an optimal aggregated clustering can be obtained by selecting an optimal subset of non-overlapping clusters partitioning the dataset together. We formalize this intuition as the MWIS problem on the attributed graph, i.e., finding the heaviest subset of mutually non-adjacent vertices. This MWIS problem exhibits a special structure. Since the clusters of each input clustering form a partition of the dataset, the vertices corresponding to each clustering form a maximal independent set (MIS) in the attributed graph. We propose a variant of simulated annealing method that takes advantage of this special structure. Our algorithm starts from each MIS, which is close to a distinct local optimum of the MWIS problem, and utilizes a local search heuristic to explore its neighborhood in order to find the MWIS. Extensive experiments on many challenging datasets show that: 1. our approach to clustering aggregation automatically decides the optimal number of clusters; 2. it does not require any parameter tuning for the underlying clustering algorithms; 3. it can combine the advantages of different underlying clustering algorithms to achieve superior performance; 4. it is robust against moderate or even bad input clusterings. 1

6 0.082943551 69 nips-2012-Clustering Sparse Graphs

7 0.080635861 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

8 0.075386941 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

9 0.070752293 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

10 0.070526928 126 nips-2012-FastEx: Hash Clustering with Exponential Families

11 0.069765262 26 nips-2012-A nonparametric variable clustering model

12 0.068888575 282 nips-2012-Proximal Newton-type methods for convex optimization

13 0.06706436 27 nips-2012-A quasi-Newton proximal splitting method

14 0.065908588 304 nips-2012-Selecting Diverse Features via Spectral Regularization

15 0.065778852 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

16 0.065095745 324 nips-2012-Stochastic Gradient Descent with Only One Projection

17 0.062210035 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk

18 0.06153176 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

19 0.061348733 187 nips-2012-Learning curves for multi-task Gaussian process regression

20 0.058661357 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.187), (1, 0.039), (2, 0.053), (3, -0.056), (4, 0.048), (5, 0.035), (6, -0.006), (7, -0.05), (8, 0.036), (9, -0.021), (10, 0.03), (11, -0.028), (12, -0.073), (13, 0.086), (14, 0.093), (15, 0.01), (16, -0.09), (17, -0.024), (18, -0.017), (19, -0.038), (20, 0.034), (21, -0.004), (22, 0.022), (23, -0.02), (24, -0.034), (25, -0.133), (26, 0.055), (27, -0.038), (28, 0.046), (29, 0.036), (30, 0.058), (31, -0.005), (32, 0.005), (33, -0.038), (34, -0.031), (35, -0.042), (36, -0.018), (37, 0.021), (38, -0.047), (39, -0.004), (40, 0.07), (41, 0.014), (42, 0.004), (43, 0.028), (44, -0.053), (45, 0.004), (46, 0.036), (47, -0.061), (48, 0.035), (49, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94022703 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin

Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1

2 0.75007457 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation

Author: Benjamin Rolfs, Bala Rajaratnam, Dominique Guillot, Ian Wong, Arian Maleki

Abstract: The 1 -regularized maximum likelihood estimation problem has recently become a topic of great interest within the machine learning, statistics, and optimization communities as a method for producing sparse inverse covariance estimators. In this paper, a proximal gradient method (G-ISTA) for performing 1 -regularized covariance matrix estimation is presented. Although numerous algorithms have been proposed for solving this problem, this simple proximal gradient method is found to have attractive theoretical and numerical properties. G-ISTA has a linear rate of convergence, resulting in an O(log ε) iteration complexity to reach a tolerance of ε. This paper gives eigenvalue bounds for the G-ISTA iterates, providing a closed-form linear convergence rate. The rate is shown to be closely related to the condition number of the optimal point. Numerical convergence results and timing comparisons for the proposed method are presented. G-ISTA is shown to perform very well, especially when the optimal point is well-conditioned. 1

3 0.69721228 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation

Author: Cho-jui Hsieh, Arindam Banerjee, Inderjit S. Dhillon, Pradeep K. Ravikumar

Abstract: We consider the composite log-determinant optimization problem, arising from the 1 regularized Gaussian maximum likelihood estimator of a sparse inverse covariance matrix, in a high-dimensional setting with a very large number of variables. Recent work has shown this estimator to have strong statistical guarantees in recovering the true structure of the sparse inverse covariance matrix, or alternatively the underlying graph structure of the corresponding Gaussian Markov Random Field, even in very high-dimensional regimes with a limited number of samples. In this paper, we are concerned with the computational cost in solving the above optimization problem. Our proposed algorithm partitions the problem into smaller sub-problems, and uses the solutions of the sub-problems to build a good approximation for the original problem. Our key idea for the divide step to obtain a sub-problem partition is as follows: we first derive a tractable bound on the quality of the approximate solution obtained from solving the corresponding sub-divided problems. Based on this bound, we propose a clustering algorithm that attempts to minimize this bound, in order to find effective partitions of the variables. For the conquer step, we use the approximate solution, i.e., solution resulting from solving the sub-problems, as an initial point to solve the original problem, and thereby achieve a much faster computational procedure. 1

4 0.69377136 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms

Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola

Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1

5 0.63928574 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation

Author: Figen Oztoprak, Jorge Nocedal, Steven Rennie, Peder A. Olsen

Abstract: We propose two classes of second-order optimization methods for solving the sparse inverse covariance estimation problem. The first approach, which we call the Newton-LASSO method, minimizes a piecewise quadratic model of the objective function at every iteration to generate a step. We employ the fast iterative shrinkage thresholding algorithm (FISTA) to solve this subproblem. The second approach, which we call the Orthant-Based Newton method, is a two-phase algorithm that first identifies an orthant face and then minimizes a smooth quadratic approximation of the objective function using the conjugate gradient method. These methods exploit the structure of the Hessian to efficiently compute the search direction and to avoid explicitly storing the Hessian. We also propose a limited memory BFGS variant of the orthant-based Newton method. Numerical results, including comparisons with the method implemented in the QUIC software [1], suggest that all the techniques described in this paper constitute useful tools for the solution of the sparse inverse covariance estimation problem. 1

6 0.62692845 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

7 0.60709769 125 nips-2012-Factoring nonnegative matrices with linear programs

8 0.59223479 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

9 0.58353466 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization

10 0.57984829 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

11 0.54948986 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

12 0.54065031 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

13 0.53739548 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

14 0.52495748 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

15 0.50994724 155 nips-2012-Human memory search as a random walk in a semantic network

16 0.50898576 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

17 0.50580418 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets

18 0.50308859 204 nips-2012-MAP Inference in Chains using Column Generation

19 0.50210345 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

20 0.50005233 254 nips-2012-On the Sample Complexity of Robust PCA


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.059), (17, 0.022), (21, 0.027), (38, 0.186), (42, 0.034), (54, 0.029), (55, 0.015), (74, 0.033), (76, 0.153), (80, 0.063), (89, 0.245), (92, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83598101 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

Author: Borja Balle, Mehryar Mohri

Abstract: Many tasks in text and speech processing and computational biology require estimating functions mapping strings to real numbers. A broad class of such functions can be defined by weighted automata. Spectral methods based on the singular value decomposition of a Hankel matrix have been recently proposed for learning a probability distribution represented by a weighted automaton from a training sample drawn according to this same target distribution. In this paper, we show how spectral methods can be extended to the problem of learning a general weighted automaton from a sample generated by an arbitrary distribution. The main obstruction to this approach is that, in general, some entries of the Hankel matrix may be missing. We present a solution to this problem based on solving a constrained matrix completion problem. Combining these two ingredients, matrix completion and spectral method, a whole new family of algorithms for learning general weighted automata is obtained. We present generalization bounds for a particular algorithm in this family. The proofs rely on a joint stability analysis of matrix completion and spectral learning. 1

same-paper 2 0.82628244 131 nips-2012-Feature Clustering for Accelerating Parallel Coordinate Descent

Author: Chad Scherrer, Ambuj Tewari, Mahantesh Halappanavar, David Haglin

Abstract: Large-scale `1 -regularized loss minimization problems arise in high-dimensional applications such as compressed sensing and high-dimensional supervised learning, including classification and regression problems. High-performance algorithms and implementations are critical to efficiently solving these problems. Building upon previous work on coordinate descent algorithms for `1 -regularized problems, we introduce a novel family of algorithms called block-greedy coordinate descent that includes, as special cases, several existing algorithms such as SCD, Greedy CD, Shotgun, and Thread-Greedy. We give a unified convergence analysis for the family of block-greedy algorithms. The analysis suggests that block-greedy coordinate descent can better exploit parallelism if features are clustered so that the maximum inner product between features in different blocks is small. Our theoretical convergence analysis is supported with experimental results using data from diverse real-world applications. We hope that algorithmic approaches and convergence analysis we provide will not only advance the field, but will also encourage researchers to systematically explore the design space of algorithms for solving large-scale `1 -regularization problems. 1

3 0.78680789 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

Abstract: We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1

4 0.72313726 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

Author: Andriy Mnih, Yee W. Teh

Abstract: User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative filtering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difficult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative filtering with implicit feedback based on modelling the user’s item selection process. In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efficient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. 1

5 0.72250962 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

Author: Nicholas Ruozzi

Abstract: Sudderth, Wainwright, and Willsky conjectured that the Bethe approximation corresponding to any fixed point of the belief propagation algorithm over an attractive, pairwise binary graphical model provides a lower bound on the true partition function. In this work, we resolve this conjecture in the affirmative by demonstrating that, for any graphical model with binary variables whose potential functions (not necessarily pairwise) are all log-supermodular, the Bethe partition function always lower bounds the true partition function. The proof of this result follows from a new variant of the “four functions” theorem that may be of independent interest. 1

6 0.72060966 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

7 0.72056293 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

8 0.72034258 27 nips-2012-A quasi-Newton proximal splitting method

9 0.71952277 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

10 0.71948367 319 nips-2012-Sparse Prediction with the $k$-Support Norm

11 0.71921843 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

12 0.71867824 125 nips-2012-Factoring nonnegative matrices with linear programs

13 0.71808541 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

14 0.71801978 69 nips-2012-Clustering Sparse Graphs

15 0.7174077 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

16 0.71688414 34 nips-2012-Active Learning of Multi-Index Function Models

17 0.71673495 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

18 0.7161141 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

19 0.71608967 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

20 0.71471733 361 nips-2012-Volume Regularization for Binary Classification