nips nips2011 nips2011-182 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari
Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. [sent-8, score-0.099]
2 Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. [sent-9, score-0.158]
3 In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. [sent-12, score-2.036]
4 We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. [sent-13, score-1.164]
5 We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. [sent-14, score-1.379]
6 We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. [sent-15, score-1.253]
7 Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. [sent-16, score-0.979]
8 Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods. [sent-17, score-0.671]
9 1 Introduction Increasingly, optimization problems in machine learning are very high-dimensional, where the number of variables is very large. [sent-18, score-0.043]
10 This has led to a renewed interest in iterative algorithms that reqnire bounded time per iteration. [sent-19, score-0.098]
11 Such iterative methods take various forms such as so-called row-action methods [6] which enforce constraints in the optimization problem sequentially, or first-order methods [4] which only compute the gradient or a coordinate of the gradient per step. [sent-20, score-0.631]
12 But the overall time complexity of these methods still has a high polynomial dependence on the number of parameters. [sent-21, score-0.032]
13 Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number ofpararneters [5, 15, 18]. [sent-22, score-0.093]
14 Towards this, we investigate one of the simplest classes of first order methods: coordinate descent, which only updates a single coordinate of the iterate at every step. [sent-24, score-1.022]
15 The coordinate descent class of algorithms has seen a renewed interest after recent papers [8, 10, 19] have shown considerable empirical success in application to large problems. [sent-25, score-0.843]
16 Saba and Tewari [13] even show that under 1 certain conditions, the convergence rate of cyclic coordinate descent is at least as fast as that of gradient descent. [sent-26, score-0.886]
17 In this paper, we focus on high-dimensional optimization problems where the solution is sparse. [sent-27, score-0.043]
18 We were motivated to investigate coordinate descent algorithms by the intuition that they could leverage the sparsity structure of the solution by judiciously choosing the coordinate to be updated. [sent-28, score-1.322]
19 In particular, we show that a greedy selection of the coordinates succeeds in weakening the costly dependence on problem size with the caveat that we could perform the greedy step efficiently. [sent-29, score-1.185]
20 The naive greedy updates would however take time that scales at least linearly in the problem dimension O(P) since it has to compute the coordinate with the maximum gradient. [sent-30, score-0.95]
21 We thus come to the other main question of this paper: Can the greedy steps in a greedy coordinate scheme be peiformed efficiently? [sent-31, score-1.396]
22 Surprisingly, we are able to answer in the affirmative, and we show this by a reduction to nearest neighbor search. [sent-32, score-0.538]
23 This allows us to leverage the significant amount of recent research on sublinear methods for nearest neighbor search, provided it suffices to have approximate nearest neighbors. [sent-33, score-1.057]
24 The upshot of our results is a suite of methods that depend weakly on the problem size or number of parameters. [sent-34, score-0.113]
25 We also investigate several notions of approximate greedy coordinate descent for which we are able to derive similar rates. [sent-35, score-1.327]
26 For the composite objective case, where the objective is the sum of a smooth component and a separable non-smooth component, we propose and analyze a "look-ahead" variant of greedy coordinate descent. [sent-36, score-1.171]
27 The development in this paper thus raises a new line ofresearch on connections between computational geometry and first-order optimization methods. [sent-37, score-0.07]
28 For instance, given our results, it would be of interest to develop approximate nearest neighbor methods tuned to greedy coordinate descent. [sent-38, score-1.508]
29 As an instance of such a connection, we show that if the covariates underlying the optimization objective satisfy a mutual incoherence condition, then a very simple nearest neighbor data structure suffices to yield a good approximation. [sent-39, score-0.776]
30 The results of this paper natorally lead to several open problems: can effective computational geometric data structures be tailored towards greedy coordinate descent? [sent-41, score-1.016]
31 Can these be adapted to (a) other first-order methods, perhaps based on sampling, and (b) different regularized variants that uncover structored sparsity. [sent-42, score-0.066]
32 We hope this paper fosters further research and cross-fertilization of ideas in computational geometry and optimization. [sent-43, score-0.054]
33 2 Setup and Notation We start our treatment with differentiable objective functions, and then extend this to encompass non-differentiable functions which arise as the sum of a smooth component and a separable nonsmooth component. [sent-44, score-0.202]
34 Let C : JR" --+ IR be a convex differentiable function. [sent-45, score-0.042]
35 We do not assume that the function is strongly convex: indeed most optimizations arising out of high-dimensional machine learning problems are convex but typically not strungly so. [sent-46, score-0.056]
36 Our analysis requires that the function satisfies the following coordinate-wise Lipschitz condition: A ••omptionAt. [sent-47, score-0.151]
37 The loss function C satisfies IIV'C(w) - V'C(v)ll~ ::; "1 ·llw - vIiI, for some "1> o. [sent-48, score-0.151]
38 In particular, we say that C has "2-Lipschitz continuous gradient w. [sent-50, score-0.071]
39 We are interested in the general optimization problem, min C(w). [sent-59, score-0.043]
40 t Coordinate Descent Coordinate descent solves (I) iteratively by optimizing over a single coordinate while holding others fixed. [sent-66, score-0.75]
41 lYPically, the choice of the coordinate to be updated is cyclic. [sent-67, score-0.446]
42 One caveat with this scheme 2 however is that it could be expensive to compute the one-dimensional optimum for general functions £,. [sent-68, score-0.098]
43 Moreover when £, is not smooth, such coordinatewise descent is not guaranteed to converge to the global optimum in general, unless the non-differentiable component is separable [16]. [sent-69, score-0.383]
44 A line of recent work [16, 17, 14] has thus focused on a "gradient descent" version of coordinate descent, that iteratively uses a local quadratic upper bound rY of the function C. [sent-70, score-0.446]
45 For the case where the optimization function is the sum of a smooth function aod the i l regularizer, this variant is also ca\led Iterative Soft Thresholding [7]. [sent-71, score-0.227]
46 A template for such coordinate gradient descent is the set of iterates: w' = W'-I - ;, VjC(w')ej. [sent-72, score-0.821]
47 [10], Wu and Laoge [19] aod others have shown considerable empirical success in applying these to large problems. [sent-75, score-0.126]
48 Suppose the convex differentiable function C satisfies Assumptions Al and A2. [sent-84, score-0.193]
49 Then the iterates of Algorithm 1 satisfy: ° C(w') _ C(w*) :<; ~I Ilw ~ w II. [sent-85, score-0.098]
50 Letting c(P) denote the time required to solve each greedy step mruq IVIC( w') I, the greedy version of coordinate descent achieves the rate C(w') - C(w*) = 0(. [sent-87, score-1.741]
51 Note that the dependence on the problem size p is restricted to the greedy step: if we could solve this maximization more efficiently, then we have a powerful "active-set" method. [sent-89, score-0.507]
52 While brute force maximization for the greedy step would take O(P) time, ifit cao be done in 0(1) time, then at time T, the iterate w satisfies C( w) - C( w*) = 0(. [sent-90, score-0.92]
53 3 Nearest Neighbor aod Fast Greedy 10 this section, we examine whether the greedy step cao be performed in sublinear time. [sent-92, score-0.813]
54 We focus in particular on optimization problems arising from statistical learoing problems where the optimization objective can be written as n C(w) = ~i(wTx',y'), (2) i=l for some loss functioni : RxR r-> R, and a set of observations {(Xi, yi)}:'~I' with Xi E RP, yi E R. [sent-93, score-0.214]
55 Note that such an optimization objective arises in most statisticallearoing problems. [sent-94, score-0.083]
56 LetJing g( u, v) = V ui( u, v) denote the gradient of the sample loss with respect to its first argument, and ri(w) = g(wT Xi, yi), the gradient of the objective (2) may be written as VjC(w) = L~~I x~ r'(w) = (Xj, r(w)) . [sent-97, score-0.182]
57 It then follows that the greedy coordinate descent step in Algorithm 1 reduces to the following simple problem: maxi (xj,r(w)) I· , (3) We can now see why the greedy step (3) cao be performed efficiently: it cao be cast as a nearness problem. [sent-98, score-2.053]
58 Regarding the time taken to compute the gradient r(w), note that after any coordinate descent update, we can update r' in 0(1) time if we cache the values {(w, x')}, so that r can be updated in O(n) time. [sent-110, score-0.849]
59 The reduction to nearest neighbor search however comes with a caveat: nearest neighbor search variants that run in sublinear time only compute approximate nearest neighbors. [sent-111, score-1.549]
60 This in turn aroounts to performing the greedy step approximately. [sent-112, score-0.516]
61 In the next few subsections, we investigate the consequences of such approximations. [sent-113, score-0.048]
62 1 Multiplicative Greedy We first consider a variant where the greedy step is performed under a mnltiplicative approximation, where we choose a coordinate it such that, for some c E (0,1], IIV. [sent-115, score-1.085]
63 (5) As the following lemma shows, the approximate greedy steps have little qualitative effect (proof in Supplementary Material). [sent-118, score-0.632]
64 The greedy coordinate descent iterates, with the greedy step computed as in (5), satisfy: ~ . [sent-120, score-1.741]
65 c(w*) :0; The price for the approximate greedy updates is thus just a constant factor 1/c 2: I reduction in the convergence rate. [sent-124, score-0.591]
66 Note that the equivalence of (4) need not hold under multiplicative approximations. [sent-125, score-0.056]
67 That is, approximate nearest neighbor techuiques that obtain a nearest neighbor upto a multiplicative factor, do not guarantee a mnltiplicative approximation for the inner product in the greedy step in turn. [sent-126, score-1.771]
68 As the next lemma shows this still achieves the required qualitative rate. [sent-127, score-0.103]
69 Suppose the greedy step is performed as in (5) with a multiplicative approximation factor of (I + ,=) (due to approximate nearest neighbor search for instance). [sent-129, score-1.165]
70 Then, at any iteration t, the greedy coordinate descent iterates satisfy either of the following two conditions, for any' > 0: (a) V. [sent-130, score-1.368]
71 2 Additive Greedy Another natural variant is the following additive approximate greedy coordioate descent, where we choose the coordinate i, such that (6) for some 'odd. [sent-138, score-1.065]
72 As the lemma below shows, the approximate greedy steps have little qualitative effect Lemma 4. [sent-139, score-0.632]
73 The greedy coordinate descent iterates, with the greedy step computed as in (6), satisfy: . [sent-140, score-1.741]
74 Note that we need obtain an additive approximation in the greedy step only upto the order of the final precision desired of the optimization problem. [sent-143, score-0.661]
75 In particular, for statistical estimation problems the desired optimization accuracy need not be lower than the statisical precision, which is typically of the order of slog(P) /. [sent-144, score-0.043]
76 Indeed, given the conoections elucidated above to greedy coordinate descent, it is an interesting futore problem to develop approximate nearest neighbor methods with additive approximations. [sent-147, score-1.578]
77 4 Tailored Nearest Neighbor Data Structures In this section, we show that one could develop approximate nearest neighbor methods tailored to the statistical estimation setting. [sent-148, score-0.682]
78 1 Qnadtree nnder Mntnallncoherence We will show that just a vanilla quadtree yields a good approximation when the covariates satisfY a technical statistical condition of mutual coherence. [sent-150, score-0.26]
79 A quadtree is a tree data structure which partitions the space. [sent-151, score-0.112]
80 Each internal node u in the quadtree has a representative point, denoted by rep(u), and a list of children nodes, denoted by children(u), which partition the space under u. [sent-152, score-0.144]
wordName wordTfidf (topN-words)
[('greedy', 0.475), ('coordinate', 0.446), ('descent', 0.304), ('nearest', 0.268), ('neighbor', 0.237), ('satisfies', 0.151), ('efficiently', 0.151), ('cao', 0.119), ('quadtree', 0.112), ('vjc', 0.112), ('aod', 0.098), ('caveat', 0.098), ('iterates', 0.098), ('tailored', 0.095), ('vanilla', 0.08), ('sublinear', 0.08), ('iixj', 0.075), ('ilw', 0.075), ('ivic', 0.075), ('lwo', 0.075), ('mnltiplicative', 0.075), ('suffices', 0.075), ('gradient', 0.071), ('renewed', 0.065), ('modem', 0.065), ('structore', 0.065), ('cyclic', 0.065), ('austin', 0.06), ('upto', 0.06), ('texas', 0.059), ('arising', 0.056), ('multiplicative', 0.056), ('lemma', 0.054), ('approximate', 0.054), ('weakly', 0.053), ('iterate', 0.053), ('tewari', 0.051), ('brute', 0.049), ('ambuj', 0.049), ('separable', 0.049), ('qualitative', 0.049), ('investigate', 0.048), ('variant', 0.048), ('inderjit', 0.047), ('leverage', 0.045), ('satisfy', 0.045), ('ll', 0.045), ('optimization', 0.043), ('additive', 0.042), ('differentiable', 0.042), ('step', 0.041), ('decade', 0.04), ('covariates', 0.04), ('objective', 0.04), ('smooth', 0.038), ('variants', 0.036), ('composite', 0.035), ('search', 0.034), ('costly', 0.034), ('reduction', 0.033), ('jr', 0.033), ('encompass', 0.033), ('negated', 0.033), ('nearness', 0.033), ('judiciously', 0.033), ('ofp', 0.033), ('ouly', 0.033), ('rll', 0.033), ('ry', 0.033), ('slog', 0.033), ('upshot', 0.033), ('led', 0.033), ('dependence', 0.032), ('yi', 0.032), ('force', 0.032), ('children', 0.032), ('increasingly', 0.03), ('weakening', 0.03), ('fort', 0.03), ('weakens', 0.03), ('uncover', 0.03), ('significant', 0.03), ('pradeepr', 0.03), ('coordinatewise', 0.03), ('lipschitz', 0.029), ('updates', 0.029), ('develop', 0.028), ('hot', 0.028), ('cache', 0.028), ('elucidated', 0.028), ('rep', 0.028), ('mutual', 0.028), ('considerable', 0.028), ('regarding', 0.028), ('geometry', 0.027), ('hope', 0.027), ('subsections', 0.027), ('suite', 0.027), ('significantly', 0.027), ('aw', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari
Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.
2 0.21309526 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
3 0.18041016 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik
Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.
4 0.13853215 271 nips-2011-Statistical Tests for Optimization Efficiency
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
5 0.13225037 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
Author: Youwei Zhang, Laurent E. Ghaoui
Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1
6 0.12987083 276 nips-2011-Structured sparse coding via lateral inhibition
7 0.11135869 202 nips-2011-On the Universality of Online Mirror Descent
8 0.10819006 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
9 0.10411067 231 nips-2011-Randomized Algorithms for Comparison-based Search
10 0.10016108 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
11 0.082179412 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
12 0.077346131 209 nips-2011-Orthogonal Matching Pursuit with Replacement
13 0.075948425 178 nips-2011-Multiclass Boosting: Theory and Algorithms
14 0.074561507 64 nips-2011-Convergent Bounds on the Euclidean Distance
15 0.064253002 211 nips-2011-Penalty Decomposition Methods for Rank Minimization
16 0.063461959 18 nips-2011-Action-Gap Phenomenon in Reinforcement Learning
17 0.063004725 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
18 0.062802613 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
19 0.059942123 282 nips-2011-The Fast Convergence of Boosting
20 0.05865673 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
topicId topicWeight
[(0, 0.161), (1, -0.032), (2, -0.074), (3, -0.106), (4, -0.066), (5, 0.1), (6, -0.007), (7, 0.029), (8, -0.105), (9, -0.018), (10, 0.098), (11, -0.055), (12, -0.124), (13, -0.044), (14, -0.012), (15, 0.113), (16, 0.041), (17, 0.136), (18, -0.084), (19, 0.026), (20, -0.065), (21, 0.027), (22, 0.071), (23, 0.058), (24, -0.159), (25, -0.012), (26, 0.03), (27, -0.042), (28, -0.038), (29, 0.095), (30, -0.002), (31, -0.214), (32, 0.067), (33, 0.075), (34, 0.109), (35, -0.131), (36, -0.06), (37, 0.029), (38, -0.137), (39, 0.102), (40, 0.044), (41, -0.02), (42, -0.03), (43, 0.188), (44, 0.083), (45, 0.108), (46, 0.005), (47, -0.001), (48, 0.04), (49, 0.096)]
simIndex simValue paperId paperTitle
same-paper 1 0.98708135 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari
Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.
2 0.67813963 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods
Author: Ali Jalali, Christopher C. Johnson, Pradeep K. Ravikumar
Abstract: In this paper, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Our first main result studies the sparsistency, or consistency in sparsity pattern recovery, properties of a forward-backward greedy algorithm as applied to general statistical models. As a special case, we then apply this algorithm to learn the structure of a discrete graphical model via neighborhood estimation. As a corollary of our general result, we derive sufficient conditions on the number of samples n, the maximum nodedegree d and the problem size p, as well as other conditions on the model parameters, so that the algorithm recovers all the edges with high probability. Our result guarantees graph selection for samples scaling as n = Ω(d2 log(p)), in contrast to existing convex-optimization based algorithms that require a sample complexity of Ω(d3 log(p)). Further, the greedy algorithm only requires a restricted strong convexity condition which is typically milder than irrepresentability assumptions. We corroborate these results using numerical simulations at the end.
3 0.58622706 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation
Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik
Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use first order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modifications that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.
4 0.55213946 271 nips-2011-Statistical Tests for Optimization Efficiency
Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling
Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems defined on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational benefits and avoids overfitting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1
5 0.52558106 143 nips-2011-Learning Anchor Planes for Classification
Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari
Abstract: Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a fixed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classification tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points. 1
6 0.50757384 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
7 0.49195158 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems
8 0.45341936 276 nips-2011-Structured sparse coding via lateral inhibition
9 0.42502007 202 nips-2011-On the Universality of Online Mirror Descent
10 0.41737157 187 nips-2011-Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
11 0.41654688 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
12 0.37512699 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
13 0.37288952 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
14 0.37198889 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks
15 0.37110212 64 nips-2011-Convergent Bounds on the Euclidean Distance
16 0.35224545 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
17 0.34321633 209 nips-2011-Orthogonal Matching Pursuit with Replacement
18 0.33474344 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
19 0.32443187 150 nips-2011-Learning a Distance Metric from a Network
20 0.32283443 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
topicId topicWeight
[(0, 0.017), (4, 0.079), (20, 0.025), (22, 0.019), (26, 0.017), (31, 0.076), (33, 0.015), (39, 0.221), (43, 0.059), (45, 0.204), (57, 0.037), (74, 0.033), (83, 0.069), (99, 0.034)]
simIndex simValue paperId paperTitle
1 0.87132335 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
Author: Kristen Grauman, Fei Sha, Sung J. Hwang
Abstract: We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships. Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM). In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children. Specifically, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members’ training instances. To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor (supercategory) nodes’ metrics. Intuitively, this reflects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component fine-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm. 1
same-paper 2 0.84828657 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent
Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari
Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.
3 0.83050883 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure
Author: Fatma K. Karzan, Arkadi S. Nemirovski, Boris T. Polyak, Anatoli Juditsky
Abstract: We discuss new methods for the recovery of signals with block-sparse structure, based on 1 -minimization. Our emphasis is on the efficiently computable error bounds for the recovery routines. We optimize these bounds with respect to the method parameters to construct the estimators with improved statistical properties. We justify the proposed approach with an oracle inequality which links the properties of the recovery algorithms and the best estimation performance. 1
4 0.82577628 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements
Author: Yi-kai Liu
Abstract: We study the problem of reconstructing an unknown matrix M of rank r and dimension d using O(rd poly log d) Pauli measurements. This has applications in quantum state tomography, and is a non-commutative analogue of a well-known problem in compressed sensing: recovering a sparse vector from a few of its Fourier coefficients. We show that almost all sets of O(rd log6 d) Pauli measurements satisfy the rankr restricted isometry property (RIP). This implies that M can be recovered from a fixed (“universal”) set of Pauli measurements, using nuclear-norm minimization (e.g., the matrix Lasso), with nearly-optimal bounds on the error. A similar result holds for any class of measurements that use an orthonormal operator basis whose elements have small operator norm. Our proof uses Dudley’s inequality for Gaussian processes, together with bounds on covering numbers obtained via entropy duality. 1
5 0.81260115 19 nips-2011-Active Classification based on Value of Classifier
Author: Tianshi Gao, Daphne Koller
Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.
6 0.78736728 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons
7 0.7514447 209 nips-2011-Orthogonal Matching Pursuit with Replacement
8 0.73373514 220 nips-2011-Prediction strategies without loss
9 0.7323122 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
10 0.72935212 150 nips-2011-Learning a Distance Metric from a Network
11 0.72895533 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
12 0.72856736 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
13 0.72778875 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
14 0.72341704 80 nips-2011-Efficient Online Learning via Randomized Rounding
15 0.72303104 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
16 0.72252005 283 nips-2011-The Fixed Points of Off-Policy TD
17 0.72248387 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
18 0.72240609 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
19 0.72120517 271 nips-2011-Statistical Tests for Optimization Efficiency
20 0.71964145 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning