nips nips2013 nips2013-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Srikrishna Sridhar, Stephen Wright, Christopher Re, Ji Liu, Victor Bittorf, Ce Zhang
Abstract: Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the exact one. These approximate LP solutions can be computed efficiently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation of the LP. We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude faster than Cplex (a commercial LP solver) while producing solutions of similar quality. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). [sent-6, score-0.491]
2 This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the exact one. [sent-7, score-0.681]
3 These approximate LP solutions can be computed efficiently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation of the LP. [sent-8, score-0.223]
4 We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. [sent-9, score-0.297]
5 Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude faster than Cplex (a commercial LP solver) while producing solutions of similar quality. [sent-10, score-0.834]
6 1 Introduction A host of machine-learning problems can be solved effectively as approximations of such NP-hard combinatorial problems as set cover, set packing, and multiway-cuts [8, 11, 16, 22]. [sent-11, score-0.115]
7 A popular scheme for solving such problems is called LP rounding [22, chs. [sent-12, score-0.414]
8 LP rounding is known to work well on a range of hard problems, and comes with theoretical guarantees for runtime and solution quality. [sent-14, score-0.487]
9 The Achilles’ heel of LP-rounding is that it requires solutions of LPs of possibly extreme scale. [sent-15, score-0.11]
10 In this work, we propose an approximate LP solver suitable for use in the LP-rounding approach, for very large problems. [sent-17, score-0.106]
11 Our intuition is that in LP rounding, since we ultimately round the LP to obtain an approximate solution of the combinatorial problem, a crude solution of the LP may suffice. [sent-18, score-0.314]
12 Hence, an approach that can find approximate solutions of large LPs quickly may be suitable, even if it is inefficient for obtaining highly accurate solutions. [sent-19, score-0.155]
13 This paper focuses on the theoretical and algorithmic aspects of finding approximate solutions to an LP, for use in LP-rounding schemes. [sent-20, score-0.155]
14 Finally, we derive bounds on runtime as well as worst-case approximation ratio of our rounding schemes. [sent-23, score-0.441]
15 Our experiments demonstrate that our approach, called Thetis, produces solutions of comparable quality to state-of-the-art approaches on such tasks as noun-phrase chunking and entity resolution. [sent-24, score-0.317]
16 We also demonstrate, on three different classes of combinatorial problems, that Thetis can outperform Cplex (a state-of-the-art commercial LP and IP solver) by up to an order of magnitude in runtime, while achieving comparable solution quality. [sent-25, score-0.236]
17 al [16] proposed rounding schemes for iterative LP solvers to facilitate MAP inference in graphical models. [sent-29, score-0.39]
18 In contrast, we propose to use stochastic descent methods to solve a QP relaxation; this allows us to take advantage of recent results on asynchronous parallel methods of this type [12,14]. [sent-30, score-0.163]
19 al [13] propose an intriguing parallel scheme for packing and covering problems. [sent-32, score-0.219]
20 Additionally, the runtime of our algorithm is less sensitive to approximation error. [sent-34, score-0.117]
21 For an error ", the bound on runtime of the algorithm in [13] grows as " 5 , while the bound on our algorithm’s runtime grows as " 2 . [sent-35, score-0.144]
22 2 Background: Approximating NP-hard problems with LP Rounding In this section, we review the theory of LP-rounding based approximation schemes for NP-hard combinatorial problems. [sent-36, score-0.138]
23 We use the vertex cover problem as an example, as it is the simplest nontrivial setting that exposes the main ideas of this approach. [sent-37, score-0.267]
24 For a minimization problem , an algorithm ALG is an ↵-factor approximation for , for some ↵ > 1, if any solution produced by ALG has an objective value at most ↵ times the value of an optimal (lowest cost) solution. [sent-39, score-0.136]
25 For some problems, such as vertex cover, there is a constant-factor approximation scheme (↵ = 2). [sent-40, score-0.272]
26 An LP-rounding based approximation scheme for the problem first constructs an IP formulation of which we denote as “P ”. [sent-42, score-0.141]
27 The second step in LP rounding is a relax / solve step: We relax the constraints in P to obtain a linear program LP (P ), replacing the binary variables with continuous variables in [0, 1], then solve LP (P ). [sent-45, score-0.503]
28 The third step is to round the solution of LP (P ) to an integer solution which is feasible for P , thus yielding a candidate solution to the original problem . [sent-46, score-0.409]
29 The focus of this paper is on the relax / solve step, which is usually the computational bottleneck in an LP-rounding based approximation scheme. [sent-47, score-0.121]
30 Let G(V, E) denote a graph with vertex set V and undirected edges E ✓ (V ⇥ V ). [sent-49, score-0.204]
31 Let cv denote a nonnegative cost associated with each vertex v 2 V . [sent-50, score-0.206]
32 A vertex cover of a graph is a subset of V such that each edge e 2 E is incident to at least one vertex in this set. [sent-51, score-0.471]
33 The minimum-cost vertex cover is the one that minimizes the sum of terms cv , summed over the vertices v belonging to the cover. [sent-52, score-0.311]
34 Let us review the “construct,” “relax / solve,” and “round” phases of an LP-rounding based approximation scheme applied to vertex cover. [sent-53, score-0.272]
35 In the “construct” phase, we introduce binary variables xv 2 {0, 1}, 8v 2 V , where xv is set to 1 if the vertex v 2 V is selected in the vertex cover and 0 otherwise. [sent-54, score-0.621]
36 xu + xv 1 for (u, v) 2 E and xv 2 {0, 1} for v 2 V. [sent-57, score-0.192]
37 (1) x v2V Relaxation yields the following LP X min cv xv s. [sent-58, score-0.14]
38 xu + xv x v2V 1 for (u, v) 2 E and xv 2 [0, 1] for v 2 V. [sent-60, score-0.192]
39 (2) A feasible solution of the LP relaxation (2) is called a “fractional solution” of the original problem. [sent-61, score-0.143]
40 In the “round” phase, we generate a valid vertex cover by simply choosing the vertices v 2 V whose 1 fractional solution xv 2 . [sent-62, score-0.557]
41 It is easy to see that the vertex cover generated by such a rounding scheme costs no more than twice the cost of the fractional solution. [sent-63, score-0.759]
42 If the fractional solution chosen for rounding is an optimal solution of (2), then we arrive at a 2-factor approximation scheme for vertex cover. [sent-64, score-0.881]
43 We note here an important property: The rounding algorithm can generate feasible integral solutions while being oblivious of whether the fractional solution is an optimal solution of (2). [sent-65, score-0.931]
44 We formally define the notion of an oblivious rounding scheme as follows. [sent-66, score-0.474]
45 For a minimization problem with an IP formulation P whose LP relaxation is denoted by LP(P ), a -factor ‘oblivious’ rounding scheme converts any feasible point xf 2 LP(P ) to an integral solution xI 2 P with cost at most times the cost of LP(P ) at xf . [sent-68, score-0.692]
46 Given a -factor oblivious algorithm ALG to the problem , one can construct a -factor approximation algorithm for by using ALG to round an optimal fractional solution of LP(P ). [sent-76, score-0.37]
47 When we have an approximate solution for LP(P ) that is feasible for this problem, rounding can produce an ↵-factor approximation algorithm for for a factor ↵ slightly larger than , where the difference between ↵ and takes account of the inexactness in the approximate solution of LP(P ). [sent-77, score-0.729]
48 Many LP-rounding schemes (including the scheme for vertex cover discussed in Section 2) are oblivious. [sent-78, score-0.359]
49 3 Main results In this section, we describe how we can solve LP relaxations approximately, in less time than traditional LP solvers, while still preserving the formal guarantees of rounding schemes. [sent-80, score-0.397]
50 We first define a notion of approximate LP solution and discuss its consequences for oblivious rounding schemes. [sent-81, score-0.545]
51 We then describe a stochastic-coordinate-descent (SCD) algorithm for obtaining approximate solutions of this QP, and mention enhancements of this approach, specifically, asynchronous parallel implementation and the use of an augmented Lagrangian framework. [sent-83, score-0.34]
52 Our analysis yields a worst-case complexity bound for solution quality and runtime of the entire LP-rounding scheme. [sent-84, score-0.232]
53 (3) (4) Let x denote an optimal primal solution of (3). [sent-91, score-0.154]
54 An approximate LP solution x that we use for LPˆ rounding may be infeasible and have objective value different from the optimum cT x⇤ . [sent-92, score-0.46]
55 We quantify the inexactness in an approximate LP solution as follows. [sent-93, score-0.172]
56 Using Definitions 1 and 2, it is easy to see that a -factor oblivious rounding scheme can round a (0, ) approximate solution to produce a feasible integral solution whose cost is no more than (1 + ) times the optimal solution of the P . [sent-96, score-0.965]
57 The factor (1 + ) arises because the rounding algorithm does not have access to an optimal fractional solution. [sent-97, score-0.427]
58 To cope with the infeasibility, we convert an (✏, )-approximate solution to a (0, ˆ) approximate solution where ˆ is not too large. [sent-98, score-0.227]
59 For vertex cover (2), we prove the following result in Appendix C. [sent-99, score-0.267]
60 Let x be an (", ) approximate solution to the linear program (2) with " 2 [0, 1). [sent-102, score-0.163]
61 ˜ ˆ Since x is a feasible solution for (2), the oblivious rounding scheme in Section 2 results in an 2(1 + ˜ (1 ") 1 ) factor approximation algorithm. [sent-104, score-0.662]
62 In general, constructing (0, ˆ) from (✏, ) approximate solutions requires reasoning about the structure of a particular LP. [sent-105, score-0.155]
63 (In practice, u and x may be chosen as ap¯ ¯ ¯ ¯ proximations to the dual and primal solutions of (3), or simply set to zero. [sent-109, score-0.199]
64 ) The quality of the approximation (5) depends on the conditioning of underlying linear program (3), a concept that was studied by Renegar [17]. [sent-110, score-0.141]
65 ✏ 2 P D For an instance of vertex cover with n nodes and m edges, we can show that P 1 = O(n1/2 (m + p n)1/2 ) and D1 = O((m + n)1/2 ) (see Appendix D). [sent-128, score-0.267]
66 The first is an asynchronous parallel implementation of Algorithm 1 and the second is the use of an augmented Lagrangian framework rather than “one-shot” approximation by the QP in (5). [sent-173, score-0.189]
67 We omit a discussion of the convergence properties of the algorithm here, but note that the quality of solution depends on the values of x, u and at the ¯ ¯ last iteration before convergence is declared. [sent-198, score-0.16]
68 By applying Theorem 5, we note that the constant C⇤ is smaller when x and u are close to the primal and dual solution sets, thus improving the approx¯ ¯ imation and reducing the need to increase to a larger value to obtain an approximate solution of acceptable accuracy. [sent-199, score-0.316]
69 4 Experiments Our experiments address two main questions: (1) Is our approximate LP-rounding scheme useful in graph analysis tasks that arise in machine learning? [sent-200, score-0.152]
70 We demonstrate that the rounded solutions obtained using Thetis are of comparable quality to those obtained with stateof-the-art systems. [sent-206, score-0.276]
71 We then run MAP inference on the factor graph using the LP formulation in [9] and compare the quality of the solutions obtained by Thetis with a Gibbs sampling-based approach [26]. [sent-209, score-0.252]
72 Figure 2 demonstrates that our algorithm produces solutions of quality comparable with state-of-the-art approaches for these graph analysis tasks. [sent-218, score-0.263]
73 We conducted numerical experiments on three different combinatorial problems that commonly arise in graph analysis tasks in machine learning: vertex cover, independent set, and multiway cuts. [sent-221, score-0.295]
74 The two main goals of this experiment are to: (1) compare the quality of the integral solutions obtained using LP-rounding with the integral solutions from Cplex-IP and (2) compare wall-clock times required by Thetis and Cplex-LP to solve the LPs for the purpose of LP-rounding. [sent-232, score-0.479]
75 All codes were provided with a time limit of 3600 seconds excluding the time taken for preprocessing as well as the runtime of the rounding algorithms that generate integral solutions from fractional solutions. [sent-251, score-0.684]
76 We solved the vertex cover problem using the approximation algorithm described in Section 2. [sent-253, score-0.336]
77 We solved the maximum independent set problem using a variant of the es + o(s)-factor approximation in [1] where s is the maximum degree of a node in the graph (see Appendix C for details). [sent-254, score-0.111]
78 The details of the transformation from approximate infeasible solutions to feasible solutions are provided in Appendix C. [sent-256, score-0.317]
79 Since the rounding schemes for maximumindependent set and multiway-cut are randomized, we chose the best feasible integral solution from 10 repetitions. [sent-257, score-0.569]
80 82 Figure 3: Summary of wall-clock speedup (in comparison with Cplex-LP) and solution quality (in comparison with Cplex-IP) of Thetis on three graph analysis problems. [sent-303, score-0.227]
81 We discuss the results for the vertex cover problem. [sent-312, score-0.267]
82 On the Bhoslib instances, the integral solutions from Thetis were within 4% of the documented optimal solutions. [sent-313, score-0.214]
83 67⇥104 Figure 4: Wall-clock time and quality of fractional and integral solutions for three graph analysis problems using Thetis, Cplex-IP and Cplex-LP. [sent-386, score-0.424]
84 BFS is the objective value of the best integer feasible solution found by Cplex-IP. [sent-388, score-0.181]
85 The gap is defined as (BFS BB)/BFS where BB is the best known solution bound found by Cplex-IP within the time limit. [sent-389, score-0.116]
86 integral solutions that were within 1% of the documented optimal solutions, but required an hour for each of the instances. [sent-393, score-0.25]
87 Although the LP solutions obtained by Thetis were less accurate than those obtained by Cplex-LP, the rounded solutions from Thetis and Cplex-LP are almost exactly the same. [sent-394, score-0.275]
88 In summary, the LP-rounding approaches using Thetis and Cplex-LP obtain integral solutions of comparable quality with Cplex-IP — but Thetis is about three times faster than Cplex-LP. [sent-395, score-0.296]
89 We were able to recover integral solutions of comparable quality to Cplex-IP, but seven to eight times faster than using LP-rounding with Cplex-LP. [sent-397, score-0.296]
90 The difference between the optimal fractional and integral solutions for these instances is much smaller than frb59-26-1. [sent-399, score-0.319]
91 Notably, Cplex-IP was able to find the optimal solution for the Amazon and DBLP instances, but timed out on Google+, which is of comparable size. [sent-401, score-0.133]
92 5 Conclusion We described Thetis, an LP rounding scheme based on an approximate solver for LP relaxations of combinatorial problems. [sent-403, score-0.569]
93 We derived worst-case runtime and solution quality bounds for our scheme, and demonstrated that our approach was faster than an alternative based on a state-of-theart LP solver, while producing rounded solutions of comparable quality. [sent-404, score-0.439]
94 JL is generously supported in part by NSF awards DMS-0914524 and DMS-1216318 and ONR award N000141310129. [sent-406, score-0.14]
95 CR’s work on this project is generously supported by NSF CAREER award under IIS-1353606, NSF award under CCF-1356918, the ONR under awards N000141210041 and N000141310129, a Sloan Research Fellowship, and gifts from Oracle and Google. [sent-407, score-0.178]
96 SJW is generously supported in part by NSF awards DMS-0914524 and DMS-1216318, ONR award N000141310129, DOE award DE-SC0002283, and Subcontract 3F-30222 from Argonne National Laboratory. [sent-408, score-0.178]
97 Solving packing integer programs via randomized rounding with alterations. [sent-411, score-0.437]
98 Approximation algorithms for the set covering and vertex cover problems. [sent-432, score-0.309]
99 Message-passing for graph-structured linear programs: Proximal methods and rounding schemes. [sent-464, score-0.324]
100 Improved approximation guarantees for packing and covering integer programs. [sent-478, score-0.2]
wordName wordTfidf (topN-words)
[('lp', 0.56), ('thetis', 0.394), ('rounding', 0.324), ('vertex', 0.162), ('lmax', 0.161), ('cplex', 0.149), ('scd', 0.143), ('rsol', 0.125), ('secs', 0.112), ('solutions', 0.11), ('ip', 0.107), ('cover', 0.105), ('fractional', 0.103), ('xv', 0.096), ('solution', 0.091), ('oblivious', 0.085), ('nnz', 0.082), ('qp', 0.081), ('bfs', 0.079), ('lps', 0.078), ('integral', 0.075), ('packing', 0.075), ('runtime', 0.072), ('quality', 0.069), ('dblp', 0.068), ('pv', 0.067), ('ct', 0.065), ('scheme', 0.065), ('primal', 0.063), ('commercial', 0.062), ('solver', 0.061), ('asynchronous', 0.061), ('entity', 0.06), ('generously', 0.058), ('rounded', 0.055), ('feasible', 0.052), ('alg', 0.05), ('mis', 0.047), ('presolve', 0.047), ('lagrangian', 0.047), ('mc', 0.046), ('round', 0.046), ('augmented', 0.046), ('amazon', 0.046), ('approximate', 0.045), ('approximation', 0.045), ('awards', 0.044), ('cv', 0.044), ('graph', 0.042), ('comparable', 0.042), ('covering', 0.042), ('combinatorial', 0.041), ('conll', 0.041), ('enhancements', 0.041), ('kc', 0.041), ('stephen', 0.041), ('solve', 0.04), ('xj', 0.04), ('solvers', 0.039), ('integer', 0.038), ('award', 0.038), ('parallel', 0.037), ('hour', 0.036), ('relax', 0.036), ('google', 0.036), ('aravind', 0.036), ('chunking', 0.036), ('inexactness', 0.036), ('kax', 0.036), ('kdk', 0.036), ('renegar', 0.036), ('thread', 0.035), ('victor', 0.034), ('relaxations', 0.033), ('christopher', 0.032), ('kx', 0.032), ('yuri', 0.032), ('formulation', 0.031), ('instances', 0.031), ('programming', 0.031), ('appendix', 0.031), ('na', 0.03), ('onr', 0.03), ('cores', 0.03), ('vc', 0.03), ('documented', 0.029), ('networking', 0.029), ('tolerance', 0.028), ('schemes', 0.027), ('program', 0.027), ('xf', 0.027), ('sgd', 0.027), ('dual', 0.026), ('problems', 0.025), ('gap', 0.025), ('multiway', 0.025), ('descent', 0.025), ('speedup', 0.025), ('solved', 0.024), ('poorer', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
Author: Srikrishna Sridhar, Stephen Wright, Christopher Re, Ji Liu, Victor Bittorf, Ce Zhang
Abstract: Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the exact one. These approximate LP solutions can be computed efficiently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation of the LP. We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude faster than Cplex (a commercial LP solver) while producing solutions of similar quality. 1
2 0.22599021 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation
Author: Bogdan Savchynskyy, Jörg Hendrik Kappes, Paul Swoboda, Christoph Schnörr
Abstract: We consider energy minimization for undirected graphical models, also known as the MAP-inference problem for Markov random fields. Although combinatorial methods, which return a provably optimal integral solution of the problem, made a significant progress in the past decade, they are still typically unable to cope with large-scale datasets. On the other hand, large scale datasets are often defined on sparse graphs and convex relaxation methods, such as linear programming relaxations then provide good approximations to integral solutions. We propose a novel method of combining combinatorial and convex programming techniques to obtain a global solution of the initial combinatorial problem. Based on the information obtained from the solution of the convex relaxation, our method confines application of the combinatorial solver to a small fraction of the initial graphical model, which allows to optimally solve much larger problems. We demonstrate the efficacy of our approach on a computer vision energy minimization benchmark. 1
3 0.1402148 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles
Author: Jinwoo Shin, Andrew E. Gelfand, Misha Chertkov
Abstract: Max-product ‘belief propagation’ (BP) is a popular distributed heuristic for finding the Maximum A Posteriori (MAP) assignment in a joint probability distribution represented by a Graphical Model (GM). It was recently shown that BP converges to the correct MAP assignment for a class of loopy GMs with the following common feature: the Linear Programming (LP) relaxation to the MAP problem is tight (has no integrality gap). Unfortunately, tightness of the LP relaxation does not, in general, guarantee convergence and correctness of the BP algorithm. The failure of BP in such cases motivates reverse engineering a solution – namely, given a tight LP, can we design a ‘good’ BP algorithm. In this paper, we design a BP algorithm for the Maximum Weight Matching (MWM) problem over general graphs. We prove that the algorithm converges to the correct optimum if the respective LP relaxation, which may include inequalities associated with non-intersecting odd-sized cycles, is tight. The most significant part of our approach is the introduction of a novel graph transformation designed to force convergence of BP. Our theoretical result suggests an efficient BP-based heuristic for the MWM problem, which consists of making sequential, “cutting plane”, modifications to the underlying GM. Our experiments show that this heuristic performs as well as traditional cutting-plane algorithms using LP solvers on MWM problems. 1
4 0.11281052 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
Author: Franz Kiraly, Louis Theran
Abstract: We propose a general framework for reconstructing and denoising single entries of incomplete and noisy entries. We describe: effective algorithms for deciding if and entry can be reconstructed and, if so, for reconstructing and denoising it; and a priori bounds on the error of each entry, individually. In the noiseless case our algorithm is exact. For rank-one matrices, the new algorithm is fast, admits a highly-parallel implementation, and produces an error minimizing estimate that is qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and OptSpace methods. 1
5 0.073346242 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
Author: Haim Avron, Vikas Sindhwani, David Woodruff
Abstract: Motivated by the desire to extend fast randomized techniques to nonlinear lp regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems, additive models and approximations to recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than “input sparsity”. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression. 1
6 0.071967758 184 nips-2013-Marginals-to-Models Reducibility
7 0.070685536 335 nips-2013-Transfer Learning in a Transductive Setting
8 0.066458941 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks
9 0.064180829 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
10 0.062318735 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
11 0.061423469 218 nips-2013-On Sampling from the Gibbs Distribution with Random Maximum A-Posteriori Perturbations
12 0.058799915 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
13 0.056121156 318 nips-2013-Structured Learning via Logistic Regression
14 0.054362874 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
15 0.054322321 268 nips-2013-Reflection methods for user-friendly submodular optimization
16 0.054032125 73 nips-2013-Convex Relaxations for Permutation Problems
17 0.051381197 75 nips-2013-Convex Two-Layer Modeling
18 0.049231648 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing
19 0.049047559 249 nips-2013-Polar Operators for Structured Sparse Estimation
20 0.048777804 186 nips-2013-Matrix factorization with binary components
topicId topicWeight
[(0, 0.137), (1, 0.035), (2, 0.036), (3, 0.019), (4, 0.058), (5, 0.058), (6, -0.047), (7, -0.057), (8, 0.004), (9, -0.004), (10, 0.02), (11, -0.02), (12, 0.094), (13, -0.018), (14, -0.032), (15, 0.011), (16, -0.079), (17, 0.006), (18, -0.022), (19, 0.069), (20, -0.074), (21, -0.045), (22, 0.08), (23, -0.056), (24, -0.018), (25, -0.041), (26, 0.048), (27, 0.009), (28, 0.013), (29, 0.114), (30, -0.015), (31, -0.045), (32, 0.066), (33, 0.149), (34, 0.054), (35, 0.031), (36, -0.128), (37, -0.085), (38, -0.17), (39, -0.049), (40, 0.057), (41, -0.144), (42, -0.137), (43, 0.009), (44, -0.053), (45, 0.109), (46, 0.033), (47, 0.077), (48, -0.051), (49, -0.106)]
simIndex simValue paperId paperTitle
same-paper 1 0.9441247 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
Author: Srikrishna Sridhar, Stephen Wright, Christopher Re, Ji Liu, Victor Bittorf, Ce Zhang
Abstract: Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the exact one. These approximate LP solutions can be computed efficiently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation of the LP. We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude faster than Cplex (a commercial LP solver) while producing solutions of similar quality. 1
2 0.88386935 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation
Author: Bogdan Savchynskyy, Jörg Hendrik Kappes, Paul Swoboda, Christoph Schnörr
Abstract: We consider energy minimization for undirected graphical models, also known as the MAP-inference problem for Markov random fields. Although combinatorial methods, which return a provably optimal integral solution of the problem, made a significant progress in the past decade, they are still typically unable to cope with large-scale datasets. On the other hand, large scale datasets are often defined on sparse graphs and convex relaxation methods, such as linear programming relaxations then provide good approximations to integral solutions. We propose a novel method of combining combinatorial and convex programming techniques to obtain a global solution of the initial combinatorial problem. Based on the information obtained from the solution of the convex relaxation, our method confines application of the combinatorial solver to a small fraction of the initial graphical model, which allows to optimally solve much larger problems. We demonstrate the efficacy of our approach on a computer vision energy minimization benchmark. 1
3 0.79680192 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles
Author: Jinwoo Shin, Andrew E. Gelfand, Misha Chertkov
Abstract: Max-product ‘belief propagation’ (BP) is a popular distributed heuristic for finding the Maximum A Posteriori (MAP) assignment in a joint probability distribution represented by a Graphical Model (GM). It was recently shown that BP converges to the correct MAP assignment for a class of loopy GMs with the following common feature: the Linear Programming (LP) relaxation to the MAP problem is tight (has no integrality gap). Unfortunately, tightness of the LP relaxation does not, in general, guarantee convergence and correctness of the BP algorithm. The failure of BP in such cases motivates reverse engineering a solution – namely, given a tight LP, can we design a ‘good’ BP algorithm. In this paper, we design a BP algorithm for the Maximum Weight Matching (MWM) problem over general graphs. We prove that the algorithm converges to the correct optimum if the respective LP relaxation, which may include inequalities associated with non-intersecting odd-sized cycles, is tight. The most significant part of our approach is the introduction of a novel graph transformation designed to force convergence of BP. Our theoretical result suggests an efficient BP-based heuristic for the MWM problem, which consists of making sequential, “cutting plane”, modifications to the underlying GM. Our experiments show that this heuristic performs as well as traditional cutting-plane algorithms using LP solvers on MWM problems. 1
4 0.61578542 184 nips-2013-Marginals-to-Models Reducibility
Author: Tim Roughgarden, Michael Kearns
Abstract: We consider a number of classical and new computational problems regarding marginal distributions, and inference in models specifying a full joint distribution. We prove general and efficient reductions between a number of these problems, which demonstrate that algorithmic progress in inference automatically yields progress for “pure data” problems. Our main technique involves formulating the problems as linear programs, and proving that the dual separation oracle required by the ellipsoid method is provided by the target problem. This technique may be of independent interest in probabilistic inference. 1
5 0.46962112 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent
Author: Shai Shalev-Shwartz, Tong Zhang
Abstract: Stochastic dual coordinate ascent (SDCA) is an effective technique for solving regularized loss minimization problems in machine learning. This paper considers an extension of SDCA under the mini-batch setting that is often used in practice. Our main contribution is to introduce an accelerated mini-batch version of SDCA and prove a fast convergence rate for this method. We discuss an implementation of our method over a parallel computing system, and compare the results to both the vanilla stochastic dual coordinate ascent and to the accelerated deterministic gradient descent method of Nesterov [2007]. 1
6 0.46738613 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
7 0.4628264 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
8 0.45073548 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
9 0.44474357 67 nips-2013-Conditional Random Fields via Univariate Exponential Families
10 0.44382098 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
11 0.43877438 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
12 0.42820841 73 nips-2013-Convex Relaxations for Permutation Problems
13 0.42473313 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
14 0.41779706 202 nips-2013-Multiclass Total Variation Clustering
15 0.40642929 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
16 0.4046849 291 nips-2013-Sensor Selection in High-Dimensional Gaussian Trees with Nuisances
17 0.39939502 152 nips-2013-Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
18 0.39359513 350 nips-2013-Wavelets on Graphs via Deep Learning
19 0.38525292 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
20 0.38363674 268 nips-2013-Reflection methods for user-friendly submodular optimization
topicId topicWeight
[(2, 0.018), (16, 0.061), (27, 0.227), (33, 0.121), (34, 0.09), (36, 0.011), (41, 0.032), (49, 0.019), (56, 0.138), (70, 0.023), (85, 0.043), (89, 0.031), (93, 0.046), (95, 0.055)]
simIndex simValue paperId paperTitle
1 0.93595243 85 nips-2013-Deep content-based music recommendation
Author: Aaron van den Oord, Sander Dieleman, Benjamin Schrauwen
Abstract: Automatic music recommendation has become an increasingly relevant problem in recent years, since a lot of music is now sold and consumed digitally. Most recommender systems rely on collaborative filtering. However, this approach suffers from the cold start problem: it fails when no usage data is available, so it is not effective for recommending new and unpopular songs. In this paper, we propose to use a latent factor model for recommendation, and predict the latent factors from music audio when they cannot be obtained from usage data. We compare a traditional approach using a bag-of-words representation of the audio signals with deep convolutional neural networks, and evaluate the predictions quantitatively and qualitatively on the Million Song Dataset. We show that using predicted latent factors produces sensible recommendations, despite the fact that there is a large semantic gap between the characteristics of a song that affect user preference and the corresponding audio signal. We also show that recent advances in deep learning translate very well to the music recommendation setting, with deep convolutional neural networks significantly outperforming the traditional approach. 1
2 0.8196556 69 nips-2013-Context-sensitive active sensing in humans
Author: Sheeraz Ahmad, He Huang, Angela J. Yu
Abstract: Humans and animals readily utilize active sensing, or the use of self-motion, to focus sensory and cognitive resources on the behaviorally most relevant stimuli and events in the environment. Understanding the computational basis of natural active sensing is important both for advancing brain sciences and for developing more powerful artificial systems. Recently, we proposed a goal-directed, context-sensitive, Bayesian control strategy for active sensing, C-DAC (ContextDependent Active Controller) (Ahmad & Yu, 2013). In contrast to previously proposed algorithms for human active vision, which tend to optimize abstract statistical objectives and therefore cannot adapt to changing behavioral context or task goals, C-DAC directly minimizes behavioral costs and thus, automatically adapts itself to different task conditions. However, C-DAC is limited as a model of human active sensing, given its computational/representational requirements, especially for more complex, real-world situations. Here, we propose a myopic approximation to C-DAC, which also takes behavioral costs into account, but achieves a significant reduction in complexity by looking only one step ahead. We also present data from a human active visual search experiment, and compare the performance of the various models against human behavior. We find that C-DAC and its myopic variant both achieve better fit to human data than Infomax (Butko & Movellan, 2010), which maximizes expected cumulative future information gain. In summary, this work provides novel experimental results that differentiate theoretical models for human active sensing, as well as a novel active sensing algorithm that retains the context-sensitivity of the optimal controller while achieving significant computational savings. 1
same-paper 3 0.80112135 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
Author: Srikrishna Sridhar, Stephen Wright, Christopher Re, Ji Liu, Victor Bittorf, Ce Zhang
Abstract: Many problems in machine learning can be solved by rounding the solution of an appropriate linear program (LP). This paper shows that we can recover solutions of comparable quality by rounding an approximate LP solution instead of the exact one. These approximate LP solutions can be computed efficiently by applying a parallel stochastic-coordinate-descent method to a quadratic-penalty formulation of the LP. We derive worst-case runtime and solution quality guarantees of this scheme using novel perturbation and convergence analysis. Our experiments demonstrate that on such combinatorial problems as vertex cover, independent set and multiway-cut, our approximate rounding scheme is up to an order of magnitude faster than Cplex (a commercial LP solver) while producing solutions of similar quality. 1
4 0.69338274 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
5 0.69178814 355 nips-2013-Which Space Partitioning Tree to Use for Search?
Author: Parikshit Ram, Alexander Gray
Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by: γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9
6 0.68921745 252 nips-2013-Predictive PAC Learning and Process Decompositions
7 0.6865688 154 nips-2013-Learning Gaussian Graphical Models with Observed or Latent FVSs
8 0.68523943 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
9 0.68495268 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
10 0.68393308 318 nips-2013-Structured Learning via Logistic Regression
11 0.68380892 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
12 0.68356478 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
13 0.68355936 184 nips-2013-Marginals-to-Models Reducibility
14 0.68319345 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
15 0.68261582 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries
16 0.68150824 149 nips-2013-Latent Structured Active Learning
18 0.68083602 233 nips-2013-Online Robust PCA via Stochastic Optimization
19 0.6807059 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
20 0.68064153 79 nips-2013-DESPOT: Online POMDP Planning with Regularization