nips nips2012 nips2012-236 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar
Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. [sent-3, score-0.162]
2 This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. [sent-5, score-0.114]
3 Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. [sent-6, score-0.311]
4 In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. [sent-7, score-0.411]
5 Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. [sent-8, score-0.696]
6 We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. [sent-9, score-0.286]
7 Recently, probabilistic models extending determinantal point processes (DPPs) [8, 9] were proposed for several such problems [10, 5, 11]. [sent-12, score-0.229]
8 In addition, we are are often interested in conditioning MAP inference on knapsack-type budget constraints, matroid constraints, or general polytope constraints. [sent-19, score-0.332]
9 In this paper we propose a new algorithm for approximating MAP inference that handles these types of constraints for non-monotone DPPs. [sent-21, score-0.106]
10 Recent work on non-monotone submodular function optimization can be broadly split into combinatorial versus continuous approaches. [sent-22, score-0.463]
11 Among combinatorial methods, modified greedy, local search and simulated annealing algorithms provide certain constant factor guarantees [16, 17, 18] and have been recently extended to optimization under knapsack and matroid constraints [19, 20]. [sent-23, score-0.227]
12 Continuous methods [21, 22] use a multilinear extension of the submodular set function to the convex hull of the feasible sets and then round fractional solutions obtained by maximizing in the interior of the polytope. [sent-24, score-0.712]
13 Our algorithm falls into the continuous category, using a novel and efficient non-linear continuous extension specifically tailored to DPPs. [sent-25, score-0.208]
14 In comparison to the constant-factor algorithms for general submodular functions, our approach is more efficient because we have explicit access to the objective function and its gradient. [sent-26, score-0.367]
15 In contrast, general submodular functions assume a simple function oracle and need to employ sampling to estimate function and gradient values in the polytope interior. [sent-27, score-0.566]
16 We show that our non-linear extension enjoys some of the critical properties of the standard multilinear extension and propose an efficient algorithm that can handle solvable polytope constraints. [sent-28, score-0.677]
17 Our algorithm compares favorably to greedy and recent “symmetric” greedy [18] methods on unconstrained simulated problems, simulated problems under matching constraints, and a real-world matching task using quotes from political candidates. [sent-29, score-0.689]
18 Originally, DPPs were introduced to model fermions in quantum physics [8], but since then they have arisen in a variety of other settings including non-intersecting random paths, random spanning trees, and eigenvalues of random matrices [9, 23, 12]. [sent-31, score-0.062]
19 As a result, DPPs assign higher probability to sets that are diverse under L. [sent-42, score-0.067]
20 In fact, a variety of probabilistic inference operations can be performed efficiently, including sampling, marginalization, and conditioning [12, 24]. [sent-47, score-0.087]
21 R is called submodular if it satisfies f (X [ {i}) f (Y [ {i}) f (X) f (Y ) (3) whenever X ✓ Y and i 62 Y . [sent-52, score-0.369]
22 Common submodular functions include the mutual information of a set of variables and the number of cut edges leaving a set of vertices of a graph. [sent-54, score-0.353]
23 A submodular function f is called nondecreasing (or monotone) when X ✓ Y implies f (X) f (Y ). [sent-55, score-0.33]
24 It is possible to show that log det(LY ) is a submodular function: entropy is submodular, and the entropy of a Gaussian is proportional to log det(⌃Y ) (plus a linear term in |Y |), where ⌃ is the covariance matrix. [sent-56, score-0.402]
25 We provide a technique that addresses all three criteria for the DPP MAP problem, although approximation guarantees for the general polytope case depend on the choice of rounding algorithm and remain an open problem. [sent-58, score-0.421]
26 We use the submodular maximization algorithm of [21] as a starting point. [sent-59, score-0.371]
27 3 MAP Inference We seek an approximate solution to the generalized DPP MAP problem arg maxY 2S log det(LY ), where S ✓ [0, 1]N and Y 2 S means that the characteristic vector I(Y ) is in S. [sent-60, score-0.069]
28 We will assume that S is a down-monotone, solvable polytope; down-monotone means that for x, y 2 [0, 1]N , x 2 S implies y 2 S whenever x y (that is, whenever xi yi 8i), and solvable means that for any linear objective function g(x) = a> x, we can efficiently find x 2 S maximizing g(x). [sent-61, score-0.321]
29 One common approach for approximating discrete optimization problems is to replace the discrete variables with continuous analogs and extend the objective function to the continuous domain. [sent-62, score-0.201]
30 When the resulting continuous optimization is solved, the result may include fractional variables. [sent-63, score-0.146]
31 Typically, a rounding scheme is then used to produce a valid integral solution. [sent-64, score-0.163]
32 As we will detail below, 3 we use a novel non-linear continuous relaxation that has a nice property: when the polytope is unconstrained, S = [0, 1]N , our method will (essentially) always produce integral solutions. [sent-65, score-0.322]
33 For more complex polytopes, a rounding procedure is required. [sent-66, score-0.134]
34 When the objective f (Y ) is a submodular set function, as in our setting, the multilinear extension can be used to obtain certain theoretical guarantees for the relaxed optimization scheme described above [21, 25]. [sent-67, score-0.698]
35 The multilinear extension is defined on a vector x 2 [0, 1]N : XY Y F (x) = xi (1 xi )f (Y ) . [sent-68, score-0.367]
36 Thus, to use the multilinear extension in practice requires estimating its value and derivative via Monte Carlo techniques. [sent-71, score-0.283]
37 Instead, for the special case of DPP probabilities we propose a new continuous extension that is efficiently computable and differentiable. [sent-73, score-0.184]
38 We refer to the following function as the softmax extension: XY Y ˜ F (x) = log xi (1 xi ) exp(f (Y )) . [sent-74, score-0.223]
39 While the softmax extension also involves a sum over exponentially many sets Y , we have the following theorem. [sent-76, score-0.223]
40 For a positive semidefinite matrix L and x 2 [0, 1]N , XY Y xi (1 xi ) det(LY ) = det(diag(x)(L I) + I) . [sent-78, score-0.084]
41 For f (Y ) = log det(LY ), we have F (x) = log det(diag(x)(L I) + I) and @ ˜ F (x) = tr((diag(x)(L I) + I) 1 (L I)i ) , @xi where (L I)i denotes the matrix obtained by zeroing all except the ith row of L I. [sent-81, score-0.072]
42 (7) Corollary 2 says that softmax extension for the DPP MAP problem is computable and differentiable in O(N 3 ) time. [sent-82, score-0.23]
43 1), this will be sufficient to efficiently find a local maximum of the softmax extension over an arbitrary solvable polytope. [sent-84, score-0.291]
44 It then remains to show that this local maximum comes with approximation guarantees. [sent-85, score-0.059]
45 1 Conditional gradient When the optimization polytope S is simple—for instance, the unit cube [0, 1]N —we can apply generic gradient-based optimization methods like L-BFGS to rapidly find a local maximum of the softmax extension. [sent-87, score-0.421]
46 In situations where we are able to efficiently project onto the polytope S, we can apply projected gradient methods. [sent-88, score-0.236]
47 In the general case, however, we assume only that the polytope is solvable. [sent-89, score-0.236]
48 Algorithm 1 describes the procedure; intuitively, at each step we move to a convex combination of the current point and the point maximizing the linear approximation of the function given by the current gradient. [sent-91, score-0.065]
49 Note that finding y requires optimizing a linear function over S; this step is efficient whenever the polytope is solvable. [sent-93, score-0.275]
50 When u, v 0, we have @2 ˜ F (x + su + tv) 0 @s@t wherever 0 < x + su + tv < 1. [sent-100, score-0.084]
51 ˜ ˜ Corollary 4 tells us that a local optimum x of F has certain global properties—namely, that F (x) ˜ F (y) whenever y x or y x. [sent-103, score-0.092]
52 (Note that this representation ˜ is overcomplete, since there are in general many subsets of [0, 1] with measure xi . [sent-109, score-0.066]
53 Lemmas 5 and 6 suffice to prove the following theorem, which appears for the multilinear extension in [21], bounding the approximation ratio of Algorithm 2. [sent-119, score-0.36]
54 Let F (x) be the softmax extension of a nonnegative submodular function f (Y ) = ˜ ˜ log det(LY ), let OPT = maxx0 2S F (x0 ), and let x and y be local optima of F in S and S \ {y 0 | y 0 1 x}, respectively. [sent-121, score-0.593]
55 4 Y 2S (11) Note that the softmax extension is an upper bound on the multilinear extension, thus Equation (11) is at least as tight as the corresponding result in [21]. [sent-123, score-0.386]
56 Algorithm 2 yields a 1/4-approximation to the DPP MAP problem whenever log det(LY ) 0 for all Y . [sent-125, score-0.075]
57 In general, the objective value obtained by Algorithm 2 is bounded below by 1 (OPT p0 ) + p0 , where p0 = minY log det(LY ). [sent-126, score-0.073]
58 3 Rounding When the polytope S is unconstrained, it is easy to show that the results of Algorithm 1—and, in turn, Algorithm 2—are integral (or can be rounded without loss). [sent-129, score-0.265]
59 If S = [0, 1]N , then for any local optimum x of F , either x is integral or at least one fractional coordinate xi can be set to 0 or 1 without lowering the objective. [sent-131, score-0.187]
60 More generally, however, the polytope S can be complex, and the output of Algorithm 2 needs to be rounded. [sent-132, score-0.236]
61 We speculate that the contention resolution rounding schemes proposed in [21] for the ˜ multilinear extension F may be extensible to F , but do not attempt to prove so here. [sent-133, score-0.457]
62 Instead, in our experiments we apply pipage rounding [28] and threshold rounding (rounding all coordinates up or down using a single threshold), which are simple and seem to work well in practice. [sent-134, score-0.335]
63 Since the general framework of continuous optimization is widely used in machine learning, this technique allows DPPs to be easily combined with other models. [sent-137, score-0.083]
64 For instance, if S is the local polytope for a Markov random field, then, augmenting the objective with the (linear) log-likelihood of the MRF—additive linear objective terms do not affect the lemmas proved above—we can approximately compute the MAP configuration of the DPP-MRF product model. [sent-138, score-0.34]
65 We might in this way model diverse objects placed in a sequence, or fit to an underlying signal like an image. [sent-139, score-0.067]
66 Note that, while a naive implementation of the arg max in Algorithm 3 requires evaluating the objective for each item in U , here we can exploit the fact that DPPs are closed under conditioning to compute all necessary values with only two matrix inversions [5]. [sent-142, score-0.101]
67 We report baseline runtimes using this optimized greedy algorithm, which is about 10 times faster than the naive version at N = 200. [sent-143, score-0.189]
68 )2 i=1 i=1 the first term of which deters the eigenvalues from being too large, and the second term of which encourages the eigenvalues to be well-separated [30]. [sent-158, score-0.062]
69 Figure 3a shows performance results on these random kernels in the unconstrained setting. [sent-162, score-0.075]
70 Our proposed algorithm outperforms greedy in general, and the performance gap tends to grow with the size of the ground set, N . [sent-163, score-0.189]
71 greedy) 4 100 200 6 N N 0 50 150 8 200 15 10 5 0 50 100 N (c) Figure 3: Median and quartile log probability ratios (top) and running time ratios (bottom) for 100 random trials. [sent-176, score-0.084]
72 (a) The proposed algorithm versus greedy on unconstrained problems. [sent-177, score-0.29]
73 (b) The proposed algorithm versus symmetric greedy on unconstrained problems. [sent-178, score-0.29]
74 (c) The proposed algorithm versus greedy on constrained problems. [sent-179, score-0.215]
75 Despite the fact that the symmetric greedy algorithm [18] has an improved approximation guarantee of 1/3, essentially the same analysis applies to Figure 3b. [sent-183, score-0.243]
76 The constraints require that if xk corresponding to the (i, j) pair is 1, no other xk0 can have first element i or second element j; i. [sent-187, score-0.096]
77 This means our constraints are of a form that allows us to apply pipage rounding to the possibly fractional result. [sent-191, score-0.321]
78 Figure 3c shows even greater gains over greedy in this setting; however, enforcing the constraints precludes using fast methods like L-BFGS, so our optimization procedure is in this case somewhat slower than greedy. [sent-192, score-0.272]
79 2 Matched summarization Finally, we demonstrate our approach using real-world data. [sent-194, score-0.065]
80 Consider the following task: given a set of documents, select a set of document pairs such that the two elements within a pair are similar, but the overall set of pairs is diverse. [sent-195, score-0.165]
81 The task output is a set of statement pairs where the first statement in each pair comes from candidate a and the second from candidate b. [sent-202, score-0.186]
82 The goal of optimization is to find a set that is diverse (contains many topics, such as healthcare, foreign policy, immigration, etc. [sent-203, score-0.093]
83 ) but where both statements in each pair are topically similar. [sent-204, score-0.087]
84 We filter short statements, leaving us with an average of 179 quotes per candidate (min = 93, max = 332 quotes). [sent-206, score-0.202]
85 7 log probability ratio (SoftMax / Greedy) Algorithm 3 Greedy MAP for DPPs Input: kernel L, polytope S Y ;, U Y while U is not empty do i⇤ arg maxi2U log det(LY [{i} ) if log det(LY [{i⇤ } ) < log det(LY ) then break end if Y Y [ {i⇤ } U {i | i 62 Y, I(Y [ {i}) 2 S} end while Output: Y 0. [sent-207, score-0.461]
86 8 1 λ Figure 4: Log ratio of the objective value achieved by our method to that achieved by greedy for ten settings of match weight . [sent-215, score-0.274]
87 Then we generate a feature matrix W where Wqt is the number of times term t appears in quote q. [sent-218, score-0.128]
88 For a given pair of candidates (a, b) we compute the quality of each (a) (b) possible quote pair (qi , qj ) as the dot product of their rows in W . [sent-220, score-0.233]
89 For each of (a) (a) (b) candidate a’s quotes qi we keep a pair with quote j = arg maxj 0 quality(qi , qj 0 ) from candidate b, and vice-versa. [sent-222, score-0.486]
90 To create a feature vector describing each pair, we simply add the corresponding pair of quote feature vectors and re-normalize, forming a new W matrix. [sent-224, score-0.167]
91 Our task is to select some high-quality representative subset of the unpruned quote pairs. [sent-225, score-0.168]
92 We formulate this as a DPP objective with kernel L = M SM , where Sij is a measurement of similarity between quote pairs i and j, and M is a diagonal matrix with Mii representing the match quality of p pair i. [sent-226, score-0.249]
93 For each candidate we cluster their quotes using k-means on the word feature vectors and impose the constraint that no more than one quote per cluster can be selected. [sent-230, score-0.307]
94 We round the final solution using the threshold rounding scheme described in Section 3. [sent-231, score-0.134]
95 In general, we observe that our algorithm is most improved compared to greedy when the constraints are in play. [sent-235, score-0.246]
96 On the other hand, when is very large the algorithms must choose as many pairs as possible in order to maximize their score; in this case the constraints play an important role. [sent-237, score-0.102]
97 5 Conclusion We presented a new approach to solving the MAP problem for DPPs based on continuous algorithms for submodular maximization. [sent-238, score-0.387]
98 Unlike the multilinear extension used in the general case, the softmax extension we propose is efficiently computable and differentiable. [sent-239, score-0.513]
99 Furthermore, it allows for general solvable polytope constraints, and yields a guaranteed 1/4-approximation in a subclass of DPPs. [sent-240, score-0.3]
100 Our method makes it easy to combine DPPs with other models like MRFs or matching models, and is faster and more reliable than standard greedy methods on synthetic and real-world problems. [sent-241, score-0.215]
wordName wordTfidf (topN-words)
[('dpp', 0.417), ('dpps', 0.337), ('submodular', 0.33), ('ly', 0.293), ('det', 0.245), ('polytope', 0.236), ('determinantal', 0.201), ('greedy', 0.189), ('multilinear', 0.189), ('rounding', 0.134), ('quotes', 0.128), ('quote', 0.128), ('softmax', 0.103), ('extension', 0.094), ('opt', 0.093), ('map', 0.092), ('unconstrained', 0.075), ('kulesza', 0.073), ('pipage', 0.067), ('diverse', 0.067), ('summarization', 0.065), ('solvable', 0.064), ('fractional', 0.063), ('constraints', 0.057), ('continuous', 0.057), ('wolsey', 0.055), ('feldman', 0.052), ('naor', 0.052), ('nemhauser', 0.052), ('candidate', 0.051), ('ratio', 0.048), ('statements', 0.048), ('sym', 0.045), ('vondr', 0.045), ('pairs', 0.045), ('xi', 0.042), ('maximization', 0.041), ('xy', 0.041), ('contention', 0.04), ('unpruned', 0.04), ('matroid', 0.04), ('pair', 0.039), ('whenever', 0.039), ('sensor', 0.039), ('tv', 0.038), ('diag', 0.037), ('objective', 0.037), ('focs', 0.037), ('bi', 0.037), ('automata', 0.037), ('gillenwater', 0.037), ('maximizing', 0.036), ('log', 0.036), ('document', 0.036), ('semide', 0.035), ('guration', 0.034), ('monotone', 0.034), ('roots', 0.034), ('corollary', 0.034), ('arg', 0.033), ('kleinberg', 0.033), ('wishart', 0.033), ('nonmonotone', 0.033), ('lij', 0.033), ('languages', 0.033), ('computable', 0.033), ('bj', 0.032), ('maxy', 0.031), ('eigenvalues', 0.031), ('variety', 0.031), ('conditioning', 0.031), ('eigenvectors', 0.03), ('local', 0.03), ('qi', 0.029), ('integral', 0.029), ('approximation', 0.029), ('placement', 0.028), ('processes', 0.028), ('simulated', 0.028), ('items', 0.027), ('qj', 0.027), ('optimization', 0.026), ('concave', 0.026), ('involves', 0.026), ('versus', 0.026), ('matching', 0.026), ('guarantee', 0.025), ('inference', 0.025), ('combinatorial', 0.024), ('ratios', 0.024), ('approximating', 0.024), ('subsets', 0.024), ('su', 0.023), ('leaving', 0.023), ('optimum', 0.023), ('minor', 0.023), ('offers', 0.023), ('plane', 0.022), ('volume', 0.022), ('guarantees', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes
Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar
Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1
2 0.37983236 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
Author: James T. Kwok, Ryan P. Adams
Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. 1
3 0.22361772 304 nips-2012-Selecting Diverse Features via Spectral Regularization
Author: Abhimanyu Das, Anirban Dasgupta, Ravi Kumar
Abstract: We study the problem of diverse feature selection in linear regression: selecting a small subset of diverse features that can predict a given objective. Diversity is useful for several reasons such as interpretability, robustness to noise, etc. We propose several spectral regularizers that capture a notion of diversity of features and show that these are all submodular set functions. These regularizers, when added to the objective function for linear regression, result in approximately submodular functions, which can then be maximized by efficient greedy and local search algorithms, with provable guarantees. We compare our algorithms to traditional greedy and 1 -regularization schemes and show that we obtain a more diverse set of features that result in the regression problem being stable under perturbations. 1
4 0.19716157 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications
Author: Rishabh Iyer, Jeff A. Bilmes
Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1
5 0.18348539 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
Author: Aaron Defazio, Tibério S. Caetano
Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov´ sz extension to obtain a convex relaxation. For tractable classes a such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.
6 0.1167206 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover
7 0.10840738 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
8 0.096665449 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means
9 0.087974586 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
10 0.087624572 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
11 0.077298172 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
12 0.066266745 164 nips-2012-Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation
13 0.051645063 69 nips-2012-Clustering Sparse Graphs
14 0.05136982 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
15 0.051244184 12 nips-2012-A Neural Autoregressive Topic Model
16 0.04985977 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
17 0.049170498 126 nips-2012-FastEx: Hash Clustering with Exponential Families
18 0.04913754 204 nips-2012-MAP Inference in Chains using Column Generation
19 0.046710849 16 nips-2012-A Polynomial-time Form of Robust Regression
20 0.04610173 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins
topicId topicWeight
[(0, 0.171), (1, 0.046), (2, 0.047), (3, -0.087), (4, -0.024), (5, 0.019), (6, -0.028), (7, -0.038), (8, 0.008), (9, -0.011), (10, 0.048), (11, 0.048), (12, 0.011), (13, 0.099), (14, 0.176), (15, 0.047), (16, -0.057), (17, -0.004), (18, -0.041), (19, 0.003), (20, 0.226), (21, 0.078), (22, 0.166), (23, 0.005), (24, 0.365), (25, 0.127), (26, -0.166), (27, -0.006), (28, 0.04), (29, -0.048), (30, 0.024), (31, 0.05), (32, 0.057), (33, 0.058), (34, -0.074), (35, 0.142), (36, -0.145), (37, 0.032), (38, -0.07), (39, 0.061), (40, 0.017), (41, -0.004), (42, -0.029), (43, -0.132), (44, 0.036), (45, -0.018), (46, 0.003), (47, -0.114), (48, 0.017), (49, -0.131)]
simIndex simValue paperId paperTitle
same-paper 1 0.90362436 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes
Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar
Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1
2 0.75660741 304 nips-2012-Selecting Diverse Features via Spectral Regularization
Author: Abhimanyu Das, Anirban Dasgupta, Ravi Kumar
Abstract: We study the problem of diverse feature selection in linear regression: selecting a small subset of diverse features that can predict a given objective. Diversity is useful for several reasons such as interpretability, robustness to noise, etc. We propose several spectral regularizers that capture a notion of diversity of features and show that these are all submodular set functions. These regularizers, when added to the objective function for linear regression, result in approximately submodular functions, which can then be maximized by efficient greedy and local search algorithms, with provable guarantees. We compare our algorithms to traditional greedy and 1 -regularization schemes and show that we obtain a more diverse set of features that result in the regression problem being stable under perturbations. 1
3 0.74371415 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications
Author: Rishabh Iyer, Jeff A. Bilmes
Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1
4 0.6009962 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover
Author: Andrew Delong, Olga Veksler, Anton Osokin, Yuri Boykov
Abstract: Inference in high-order graphical models has become important in recent years. Several approaches are based, for example, on generalized message-passing, or on transformation to a pairwise model with extra ‘auxiliary’ variables. We focus on a special case where a much more efficient transformation is possible. Instead of adding variables, we transform the original problem into a comparatively small instance of submodular vertex-cover. These vertex-cover instances can then be attacked by existing algorithms (e.g. belief propagation, QPBO), where they often run 4–15 times faster and find better solutions than when applied to the original problem. We evaluate our approach on synthetic data, then we show applications within a fast hierarchical clustering and model-fitting framework. 1
5 0.57170141 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
Author: James T. Kwok, Ryan P. Adams
Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. 1
6 0.48251456 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation
7 0.31058797 25 nips-2012-A new metric on the manifold of kernel matrices with application to matrix geometric means
8 0.30877185 359 nips-2012-Variational Inference for Crowdsourcing
9 0.28846192 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
10 0.27938351 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
11 0.26625079 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem
12 0.25818172 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts
13 0.25677794 137 nips-2012-From Deformations to Parts: Motion-based Segmentation of 3D Objects
14 0.25331289 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis
15 0.25073802 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization
16 0.25068575 12 nips-2012-A Neural Autoregressive Topic Model
17 0.25008941 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection
18 0.24194589 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
19 0.24172059 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
20 0.23986715 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion
topicId topicWeight
[(0, 0.065), (21, 0.019), (31, 0.182), (38, 0.151), (39, 0.02), (42, 0.034), (53, 0.049), (54, 0.026), (55, 0.017), (74, 0.113), (76, 0.122), (80, 0.087), (92, 0.041)]
simIndex simValue paperId paperTitle
1 0.84985662 119 nips-2012-Entropy Estimations Using Correlated Symmetric Stable Random Projections
Author: Ping Li, Cun-hui Zhang
Abstract: Methods for efficiently estimating Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) [11, 13] based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Interestingly, the estimator for the moment we recommend for entropy estimation barely has bounded variance itself, whereas the common geometric mean estimator (which has bounded higher-order moments) is not sufficient for entropy estimation. Our experiments confirm that this method is able to well approximate the Shannon entropy using small storage.
Author: Stefan Habenschuss, Johannes Bill, Bernhard Nessler
Abstract: Recent spiking network models of Bayesian inference and unsupervised learning frequently assume either inputs to arrive in a special format or employ complex computations in neuronal activation functions and synaptic plasticity rules. Here we show in a rigorous mathematical treatment how homeostatic processes, which have previously received little attention in this context, can overcome common theoretical limitations and facilitate the neural implementation and performance of existing models. In particular, we show that homeostatic plasticity can be understood as the enforcement of a ’balancing’ posterior constraint during probabilistic inference and learning with Expectation Maximization. We link homeostatic dynamics to the theory of variational inference, and show that nontrivial terms, which typically appear during probabilistic inference in a large class of models, drop out. We demonstrate the feasibility of our approach in a spiking WinnerTake-All architecture of Bayesian inference and learning. Finally, we sketch how the mathematical framework can be extended to richer recurrent network architectures. Altogether, our theory provides a novel perspective on the interplay of homeostatic processes and synaptic plasticity in cortical microcircuits, and points to an essential role of homeostasis during inference and learning in spiking networks. 1
same-paper 3 0.83363652 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes
Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar
Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1
4 0.79439312 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
Author: Ronald Ortner, Daniil Ryabko
Abstract: We derive sublinear regret bounds for undiscounted reinforcement learning in continuous state space. The proposed algorithm combines state aggregation with the use of upper confidence bounds for implementing optimism in the face of uncertainty. Beside the existence of an optimal policy which satisfies the Poisson equation, the only assumptions made are H¨ lder continuity of rewards and transition o probabilities. 1
5 0.78152829 274 nips-2012-Priors for Diversity in Generative Latent Variable Models
Author: James T. Kwok, Ryan P. Adams
Abstract: Probabilistic latent variable models are one of the cornerstones of machine learning. They offer a convenient and coherent way to specify prior distributions over unobserved structure in data, so that these unknown properties can be inferred via posterior inference. Such models are useful for exploratory analysis and visualization, for building density models of data, and for providing features that can be used for later discriminative tasks. A significant limitation of these models, however, is that draws from the prior are often highly redundant due to i.i.d. assumptions on internal parameters. For example, there is no preference in the prior of a mixture model to make components non-overlapping, or in topic model to ensure that co-occurring words only appear in a small number of topics. In this work, we revisit these independence assumptions for probabilistic latent variable models, replacing the underlying i.i.d. prior with a determinantal point process (DPP). The DPP allows us to specify a preference for diversity in our latent variables using a positive definite kernel function. Using a kernel between probability distributions, we are able to define a DPP on probability measures. We show how to perform MAP inference with DPP priors in latent Dirichlet allocation and in mixture models, leading to better intuition for the latent variable representation and quantitatively improved unsupervised feature extraction, without compromising the generative aspects of the model. 1
6 0.77218437 260 nips-2012-Online Sum-Product Computation Over Trees
7 0.77043915 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification
8 0.76918572 176 nips-2012-Learning Image Descriptors with the Boosting-Trick
9 0.76797426 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration
10 0.76738369 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs
11 0.76551712 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves
12 0.76186866 313 nips-2012-Sketch-Based Linear Value Function Approximation
13 0.76064664 12 nips-2012-A Neural Autoregressive Topic Model
14 0.76051271 168 nips-2012-Kernel Latent SVM for Visual Recognition
15 0.76046389 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
16 0.76026338 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
17 0.76011652 8 nips-2012-A Generative Model for Parts-based Object Segmentation
18 0.75932711 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
19 0.75794506 180 nips-2012-Learning Mixtures of Tree Graphical Models
20 0.75724566 162 nips-2012-Inverse Reinforcement Learning through Structured Classification