nips nips2012 nips2012-328 knowledge-graph by maker-knowledge-mining

328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications


Source: pdf

Author: Rishabh Iyer, Jeff A. Bilmes

Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. [sent-5, score-0.757]

2 We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. [sent-8, score-1.521]

3 We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. [sent-9, score-1.287]

4 We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. [sent-10, score-0.42]

5 In this paper we define a class of divergences between sets, where each divergence is parameterized by a submodular function. [sent-16, score-0.929]

6 This can alternatively and equivalently be seen as a divergence between binary vectors in the same way that submodular functions are special cases of pseudo-Boolean functions [3]. [sent-17, score-0.735]

7 We call this the class of submodular Bregman divergences (or just submodular Bregman). [sent-18, score-1.312]

8 We show an interesting mathematical property of the submodular Bregman, namely that they can be defined based on either a tight modular (linear) upper bound or alternatively a tight modular lower bound, unlike the traditional (continuous) Bregman definable only via a tight linear lower bound. [sent-19, score-0.818]

9 A set function f : 2V → R is submodular if ∀S, T ⊆ V , f (S) + f (T ) ≥ f (S ∪ T ) + f (S ∩ T ). [sent-24, score-0.539]

10 We in fact show a direct connection between a submodular Bregman and a generalized Bregman divergence defined through the Lov´ sz extension. [sent-28, score-0.943]

11 Further background a on submodular functions may be found in the text [9]. [sent-29, score-0.55]

12 We first define the different types of submodular Bregman in Section 2. [sent-31, score-0.539]

13 We also define the Lov´ sz Bregman divergence, and show its relation to a version of a the submodular Bregman. [sent-32, score-0.764]

14 Then in Section 3, we prove a number of properties of the submodular Bregman and show how they are related to the Bregman divergence. [sent-33, score-0.539]

15 Finally in Section 4, we show how the submodular Bregman can be used in applications in machine learning. [sent-34, score-0.539]

16 In particular, we show how the proximal framework of the submodular Bregman generalizes a number of mirror-descent style approximate submodular optimization algorithms. [sent-35, score-1.153]

17 2 The Bregman and Submodular Bregman divergences ˆ Notation: We use φ to refer to a convex function, f to refer to a submodular function, and f as f ’s Lov´ sz extension. [sent-37, score-1.039]

18 A divergence on vectors and sets is formally defined as follows: Given a domain of vectors or sets S (and if sets, S = a lattice of sets L, where L is a lattice if ∀X, Y ∈ L, X ∪ Y, X ∩ Y ∈ L), a function d : S × S → R+ is called a divergence if ∀x, y ∈ S, d(x, y) ≥ 0 and ∀x ∈ S, d(x, x) = 0. [sent-42, score-0.42]

19 2 The Submodular Bregman divergences In a similar spirit, we define a submodular Bregman divergence parameterized by a submodular function and defined as the difference between the function and its modular (sometimes called linear) bounds. [sent-57, score-1.532]

20 Surprisingly, any submodular function has both a tight upper and lower modular bound ([15]), unlike strict convexity where only a tight first-order lower bound exists. [sent-58, score-0.745]

21 Hence, we define two distinct forms of submodular Bregman parameterized by a submodular function and in terms of either its tight upper or tight lower bounds. [sent-59, score-1.198]

22 In a manner similar to the generalized Bregman divergence ([13]), we define a discrete subgradient map for a submodular function Hf , which for every set Y , picks a subgradient Hf (Y ) = hY ∈ ∂f (Y ). [sent-64, score-0.878]

23 Then, given a submodular function f and a subgradient-map Hf , the generalized lower bound submodular Bregman, which H we shall henceforth call df f , is defined as: H df f (X, Y ) = f (X) − f (Y ) − hY (X) + hY (Y ) = f (X) − f (Y ) − Hf (Y ), 1X − 1Y ). [sent-65, score-1.989]

24 (6) We remark here that similar to the definition of the generalized Bregman divergence, this submodular Bregman is parameterized both by the submodular function f and the subgradient map Hf . [sent-66, score-1.199]

25 The subdifferential corresponding to a submodular function is an unbounded polyhedron [9], with an uncountable number of possible subgradients. [sent-67, score-0.584]

26 Thus, we define a subclass of df f with Hf chosen so that it picks an extreme points of ∂f (Y ), which we will call the permutation based lower bound submodular Bregman, henceforth referred to with dΣ . [sent-69, score-1.115]

27 f (8) H As can readily be seen, the dΣ are special cases of the df f . [sent-81, score-0.412]

28 Similar to the extreme genf eralized Bregman divergence above, we can define forms of the “extreme” lower bound submodular Bregman divergences df (X, Y ) and df (X, Y ). [sent-82, score-1.803]

29 For every hY ∈ ∂f (Y ) ∩ Pf , df (X, Y ) ≤ df f (X, Y ) ≤ df (X, Y ). [sent-85, score-1.21]

30 Similarly for every permutation map Σ, df (X, Y ) ≤ dΣ (X, Y ) ≤ df (X, Y ). [sent-86, score-0.877]

31 Similarly df (X, Y ) = f (X)−f (Y )−f (Y \X)−f (V )+f (V \X\Y ) H The above theorem gives bounds for df f and dΣ . [sent-88, score-0.81]

32 Further we see that df is exactly the divergence f which defines the submodularity of f . [sent-89, score-0.577]

33 Similar to the approach in ([15]), we can relax the Nemhauser divergences to obtain three modular upper bound submodular Bregmans as: df (X, Y ) 1 f (X) − f (j|X − {j}) + df (X, Y ) 2 f (X) − ) (11) f (j|X) − f (Y ). [sent-94, score-1.659]

34 f (j|V − {j}) + (13) j∈Y \X j∈X\Y df (X, Y 3 f (j|∅) − f (Y ), j∈Y \X j∈X\Y f (X) − f (j|V − {j}) + j∈Y \X j∈X\Y We call these the Nemhauser based upper-bound submodular Bregmans of, respectively, type-I, II and III. [sent-96, score-0.95]

35 Henceforth, we shall refer to them as df , df and df and when referring to them collectively, 1 2 3 we will use df . [sent-97, score-1.636]

36 The Nemhauser divergences are analogous to the extreme divergences of the 1:3 generalized Bregman divergences since they bound the Nemhauser based submodular Bregmans. [sent-98, score-1.317]

37 Its not hard to observe that for a submodular function f , df (X, Y ) ≥ df (X, Y ) ≥ df (X, Y ). [sent-99, score-1.739]

38 Similarly 3 1 df (X, Y ) ≥ df (X, Y ) ≥ df (X, Y ) 3 2 H Similar to the generalized lower bound submodular Bregman df f , we define a generalized upper bound submodular Bregman divergence df f in terms of any supergradient of f . [sent-100, score-3.403]

39 Interestingly for G a submodular function, we can define a superdifferential ∂ f (X) at X as follows: ∂ f (X) = {x ∈ RV : ∀Y ⊆ V, f (X) − x(X) ≥ f (Y ) − x(Y )}. [sent-101, score-0.539]

40 In fact, it can be shown that all three forms of df are actually 1:3 special cases of df f , in that they form specific supergradient maps. [sent-103, score-0.876]

41 Then it can be shown [12] that gX , gX , gX ∈ ∂ f (X), and / correspondingly df , df and df are special cases of df f . [sent-106, score-1.622]

42 1 2 3 G df f also subsumes an interesting class of divergences for any submodular function representable as G concave over modular. [sent-107, score-1.202]

43 Consider any decomposable submodular function [24] f , representable as: f (X) = i λi hi (mi (X)), where the hi s are (not necessarily smooth) concave functions and the mi s cm are vectors in Rn . [sent-108, score-0.725]

44 4 H Table 1: Instances of weighted distances measures as special cases of df f and df f for w ∈ Rn + G Name Hamming Hamming Type H df f df f G Recall Precision df f df f G AER(Y, X; Y ) Cond. [sent-112, score-2.437]

45 3 The Lov´ sz Bregman divergence a The Lov´ sz extension ([20]) offers a natural connection between submodularity and convexity. [sent-117, score-0.641]

46 The a Lov´ sz extension is a non-smooth convex function, and hence we can define a generalized Bregman a divergence ([13, 18]) which has a number of properties and applications analogous to the Bregman divergence. [sent-118, score-0.457]

47 The Lov´ sz extension of a submodular a function has a very interesting set of subgradients, which have a particularly nice structure in that there is a very simple way of obtaining them [7]. [sent-120, score-0.809]

48 These subgradients are really the extreme points of the submodular polyhedron. [sent-126, score-0.611]

49 Then define the ˆ Lov´ sz Bregman divergence df as the Bregman divergence of f and the subgradient hy,σy . [sent-127, score-0.966]

50 Hence for simplicity we just refer to the Lov´ sz Bregman divergence as df . [sent-132, score-0.788]

51 The Lov´ sz Bregman divergence is closely related to the lower a a ˆ bound submodular Bregman, as we show below. [sent-133, score-0.942]

52 The Lov´ sz Bregman divergences are an extension of the lower bound submodular a Bregman, over the interior of the hypercube. [sent-136, score-1.034]

53 Further the Lov´ sz Bregman divergence can be expressed a as df (x, y) = x, hx,σx − hy,σy , and hence depends only x, the permutation σx and the permutation ˆ of y(σy ), but is independent of the values of y. [sent-137, score-0.889]

54 5 3 Properties of the submodular Bregman and Lov´ sz Bregman divergences a In this section, we investigate some of the properties of the submodular Bregman and Lov´ sz a Bregman divergences which make these divergences interesting for Machine Learning applications. [sent-138, score-2.208]

55 All forms of the submodular Bregman divergences are non-negative, and hence they are valid divergences. [sent-140, score-0.793]

56 The lower bound submodular Bregman is submodular in X for a given Y , while the upper bound submodular Bregman is supermodular in Y for a given X. [sent-141, score-1.697]

57 In addition to these the forms of the submodular Bregman divergence also satisfy interesting properties like a characterization of equivalence classes, a form of set separation, a generalized triangle inequality over sets and a form of both Fenchel and submodular duality. [sent-143, score-1.296]

58 Finally the generalized submodular Bregman divergence has an interesting alternate characterization, which shows that they can potentially subsume a large number of discrete divergences. [sent-144, score-0.744]

59 In particular, a H divergence d is of the form df f iff for any sets A, B ⊆ V , the set function fA (X) = d(X, A) is submodular in X and the set function d(X, A) − d(X, B) is modular in X. [sent-145, score-1.158]

60 Similarly a divergence d is of the form df f iff, for any set A, B ⊆ V , the set function fA (Y ) = d(A, Y ) is supermodular in G Y and the set function d(A, Y ) − d(B, Y ) is modular in Y . [sent-146, score-0.623]

61 These facts show that the generalized Bregman divergences are potentially a very large class of divergences while Table 1 provides just a few of them. [sent-147, score-0.48]

62 [12] Given a submodular function whose polyhedron contains all possible extreme points (e. [sent-152, score-0.604]

63 ˆ Hence the Lov´ sz Bregman divergence can be seen as a divergence between the permutations. [sent-155, score-0.515]

64 While a a number of distance measures capture the notion of a distance amongst orderings [17], the Lov´ sz a Bregman divergences has a unique feature not present in these distance measures. [sent-156, score-0.574]

65 The Lov´ sz a Bregman divergences not only capture the distance between σx and σy , but also weighs it with the value of x, thus giving preference to the values and not just the orderings. [sent-157, score-0.471]

66 Hence it can be seen as a divergence between a score x and a permutation σy , and hence we shall also represent it as df (x, y) = df (x||σy ) = x, hx,σx − hx,σy . [sent-158, score-1.029]

67 Finally, as we shall see the Lov´ sz a Bregman divergences are easily amenable to k-means style alternating minimization algorithms for clustering ranked data, a process that is typically difficult using other permutation-based distances. [sent-162, score-0.535]

68 4 Applications In this section, we show the utility of the submodular Bregman and Lov´ sz Bregman divergences by a considering some practical applications in machine learning and optimization. [sent-163, score-0.987]

69 1 A proximal framework for the submodular Bregman divergence The Bregman divergence has some nice properties related to a proximal method. [sent-167, score-0.945]

70 We define a similar framework for the submodular Bregmans. [sent-170, score-0.539]

71 , if S is the constraint that X be a X t+1 := argminX∈S F (X) + λd(X, X t ) graph-cut, then this optimization problem is NP t←t+1 hard even if F is submodular ([15]). [sent-174, score-0.539]

72 Consider now a divergence d(X, Y ) that can be either an upper or lower bound submodular Bregman. [sent-175, score-0.73]

73 We show this below: Minimizing the difference between submodular (DS) functions: Consider the case where F (X) = f (X) − g(X) is a difference between two submodular functions f and g. [sent-180, score-1.089]

74 [11] Submodular function minimization: Algorithm 1 also generalizes a number of approximate submodular minimization algorithms. [sent-189, score-0.573]

75 If F is a submodular function and the underlying constraints S represent the family of cuts, then we obtain the cooperative cut problem ([15], [14]) and one of the algorithms developed in ([15]) is a special case of Algorithm 1. [sent-190, score-0.563]

76 If S = 2V above, we get a form of the approximate submodular minimization algorithm suggested for arbitrary (non-graph representable) submodular functions ([16]). [sent-191, score-1.112]

77 The proximal minimization algorithm also generalizes three submodular function minimization algorithms IMA-I, II and III, described in detail in [12] again with λ = 1, S = 2V and d = df , df and df respectively. [sent-192, score-1.844]

78 These algorithms are similar to 1 2 3 the greedy algorithm for submodular maximization [23]. [sent-193, score-0.552]

79 Interestingly these algorithms provide bounds to the lattice of minimizers of the submodular functions. [sent-194, score-0.599]

80 Thus the lattice formed with A and B defined as the join and meet respectively, gives a bound on the minimizers, and we can restrict the submodular minimization algorithms to this lattice. [sent-196, score-0.615]

81 However using d = df as a regularizer (which is IMA-III) and starting with X 0 = ∅ and X 0 = V , we 3 get the sets A and B [10, 12] respectively from Algorithm 1. [sent-197, score-0.41]

82 With versions of algorithm 1 with d = df 1 and d = df , and starting respectively from X 0 = ∅ and X 0 = V , we get sets that provide a tighter 2 bound on the lattice of minimizers than the one obtained with A and B. [sent-198, score-0.88]

83 Further these algorithms also provide improved bounds in the context of monotone submodular minimization subject to 1 combinatorial constraints. [sent-199, score-0.588]

84 In particular, these algorithms provide bounds which are better than ν , where ν is a parameter related to the curvature of the submodular function. [sent-200, score-0.549]

85 For a more elaborate and detailed discussion related to this, refer to [10] Submodular function maximization: If f is a submodular function, then using d(X, X v ) = dΣv (X, X v ) forms an iterative algorithm for maximizing the modular lower bound of a submodular f function. [sent-202, score-1.225]

86 This algorithm then generalizes a algorithms number of unconstrained submodular maximization and constrained submodular maximization, in that by an appropriate schedule of Σv we can obtain these algorithms. [sent-203, score-1.113]

87 be/kfEnLOmvEVc Figure 1: Results of k-means clustering using the Lov´ sz Bregman divergence (two plots on the left) a and the Euclidean distance (two plots on the right). [sent-206, score-0.423]

88 approximation algorithm for unconstrained and cardinality constrained submodular maximization respectively. [sent-208, score-0.567]

89 2 Clustering framework with the Lov´ sz Bregman divergence a In this section we investigate a clustering framework similar to [2], using the Lov´ sz Bregman a divergence and show how this is natural for a number of applications. [sent-211, score-0.77]

90 Recall that the Lov´ sz a Bregman divergence in some sense measures the distance between the ordering of the vectors and can be seen as a form of the “sorting” distance. [sent-212, score-0.445]

91 Given a set of vectors X and a Lov´ sz a Bregman divergence df , a natural choice of a representative (in this case a permutation) is the point ˆ n with minimum average distance, or in other words: σ = argminσ ˆ i=1 df (xi ||σ ). [sent-217, score-1.21]

92 Interestingly for the Lov´ sz Bregman divergence this problem is easy and the representative permutation is exactly a the permutation of the arithmetic mean of X Theorem 4. [sent-218, score-0.499]

93 Then we define an objective, which captures k this idea of clustering vectors into subsets of similar orderings: minM,C j=1 xi ∈Cj df (xi , µj ). [sent-221, score-0.447]

94 The Lov´ sz Bregman divergence naturally captures the notion of distance between a orderings of vectors and yet, the problem of finding the representative in this case is very easy. [sent-229, score-0.463]

95 We use the submodular function f (X) = w(X), for an arbitrary vector w ensuring unique base extreme points. [sent-233, score-0.584]

96 Acknowledgments: We thank Stefanie Jegelka, Karthik Narayanan, Andrew Guillory, Hui Lin, John Halloran and the rest of the submodular group at UW-EE for discussions. [sent-235, score-0.539]

97 A framework of mirror descent algorithms for submodular optimization. [sent-292, score-0.548]

98 Algorithms for approximate minimization of the difference between submodular functions, with applications. [sent-297, score-0.562]

99 Submodularity beyond submodular energies: coupling edges in graph cuts. [sent-318, score-0.539]

100 An analysis of approximations for maximizing submodular set functions—i. [sent-359, score-0.539]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('submodular', 0.539), ('bregman', 0.524), ('df', 0.4), ('lov', 0.25), ('sz', 0.225), ('divergences', 0.223), ('divergence', 0.145), ('gx', 0.134), ('hy', 0.121), ('hf', 0.067), ('modular', 0.064), ('permutation', 0.053), ('subgradient', 0.051), ('proximal', 0.048), ('supergradient', 0.046), ('extreme', 0.045), ('nemhauser', 0.04), ('jegelka', 0.035), ('generalized', 0.034), ('iyer', 0.033), ('lattice', 0.033), ('hi', 0.033), ('submodularity', 0.032), ('clustering', 0.03), ('orderings', 0.03), ('mi', 0.03), ('pf', 0.029), ('subgradients', 0.027), ('tight', 0.027), ('subdifferential', 0.025), ('minimization', 0.023), ('representative', 0.023), ('distance', 0.023), ('parameterized', 0.022), ('hamming', 0.021), ('polyhedron', 0.02), ('nice', 0.02), ('bound', 0.02), ('ordering', 0.02), ('representable', 0.019), ('picks', 0.019), ('cuts', 0.018), ('forms', 0.018), ('cm', 0.018), ('refer', 0.018), ('shall', 0.018), ('bregmans', 0.017), ('vectors', 0.017), ('minimizers', 0.017), ('hx', 0.016), ('style', 0.016), ('combinatorial', 0.016), ('convex', 0.016), ('yk', 0.016), ('xx', 0.016), ('cj', 0.015), ('bilmes', 0.015), ('cardinality', 0.015), ('discrete', 0.015), ('henceforth', 0.015), ('decomposable', 0.015), ('measures', 0.015), ('elaborate', 0.014), ('supermodular', 0.014), ('voters', 0.014), ('map', 0.014), ('extension', 0.014), ('np', 0.013), ('maximization', 0.013), ('bf', 0.013), ('upper', 0.013), ('lower', 0.013), ('hence', 0.013), ('cooperative', 0.012), ('ne', 0.012), ('amongst', 0.012), ('special', 0.012), ('characters', 0.011), ('interestingly', 0.011), ('representatives', 0.011), ('generalize', 0.011), ('call', 0.011), ('interesting', 0.011), ('fa', 0.011), ('schedule', 0.011), ('functions', 0.011), ('generalizes', 0.011), ('correspondingly', 0.01), ('weighted', 0.01), ('totally', 0.01), ('concave', 0.01), ('analogous', 0.01), ('bounds', 0.01), ('concavity', 0.01), ('math', 0.01), ('sets', 0.01), ('every', 0.01), ('convexity', 0.009), ('recall', 0.009), ('mirror', 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications

Author: Rishabh Iyer, Jeff A. Bilmes

Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1

2 0.30905637 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

Author: Aaron Defazio, Tibério S. Caetano

Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov´ sz extension to obtain a convex relaxation. For tractable classes a such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.

3 0.2781415 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

Author: Ke Jiang, Brian Kulis, Michael I. Jordan

Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1

4 0.19883104 304 nips-2012-Selecting Diverse Features via Spectral Regularization

Author: Abhimanyu Das, Anirban Dasgupta, Ravi Kumar

Abstract: We study the problem of diverse feature selection in linear regression: selecting a small subset of diverse features that can predict a given objective. Diversity is useful for several reasons such as interpretability, robustness to noise, etc. We propose several spectral regularizers that capture a notion of diversity of features and show that these are all submodular set functions. These regularizers, when added to the objective function for linear regression, result in approximately submodular functions, which can then be maximized by efficient greedy and local search algorithms, with provable guarantees. We compare our algorithms to traditional greedy and 1 -regularization schemes and show that we obtain a more diverse set of features that result in the regression problem being stable under perturbations. 1

5 0.19716157 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar

Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1

6 0.14154455 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

7 0.13816828 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

8 0.11561385 142 nips-2012-Generalization Bounds for Domain Adaptation

9 0.06952066 24 nips-2012-A mechanistic model of early sensory processing based on subtracting sparse representations

10 0.056283016 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

11 0.04603238 231 nips-2012-Multiple Operator-valued Kernel Learning

12 0.044701945 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

13 0.036182944 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

14 0.035774518 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

15 0.033544332 282 nips-2012-Proximal Newton-type methods for convex optimization

16 0.031164583 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

17 0.030499276 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

18 0.028171454 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

19 0.026864031 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing

20 0.026131883 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.089), (1, 0.048), (2, 0.046), (3, -0.067), (4, -0.016), (5, 0.038), (6, -0.035), (7, -0.074), (8, 0.009), (9, 0.03), (10, 0.05), (11, -0.037), (12, -0.004), (13, 0.06), (14, 0.167), (15, -0.029), (16, -0.113), (17, -0.056), (18, -0.118), (19, -0.023), (20, 0.176), (21, 0.059), (22, 0.192), (23, 0.009), (24, 0.348), (25, 0.141), (26, -0.222), (27, -0.017), (28, -0.01), (29, -0.078), (30, 0.002), (31, 0.121), (32, -0.053), (33, 0.008), (34, -0.084), (35, 0.094), (36, 0.081), (37, -0.058), (38, -0.002), (39, -0.028), (40, 0.004), (41, -0.066), (42, 0.008), (43, 0.022), (44, 0.088), (45, -0.032), (46, -0.037), (47, 0.015), (48, -0.025), (49, 0.095)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.98038584 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications

Author: Rishabh Iyer, Jeff A. Bilmes

Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1

2 0.64976716 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

Author: Jennifer Gillenwater, Alex Kulesza, Ben Taskar

Abstract: Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. This optimization problem, which also arises in experimental design and sensor placement, involves finding the largest principal minor of a positive semidefinite matrix. Because the objective is log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of monotone objectives, which correspond to a restricted class of DPPs. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a more general class of non-monotone DPPs; our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms standard and recent methods on both synthetic and real-world data. 1

3 0.62009561 304 nips-2012-Selecting Diverse Features via Spectral Regularization

Author: Abhimanyu Das, Anirban Dasgupta, Ravi Kumar

Abstract: We study the problem of diverse feature selection in linear regression: selecting a small subset of diverse features that can predict a given objective. Diversity is useful for several reasons such as interpretability, robustness to noise, etc. We propose several spectral regularizers that capture a notion of diversity of features and show that these are all submodular set functions. These regularizers, when added to the objective function for linear regression, result in approximately submodular functions, which can then be maximized by efficient greedy and local search algorithms, with provable guarantees. We compare our algorithms to traditional greedy and 1 -regularization schemes and show that we obtain a more diverse set of features that result in the regression problem being stable under perturbations. 1

4 0.61147386 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

Author: Andrew Delong, Olga Veksler, Anton Osokin, Yuri Boykov

Abstract: Inference in high-order graphical models has become important in recent years. Several approaches are based, for example, on generalized message-passing, or on transformation to a pairwise model with extra ‘auxiliary’ variables. We focus on a special case where a much more efficient transformation is possible. Instead of adding variables, we transform the original problem into a comparatively small instance of submodular vertex-cover. These vertex-cover instances can then be attacked by existing algorithms (e.g. belief propagation, QPBO), where they often run 4–15 times faster and find better solutions than when applied to the original problem. We evaluate our approach on synthetic data, then we show applications within a fast hierarchical clustering and model-fitting framework. 1

5 0.54851907 6 nips-2012-A Convex Formulation for Learning Scale-Free Networks via Submodular Relaxation

Author: Aaron Defazio, Tibério S. Caetano

Abstract: A key problem in statistics and machine learning is the determination of network structure from data. We consider the case where the structure of the graph to be reconstructed is known to be scale-free. We show that in such cases it is natural to formulate structured sparsity inducing priors using submodular functions, and we use their Lov´ sz extension to obtain a convex relaxation. For tractable classes a such as Gaussian graphical models, this leads to a convex optimization problem that can be efficiently solved. We show that our method results in an improvement in the accuracy of reconstructed networks for synthetic data. We also show how our prior encourages scale-free reconstructions on a bioinfomatics dataset.

6 0.36157238 337 nips-2012-The Lovász ϑ function, SVMs and finding large dense subgraphs

7 0.29291147 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

8 0.28054237 359 nips-2012-Variational Inference for Crowdsourcing

9 0.25434169 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

10 0.22914076 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

11 0.22706336 142 nips-2012-Generalization Bounds for Domain Adaptation

12 0.20180774 69 nips-2012-Clustering Sparse Graphs

13 0.19459572 338 nips-2012-The Perturbed Variation

14 0.1943503 132 nips-2012-Fiedler Random Fields: A Large-Scale Spectral Approach to Statistical Network Modeling

15 0.189375 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

16 0.18445988 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

17 0.1813204 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

18 0.17812905 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization

19 0.17496593 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

20 0.17196451 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.035), (38, 0.102), (42, 0.023), (53, 0.434), (54, 0.013), (55, 0.014), (68, 0.015), (74, 0.045), (76, 0.078), (80, 0.042), (92, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72934687 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications

Author: Rishabh Iyer, Jeff A. Bilmes

Abstract: We introduce a class of discrete divergences on sets (equivalently binary vectors) that we call the submodular-Bregman divergences. We consider two kinds, defined either from tight modular upper or tight modular lower bounds of a submodular function. We show that the properties of these divergences are analogous to the (standard continuous) Bregman divergence. We demonstrate how they generalize many useful divergences, including the weighted Hamming distance, squared weighted Hamming, weighted precision, recall, conditional mutual information, and a generalized KL-divergence on sets. We also show that the generalized Bregman divergence on the Lov´ sz extension of a submodular function, which we a call the Lov´ sz-Bregman divergence, is a continuous extension of a submodular a Bregman divergence. We point out a number of applications, and in particular show that a proximal algorithm defined through the submodular Bregman divergence provides a framework for many mirror-descent style algorithms related to submodular function optimization. We also show that a generalization of the k-means algorithm using the Lov´ sz Bregman divergence is natural in clustering scenarios where a ordering is important. A unique property of this algorithm is that computing the mean ordering is extremely efficient unlike other order based distance measures. 1

2 0.66645616 359 nips-2012-Variational Inference for Crowdsourcing

Author: Qiang Liu, Jian Peng, Alex Ihler

Abstract: Crowdsourcing has become a popular paradigm for labeling large datasets. However, it has given rise to the computational task of aggregating the crowdsourced labels provided by a collection of unreliable annotators. We approach this problem by transforming it into a standard inference problem in graphical models, and applying approximate variational methods, including belief propagation (BP) and mean field (MF). We show that our BP algorithm generalizes both majority voting and a recent algorithm by Karger et al. [1], while our MF method is closely related to a commonly used EM algorithm. In both cases, we find that the performance of the algorithms critically depends on the choice of a prior distribution on the workers’ reliability; by choosing the prior properly, both BP and MF (and EM) perform surprisingly well on both simulated and real-world datasets, competitive with state-of-the-art algorithms based on more complicated modeling assumptions. 1

3 0.66127473 10 nips-2012-A Linear Time Active Learning Algorithm for Link Classification

Author: Nicolò Cesa-bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella

Abstract: We present very efficient active learning algorithms for link classification in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph G = (V, E) such that |E| = Ω(|V |3/2 ) by querying O(|V |3/2 ) edge labels. More generally, we show an algorithm that achieves optimality to within a factor of O(k) by querying at most order of |V | + (|V |/k)3/2 edge labels. The running time of this algorithm is at most of order |E| + |V | log |V |. 1

4 0.59312677 313 nips-2012-Sketch-Based Linear Value Function Approximation

Author: Marc Bellemare, Joel Veness, Michael Bowling

Abstract: Hashing is a common method to reduce large, potentially infinite feature vectors to a fixed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and fifty-five Atari 2600 games to highlight the superior learning performance obtained when using tug-of-war hashing. 1

5 0.5928067 12 nips-2012-A Neural Autoregressive Topic Model

Author: Hugo Larochelle, Stanislas Lauly

Abstract: We describe a new model for learning meaningful representations of text documents from an unlabeled collection of documents. This model is inspired by the recently proposed Replicated Softmax, an undirected graphical model of word counts that was shown to learn a better generative model and more meaningful document representations. Specifically, we take inspiration from the conditional mean-field recursive equations of the Replicated Softmax in order to define a neural network architecture that estimates the probability of observing a new word in a given document given the previously observed words. This paradigm also allows us to replace the expensive softmax distribution over words with a hierarchical distribution over paths in a binary tree of words. The end result is a model whose training complexity scales logarithmically with the vocabulary size instead of linearly as in the Replicated Softmax. Our experiments show that our model is competitive both as a generative model of documents and as a document representation learning algorithm. 1

6 0.54704708 30 nips-2012-Accuracy at the Top

7 0.49757165 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

8 0.37833133 214 nips-2012-Minimizing Sparse High-Order Energies by Submodular Vertex-Cover

9 0.37506965 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

10 0.35624135 160 nips-2012-Imitation Learning by Coaching

11 0.35520583 129 nips-2012-Fast Variational Inference in the Conjugate Exponential Family

12 0.35311103 293 nips-2012-Relax and Randomize : From Value to Algorithms

13 0.35028931 260 nips-2012-Online Sum-Product Computation Over Trees

14 0.34963521 163 nips-2012-Isotropic Hashing

15 0.34898356 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

16 0.34735611 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

17 0.34701774 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.34699669 304 nips-2012-Selecting Diverse Features via Spectral Regularization

19 0.34567839 298 nips-2012-Scalable Inference of Overlapping Communities

20 0.34550193 71 nips-2012-Co-Regularized Hashing for Multimodal Data