nips nips2010 nips2010-258 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francis R. Bach
Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
Reference: text
sentIndex sentText sentNum sentScore
1 Structured sparsity-inducing norms through submodular functions Francis Bach INRIA - Willow project-team Laboratoire d’Informatique de l’Ecole Normale Sup´ rieure e Paris, France francis. [sent-1, score-0.972]
2 This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. [sent-6, score-0.443]
3 This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). [sent-8, score-0.277]
4 By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning. [sent-9, score-1.114]
5 In these two situations, reducing parsimony to finding models with low cardinality turns out to be limiting, and structured parsimony has emerged as a fruitful practical extension, with applications to image processing, text processing or bioinformatics (see, e. [sent-15, score-0.306]
6 For example, in [4], structured sparsity is used to encode prior knowledge regarding network relationship between genes, while in [6], it is used as an alternative to structured nonparametric Bayesian process based priors for topic models. [sent-18, score-0.217]
7 In this paper, we instead follow the approach of [8, 3] and consider specific penalty functions F (Supp(w)) of the support set, which go beyond the cardinality function, but are not limited or designed to only forbid certain sparsity patterns. [sent-20, score-0.355]
8 2, these may also lead to restricted sets of supports but their interpretation in terms of an explicit penalty on the support leads to additional 1 insights into the behavior of structured sparsity-inducing norms (see, e. [sent-22, score-0.497]
9 , forward selection) to the problem are considered in [8, 3], we provide convex relaxations to the function w → F (Supp(w)), which extend the traditional link between the ℓ1 -norm and the cardinality function. [sent-28, score-0.217]
10 This is done for a particular ensemble of set-functions F , namely nondecreasing submodular functions. [sent-29, score-0.87]
11 Submodular functions may be seen as the set-function equivalent of convex functions, and exhibit many interesting properties that we review in Section 2—see [9] for a tutorial on submodular analysis and [10, 11] for other applications to machine learning. [sent-30, score-0.756]
12 This paper makes the following contributions: − We make explicit links between submodularity and sparsity by showing that the convex envelope of the function w → F (Supp(w)) on the ℓ∞ -ball may be readily obtained from the Lov´ sz a extension of the submodular function (Section 3). [sent-31, score-1.07]
13 , subgradients and proximal operators (Section 5), as well as theoretical guarantees, i. [sent-34, score-0.131]
14 , conditions for support recovery or high-dimensional inference (Section 6), that extend classical results for the ℓ1 -norm and show that many norms may be tackled by the exact same analysis and algorithms. [sent-36, score-0.388]
15 , p} denotes the support of w, defined as Supp(w) = {j ∈ V, wj = 0}. [sent-43, score-0.18]
16 2 Review of submodular function theory Throughout this paper, we consider a nondecreasing submodular function F defined on the power set 2V of V = {1, . [sent-48, score-1.532]
17 Classical examples are the cardinality function (which will lead to the ℓ1 -norm) and, given a partition of V into B1 ∪ · · · ∪ Bk = V , the set-function A → F (A) which is equal to the number of groups B1 , . [sent-60, score-0.3]
18 Given any set-function F , one can define its Lov´ sz extensionf : Rp → R, as a a + follows; given w ∈ Rp , we can order the components of w in decreasing order wj1 · · · wjp + 0, the value f (w) is then defined as: f (w) = p k=1 wjk [F ({j1 , . [sent-65, score-0.125]
19 (1) The Lov´ sz extension f is always piecewise-linear, and when F is submodular, it is also convex a (see, e. [sent-72, score-0.166]
20 We denote by P the submodular polyhedron [12], defined as the set of s ∈ Rp such that for all A ⊂ V , s(A) F (A), i. [sent-80, score-0.726]
21 One 2 (0,1)/F({2}) (1,1)/F({1,2}) (1,0)/F({1}) Figure 1: Polyhedral unit ball, for 4 different submodular functions (two variables), with different stable inseparable sets leading to different sets of extreme points; changing values of F may make some of the extreme points disappear. [sent-83, score-1.249]
22 From left to right: F (A) = |A|1/2 (all possible extreme points), F (A) = |A| (leading to the ℓ1 -norm), F (A) = min{|A|, 1} (leading to the ℓ∞ -norm), 1 F (A) = 1 1{A∩{2}=∅} + 1{A=∅} (leading to the structured norm Ω(w) = 2 |w2 | + w ∞ ). [sent-84, score-0.23]
23 2 important result in submodular analysis is that if F is a nondecreasing submodular function, then we have a representation of f as a maximum of linear functions [12, 9], i. [sent-85, score-1.573]
24 A set A is said stable if it cannot be augmented without increasing F , i. [sent-99, score-0.122]
25 The set of stable sets is closed by intersection [13], and will correspond to the set of allowed sparsity patterns (see Section 6. [sent-103, score-0.314]
26 As shown in [13], the submodular polytope P has full dimension p as soon as F is strictly positive on all singletons, and its faces are exactly the sets {sk = 0} for k ∈ V and {s(A) = F (A)} for stable and inseparable sets A. [sent-108, score-1.063]
27 + These stable inseparable sets will play a role when describing extreme points of unit balls of our new norms (Section 3) and for deriving concentration inequalities in Section 6. [sent-111, score-0.74]
28 For the cardinality function, stable and inseparable sets are singletons. [sent-113, score-0.427]
29 3 Definition and properties of structured norms We define the function Ω(w) = f (|w|), where |w| is the vector in Rp composed of absolute values of w and f the Lov´ sz extension of F . [sent-114, score-0.443]
30 We have the following properties (see proof in [15]), which a show that we indeed define a norm and that it is the desired convex envelope: Proposition 1 (Convex envelope, dual norm) Assume that the set-function F is submodular, nondecreasing, and strictly positive for all singletons. [sent-115, score-0.181]
31 Then: (i) Ω is a norm on Rp , (ii) Ω is the convex envelope of the function g : w → F (Supp(w)) on the unit ℓ∞ -ball, (iii) the dual norm (see, e. [sent-117, score-0.352]
32 We provide examples of submodular set-functions and norms in Section 4, where we go from setfunctions to norms, and vice-versa. [sent-120, score-0.984]
33 The following proposition gives the set of extreme points of the unit ball (see proof in [15] and examples in Figure 1): Proposition 2 (Extreme points of unit ball) The extreme points of the unit ball of Ω are the vec1 tors F (A) s, with s ∈ {−1, 0, 1}p, Supp(s) = A and A a stable inseparable set. [sent-125, score-0.726]
34 This proposition shows, that depending on the number and cardinality of the inseparable stable sets, we can go from 2p (only singletons) to 3p − 1 extreme points (all possible sign vectors). [sent-126, score-0.574]
35 We show in Figure 1 examples of balls for p = 2, as well as sets of extreme points. [sent-127, score-0.183]
36 5 1 weights 1 weights weights Figure 2: Sequence and groups: (left) groups for contiguous patterns, (right) groups for penalizing the number of jumps in the indicator vector sequence. [sent-132, score-0.267]
37 From left to right: ℓ1 -norm penalization (a wrong variable is included with the correct ones), polyhedral norm for rectangles in 2D, with zoom (all variables come in together), mix of the two norms (correct behavior). [sent-136, score-0.433]
38 4 Examples of nondecreasing submodular functions We consider three main types of submodular functions with potential applications to regularization for supervised learning. [sent-137, score-1.614]
39 Some existing norms are shown to be examples of our frameworks (Section 4. [sent-138, score-0.299]
40 3), while other novel norms are designed from specific submodular functions (Section 4. [sent-140, score-0.972]
41 Other examples of submodular functions, in particular in terms of matroids and entropies, may be found in [12, 10, 11] and could also lead to interesting new norms. [sent-142, score-0.765]
42 Note that set covers, which are common examples of submodular functions are subcases of set-functions defined in Section 4. [sent-143, score-0.773]
43 1 Norms defined with non-overlapping or overlapping groups We consider grouped norms defined with potentially overlapping groups [1, 2], i. [sent-148, score-0.612]
44 It is a norm as soon as ∪G,d(G)>0 G = V and it corresponds to the nondecreasing submodular function F (A) = G∩A=∅ d(G). [sent-151, score-1.003]
45 In the case where ℓ∞ -norms are replaced by ℓ2 -norms, [2] has shown that the set of allowed sparsity patterns are intersections of complements of groups G with strictly positive weights. [sent-152, score-0.311]
46 These sets happen to be the set of stable sets for the corresponding submodular function; thus the analysis provided in Section 6. [sent-153, score-0.856]
47 However, in our situation, we can give a reinterpretation through a submodular function that counts the number of times the support A intersects groups G with non zero weights. [sent-155, score-0.835]
48 This goes beyond restricting the set of allowed sparsity patterns to stable sets. [sent-156, score-0.278]
49 Hierarchical norms defined on directed acyclic graphs [1, 5, 6] correspond to the set-function F (A) which is the cardinality of the union of ancestors of elements in A. [sent-160, score-0.41]
50 If we assume that the p variables are organized in a 1D, 2D or 3D grid, [2] considers norms based on overlapping groups leading to stable sets equal to rectangular or convex shapes, with applications in computer vision [17]. [sent-163, score-0.642]
51 For example, for the groups defined in the left side of Figure 2 (with unit weights), we have F (A) = p − 2 + range(A) if A = ∅ and F (∅) = 0 (the range of A is equal to max(A) − min(A) + 1). [sent-164, score-0.135]
52 In order to counterbalance this effect, adding a constant times the cardinality function has the effect of making the first gap relatively smaller. [sent-167, score-0.141]
53 All patterns are then allowed, but contiguous ones are encouraged rather than forced. [sent-169, score-0.126]
54 4 Another interesting new norm may be defined from the groups in the right side of Figure 2. [sent-170, score-0.182]
55 Note that this also favors contiguous patterns but is not limited to selecting a single interval (like the norm obtained from groups in the left side of Figure 2). [sent-172, score-0.277]
56 The functions h(λ) = log(λ+t) for t 0 lead to submodular functions, as they correspond to entropies of Gaussian random variables ∞ (see, e. [sent-184, score-0.768]
57 , [20]), π h(λ) = λq for q ∈ (0, 1] are positive linear combinations of functions that lead to nondecreasing submodular functions. [sent-189, score-0.949]
58 Thus, they are also nondecreasing submodular functions, and, to the best of our knowledge, provide novel examples of such functions. [sent-190, score-0.9]
59 ⊤ ⊤ If h is linear, then F (A) = tr XA XA = k∈A Xk Xk (where XA denotes the submatrix of X with columns in A) and we obtain a weighted cardinality function and hence and a weighted ℓ1 -norm, which is a factorial prior, i. [sent-192, score-0.242]
60 This is a non-factorial prior but unfortunately it does not lead to a submodular function. [sent-196, score-0.7]
61 In a Bayesian context however, it is shown by [22] that penalties of the form ⊤ log det(XA XA + λI) (which lead to submodular functions) correspond to marginal likelihoods associated to the set A and have good behavior when used within a non-convex framework. [sent-197, score-0.7]
62 This highlights the need for non-factorial priors which are sub-linear functions of the eigenvalues of ⊤ XA XA , which is exactly what nondecreasing submodular function of submatrices are. [sent-198, score-0.979]
63 We do not pursue the extensive evaluation of non-factorial convex priors in this paper but provide in simulations ⊤ examples with F (A) = tr(XA XA )1/2 (which is equal to the trace norm of XA [16]). [sent-199, score-0.237]
64 3 Functions of cardinality For F (A) = h(|A|) where h is nondecreasing, such that h(0) = 0 and concave, then, from Eq. [sent-201, score-0.141]
65 This includes the sum of the q largest elements, and might lead to interesting new norms for unstructured variable selection but this is not pursued here. [sent-205, score-0.337]
66 Note that since these norms are polyhedral norms with unit balls having potentially an exponential number of vertices or faces, regular linear programming toolboxes may not be used. [sent-208, score-0.718]
67 From Ω(w) = maxs∈P s⊤ |w| and the greedy algorithm1 presented in Section 2, one can easily get in polynomial time one subgradient as one of the maximizers s. [sent-210, score-0.114]
68 This allows to use subgradient descent, with, as shown in Figure 4, slow convergence compared to proximal methods. [sent-211, score-0.153]
69 1 The greedy algorithm to find extreme points of the submodular polyhedron should not be confused with the greedy algorithm (e. [sent-217, score-0.942]
70 Moreover, any algorithm for minimizing submodular functions allows to get directly the support of the unique solution of the proximal problem and that with a sequence of submodular function minimizations, the full solution may also be obtained. [sent-223, score-1.528]
71 Similar links between convex optimization and minimization of submodular functions have been considered (see, e. [sent-224, score-0.79]
72 However, these are dedicated to symmetric submodular functions (such as the ones obtained from graph cuts) and are thus not directly applicable to our situation of non-increasing submodular functions. [sent-227, score-1.435]
73 Finally, note that using the minimum-norm-point algorithm leads to a generic algorithm that can be applied to any submodular functions F , and that it may be rather inefficient for simpler subcases (e. [sent-228, score-0.743]
74 , the ℓ1 /ℓ∞ -norm, tree-structured groups [6], or general overlapping groups [7]). [sent-230, score-0.229]
75 Like recent analysis of sparsity-inducing norms [25], the analysis provided in this section relies heavily on decomposability properties of our norm Ω. [sent-239, score-0.4]
76 These two functions are submodular and nondecreasing as soon as F is (see, e. [sent-242, score-0.953]
77 We denote by ΩJ the norm on RJ defined through the submodular function FJ , and ΩJ the pseudoc norm defined on RJ defined through F J (as shown in Proposition 4, it is a norm only when J is a stable set). [sent-245, score-1.057]
78 We show that with probability one, only stable support sets may be obtained (see proof in [15]). [sent-253, score-0.213]
79 (3) ˆ is unique and, with probability one, its support Supp(w) is a stable set. [sent-258, score-0.177]
80 3 High-dimensional inference We now assume that the linear model is well-specified and extend results from [26] for sufficient support recovery conditions and from [25] for estimation consistency. [sent-260, score-0.119]
81 We denote by ρ(J) = minB⊂J c F (B∪J)−F (J) ; by submodularity and monotonicity of F , ρ(J) is always between F (B) zero and one, and, as soon as J is stable it is strictly positive (for the ℓ1 -norm, ρ(J) = 1). [sent-262, score-0.32]
82 , Propositions 6 and 8 extend results based on support recovery conditions [26]; while Propositions 7 and 8 extend results based on restricted eigenvalue conditions (see, e. [sent-267, score-0.142]
83 Denote by J the smallest stable set containing the ∗ ∗ ∗ ∗ support Supp(w ) of w . [sent-274, score-0.177]
84 Then, if λ for η > 0, (ΩJ )∗ [(ΩJ (Q−1 QJj ))j∈J c ] ˆ JJ 2c(J) , the minimizer w is unique and has support equal to J, with probability larger than 1 − 3P Ω∗ (z) > multivariate normal with covariance matrix Q. [sent-276, score-0.13]
85 Denote by J the smallest stable set containing the support ∗ ∗ Supp(w ) of w . [sent-279, score-0.177]
86 Gaussian components, normalized to have unit ℓ2 ∗ norm columns. [sent-290, score-0.135]
87 A set J of cardinality k is chosen at random and the weights wJ are sampled from a ∗ standard multivariate Gaussian distribution and wJ c = 0. [sent-291, score-0.168]
88 For the submodular function F (A) = |A|1/2 (a simple submodular function beyond the cardinality) we compare three optimization algorithms described in Section 5, subgradient descent and two proximal methods, ISTA and its accelerated version FISTA [23], for p = n = 1000, k = 100 and λ = 0. [sent-295, score-1.477]
89 Other settings and other set-functions would lead to similar results than the ones presented in Figure 4: FISTA is faster than ISTA, and much faster than subgradient descent. [sent-297, score-0.114]
90 We see in the right plots of Figure 4 that 7 f(w)−min(f) −5 10 0 20 40 60 time (seconds) 1 thresholded OLS greedy submodular 0. [sent-301, score-0.731]
91 5 0 0 20 penalty residual error fista ista subgradient residual error 1 0 10 0. [sent-302, score-0.272]
92 p 120 120 120 120 120 120 120 120 120 120 120 120 n 120 120 120 120 120 120 20 20 20 20 20 20 k 80 40 20 10 6 4 80 40 20 10 6 4 submodular 40. [sent-307, score-0.662]
93 The performance of the submodular method is shown, then differences from all methods to this particular one are computed, and shown in bold when they are significantly greater than zero, as measured by a paired t-test with level 5% (i. [sent-410, score-0.662]
94 for hard cases (middle plot) convex optimization techniques perform better than other approaches, while for easier cases with more observations (right plot), it does as well as greedy approaches. [sent-413, score-0.122]
95 We now focus on the predictive performance and ⊤ compare our new norm with F (A) = tr(XA XA )1/2 , with greedy approaches [3] and to regularization by ℓ1 or ℓ2 norms. [sent-415, score-0.16]
96 As shown in Table 1, the new norm based on non-factorial priors is more robust than the ℓ1 -norm to lower number of observations n and to larger cardinality of support k. [sent-416, score-0.325]
97 8 Conclusions We have presented a family of sparsity-inducing norms dedicated to incorporating prior knowledge or structural constraints on the support of linear predictors. [sent-417, score-0.363]
98 , by considering related adapted concave penalties to enhance sparsity-inducing norms, or by extending some of the concepts for norms of matrices, with potential applications in matrix factorization or multi-task learning (see, e. [sent-421, score-0.269]
99 , [27] for application of submodular functions to dictionary learning). [sent-423, score-0.727]
100 Convex analysis and optimization with submodular functions: a tutorial. [sent-485, score-0.662]
wordName wordTfidf (topN-words)
[('submodular', 0.662), ('norms', 0.269), ('rp', 0.215), ('nondecreasing', 0.208), ('xa', 0.19), ('supp', 0.164), ('cardinality', 0.141), ('inseparable', 0.128), ('wj', 0.125), ('stable', 0.122), ('xw', 0.113), ('proximal', 0.108), ('norm', 0.091), ('groups', 0.091), ('sz', 0.085), ('lov', 0.082), ('proposition', 0.082), ('ista', 0.079), ('extreme', 0.078), ('submodularity', 0.078), ('polyhedral', 0.073), ('envelope', 0.073), ('greedy', 0.069), ('polyhedron', 0.064), ('structured', 0.061), ('jenatton', 0.061), ('fista', 0.06), ('sparsity', 0.057), ('patterns', 0.057), ('support', 0.055), ('minw', 0.054), ('convex', 0.053), ('parsimony', 0.052), ('tr', 0.051), ('overlapping', 0.047), ('obozinski', 0.045), ('subgradient', 0.045), ('unit', 0.044), ('grouped', 0.043), ('singletons', 0.042), ('allowed', 0.042), ('soon', 0.042), ('recovery', 0.041), ('jk', 0.041), ('functions', 0.041), ('monotonicity', 0.041), ('combinatorial', 0.04), ('decomposability', 0.04), ('qaa', 0.04), ('qjj', 0.04), ('subcases', 0.04), ('wjp', 0.04), ('dedicated', 0.039), ('balls', 0.039), ('priors', 0.038), ('penalty', 0.038), ('ball', 0.038), ('lead', 0.038), ('contiguous', 0.038), ('propositions', 0.038), ('bk', 0.037), ('strictly', 0.037), ('sets', 0.036), ('matroids', 0.035), ('links', 0.034), ('mairal', 0.033), ('ones', 0.031), ('examples', 0.03), ('selection', 0.03), ('negahban', 0.03), ('submatrices', 0.03), ('fj', 0.03), ('cuts', 0.029), ('rj', 0.029), ('extension', 0.028), ('multivariate', 0.027), ('complements', 0.027), ('entropies', 0.027), ('factorial', 0.027), ('non', 0.027), ('minimizer', 0.025), ('residual', 0.025), ('hal', 0.025), ('krause', 0.025), ('fused', 0.025), ('pursue', 0.025), ('jumps', 0.024), ('maxa', 0.024), ('dictionary', 0.024), ('potentially', 0.024), ('concentration', 0.024), ('moreover', 0.024), ('leading', 0.024), ('submatrix', 0.023), ('subgradients', 0.023), ('penalizing', 0.023), ('go', 0.023), ('extend', 0.023), ('normal', 0.023), ('rn', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 258 nips-2010-Structured sparsity-inducing norms through submodular functions
Author: Francis R. Bach
Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
2 0.48474485 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions
Author: Peter Stobbe, Andreas Krause
Abstract: Many combinatorial problems arising in machine learning can be reduced to the problem of minimizing a submodular function. Submodular functions are a natural discrete analog of convex functions, and can be minimized in strongly polynomial time. Unfortunately, state-of-the-art algorithms for general submodular minimization are intractable for larger problems. In this paper, we introduce a novel subclass of submodular minimization problems that we call decomposable. Decomposable submodular functions are those that can be represented as sums of concave functions applied to modular functions. We develop an algorithm, SLG, that can efÄ?Ĺš ciently minimize decomposable submodular functions with tens of thousands of variables. Our algorithm exploits recent results in smoothed convex minimization. We apply SLG to synthetic benchmarks and a joint classiÄ?Ĺš cation-and-segmentation task, and show that it outperforms the state-of-the-art general purpose submodular minimization algorithms by several orders of magnitude. 1
3 0.3695322 166 nips-2010-Minimum Average Cost Clustering
Author: Kiyohito Nagano, Yoshinobu Kawahara, Satoru Iwata
Abstract: A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments. 1
4 0.24130778 102 nips-2010-Generalized roof duality and bisubmodular functions
Author: Vladimir Kolmogorov
Abstract: ˆ Consider a convex relaxation f of a pseudo-boolean function f . We say that ˆ the relaxation is totally half-integral if f (x) is a polyhedral function with halfintegral extreme points x, and this property is preserved after adding an arbitrary combination of constraints of the form xi = xj , xi = 1 − xj , and xi = γ where 1 γ ∈ {0, 1, 2 } is a constant. A well-known example is the roof duality relaxation for quadratic pseudo-boolean functions f . We argue that total half-integrality is a natural requirement for generalizations of roof duality to arbitrary pseudo-boolean functions. Our contributions are as follows. First, we provide a complete characterization ˆ of totally half-integral relaxations f by establishing a one-to-one correspondence with bisubmodular functions. Second, we give a new characterization of bisubmodular functions. Finally, we show some relationships between general totally half-integral relaxations and relaxations based on the roof duality. 1
5 0.2006522 181 nips-2010-Network Flow Algorithms for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Francis R. Bach, Guillaume R. Obozinski
Abstract: We consider a class of learning problems that involve a structured sparsityinducing norm defined as the sum of ℓ∞ -norms over groups of variables. Whereas a lot of effort has been put in developing fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlapping groups. To this end, we show that the corresponding optimization problem is related to network flow optimization. More precisely, the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow problem. We propose an efficient procedure which computes its solution exactly in polynomial time. Our algorithm scales up to millions of variables, and opens up a whole new range of applications for structured sparse models. We present several experiments on image and video data, demonstrating the applicability and scalability of our approach for various problems. 1
6 0.14223392 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
7 0.10847566 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
8 0.10699146 217 nips-2010-Probabilistic Multi-Task Feature Selection
9 0.10196188 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
10 0.092566878 5 nips-2010-A Dirty Model for Multi-task Learning
11 0.092413746 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
12 0.086668581 148 nips-2010-Learning Networks of Stochastic Differential Equations
13 0.080284603 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
14 0.074767575 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
15 0.073091701 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning
16 0.070371412 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
17 0.067182191 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
18 0.060750693 63 nips-2010-Distributed Dual Averaging In Networks
19 0.058515493 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction
20 0.057093084 48 nips-2010-Collaborative Filtering in a Non-Uniform World: Learning with the Weighted Trace Norm
topicId topicWeight
[(0, 0.179), (1, 0.047), (2, 0.149), (3, 0.117), (4, 0.037), (5, -0.258), (6, -0.049), (7, 0.22), (8, 0.383), (9, 0.017), (10, 0.263), (11, 0.235), (12, 0.109), (13, -0.075), (14, -0.248), (15, 0.203), (16, -0.003), (17, 0.002), (18, -0.006), (19, 0.072), (20, -0.132), (21, -0.068), (22, 0.018), (23, -0.043), (24, 0.072), (25, 0.082), (26, 0.046), (27, -0.036), (28, 0.019), (29, 0.015), (30, 0.007), (31, -0.041), (32, 0.036), (33, -0.0), (34, -0.017), (35, -0.014), (36, 0.023), (37, -0.029), (38, -0.016), (39, 0.029), (40, -0.044), (41, 0.004), (42, 0.016), (43, 0.034), (44, -0.015), (45, 0.008), (46, 0.074), (47, 0.021), (48, -0.04), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.95229095 258 nips-2010-Structured sparsity-inducing norms through submodular functions
Author: Francis R. Bach
Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
2 0.90522468 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions
Author: Peter Stobbe, Andreas Krause
Abstract: Many combinatorial problems arising in machine learning can be reduced to the problem of minimizing a submodular function. Submodular functions are a natural discrete analog of convex functions, and can be minimized in strongly polynomial time. Unfortunately, state-of-the-art algorithms for general submodular minimization are intractable for larger problems. In this paper, we introduce a novel subclass of submodular minimization problems that we call decomposable. Decomposable submodular functions are those that can be represented as sums of concave functions applied to modular functions. We develop an algorithm, SLG, that can efÄ?Ĺš ciently minimize decomposable submodular functions with tens of thousands of variables. Our algorithm exploits recent results in smoothed convex minimization. We apply SLG to synthetic benchmarks and a joint classiÄ?Ĺš cation-and-segmentation task, and show that it outperforms the state-of-the-art general purpose submodular minimization algorithms by several orders of magnitude. 1
3 0.80289769 102 nips-2010-Generalized roof duality and bisubmodular functions
Author: Vladimir Kolmogorov
Abstract: ˆ Consider a convex relaxation f of a pseudo-boolean function f . We say that ˆ the relaxation is totally half-integral if f (x) is a polyhedral function with halfintegral extreme points x, and this property is preserved after adding an arbitrary combination of constraints of the form xi = xj , xi = 1 − xj , and xi = γ where 1 γ ∈ {0, 1, 2 } is a constant. A well-known example is the roof duality relaxation for quadratic pseudo-boolean functions f . We argue that total half-integrality is a natural requirement for generalizations of roof duality to arbitrary pseudo-boolean functions. Our contributions are as follows. First, we provide a complete characterization ˆ of totally half-integral relaxations f by establishing a one-to-one correspondence with bisubmodular functions. Second, we give a new characterization of bisubmodular functions. Finally, we show some relationships between general totally half-integral relaxations and relaxations based on the roof duality. 1
4 0.77606082 166 nips-2010-Minimum Average Cost Clustering
Author: Kiyohito Nagano, Yoshinobu Kawahara, Satoru Iwata
Abstract: A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments. 1
5 0.38738197 181 nips-2010-Network Flow Algorithms for Structured Sparsity
Author: Julien Mairal, Rodolphe Jenatton, Francis R. Bach, Guillaume R. Obozinski
Abstract: We consider a class of learning problems that involve a structured sparsityinducing norm defined as the sum of ℓ∞ -norms over groups of variables. Whereas a lot of effort has been put in developing fast optimization methods when the groups are disjoint or embedded in a specific hierarchical structure, we address here the case of general overlapping groups. To this end, we show that the corresponding optimization problem is related to network flow optimization. More precisely, the proximal problem associated with the norm we consider is dual to a quadratic min-cost flow problem. We propose an efficient procedure which computes its solution exactly in polynomial time. Our algorithm scales up to millions of variables, and opens up a whole new range of applications for structured sparse models. We present several experiments on image and video data, demonstrating the applicability and scalability of our approach for various problems. 1
6 0.34362566 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
7 0.30805779 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
8 0.28066874 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
9 0.27554399 248 nips-2010-Sparse Inverse Covariance Selection via Alternating Linearization Methods
10 0.26411489 45 nips-2010-CUR from a Sparse Optimization Viewpoint
11 0.25569254 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
12 0.25456911 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
13 0.25375625 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
14 0.24927524 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives
15 0.24497385 68 nips-2010-Effects of Synaptic Weight Diffusion on Learning in Decision Making Networks
16 0.23876552 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
17 0.23859102 148 nips-2010-Learning Networks of Stochastic Differential Equations
18 0.23639461 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
19 0.23101608 287 nips-2010-Worst-Case Linear Discriminant Analysis
20 0.22999685 108 nips-2010-Graph-Valued Regression
topicId topicWeight
[(0, 0.011), (13, 0.067), (17, 0.024), (26, 0.036), (27, 0.043), (30, 0.056), (33, 0.223), (35, 0.063), (45, 0.167), (50, 0.056), (52, 0.053), (60, 0.028), (77, 0.042), (78, 0.015), (90, 0.029)]
simIndex simValue paperId paperTitle
1 0.84080666 102 nips-2010-Generalized roof duality and bisubmodular functions
Author: Vladimir Kolmogorov
Abstract: ˆ Consider a convex relaxation f of a pseudo-boolean function f . We say that ˆ the relaxation is totally half-integral if f (x) is a polyhedral function with halfintegral extreme points x, and this property is preserved after adding an arbitrary combination of constraints of the form xi = xj , xi = 1 − xj , and xi = γ where 1 γ ∈ {0, 1, 2 } is a constant. A well-known example is the roof duality relaxation for quadratic pseudo-boolean functions f . We argue that total half-integrality is a natural requirement for generalizations of roof duality to arbitrary pseudo-boolean functions. Our contributions are as follows. First, we provide a complete characterization ˆ of totally half-integral relaxations f by establishing a one-to-one correspondence with bisubmodular functions. Second, we give a new characterization of bisubmodular functions. Finally, we show some relationships between general totally half-integral relaxations and relaxations based on the roof duality. 1
same-paper 2 0.77641904 258 nips-2010-Structured sparsity-inducing norms through submodular functions
Author: Francis R. Bach
Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
3 0.76226717 139 nips-2010-Latent Variable Models for Predicting File Dependencies in Large-Scale Software Development
Author: Diane Hu, Laurens Maaten, Youngmin Cho, Sorin Lerner, Lawrence K. Saul
Abstract: When software developers modify one or more files in a large code base, they must also identify and update other related files. Many file dependencies can be detected by mining the development history of the code base: in essence, groups of related files are revealed by the logs of previous workflows. From data of this form, we show how to detect dependent files by solving a problem in binary matrix completion. We explore different latent variable models (LVMs) for this problem, including Bernoulli mixture models, exponential family PCA, restricted Boltzmann machines, and fully Bayesian approaches. We evaluate these models on the development histories of three large, open-source software systems: Mozilla Firefox, Eclipse Subversive, and Gimp. In all of these applications, we find that LVMs improve the performance of related file prediction over current leading methods. 1
4 0.7115503 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
Author: Yangqing Jia, Mathieu Salzmann, Trevor Darrell
Abstract: Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multiview learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation. 1
5 0.67939812 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
6 0.67898047 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
7 0.67833495 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
8 0.67243522 265 nips-2010-The LASSO risk: asymptotic results and real world examples
9 0.6715669 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
10 0.67071241 5 nips-2010-A Dirty Model for Multi-task Learning
11 0.66799361 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
12 0.66449887 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
13 0.66408414 148 nips-2010-Learning Networks of Stochastic Differential Equations
14 0.66344529 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
15 0.6632421 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
16 0.66303837 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups
17 0.65931123 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
18 0.65844005 217 nips-2010-Probabilistic Multi-Task Feature Selection
19 0.65843576 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
20 0.65781903 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives