nips nips2011 nips2011-251 knowledge-graph by maker-knowledge-mining

251 nips-2011-Shaping Level Sets with Submodular Functions


Source: pdf

Author: Francis R. Bach

Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 fr Abstract We consider a class of sparsity-inducing regularization terms based on submodular functions. [sent-3, score-0.593]

2 While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. [sent-4, score-0.883]

3 We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i. [sent-5, score-0.631]

4 , the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. [sent-7, score-0.271]

5 We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). [sent-8, score-0.546]

6 A classical example is the total variation in one or two dimensions, which leads to piecewise constant solutions [4, 5] and can be applied to various image labelling problems [6, 5], or change point detection tasks [7, 8, 9]. [sent-16, score-0.345]

7 In this paper, we follow the approach of [3], who designed sparsity-inducing norms based on non-decreasing submodular functions, as a convex approximation to imposing a specific prior on the supports of the predictors. [sent-18, score-0.73]

8 Here, we show that a similar parallel holds for some other class of submodular functions, namely non-negative set-functions which are equal to zero for the full and empty set. [sent-19, score-0.593]

9 Our main instance of such functions are symmetric submodular functions. [sent-20, score-0.672]

10 We make the following contributions: − We provide in Section 3 explicit links between priors on level sets and certain submodular functions: we show that the Lov´ sz extensions (see, e. [sent-21, score-1.031]

11 , [11] and a short review in Section 2) a associated to these submodular functions are the convex envelopes (i. [sent-23, score-0.694]

12 , tightest convex lower bounds) of specific functions that depend on all level sets of the underlying vector. [sent-25, score-0.365]

13 − In Section 4, we reinterpret existing norms such as the total variation and design new norms, based on noisy cuts or order statistics. [sent-26, score-0.319]

14 − We provide unified algorithms in Section 5, such as proximal operators, which are based on a sequence of submodular function minimizations (SFMs), when such SFMs are efficient, or by adapting the generic slower approach of [3] otherwise. [sent-28, score-0.906]

15 − We derive unified theoretical guarantees for level set recovery in Section 6, showing that even in the absence of correlation between predictors, level set recovery is not always guaranteed, a situation which is to be contrasted with traditional support recovery situations [1, 3]. [sent-29, score-0.468]

16 In this paper, for a certain vector w ∈ Rp , we call level sets the sets of indices which are larger (or smaller) or equal to a certain constant α, which we denote {w α} (or {w α}), while we call constant sets the sets of indices which are equal to a constant α, which we denote {w = α}. [sent-38, score-0.806]

17 2 Review of Submodular Analysis In this section, we review relevant results from submodular analysis. [sent-39, score-0.561]

18 Throughout this paper, we consider a submodular function F defined on the power set 2V of V = {1, . [sent-46, score-0.561]

19 We give several examples in Section 4; for illustrating the concepts introduced in this section and Section 3, p−1 we will consider the cut in an undirected chain graph, i. [sent-58, score-0.266]

20 , [11] for this particular Lov´ sz extension f : Rp → R, as f (w) = R F ({w a formulation). [sent-64, score-0.267]

21 The Lov´ sz extension is convex if and only if F is submodular. [sent-65, score-0.331]

22 For the chain graph, we obtain the usual total variation f (w) = j=1 |wj − wj+1 |. [sent-68, score-0.193]

23 One important result in submodular analysis is that if F is a submodular function, then we have a representation of f as a maximum of linear functions [12, 11], i. [sent-71, score-1.191]

24 It is known that the set of tight sets is a distributive lattice, i. [sent-90, score-0.262]

25 The faces of B(F ) are thus intersections of hyperplanes {s(A) = F (A)} for A belonging to certain distributive lattices (see Prop. [sent-93, score-0.343]

26 A set is said inseparable if it is not separable. [sent-96, score-0.191]

27 For the cut in an undirected graph, inseparable sets are exactly connected sets. [sent-97, score-0.501]

28 3 Properties of the Lov´ sz Extension a In this section, we derive properties of the Lov´ sz extension for submodular functions, which go a beyond convexity and homogeneity. [sent-98, score-1.039]

29 Throughout this section, we assume that F is a non-negative submodular set-function that is equal to zero at ∅ and V . [sent-99, score-0.593]

30 The various extreme points cut the space into polygons where the ordering of the component is fixed. [sent-105, score-0.253]

31 Left: F (A) = 1|A|∈{1,2} , leading to f (w) = maxk wk − mink wk (all possible extreme points); note that the polygon need not be symmetric in general. [sent-106, score-0.334]

32 Note the difference with the result of [3]: we consider here a different set on which we compute the convex envelope ([0, 1]p + R1V instead of [−1, 1]p ), and not a function of the support of w, but of all its level sets. [sent-110, score-0.252]

33 1 Moreover, the Lov´ sz extension is a convex relaxation of a function of level sets a (of the form {w α}) and not of constant sets (of the form {w = α}). [sent-111, score-0.715]

34 This definition through level sets will generate some potentially undesired behavior (such as the well-known staircase effect for the one-dimensional total variation), as we show in Section 6. [sent-113, score-0.334]

35 The next proposition describes the set of extreme points of the “unit ball” U = {w, f (w) 1}, giving a first illustration of sparsity-inducing effects (see example in Figure 1, in particular for the one-dimensional total variation). [sent-114, score-0.201]

36 Proposition 2 (Extreme points) The extreme points of the set U ∩ {w⊤ 1V = 0} are the projections of the vectors 1A /F (A) on the plane {w⊤ 1V = 0}, for A such that A is inseparable for F and V \A is inseparable for B → F (A ∪ B) − F (A). [sent-115, score-0.48]

37 We assume in this paper that all lattices contain the empty set ∅ and the full set V , and we endow the lattice with the inclusion order. [sent-118, score-0.276]

38 The faces of U are characterized by lattices D, with their corresponding posets Π(D) = ◦ {A1 , . [sent-136, score-0.265]

39 We denote by UD (and by UD its closure) the set of w ∈ Rp such that (a) f (w) 1, (b) w is piecewise constant with respect to Π(D), with value vi on Ai , and (c) for all pairs (i, j), 1 Note that the support {w = 0} is a constant set which is the intersection of two level sets. [sent-140, score-0.294]

40 For certain lattices D, these will be exactly the relative interiors of all faces of U (see proof in [14]): Proposition 3 (Faces of U) The (non-empty) relative interiors of all faces of U are exactly of the ◦ form UD , where D is a lattice such that: (i) the restriction of F to D is modular, i. [sent-146, score-0.588]

41 , m}, the set Aj is inseparable for the function Cj → F (Bj−1 ∪ Cj ) − F (Bj−1 ), where Bj−1 is the union of all ancestors of Aj in Π(D), (iii) among all lattices corresponding to the same unordered partition, D is a maximal element of the set of lattices satisfying (i) and (ii). [sent-151, score-0.451]

42 Among the three conditions, the second one is the easiest to interpret, as it reduces to having constant sets which are inseparable for certain submodular functions, and for cuts in an undirected graph, these will exactly be connected sets. [sent-152, score-1.058]

43 Note that from the face w belongs to, we have strong constraints on the constant sets, but we may not be able to determine all level sets of w, because only partial constraints are given by the order on Π(D). [sent-156, score-0.3]

44 4 Examples of Submodular Functions In this section, we provide examples of submodular functions and of their Lov´ sz extensions. [sent-158, score-0.877]

45 Some a are well-known (such as cut functions and total variations), some are new in the context of supervised learning (regular functions), while some have interesting effects in terms of clustering or outlier detection (cardinality-based functions). [sent-159, score-0.396]

46 From any submodular function G, one may define F (A) = G(A) + G(V \A) − G(∅) − G(V ), which is symmetric. [sent-161, score-0.561]

47 Given a set of nonnegative weights d : V × V → R+ , define the cut F (A) = a k,j∈V d(k, j)(wk − wj )+ (which k∈A,j∈V \A d(k, j). [sent-164, score-0.194]

48 The Lov´ sz extension is equal to f (w) = shows submodularity because f is convex), and is often referred to as the total variation. [sent-165, score-0.349]

49 If the weight function d is symmetric, then the submodular function is also symmetric. [sent-166, score-0.561]

50 In this case, it can be shown that inseparable sets for functions A → F (A ∪ B) − F (B) are exactly connected sets. [sent-167, score-0.413]

51 3 and 6, constant sets are connected sets, which is the usual justification behind the total variation. [sent-169, score-0.243]

52 Note however that some configurations of connected sets are not allowed due to the other conditions in Prop. [sent-170, score-0.194]

53 In Figure 5 (right plot), we give an example of the usual chain graph, leading to the one-dimensional total variation [4, 5]. [sent-172, score-0.239]

54 1 0 −2 0 2 2 4 6 log(σ ) Figure 3: Three left plots: Estimation of noisy piecewise constant 1D signal with outliers (indices 5 and 15 in the chain of 20 nodes). [sent-179, score-0.201]

55 Middle: best estimation with total variation (level sets are not correctly estimated). [sent-181, score-0.269]

56 Right: best estimation with the robust total variation based on noisy cut functions (level sets are correctly estimated, with less bias and with detection of outliers). [sent-182, score-0.521]

57 This allows for robust versions of cuts, where some gaps may be tolerated; indeed, compared to having directly a small cut for A, B needs to have a small cut and be close to A, thus allowing some elements to be removed or added to A in order to lower the cut. [sent-188, score-0.258]

58 See examples in Figure 3, illustrating the behavior of the type of graph displayed in the bottom-right plot of Figure 5, where the performance of the robust total variation is significantly more stable in presence of outliers. [sent-189, score-0.307]

59 For F (A) = h(|A|) where h is such that h(0) = h(p) = 0 and h concave, we obtain a submodular function, and a Lov´ sz extension that depends on the order statisa p−1 tics of w, i. [sent-191, score-0.828]

60 While these examples do not provide significantly different behaviors for the non-decreasing submodular functions explored by [3] (i. [sent-194, score-0.666]

61 Indeed, as shown in Section 6, allowed constant sets A are such that A is inseparable for the function C → h(|B ∪ C|) − h(|B|) (where B ⊂ V is the set of components with higher values than the ones in A), which imposes that the concave function h is not linear on [|B|, |B|+|A|]. [sent-199, score-0.432]

62 All patterns of level sets are allowed as the function h is strongly concave (see left plot of Figure 4). [sent-203, score-0.359]

63 This function has been extended in [15] by considering situations where each wj is a vector, instead of a scalar, and replacing the absolute value |wi − wj | by any norm wi − wj , leading to convex formulations for clustering. [sent-204, score-0.305]

64 Two large level sets at the top and bottom, all the rest of the variables are in-between and separated (Figure 4, second plot from the left). [sent-207, score-0.27]

65 This function is piecewise affine, with only one kink, thus only one level set of cardinalty greater than one (in the middle) is possible, which is observed in Figure 4 (third plot from the left). [sent-210, score-0.252]

66 5 Optimization Algorithms In this section, we present optimization methods for minimizing convex objective functions regularized by the Lov´ sz extension of a submodular function. [sent-212, score-0.961]

67 These lead to convex optimization problems, a which we tackle using proximal methods (see, e. [sent-213, score-0.302]

68 Moreover, note that with the square loss, the regularization paths are piecewise affine, as a direct consequence of regularizing by a polyhedral function. [sent-217, score-0.195]

69 4 −10 0 1 λ 2 3 Figure 4: Left: Piecewise linear regularization paths of proximal problems (Eq. [sent-223, score-0.306]

70 From left to right: quadratic function (all level sets allowed), second example in Section 4 (two large level sets at the top and bottom), piecewise linear with two pieces (a single large level set in the middle). [sent-225, score-0.678]

71 Note that in all these particular cases the regularization paths for orthogonal designs are agglomerative (see Section 5), while for general designs, they would still be piecewise affine but not agglomerative. [sent-227, score-0.279]

72 This allows to use subgradient descent, with slow convergence compared to proximal methods (see Figure 5). [sent-230, score-0.296]

73 Proximal problems through sequences of submodular function minimizations (SFMs). [sent-231, score-0.625]

74 Given regularized problems of the form minw∈Rp L(w) + λf (w), where L is differentiable with Lipschitzcontinuous gradient, proximal methods have been shown to be particularly efficient first-order methods (see, e. [sent-232, score-0.238]

75 To apply these methods, it suffices to be able to solve efficiently: min 1 w∈Rp 2 w−z 2 2 + λf (w), (1) which we refer to as the proximal problem. [sent-236, score-0.238]

76 It is known that solving the proximal problem is related to submodular function minimization (SFM). [sent-237, score-0.832]

77 More precisely, the minimum of A → λF (A) − z(A) may be obtained by selecting negative components of the solution of a single proximal problem [12, 11]. [sent-238, score-0.238]

78 Alternatively, the solution of the proximal problem may be obtained by a sequence of at most p submodular function minimizations of the form A → λF (A) − z(A), by a decomposition algorithm adapted from [18], and described in [11]. [sent-239, score-0.863]

79 Thus, computing the proximal operator has polynomial complexity since SFM has polynomial complexity. [sent-240, score-0.238]

80 Nevertheless, this strategy is efficient for families of submodular functions for which dedicated fast algorithms exist: – Cuts: Minimizing the cut or the partially minimized cut, plus a modular function, may be done with a min-cut/max-flow algorithm [see, e. [sent-242, score-0.87]

81 For proximal methods, we need in fact to solve an instance of a parametric max-flow problem, which may be done using other efficient dedicated algorithms [21, 5] than the decomposition algorithm derived from [18]. [sent-245, score-0.307]

82 This can be tackled efficiently by the minimum-norm-point algorithm [12], which iterates between orthogonal projections on affine subspaces and the greedy algorithm for the submodular function4. [sent-251, score-0.595]

83 2 The greedy algorithm to find extreme points of the base polyhedron should not be confused with the greedy algorithm (e. [sent-253, score-0.201]

84 3 Note that even in the case of symmetric submodular functions, where more efficient algorithms in O(p3 ) for submodular function minimization (SFM) exist [20], the minimization of functions of the form λF (A) − z(A) is provably as hard as general SFM [20]. [sent-256, score-1.299]

85 4 Interestingly, when used for submodular function minimization (SFM), the minimum-norm-point algorithm has no complexity bound but is empirically faster than algorithms with such bounds [12]. [sent-257, score-0.594]

86 Right: Examples of graphs (top: chain graph, bottom: hidden chain graph, with sets W and V and examples of a set A in light red, and B in blue, see text for details). [sent-261, score-0.22]

87 4 are satisfied by (a) all submodular setfunctions that only depend on the cardinality, and (b) by the one-dimensional total variation—we thus recover and extend known results from [7, 22, 15]. [sent-269, score-0.611]

88 2 2 That is, the proximal operator for f + λ · 1 is equal to the composition of the proximal operator for f and the one for λ · 1 . [sent-273, score-0.508]

89 That is, we study the unique global minimum w of the ˆ proximal problem in Eq. [sent-275, score-0.238]

90 (1) and make some assumption regarding z (typically z = w∗ + noise), and provide guarantees related to the recovery of the level sets of w∗ . [sent-276, score-0.308]

91 We first start by characterizing the allowed level sets, showing that the partial constraints defined in Section 3 on faces of {f (w) 1} do not create by chance further groupings of variables (see proof in [14]). [sent-277, score-0.281]

92 (1) has ˆ constant sets that define a partition corresponding to a lattice D defined in Prop. [sent-280, score-0.332]

93 We now show that under certain conditions the recovered constant sets are the correct ones: 7 Theorem 1 (Level set recovery) Assume that z = w∗ + σε, where ε ∈ Rp is a standard Gaussian random vector, and w∗ is consistent with the lattice D and its associated poset Π(D) = ∗ (A1 , . [sent-282, score-0.373]

94 Interestingly, the validity of level set recovery implies the agglomerativity of proximal paths (Eq. [sent-305, score-0.47]

95 4), then, even with infinitesimal noise, one can show that in some cases, the wrong level sets may be obtained with non vanishing probability, while if ηj is strictly negative, one can show that in some cases, we never get the correct level sets. [sent-311, score-0.352]

96 , a piecewise constant vector, with a sequence of at least two consecutive increases, or two consecutive decreases, showing that in the presence of such staircases, one cannot have consistent support recovery, which is a well-known issue in signal processing (typically, more steps are created). [sent-326, score-0.196]

97 7 Conclusion We have presented a family of sparsity-inducing norms dedicated to incorporating prior knowledge or structural constraints on the level sets of linear predictors. [sent-337, score-0.378]

98 On total variation minimization and surface evolution using parametric maximum flows. [sent-378, score-0.19]

99 Convex analysis and optimization with submodular functions: a tutorial. [sent-416, score-0.561]

100 A faster strongly polynomial time algorithm for submodular function minimization. [sent-459, score-0.561]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('submodular', 0.561), ('lov', 0.247), ('proximal', 0.238), ('aj', 0.227), ('sz', 0.211), ('inseparable', 0.191), ('lattice', 0.146), ('rp', 0.14), ('lattices', 0.13), ('cut', 0.129), ('bj', 0.125), ('distributive', 0.121), ('level', 0.12), ('sets', 0.112), ('variation', 0.107), ('piecewise', 0.094), ('faces', 0.092), ('sfm', 0.087), ('cuts', 0.085), ('cj', 0.082), ('norms', 0.077), ('recovery', 0.076), ('poset', 0.075), ('extreme', 0.071), ('agglomerative', 0.071), ('dedicated', 0.069), ('functions', 0.069), ('envelope', 0.068), ('wj', 0.065), ('hasse', 0.064), ('interiors', 0.064), ('minimizations', 0.064), ('sfms', 0.064), ('convex', 0.064), ('subgradient', 0.058), ('wk', 0.057), ('shaping', 0.057), ('ista', 0.057), ('extension', 0.056), ('detection', 0.054), ('proposition', 0.053), ('staircase', 0.052), ('fista', 0.052), ('outlier', 0.051), ('total', 0.05), ('concave', 0.048), ('leading', 0.046), ('designs', 0.046), ('ud', 0.046), ('clustering', 0.043), ('generic', 0.043), ('posets', 0.043), ('wjp', 0.043), ('modular', 0.042), ('hal', 0.042), ('symmetric', 0.042), ('tv', 0.041), ('fused', 0.041), ('connected', 0.041), ('allowed', 0.041), ('constant', 0.04), ('graph', 0.039), ('plot', 0.038), ('wjk', 0.038), ('illustrating', 0.037), ('examples', 0.036), ('chain', 0.036), ('paths', 0.036), ('diagram', 0.035), ('polyhedron', 0.035), ('partition', 0.034), ('greedy', 0.034), ('minimization', 0.033), ('bach', 0.033), ('polar', 0.033), ('polyhedral', 0.033), ('replications', 0.033), ('sierra', 0.033), ('mink', 0.033), ('regularization', 0.032), ('equal', 0.032), ('outliers', 0.031), ('maxs', 0.031), ('consecutive', 0.031), ('uni', 0.03), ('card', 0.029), ('af', 0.029), ('tight', 0.029), ('vj', 0.029), ('maxk', 0.028), ('separable', 0.028), ('supports', 0.028), ('path', 0.028), ('undirected', 0.028), ('partial', 0.028), ('extensions', 0.027), ('ow', 0.027), ('indices', 0.027), ('points', 0.027), ('ordering', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 251 nips-2011-Shaping Level Sets with Submodular Functions

Author: Francis R. Bach

Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.

2 0.50035244 199 nips-2011-On fast approximate submodular minimization

Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes

Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies. 1

3 0.32914203 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they find the complete information they need. We are specifically interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a confident diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) defined over a ground set V is called submodular if it satisfies the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, influence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is defined to be the set containing the first i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order specified by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it first in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have defined it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can define F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a fixed set of candidate diseases, then this function is submodular. Specifically, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The offline approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this offline algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single fixed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem fits in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard offline problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efficient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best fixed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an offline approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an offline algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this offline algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known finite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Offline Analysis In this work we present an algorithm for online submodular set cover which extends the offline algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this offline algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Offline Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in offline analysis the offline algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no benefit but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is defined such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it finds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Specifically, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), define a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) define a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next define truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one final lemma before proving the main result of this section. For a sequence S define Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Define Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it suffices to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it suffices to show that after selecting the first k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Define a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ definition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to fix. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modification does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the first k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the offline setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Define F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to first select the two broad actions covering all common ads then select the narrow actions in any order. However, the offline cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for offline solutions as baselines. As seen in the figure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the offline cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We define an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Define the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Define F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Specifically, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Define F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netflix’s Watch Instantly service and use 803 questions based on those we used for an offline problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modification of Normal-Hedge which fixes ct = 1 for a fixed, more aggressive learning rate than theoretically justified. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed significantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modified Normal-Hedge has no theoretical justification, so it may not generalize to other problems. For the modified Normal-Hedge the final average cover times are 7.72 adaptive and 8.22 cumulative. The offline values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a fixed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9

4 0.229403 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

Author: Yoshinobu Kawahara, Takashi Washio

Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.

5 0.22488099 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

Author: Matthias Hein, Simon Setzer

Abstract: Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efficient first-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. 1

6 0.14652537 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

7 0.1280309 78 nips-2011-Efficient Methods for Overlapping Group Lasso

8 0.11526298 213 nips-2011-Phase transition in the family of p-resistances

9 0.10551783 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

10 0.09963958 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

11 0.093408614 277 nips-2011-Submodular Multi-Label Learning

12 0.09090998 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

13 0.073070392 186 nips-2011-Noise Thresholds for Spectral Clustering

14 0.072478764 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty

15 0.072374679 276 nips-2011-Structured sparse coding via lateral inhibition

16 0.071650065 258 nips-2011-Sparse Bayesian Multi-Task Learning

17 0.067608088 227 nips-2011-Pylon Model for Semantic Segmentation

18 0.064061016 259 nips-2011-Sparse Estimation with Structured Dictionaries

19 0.062856495 195 nips-2011-On Learning Discrete Graphical Models using Greedy Methods

20 0.06129146 265 nips-2011-Sparse recovery by thresholded non-negative least squares


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, -0.019), (2, -0.106), (3, -0.096), (4, -0.05), (5, -0.057), (6, -0.526), (7, 0.308), (8, -0.107), (9, 0.076), (10, -0.018), (11, -0.105), (12, 0.153), (13, 0.143), (14, -0.022), (15, -0.023), (16, -0.033), (17, 0.053), (18, 0.005), (19, -0.037), (20, 0.015), (21, 0.005), (22, -0.095), (23, -0.025), (24, 0.009), (25, -0.042), (26, 0.022), (27, 0.087), (28, 0.05), (29, 0.055), (30, 0.056), (31, 0.027), (32, 0.005), (33, -0.01), (34, 0.026), (35, 0.021), (36, 0.001), (37, 0.022), (38, 0.017), (39, -0.03), (40, 0.002), (41, 0.034), (42, -0.013), (43, -0.011), (44, -0.003), (45, -0.051), (46, -0.04), (47, -0.003), (48, -0.022), (49, -0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95039093 251 nips-2011-Shaping Level Sets with Submodular Functions

Author: Francis R. Bach

Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.

2 0.92872518 199 nips-2011-On fast approximate submodular minimization

Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes

Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies. 1

3 0.80167055 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

Author: Matthias Hein, Simon Setzer

Abstract: Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efficient first-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. 1

4 0.73778343 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

Author: Yoshinobu Kawahara, Takashi Washio

Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.

5 0.73462397 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they find the complete information they need. We are specifically interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a confident diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) defined over a ground set V is called submodular if it satisfies the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, influence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is defined to be the set containing the first i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order specified by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it first in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have defined it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can define F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a fixed set of candidate diseases, then this function is submodular. Specifically, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The offline approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this offline algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single fixed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem fits in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard offline problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efficient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best fixed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an offline approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an offline algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this offline algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known finite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Offline Analysis In this work we present an algorithm for online submodular set cover which extends the offline algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this offline algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Offline Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in offline analysis the offline algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no benefit but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is defined such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it finds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Specifically, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), define a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) define a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next define truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one final lemma before proving the main result of this section. For a sequence S define Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Define Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it suffices to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it suffices to show that after selecting the first k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Define a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ definition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to fix. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modification does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the first k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the offline setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Define F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to first select the two broad actions covering all common ads then select the narrow actions in any order. However, the offline cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for offline solutions as baselines. As seen in the figure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the offline cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We define an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Define the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Define F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Specifically, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Define F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netflix’s Watch Instantly service and use 803 questions based on those we used for an offline problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modification of Normal-Hedge which fixes ct = 1 for a fixed, more aggressive learning rate than theoretically justified. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed significantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modified Normal-Hedge has no theoretical justification, so it may not generalize to other problems. For the modified Normal-Hedge the final average cover times are 7.72 adaptive and 8.22 cumulative. The offline values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a fixed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9

6 0.66141707 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

7 0.41604131 277 nips-2011-Submodular Multi-Label Learning

8 0.33477187 213 nips-2011-Phase transition in the family of p-resistances

9 0.32510039 78 nips-2011-Efficient Methods for Overlapping Group Lasso

10 0.30417618 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

11 0.3019751 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty

12 0.28277758 67 nips-2011-Data Skeletonization via Reeb Graphs

13 0.27494493 202 nips-2011-On the Universality of Online Mirror Descent

14 0.26926294 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

15 0.25623244 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

16 0.240989 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

17 0.23811786 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs

18 0.23794463 259 nips-2011-Sparse Estimation with Structured Dictionaries

19 0.23744547 73 nips-2011-Divide-and-Conquer Matrix Factorization

20 0.23731127 265 nips-2011-Sparse recovery by thresholded non-negative least squares


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.036), (4, 0.058), (17, 0.012), (20, 0.041), (26, 0.056), (31, 0.052), (33, 0.029), (41, 0.201), (43, 0.078), (45, 0.146), (57, 0.025), (65, 0.018), (74, 0.076), (83, 0.024), (99, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.82015592 251 nips-2011-Shaping Level Sets with Submodular Functions

Author: Francis R. Bach

Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.

2 0.78821206 4 nips-2011-A Convergence Analysis of Log-Linear Training

Author: Simon Wiesler, Hermann Ney

Abstract: Log-linear models are widely used probability models for statistical pattern recognition. Typically, log-linear models are trained according to a convex criterion. In recent years, the interest in log-linear models has greatly increased. The optimization of log-linear model parameters is costly and therefore an important topic, in particular for large-scale applications. Different optimization algorithms have been evaluated empirically in many papers. In this work, we analyze the optimization problem analytically and show that the training of log-linear models can be highly ill-conditioned. We verify our findings on two handwriting tasks. By making use of our convergence analysis, we obtain good results on a large-scale continuous handwriting recognition task with a simple and generic approach. 1

3 0.72540581 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

4 0.72288269 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning

Author: George Konidaris, Scott Niekum, Philip S. Thomas

Abstract: We show that the λ-return target used in the TD(λ) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an nstep return estimate increases with n. We introduce the γ-return estimator, an alternative target based on a more accurate model of variance, which defines the TDγ family of complex-backup temporal difference learning algorithms. We derive TDγ , the γ-return equivalent of the original TD(λ) algorithm, which eliminates the λ parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TDγ (C), with a capacity parameter C. TDγ (C) requires C times more time and memory than TD(λ) and is incremental and online. We show that TDγ outperforms TD(λ) for any setting of λ on 4 out of 5 benchmark domains, and that TDγ (C) performs as well as or better than TDγ for intermediate settings of C. 1

5 0.7202391 80 nips-2011-Efficient Online Learning via Randomized Rounding

Author: Nicolò Cesa-bianchi, Ohad Shamir

Abstract: Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

6 0.71863639 22 nips-2011-Active Ranking using Pairwise Comparisons

7 0.71824104 263 nips-2011-Sparse Manifold Clustering and Embedding

8 0.71762598 29 nips-2011-Algorithms and hardness results for parallel large margin learning

9 0.71421325 265 nips-2011-Sparse recovery by thresholded non-negative least squares

10 0.71140611 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

11 0.71112818 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction

12 0.71069694 202 nips-2011-On the Universality of Online Mirror Descent

13 0.70886946 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

14 0.70805031 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

15 0.70786476 109 nips-2011-Greedy Model Averaging

16 0.70649666 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

17 0.70617634 276 nips-2011-Structured sparse coding via lateral inhibition

18 0.70597857 294 nips-2011-Unifying Framework for Fast Learning Rate of Non-Sparse Multiple Kernel Learning

19 0.70546216 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

20 0.70310718 239 nips-2011-Robust Lasso with missing and grossly corrupted observations