nips nips2011 nips2011-199 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes
Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies. 1
Reference: text
sentIndex sentText sentNum sentScore
1 On fast approximate submodular minimization Stefanie Jegelka† , Hui Lin∗ , Jeff Bilmes∗ Max Planck Institute for Intelligent Systems, Tuebingen, Germany ∗ University of Washington, Dept. [sent-1, score-0.745]
2 We therefore propose a fast approximate method to minimize arbitrary submodular functions. [sent-11, score-0.705]
3 For a large sub-class of submodular functions, the algorithm is exact. [sent-12, score-0.705]
4 Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. [sent-13, score-1.486]
5 A set function f : 2V → R defined on subsets of a finite ground set V is submodular if it satisfies the inequality f (S) + f (T ) ≥ f (S ∪ T ) + f (S ∩ T ) for all S, T ⊆ V. [sent-16, score-0.743]
6 Submodular functions include entropy, graph cuts (defined as a function of graph nodes), potentials in many Markov Random Fields [3], clustering objectives [23],covering functions (e. [sent-17, score-0.572]
7 One might consider submodular functions as being on the boundary between “efficiently”, i. [sent-20, score-0.822]
8 Indeed, the submodular function minimization (SFM) algorithms with proven polynomial running time are practical only for very small data sets. [sent-24, score-0.841]
9 3 ) were obtained with simpler graph cut functions [22]. [sent-30, score-0.387]
10 Contrary to the possibly poor performance of “exact” methods, our approximate method is fast, is exact for a large class of submodular functions, and approximates all other functions with bounded deviation. [sent-34, score-0.812]
11 Our approach combines two ingredients: 1) the representation of functions by graphs; and 2) a recent generalization of graph cuts that combines edge-costs non-linearly. [sent-35, score-0.349]
12 Representing functions as graph cuts is a popular basis for optimization, but cuts cannot efficiently represent all submodular functions. [sent-36, score-1.18]
13 Contrary to previous constructions, including 2) leads to exact representations for any submodular 1 function. [sent-37, score-0.736]
14 To optimize an arbitrary submodular function f represented in our formalism, we construct ˆ a graph-representable tractable submodular upper bound f that is tight at a given set T ⊆ V, i. [sent-38, score-1.41]
15 This problem can be solved via SFM using the following Bipartite neighborhoods class of submodular functions: Define a bipartite graph H = (V, U, E, w) with left/right nodes V/U, and a modular weight function w : U → R+ . [sent-52, score-1.233]
16 We say f is the submodular function induced by modular function w and graph H. [sent-59, score-1.007]
17 Benefit restricted to blocks can arise from non-negative non-decreasing submodular functions g : 2U → R+ restricted to blocks. [sent-73, score-0.781]
18 Unfortunately, this class of submodular functions is no longer representable by a bipartite graph, and general SFM must be used. [sent-75, score-0.984]
19 2 Background on Algorithms for submodular function minimization (SFM) The first polynomial algorithm for SFM was by Gr¨ tschel et al. [sent-79, score-0.884]
20 2 Luckily, many sub-families of submodular functions permit specialized, faster algorithms. [sent-83, score-0.804]
21 They have found numerous applications in computer vision [2, 12], begging the question as to which functions can be represented and minimized using graph ˘ y cuts [9, 6, 31]. [sent-85, score-0.349]
22 [32] show that cut representations are indeed limited: even when allowing exponentially many additional variables, not all submodular functions can be expressed as graph cuts. [sent-87, score-1.092]
23 Other specific cases of relatively efficient SFM include graphic matroids [25] and symmetric submodular functions, minimizable in cubic time [26]. [sent-91, score-0.785]
24 A further class of benign functions are those of the form f (S) = ψ( i∈S w(i)) + m(S) for nonnegative weights w : V → R+ , and certain concave functions ψ : R → R. [sent-92, score-0.358]
25 Whereas Fujishige and Iwata [8] decompose ψ as a minimum of modular functions, Stobbe and Krause [29] decompose it into a sum of truncated functions of the form f (A) = min{ i∈A w (i), γ} — this class of functions, however, is also limited. [sent-96, score-0.385]
26 Thus, if truncations could express any submodular function, then so could graph cuts, contradicting the results in [32]. [sent-98, score-0.93]
27 Moreover, the formulation itself of some representable functions in terms of concave functions can be challenging. [sent-100, score-0.287]
28 2 Representing submodular functions by generalized graph cuts We begin with the representation of a set function f : 2V → R by a graph cut, and then extend this to submodular edge weights. [sent-102, score-2.033]
29 Formally, f is graph-representable if there exists a graph G = (V ∪ U ∪ {s, t}, E) with terminal nodes s, t, one node for each element i in V, a set U of auxiliary nodes (U can be empty), and edge weights w : E → R+ such that, for any S ⊆ V: f (S) = min w(δ(s ∪ S ∪ U )) = min U ⊆U U ⊆U w(e). [sent-103, score-0.596]
30 Recall that any minimal (s, t)-cut partitions the graph nodes into the set Ts ⊆ V∪U reachable from s and the set Tt = (V∪U)\Ts disconnected from s. [sent-105, score-0.233]
31 That means, f (S) equals the weight of the minimum (s, t)-cut that assigns S to Ts and V \ S to Tt , and the auxiliary nodes to achieve the minimum. [sent-106, score-0.238]
32 The graph representation (1) leads to the equivalence between minimum cuts and the minimizers of f : Lemma 1. [sent-112, score-0.342]
33 Then the boundary δs (S ∗ ∪ U ∗ ) ⊆ E is a minimum cut in G. [sent-114, score-0.274]
34 The lemma (proven in [18]) is good news since minimum cuts can be computed efficiently. [sent-115, score-0.218]
35 To derive ∗ S ∗ from a minimum cut, recall that any minimum cut is the boundary of some set Ts ⊆ V ∪ U that ∗ ∗ ∗ ∗ ∗ is still reachable from s after cutting. [sent-116, score-0.367]
36 A large sub-family of submodular functions can be expressed exactly in the form (1), but possibly with an exponentially large U. [sent-118, score-0.781]
37 To express any submodular function with few auxiliary nodes, in this paper we extend Equation (1) as is seen below. [sent-120, score-0.806]
38 Unless the submodular function f is already a graph cut function (and directly representable), we first decompose f into a modular function and a nondecreasing submodular function, and then build up the graph part by part. [sent-121, score-2.12]
39 Dashed blue edges can have submodular weights; auxiliary nodes are white and ground set nodes are shaded. [sent-124, score-1.062]
40 The bipartite graph can have arbitrary representations between U and t, 3(e) is one example. [sent-125, score-0.281]
41 Any submodular function f can be decomposed as f (S) = m(S) + g(S) into a modular function m and a totally normalized polymatroid rank function g. [sent-133, score-0.951]
42 If m(i) ≥ 0 for any i ∈ V, then diminishing marginal costs, a property of submodular functions, imply that we can discard element i immediately [5, 18]. [sent-136, score-0.737]
43 To express such negative costs in a graph cut, we point out an equivalent formulation with positive weights: since m(V) is constant, minimizing m(S) = i∈S m(i) is equivalent to minimizing the shifted function m(S) − m(V) = −m(V \ S). [sent-137, score-0.297]
44 We implement this shifted function in the graph by adding an edge (s, i) with nonnegative weight −m(i) for each i ∈ V. [sent-139, score-0.332]
45 , j ∈ S) that / is not selected must be separated from s, and the edge (s, j) contributes −m(j) to the total cut cost. [sent-142, score-0.291]
46 Having constructed the modular part of the function f by edges (s, i) for all i ∈ V, we address its submodular part g. [sent-143, score-0.978]
47 We begin with some example functions that are explicitly graph-representable with polynomially many auxiliary nodes U. [sent-145, score-0.215]
48 The minimum way to disconnect a set S from t is to cut the single edge (uj−1 , uj ) with weight w(j) of the largest element j = argmaxi∈S w(i). [sent-154, score-0.448]
49 Truncations can model piecewise linear concave functions of w(S) [19, 29], and also represent negative terms in a pseudo-boolean polynomial [18]. [sent-159, score-0.223]
50 Furthermore, these functions include rank functions g(S) = min{|S|, k} of uniform matroids, and rank functions of partition matroids. [sent-160, score-0.316]
51 We already encountered bipartite submodular functions f (S) = u∈N (S) w(u) in Section 1. [sent-164, score-0.915]
52 The bipartite graph that defines N (S) is part of the representa4 tion shown in Figure 3(d), and its edges get infinite weight. [sent-166, score-0.399]
53 Each u ∈ U has such an edge (u, t), and the weight of that edge is the weight w(u) of u. [sent-168, score-0.314]
54 Minimizing a graph-represented function is equivalent to finding the minimum (s, t)-cut, and all edge weights in the above are nonnegative. [sent-173, score-0.285]
55 1 Submodular edge weights Next we address the generic case of a submodular function that is not (efficiently) graph-representable or whose functional form is unknown. [sent-176, score-0.921]
56 We can still decompose this function into a modular part m and a polymatroid g. [sent-177, score-0.249]
57 Instead of a sum of weights, we define the cost of a set of these edges to be a non-additive function on sets of edges, a polymatroid rank function. [sent-180, score-0.281]
58 Each edge (i, t) is associated with exactly one ground set element i ∈ V, and selecting i (i ∈ Ts ) is equivalent to cutting the edge (i, t). [sent-181, score-0.324]
59 Thus, the cost of edge (i, t) will model the cost g(i) of its element i ∈ V. [sent-182, score-0.253]
60 We define the cost of C to be the cost of its adjacent ground set elements, hg (C) g(V (C)); this implies hg (δs (S ∩ Et )) = g(S). [sent-185, score-0.553]
61 This generalization from the standard sum of edge weights to a nondecreasing submodular function permits us to express many more functions, in fact any submodular function [5]. [sent-187, score-1.742]
62 Such expressiveness comes at a price, however: in general, finding a minimum (s, t)-cut with such submodular edge weights is NP-hard, and even hard to approximate [17]. [sent-188, score-0.99]
63 The graphs here that represent submodular functions correspond to benign examples that are not NP-hard. [sent-189, score-0.844]
64 For the moment, we assume that we can handle submodular costs on edges. [sent-192, score-0.775]
65 The simple construction in Figure 3(f) itself corresponds to a general submodular function minimization. [sent-193, score-0.737]
66 If g decomposes into a sum of graph-representable functions and a (nondecreasing submodular) remainder gr , then we construct a subgraph for each graph-representable function, and combine these subgraphs with the submodular-edge construction for gr . [sent-195, score-0.287]
67 In addition, we are in no way restricted to separating graph-representable and general submodular functions. [sent-197, score-0.705]
68 The cost function in our application is a submodular function induced by a bipartite graph H = (V, U, E). [sent-198, score-1.033]
69 Given a nondecreasing submodular function gU : 2U → R+ on U, the graph H defines a function g(S) = gU (N (S)). [sent-200, score-0.919]
70 For any such function, we represent H explicitly in G, and then add submodular-cost edges from U to t with hg (δs (N (S))) = gU (N (S)), as shown in Figure 3(d). [sent-203, score-0.317]
71 Algorithm 1 applies to any submodular-weight cut; this algorithm is exact if the edge costs are modular (a sum of ˆ weights). [sent-206, score-0.408]
72 In this section, we switch from costs f, f of node sets S, T to equivalent costs w, h of edge sets A, B, C and back. [sent-208, score-0.267]
73 Recall that the representation G has two types of edges: those whose weights w are counted as the usual sum, and those charged via a submodular function hg derived from g. [sent-214, score-1.022]
74 The submodular cost hg of the remaining edges is upper bounded by referring to a fixed set B ⊆ E that we specify later. [sent-217, score-1.069]
75 For any A ⊆ Et , we define ˆ hB (A) hg (B) + ρh (e|B ∩ Et ) − e∈A\B ρh (e|Et \ e) ≥ hg (A). [sent-218, score-0.398]
76 Importantly, the edge weights νB are always nonnegative, because, by Theorem 1, g is guaranteed to be nondecreasing. [sent-225, score-0.216]
77 Assume G is any of the graphs in Figure 3, and let T ∗ ⊆ V ∪U be the maximal set defining a minimum-cost cut δs (T ∗ ) in G, so that S ∗ = T ∗ ∩ V is a minimizer of the function represented by G. [sent-232, score-0.229]
78 1 Improvement via summarizations ˆ The approximation f is loosest if the sum of edge weights νi (A) significantly overestimates the true joint cost hg (A) of sets of edges A ⊆ δs T ∗ \ δs Ti still to be cut. [sent-242, score-0.605]
79 Thus, to tighten the approximation, we summarize the joint cost of groups of edges by a construction similar to Figure 3(b). [sent-245, score-0.24]
80 For each group, we introduce an auxiliary node tk and re-connect all edges (u, t) ∈ Gk to end in tk instead of t. [sent-248, score-0.297]
81 An extra edge ek connects tk to t, and carries the joint weight νi (ek ) of all edges in Gk ; a tighter approximation. [sent-250, score-0.418]
82 Subsequent approximations νi refer to cuts δs Ti , and such a cut can contain either single edges from Gk , or the group edge ek . [sent-253, score-0.627]
83 We set the next reference set Bi to be a copy of δs Ti in which each group edge ek was replaced by all its group members Gk . [sent-254, score-0.219]
84 Formally, these weights represent the upper bound ˆ hB (A) = hg (B) + ρh (Gk \ B|B) + Gk ⊆A ˆ ρh (e|Et \ e) ≤ h(A), ρh (e|B) − e∈(Gk ∩A)\B,Gk ⊆A e∈B\A where we replace Gk by ek whenever Gk ⊆ A. [sent-256, score-0.38]
85 4 Parametric constructions for special cases For certain functions of the form f (S) = m(S) + g(N (S)), the graph representation in Figure 3(d) admits a specific algorithm. [sent-258, score-0.269]
86 3(d)), we define edge weights νk (e) = w(e) for edges e ∈ Et as in Section 3, / and νk (u, t) = αk w(u) for e ∈ Et . [sent-271, score-0.334]
87 Then Tk = Sk ∪ N (Sk ) defines a minimum cut δs Tk in G. [sent-272, score-0.233]
88 Here, νk ≥ 0 because ψ is nondecreasing, and thus we only need t-edges which already exist in the bipartite graph G. [sent-277, score-0.281]
89 The advantage of the graph cut is that it easily combines with other objectives. [sent-287, score-0.311]
90 The bipartite graph represents a corpus subset extraction problem (Section 1. [sent-305, score-0.281]
91 In contrast, the deviation of the approximate edge weights νi from the true cost is bounded [18]. [sent-314, score-0.263]
92 Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. [sent-338, score-0.314]
93 A submodular function minimization algorithm based on the minimum-norm base. [sent-367, score-0.745]
94 Minimizing a submodular function arising from a concave function. [sent-372, score-0.771]
95 Realization of set functions as cut functions of graphs and hypergraphs. [sent-378, score-0.356]
96 A combinatorial strongly polynomial algorithm for minimizing submodular functions. [sent-420, score-0.858]
97 Submodularity beyond submodular energies: coupling edges in graph cuts. [sent-426, score-0.97]
98 An application of the submodular principal partition to training data subset selection. [sent-449, score-0.739]
99 A faster strongly polynomial time algorithm for submodular function minimization. [sent-476, score-0.778]
100 A combinatorial algorithm minimizing submodular functions in strongly polynomial time. [sent-489, score-0.934]
wordName wordTfidf (topN-words)
[('submodular', 0.705), ('gk', 0.224), ('hg', 0.199), ('cut', 0.164), ('modular', 0.155), ('graph', 0.147), ('bipartite', 0.134), ('edge', 0.127), ('cuts', 0.126), ('edges', 0.118), ('sfm', 0.102), ('ti', 0.097), ('ek', 0.092), ('weights', 0.089), ('jegelka', 0.08), ('fujishige', 0.08), ('auxiliary', 0.077), ('functions', 0.076), ('ow', 0.076), ('costs', 0.07), ('minimum', 0.069), ('representable', 0.069), ('nondecreasing', 0.067), ('mn', 0.067), ('concave', 0.066), ('summarization', 0.065), ('breakpoints', 0.064), ('polymatroid', 0.064), ('nodes', 0.062), ('et', 0.061), ('parametric', 0.057), ('gu', 0.056), ('sk', 0.055), ('gir', 0.054), ('iwata', 0.054), ('matroid', 0.054), ('stobbe', 0.054), ('truncations', 0.054), ('zivn', 0.054), ('xk', 0.052), ('combinatorial', 0.052), ('tt', 0.051), ('tk', 0.051), ('ts', 0.051), ('running', 0.05), ('gis', 0.048), ('graphic', 0.048), ('cost', 0.047), ('polynomial', 0.046), ('submodularity', 0.046), ('gr', 0.046), ('constructions', 0.046), ('groups', 0.043), ('boundary', 0.041), ('gi', 0.041), ('minimization', 0.04), ('graphs', 0.04), ('ground', 0.038), ('hb', 0.036), ('gp', 0.036), ('piecewise', 0.035), ('contrary', 0.035), ('speech', 0.035), ('subgraph', 0.035), ('partition', 0.034), ('pick', 0.034), ('construction', 0.032), ('element', 0.032), ('expressible', 0.032), ('argmins', 0.032), ('matroids', 0.032), ('tschel', 0.032), ('exact', 0.031), ('truncation', 0.031), ('decompose', 0.03), ('mc', 0.03), ('weight', 0.03), ('charged', 0.029), ('combinatorica', 0.029), ('fk', 0.029), ('nonnegative', 0.028), ('minimizing', 0.028), ('motivating', 0.028), ('strongly', 0.027), ('luckily', 0.027), ('subgraphs', 0.027), ('bi', 0.027), ('rank', 0.027), ('lin', 0.026), ('boykov', 0.026), ('uj', 0.026), ('sum', 0.025), ('maximal', 0.025), ('express', 0.024), ('reachable', 0.024), ('benign', 0.023), ('permit', 0.023), ('em', 0.023), ('adjacent', 0.023), ('lemma', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 199 nips-2011-On fast approximate submodular minimization
Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes
Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies. 1
2 0.50035244 251 nips-2011-Shaping Level Sets with Submodular Functions
Author: Francis R. Bach
Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.
3 0.41717255 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
Author: Andrew Guillory, Jeff A. Bilmes
Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they find the complete information they need. We are specifically interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a confident diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) defined over a ground set V is called submodular if it satisfies the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, influence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is defined to be the set containing the first i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order specified by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it first in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have defined it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can define F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a fixed set of candidate diseases, then this function is submodular. Specifically, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The offline approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this offline algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single fixed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem fits in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard offline problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efficient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best fixed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an offline approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an offline algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this offline algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known finite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Offline Analysis In this work we present an algorithm for online submodular set cover which extends the offline algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this offline algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Offline Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in offline analysis the offline algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no benefit but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is defined such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it finds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Specifically, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), define a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) define a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next define truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one final lemma before proving the main result of this section. For a sequence S define Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Define Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it suffices to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it suffices to show that after selecting the first k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Define a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ definition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to fix. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modification does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the first k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the offline setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Define F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to first select the two broad actions covering all common ads then select the narrow actions in any order. However, the offline cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for offline solutions as baselines. As seen in the figure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the offline cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We define an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Define the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Define F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Specifically, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Define F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netflix’s Watch Instantly service and use 803 questions based on those we used for an offline problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modification of Normal-Hedge which fixes ct = 1 for a fixed, more aggressive learning rate than theoretically justified. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed significantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modified Normal-Hedge has no theoretical justification, so it may not generalize to other problems. For the modified Normal-Hedge the final average cover times are 7.72 adaptive and 8.22 cumulative. The offline values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a fixed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9
4 0.30211061 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
Author: Yoshinobu Kawahara, Takashi Washio
Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.
5 0.26270333 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts
Author: Matthias Hein, Simon Setzer
Abstract: Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efficient first-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. 1
6 0.17890799 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval
7 0.11323053 277 nips-2011-Submodular Multi-Label Learning
8 0.10720734 227 nips-2011-Pylon Model for Semantic Segmentation
9 0.081866927 213 nips-2011-Phase transition in the family of p-resistances
10 0.071926281 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
11 0.068364747 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
12 0.064035565 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
13 0.063246697 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
14 0.061474878 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation
15 0.058896445 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
16 0.056546539 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
17 0.055522364 67 nips-2011-Data Skeletonization via Reeb Graphs
18 0.055115748 181 nips-2011-Multiple Instance Learning on Structured Data
19 0.054919157 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
20 0.054584157 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
topicId topicWeight
[(0, 0.189), (1, -0.02), (2, -0.096), (3, -0.032), (4, -0.007), (5, -0.112), (6, -0.602), (7, 0.297), (8, -0.118), (9, 0.033), (10, -0.067), (11, -0.112), (12, 0.19), (13, 0.158), (14, -0.036), (15, -0.027), (16, -0.071), (17, 0.065), (18, -0.04), (19, -0.046), (20, -0.018), (21, -0.026), (22, -0.048), (23, -0.044), (24, 0.048), (25, -0.03), (26, 0.032), (27, 0.06), (28, -0.005), (29, 0.077), (30, 0.017), (31, 0.037), (32, 0.002), (33, 0.006), (34, 0.022), (35, 0.005), (36, -0.023), (37, 0.028), (38, -0.006), (39, -0.026), (40, 0.011), (41, 0.006), (42, -0.009), (43, 0.002), (44, 0.037), (45, -0.04), (46, -0.007), (47, -0.009), (48, 0.008), (49, 0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.95534039 199 nips-2011-On fast approximate submodular minimization
Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes
Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies. 1
2 0.89317214 251 nips-2011-Shaping Level Sets with Submodular Functions
Author: Francis R. Bach
Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.
3 0.79966384 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts
Author: Matthias Hein, Simon Setzer
Abstract: Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efficient first-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. 1
4 0.75503224 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
Author: Andrew Guillory, Jeff A. Bilmes
Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they find the complete information they need. We are specifically interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a confident diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) defined over a ground set V is called submodular if it satisfies the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, influence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is defined to be the set containing the first i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order specified by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it first in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have defined it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can define F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a fixed set of candidate diseases, then this function is submodular. Specifically, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The offline approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this offline algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single fixed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem fits in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard offline problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efficient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best fixed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an offline approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an offline algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this offline algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known finite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Offline Analysis In this work we present an algorithm for online submodular set cover which extends the offline algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this offline algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Offline Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in offline analysis the offline algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no benefit but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is defined such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it finds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Specifically, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), define a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) define a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next define truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one final lemma before proving the main result of this section. For a sequence S define Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Define Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it suffices to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it suffices to show that after selecting the first k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Define a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ definition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to fix. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modification does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the first k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the offline setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Define F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to first select the two broad actions covering all common ads then select the narrow actions in any order. However, the offline cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for offline solutions as baselines. As seen in the figure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the offline cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We define an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Define the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Define F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Specifically, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Define F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netflix’s Watch Instantly service and use 803 questions based on those we used for an offline problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modification of Normal-Hedge which fixes ct = 1 for a fixed, more aggressive learning rate than theoretically justified. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed significantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modified Normal-Hedge has no theoretical justification, so it may not generalize to other problems. For the modified Normal-Hedge the final average cover times are 7.72 adaptive and 8.22 cumulative. The offline values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a fixed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9
5 0.75404304 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
Author: Yoshinobu Kawahara, Takashi Washio
Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.
6 0.66715515 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval
7 0.39814466 277 nips-2011-Submodular Multi-Label Learning
8 0.29904869 67 nips-2011-Data Skeletonization via Reeb Graphs
9 0.29175672 213 nips-2011-Phase transition in the family of p-resistances
10 0.25464368 146 nips-2011-Learning Higher-Order Graph Structure with Features by Structure Penalty
11 0.24833691 227 nips-2011-Pylon Model for Semantic Segmentation
12 0.24470679 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm
13 0.24126354 274 nips-2011-Structure Learning for Optimization
14 0.23690671 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
15 0.22154661 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
16 0.21873994 117 nips-2011-High-Dimensional Graphical Model Selection: Tractable Graph Families and Necessary Conditions
17 0.21743166 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
18 0.21530412 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
19 0.2085308 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
20 0.20361565 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
topicId topicWeight
[(0, 0.046), (4, 0.079), (6, 0.224), (10, 0.02), (20, 0.058), (26, 0.031), (31, 0.075), (33, 0.027), (41, 0.011), (43, 0.08), (45, 0.096), (48, 0.016), (57, 0.034), (65, 0.01), (74, 0.056), (83, 0.026), (99, 0.03)]
simIndex simValue paperId paperTitle
1 0.78881699 303 nips-2011-Video Annotation and Tracking with Active Learning
Author: Carl Vondrick, Deva Ramanan
Abstract: We introduce a novel active learning framework for video annotation. By judiciously choosing which frames a user should annotate, we can obtain highly accurate tracks with minimal user effort. We cast this problem as one of active learning, and show that we can obtain excellent performance by querying frames that, if annotated, would produce a large expected change in the estimated object track. We implement a constrained tracker and compute the expected change for putative annotations with efficient dynamic programming algorithms. We demonstrate our framework on four datasets, including two benchmark datasets constructed with key frame annotations obtained by Amazon Mechanical Turk. Our results indicate that we could obtain equivalent labels for a small fraction of the original cost. 1
same-paper 2 0.76157397 199 nips-2011-On fast approximate submodular minimization
Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes
Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies. 1
3 0.63848084 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
Author: Alexander Rakhlin, Karthik Sridharan, Ambuj Tewari
Abstract: Learning theory has largely focused on two main learning scenarios: the classical statistical setting where instances are drawn i.i.d. from a fixed distribution, and the adversarial scenario wherein, at every time step, an adversarially chosen instance is revealed to the player. It can be argued that in the real world neither of these assumptions is reasonable. We define the minimax value of a game where the adversary is restricted in his moves, capturing stochastic and non-stochastic assumptions on data. Building on the sequential symmetrization approach, we define a notion of distribution-dependent Rademacher complexity for the spectrum of problems ranging from i.i.d. to worst-case. The bounds let us immediately deduce variation-type bounds. We study a smoothed online learning scenario and show that exponentially small amount of noise can make function classes with infinite Littlestone dimension learnable. 1
4 0.62539256 231 nips-2011-Randomized Algorithms for Comparison-based Search
Author: Dominique Tschopp, Suhas Diggavi, Payam Delgosha, Soheil Mohajer
Abstract: This paper addresses the problem of finding the nearest neighbor (or one of the R-nearest neighbors) of a query object q in a database of n objects, when we can only use a comparison oracle. The comparison oracle, given two reference objects and a query object, returns the reference object most similar to the query object. The main problem we study is how to search the database for the nearest neighbor (NN) of a query, while minimizing the questions. The difficulty of this problem depends on properties of the underlying database. We show the importance of a characterization: combinatorial disorder D which defines approximate triangle n inequalities on ranks. We present a lower bound of Ω(D log D + D2 ) average number of questions in the search phase for any randomized algorithm, which demonstrates the fundamental role of D for worst case behavior. We develop 3 a randomized scheme for NN retrieval in O(D3 log2 n + D log2 n log log nD ) 3 questions. The learning requires asking O(nD3 log2 n + D log2 n log log nD ) questions and O(n log2 n/ log(2D)) bits to store.
5 0.62053657 127 nips-2011-Image Parsing with Stochastic Scene Grammar
Author: Yibiao Zhao, Song-chun Zhu
Abstract: This paper proposes a parsing algorithm for scene understanding which includes four aspects: computing 3D scene layout, detecting 3D objects (e.g. furniture), detecting 2D faces (windows, doors etc.), and segmenting background. In contrast to previous scene labeling work that applied discriminative classifiers to pixels (or super-pixels), we use a generative Stochastic Scene Grammar (SSG). This grammar represents the compositional structures of visual entities from scene categories, 3D foreground/background, 2D faces, to 1D lines. The grammar includes three types of production rules and two types of contextual relations. Production rules: (i) AND rules represent the decomposition of an entity into sub-parts; (ii) OR rules represent the switching among sub-types of an entity; (iii) SET rules represent an ensemble of visual entities. Contextual relations: (i) Cooperative “+” relations represent positive links between binding entities, such as hinged faces of a object or aligned boxes; (ii) Competitive “-” relations represents negative links between competing entities, such as mutually exclusive boxes. We design an efficient MCMC inference algorithm, namely Hierarchical cluster sampling, to search in the large solution space of scene configurations. The algorithm has two stages: (i) Clustering: It forms all possible higher-level structures (clusters) from lower-level entities by production rules and contextual relations. (ii) Sampling: It jumps between alternative structures (clusters) in each layer of the hierarchy to find the most probable configuration (represented by a parse tree). In our experiment, we demonstrate the superiority of our algorithm over existing methods on public dataset. In addition, our approach achieves richer structures in the parse tree. 1
6 0.62038338 22 nips-2011-Active Ranking using Pairwise Comparisons
7 0.62036216 227 nips-2011-Pylon Model for Semantic Segmentation
8 0.61780691 139 nips-2011-Kernel Bayes' Rule
9 0.61705488 154 nips-2011-Learning person-object interactions for action recognition in still images
10 0.61651647 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
11 0.61246246 258 nips-2011-Sparse Bayesian Multi-Task Learning
12 0.61088306 186 nips-2011-Noise Thresholds for Spectral Clustering
13 0.61087447 180 nips-2011-Multiple Instance Filtering
14 0.61057371 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials
15 0.61029059 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
16 0.60945463 29 nips-2011-Algorithms and hardness results for parallel large margin learning
17 0.60886389 276 nips-2011-Structured sparse coding via lateral inhibition
18 0.6085096 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
19 0.60804373 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
20 0.60781676 265 nips-2011-Sparse recovery by thresholded non-negative least squares