nips nips2011 nips2011-222 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yoshinobu Kawahara, Takashi Washio
Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D. [sent-7, score-0.338]
2 The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. [sent-14, score-0.101]
3 programming problem covers a broad range of applications in machine learning. [sent-17, score-0.165]
4 We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning. [sent-19, score-0.197]
5 , the problem of minimizing the difference of two submodular functions. [sent-26, score-0.302]
6 This is a natural formulation of many machine learning problems, such as learning graph matching [3], discriminative structure learning [21], feature selection [1] and energy minimization [24]. [sent-27, score-0.215]
7 In this paper, we propose a prismatic algorithm for the D. [sent-28, score-0.289]
8 programming problem, which is a branch-and-bound-based algorithm responding to the specific structure of this problem. [sent-30, score-0.175]
9 programming problem (although there exists an approximate algorithm for this problem [21]). [sent-33, score-0.119]
10 As is well known, the branch-and-bound method is one of the most successful frameworks in mathematical programming and has been incorporated into commercial softwares such as CPLEX [13, 12]. [sent-34, score-0.119]
11 programming problem through the continuous relaxation of solution spaces and objective functions with the help of the Lov´ sz extension [17, 11, 18]. [sent-37, score-0.21]
12 The algorithm is a implemented as an iterative calculation of binary-integer linear programming (BILP). [sent-38, score-0.119]
13 programming problem in machine learning and investigate empirically the performance of our method and the difference between exact and approximate solutions through feature selection and discriminative structure-learning problems. [sent-41, score-0.289]
14 programming problem and then describe its applications in machine learning. [sent-45, score-0.119]
15 Preliminaries and Notation: A set function f is called submodular if f (A) + f (B) ≥ f (A ∪ B) + f (A ∩ B) for all A, B ⊆ N , where N = {1, · · · , n} [5, 7]. [sent-49, score-0.302]
16 Programming Problem and its Applications Let f and g are submodular functions. [sent-68, score-0.302]
17 , the problem of minimizing the difference of two submodular functions: min f (A) − g(A). [sent-73, score-0.333]
18 This combinatorial problem is known to be a submodular maximization problem with cardinality constraint for commonly used measures such as least-squared errors [4, 14]. [sent-79, score-0.395]
19 Recently, it is reported that submodular functions in place of the cardinality can give a wider family of polyhedral norms and may incorporate prior knowledge or structural constraints in sparse methods [1]. [sent-81, score-0.428]
20 Then, the objective (that is supposed to be minimized) becomes the sum of a loss function (often, supermodular) and submodular regularization terms. [sent-82, score-0.332]
21 One commonly used metric for discriminative structure learning would be EAR (explaining away residual) [2]. [sent-84, score-0.112]
22 Since the (symmetric) mutual information is a submodular function, obviously this problem leads the D. [sent-89, score-0.326]
23 Then, many tasks in computer vision can be naturally formulated in terms of energy minimization where the energy function has the ∑ ∑ form: E(x) = p∈V θp (xp ) + (p,q)∈E θ(xp , xq ), where θp and θp,q are univariate and pairwise potentials. [sent-94, score-0.124]
24 Based on this, any energy function in computer vision can be written with a submodular function E1 (x) and a supermodular function E2 (x) as E(x) = E1 (x) + E2 (x) (ex. [sent-98, score-0.442]
25 Or, in case of binarized energy, even if such explicit decomposition is not known, a non-unique decomposition to submodular and supermodular functions can be always given [25]. [sent-100, score-0.442]
26 Programming Problem By introducing an additional variable t(∈ R), Problem (1) can be converted into the equivalent problem with a supermodular objective function and a submodular feasible set, i. [sent-103, score-0.499]
27 To this end, we first define a prism T (S) ⊂ Rn+1 by T = {(x, t) ∈ Rn × R : x ∈ S}, where S is an n-simplex. [sent-110, score-0.389]
28 The prism T has n + 1 edges that are vertical lines (i. [sent-114, score-0.389]
29 T v (0,1) (1,1) D r (1,0) (0,0) S2 S1 S Figure 1: Illustration of the prismatic algorithm. [sent-117, score-0.289]
30 In branching, subproblems are constructed by dividing the feasible region of a parent problem. [sent-119, score-0.108]
31 And in bounding, we judge whether an optimal solution exists in the region of a subproblem and its descendants by calculating an upper bound of the subproblem and comparing it with an lower bound of the original problem. [sent-120, score-0.141]
32 Then, T = i=1 Ti , where Ti = {(x, t) ∈ Rn × R : x ∈ Si }, is a natural prismatic partition of T induced by the above simplical partition. [sent-125, score-0.289]
33 , Tk ) at the iteration k, we consider a polyheˆ ˜ ˜ dral convex set Pk such that Pk ⊃ D, where D = {(x, t) ∈ Rn × R : x ∈ D, f (x) ≤ t} is the region corresponding to the feasible set of Problem (2). [sent-127, score-0.136]
34 Here, t can be determined by using ˜ ˜ where t some existing submodular minimization solver [23, 8]. [sent-129, score-0.332]
35 3, a lower bound β(Tk ) of t − g(A) on the current prism Tk can be calculated through the binary-integer linear programming (BILP) (or the linear programming (LP)) using Pk , obtained as described above. [sent-133, score-0.675]
36 Construction of the first prism: A prism needs to be constructed from a hypercube at first, 2. [sent-141, score-0.423]
37 Subdivision process: A prism is divided into a finite number of sub-prisms at each iteration, 3. [sent-142, score-0.389]
38 Bound estimation: For each prism generated throughout the algorithm, a lower bound for the objective function t − g(A) over the part of the feasible set contained in this prism is computed, 4. [sent-143, score-0.93]
39 Construction of cutting planes: Throughout the algorithm, a sequence of polyhedral convex sets ˜ P0 , P1 , · · · is constructed such that P0 ⊃ P1 ⊃ · · · ⊃ D. [sent-144, score-0.156]
40 Deletion of non-optimal prisms: At each iteration, we try to delete prisms that contain no feasible solution better than the one obtained so far. [sent-146, score-0.293]
41 3 ˜ Construct a simplex S0 ⊃ D, its corresponding prism T0 and a polyhedral convex set P0 ⊃ D. [sent-147, score-0.542]
42 while Rk ̸= ∅ ∗ ∗ ∗ Select a prism Tk ∈ Rk satisfying βk = β(Tk ), (¯ k , tk ) ∈ Tk . [sent-151, score-0.869]
43 v ¯ k ¯ ˜ if (¯ , tk ) ∈ D then v Set Pk+1 = Pk . [sent-152, score-0.441]
44 else Construct lk (x, t) according to (8), and set Pk+1 = {(x, t) ∈ Pk : lk (x, t) ≤ 0}. [sent-153, score-0.238]
45 Let Mk denote the collection of remaining prisms Tk,j (j ∈ Jk ), and for each T ∈ Mk set 1 2 3 4 5 6 7 8 9 10 11 12 ∗ β(T ) = max{β(Tk ), β(T, Pk+1 , αk )}. [sent-159, score-0.115]
46 14 15 Algorithm 1: Pseudo-code of the prismatic algorithm for the D. [sent-163, score-0.289]
47 1 Construction of the first prism ˜ The initial simplex S0 ⊃ D (which yields the initial prism T0 ⊃ D) can be constructed as follows. [sent-166, score-0.865]
48 Then, the initial simplex S0 ⊃ D can be constructed by S0 = {x ∈ Rn : xi ≤ 1(i ∈ Av ), xi ≥ 0(i ∈ N \ Av ), aT x ≤ γ}, ∑ where a = i∈N \Av ei − i∈Av ei and γ = |N \ Av |. [sent-170, score-0.167]
49 The n + 1 vertices of S0 are v and the n points where the hyperplane {x ∈ Rn : aT x = γ} intersects the edges of the cone {x ∈ Rn : xi ≤ 1(i ∈ Av ), xi ≥ 0(i ∈ N \ Av )}. [sent-171, score-0.174]
50 2 Sub-division of a prism Let Sk and Tk be the simplex and prism at k-th iteration in the algorithm, respectively. [sent-174, score-0.866]
51 , we have ∪ j i i λi >0 Sk = Sk , int Sk ∩ int Sk = ∅ for i ̸= j [12]. [sent-202, score-0.126]
52 i i In a natural way, the prisms T (Sk ) generated by the simplices Sk defined in Eq. [sent-203, score-0.149]
53 , for every nested (decreasing) sequence ∩∞ of prisms {Tq } generated by this process, we have q=0 Tq = τ , where τ is a line perpendicular to Rn (a vertical line) [11]. [sent-207, score-0.115]
54 3 Lower bounds Again, let Sk and Tk be the simplex and prism at k-th iteration in the algorithm, respectively. [sent-216, score-0.477]
55 And, let α be an upper bound of t − g(A), which is the smallest value of t − g(A) attained at a feasible ˜ point known so far in the algorithm. [sent-217, score-0.1]
56 Moreover, let Pk be a polyhedral convex set which contains D and be represented as Pk = {(x, t) ∈ Rn × R : Ak x + ak t ≤ bk }, (4) m 1 where Ak is a real (m × n)-matrix and ak , bk ∈ R . [sent-218, score-0.266]
57 , n + 1, consider the point (v i , ti ) where the edge of Tk passing through v i k k k intersects the level set {(x, t) : t − g (x) = µ}, i. [sent-230, score-0.265]
58 ˆ k k Then, let us denote the uniquely defined hyperplane through the points (v i , ti ) by H = {(x, t) ∈ k k Rn ×R : pT x−t = γ}, where p ∈ Rn and γ ∈ R. [sent-236, score-0.258]
59 ˜ If Tk ∩ D ⊆ H+ , then we see from the supermodularity of g(A) (the concavity of g (x)) that ˆ ˜ min{t − g(A) : (A, t) ∈ (A, t)(Tk ∩ D)} ≥ min{t − g(A) : (A, t) ∈ (A, t)(Tk ∩ H+ )} ≥ min{t − g (x) : (x, t) ∈ Tk ∩ H+ } ˆ = ti − g (xi )(i = 1, . [sent-240, score-0.185]
60 , n + 1, let z i = (v i , ti ) be the point where the edge of T passing through v i intersects k k k ¯ ˜ ¯ H. [sent-249, score-0.265]
61 Ak x + ak t ≤ bk , x = i=1 λi v i , x ∈ Bn , k i=1 ti λi − t λ,x,t ∑n+1 i=1 λi = 1, λi ≥ 0 (i = 1, . [sent-256, score-0.268]
62 ∑n+1 ∗ ∗ ∗ ∗ ∗ ∗ (b) Otherwise, let (λ , x , t ) be an optimal solution of BILP (5) and c = i=1 ti λi − t its optimal value, respectively. [sent-263, score-0.21]
63 ∑n+1 ∗ ¯ ¯k ˆ k (b2) If c > 0, then z = ( i=1 λi v i , t∗ ), z i = (v i , ti ) = (v i , ti − c∗ ) and ti − g (v i ) = k k k k k k ∗ µ − c (i = 1, . [sent-265, score-0.555]
64 5 ¯ that determining the hyperplane H and the point z amounts to solving the binary integer linear programming problem: max pT x − t s. [sent-277, score-0.192]
65 On the other hand, since (v i , ti ) ∈ H, we have pT v i − ti = γ (i = 1, . [sent-281, score-0.37]
66 , n + 1), and hence k k k k ∑n+1 ∑n+1 i T i p x − t = i=1 λi (γ + tk ) − t = i=1 tk λi − t + γ. [sent-284, score-0.882]
67 ¯ Since H = {(x, t) ∈ Rn × R : pT x − t = γ ∗ } and H = {(x, t) ∈ Rn × R : pT x − t = γ} ¯ we see that for each intersection point (v i , ti ) (and (v i , ti )) of the edge of Tk passing through v i k k k k k ¯ ¯k with H (and H), we have pT v i − ti = γ ∗ and pT v i − ti = γ, respectively. [sent-290, score-0.762]
68 This implies that k k k ¯k ˆ k ¯k ˆ k ti = ti + γ − γ ∗ = ti − c∗ , and (using ti = g (v i ) + µ) that ti = g (v i ) + µ − c∗ . [sent-291, score-0.925]
69 Thus, Proposition 1 provides the lower bound { +∞, if BILP (5) has no feasible point, µ, if c∗ ≤ 0, (7) βk (Tk , Pk , α) = ∗ µ − c if c∗ > 0. [sent-296, score-0.122]
70 4 Outer approximation ˜ The polyhedral convex set Pk ⊃ D used in the preceding section is updated in each iteration, i. [sent-300, score-0.1]
71 That is, a certain linear inequality lk (x, t) ≤ 0 is added to the constraint set defining Pk , i. [sent-307, score-0.119]
72 , we set Pk+1 = Pk ∩ {(x, t) ∈ Rn × R : lk (x, t) ≤ 0}. [sent-309, score-0.119]
73 The function lk (x, t) is constructed as follows. [sent-310, score-0.153]
74 (7), and a point (¯ k , tk ) satisfying tk − g (¯ k ) = βk . [sent-312, score-0.921]
75 Then, we can set approximation only in the case (¯ k , t / v ˆ lk (x, t) = sT [(x, t) − z k ] + (f (x∗ ) − t∗ ), k k k (8) ˆ where sk is a subgradient of f (x) − t at z k . [sent-314, score-0.31]
76 The hyperplane {(x, t) ∈ Rn × R : lk (x, t) = 0} strictly separates z k from D, i. [sent-317, score-0.192]
77 , ∀ ˜ lk (z k ) > 0, and lk (x, t) ≤ 0 for (x, t) ∈ D. [sent-319, score-0.238]
78 Since we assume that z k ∈ D, we have lk (z k ) = (f (x∗ ) − t∗ ). [sent-321, score-0.119]
79 5 Deletion rules At each iteration of the algorithm, we try to delete certain subprisms that contain no optimal solution. [sent-324, score-0.152]
80 To this end, we adopt the following two deletion rules: (DR1) Delete Tk if BILP (5) has no feasible solution. [sent-325, score-0.115]
81 (Supermodular-submodular) λ λ λ Figure 2: Training errors, test errors and computational time versus λ for the prismatic algorithm and the supermodular-sumodular procedure. [sent-329, score-0.325]
82 87) Table 1: Normalized mean-square prediction errors of training and test data by the prismatic algorithm, the supermodular-submodular procedure, the greedy algorithm and the lasso. [sent-362, score-0.325]
83 , the prism Tk is infeasible, and (DR2) from Proposition 1 and from the definition of µ that the current best feasible solution cannot be improved in T . [sent-369, score-0.488]
84 1, and then apply the algorithm to an application of discriminative structure learning using the UCI repository data in Section 5. [sent-371, score-0.112]
85 1 Application to feature selection We compared the performance and solutions by the proposed prismatic algorithm (PRISM), the supermodular-submodular procedure (SSP) [21], the greedy method and the LASSO. [sent-378, score-0.338]
86 Since the first term is a supermodular function [4] and the second is a submodular function, this problem is the D. [sent-391, score-0.395]
87 That is, we could obtain the better solutions (in the sense of prediction error) by optimizing the objective with the submodular norm more exactly. [sent-400, score-0.355]
88 11) Table 2: Empirical accuracy of the classifiers in [%] with standard deviation by the TANs discriminatively learned with PRISM or SSP and generatively learned with a submodular minimization solver. [sent-427, score-0.363]
89 Also, Table 1 shows normalized-mean prediction errors by the prismatic algorithm, the supermodularsubmodular procedure, the greedy method and the lasso for several k. [sent-431, score-0.325]
90 This result also seems to show that optimizing the objective with the submodular norm exactly is significant in the meaning of prediction errors. [sent-433, score-0.332]
91 2 Application to discriminative structure learning Our second application is discriminative structure learning using the UCI machine learning repository. [sent-435, score-0.224]
92 To this end, we used the procedure described in [20] with a submodular minimization solver (for the generative case), and the one [21] combined with our prismatic algorithm (PRISM) or the supermodular-submodular procedure (SSP) (for the discriminative case). [sent-439, score-0.706]
93 The results seem to show that optimizing the EAR measure more exactly could improve the performance of classification (which would mean that the EAR is significant as the measure of discriminative structure learning in the sense of classification). [sent-444, score-0.112]
94 6 Conclusions In this paper, we proposed a prismatic algorithm for the D. [sent-445, score-0.289]
95 programming problem (1), which is the first exact algorithm for this problem and is a branch-and-bound method responding to the structure of this problem. [sent-447, score-0.211]
96 programming problem through the continuous relaxation of solution spaces and objective functions with the help of the Lov´ sz extension. [sent-450, score-0.21]
97 We applied the proposed algorithm to several situations of feature selection a and discriminative structure learning using artificial and real-world datasets. [sent-451, score-0.138]
98 programming problem addressed in this paper covers a broad range of applications in machine learning. [sent-454, score-0.165]
99 The minimum-norm-point algorithm applied submodular function minimization and linear programming. [sent-516, score-0.332]
100 A submodular-supermodular procedure with applications to discriminative structure learning. [sent-614, score-0.112]
wordName wordTfidf (topN-words)
[('tk', 0.441), ('prism', 0.389), ('submodular', 0.302), ('pk', 0.301), ('prismatic', 0.289), ('bilp', 0.192), ('sk', 0.191), ('ti', 0.185), ('ssp', 0.135), ('programming', 0.119), ('lk', 0.119), ('prisms', 0.115), ('av', 0.101), ('rn', 0.096), ('supermodular', 0.093), ('pt', 0.091), ('discriminative', 0.085), ('delete', 0.079), ('feasible', 0.074), ('polyhedral', 0.073), ('hyperplane', 0.073), ('int', 0.063), ('subdivision', 0.062), ('intersects', 0.058), ('ear', 0.058), ('bn', 0.056), ('lov', 0.055), ('ak', 0.054), ('simplex', 0.053), ('kawahara', 0.051), ('branching', 0.051), ('submodularity', 0.049), ('pq', 0.049), ('energy', 0.047), ('binarized', 0.047), ('pj', 0.046), ('mk', 0.045), ('vertices', 0.043), ('deletion', 0.041), ('ia', 0.041), ('ei', 0.04), ('satisfying', 0.039), ('osaka', 0.038), ('subprisms', 0.038), ('tans', 0.038), ('jk', 0.038), ('exact', 0.036), ('errors', 0.036), ('proposition', 0.036), ('sz', 0.036), ('iteration', 0.035), ('constructed', 0.034), ('hepatitis', 0.034), ('nagano', 0.034), ('simplices', 0.034), ('horst', 0.034), ('xa', 0.034), ('xw', 0.034), ('bounding', 0.032), ('chess', 0.031), ('narasimhan', 0.031), ('generatively', 0.031), ('xj', 0.031), ('min', 0.031), ('rk', 0.03), ('cardinality', 0.03), ('minimization', 0.03), ('objective', 0.03), ('responding', 0.029), ('cplex', 0.029), ('jst', 0.029), ('bk', 0.029), ('outer', 0.029), ('tq', 0.028), ('combinatorial', 0.027), ('convex', 0.027), ('structure', 0.027), ('halfspace', 0.026), ('bound', 0.026), ('selection', 0.026), ('responds', 0.025), ('solution', 0.025), ('cube', 0.025), ('covers', 0.024), ('obviously', 0.024), ('rother', 0.024), ('norms', 0.023), ('cheng', 0.023), ('solutions', 0.023), ('broad', 0.022), ('si', 0.022), ('cutting', 0.022), ('operation', 0.022), ('passing', 0.022), ('operations', 0.022), ('lower', 0.022), ('pm', 0.021), ('german', 0.021), ('subproblem', 0.021), ('pages', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
Author: Yoshinobu Kawahara, Takashi Washio
Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.
2 0.30211061 199 nips-2011-On fast approximate submodular minimization
Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes
Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies. 1
3 0.229403 251 nips-2011-Shaping Level Sets with Submodular Functions
Author: Francis R. Bach
Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.
4 0.19664165 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
Author: Andrew Guillory, Jeff A. Bilmes
Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they find the complete information they need. We are specifically interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a confident diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) defined over a ground set V is called submodular if it satisfies the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, influence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is defined to be the set containing the first i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order specified by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it first in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have defined it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can define F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a fixed set of candidate diseases, then this function is submodular. Specifically, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The offline approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this offline algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single fixed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem fits in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard offline problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efficient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best fixed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an offline approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an offline algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this offline algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known finite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Offline Analysis In this work we present an algorithm for online submodular set cover which extends the offline algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this offline algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Offline Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in offline analysis the offline algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no benefit but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is defined such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it finds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Specifically, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), define a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) define a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next define truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one final lemma before proving the main result of this section. For a sequence S define Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Define Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it suffices to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it suffices to show that after selecting the first k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Define a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ definition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to fix. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modification does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the first k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the offline setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Define F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to first select the two broad actions covering all common ads then select the narrow actions in any order. However, the offline cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for offline solutions as baselines. As seen in the figure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the offline cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We define an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Define the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Define F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Specifically, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Define F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netflix’s Watch Instantly service and use 803 questions based on those we used for an offline problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modification of Normal-Hedge which fixes ct = 1 for a fixed, more aggressive learning rate than theoretically justified. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed significantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modified Normal-Hedge has no theoretical justification, so it may not generalize to other problems. For the modified Normal-Hedge the final average cover times are 7.72 adaptive and 8.22 cumulative. The offline values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a fixed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9
5 0.19049312 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
Author: Miles Lopes, Laurent Jacob, Martin J. Wainwright
Abstract: We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Our contribution is a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T 2 statistic. Working within a high-dimensional framework that allows (p, n) → ∞, we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from simulated data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure with comparisons on a high-dimensional gene expression dataset involving the discrimination of different types of cancer. 1
6 0.1575152 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
7 0.098025389 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts
8 0.080325313 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval
9 0.0739474 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
10 0.07387878 219 nips-2011-Predicting response time and error rates in visual search
11 0.068341114 168 nips-2011-Maximum Margin Multi-Instance Learning
12 0.066607244 277 nips-2011-Submodular Multi-Label Learning
13 0.064968728 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
14 0.061956782 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
15 0.061808772 268 nips-2011-Speedy Q-Learning
16 0.061487526 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
17 0.061336458 59 nips-2011-Composite Multiclass Losses
18 0.056080744 227 nips-2011-Pylon Model for Semantic Segmentation
19 0.055015385 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time
20 0.049827918 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs
topicId topicWeight
[(0, 0.169), (1, -0.026), (2, -0.064), (3, -0.052), (4, -0.006), (5, -0.037), (6, -0.357), (7, 0.157), (8, -0.1), (9, 0.042), (10, -0.017), (11, -0.115), (12, 0.138), (13, 0.119), (14, -0.049), (15, -0.066), (16, -0.073), (17, 0.042), (18, 0.004), (19, -0.022), (20, 0.043), (21, -0.012), (22, 0.012), (23, -0.025), (24, 0.006), (25, -0.026), (26, 0.005), (27, 0.004), (28, -0.009), (29, -0.006), (30, -0.131), (31, 0.103), (32, -0.031), (33, -0.005), (34, -0.119), (35, 0.004), (36, 0.003), (37, 0.062), (38, -0.008), (39, 0.01), (40, -0.026), (41, -0.082), (42, -0.046), (43, 0.127), (44, -0.039), (45, 0.01), (46, 0.08), (47, 0.033), (48, -0.017), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.93186694 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
Author: Yoshinobu Kawahara, Takashi Washio
Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.
2 0.79154605 199 nips-2011-On fast approximate submodular minimization
Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes
Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest significant speedups over minimum norm while retaining higher accuracies. 1
3 0.74276489 251 nips-2011-Shaping Level Sets with Submodular Functions
Author: Francis R. Bach
Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide unified optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting specific submodular functions, we give a new interpretation to known norms, such as the total variation; we also define new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.
4 0.62479204 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
Author: Andrew Guillory, Jeff A. Bilmes
Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they find the complete information they need. We are specifically interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a confident diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) defined over a ground set V is called submodular if it satisfies the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, influence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is defined to be the set containing the first i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order specified by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it first in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have defined it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can define F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a fixed set of candidate diseases, then this function is submodular. Specifically, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The offline approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this offline algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single fixed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem fits in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard offline problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efficient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best fixed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an offline approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an offline algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this offline algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known finite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Offline Analysis In this work we present an algorithm for online submodular set cover which extends the offline algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this offline algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Offline Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in offline analysis the offline algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no benefit but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is defined such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it finds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Specifically, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), define a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) define a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next define truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one final lemma before proving the main result of this section. For a sequence S define Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Define Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it suffices to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it suffices to show that after selecting the first k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Define a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ definition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to fix. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modification does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the first k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the offline setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Define F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to first select the two broad actions covering all common ads then select the narrow actions in any order. However, the offline cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for offline solutions as baselines. As seen in the figure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the offline cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We define an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Define the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Define F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Specifically, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Define F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netflix’s Watch Instantly service and use 803 questions based on those we used for an offline problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modification of Normal-Hedge which fixes ct = 1 for a fixed, more aggressive learning rate than theoretically justified. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed significantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modified Normal-Hedge has no theoretical justification, so it may not generalize to other problems. For the modified Normal-Hedge the final average cover times are 7.72 adaptive and 8.22 cumulative. The offline values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a fixed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9
5 0.58314151 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts
Author: Matthias Hein, Simon Setzer
Abstract: Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efficient first-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. 1
6 0.51982021 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval
7 0.47462729 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
8 0.40327072 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
9 0.35135344 277 nips-2011-Submodular Multi-Label Learning
10 0.34424037 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo
11 0.32977495 6 nips-2011-A Global Structural EM Algorithm for a Model of Cancer Progression
12 0.3192105 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
13 0.31292492 8 nips-2011-A Model for Temporal Dependencies in Event Streams
14 0.30675048 192 nips-2011-Nonstandard Interpretations of Probabilistic Programs for Efficient Inference
15 0.30198753 40 nips-2011-Automated Refinement of Bayes Networks' Parameters based on Test Ordering Constraints
16 0.29589379 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
17 0.28609166 81 nips-2011-Efficient anomaly detection using bipartite k-NN graphs
18 0.27337688 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
19 0.27274635 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data
20 0.27272016 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
topicId topicWeight
[(0, 0.035), (4, 0.066), (20, 0.047), (26, 0.047), (31, 0.058), (33, 0.025), (43, 0.053), (45, 0.117), (57, 0.026), (70, 0.289), (74, 0.033), (83, 0.038), (99, 0.062)]
simIndex simValue paperId paperTitle
1 0.82785082 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
Author: Richard Socher, Eric H. Huang, Jeffrey Pennin, Christopher D. Manning, Andrew Y. Ng
Abstract: Paraphrase detection is the task of examining two sentences and determining whether they have the same meaning. In order to obtain high accuracy on this task, thorough syntactic and semantic analysis of the two statements is needed. We introduce a method for paraphrase detection based on recursive autoencoders (RAE). Our unsupervised RAEs are based on a novel unfolding objective and learn feature vectors for phrases in syntactic trees. These features are used to measure the word- and phrase-wise similarity between two sentences. Since sentences may be of arbitrary length, the resulting matrix of similarity measures is of variable size. We introduce a novel dynamic pooling layer which computes a fixed-sized representation from the variable-sized matrices. The pooled representation is then used as input to a classifier. Our method outperforms other state-of-the-art approaches on the challenging MSRP paraphrase corpus. 1
2 0.80624807 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
Author: Joni K. Pajarinen, Jaakko Peltonen
Abstract: Applications such as robot control and wireless communication require planning under uncertainty. Partially observable Markov decision processes (POMDPs) plan policies for single agents under uncertainty and their decentralized versions (DEC-POMDPs) find a policy for multiple agents. The policy in infinite-horizon POMDP and DEC-POMDP problems has been represented as finite state controllers (FSCs). We introduce a novel class of periodic FSCs, composed of layers connected only to the previous and next layer. Our periodic FSC method finds a deterministic finite-horizon policy and converts it to an initial periodic infinitehorizon policy. This policy is optimized by a new infinite-horizon algorithm to yield deterministic periodic policies, and by a new expectation maximization algorithm to yield stochastic periodic policies. Our method yields better results than earlier planning methods and can compute larger solutions than with regular FSCs.
same-paper 3 0.71591824 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem
Author: Yoshinobu Kawahara, Takashi Washio
Abstract: In this paper, we propose the first exact algorithm for minimizing the difference of two submodular functions (D.S.), i.e., the discrete version of the D.C. programming problem. The developed algorithm is a branch-and-bound-based algorithm which responds to the structure of this problem through the relationship between submodularity and convexity. The D.S. programming problem covers a broad range of applications in machine learning. In fact, this generalizes any set-function optimization. We empirically investigate the performance of our algorithm, and illustrate the difference between exact and approximate solutions respectively obtained by the proposed and existing algorithms in feature selection and discriminative structure learning.
4 0.53664333 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
5 0.53325492 22 nips-2011-Active Ranking using Pairwise Comparisons
Author: Kevin G. Jamieson, Robert Nowak
Abstract: This paper examines the problem of ranking a collection of objects using pairwise comparisons (rankings of two objects). In general, the ranking of n objects can be identified by standard sorting methods using n log2 n pairwise comparisons. We are interested in natural situations in which relationships among the objects may allow for ranking using far fewer pairwise comparisons. Specifically, we assume that the objects can be embedded into a d-dimensional Euclidean space and that the rankings reflect their relative distances from a common reference point in Rd . We show that under this assumption the number of possible rankings grows like n2d and demonstrate an algorithm that can identify a randomly selected ranking using just slightly more than d log n adaptively selected pairwise comparisons, on average. If instead the comparisons are chosen at random, then almost all pairwise comparisons must be made in order to identify any ranking. In addition, we propose a robust, error-tolerant algorithm that only requires that the pairwise comparisons are probably correct. Experimental studies with synthetic and real datasets support the conclusions of our theoretical analysis. 1
6 0.53122139 80 nips-2011-Efficient Online Learning via Randomized Rounding
7 0.53001755 263 nips-2011-Sparse Manifold Clustering and Embedding
8 0.52824813 186 nips-2011-Noise Thresholds for Spectral Clustering
9 0.52752227 231 nips-2011-Randomized Algorithms for Comparison-based Search
10 0.5270052 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
11 0.52700317 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
12 0.52606881 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
13 0.524818 64 nips-2011-Convergent Bounds on the Euclidean Distance
14 0.52215374 303 nips-2011-Video Annotation and Tracking with Active Learning
15 0.52212191 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
16 0.51917398 202 nips-2011-On the Universality of Online Mirror Descent
17 0.51769632 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time
18 0.51759684 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning
19 0.5165472 127 nips-2011-Image Parsing with Stochastic Scene Grammar
20 0.51649499 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity