nips nips2011 nips2011-277 knowledge-graph by maker-knowledge-mining

277 nips-2011-Submodular Multi-Label Learning

Source: pdf

Author: James Petterson, Tibério S. Caetano

Abstract: In this paper we present an algorithm to learn a multi-label classiﬁer which attempts at directly optimising the F -score. The key novelty of our formulation is that we explicitly allow for assortative (submodular) pairwise label interactions, i.e., we can leverage the co-ocurrence of pairs of labels in order to improve the quality of prediction. Prediction in this model consists of minimising a particular submodular set function, what can be accomplished exactly and eﬃciently via graph-cuts. Learning however is substantially more involved and requires the solution of an intractable combinatorial optimisation problem. We present an approximate algorithm for this problem and prove that it is sound in the sense that it never predicts incorrect labels. We also present a nontrivial test of a suﬃcient condition for our algorithm to have found an optimal solution. We present experiments on benchmark multi-label datasets, which attest the value of the proposed technique. We also make available source code that enables the reproduction of our experiments. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Submodular Multi-Label Learning James Petterson NICTA/ANU Canberra, Australia Tiberio Caetano NICTA/ANU Sydney/Canberra, Australia Abstract In this paper we present an algorithm to learn a multi-label classiﬁer which attempts at directly optimising the F -score. [sent-1, score-0.063]

2 The key novelty of our formulation is that we explicitly allow for assortative (submodular) pairwise label interactions, i. [sent-2, score-0.106]

3 , we can leverage the co-ocurrence of pairs of labels in order to improve the quality of prediction. [sent-4, score-0.121]

4 Prediction in this model consists of minimising a particular submodular set function, what can be accomplished exactly and eﬃciently via graph-cuts. [sent-5, score-0.159]

5 Learning however is substantially more involved and requires the solution of an intractable combinatorial optimisation problem. [sent-6, score-0.229]

6 We present an approximate algorithm for this problem and prove that it is sound in the sense that it never predicts incorrect labels. [sent-7, score-0.094]

7 We also present a nontrivial test of a suﬃcient condition for our algorithm to have found an optimal solution. [sent-8, score-0.088]

8 We also make available source code that enables the reproduction of our experiments. [sent-10, score-0.144]

9 This is due to a number of reasons, including the increase in availability of multi-modal datasets and the emergence of crowdsourcing, which naturally create settings where multiple interpretations of a given input observation are possible (multiple labels for a single instance). [sent-14, score-0.157]

10 Also many classical problems are inherently multi-label, such as the categorisation of documents [5], gene function prediction [6] and image tagging [7]. [sent-15, score-0.074]

11 The ﬁrst is that a prediction should ideally be good both in terms of precision and recall: we care not only about predicting as many of the correct labels as possible, but also as few non-correct labels as possible. [sent-17, score-0.282]

12 The second property we wish is that, both during training and also at test time, the algorithm should ideally take into account possible dependencies between the labels. [sent-19, score-0.127]

13 For example, in automatic image tagging, if labels ocean and ship have high co-occurrence frequency in the training set, the model learned should somehow boost the chances of predicting ocean if there is strong visual evidence for the label ship [9]. [sent-20, score-0.401]

14 First, we explicitly model the dependencies between pairs of labels, albeit restricting them to be submodular (in rough terms, we model only the positive pairwise label correlations). [sent-22, score-0.28]

15 This enables exact and eﬃcient prediction at test time, since ﬁnding an optimal subset of labels reduces to the minimisation of a particular kind of submodular set function which can be done eﬃciently via graph-cuts. [sent-23, score-0.438]

16 This is because we draw on the max-margin structured prediction framework from [10], which, as we will see, enables us to optimise a convex upper bound on the loss induced by the F -score. [sent-25, score-0.16]

17 The critical technical contribution of the paper is a constraint generation algorithm for loss-augmented inference where the scoring of the pair (input-output) is a submodular set function and the loss is derived from the F -score. [sent-26, score-0.375]

18 Our constraint generation algorithm is only approximate since the problem is intractable. [sent-28, score-0.184]

19 However we give theoretical arguments supporting our empirical ﬁndings that the algorithm is not only very accurate in practice, but in the majority of our real-world experiments it actually produces a solution which is exactly optimal. [sent-29, score-0.119]

20 We also provide source code that enables the reproduction of all the experiments presented in this paper. [sent-31, score-0.144]

21 A convex relaxation for F -measure optimisation in the multi-label setting was proposed recently in [11]. [sent-33, score-0.161]

22 This can be seen as a particular case of our method when there are no explicit label dependencies. [sent-34, score-0.07]

23 In [12] the authors propose quite general tree and DAGbased dependencies among the labels and adapt decoding algorithms from signal processing to the problem of ﬁnding predictions consistent with the structures learned. [sent-35, score-0.166]

24 In [13] graphical models are used to impose structure in the label dependencies. [sent-36, score-0.07]

25 Both [12] and [13] are in a sense complementary to our method since we do not enforce any particular graph topology on the labels but instead we limit the nature of the interactions to be submodular. [sent-37, score-0.17]

26 In [14] the authors study the multi-label problem under the assumption that prior knowledge on the density of label correlations is available. [sent-38, score-0.07]

27 The RAkEL algorithm [16] uses instead an ensemble of classiﬁers, each learned on a random subset of the label set. [sent-42, score-0.097]

28 In [17] the authors propose a Bayesian CCA model and apply it to multi-label problems by enforcing group sparsity regularisation in order to capture information about label co-occurrences. [sent-43, score-0.07]

29 2 The Model Let x ∈ X be a vector of dimensionality D with the features of an instance (say, an image); let y ∈ Y be a set of labels for an instance (say, tags for an image), from a ﬁxed dictionary of V possible labels, encoded as y ∈ {0, 1}V . [sent-44, score-0.121]

30 For example, y = [1 1 0 0] denotes the ﬁrst and second labels of a set of four. [sent-45, score-0.121]

31 We assume we are given a training set {(xn , y n )}N , and n=1 our task is to estimate a map f : X → Y that has good agreement with the training set but also generalises well to new data. [sent-46, score-0.062]

32 Since our goal is to maximise the F -score a suitable choice of loss function is ∆(y, y ) = 1 − F (y, y ), which is the one we adopt in this ¯ ¯ paper. [sent-54, score-0.065]

33 The loss for a single prediction is therefore ∆(y, y ) = 1 − 2 |y ⊙ y |/(|y| + |¯|) ¯ ¯ y 2. [sent-55, score-0.069]

34 The oﬀ-diagonal elements are Aij = Cij θij , where Cij 2 is the normalised counts of co-occurrence of labels i and j in the training set, and θij the corresponding scalar parameter which is forced to be non-negative. [sent-57, score-0.209]

35 This will ensure that the oﬀ-diagonal entries of A are non-negative and therefore that problem 2 consists of the maximisation of a supermodular function (or, equivalently, the minimisation of a submodular function), which can be solved eﬃciently via graph-cuts. [sent-58, score-0.24]

36 Direct optimisation of the loss deﬁned in equation 1 is a highly intractable problem since it is a discrete quantity and our parameter space is continuous. [sent-75, score-0.217]

37 Here we will follow the program in [10] and instead construct a convex upper bound on the loss function, which can then be attacked using convex optimisation tools. [sent-76, score-0.219]

38 ∗ The constraints immediately imply that the optimal solution will be such that ξn ≥ n ∗ n ∆(argmaxy φ(x , y), θ , y ), and therefore the minimum value of the objective function upper bounds the loss, thus motivating the formulation. [sent-81, score-0.165]

39 Since there are exponentially many constraints, we follow [10] in adopting a constraint generation strategy, which starts by solving the problem with no constraints and iteratively adding the most violated constraint for the current solution of the optimisation problem. [sent-82, score-0.452]

40 This is guaranteed to ﬁnd an -close approximation of the solution of (3) after including only a polynomial (O(−2 )) number of constraints [10]. [sent-83, score-0.068]

41 At each iteration we need to maximise the violation margin ξn , which from the constraints 3b reduces to ∗ yn ∈ argmax [∆(y, y n ) + φ(xn , y), θ] (4) y∈Y Learning Algorithm. [sent-84, score-0.131]

42 Algorithm 1 describes a particular convex solver based on bundle methods (BMRM [18]), which we use here. [sent-86, score-0.083]

43 Our contribution lies not here, but in the routine of constraint generation for Algorithm 1, which is described in Algorithm 2. [sent-88, score-0.157]

44 BMRM requires the solution of constraint generation and the value of the objective function for the slack corresponding to the constraint generated, as well as its gradient. [sent-89, score-0.346]

45 The most challenging step consists of solving the constraint generation problem. [sent-95, score-0.157]

46 A na¨ strategy would be to aim for solving problem 8 V times, one for each value of |y|, ıve and constraining the optimisation to only include elements y such that |y| is ﬁxed. [sent-104, score-0.186]

47 In other words, we can partition the optimisation problem into k optimisation problems conditioned on the sets Yk := {y : |y| = k}: max y T A(y)y = max max y T A[k],n y y k y∈Yk (9) where A[k],n denotes the particular matrix An that we obtain when |y| = k. [sent-105, score-0.55]

48 , the problem of maximising a supermodular function (or minimising a submodular function) subject to a cardinality constraint, is itself NP-hard [19]. [sent-108, score-0.234]

49 We therefore do not follow this strategy, but instead seek a polynomial-time algorithm that in practice will give us an optimal solution most of the time. [sent-109, score-0.132]

50 1 The algorithm essentially searches for the largest k such that solving argmaxy y T A[k],n y returns a solution with k 1s. [sent-112, score-0.246]

51 We call the k obtained kmax , and ∗n the corresponding solution ykmax . [sent-113, score-1.143]

52 Although this algorithm is not provably optimal, Theorem 1 guarantees that it is sound in the sense that it never predicts incorrect labels. [sent-116, score-0.119]

53 4 next section we present additional evidence supporting this algorithm, in the form of a test that if positive guarantees the solution obtained is optimal. [sent-118, score-0.141]

54 We call a solution y a partially optimal solution of argmaxy y T An (y)y if the labels it predicts as being present are indeed present in an optimal solution, i. [sent-119, score-0.523]

55 , if for those i for which y (i) = 1 we also have y ∗n (i) = 1, for some y ∗n ∈ argmaxy y T An (y)y. [sent-121, score-0.151]

56 We have the following result ∗n Theorem 1 Upon completion of Algorithm 2, ykmax is a partially optimal solution of argmaxy y T An (y)y. [sent-123, score-0.811]

57 The theorem means that whenever the algorithm predicts the presence of a label, it does so correctly; however there may be labels not predicted which are in fact present in the corresponding optimal solution. [sent-125, score-0.226]

58 4 Certiﬁcate of Optimality As empirically veriﬁed in our experiments in section 5, our constraint generation algorithm (Algorithm 2) is indeed quite accurate: most of the time the solution obtained is optimal. [sent-126, score-0.252]

59 In this section we present a test that if positive guarantees that an optimal solution has been obtained (i. [sent-127, score-0.154]

60 This can be used to generate empirical lower bounds on the probability that the algorithm returns an optimal solution (we explore this possibility in the experimental section). [sent-130, score-0.132]

61 Let Z := {i : ∗n ∗n ykmax (i) = 0}, and PZ be the power set of Z (Z for ‘zeros’). [sent-132, score-0.555]

62 Term (c) ∗n is negative or zero, since each term in the sum is negative or zero (otherwise ykmax would have included it). [sent-139, score-0.555]

63 #train/#test denotes the number of observations used for training and testing respectively; V is the number of labels and D the dimensionality of the features; Avg is the average number of labels per instance. [sent-144, score-0.273]

64 dataset domain #train #test V D Avg yeast biology 1500 917 14 103 4. [sent-145, score-0.088]

65 37 ∗n We know that, regardless of A or α, βA,α ≤ 0 (otherwise ykmax ∈ argmaxy y T A[kmax ],n y, / T [kmax ],n since βA,α is the increment in the objective function y A y obtained by adding 1s in the entries of α). [sent-147, score-0.81]

66 This is because ∗n kmax +|α| vα + kmax +|α| |y n ⊙ ykmax | ≤ kmax +|α| |y n | ≤ Vmax |y n | ≤ V |y n | = 0 kmax kmax kmax k n = 2V |y n |/((V + |y n |)|y n |) ≤ 2 (15) (Note that if |y | = 0 then γα = 0 and our algorithm will always return an optimal solution since βA,α ≤ 0). [sent-149, score-2.247]

67 The algorithm eﬀectively computes the gap between the scores of the ∗n optimal solution ykmax and the highest scoring solution if one sets to 1 at least one of the ∗n zero entries in ykmax . [sent-159, score-1.368]

68 It does so by solving graph-cuts constraining the solution y to include ∗n ∗n the 1s present in ykmax but additionally ﬁxing one of the zero entries of ykmax to 1 (lines ∗n 7-8). [sent-160, score-1.203]

69 This is done for every possible zero entry of ykmax , and the maximum score is recorded ∗n (lines 7-11). [sent-161, score-0.587]

70 The gap between this and the score of the optimal solution ykmax is then returned (line 13). [sent-162, score-0.692]

71 This will involve V − kmax calls to graph-cuts, and therefore the total computational complexity is O(V 4 ). [sent-163, score-0.52]

72 Once we compute maxα βA,α , we simply test wether maxα βA,α + |V | |y n | > 0 holds (we use |V | |y n | rather than 2 as an upper bound for γα kmax kmax because, as seen from (15), it is the tightest upper bound which still does not depend on α and therefore can be computed). [sent-164, score-1.122]

73 We have the following theorem (proven in Appendix A) ∗n Theorem 2 Upon completion of Algorithm 3, if maxα βA,α + |V | |y n | ≤ 0, then ykmax is kmax T n an optimal solution of argmaxy y A (y)y. [sent-165, score-1.331]

74 For the sake of reproducibility we focused in publicly available datasets, and to ensure that the label dependencies have a reasonable impact in the results we restricted the experiments to datasets with a suﬃciently large average number of labels per instance. [sent-168, score-0.306]

75 We 6 Figure 1: F -Score results on enron (left) and yeast (right), for diﬀerent amounts of unary features. [sent-169, score-0.29]

76 chose therefore two multilabel datasets from mulan:2 yeast and enron. [sent-171, score-0.208]

77 The datasets used have very informative unary features, so to better visualise the contribution of the label dependencies to the model we trained using varying amounts (1%, 10% and 100%) of the original unary features. [sent-174, score-0.401]

78 We compared our proposed method to RML[11] without reversion3 , which is essentially our model without the quadratic term, and to other state-of-the-art methods for which source code is publicly available – BR [15], RAkEL[16] and MLKNN[20]. [sent-175, score-0.097]

79 In our experiments we used 50% of the pairs with yeast and 5% with enron (45 and 68 pairs, respectively). [sent-180, score-0.177]

80 Our implementation is in C++, based on the source code of RML[11], which uses the Bundle Methods for Risk Minimization (BMRM) of [18]. [sent-189, score-0.063]

81 In Figure 1 we plot the F -Score for varying-sized subsets of the unary features, for both enron (left) and yeast (right). [sent-195, score-0.29]

82 The goal is to assess the beneﬁts of explicitly modelling the pairwise label interactions, particularly when the unary information is deteriorated. [sent-196, score-0.219]

83 In this setting the unary features are very informative and the pairwise interactions are not helpful. [sent-198, score-0.222]

84 As we reduce the number of available unary features (from right to left in the plots), the importance of the pairwise interactions increase, and our model demonstrates improvement over RML. [sent-199, score-0.198]

85 3 7 Figure 2: Empirical analysis of Algorithms 2 and 3 during training with the yeast dataset. [sent-209, score-0.119]

86 Left: frequency with which Algorithm 2 is optimal at each iteration (blue) and frequency with which Algorithm 3 reports an optimal solution has been found by Algorithm 2 (green). [sent-210, score-0.208]

87 Right: diﬀerence, at each iteration, between the objective computed using the results from Algorithm 2 and exhaustive enumeration. [sent-211, score-0.062]

88 To evaluate how well our constraint generation algorithm performs in practice we compared its results against those of exhaustive search, which is exact but only feasible for a dataset with a small number of labels, such as yeast. [sent-213, score-0.215]

89 In Figure 2-left we plot, for the ﬁrst 100 iterations of the learning algorithm, the frequency with which Algorithm 2 returns the exact solution (blue line) as well as the frequency with which the test given in Algorithm 3 guarantees the solution is exact (green line). [sent-215, score-0.251]

90 We can see that overall in more than 50% of its executions Algorithm 2 produces an optimal solution. [sent-216, score-0.076]

91 Our test eﬀectively oﬀers a lower bound which is as expected is not tight, however it is informative in the sense that its variations reﬂect legitimate variations in the real quantity of interest (as can be seen by the obvious correlation between the two curves). [sent-217, score-0.075]

92 For the learning algorithm, however, what we are interested in is the objective oi and the gradient gi of line 7 of Algorithm 1, and both depend only on the compound result of N executions of Algorithm 2 at each iteration of the learning algorithm. [sent-218, score-0.1]

93 This is illustrated in Figure 2-right, where we plot, for each iteration, the normalised diﬀerence between the objective computed with results from Algorithm 2 and the one computed with the results of an exact exhaustive search5 . [sent-219, score-0.094]

94 6 Conclusion We presented a method for learning multi-label classiﬁers which explicitly models label dependencies in a submodular fashion. [sent-221, score-0.244]

95 As an estimator we use structured support vector machines solved with constraint generation. [sent-222, score-0.091]

96 Our key contribution is an algorithm for constraint generation which is proven to be partially optimal in the sense that all labels it predicts are included in some optimal solution. [sent-223, score-0.42]

97 We also describe an eﬃciently computable test that if positive guarantees the solution found is optimal, and can be used to generate empirical lower bounds on the probability of ﬁnding an optimal solution. [sent-224, score-0.154]

98 Shawe-Taylor, “Kernel-based learning of hierarchical multilabel classiﬁcation models,” JMLR, vol. [sent-250, score-0.084]

99 Csurka, “Learning structured prediction models for interactive image labeling,” in CVPR, 2011. [sent-271, score-0.065]

100 Vlahavas, “Random k-labelsets: An ensemble method for multilabel classiﬁcation,” in ECML, 2007. [sent-303, score-0.084]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ykmax', 0.555), ('kmax', 0.52), ('aii', 0.215), ('optimisation', 0.161), ('argmaxy', 0.151), ('submodular', 0.129), ('labels', 0.121), ('unary', 0.113), ('bmrm', 0.111), ('pz', 0.098), ('generation', 0.091), ('enron', 0.089), ('rml', 0.089), ('yeast', 0.088), ('multilabel', 0.084), ('yk', 0.077), ('max', 0.076), ('yn', 0.07), ('label', 0.07), ('solution', 0.068), ('rakel', 0.067), ('constraint', 0.066), ('xn', 0.061), ('ectively', 0.059), ('bundle', 0.057), ('cij', 0.057), ('minimisation', 0.05), ('interactions', 0.049), ('diag', 0.048), ('increment', 0.048), ('cca', 0.048), ('ay', 0.048), ('aij', 0.048), ('erence', 0.046), ('dependencies', 0.045), ('mensink', 0.044), ('mlknn', 0.044), ('reproduction', 0.044), ('vlahavas', 0.044), ('rmax', 0.041), ('predicts', 0.041), ('prediction', 0.04), ('executions', 0.039), ('maximising', 0.039), ('tsoumakas', 0.039), ('ocean', 0.039), ('classi', 0.038), ('ij', 0.037), ('optimal', 0.037), ('enables', 0.037), ('pairwise', 0.036), ('datasets', 0.036), ('maximise', 0.036), ('optimising', 0.036), ('australia', 0.036), ('cate', 0.036), ('supermodular', 0.036), ('tagging', 0.034), ('petterson', 0.034), ('ship', 0.034), ('publicly', 0.034), ('frequency', 0.033), ('scoring', 0.033), ('source', 0.033), ('score', 0.032), ('di', 0.032), ('appendix', 0.032), ('normalised', 0.032), ('vishwanathan', 0.032), ('verbeek', 0.032), ('caetano', 0.032), ('certi', 0.032), ('su', 0.032), ('library', 0.031), ('exhaustive', 0.031), ('training', 0.031), ('objective', 0.031), ('oi', 0.03), ('avg', 0.03), ('minimising', 0.03), ('code', 0.03), ('maxy', 0.029), ('teo', 0.029), ('upper', 0.029), ('loss', 0.029), ('vec', 0.027), ('quantity', 0.027), ('algorithm', 0.027), ('australian', 0.027), ('describes', 0.026), ('sound', 0.026), ('elements', 0.025), ('structured', 0.025), ('argmax', 0.025), ('entries', 0.025), ('guarantees', 0.025), ('supporting', 0.024), ('slack', 0.024), ('informative', 0.024), ('test', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 277 nips-2011-Submodular Multi-Label Learning

Author: James Petterson, Tibério S. Caetano

2 0.11323053 199 nips-2011-On fast approximate submodular minimization

Author: Stefanie Jegelka, Hui Lin, Jeff A. Bilmes

Abstract: We are motivated by an application to extract a representative subset of machine learning training data and by the poor empirical performance we observe of the popular minimum norm algorithm. In fact, for our application, minimum norm can have a running time of about O(n7 ) (O(n5 ) oracle calls). We therefore propose a fast approximate method to minimize arbitrary submodular functions. For a large sub-class of submodular functions, the algorithm is exact. Other submodular functions are iteratively approximated by tight submodular upper bounds, and then repeatedly optimized. We show theoretical properties, and empirical results suggest signiﬁcant speedups over minimum norm while retaining higher accuracies. 1

3 0.093408614 251 nips-2011-Shaping Level Sets with Submodular Functions

Author: Francis R. Bach

Abstract: We consider a class of sparsity-inducing regularization terms based on submodular functions. While previous work has focused on non-decreasing functions, we explore symmetric submodular functions and their Lov´ sz extensions. We show that the Lov´ sz a a extension may be seen as the convex envelope of a function that depends on level sets (i.e., the set of indices whose corresponding components of the underlying predictor are greater than a given constant): this leads to a class of convex structured regularization terms that impose prior knowledge on the level sets, and not only on the supports of the underlying predictors. We provide uniﬁed optimization algorithms, such as proximal operators, and theoretical guarantees (allowed level sets and recovery conditions). By selecting speciﬁc submodular functions, we give a new interpretation to known norms, such as the total variation; we also deﬁne new norms, in particular ones that are based on order statistics with application to clustering and outlier detection, and on noisy cuts in graphs with application to change point detection in the presence of outliers.

4 0.089468382 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

Author: Christoph H. Lampert

Abstract: We study multi-label prediction for structured output sets, a problem that occurs, for example, in object detection in images, secondary structure prediction in computational biology, and graph matching with symmetries. Conventional multilabel classiﬁcation techniques are typically not applicable in this situation, because they require explicit enumeration of the label set, which is infeasible in case of structured outputs. Relying on techniques originally designed for single-label structured prediction, in particular structured support vector machines, results in reduced prediction accuracy, or leads to infeasible optimization problems. In this work we derive a maximum-margin training formulation for multi-label structured prediction that remains computationally tractable while achieving high prediction accuracy. It also shares most beneﬁcial properties with single-label maximum-margin approaches, in particular formulation as a convex optimization problem, efﬁcient working set training, and PAC-Bayesian generalization bounds. 1

5 0.089252539 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they ﬁnd the complete information they need. We are speciﬁcally interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a conﬁdent diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) deﬁned over a ground set V is called submodular if it satisﬁes the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, inﬂuence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is deﬁned to be the set containing the ﬁrst i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order speciﬁed by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it ﬁrst in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have deﬁned it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can deﬁne F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a ﬁxed set of candidate diseases, then this function is submodular. Speciﬁcally, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The ofﬂine approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this ofﬂine algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single ﬁxed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem ﬁts in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard ofﬂine problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efﬁcient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best ﬁxed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an ofﬂine approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an ofﬂine algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this ofﬂine algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known ﬁnite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Ofﬂine Analysis In this work we present an algorithm for online submodular set cover which extends the ofﬂine algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this ofﬂine algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Ofﬂine Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in ofﬂine analysis the ofﬂine algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no beneﬁt but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is deﬁned such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it ﬁnds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Speciﬁcally, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), deﬁne a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) deﬁne a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next deﬁne truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one ﬁnal lemma before proving the main result of this section. For a sequence S deﬁne Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Deﬁne Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it sufﬁces to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it sufﬁces to show that after selecting the ﬁrst k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Deﬁne a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ deﬁnition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to ﬁx. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modiﬁcation does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the ﬁrst k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the ofﬂine setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Deﬁne F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to ﬁrst select the two broad actions covering all common ads then select the narrow actions in any order. However, the ofﬂine cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for ofﬂine solutions as baselines. As seen in the ﬁgure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the ofﬂine cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We deﬁne an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Deﬁne the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Deﬁne F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Speciﬁcally, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Deﬁne F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netﬂix’s Watch Instantly service and use 803 questions based on those we used for an ofﬂine problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modiﬁcation of Normal-Hedge which ﬁxes ct = 1 for a ﬁxed, more aggressive learning rate than theoretically justiﬁed. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed signiﬁcantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modiﬁed Normal-Hedge has no theoretical justiﬁcation, so it may not generalize to other problems. For the modiﬁed Normal-Hedge the ﬁnal average cover times are 7.72 adaptive and 8.22 cumulative. The ofﬂine values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a ﬁxed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of inﬂuence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efﬁcient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9

6 0.08111842 258 nips-2011-Sparse Bayesian Multi-Task Learning

7 0.076589748 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

8 0.066607244 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

9 0.064028807 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

10 0.062541142 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

11 0.058172632 227 nips-2011-Pylon Model for Semantic Segmentation

12 0.057705563 33 nips-2011-An Exact Algorithm for F-Measure Maximization

13 0.057651427 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

14 0.057356678 137 nips-2011-Iterative Learning for Reliable Crowdsourcing Systems

15 0.056071948 165 nips-2011-Matrix Completion for Multi-label Image Classification

16 0.050751679 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

17 0.050579902 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

18 0.047703575 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

19 0.045715012 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

20 0.044493001 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.162), (1, 0.015), (2, -0.065), (3, -0.006), (4, -0.013), (5, -0.028), (6, -0.123), (7, 0.039), (8, -0.057), (9, 0.054), (10, -0.025), (11, -0.036), (12, 0.056), (13, 0.008), (14, -0.038), (15, -0.035), (16, -0.077), (17, 0.004), (18, 0.065), (19, -0.054), (20, -0.002), (21, -0.008), (22, 0.058), (23, -0.051), (24, 0.013), (25, -0.002), (26, -0.021), (27, -0.001), (28, -0.084), (29, -0.068), (30, 0.053), (31, -0.102), (32, 0.023), (33, -0.048), (34, 0.01), (35, -0.058), (36, -0.052), (37, 0.013), (38, -0.013), (39, -0.056), (40, -0.03), (41, 0.036), (42, 0.06), (43, -0.045), (44, 0.022), (45, 0.019), (46, -0.071), (47, 0.002), (48, 0.026), (49, -0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88085061 277 nips-2011-Submodular Multi-Label Learning

Author: James Petterson, Tibério S. Caetano

2 0.62682128 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

Author: Christoph H. Lampert

3 0.6057446 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation

Author: Nico Goernitz, Christian Widmer, Georg Zeller, Andre Kahles, Gunnar Rätsch, Sören Sonnenburg

Abstract: We present a novel regularization-based Multitask Learning (MTL) formulation for Structured Output (SO) prediction for the case of hierarchical task relations. Structured output prediction often leads to difﬁcult inference problems and hence requires large amounts of training data to obtain accurate models. We propose to use MTL to exploit additional information from related learning tasks by means of hierarchical regularization. Training SO models on the combined set of examples from multiple tasks can easily become infeasible for real world applications. To be able to solve the optimization problems underlying multitask structured output learning, we propose an efﬁcient algorithm based on bundle-methods. We demonstrate the performance of our approach in applications from the domain of computational biology addressing the key problem of gene ﬁnding. We show that 1) our proposed solver achieves much faster convergence than previous methods and 2) that the Hierarchical SO-MTL approach outperforms considered non-MTL methods. 1

4 0.59923077 33 nips-2011-An Exact Algorithm for F-Measure Maximization

Author: Krzysztof J. Dembczynski, Willem Waegeman, Weiwei Cheng, Eyke Hüllermeier

Abstract: The F-measure, originally introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classiﬁcation, multi-label classiﬁcation, and structured output prediction. Optimizing this measure remains a statistically and computationally challenging problem, since no closed-form maximizer exists. Current algorithms are approximate and typically rely on additional assumptions regarding the statistical distribution of the binary response variables. In this paper, we present an algorithm which is not only computationally efﬁcient but also exact, regardless of the underlying distribution. The algorithm requires only a quadratic number of parameters of the joint distribution (with respect to the number of binary responses). We illustrate its practical performance by means of experimental results for multi-label classiﬁcation. 1

5 0.59710872 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

6 0.5767417 181 nips-2011-Multiple Instance Learning on Structured Data

7 0.5706566 199 nips-2011-On fast approximate submodular minimization

8 0.56455249 251 nips-2011-Shaping Level Sets with Submodular Functions

9 0.55080152 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

10 0.53235948 160 nips-2011-Linear Submodular Bandits and their Application to Diversified Retrieval

11 0.51589072 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

12 0.51482671 42 nips-2011-Bayesian Bias Mitigation for Crowdsourcing

13 0.5116412 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification

14 0.49754447 240 nips-2011-Robust Multi-Class Gaussian Process Classification

15 0.49595988 19 nips-2011-Active Classification based on Value of Classifier

16 0.493994 258 nips-2011-Sparse Bayesian Multi-Task Learning

17 0.4877882 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

18 0.48660028 222 nips-2011-Prismatic Algorithm for Discrete D.C. Programming Problem

19 0.48024496 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

20 0.47863796 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (4, 0.047), (10, 0.032), (13, 0.261), (20, 0.023), (26, 0.037), (31, 0.068), (33, 0.042), (43, 0.056), (45, 0.111), (57, 0.036), (65, 0.014), (74, 0.056), (83, 0.038), (89, 0.01), (99, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72003877 277 nips-2011-Submodular Multi-Label Learning

Author: James Petterson, Tibério S. Caetano

2 0.63778251 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

Author: Nasser M. Nasrabadi, Trac D. Tran, Nam Nguyen

Abstract: This paper studies the problem of accurately recovering a sparse vector β from highly corrupted linear measurements y = Xβ + e + w where e is a sparse error vector whose nonzero entries may be unbounded and w is a bounded noise. We propose a so-called extended Lasso optimization which takes into consideration sparse prior information of both β and e . Our ﬁrst result shows that the extended Lasso can faithfully recover both the regression and the corruption vectors. Our analysis is relied on a notion of extended restricted eigenvalue for the design matrix X. Our second set of results applies to a general class of Gaussian design matrix X with i.i.d rows N (0, Σ), for which we provide a surprising phenomenon: the extended Lasso can recover exact signed supports of both β and e from only Ω(k log p log n) observations, even the fraction of corruption is arbitrarily close to one. Our analysis also shows that this amount of observations required to achieve exact signed support is optimal. 1

3 0.5969556 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

Author: Jun-ichiro Hirayama, Aapo Hyvärinen

Abstract: Components estimated by independent component analysis and related methods are typically not independent in real data. A very common form of nonlinear dependency between the components is correlations in their variances or energies. Here, we propose a principled probabilistic model to model the energycorrelations between the latent variables. Our two-stage model includes a linear mixing of latent signals into the observed ones like in ICA. The main new feature is a model of the energy-correlations based on the structural equation model (SEM), in particular, a Linear Non-Gaussian SEM. The SEM is closely related to divisive normalization which effectively reduces energy correlation. Our new twostage model enables estimation of both the linear mixing and the interactions related to energy-correlations, without resorting to approximations of the likelihood function or other non-principled approaches. We demonstrate the applicability of our method with synthetic dataset, natural images and brain signals. 1

4 0.55941445 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

5 0.55430061 227 nips-2011-Pylon Model for Semantic Segmentation

Author: Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman

Abstract: Graph cut optimization is one of the standard workhorses of image segmentation since for binary random ﬁeld representations of the image, it gives globally optimal results and there are efﬁcient polynomial time implementations. Often, the random ﬁeld is applied over a ﬂat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a ﬂat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding beneﬁts of global optimality and efﬁciency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a ﬂat partitioning) and demonstrate that the optimization in the pylon model is able to ﬂexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches. 1

6 0.55365509 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

7 0.55165809 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

8 0.55123305 258 nips-2011-Sparse Bayesian Multi-Task Learning

9 0.5510112 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

10 0.55078608 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

11 0.55040932 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

12 0.54876709 180 nips-2011-Multiple Instance Filtering

13 0.54835284 263 nips-2011-Sparse Manifold Clustering and Embedding

14 0.54829037 276 nips-2011-Structured sparse coding via lateral inhibition

15 0.5481838 271 nips-2011-Statistical Tests for Optimization Efficiency

16 0.54799122 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

17 0.54792666 231 nips-2011-Randomized Algorithms for Comparison-based Search

18 0.54631913 303 nips-2011-Video Annotation and Tracking with Active Learning

19 0.54571694 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

20 0.54513186 66 nips-2011-Crowdclustering