nips nips2011 nips2011-76 knowledge-graph by maker-knowledge-mining

76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Source: pdf

Author: Philipp Krähenbühl, Vladlen Koltun

Abstract: Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random ﬁelds deﬁned over pixels or image regions. While regionlevel models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. In this paper, we consider fully connected CRF models deﬁned on the complete set of pixels in an image. The resulting graphs have billions of edges, making traditional inference algorithms impractical. Our main contribution is a highly efﬁcient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are deﬁned by a linear combination of Gaussian kernels. Our experiments demonstrate that dense connectivity at the pixel level substantially improves segmentation and labeling accuracy. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Most state-of-the-art techniques for multi-class image segmentation and labeling use conditional random ﬁelds deﬁned over pixels or image regions. [sent-5, score-0.876]

2 While regionlevel models often feature dense pairwise connectivity, pixel-level models are considerably larger and have only permitted sparse graph structures. [sent-6, score-0.303]

3 In this paper, we consider fully connected CRF models deﬁned on the complete set of pixels in an image. [sent-7, score-0.78]

4 The resulting graphs have billions of edges, making traditional inference algorithms impractical. [sent-8, score-0.204]

5 Our main contribution is a highly efﬁcient approximate inference algorithm for fully connected CRF models in which the pairwise edge potentials are deﬁned by a linear combination of Gaussian kernels. [sent-9, score-1.157]

6 Our experiments demonstrate that dense connectivity at the pixel level substantially improves segmentation and labeling accuracy. [sent-10, score-0.642]

7 1 Introduction Multi-class image segmentation and labeling is one of the most challenging and actively studied problems in computer vision. [sent-11, score-0.574]

8 The goal is to label every pixel in the image with one of several predetermined object categories, thus concurrently performing recognition and segmentation of multiple object classes. [sent-12, score-0.764]

9 A common approach is to pose this problem as maximum a posteriori (MAP) inference in a conditional random ﬁeld (CRF) deﬁned over pixels or image patches [8, 12, 18, 19, 9]. [sent-13, score-0.575]

10 The CRF potentials incorporate smoothness terms that maximize label agreement between similar pixels, and can integrate more elaborate terms that model contextual relationships between object classes. [sent-14, score-0.448]

11 Basic CRF models are composed of unary potentials on individual pixels or image patches and pairwise potentials on neighboring pixels or patches [19, 23, 7, 5]. [sent-15, score-1.323]

12 The resulting adjacency CRF structure is limited in its ability to model long-range connections within the image and generally results in excessive smoothing of object boundaries. [sent-16, score-0.313]

13 In order to improve segmentation and labeling accuracy, researchers have expanded the basic CRF framework to incorporate hierarchical connectivity and higher-order potentials deﬁned on image regions [8, 12, 9, 13]. [sent-17, score-0.952]

14 However, the accuracy of these approaches is necessarily restricted by the accuracy of unsupervised image segmentation, which is used to compute the regions on which the model operates. [sent-18, score-0.303]

15 This limits the ability of region-based approaches to produce accurate label assignments around complex object boundaries, although signiﬁcant progress has been made [9, 13, 14]. [sent-19, score-0.196]

16 In this paper, we explore a different model structure for accurate semantic segmentation and labeling. [sent-20, score-0.277]

17 We use a fully connected CRF that establishes pairwise potentials on all pairs of pixels in the image. [sent-21, score-1.108]

18 Fully connected CRFs have been used for semantic image labeling in the past [18, 22, 6, 17], but the complexity of inference in fully connected models has restricted their application to sets of hundreds of image regions or fewer. [sent-22, score-1.645]

19 The segmentation accuracy achieved by these approaches is again limited by the unsupervised segmentation that produces the regions. [sent-23, score-0.571]

20 In contrast, our model connects all 1 sky tree grass tree bench road (a) Image (b) Unary classiﬁers (c) Robust P n CRF grass (d) Fully connected CRF, (e) Fully connected CRF, MCMC inference, 36 hrs our approach, 0. [sent-24, score-1.073]

21 2 seconds Figure 1: Pixel-level classiﬁcation with a fully connected CRF. [sent-25, score-0.639]

22 (b) The response of unary classiﬁers used by our models. [sent-27, score-0.151]

23 (c) Classiﬁcation produced by the Robust P n CRF [9]. [sent-28, score-0.051]

24 (d) Classiﬁcation produced by MCMC inference [17] in a fully connected pixel-level CRF model; the algorithm was run for 36 hours and only partially converged for the bottom image. [sent-29, score-0.935]

25 (e) Classiﬁcation produced by our inference algorithm in the fully connected model in 0. [sent-30, score-0.761]

26 pairs of individual pixels in the image, enabling greatly reﬁned segmentation and labeling. [sent-32, score-0.428]

27 The main challenge is the size of the model, which has tens of thousands of nodes and billions of edges even on low-resolution images. [sent-33, score-0.17]

28 Our main contribution is a highly efﬁcient inference algorithm for fully connected CRF models in which the pairwise edge potentials are deﬁned by a linear combination of Gaussian kernels in an arbitrary feature space. [sent-34, score-1.132]

29 This approximation is iteratively optimized through a series of message passing steps, each of which updates a single variable by aggregating information from all other variables. [sent-36, score-0.132]

30 We show that a mean ﬁeld update of all variables in a fully connected CRF can be performed using Gaussian ﬁltering in feature space. [sent-37, score-0.596]

31 This allows us to reduce the computational complexity of message passing from quadratic to linear in the number of variables by employing efﬁcient approximate high-dimensional ﬁltering [16, 2, 1]. [sent-38, score-0.144]

32 The resulting approximate inference algorithm is sublinear in the number of edges in the model. [sent-39, score-0.219]

33 Figure 1 demonstrates the beneﬁts of the presented algorithm on two images from the MSRC-21 dataset for multi-class image segmentation and labeling. [sent-40, score-0.417]

34 Figure 1(d) shows the results of approximate MCMC inference in fully connected CRFs on these images [17]. [sent-41, score-0.771]

35 The MCMC procedure was run for 36 hours and only partially converged for the bottom image. [sent-42, score-0.174]

36 We have also experimented with graph cut inference in the fully connected models [11], but it did not converge within 72 hours. [sent-43, score-0.838]

37 In contrast, a single-threaded implementation of our algorithm produces a detailed pixel-level labeling in 0. [sent-44, score-0.19]

38 To the best of our knowledge, we are the ﬁrst to demonstrate efﬁcient inference in fully connected CRF models at the pixel level. [sent-47, score-0.847]

39 In our setting, I ranges over possible input images of size N and X ranges over possible pixel-level image labelings. [sent-60, score-0.29]

40 Ij is the color vector of pixel j and Xj is the label assigned to pixel j. [sent-61, score-0.274]

41 A conditional random ﬁeld (I, X) is characterized by a Gibbs distribution 1 P (X|I) = Z(I) exp(− c∈CG φc (Xc |I)), where G = (V, E) is a graph on X and each clique c 2 in a set of cliques CG in G induces a potential φc [15]. [sent-62, score-0.141]

42 The Gibbs energy of a labeling x ∈ LN is E(x|I) = c∈CG φc (xc |I). [sent-63, score-0.197]

43 The maximum a posteriori (MAP) labeling of the random ﬁeld is x∗ = arg maxx∈LN P (x|I). [sent-64, score-0.217]

44 For notational convenience we will omit the conditioning in the rest of the paper and use ψc (xc ) to denote φc (xc |I). [sent-65, score-0.107]

45 In the fully connected pairwise CRF model, G is the complete graph on X and CG is the set of all unary and pairwise cliques. [sent-66, score-1.016]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('crf', 0.568), ('connected', 0.337), ('fully', 0.259), ('segmentation', 0.235), ('potentials', 0.214), ('xc', 0.163), ('labeling', 0.161), ('pixels', 0.159), ('cg', 0.156), ('unary', 0.151), ('image', 0.146), ('inference', 0.114), ('pairwise', 0.113), ('pixel', 0.112), ('vladlen', 0.104), ('mcmc', 0.102), ('eld', 0.094), ('crfs', 0.094), ('billions', 0.09), ('connectivity', 0.085), ('grass', 0.075), ('patches', 0.071), ('object', 0.066), ('gibbs', 0.06), ('converged', 0.058), ('posteriori', 0.056), ('hours', 0.055), ('ranges', 0.054), ('bench', 0.052), ('koltun', 0.052), ('produced', 0.051), ('label', 0.05), ('message', 0.049), ('dense', 0.049), ('edges', 0.048), ('stanford', 0.048), ('permitted', 0.048), ('philipp', 0.048), ('concurrently', 0.048), ('hrs', 0.048), ('regions', 0.048), ('passing', 0.045), ('edge', 0.043), ('graph', 0.043), ('seconds', 0.043), ('semantic', 0.042), ('predetermined', 0.041), ('classi', 0.041), ('ltering', 0.038), ('sky', 0.038), ('aggregating', 0.038), ('excessive', 0.038), ('adjacency', 0.038), ('kr', 0.038), ('accuracy', 0.037), ('lk', 0.036), ('images', 0.036), ('energy', 0.036), ('ln', 0.036), ('maxx', 0.035), ('cliques', 0.035), ('voc', 0.035), ('unsupervised', 0.035), ('experimented', 0.034), ('clique', 0.034), ('enabling', 0.034), ('incorporate', 0.032), ('actively', 0.032), ('sublinear', 0.032), ('elaborate', 0.032), ('tens', 0.032), ('ers', 0.032), ('road', 0.031), ('expanded', 0.031), ('bottom', 0.031), ('partially', 0.03), ('hundreds', 0.03), ('assignments', 0.03), ('notational', 0.029), ('conditional', 0.029), ('produces', 0.029), ('contextual', 0.028), ('connects', 0.028), ('pascal', 0.028), ('xj', 0.027), ('conditioning', 0.027), ('contribution', 0.027), ('map', 0.027), ('establishes', 0.026), ('boundaries', 0.026), ('tree', 0.026), ('integrate', 0.026), ('convenience', 0.026), ('cut', 0.026), ('omit', 0.025), ('ability', 0.025), ('progress', 0.025), ('employing', 0.025), ('models', 0.025), ('approximate', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Author: Philipp Krähenbühl, Vladlen Koltun

2 0.28674757 227 nips-2011-Pylon Model for Semantic Segmentation

Author: Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman

Abstract: Graph cut optimization is one of the standard workhorses of image segmentation since for binary random ﬁeld representations of the image, it gives globally optimal results and there are efﬁcient polynomial time implementations. Often, the random ﬁeld is applied over a ﬂat partitioning of the image into non-intersecting elements, such as pixels or super-pixels. In the paper we show that if, instead of a ﬂat partitioning, the image is represented by a hierarchical segmentation tree, then the resulting energy combining unary and boundary terms can still be optimized using graph cut (with all the corresponding beneﬁts of global optimality and efﬁciency). As a result of such inference, the image gets partitioned into a set of segments that may come from different layers of the tree. We apply this formulation, which we call the pylon model, to the task of semantic segmentation where the goal is to separate an image into areas belonging to different semantic classes. The experiments highlight the advantage of inference on a segmentation tree (over a ﬂat partitioning) and demonstrate that the optimization in the pylon model is able to ﬂexibly choose the level of segmentation across the image. Overall, the proposed system has superior segmentation accuracy on several datasets (Graz-02, Stanford background) compared to previously suggested approaches. 1

3 0.19395354 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

Author: Adrian Ion, Joao Carreira, Cristian Sminchisescu

Abstract: We present a joint image segmentation and labeling model (JSL) which, given a bag of ﬁgure-ground segment hypotheses extracted at multiple image locations and scales, constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as ﬁrst sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag [1], followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect conﬁgurations that a not-yet-competent model rates probable during learning. We show that the proposed methodology matches the current state of the art in the Stanford dataset [2], as well as in VOC2010, where 41.7% accuracy on the test set is achieved.

4 0.16682515 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

Author: Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, Chang D. Yoo

Abstract: For many of the state-of-the-art computer vision algorithms, image segmentation is an important preprocessing step. As such, several image segmentation algorithms have been proposed, however, with certain reservation due to high computational load and many hand-tuning parameters. Correlation clustering, a graphpartitioning algorithm often used in natural language processing and document clustering, has the potential to perform better than previously proposed image segmentation algorithms. We improve the basic correlation clustering formulation by taking into account higher-order cluster relationships. This improves clustering in the presence of local boundary ambiguities. We ﬁrst apply the pairwise correlation clustering to image segmentation over a pairwise superpixel graph and then develop higher-order correlation clustering over a hypergraph that considers higher-order relations among superpixels. Fast inference is possible by linear programming relaxation, and also effective parameter learning framework by structured support vector machine is possible. Experimental results on various datasets show that the proposed higher-order correlation clustering outperforms other state-of-the-art image segmentation algorithms.

5 0.11536558 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classiﬁers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by deﬁning a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

6 0.097108103 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

7 0.096331529 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

8 0.086741157 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

9 0.086739026 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

10 0.075969502 229 nips-2011-Query-Aware MCMC

11 0.074461862 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

12 0.073073849 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

13 0.07220605 276 nips-2011-Structured sparse coding via lateral inhibition

14 0.070567355 244 nips-2011-Selecting Receptive Fields in Deep Networks

15 0.06990312 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

16 0.066833675 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

17 0.062628612 261 nips-2011-Sparse Filtering

18 0.059452094 165 nips-2011-Matrix Completion for Multi-label Image Classification

19 0.059318572 275 nips-2011-Structured Learning for Cell Tracking

20 0.056802325 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.168), (1, 0.127), (2, -0.095), (3, 0.159), (4, 0.064), (5, -0.015), (6, -0.092), (7, -0.012), (8, 0.105), (9, -0.011), (10, 0.059), (11, -0.01), (12, 0.044), (13, -0.106), (14, -0.179), (15, -0.005), (16, 0.092), (17, -0.187), (18, -0.104), (19, -0.023), (20, -0.092), (21, 0.003), (22, 0.089), (23, -0.095), (24, -0.067), (25, 0.027), (26, 0.047), (27, -0.141), (28, -0.042), (29, -0.05), (30, -0.025), (31, -0.144), (32, -0.025), (33, -0.086), (34, -0.039), (35, -0.045), (36, -0.041), (37, -0.002), (38, 0.045), (39, 0.049), (40, -0.021), (41, 0.042), (42, -0.019), (43, 0.039), (44, -0.043), (45, -0.023), (46, 0.009), (47, 0.021), (48, 0.007), (49, 0.009)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97362262 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Author: Philipp Krähenbühl, Vladlen Koltun

2 0.90967339 227 nips-2011-Pylon Model for Semantic Segmentation

Author: Victor Lempitsky, Andrea Vedaldi, Andrew Zisserman

3 0.85902882 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

Author: Adrian Ion, Joao Carreira, Cristian Sminchisescu

4 0.74112153 119 nips-2011-Higher-Order Correlation Clustering for Image Segmentation

Author: Sungwoong Kim, Sebastian Nowozin, Pushmeet Kohli, Chang D. Yoo

5 0.6514051 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

Author: Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth, David M. Blei

Abstract: The distance dependent Chinese restaurant process (ddCRP) was recently introduced to accommodate random partitions of non-exchangeable data [1]. The ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense. This paper examines the ddCRP in a spatial setting with the goal of natural image segmentation. We explore the biases of the spatial ddCRP model and propose a novel hierarchical extension better suited for producing “human-like” segmentations. We then study the sensitivity of the models to various distance and appearance hyperparameters, and provide the ﬁrst rigorous comparison of nonparametric Bayesian models in the image segmentation domain. On unsupervised image segmentation, we demonstrate that similar performance to existing nonparametric Bayesian models is possible with substantially simpler models and algorithms.

6 0.55398619 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

7 0.52740526 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

8 0.48208669 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance

9 0.41653574 255 nips-2011-Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC

10 0.38074675 127 nips-2011-Image Parsing with Stochastic Scene Grammar

11 0.37392467 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs

12 0.37226364 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

13 0.37008265 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation

14 0.36843607 55 nips-2011-Collective Graphical Models

15 0.35542861 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

16 0.35258418 229 nips-2011-Query-Aware MCMC

17 0.35095948 293 nips-2011-Understanding the Intrinsic Memorability of Images

18 0.34778634 138 nips-2011-Joint 3D Estimation of Objects and Scene Layout

19 0.3295567 274 nips-2011-Structure Learning for Optimization

20 0.3292411 277 nips-2011-Submodular Multi-Label Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (4, 0.041), (10, 0.244), (20, 0.088), (26, 0.02), (31, 0.109), (33, 0.039), (43, 0.087), (45, 0.09), (57, 0.041), (74, 0.068), (83, 0.033), (99, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.84408343 76 nips-2011-Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Author: Philipp Krähenbühl, Vladlen Koltun

2 0.8370803 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

Author: Siwei Lyu

Abstract: When used to learn high dimensional parametric probabilistic models, the classical maximum likelihood (ML) learning often suffers from computational intractability, which motivates the active developments of non-ML learning methods. Yet, because of their divergent motivations and forms, the objective functions of many non-ML learning methods are seemingly unrelated, and there lacks a uniﬁed framework to understand them. In this work, based on an information geometric view of parametric learning, we introduce a general non-ML learning principle termed as minimum KL contraction, where we seek optimal parameters that minimizes the contraction of the KL divergence between the two distributions after they are transformed with a KL contraction operator. We then show that the objective functions of several important or recently developed non-ML learning methods, including contrastive divergence [12], noise-contrastive estimation [11], partial likelihood [7], non-local contrastive objectives [31], score matching [14], pseudo-likelihood [3], maximum conditional likelihood [17], maximum mutual information [2], maximum marginal likelihood [9], and conditional and marginal composite likelihood [24], can be uniﬁed under the minimum KL contraction framework with different choices of the KL contraction operators. 1

3 0.81124789 123 nips-2011-How biased are maximum entropy models?

Author: Jakob H. Macke, Iain Murray, Peter E. Latham

Abstract: Maximum entropy models have become popular statistical models in neuroscience and other areas in biology, and can be useful tools for obtaining estimates of mutual information in biological systems. However, maximum entropy models ﬁt to small data sets can be subject to sampling bias; i.e. the true entropy of the data can be severely underestimated. Here we study the sampling properties of estimates of the entropy obtained from maximum entropy models. We show that if the data is generated by a distribution that lies in the model class, the bias is equal to the number of parameters divided by twice the number of observations. However, in practice, the true distribution is usually outside the model class, and we show here that this misspeciﬁcation can lead to much larger bias. We provide a perturbative approximation of the maximally expected bias when the true model is out of model class, and we illustrate our results using numerical simulations of an Ising model; i.e. the second-order maximum entropy distribution on binary data. 1

4 0.79669976 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts

Author: Matthias Hein, Simon Setzer

Abstract: Spectral clustering is based on the spectral relaxation of the normalized/ratio graph cut criterion. While the spectral relaxation is known to be loose, it has been shown recently that a non-linear eigenproblem yields a tight relaxation of the Cheeger cut. In this paper, we extend this result considerably by providing a characterization of all balanced graph cuts which allow for a tight relaxation. Although the resulting optimization problems are non-convex and non-smooth, we provide an efﬁcient ﬁrst-order scheme which scales to large graphs. Moreover, our approach comes with the quality guarantee that given any partition as initialization the algorithm either outputs a better partition or it stops immediately. 1

5 0.71857601 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We propose an online prediction version of submodular set cover with connections to ranking and repeated active learning. In each round, the learning algorithm chooses a sequence of items. The algorithm then receives a monotone submodular function and suffers loss equal to the cover time of the function: the number of items needed, when items are selected in order of the chosen sequence, to achieve a coverage constraint. We develop an online learning algorithm whose loss converges to approximately that of the best sequence in hindsight. Our proposed algorithm is readily extended to a setting where multiple functions are revealed at each round and to bandit and contextual bandit settings. 1 Problem In an online ranking problem, at each round we choose an ordered list of items and then incur some loss. Problems with this structure include search result ranking, ranking news articles, and ranking advertisements. In search result ranking, each round corresponds to a search query and the items correspond to search results. We consider online ranking problems in which the loss incurred at each round is the number of items in the list needed to achieve some goal. For example, in search result ranking a reasonable loss is the number of results the user needs to view before they ﬁnd the complete information they need. We are speciﬁcally interested in problems where the list of items is a sequence of questions to ask or tests to perform in order to learn. In this case the ranking problem becomes a repeated active learning problem. For example, consider a medical diagnosis problem where at each round we choose a sequence of medical tests to perform on a patient with an unknown illness. The loss is the number of tests we need to perform in order to make a conﬁdent diagnosis. We propose an approach to these problems using a new online version of submodular set cover. A set function F (S) deﬁned over a ground set V is called submodular if it satisﬁes the following diminishing returns property: for every A ⊆ B ⊆ V \ {v}, F (A + v) − F (A) ≥ F (B + v) − F (B). Many natural objectives measuring information, inﬂuence, and coverage turn out to be submodular [1, 2, 3]. A set function is called monotone if for every A ⊆ B, F (A) ≤ F (B) and normalized if F (∅) = 0. Submodular set cover is the problem of selecting an S ⊆ V minimizing |S| under the constraint that F (S) ≥ 1 where F is submodular, monotone, and normalized (note we can always rescale F ). This problem is NP-hard, but a greedy algorithm gives a solution with cost less than 1 + ln 1/ that of the optimal solution where is the smallest non-zero gain of F [4]. We propose the following online prediction version of submodular set cover, which we simply call online submodular set cover. At each time step t = 1 . . . T we choose a sequence of elements t t t t S t = (v1 , v2 , . . . vn ) where each vi is chosen from a ground set V of size n (we use a superscript for rounds of the online problem and a subscript for other indices). After choosing S t , an adversary reveals a submodular, monotone, normalized function F t , and we suffer loss (F t , S t ) where (F t , S t ) t min {n} ∪ {i : F t (Si ) ≥ 1}i 1 (1) t t t t ∅). Note and Si j≤i {vj } is deﬁned to be the set containing the ﬁrst i elements of S (let S0 n t t t t can be equivalently written (F , S ) i=0 I(F (Si ) < 1) where I is the indicator function. Intuitively, (F t , S t ) corresponds to a bounded version of cover time: it is the number of items up to n needed to achieve F t (S) ≥ 1 when we select items in the order speciﬁed by S t . Thus, if coverage is not achieved, we suffer a loss of n. We assume that F t (V ) ≥ 1 (therefore coverage is achieved if S t does not contain duplicates) and that the sequence of functions (F t )t is chosen in advance (by an oblivious adversary). The goal of our learning algorithm is to minimize the total loss t (F t , S t ). To make the problem clear, we present it ﬁrst in its simplest, full information version. However, we will later consider more complex variations including (1) a version where we only produce a list of t t t length k ≤ n instead of n, (2) a multiple objective version where a set of objectives F1 , F2 , . . . Fm is revealed each round, (3) a bandit (partial information) version where we do not get full access to t t t F t and instead only observe F t (S1 ), F t (S2 ), . . . F t (Sn ), and (4) a contextual bandit version where there is some context associated with each round. We argue that online submodular set cover, as we have deﬁned it, is an interesting and useful model for ranking and repeated active learning problems. In a search result ranking problem, after presenting search results to a user we can obtain implicit feedback from this user (e.g., clicks, time spent viewing each result) to determine which results were actually relevant. We can then construct an objective F t (S) such that F t (S) ≥ 1 iff S covers or summarizes the relevant results. Alternatively, we can avoid explicitly constructing an objective by considering the bandit version of the problem t where we only observe the values F t (Si ). For example, if the user clicked on k total results then t we can let F (Si ) ci /k where ci ≤ k is the number of results in the subset Si which were clicked. Note that the user may click an arbitrary set of results in an arbitrary order, and the user’s decision whether or not to click a result may depend on previously viewed and clicked results. All that we assume is that there is some unknown submodular function explaining the click counts. If the user clicks on a small number of very early results, then coverage is achieved quickly and the ordering is desirable. This coverage objective makes sense if we assume that the set of results the user clicked are of roughly equal importance and together summarize the results of interest to the user. In the medical diagnosis application, we can deﬁne F t (S) to be proportional to the number of candidate diseases which are eliminated after performing the set of tests S on patient t. If we assume that a particular test result always eliminates a ﬁxed set of candidate diseases, then this function is submodular. Speciﬁcally, this objective is the reduction in the size of the version space [5, 6]. Other active learning problems can also be phrased in terms of satisfying a submodular coverage constraint including problems that allow for noise [7]. Note that, as in the search result ranking problem, F t is not initially known but can be inferred after we have chosen S t and suffered loss (F t , S t ). 2 Background and Related Work Recently, Azar and Gamzu [8] extended the O(ln 1/ ) greedy approximation algorithm for submodular set cover to the more general problem of minimizing the average cover time of a set of objectives. Here is the smallest non-zero gain of all the objectives. Azar and Gamzu [8] call this problem ranking with submodular valuations. More formally, we have a known set of functions F1 , F2 , . . . , Fm each with an associated weight wi . The goal is then to choose a permutation S of m the ground set V to minimize i=1 wi (Fi , S). The ofﬂine approximation algorithm for ranking with submodular valuations will be a crucial tool in our analysis of online submodular set cover. In particular, this ofﬂine algorithm can viewed as constructing the best single permutation S for a sequence of objectives F 1 , F 2 . . . F T in hindsight (i.e., after all the objectives are known). Recently the ranking with submodular valuations problem was extended to metric costs [9]. Online learning is a well-studied problem [10]. In one standard setting, the online learning algorithm has a collection of actions A, and at each time step t the algorithm picks an action S t ∈ A. The learning algorithm then receives a loss function t , and the algorithm incurs the loss value for the action it chose t (S t ). We assume t (S t ) ∈ [0, 1] but make no other assumptions about the form of loss. The performance of an online learning algorithm is often measured in terms of regret, the difference between the loss incurred by the algorithm and the loss of the best single ﬁxed action T T in hindsight: R = t=1 t (S t ) − minS∈A t=1 t (S). There are randomized algorithms which guarantee E[R] ≤ T ln |A| for adversarial sequences of loss functions [11]. Note that because 2 E[R] = o(T ) the per round regret approaches zero. In the bandit version of this problem the learning algorithm only observes t (S t ) [12]. Our problem ﬁts in this standard setting with A chosen to be the set of all ground set permutations (v1 , v2 , . . . vn ) and t (S t ) (F t , S t )/n. However, in this case A is very large so standard online learning algorithms which keep weight vectors of size |A| cannot be directly applied. Furthermore, our problem generalizes an NP-hard ofﬂine problem which has no polynomial time approximation scheme, so it is not likely that we will be able to derive any efﬁcient algorithm with o(T ln |A|) regret. We therefore instead consider α-regret, the loss incurred by the algorithm as compared to α T T times the best ﬁxed prediction. Rα = t=1 t (S t ) − α minS∈A t=1 t (S). α-regret is a standard notion of regret for online versions of NP-hard problems. If we can show Rα grows sub linearly with T then we have shown loss converges to that of an ofﬂine approximation with ratio α. Streeter and Golovin [13] give online algorithms for the closely related problems of submodular function maximization and min-sum submodular set cover. In online submodular function maximization, the learning algorithm selects a set S t with |S t | ≤ k before F t is revealed, and the goal is to maximize t F t (S t ). This problem differs from ours in that our problem is a loss minimization problem as opposed to an objective maximization problem. Online min-sum submodular set cover is similar to online submodular set cover except the loss is not cover time but rather n ˆ(F t , S t ) t max(1 − F t (Si ), 0). (2) i=0 t t Min-sum submodular set cover penalizes 1 − F t (Si ) where submodular set cover uses I(F t (Si ) < 1). We claim that for certain applications the hard threshold makes more sense. For example, in repeated active learning problems minimizing t (F t , S t ) naturally corresponds to minimizing the number of questions asked. Minimizing t ˆ(F t , S t ) does not have this interpretation as it charges less for questions asked when F t is closer to 1. One might hope that minimizing could be reduced to or shown equivalent to minimizing ˆ. This is not likely to be the case, as the approximation algorithm of Streeter and Golovin [13] does not carry over to online submodular set cover. Their online algorithm is based on approximating an ofﬂine algorithm which greedily maximizes t min(F t (S), 1). Azar and Gamzu [8] show that this ofﬂine algorithm, which they call the cumulative greedy algorithm, does not achieve a good approximation ratio for average cover time. Radlinski et al. [14] consider a special case of online submodular function maximization applied to search result ranking. In their problem the objective function is assumed to be a binary valued submodular function with 1 indicating the user clicked on at least one document. The goal is then to maximize the number of queries which receive at least one click. For binary valued functions ˆ and are the same, so in this setting minimizing the number of documents a user must view before clicking on a result is a min-sum submodular set cover problem. Our results generalize this problem to minimizing the number of documents a user must view before some possibly non-binary submodular objective is met. With non-binary objectives we can incorporate richer implicit feedback such as multiple clicks and time spent viewing results. Slivkins et al. [15] generalize the results of Radlinski et al. [14] to a metric space bandit setting. Our work differs from the online set cover problem of Alon et al. [16]; this problem is a single set cover problem in which the items that need to be covered are revealed one at a time. Kakade et al. [17] analyze general online optimization problems with linear loss. If we assume that the functions F t are all taken from a known ﬁnite set of functions F then we have linear loss over a |F| dimensional space. However, this approach gives poor dependence on |F|. 3 Ofﬂine Analysis In this work we present an algorithm for online submodular set cover which extends the ofﬂine algorithm of Azar and Gamzu [8] for the ranking with submodular valuations problem. Algorithm 1 shows this ofﬂine algorithm, called the adaptive residual updates algorithm. Here we use T to denote the number of objective functions and superscript t to index the set of objectives. This notation is chosen to make the connection to the proceeding online algorithm clear: our online algorithm will approximately implement Algorithm 1 in an online setting, and in this case the set of objectives in 3 Algorithm 1 Ofﬂine Adaptive Residual Input: Objectives F 1 , F 2 , . . . F T Output: Sequence S1 ⊂ S2 ⊂ . . . Sn S0 ← ∅ for i ← 1 . . . n do v ← argmax t δ(F t , Si−1 , v) v∈V Si ← Si−1 + v end for Figure 1: Histograms used in ofﬂine analysis the ofﬂine algorithm will be the sequence of objectives in the online problem. The algorithm is a greedy algorithm similar to the standard algorithm for submodular set cover. The crucial difference is that instead of a normal gain term of F t (S + v) − F t (S) it uses a relative gain term δ(F t , S, v) min( F 0 t (S+v)−F t (S) , 1) 1−F t (S) if F (S) < 1 otherwise The intuition is that (1) a small gain for F t matters more if F t is close to being covered (F t (S) close to 1) and (2) gains for F t with F t (S) ≥ 1 do not matter as these functions are already covered. The main result of Azar and Gamzu [8] is that Algorithm 1 is approximately optimal. t Theorem 1 ([8]). The loss t (F , S) of the sequence produced by Algorithm 1 is within 4(ln(1/ ) + 2) of that of any other sequence. We note Azar and Gamzu [8] allow for weights for each F t . We omit weights for simplicity. Also, Azar and Gamzu [8] do not allow the sequence S to contain duplicates while we do–selecting a ground set element twice has no beneﬁt but allowing them will be convenient for the online analysis. The proof of Theorem 1 involves representing solutions to the submodular ranking problem as histograms. Each histogram is deﬁned such that the area of the histogram is equal to the loss of the corresponding solution. The approximate optimality of Algorithm 1 is shown by proving that the histogram for the solution it ﬁnds is approximately contained within the histogram for the optimal solution. In order to convert Algorithm 1 into an online algorithm, we will need a stronger version of Theorem 1. Speciﬁcally, we will need to show that when there is some additive error in the greedy selection rule Algorithm 1 is still approximately optimal. For the optimal solution S ∗ = argminS∈V n t (F t , S) (V n is the set of all length n sequences of ground set elements), deﬁne a histogram h∗ with T columns, one for each function F t . Let the tth column have with width 1 and height equal to (F t , S ∗ ). Assume that the columns are ordered by increasing cover time so that the histogram is monotone non-decreasing. Note that the area of this histogram is exactly the loss of S ∗ . For a sequence of sets ∅ = S0 ⊆ S1 ⊆ . . . Sn (e.g., those found by Algorithm 1) deﬁne a corresponding sequence of truncated objectives ˆ Fit (S) min( F 1 t (S∪Si−1 )−F t (Si−1 ) , 1) 1−F t (Si−1 ) if F t (Si−1 ) < 1 otherwise ˆ Fit (S) is essentially F t except with (1) Si−1 given “for free”, and (2) values rescaled to range ˆ ˆ between 0 and 1. We note that Fit is submodular and that if F t (S) ≥ 1 then Fit (S) ≥ 1. In this ˆ t is an easier objective than F t . Also, for any v, F t ({v}) − F t (∅) = δ(F t , Si−1 , v). In ˆ ˆ sense Fi i i ˆ other words, the gain of Fit at ∅ is the normalized gain of F t at Si−1 . This property will be crucial. ˆ ˆ ˆ We next deﬁne truncated versions of h∗ : h1 , h2 , . . . hn which correspond to the loss of S ∗ for the ˆ ˆ t . For each j ∈ 1 . . . n, let hi have T columns of height j easier covering problems involving Fi t ∗ t ∗ ˆ ˆ with the tth such column of width Fi (Sj ) − Fi (Sj−1 ) (some of these columns may have 0 width). ˆ Assume again the columns are ordered by height. Figure 1 shows h∗ and hi . ∗ We assume without loss of generality that F t (Sn ) ≥ 1 for every t (clearly some choice of S ∗ ∗ contains no duplicates, so under our assumption that F t (V ) ≥ 1 we also have F t (Sn ) ≥ 1). Note 4 ˆ that the total width of hi is then the number of functions remaining to be covered after Si−1 is given ˆ for free (i.e., the number of F t with F t (Si−1 ) < 1). It is not hard to see that the total area of hi is ˆ(F t , S ∗ ) where ˆ is the loss function for min-sum submodular set cover (2). From this we know ˆ l i t ˆ hi has area less than h∗ . In fact, Azar and Gamzu [8] show the following. ˆ ˆ Lemma 1 ([8]). hi is completely contained within h∗ when hi and h∗ are aligned along their lower right boundaries. We need one ﬁnal lemma before proving the main result of this section. For a sequence S deﬁne Qi = t δ(F t , Si−1 , vi ) to be the total normalized gain of the ith selected element and let ∆i = n t j=i Qj be the sum of the normalized gains from i to n. Deﬁne Πi = |{t : F (Si−1 ) < 1}| to be the number of functions which are still uncovered before vi is selected (i.e., the loss incurred at step i). [8] show the following result relating ∆i to Πi . Lemma 2 ([8]). For any i, ∆i ≤ (ln 1/ + 2)Πi We now state and prove the main result of this section, that Algorithm 1 is approximately optimal even when the ith greedy selection is preformed with some additive error Ri . This theorem shows that in order to achieve low average cover time it sufﬁces to approximately implement Algorithm 1. Aside from being useful for converting Algorithm 1 into an online algorithm, this theorem may be useful for applications in which the ground set V is very large. In these situations it may be possible to approximate Algorithm 1 (e.g., through sampling). Streeter and Golovin [13] prove similar results for submodular function maximization and min-sum submodular set cover. Our result is similar, but t the proof is non trivial. The loss function is highly non linear with respect to changes in F t (Si ), so it is conceivable that small additive errors in the greedy selection could have a large effect. The analysis of Im and Nagarajan [9] involves a version of Algorithm 1 which is robust to a sort of multplicative error in each stage of the greedy selection. Theorem 2. Let S = (v1 , v2 , . . . vn ) be any sequence for which δ(F t , Si−1 , vi ) + Ri ≥ max Then t δ(F t , Si−1 , v) v∈V t (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + n t i Ri Proof. Let h be a histogram with a column for each Πi with Πi = 0. Let γ = (ln 1/ + 2). Let the ith column have width (Qi + Ri )/(2γ) and height max(Πi − j Rj , 0)/(2(Qi + Ri )). Note that Πi = 0 iff Qi + Ri = 0 as if there are functions not yet covered then there is some set element with non zero gain (and vice versa). The area of h is i:Πi =0 max(Πi − j Rj , 0) 1 1 (Qi + Ri ) ≥ 2γ 2(Qi + Ri ) 4γ (F t , S) − t n 4γ Rj j ˆ Assume h and every hi are aligned along their lower right boundaries. We show that if the ith ˆ column of h has non-zero area then it is contained within hi . Then, it follows from Lemma 1 that h ∗ is contained within h , completing the proof. Consider the ith column in h. Assume this column has non zero area so Πi ≥ j Rj . This column is at most (∆i + j≥i Rj )/(2γ) away from the right hand boundary. To show that this column is in ˆ hi it sufﬁces to show that after selecting the ﬁrst k = (Πi − j Rj )/(2(Qi + Ri )) items in S ∗ we ˆ ∗ ˆ still have t (1 − Fit (Sk )) ≥ (∆i + j≥i Rj )/(2γ) . The most that t Fit can increase through ˆ the addition of one item is Qi + Ri . Therefore, using the submodularity of Fit , ˆ ∗ Fit (Sk ) − t Therefore t (1 ˆ Fit (∅) ≤ k(Qi + Ri ) ≤ Πi /2 − t ˆ ∗ − Fit (Sk )) ≥ Πi /2 + j Rj /2 since Rj /2 ≥ ∆i /(2γ) + Πi /2 + Rj /2 j t (1 ˆ − Fit (∅)) = Πi . Using Lemma 2 Rj /2 ≥ (∆i + j j 5 Rj )/(2γ) j≥i Algorithm 2 Online Adaptive Residual Input: Integer T Initialize n online learning algorithms E1 , E2 , . . . En with A = V for t = 1 → T do t ∀i ∈ 1 . . . n predict vi with Ei t t S t ← (v1 , . . . vn ) Receive F t , pay loss l(F t , S t ) t For Ei , t (v) ← (1 − δ(F t , Si−1 , v)) end for 4 Figure 2: Ei selects the ith element in S t . Online Analysis We now show how to convert Algorithm 1 into an online algorithm. We use the same idea used by Streeter and Golovin [13] and Radlinski et al. [14] for online submodular function maximization: we run n copies of some low regret online learning algorithm, E1 , E2 , . . . En , each with action space A = V . We use the ith copy Ei to select the ith item in each predicted sequence S t . In other 1 2 T words, the predictions of Ei will be vi , vi , . . . vi . Figure 2 illustrates this. Our algorithm assigns loss values to each Ei so that, assuming Ei has low regret, Ei approximately implements the ith greedy selection in Algorithm 1. Algorithm 2 shows this approach. Note that under our assumption that F 1 , F 2 , . . . F T is chosen by an oblivious adversary, the loss values for the ith copy of the online algorithm are oblivious to the predictions of that run of the algorithm. Therefore we can use any algorithm for learning against an oblivious adversary. Theorem 3. Assume we use as a subroutine an online prediction algorithm with expected regret √ √ E[R] ≤ T ln n. Algorithm 2 has expected α-regret E[Rα ] ≤ n2 T ln n for α = 4(ln(1/ ) + 2) 1 2 T Proof. Deﬁne a meta-action vi for the sequence of actions chosen by Ei , vi = (vi , vi , . . . vi ). We ˜ ˜ t t t t ˜ can extend the domain of F to allow for meta-actions F (S ∪ {ˆi }) = F (S ∪ {vi }). Let S be v ˜ the sequence of meta actions S = (v1 , v2 , . . . vn ). Let Ri be the regret of Ei . Note that from the ˜ ˜ ˜ deﬁnition of regret and our choice of loss values we have that ˜ δ(F t , Si−1 , v) − max v∈V t ˜ δ(F t , Si−1 , vi ) = Ri ˜ t ˜ Therefore, S approximates the greedy solution in the sense required by Theorem 2. Theorem 2 did not require that S be constructed V . From Theorem 2 we then have ˜ (F t , S) ≤ α (F t , S t ) = t t The expected α-regret is then E[n i (F t , S ∗ ) + n t Ri i √ Ri ] ≤ n2 T ln n We describe several variations and extensions of this analysis, some of which mirror those for related work [13, 14, 15]. Avoiding Duplicate Items Since each run of the online prediction algorithm is independent, Algorithm 2 may select the same ground set element multiple times. This drawback is easy to ﬁx. We can simply select any arbitrary vi ∈ Si−1 if Ei selects a vi ∈ Si−i . This modiﬁcation does not affect / the regret guarantee as selecting a vi ∈ Si−1 will always result in a gain of zero (loss of 1). Truncated Loss In some applications we only care about the ﬁrst k items in the sequence S t . For these applications it makes sense to consider a truncated version of l(F t , S t ) with parameter k k (F t , S t ) t t min {k} ∪ {|Si | : F t (Si ) ≥ 1} This is cover time computed up to the kth element in S t . The analysis for Theorem 2 also shows k k (F t , S t ) ≤ 4(ln 1/ + 2) t (F t , S ∗ ) + k i=1 Ri . The corresponding regret bound is then t 6 √ k 2 T ln n. Note here we are bounding truncated loss t k (F t , S t ) in terms of untruncated loss t ∗ 2 2 t (F , S ). In this sense this bound is weaker. However, we replace n with k which may be much smaller. Algorithm 2 achieves this bound simultaneously for all k. Multiple Objectives per Round Consider a variation of online submodular set cover in which int t t stead of receiving a single objective F t each round we receive a batch of objectives F1 , F2 , . . . Fm m t t and incur loss i=1 (Fi , S ). In other words, each rounds corresponds to a ranking with submodular valuations problem. It is easy to extend Algorithm 2 to this√setting by using 1 − m t (1/m) i=1 δ(Fit , Si−1 , v) for the loss of action v in Ei . We then get O(k 2 mL∗ ln n+k 2 m ln n) T m ∗ total regret where L = t=1 i=1 (Fit , S ∗ ) (Section 2.6 of [10]). Bandit Setting Consider a setting where instead of receiving full access to F t we only observe t t t the sequence of objective function values F t (S1 ), F t (S2 ), . . . F t (Sn ) (or in the case of multiple t t objectives per round, Fi (Sj ) for every i and j). We can extend Algorithm 2 to this setting using a nonstochastic multiarmed bandits algorithm [12]. We note duplicate removal becomes more subtle in the bandit setting: should we feedback a gain of zero when a duplicate is selected or the gain of the non-duplicate replacement? We propose either is valid if replacements are chosen obliviously. Bandit Setting with Expert Advice We can further generalize the bandit setting to the contextual bandit setting [18] (e.g., the bandit setting with expert advice [12]). Say that we have access at time step t to predictions from a set of m experts. Let vj be the meta action corresponding to the sequence ˜ ˜ of predictions from the jth expert and V be the set of all vj . Assume that Ei guarantees low regret ˜ ˜ with respect to V t t δ(F t , Si−1 , vi ) + Ri ≥ max v ∈V ˜ ˜ t t δ(F t , Si−1 , v ) ˜ (3) t where we have extended the domain of each F t to include meta actions as in the proof of Theorem ˜ 3. Additionally assume that F t (V ) ≥ 1 for every t. In this case we can show t k (F t , S t ) ≤ √ k m t ∗ minS ∗ ∈V m t (F , S ) + k i=1 Ri . The Exp4 algorithm [12] has Ri = O( nT ln m) giving ˜ √ total regret O(k 2 nT ln m). Experts may use context in forming recommendations. For example, in a search ranking problem the context could be the query. 5 5.1 Experimental Results Synthetic Example We present a synthetic example for which the online cumulative greedy algorithm [13] fails, based on the example in Azar and Gamzu [8] for the ofﬂine setting. Consider an online ad placement problem where the ground set V is a set of available ad placement actions (e.g., a v ∈ V could correspond to placing an ad on a particular web page for a particular length of time). On round t, we receive an ad from an advertiser, and our goal is to acquire λ clicks for the ad using as few t advertising actions as possible. Deﬁne F t (Si ) to be min(ct , λ)/λ where ct is number of clicks i i t acquired from the ad placement actions Si . Say that we have n advertising actions of two types: 2 broad actions and n − 2 narrow actions. Say that the ads we receive are also of two types. Common type ads occur with probability (n − 1)/n and receive 1 and λ − 1 clicks respectively from the two broad actions and 0 clicks from narrow actions. Uncommon type ads occur with probability 1/n and receive λ clicks from one randomly chosen narrow action and 0 clicks from all other actions. Assume λ ≥ n2 . Intuitively broad actions could correspond to ad placements on sites for which many ads are relevant. The optimal strategy giving an average cover time O(1) is to ﬁrst select the two broad actions covering all common ads then select the narrow actions in any order. However, the ofﬂine cumulative greedy algorithm will pick all narrow actions before picking the broad action with gain 1 giving average cover time O(n). The left of Figure 3 shows average cover time for our proposed algorithm and the cumulative greedy algorithm of [13] on the same sequences of random objectives. For this example we use n = 25 and the bandit version of the problem with the Exp3 algorithm [12]. We also plot the average cover times for ofﬂine solutions as baselines. As seen in the ﬁgure, the cumulative algorithms converge to higher average cover times than the adaptive residual algorithms. Interestingly, the online cumulative algorithm does better than the ofﬂine cumulative algorithm: it seems added randomization helps. 7 Figure 3: Average cover time 5.2 Repeated Active Learning for Movie Recommendation Consider a movie recommendation website which asks users a sequence of questions before they are given recommendations. We deﬁne an online submodular set cover problem for choosing sequences of questions in order to quickly eliminate a large number of movies from consideration. This is similar conceptually to the diagnosis problem discussed in the introduction. Deﬁne the ground set V to be a set of questions (for example “Do you want to watch something released in the past 10 years?” or “Do you want to watch something from the Drama genre?”). Deﬁne F t (S) to be proportional to the number of movies eliminated from consideration after asking the tth user S. Speciﬁcally, let H be the set of all movies in our database and V t (S) be the subset of movies consistent with the tth user’s responses to S. Deﬁne F t (S) min(|H \ V t (S)|/c, 1) where c is a constant. F t (S) ≥ iff after asking the set of questions S we have eliminated at least c movies. We set H to be a set of 11634 movies available on Netﬂix’s Watch Instantly service and use 803 questions based on those we used for an ofﬂine problem [7]. To simulate user responses to questions, on round t we randomly select a movie from H and assume the tth user answers questions consistently with this movie. We set c = |H| − 500 so the goal is to eliminate about 95% of all movies. We evaluate in the full information setting: this makes sense if we assume we receive as feedback the movie the user actually selected. As our online prediction subroutine we tried Normal-Hedge [19], a second order multiplicative weights method [20], and a version of multiplicative weights for small gains using the doubling trick (Section 2.6 of [10]). We also tried a heuristic modiﬁcation of Normal-Hedge which ﬁxes ct = 1 for a ﬁxed, more aggressive learning rate than theoretically justiﬁed. The right of Figure 3 shows average cover time for 100 runs of T = 10000 iterations. Note the different scale in the bottom row–these methods performed signiﬁcantly worse than Normal-Hedge. The online cumulative greedy algorithm converges to a average cover time close to but slightly worse than that of the adaptive greedy method. The differences are more dramatic for prediction subroutines that converge slowly. The modiﬁed Normal-Hedge has no theoretical justiﬁcation, so it may not generalize to other problems. For the modiﬁed Normal-Hedge the ﬁnal average cover times are 7.72 adaptive and 8.22 cumulative. The ofﬂine values are 6.78 and 7.15. 6 Open Problems It is not yet clear what practical value our proposed approach will have for web search result ranking. A drawback to our approach is that we pick a ﬁxed order in which to ask questions. For some problems it makes more sense to consider adaptive strategies [5, 6]. Acknowledgments This material is based upon work supported in part by the National Science Foundation under grant IIS-0535100, by an Intel research award, a Microsoft research award, and a Google research award. 8 References [1] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In HLT, 2011. ´ [2] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of inﬂuence through a social network. In KDD, 2003. [3] A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efﬁcient algorithms and empirical studies. JMLR, 2008. [4] L.A. Wolsey. An analysis of the greedy algorithm for the submodular set covering problem. Combinatorica, 2(4), 1982. [5] D. Golovin and A. Krause. Adaptive submodularity: A new approach to active learning and stochastic optimization. In COLT, 2010. [6] Andrew Guillory and Jeff Bilmes. Interactive submodular set cover. In ICML, 2010. [7] Andrew Guillory and Jeff Bilmes. Simultaneous learning and covering with adversarial noise. In ICML, 2011. [8] Yossi Azar and Iftah Gamzu. Ranking with Submodular Valuations. In SODA, 2011. [9] S. Im and V. Nagarajan. Minimum Latency Submodular Cover in Metrics. ArXiv e-prints, October 2011. [10] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006. [11] Y. Freund and R. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pages 23–37, 1995. [12] P. Auer, N. Cesa-Bianchi, Y. Freund, and R.E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2003. [13] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2008. [14] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multi-armed bandits. In ICML, 2008. [15] A. Slivkins, F. Radlinski, and S. Gollapudi. Learning optimally diverse rankings over large document collections. In ICML, 2010. [16] N. Alon, B. Awerbuch, and Y. Azar. The online set cover problem. In STOC, 2003. [17] Sham M. Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms. In STOC, 2007. [18] J. Langford and T. Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In NIPS, 2007. [19] K. Chaudhuri, Y. Freund, and D. Hsu. A parameter-free hedging algorithm. In NIPS, 2009. [20] N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 2007. 9

6 0.68915051 227 nips-2011-Pylon Model for Semantic Segmentation

7 0.64101833 199 nips-2011-On fast approximate submodular minimization

8 0.6356141 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

9 0.62459368 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation

10 0.62285525 180 nips-2011-Multiple Instance Filtering

11 0.62182534 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

12 0.61441231 55 nips-2011-Collective Graphical Models

13 0.61220288 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

14 0.61140221 258 nips-2011-Sparse Bayesian Multi-Task Learning

15 0.61101758 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations

16 0.60939705 66 nips-2011-Crowdclustering

17 0.60920489 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

18 0.60803407 303 nips-2011-Video Annotation and Tracking with Active Learning

19 0.60732305 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

20 0.60609013 156 nips-2011-Learning to Learn with Compound HD Models