nips nips2009 nips2009-236 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andrea Vedaldi, Andrew Zisserman
Abstract: We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efficient. We make the following contributions: (i) we note that extending the Structured Output Regression formulation of Blaschko and Lampert [1] to include a bias term significantly improves performance; (ii) that alignment (to account for small rotations and anisotropic scalings) can be included as a latent variable and efficiently determined and implemented; (iii) that the latent variable extends to multiple aspects (e.g. left facing, right facing, front) with the same formulation; and (iv), most significantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. We demonstrate the method by training and testing on the PASCAL VOC 2007 data set – training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with significant truncations. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. [sent-4, score-0.574]
2 The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efficient. [sent-5, score-0.284]
3 left facing, right facing, front) with the same formulation; and (iv), most significantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. [sent-8, score-0.677]
4 We demonstrate the method by training and testing on the PASCAL VOC 2007 data set – training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with significant truncations. [sent-9, score-0.633]
5 The training data provided for these challenges specifies if an object is truncated – when the provided axis aligned bounding box does not cover the full extent of the object. [sent-11, score-0.657]
6 The principal cause of truncation is that the object partially lies outside the image area. [sent-12, score-0.321]
7 Most participants simple disregard the truncated training instances and learn from the non-truncated ones. [sent-13, score-0.311]
8 This is a waste of training material, but more seriously many truncated instances are missed in testing, significantly reducing the recall and hence decreasing overall recognition performance. [sent-14, score-0.311]
9 The model is specified as a joint kernel and learnt using an extension of the structural SVM with latent variables framework of [13]. [sent-17, score-0.227]
10 We use this approach as it allows us to address a second deficiency of the provided supervision – that the annotation is limited to axis aligned bounding boxes, even though the objects may be in plane rotated so that the box is a loose bound. [sent-18, score-0.426]
11 The latent variables allow us to specify a pose transformation for each instances so that we achieve a spatial correspondence between all instances with the same aspect. [sent-19, score-0.714]
12 [4] who propose a latent SVM framework, where the latent variables specify sub-part locations. [sent-22, score-0.321]
13 The parts give their model some tolerance to in plane rotation and foreshortening (though an axis aligned rectangle is still used for a first 1 RIGHT LEFT LEFT LEFT LEFT (a) RIGHT LEFT LEFT LEFT (b) (c) (d) Figure 1: Model overview. [sent-23, score-0.253]
14 Detection examples on the VOC images for the bicycle class demonstrate that the model can handle severe truncations (a-b), multiple objects (c), multiple aspects (d), and pose variations (small in-plane rotations) (e). [sent-24, score-0.869]
15 Truncations caused by the image boundary, (a) & (b), are a significant problem for template based detectors, since the template can then only partially align with the image. [sent-25, score-0.66]
16 Small in-plane rotations and anisotropic rescalings of the template are approximated efficiently by rearranging sub-blocks of the HOG template (white boxes in (e)). [sent-26, score-0.897]
17 [11] used pixel wise binary latent variables to specify the occlusion and an Ising prior for spatial coherence. [sent-31, score-0.323]
18 There are numerous papers on detecting faces with various degrees of partial occlusion from glasses, or synthetic truncations [6, 7]. [sent-38, score-0.282]
19 We then learn a structured regressor, mapping an image to a list of objects with their pose (or bounding box), while at the same time handling explicitly truncation and multiple aspects. [sent-40, score-0.91]
20 Our choice of kernel is inspired by the restriction kernel of [1]; however, our kernel performs both restriction and alignment to a template, supports multiple templates to handle different aspects and truncations, and adds a bias term which significantly improves performance. [sent-41, score-0.608]
21 We refine pose beyond translation and scaling with an additional transformation selected from a finite set of possible perturbations covering aspect ratio change and small in plane rotations. [sent-42, score-0.686]
22 Instead of explicitly transforming the image with each element of this set (which would be prohibitively expensive) we introduce a novel approximation based on decomposing the HOG descriptor into small blocks and quickly rearranging those. [sent-43, score-0.406]
23 To handle occlusions we selectively switch between an object descriptor and an occlusion descriptor. [sent-44, score-0.437]
24 To identify which portions of the template are occluded we use a field of binary variables. [sent-45, score-0.358]
25 These could be treated as latent variables; however, since we consider here only occlusions caused by the image boundaries, we can infer them deterministically from the position of the object relative to the image boundaries. [sent-46, score-0.506]
26 In training we improve the ground-truth pose parameters, since the bounding boxes and aspect associations provided in PASCAL VOC are quite coarse indicators of the object pose. [sent-49, score-0.964]
27 For each instance we add a latent variable which encodes a pose adjustment. [sent-50, score-0.518]
28 The resulting optimization alternates between optimizing the model parameters given the latent variables (a convex problem solved by a cutting plane algorithm) and optimizing the latent variable given the model (akin to inference). [sent-53, score-0.337]
29 2 The overall method is computationally efficient both in training and testing, achieves very good performances, and it is able to learn and recognise truncated objects. [sent-54, score-0.23]
30 We use the structured prediction learning framework of [9, 13], which considers along with the input and output variables x and y, an auxiliary latent variable h ∈ H as well (we use h to specify a refinement to the ground-truth pose parameters). [sent-56, score-0.612]
31 , (xN , yN ), the parameters w are learned by minimizing the regularized empirical risk R(w) = 1 w 2 2 + C N N ˆ ˆ ∆(yi , yi (w), hi (w)), ˆ ˆ where yi (w) = yxi (w), ˆ ˆ hi (w) = hxi (w). [sent-62, score-0.584]
32 We first define the kernel for the case of a single unoccluded instance in an image including scale and perturbing transformations, then generalise this to include truncations and aspects; and finally to multiple instances. [sent-65, score-0.435]
33 An appropriate loss function ∆(yi , y, h) is subsequently defined which takes into account the pose of the object specified by the latent variables. [sent-66, score-0.623]
34 1 Restriction and alignment kernel Denote by R a rectangular region of the image x, and by x|R the image cropped to that rectangle. [sent-68, score-0.327]
35 A restriction kernel [1] is the kernel K((x, R), (x , R )) = Kimage (x|R , x |R ) where Kimage is an appropriate kernel between images. [sent-69, score-0.246]
36 Let R0 be a reference rectangle and denote by R(p) = gp R0 the rectangle obtained from R0 by a geometric transformation ¯ gp : R2 → R2 . [sent-72, score-0.256]
37 Let x be an image defined on the reference rectangle R0 and let H(¯ ) ∈ Rd be a descriptor (e. [sent-74, score-0.34]
38 Then a natural definition of the restriction and alignment kernel is K((x, p), (x , p )) = Kdescr (H(x; p), H(x ; p )) (3) where Kdescr is an appropriate kernel for descriptors, and H(x; p) is the descriptor computed on the −1 transformed patch x as H(x; p) = H(gp x). [sent-77, score-0.407]
39 Our choice of the HOG descriptor puts some limits on the space of poses p that can be efficiently explored. [sent-79, score-0.217]
40 The computation starts by decomposing the image x into cells of d × d pixels (here d = 8). [sent-81, score-0.27]
41 The descriptor of a cell is the nine dimensional histogram of the orientation of the image gradient inside the cell (appropriately weighed and normalized as in [2]). [sent-82, score-0.384]
42 We obtain the HOG descriptor of a rectangle of w × h cells by stacking the enclosed cell descriptors (this is a 9 × w × h vector). [sent-83, score-0.513]
43 3 To further refine pose beyond scale and translation, here we consider an additional perturbation gt , indexed by a pose parameter t, selected in a set of transformations g1 , . [sent-86, score-0.863]
44 Those could be achieved in the same manner as −1 scaling by transforming the image gt x for each one, but this would be very expensive (we would need to recompute the cell descriptors every time). [sent-90, score-0.315]
45 Instead, we approximate such transformations by rearranging the cells of the template (Fig. [sent-91, score-0.55]
46 First, we partition the w × h cells of the template into blocks of m × m cells (e. [sent-93, score-0.617]
47 move each cell of the template independently), but we prefer to use blocks as this accelerates inference (see Sect. [sent-99, score-0.427]
48 Hence, pose is for us a tuple (x, y, s, t) representing translation, scale, and additional perturbation. [sent-101, score-0.356]
49 (4) Modeling truncations If part of the object is occluded (either by clutter or by the image boundaries), some of the descriptor cells will be either unpredictable or undefined. [sent-104, score-0.776]
50 We explicitly deal with occlusion at the granularity of the HOG cells by adding a field of w × h binary indicator variables v ∈ {0, 1}wh . [sent-105, score-0.271]
51 Here vj = 1 means that the j-th cell of the HOG descriptor H(x, p) should be considered to be visible, and vj = 0 that it is occluded. [sent-106, score-0.223]
52 (5) is effectively defining a template for the object and one for the occlusions. [sent-110, score-0.39]
53 In this case, which we explore in the experiments, we can write v = v(p) as a function of the pose p, and remove the explicit dependence on v in Ψ. [sent-113, score-0.356]
54 Moreover the truncated HOG cells are undefined and can be assigned a nominal common value. [sent-114, score-0.322]
55 So here we work with a simplified kernel, in which occlusions are represented by a scalar proportional to the number of truncated cells: Ψ(x, p) = 2. [sent-115, score-0.256]
56 3 (v(p) ⊗ 19 ) H(x, p) wh − |v(p)| (6) Modeling aspects A template model works well as long as pose captures accurately enough the transformations resulting from changes in the viewing conditions. [sent-116, score-0.849]
57 In our model, the pose p, combined with the robustness of the HOG descriptor, can absorb a fair amount of viewpoint induced deformation. [sent-117, score-0.356]
58 To this end, we augment pose with an aspect indicator a (so that pose is the tuple p = (x, y, s, t, a)), which indicates which template to use. [sent-120, score-1.118]
59 Note that now the concept of pose has been generalized to include a parameter, a, which, differently from the others, does not specify a geometric transformation. [sent-121, score-0.39]
60 Nevertheless, pose specifies how the model should be aligned to the image, i. [sent-122, score-0.408]
61 by (i) choosing the template that corresponds to the aspect a, (ii) translating and scaling such template according to (x, y, s), and (iii) applying to it the additional perturbation gt . [sent-124, score-0.809]
62 In training, they are initialized from the ground truth data annotations (bounding boxes and aspect labels), and are then refined by estimating the latent variables (Sect. [sent-126, score-0.595]
63 4 Latent variables The PASCAL VOC bounding boxes yield only a coarse estimate of the ground truth pose parameters (e. [sent-145, score-0.812]
64 they do not contain any information on the object rotation) and the aspect assignments may also be suboptimal (see previous section). [sent-147, score-0.234]
65 Therefore, we introduce latent variables h = (δp) that encode an adjustment to the ground-truth pose parameters y = (p). [sent-148, score-0.562]
66 In practice, the adjustment δp is a small variation of translation x, y, scale s, and perturbation t, and can switch the aspect a all together. [sent-149, score-0.286]
67 5 Variable number of objects: loss function, bias, training So far, we have defined the feature map Ψ(x, y) = Ψ(x; p) for the case in which the label y = (p) contains exactly one object, but an image may contain no or multiple object instances (denoted respectively y = and y = (p1 , . [sent-152, score-0.416]
68 Note that the ground truth poses are defined so that B(pl ) matches the PASCAL provided bounding boxes [1] (or the manually extended ones for the truncated ones). [sent-158, score-0.656]
69 If an image contains multiple instances, the image is added to the training set multiple times, each time “activating” one of the instances, and “deactivating” the others. [sent-169, score-0.336]
70 Formally, let p0 be the pose of the active instance and p1 , . [sent-171, score-0.393]
71 A pose p is removed from the search space if, and only if, maxi overl(B(p), B(pi )) ≥ max{overl(B(p), B(p0 )), 0. [sent-175, score-0.356]
72 (2) is difficult as the loss depends on w ˆ ˆ through yi (w) and hi (w) (see Eq. [sent-178, score-0.325]
73 i (11) Here h∗ (w) = argmaxh∈H w, Ψ(xi , yi , h) completes the label (yi , h∗ (w)) of the sample xi (of i i which only the observed part yi is known from the ground truth). [sent-181, score-0.447]
74 Starting from an estimate wt of the solution, define h∗ = hi (wt ), so that, for any w, it w, Ψ(xi , yi , h∗ (w)) = max w, Ψ(xi , yi , h ) ≥ w, Ψ(xi , yi , h∗ ) i it h and the equality holds for w = wt . [sent-187, score-0.718]
75 Hence the energy (11) is bounded by 1 w 2 2 + C N N max i=1 (y,h)∈Y×H ∆(yi , y, h) [1 + w, Ψ(xi , y, h) − w, Ψ(xi , yi , h∗ ) ] it (12) and the bound is strict for w = wt . [sent-188, score-0.244]
76 We further upper i bound the loss by ˛ ˛ ˆ ˆ ∆(yi , yi (w), hi (w)) ≤ ∆(yi , y, h) [1 + w, Ψ(xi , y, h) − w, Ψ(xi , yi , h∗ (w)) ] ˛ i ≤ max (y,h)∈Y×H ˆ y=ˆ i (w),h=hi (w) y ∆(yi , y, h) [1 + w, Ψ(xi , y, h) − w, Ψ(xi , yi , h∗ (w)) ] i (14) ˆ ˆ Substituting this bound into (2) yields (11). [sent-193, score-0.679]
77 Note that yi (w) and hi (w) are defined as the maximiser of w, Ψ(xi , y, h) alone (see Eq. [sent-194, score-0.292]
78 5); detecting trunctated instances, training on truncated instances, and counting the truncated cells as a feature (Sect. [sent-221, score-0.552]
79 Right panel: (a) Original VOC specified bounding box and aspect; (b) alignment and aspect after pose inference – in addition to translation and scale, our templates are searched over a set of small perturbations. [sent-226, score-0.941]
80 This is implemented efficiently by breaking the template into blocks (dashed boxes) and rearranging those. [sent-227, score-0.399]
81 The ground truth pose parameters are approximate because they are obtained from bounding boxes (a). [sent-229, score-0.775]
82 In the example, an instance originally labeled as misc (a) is reassigned to the left aspect (b). [sent-232, score-0.241]
83 Each object instance is labeled with a bounding box and a categorical aspect variable (left, right, front, back, undefined). [sent-235, score-0.498]
84 From the bounding box we estimate translation and scale of the object, and we use aspect to select one of multiple HOG templates. [sent-236, score-0.48]
85 left and right) are mapped to the same HOG template as suggested in Sect. [sent-239, score-0.322]
86 While our model is capable of handling correctly truncations, truncated bounding boxes provide a poor estimate of the pose of the object pose which prevents using such objects for training. [sent-242, score-1.415]
87 While we could simply avoid training with truncated boxes (or generate artificially truncated examples whose pose would be known), we prefer exploiting all the available training data. [sent-243, score-0.993]
88 To do this, we manually augment all truncated PASCAL VOC annotations with an additional “physical” bounding box. [sent-244, score-0.363]
89 The purpose is to provide a better initial guess for the object pose, which is then refined by optimizing over the latent variables. [sent-245, score-0.234]
90 This means matching a HOG template O(W HT A) times, where W × H is the dimension of the image in cells, T the number of perturbations (Sect. [sent-248, score-0.419]
91 1 For a given scale and aspect, matching the template for all locations reduces to convolution. [sent-253, score-0.281]
92 To detect additional objects at test time we run inference multiple times, but excluding all detections that overlap by more than 20% with any previously detected object. [sent-259, score-0.229]
93 The dashed boxes are bicycles that were detected with or without truncation support, while the solid ones were detectable only when truncations were considered explicitly. [sent-263, score-0.559]
94 The baseline model uses only the HOG template without bias, truncations, nor pose refinement (Sect. [sent-270, score-0.637]
95 The two most significant improvements are (a) the ability of detecting truncated instances (+22% AP, Fig. [sent-273, score-0.261]
96 Training with the truncated instances, adding the number of occluded HOG cells as a feature component, and adding jitters beyond translation and scaling all yield an improvement of about +2–3% AP. [sent-275, score-0.558]
97 Initially, the pose refinimenent h is null and the alternation optimization algorithm is iterated five times to estimate the model w and refinement h. [sent-278, score-0.356]
98 Conclusions We have shown how structured output regression with latent variables provides an integrated and effective solution to many problems in object detection: truncations, pose variability, multiple objects, and multiple aspects can all be dealt in a consistent framework. [sent-287, score-0.876]
99 While we have shown that truncated examples can be used for training, we had to manually extend the PASCAL VOC annotations for these cases to include rough “physical” bounding boxes (as a hint for the initial pose parameters). [sent-288, score-0.896]
100 We plan to further extend the approach to infer pose for truncated examples in a fully automatic fashion (weak supervision). [sent-289, score-0.536]
wordName wordTfidf (topN-words)
[('pose', 0.356), ('hog', 0.34), ('template', 0.281), ('voc', 0.244), ('truncations', 0.19), ('truncated', 0.18), ('boxes', 0.177), ('yi', 0.177), ('descriptor', 0.16), ('pascal', 0.157), ('bounding', 0.147), ('cells', 0.142), ('latent', 0.125), ('aspect', 0.125), ('hi', 0.115), ('truncation', 0.114), ('object', 0.109), ('aspects', 0.099), ('image', 0.098), ('occlusion', 0.092), ('overl', 0.087), ('translation', 0.083), ('rectangle', 0.082), ('instances', 0.081), ('box', 0.08), ('occluded', 0.077), ('occlusions', 0.076), ('bicycle', 0.076), ('detection', 0.066), ('rearranging', 0.066), ('descriptors', 0.066), ('alignment', 0.066), ('kernel', 0.065), ('cell', 0.063), ('rescaling', 0.062), ('facing', 0.062), ('transformations', 0.061), ('structured', 0.06), ('unde', 0.059), ('objects', 0.058), ('poses', 0.057), ('rotations', 0.057), ('gt', 0.056), ('detections', 0.055), ('templates', 0.053), ('wh', 0.052), ('aligned', 0.052), ('blocks', 0.052), ('ap', 0.051), ('restriction', 0.051), ('plane', 0.05), ('training', 0.05), ('truth', 0.049), ('bias', 0.048), ('xi', 0.047), ('ground', 0.046), ('gp', 0.046), ('multiple', 0.045), ('adjustment', 0.044), ('deactivating', 0.044), ('jitters', 0.044), ('kdescr', 0.044), ('kimage', 0.044), ('wbias', 0.044), ('left', 0.041), ('slack', 0.041), ('detected', 0.04), ('perturbations', 0.04), ('winn', 0.04), ('testing', 0.039), ('axis', 0.039), ('front', 0.038), ('nement', 0.038), ('blaschko', 0.038), ('misc', 0.038), ('bicycles', 0.038), ('instance', 0.037), ('variables', 0.037), ('annotations', 0.036), ('wt', 0.036), ('anisotropic', 0.035), ('wise', 0.035), ('specify', 0.034), ('perturbation', 0.034), ('detector', 0.034), ('loss', 0.033), ('cccp', 0.033), ('handling', 0.032), ('scaling', 0.032), ('energy', 0.031), ('histograms', 0.031), ('inference', 0.031), ('vedaldi', 0.031), ('dalal', 0.031), ('symmetries', 0.031), ('category', 0.031), ('score', 0.03), ('rotation', 0.03), ('williams', 0.03), ('decomposing', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 236 nips-2009-Structured output regression for detection with partial truncation
Author: Andrea Vedaldi, Andrew Zisserman
Abstract: We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efficient. We make the following contributions: (i) we note that extending the Structured Output Regression formulation of Blaschko and Lampert [1] to include a bias term significantly improves performance; (ii) that alignment (to account for small rotations and anisotropic scalings) can be included as a latent variable and efficiently determined and implemented; (iii) that the latent variable extends to multiple aspects (e.g. left facing, right facing, front) with the same formulation; and (iv), most significantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. We demonstrate the method by training and testing on the PASCAL VOC 2007 data set – training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with significant truncations. 1
2 0.16487153 201 nips-2009-Region-based Segmentation and Object Detection
Author: Stephen Gould, Tianshi Gao, Daphne Koller
Abstract: Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other [10, 11]. However, current state-of-the-art models use a separate representation for each task making joint inference clumsy and leaving the classification of many parts of the scene ambiguous. In this work, we propose a hierarchical region-based approach to joint object detection and image segmentation. Our approach simultaneously reasons about pixels, regions and objects in a coherent probabilistic model. Pixel appearance features allow us to perform well on classifying amorphous background classes, while the explicit representation of regions facilitate the computation of more sophisticated features necessary for object detection. Importantly, our model gives a single unified description of the scene—we explain every pixel in the image and enforce global consistency between all random variables in our model. We run experiments on the challenging Street Scene dataset [2] and show significant improvement over state-of-the-art results for object detection accuracy. 1
3 0.16108115 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation
Author: Jie Luo, Barbara Caputo, Vittorio Ferrari
Abstract: Given a corpus of news items consisting of images accompanied by text captions, we want to find out “who’s doing what”, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint ‘face and pose’ model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images. 1
4 0.13837533 104 nips-2009-Group Sparse Coding
Author: Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow
Abstract: Bag-of-words document representations are often used in text, image and video processing. While it is relatively easy to determine a suitable word dictionary for text documents, there is no simple mapping from raw images or videos to dictionary terms. The classical approach builds a dictionary using vector quantization over a large set of useful visual descriptors extracted from a training set, and uses a nearest-neighbor algorithm to count the number of occurrences of each dictionary word in documents to be encoded. More robust approaches have been proposed recently that represent each visual descriptor as a sparse weighted combination of dictionary words. While favoring a sparse representation at the level of visual descriptors, those methods however do not ensure that images have sparse representation. In this work, we use mixed-norm regularization to achieve sparsity at the image level as well as a small overall dictionary. This approach can also be used to encourage using the same dictionary words for all the images in a class, providing a discriminative signal in the construction of image representations. Experimental results on a benchmark image classification dataset show that when compact image or dictionary representations are needed for computational efficiency, the proposed approach yields better mean average precision in classification. 1
5 0.1332171 133 nips-2009-Learning models of object structure
Author: Joseph Schlecht, Kobus Barnard
Abstract: We present an approach for learning stochastic geometric models of object categories from single view images. We focus here on models expressible as a spatially contiguous assemblage of blocks. Model topologies are learned across groups of images, and one or more such topologies is linked to an object category (e.g. chairs). Fitting learned topologies to an image can be used to identify the object class, as well as detail its geometry. The latter goes beyond labeling objects, as it provides the geometric structure of particular instances. We learn the models using joint statistical inference over category parameters, camera parameters, and instance parameters. These produce an image likelihood through a statistical imaging model. We use trans-dimensional sampling to explore topology hypotheses, and alternate between Metropolis-Hastings and stochastic dynamics to explore instance parameters. Experiments on images of furniture objects such as tables and chairs suggest that this is an effective approach for learning models that encode simple representations of category geometry and the statistics thereof, and support inferring both category and geometry on held out single view images. 1
6 0.12916394 211 nips-2009-Segmenting Scenes by Matching Image Composites
7 0.12812354 46 nips-2009-Bilinear classifiers for visual recognition
8 0.1223155 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
9 0.11256163 251 nips-2009-Unsupervised Detection of Regions of Interest Using Iterative Link Analysis
10 0.093670174 175 nips-2009-Occlusive Components Analysis
11 0.085630894 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks
12 0.084858291 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation
13 0.082987338 89 nips-2009-FACTORIE: Probabilistic Programming via Imperatively Defined Factor Graphs
14 0.076587379 72 nips-2009-Distribution Matching for Transduction
15 0.07100293 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
16 0.067700215 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection
17 0.067272618 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
18 0.066823401 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection
19 0.066594206 151 nips-2009-Measuring Invariances in Deep Networks
20 0.066484421 47 nips-2009-Boosting with Spatial Regularization
topicId topicWeight
[(0, -0.229), (1, -0.09), (2, -0.149), (3, -0.018), (4, -0.018), (5, 0.127), (6, -0.02), (7, 0.096), (8, 0.121), (9, -0.025), (10, 0.067), (11, -0.05), (12, 0.016), (13, -0.045), (14, -0.028), (15, 0.086), (16, 0.028), (17, -0.029), (18, 0.13), (19, -0.071), (20, 0.037), (21, -0.027), (22, -0.035), (23, -0.044), (24, -0.06), (25, 0.092), (26, 0.043), (27, -0.007), (28, 0.07), (29, 0.044), (30, 0.047), (31, -0.004), (32, -0.029), (33, 0.033), (34, 0.029), (35, 0.032), (36, -0.134), (37, -0.14), (38, 0.167), (39, 0.125), (40, 0.068), (41, -0.026), (42, -0.018), (43, 0.019), (44, 0.05), (45, 0.013), (46, 0.081), (47, 0.046), (48, -0.11), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.94604987 236 nips-2009-Structured output regression for detection with partial truncation
Author: Andrea Vedaldi, Andrew Zisserman
Abstract: We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efficient. We make the following contributions: (i) we note that extending the Structured Output Regression formulation of Blaschko and Lampert [1] to include a bias term significantly improves performance; (ii) that alignment (to account for small rotations and anisotropic scalings) can be included as a latent variable and efficiently determined and implemented; (iii) that the latent variable extends to multiple aspects (e.g. left facing, right facing, front) with the same formulation; and (iv), most significantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. We demonstrate the method by training and testing on the PASCAL VOC 2007 data set – training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with significant truncations. 1
2 0.68413043 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation
Author: Jie Luo, Barbara Caputo, Vittorio Ferrari
Abstract: Given a corpus of news items consisting of images accompanied by text captions, we want to find out “who’s doing what”, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint ‘face and pose’ model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images. 1
3 0.66465938 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
Author: Mario Fritz, Gary Bradski, Sergey Karayev, Trevor Darrell, Michael J. Black
Abstract: Existing methods for visual recognition based on quantized local features can perform poorly when local features exist on transparent surfaces, such as glass or plastic objects. There are characteristic patterns to the local appearance of transparent objects, but they may not be well captured by distances to individual examples or by a local pattern codebook obtained by vector quantization. The appearance of a transparent patch is determined in part by the refraction of a background pattern through a transparent medium: the energy from the background usually dominates the patch appearance. We model transparent local patch appearance using an additive model of latent factors: background factors due to scene content, and factors which capture a local edge energy distribution characteristic of the refraction. We implement our method using a novel LDA-SIFT formulation which performs LDA prior to any vector quantization step; we discover latent topics which are characteristic of particular transparent patches and quantize the SIFT space into transparent visual words according to the latent topic dimensions. No knowledge of the background scene is required at test time; we show examples recognizing transparent glasses in a domestic environment. 1
4 0.60238212 201 nips-2009-Region-based Segmentation and Object Detection
Author: Stephen Gould, Tianshi Gao, Daphne Koller
Abstract: Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other [10, 11]. However, current state-of-the-art models use a separate representation for each task making joint inference clumsy and leaving the classification of many parts of the scene ambiguous. In this work, we propose a hierarchical region-based approach to joint object detection and image segmentation. Our approach simultaneously reasons about pixels, regions and objects in a coherent probabilistic model. Pixel appearance features allow us to perform well on classifying amorphous background classes, while the explicit representation of regions facilitate the computation of more sophisticated features necessary for object detection. Importantly, our model gives a single unified description of the scene—we explain every pixel in the image and enforce global consistency between all random variables in our model. We run experiments on the challenging Street Scene dataset [2] and show significant improvement over state-of-the-art results for object detection accuracy. 1
5 0.58720291 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation
Author: Lan Du, Lu Ren, Lawrence Carin, David B. Dunson
Abstract: A non-parametric Bayesian model is proposed for processing multiple images. The analysis employs image features and, when present, the words associated with accompanying annotations. The model clusters the images into classes, and each image is segmented into a set of objects, also allowing the opportunity to assign a word to each object (localized labeling). Each object is assumed to be represented as a heterogeneous mix of components, with this realized via mixture models linking image features to object types. The number of image classes, number of object types, and the characteristics of the object-feature mixture models are inferred nonparametrically. To constitute spatially contiguous objects, a new logistic stick-breaking process is developed. Inference is performed efficiently via variational Bayesian analysis, with example results presented on two image databases.
6 0.5796839 211 nips-2009-Segmenting Scenes by Matching Image Composites
7 0.56607926 133 nips-2009-Learning models of object structure
8 0.56586242 175 nips-2009-Occlusive Components Analysis
9 0.53747517 46 nips-2009-Bilinear classifiers for visual recognition
10 0.53586942 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection
11 0.49287838 251 nips-2009-Unsupervised Detection of Regions of Interest Using Iterative Link Analysis
12 0.48761329 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
13 0.4806464 44 nips-2009-Beyond Categories: The Visual Memex Model for Reasoning About Object Relationships
14 0.47218671 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition
15 0.44171673 172 nips-2009-Nonparametric Bayesian Texture Learning and Synthesis
16 0.42935777 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
17 0.41854137 72 nips-2009-Distribution Matching for Transduction
18 0.40506071 104 nips-2009-Group Sparse Coding
19 0.39613011 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection
20 0.39567119 96 nips-2009-Filtering Abstract Senses From Image Search Results
topicId topicWeight
[(24, 0.034), (25, 0.071), (35, 0.493), (36, 0.089), (39, 0.038), (55, 0.018), (58, 0.038), (71, 0.033), (81, 0.01), (86, 0.079), (91, 0.012)]
simIndex simValue paperId paperTitle
1 0.9745459 10 nips-2009-A Gaussian Tree Approximation for Integer Least-Squares
Author: Jacob Goldberger, Amir Leshem
Abstract: This paper proposes a new algorithm for the linear least squares problem where the unknown variables are constrained to be in a finite set. The factor graph that corresponds to this problem is very loopy; in fact, it is a complete graph. Hence, applying the Belief Propagation (BP) algorithm yields very poor results. The algorithm described here is based on an optimal tree approximation of the Gaussian density of the unconstrained linear system. It is shown that even though the approximation is not directly applied to the exact discrete distribution, applying the BP algorithm to the modified factor graph outperforms current methods in terms of both performance and complexity. The improved performance of the proposed algorithm is demonstrated on the problem of MIMO detection.
2 0.93707973 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we present a novel approach to learn directed acyclic graphs (DAGs) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods. 1
same-paper 3 0.93519896 236 nips-2009-Structured output regression for detection with partial truncation
Author: Andrea Vedaldi, Andrew Zisserman
Abstract: We develop a structured output model for object category detection that explicitly accounts for alignment, multiple aspects and partial truncation in both training and inference. The model is formulated as large margin learning with latent variables and slack rescaling, and both training and inference are computationally efficient. We make the following contributions: (i) we note that extending the Structured Output Regression formulation of Blaschko and Lampert [1] to include a bias term significantly improves performance; (ii) that alignment (to account for small rotations and anisotropic scalings) can be included as a latent variable and efficiently determined and implemented; (iii) that the latent variable extends to multiple aspects (e.g. left facing, right facing, front) with the same formulation; and (iv), most significantly for performance, that truncated and truncated instances can be included in both training and inference with an explicit truncation mask. We demonstrate the method by training and testing on the PASCAL VOC 2007 data set – training includes the truncated examples, and in testing object instances are detected at multiple scales, alignments, and with significant truncations. 1
4 0.90547013 209 nips-2009-Robust Value Function Approximation Using Bilinear Programming
Author: Marek Petrik, Shlomo Zilberstein
Abstract: Existing value function approximation methods have been successfully used in many applications, but they often lack useful a priori error bounds. We propose approximate bilinear programming, a new formulation of value function approximation that provides strong a priori guarantees. In particular, this approach provably finds an approximate value function that minimizes the Bellman residual. Solving a bilinear program optimally is NP-hard, but this is unavoidable because the Bellman-residual minimization itself is NP-hard. We therefore employ and analyze a common approximate algorithm for bilinear programs. The analysis shows that this algorithm offers a convergent generalization of approximate policy iteration. Finally, we demonstrate that the proposed approach can consistently minimize the Bellman residual on a simple benchmark problem. 1 Motivation Solving large Markov Decision Problems (MDPs) is a very useful, but computationally challenging problem addressed widely in the AI literature, particularly in the area of reinforcement learning. It is widely accepted that large MDPs can only be solved approximately. The commonly used approximation methods can be divided into three broad categories: 1) policy search, which explores a restricted space of all policies, 2) approximate dynamic programming, which searches a restricted space of value functions, and 3) approximate linear programming, which approximates the solution using a linear program. While all of these methods have achieved impressive results in many domains, they have significant limitations. Policy search methods rely on local search in a restricted policy space. The policy may be represented, for example, as a finite-state controller [22] or as a greedy policy with respect to an approximate value function [24]. Policy search methods have achieved impressive results in such domains as Tetris [24] and helicopter control [1]. However, they are notoriously hard to analyze. We are not aware of any theoretical guarantees regarding the quality of the solution. Approximate dynamic programming (ADP) methods iteratively approximate the value function [4, 20, 23]. They have been extensively analyzed and are the most commonly used methods. However, ADP methods typically do not converge and they only provide weak guarantees of approximation quality. The approximation error bounds are usually expressed in terms of the worst-case approximation of the value function over all policies [4]. In addition, most available bounds are with respect to the L∞ norm, while the algorithms often minimize the L2 norm. While there exist some L2 -based bounds [14], they require values that are difficult to obtain. Approximate linear programming (ALP) uses a linear program to compute the approximate value function in a particular vector space [7]. ALP has been previously used in a wide variety of settings [2, 9, 10]. Although ALP often does not perform as well as ADP, there have been some recent 1 efforts to close the gap [18]. ALP has better theoretical properties than ADP and policy search. It is guaranteed to converge and return the closest L1 -norm approximation v of the optimal value func˜ tion v ∗ up to a multiplicative factor. However, the L1 norm must be properly weighted to guarantee a small policy loss, and there is no reliable method for selecting appropriate weights [7]. To summarize, the existing reinforcement learning techniques often provide good solutions, but typically require significant domain knowledge [20]. The domain knowledge is needed partly because useful a priori error bounds are not available, as mentioned above. Our goal is to develop a more robust method that is guaranteed to minimize an actual bound on the policy loss. We present a new formulation of value function approximation that provably minimizes a bound on the policy loss. Unlike in some other algorithms, the bound in this case does not rely on values that are hard to obtain. The new method unifies policy search and value-function search methods to minimize the L∞ norm of the Bellman residual, which bounds the policy loss. We start with a description of the framework and notation in Section 2. Then, in Section 3, we describe the proposed Approximate Bilinear Programming (ABP) formulation. A drawback of this formulation is its computational complexity, which may be exponential. We show in Section 4 that this is unavoidable, because minimizing the approximation error bound is in fact NP-hard. Although our focus is on the formulation and its properties, we also discuss some simple algorithms for solving bilinear programs. Section 5 shows that ABP can be seen as an improvement of ALP and Approximate Policy Iteration (API). Section 6 demonstrates the applicability of ABP using a common reinforcement learning benchmark problem. A complete discussion of sampling strategies–an essential component for achieving robustness–is beyond the scope of this paper, but the issue is briefly discussed in Section 6. Complete proofs of the theorems can be found in [19]. 2 Solving MDPs using ALP In this section, we formally define MDPs, their ALP formulation, and the approximation errors involved. These notions serve as a basis for developing the ABP formulation. A Markov Decision Process is a tuple (S, A, P, r, α), where S is the finite set of states, A is the finite set of actions. P : S × S × A → [0, 1] is the transition function, where P (s , s, a) represents the probability of transiting to state s from state s, given action a. The function r : S × A → R is the reward function, and α : S → [0, 1] is the initial state distribution. The objective is to maximize the infinite-horizon discounted cumulative reward. To shorten the notation, we assume an arbitrary ordering of the states: s1 , s2 , . . . , sn . Then, Pa and ra are used to denote the probabilistic transition matrix and reward for action a. The solution of an MDP is a policy π : S × A → [0, 1] from a set of possible policies Π, such that for all s ∈ S, a∈A π(s, a) = 1. We assume that the policies may be stochastic, but stationary [21]. A policy is deterministic when π(s, a) ∈ {0, 1} for all s ∈ S and a ∈ A. The transition and reward functions for a given policy are denoted by Pπ and rπ . The value function update for a policy π is denoted by Lπ , and the Bellman operator is denoted by L. That is: Lπ v = Pπ v + rπ Lv = max Lπ v. π∈Π The optimal value function, denoted v ∗ , satisfies v ∗ = Lv ∗ . We focus on linear value function approximation for discounted infinite-horizon problems. In linear value function approximation, the value function is represented as a linear combination of nonlinear basis functions (vectors). For each state s, we define a row-vector φ(s) of features. The rows of the basis matrix M correspond to φ(s), and the approximation space is generated by the columns of the matrix. That is, the basis matrix M , and the value function v are represented as: − φ(s1 ) − M = − φ(s2 ) − v = M x. . . . Definition 1. A value function, v, is representable if v ∈ M ⊆ R|S| , where M = colspan (M ), and is transitive-feasible when v ≥ Lv. We denote the set of transitive-feasible value functions as: K = {v ∈ R|S| v ≥ Lv}. 2 Notice that the optimal value function v ∗ is transitive-feasible, and M is a linear space. Also, all the inequalities are element-wise. Because the new formulation is related to ALP, we introduce it first. It is well known that an infinite horizon discounted MDP problem may be formulated in terms of solving the following linear program: minimize v c(s)v(s) s∈S v(s) − γ s.t. P (s , s, a)v(s ) ≥ r(s, a) ∀(s, a) ∈ (S, A) (1) s ∈S We use A as a shorthand notation for the constraint matrix and b for the right-hand side. The value c represents a distribution over the states, usually a uniform one. That is, s∈S c(s) = 1. The linear program in Eq. (1) is often too large to be solved precisely, so it is approximated to get an approximate linear program by assuming that v ∈ M [8], as follows: minimize cT v x Av ≥ b s.t. (2) v∈M The constraint v ∈ M denotes the approximation. To actually solve this linear program, the value function is represented as v = M x. In the remainder of the paper, we assume that 1 ∈ M to guarantee the feasibility of the ALP, where 1 is a vector of all ones. The optimal solution of the ALP, v , satisfies that v ≥ v ∗ . Then, the objective of Eq. (2) represents the minimization of v − v ∗ 1,c , ˜ ˜ ˜ where · 1,c is a c-weighted L1 norm [7]. The ultimate goal of the optimization is not to obtain a good value function v , but a good policy. ˜ The quality of the policy, typically chosen to be greedy with respect to v , depends non-trivially on ˜ the approximate value function. The ABP formulation will minimize policy loss by minimizing L˜ − v ∞ , which bounds the policy loss as follows. v ˜ Theorem 2 (e.g. [25]). Let v be an arbitrary value function, and let v be the value of the greedy ˜ ˆ policy with respect to v . Then: ˜ 2 v∗ − v ∞ ≤ ˆ L˜ − v ∞ , v ˜ 1−γ In addition, if v ≥ L˜, the policy loss is smallest for the greedy policy. ˜ v Policies, like value functions, can be represented as vectors. Assume an arbitrary ordering of the state-action pairs, such that o(s, a) → N maps a state and an action to its position. The policies are represented as θ ∈ R|S|×|A| , and we use the shorthand notation θ(s, a) = θ(o(s, a)). Remark 3. The corresponding π and θ are denoted as π θ and θπ and satisfy: π θ (s, a) = θπ (s, a). We will also consider approximations of the policies in the policy-space, generated by columns of a matrix N . A policy is representable when π ∈ N , where N = colspan (N ). 3 Approximate Bilinear Programs This section shows how to formulate minv∈M Lv − v ∞ as a separable bilinear program. Bilinear programs are a generalization of linear programs with an additional bilinear term in the objective function. A separable bilinear program consists of two linear programs with independent constraints and are fairly easy to solve and analyze. Definition 4 (Separable Bilinear Program). A separable bilinear program in the normal form is defined as follows: T T minimize f (w, x, y, z) = sT w + r1 x + xT Cy + r2 y + sT z 1 2 w,x y,z s.t. A1 x + B1 w = b1 A2 y + B2 z = b2 w, x ≥ 0 y, z ≥ 0 3 (3) We separate the variables using a vertical line and the constraints using different columns to emphasize the separable nature of the bilinear program. In this paper, we only use separable bilinear programs and refer to them simply as bilinear programs. An approximate bilinear program can now be formulated as follows. minimize θT λ + λ θ λ,λ ,v Bθ = 1 z = Av − b s.t. θ≥0 z≥0 (4) λ+λ1≥z λ≥0 θ∈N v∈M All variables are vectors except λ , which is a scalar. The symbol z is only used to simplify the notation and does not need to represent an optimization variable. The variable v is defined for each state and represents the value function. Matrix A represents constraints that are identical to the constraints in Eq. (2). The variables λ correspond to all state-action pairs. These variables represent the Bellman residuals that are being minimized. The variables θ are defined for all state-action pairs and represent policies in Remark 3. The matrix B represents the following constraints: θ(s, a) = 1 ∀s ∈ S. a∈A As with approximate linear programs, we initially assume that all the constraints on z are used. In realistic settings, however, the constraints would be sampled or somehow reduced. We defer the discussion of this issue until Section 6. Note that the constraints in our formulation correspond to elements of z and θ. Thus when constraints are omitted, also the corresponding elements of z and θ are omitted. To simplify the notation, the value function approximation in this problem is denoted only implicitly by v ∈ M, and the policy approximation is denoted by θ ∈ N . In an actual implementation, the optimization variables would be x, y using the relationships v = M x and θ = N y. We do not assume any approximation of the policy space, unless mentioned otherwise. We also use v or θ to refer to partial solutions of Eq. (4) with the other variables chosen appropriately to achieve feasibility. The ABP formulation is closely related to approximate linear programs, and we discuss the connection in Section 5. We first analyze the properties of the optimal solutions of the bilinear program and then show and discuss the solution methods in Section 4. The following theorem states the main property of the bilinear formulation. ˜˜ ˜ ˜ Theorem 5. b Let (θ, v , λ, λ ) be an optimal solution of Eq. (4) and assume that 1 ∈ M. Then: ˜ ˜ ˜ θT λ + λ = L˜ − v v ˜ ∞ ≤ min v∈K∩M Lv − v ∞ ≤ 2 min Lv − v v∈M ∞ ≤ 2(1 + γ) min v − v ∗ v∈M ∞. ˜ In addition, π θ minimizes the Bellman residual with regard to v , and its value function v satisfies: ˜ ˆ 2 min Lv − v ∞ . v − v∗ ∞ ≤ ˆ 1 − γ v∈M The proof of the theorem can be found in [19]. It is important to note that, as Theorem 5 states, the ABP approach is equivalent to a minimization over all representable value functions, not only the transitive-feasible ones. Notice also the missing coefficient 2 (2 instead of 4) in the last equation of Theorem 5. This follows by subtracting a constant vector 1 from v to balance the lower bounds ˜ on the Bellman residual error with the upper ones. This modified approximate value function will have 1/2 of the original Bellman residual but an identical greedy policy. Finally, note that whenever v ∗ ∈ M, both ABP and ALP will return the optimal value function. The ABP solution minimizes the L∞ norm of the Bellman residual due to: 1) the correspondence between θ and the policies, and 2) the dual representation with respect to variables λ and λ . The theorem then follows using techniques similar to those used for approximate linear programs [7]. 4 Algorithm 1: Iterative algorithm for solving Eq. (3) (x0 , w0 ) ← random ; (y0 , z0 ) ← arg miny,z f (w0 , x0 , y, z) ; i←1; while yi−1 = yi or xi−1 = xi do (yi , zi ) ← arg min{y,z A2 y+B2 z=b2 y,z≥0} f (wi−1 , xi−1 , y, z) ; (xi , wi ) ← arg min{x,w A1 x+B1 w=b1 x,w≥0} f (w, x, yi , zi ) ; i←i+1 return f (wi , xi , yi , zi ) 4 Solving Bilinear Programs In this section we describe simple methods for solving ABPs. We first describe optimal methods, which have exponential complexity, and then discuss some approximation strategies. Solving a bilinear program is an NP-complete problem [3]. The membership in NP follows from the finite number of basic feasible solutions of the individual linear programs, each of which can be checked in polynomial time. The NP-hardness is shown by a reduction from the SAT problem [3]. The NP-completeness of ABP compares unfavorably with the polynomial complexity of ALP. However, most other ADP algorithms are not guaranteed to converge to a solution in finite time. The following theorem shows that the computational complexity of the ABP formulation is asymptotically the same as the complexity of the problem it solves. Theorem 6. b Determining minv∈K∩M Lv − v ∞ < is NP-complete for the full constraint representation, 0 < γ < 1, and a given > 0. In addition, the problem remains NP-complete when 1 ∈ M, and therefore minv∈M Lv − v ∞ < is also NP-complete. As the theorem states, the value function approximation does not become computationally simpler even when 1 ∈ M – a universal assumption in the paper. Notice that ALP can determine whether minv∈K∩M Lv − v ∞ = 0 in polynomial time. The proof of Theorem 6 is based on a reduction from SAT and can be found in [19]. The policy in the reduction determines the true literal in each clause, and the approximate value function corresponds to the truth value of the literals. The approximation basis forces literals that share the same variable to have consistent values. Bilinear programs are non-convex and are typically solved using global optimization techniques. The common solution methods are based on concave cuts [11] or branch-and-bound [6]. In ABP settings with a small number of features, the successive approximation algorithm [17] may be applied efficiently. We are, however, not aware of commercial solvers available for solving bilinear programs. Bilinear programs can be formulated as concave quadratic minimization problems [11], or mixed integer linear programs [11, 16], for which there are numerous commercial solvers available. Because we are interested in solving very large bilinear programs, we describe simple approximate algorithms next. Optimal scalable methods are beyond the scope of this paper. The most common approximate method for solving bilinear programs is shown in Algorithm 1. It is designed for the general formulation shown in Eq. (3), where f (w, x, y, z) represents the objective function. The minimizations in the algorithm are linear programs which can be easily solved. Interestingly, as we will show in Section 5, Algorithm 1 applied to ABP generalizes a version of API. While Algorithm 1 is not guaranteed to find an optimal solution, its empirical performance is often remarkably good [13]. Its basic properties are summarized by the following proposition. Proposition 7 (e.g. [3]). Algorithm 1 is guaranteed to converge, assuming that the linear program solutions are in a vertex of the optimality simplex. In addition, the global optimum is a fixed point of the algorithm, and the objective value monotonically improves during execution. 5 The proof is based on the finite count of the basic feasible solutions of the individual linear programs. Because the objective function does not increase in any iteration, the algorithm will eventually converge. In the context of MDPs, Algorithm 1 can be further refined. For example, the constraint v ∈ M in Eq. (4) serves mostly to simplify the bilinear program and a value function that violates it may still be acceptable. The following proposition motivates the construction of a new value function from two transitive-feasible value functions. Proposition 8. Let v1 and v2 be feasible value functions in Eq. (4). Then the value function ˜ ˜ v (s) = min{˜1 (s), v2 (s)} is also feasible in Eq. (4). Therefore v ≥ v ∗ and v ∗ − v ∞ ≤ ˜ v ˜ ˜ ˜ min { v ∗ − v1 ∞ , v ∗ − v2 ∞ }. ˜ ˜ The proof of the proposition is based on Jensen’s inequality and can be found in [19]. Proposition 8 can be used to extend Algorithm 1 when solving ABPs. One option is to take the state-wise minimum of values from multiple random executions of Algorithm 1, which preserves the transitive feasibility of the value function. However, the increasing number of value functions used to obtain v also increases the potential sampling error. ˜ 5 Relationship to ALP and API In this section, we describe the important connections between ABP and the two closely related ADP methods: ALP, and API with L∞ minimization. Both of these methods are commonly used, for example to solve factored MDPs [10]. Our analysis sheds light on some of their observed properties and leads to a new convergent form of API. ABP addresses some important issues with ALP: 1) ALP provides value function bounds with respect to L1 norm, which does not guarantee small policy loss, 2) ALP’s solution quality depends significantly on the heuristically-chosen objective function c in Eq. (2) [7], and 3) incomplete constraint samples in ALP easily lead to unbounded linear programs. The drawback of using ABP, however, is the higher computational complexity. Both the first and the second issues in ALP can be addressed by choosing the right objective function [7]. Because this objective function depends on the optimal ALP solution, it cannot be practically computed. Instead, various heuristics are usually used. The heuristic objective functions may lead to significant improvements in specific domains, but they do not provide any guarantees. ABP, on the other hand, has no such parameters that require adjustments. The third issue arises when the constraints of an ALP need to be sampled in some large domains. The ALP may become unbounded with incomplete samples because its objective value is defined using the L1 norm on the states, and the constraints are defined using the L∞ norm of the Bellman residual. In ABP, the Bellman residual is used in both the constraints and objective function. The objective function of ABP is then bounded below by 0 for an arbitrarily small number of samples. ABP can also improve on API with L∞ minimization (L∞ -API for short), which is a leading method for solving factored MDPs [10]. Minimizing the L∞ approximation error is theoretically preferable, since it is compatible with the existing bounds on policy loss [10]. In contrast, few practical bounds exist for API with the L2 norm minimization [14], such as LSPI [12]. L∞ -API is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. (I − γPπ )v + 1φ ≥ rπ −(I − γPπ )v + 1φ ≥ −rπ (5) v∈M Here I denotes the identity matrix. We are not aware of a convergence or a divergence proof of L∞ -API, and this analysis is beyond the scope of this paper. 6 Algorithm 2: Approximate policy iteration, where f (π) denotes a custom value function approximation for the policy π. π0 , k ← rand, 1 ; while πk = πk−1 do vk ← f (πk−1 ) ; ˜ πk (s) ← arg maxa∈A r(s, a) + γ s ∈S P (s , s, a)˜k (s) ∀s ∈ S ; v k ←k+1 We propose Optimistic Approximate Policy Iteration (OAPI), a modification of API. OAPI is shown in Algorithm 2, where f (π) is calculated using the following program: minimize φ φ,v s.t. Av ≥ b (≡ (I − γPπ )v ≥ rπ ∀π ∈ Π) −(I − γPπ )v + 1φ ≥ −rπ (6) v∈M In fact, OAPI corresponds to Algorithm 1 applied to ABP because Eq. (6) corresponds to Eq. (4) with fixed θ. Then, using Proposition 7, we get the following corollary. Corollary 9. Optimistic approximate policy iteration converges in finite time. In addition, the Bellman residual of the generated value functions monotonically decreases. OAPI differs from L∞ -API in two ways: 1) OAPI constrains the Bellman residuals by 0 from below and by φ from above, and then it minimizes φ. L∞ -API constrains the Bellman residuals by φ from both above and below. 2) OAPI, like API, uses only the current policy for the upper bound on the Bellman residual, but uses all the policies for the lower bound on the Bellman residual. L∞ -API cannot return an approximate value function that has a lower Bellman residual than ABP, given the optimality of ABP described in Theorem 5. However, even OAPI, an approximate ABP algorithm, performs comparably to L∞ -API, as the following theorem states. Theorem 10. b Assume that L∞ -API converges to a policy π and a value function v that both φ satisfy: φ = v − Lπ v ∞ = v − Lv ∞ . Then v = v + 1−γ 1 is feasible in Eq. (4), and it is a fixed ˜ point of OAPI. In addition, the greedy policies with respect to v and v are identical. ˜ The proof is based on two facts. First, v is feasible with respect to the constraints in Eq. (4). The ˜ Bellman residual changes for all the policies identically, since a constant vector is added. Second, because Lπ is greedy with respect to v , we have that v ≥ Lπ v ≥ L˜. The value function v is ˜ ˜ ˜ v ˜ therefore transitive-feasible. The full proof can be found in [19]. To summarize, OAPI guarantees convergence, while matching the performance of L∞ -API. The convergence of OAPI is achieved because given a non-negative Bellman residual, the greedy policy also minimizes the Bellman residual. Because OAPI ensures that the Bellman residual is always non-negative, it can progressively reduce it. In comparison, the greedy policy in L∞ -API does not minimize the Bellman residual, and therefore L∞ -API does not always reduce it. Theorem 10 also explains why API provides better solutions than ALP, as observed in [10]. From the discussion above, ALP can be seen as an L1 -norm approximation of a single iteration of OAPI. L∞ -API, on the other hand, performs many such ALP-like iterations. 6 Empirical Evaluation As we showed in Theorem 10, even OAPI, the very simple approximate algorithm for ABP, can perform as well as existing state-of-the art methods on factored MDPs. However, a deeper understanding of the formulation and potential solution methods will be necessary in order to determine the full practical impact of the proposed methods. In this section, we validate the approach by applying it to the mountain car problem, a simple reinforcement learning benchmark problem. We have so far considered that all the constraints involving z are present in the ABP in Eq. (4). Because the constraints correspond to all state-action pairs, it is often impractical to even enumerate 7 (a) L∞ error of the Bellman residual Features 100 144 OAPI 0.21 (0.23) 0.13 (0.1) ALP 13. (13.) 3.6 (4.3) LSPI 9. (14.) 3.9 (7.7) API 0.46 (0.08) 0.86 (1.18) (b) L2 error of the Bellman residual Features 100 144 OAPI 0.2 (0.3) 0.1 (1.9) ALP 9.5 (18.) 0.3 (0.4) LSPI 1.2 (1.5) 0.9 (0.1) API 0.04 (0.01) 0.08 (0.08) Table 1: Bellman residual of the final value function. The values are averages over 5 executions, with the standard deviations shown in parentheses. them. This issue can be addressed in at least two ways. First, a small randomly-selected subset of the constraints can be used in the ABP, a common approach in ALP [9, 5]. The ALP sampling bounds can be easily extended to ABP. Second, the structure of the MDP can be used to reduce the number of constraints. Such a reduction is possible, for example, in factored MDPs with L∞ -API and ALP [10], and can be easily extended to OAPI and ABP. In the mountain-car benchmark, an underpowered car needs to climb a hill [23]. To do so, it first needs to back up to an opposite hill to gain sufficient momentum. The car receives a reward of 1 when it climbs the hill. In the experiments we used a discount factor γ = 0.99. The experiments are designed to determine whether OAPI reliably minimizes the Bellman residual in comparison with API and ALP. We use a uniformly-spaced linear spline to approximate the value function. The constraints were based on 200 uniformly sampled states with all 3 actions per state. We evaluated the methods with the number of the approximation features 100 and 144, which corresponds to the number of linear segments. The results of ABP (in particular OAPI), ALP, API with L2 minimization, and LSPI are depicted in Table 1. The results are shown for both L∞ norm and uniformly-weighted L2 norm. The runtimes of all these methods are comparable, with ALP being the fastest. Since API (LSPI) is not guaranteed to converge, we ran it for at most 20 iterations, which was an upper bound on the number of iterations of OAPI. The results demonstrate that ABP minimizes the L∞ Bellman residual much more consistently than the other methods. Note, however, that all the considered algorithms would perform significantly better given a finer approximation. 7 Conclusion and Future Work We proposed and analyzed approximate bilinear programming, a new value-function approximation method, which provably minimizes the L∞ Bellman residual. ABP returns the optimal approximate value function with respect to the Bellman residual bounds, despite the formulation with regard to transitive-feasible value functions. We also showed that there is no asymptotically simpler formulation, since finding the closest value function and solving a bilinear program are both NP-complete problems. Finally, the formulation leads to the development of OAPI, a new convergent form of API which monotonically improves the objective value function. While we only discussed approximate solutions of the ABP, a deeper study of bilinear solvers may render optimal solution methods feasible. ABPs have a small number of essential variables (that determine the value function) and a large number of constraints, which can be leveraged by the solvers [15]. The L∞ error bound provides good theoretical guarantees, but it may be too conservative in practice. A similar formulation based on L2 norm minimization may be more practical. We believe that the proposed formulation will help to deepen the understanding of value function approximation and the characteristics of existing solution methods, and potentially lead to the development of more robust and widely-applicable reinforcement learning algorithms. Acknowledgements This work was supported by the Air Force Office of Scientific Research under Grant No. FA955008-1-0171. We also thank the anonymous reviewers for their useful comments. 8 References [1] Pieter Abbeel, Varun Ganapathi, and Andrew Y. Ng. Learning vehicular dynamics, with application to modeling helicopters. In Advances in Neural Information Processing Systems, pages 1–8, 2006. [2] Daniel Adelman. A price-directed approach to stochastic inventory/routing. Operations Research, 52:499–514, 2004. [3] Kristin P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Technical report, Computer Science Department, University of Wisconsin, 1992. [4] Dimitri P. Bertsekas and Sergey Ioffe. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical Report LIDS-P-2349, LIDS, 1997. [5] Guiuseppe Calafiore and M.C. Campi. Uncertain convex programs: Randomized solutions and confidence levels. Mathematical Programming, Series A, 102:25–46, 2005. [6] Alberto Carpara and Michele Monaci. Bidimensional packing by bilinear programming. Mathematical Programming Series A, 118:75–108, 2009. [7] Daniela P. de Farias. The Linear Programming Approach to Approximate Dynamic Programming: Theory and Application. PhD thesis, Stanford University, 2002. [8] Daniela P. de Farias and Ben Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51:850–856, 2003. [9] Daniela Pucci de Farias and Benjamin Van Roy. On constraint sampling in the linear programming approach to approximate dynamic programming. Mathematics of Operations Research, 29(3):462–478, 2004. [10] Carlos Guestrin, Daphne Koller, Ronald Parr, and Shobha Venkataraman. Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research, 19:399–468, 2003. [11] Reiner Horst and Hoang Tuy. Global optimization: Deterministic approaches. Springer, 1996. [12] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107–1149, 2003. [13] O. L. Mangasarian. The linear complementarity problem as a separable bilinear program. Journal of Global Optimization, 12:1–7, 1995. [14] Remi Munos. Error bounds for approximate policy iteration. In International Conference on Machine Learning, pages 560–567, 2003. [15] Marek Petrik and Shlomo Zilberstein. Anytime coordination using separable bilinear programs. In Conference on Artificial Intelligence, pages 750–755, 2007. [16] Marek Petrik and Shlomo Zilberstein. Average reward decentralized Markov decision processes. In International Joint Conference on Artificial Intelligence, pages 1997–2002, 2007. [17] Marek Petrik and Shlomo Zilberstein. A bilinear programming approach for multiagent planning. Journal of Artificial Intelligence Research, 35:235–274, 2009. [18] Marek Petrik and Shlomo Zilberstein. Constraint relaxation in approximate linear programs. In International Conference on Machine Learning, pages 809–816, 2009. [19] Marek Petrik and Shlomo Zilberstein. Robust value function approximation using bilinear programming. Technical Report UM-CS-2009-052, Department of Computer Science, University of Massachusetts Amherst, 2009. [20] Warren B. Powell. Approximate Dynamic Programming. Wiley-Interscience, 2007. [21] Martin L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, Inc., 2005. [22] Kenneth O. Stanley and Risto Miikkulainen. Competitive coevolution through evolutionary complexification. Journal of Artificial Intelligence Research, 21:63–100, 2004. [23] Richard S. Sutton and Andrew Barto. Reinforcement learning. MIT Press, 1998. [24] Istvan Szita and Andras Lorincz. Learning Tetris using the noisy cross-entropy method. Neural Computation, 18(12):2936–2941, 2006. [25] Ronald J. Williams and Leemon C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Yale Workshop on Adaptive and Learning Systems, 1994. 9
5 0.86767018 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs
Author: Manqi Zhao, Venkatesh Saligrama
Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.
6 0.85344696 219 nips-2009-Slow, Decorrelated Features for Pretraining Complex Cell-like Networks
7 0.64182383 159 nips-2009-Multi-Step Dyna Planning for Policy Evaluation and Control
8 0.6107505 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models
9 0.59825289 187 nips-2009-Particle-based Variational Inference for Continuous Systems
10 0.59687197 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
11 0.59135586 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition
12 0.58627611 113 nips-2009-Improving Existing Fault Recovery Policies
13 0.5820154 100 nips-2009-Gaussian process regression with Student-t likelihood
14 0.58133984 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition
15 0.5804413 167 nips-2009-Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations
16 0.57125497 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
17 0.56641471 3 nips-2009-AUC optimization and the two-sample problem
18 0.56349832 48 nips-2009-Bootstrapping from Game Tree Search
19 0.54893386 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
20 0.54783595 151 nips-2009-Measuring Invariances in Deep Networks