nips nips2005 nips2005-151 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Francois Fleuret, Gilles Blanchard
Abstract: We investigate the learning of the appearance of an object from a single image of it. Instead of using a large number of pictures of the object to recognize, we use a labeled reference database of pictures of other objects to learn invariance to noise and variations in pose and illumination. This acquired knowledge is then used to predict if two pictures of new objects, which do not appear on the training pictures, actually display the same object. We propose a generic scheme called chopping to address this task. It relies on hundreds of random binary splits of the training set chosen to keep together the images of any given object. Those splits are extended to the complete image space with a simple learning algorithm. Given two images, the responses of the split predictors are combined with a Bayesian rule into a posterior probability of similarity. Experiments with the COIL-100 database and with a database of 150 deA graded LTEX symbols compare our method to a classical learning with several examples of the positive class and to a direct learning of the similarity. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract We investigate the learning of the appearance of an object from a single image of it. [sent-5, score-0.272]
2 Instead of using a large number of pictures of the object to recognize, we use a labeled reference database of pictures of other objects to learn invariance to noise and variations in pose and illumination. [sent-6, score-1.025]
3 This acquired knowledge is then used to predict if two pictures of new objects, which do not appear on the training pictures, actually display the same object. [sent-7, score-0.356]
4 We propose a generic scheme called chopping to address this task. [sent-8, score-0.524]
5 It relies on hundreds of random binary splits of the training set chosen to keep together the images of any given object. [sent-9, score-0.779]
6 Those splits are extended to the complete image space with a simple learning algorithm. [sent-10, score-0.465]
7 Given two images, the responses of the split predictors are combined with a Bayesian rule into a posterior probability of similarity. [sent-11, score-0.404]
8 Experiments with the COIL-100 database and with a database of 150 deA graded LTEX symbols compare our method to a classical learning with several examples of the positive class and to a direct learning of the similarity. [sent-12, score-0.656]
9 1 Introduction Pattern recognition has so far mainly focused on the following task: given many training examples labelled with their classes (the object they display), guess the class of a new sample which was not available during training. [sent-13, score-0.359]
10 ∗ Supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778 Being able to perform that type of one-sample learning corresponds to the ability, given one example, to sort out which elements of a test set are of the same class (i. [sent-18, score-0.158]
11 This can be done by comparing one by one all the elements of the test set with the reference example, and labelling as of the same class those which are similar enough. [sent-22, score-0.321]
12 Learning techniques can be used to choose the similarity measure, which could be adaptive and learned from a large number of examples of classes not involved in the test. [sent-23, score-0.125]
13 Thus, given a large number of training images of a large number of objects labeled with their actual classes, and provided two pictures of unknown objects (objects which do not appear in the training pictures), we want to decide if these two objects are actually the same object. [sent-24, score-1.279]
14 The first image of such a couple can be seen as a single training example, and the second image as a test example. [sent-25, score-0.267]
15 Averaging the error rate by repeating that test several times provides with an estimate of a one-sample learning (OSL) error rate. [sent-26, score-0.165]
16 Finally, the precise one-sample learning setting considered here has been the object of recent research [4, 3, 5] proposing different methods (hyperfeature learning, distance learning) for finding invariant features from a set of training reference objects distinct from the test objects. [sent-29, score-0.599]
17 We propose to build a large number of binary splits of the image space, designed to assign the same binary label to all the images common to a same object. [sent-32, score-0.886]
18 The binary mapping associated to such a split is thus highly invariant across the images of a certain object while highly variant across images of different objects. [sent-33, score-0.926]
19 We can define such a split on the training images, and train a predictor to extend it to the complete image space by induction. [sent-34, score-0.554]
20 We expect the predictor to respond similarly on two images of a same 1 object, and differently on two images of two different objects with probability 2 . [sent-35, score-0.88]
21 The global criterion to compare two images consists roughly of counting how many such splitpredictors responds similarly and compare the result to a fixed threshold. [sent-36, score-0.273]
22 The principle of transforming a multiclass learning problem into several binary ones by class grouping has a long history in Machine Learning [10]. [sent-37, score-0.263]
23 From this point of view the collected output of several binary classifiers is used as a way for coding class membership. [sent-38, score-0.17]
24 Hence, the goal is not to code multiclass membership; our focus is not on designing efficient codes – splits are chosen randomly and we take a large number of them – but rather on how to use the learned mappings for learning unknown objects. [sent-41, score-0.44]
25 2 Data and features To make the rest of the paper clearer to the reader, we now introduce the data and feature sets we are using for our proof of concept experiments. [sent-42, score-0.1]
26 However, note that while we have focused on image classification, our approach is generic and could be applied to any signals for which adaptive binary classifiers are available. [sent-43, score-0.192]
27 1 Data We use two databases of pictures for our experiments. [sent-45, score-0.3]
28 The first one is the standard COIL100 database of pictures [9]. [sent-46, score-0.403]
29 It contains 7200 images corresponding to 100 different objects Figure 1: Four objects from the 100 objects of the COIL-100 database (downsampled to A 38 × 38 grayscale pixels) and four symbols from the 150 symbols of our LTEX symbol database (A, Φ, ⋖ and ⋔, resolution 28 × 28). [sent-47, score-1.432]
30 Each image of the later is generated by applying a rotation and a scaling, and by adding lines of random grayscales at random locations and orientations. [sent-48, score-0.102]
31 The relative values of the two pixels connected by the thick segment define the polarity of the edge (dark to light or light to dark). [sent-50, score-0.132]
32 We down-sample these images from their original resolution to 38 × 38 pixels, and convert them to grayscale. [sent-53, score-0.269]
33 The A second database contains images of 150 LTEX symbols. [sent-55, score-0.39]
34 We generated 1, 000 images of each symbol by applying a random rotation (angle is taken between −20 and +20 degrees) and a random scaling factor (up to 1. [sent-56, score-0.359]
35 Examples of these degraded images are given in figure 1 (right). [sent-60, score-0.247]
36 Let ξx,y,d denote a basic edge detector indexed by a location (x, y) in the image frame and an orientation d which can take eight different values, corresponding to four orientations and two polarities (see figure 2). [sent-63, score-0.162]
37 For 1 pictures of size 32 × 32 there is a total of N = 4 (32 × 32)2 × 8 ≃ 2. [sent-67, score-0.26]
38 2 Negative class Positive class Negative class Positive class 0. [sent-71, score-0.256]
39 05 0 -4000 -3000 -2000 -1000 0 Response 1000 2000 3000 4000 0 -4000 -3000 -2000 -1000 0 1000 2000 3000 4000 Response Figure 3: These two histograms are representative of the responses of two split predictors conditionally to the real arbitrary labelling P (L | S). [sent-77, score-0.611]
40 3 Chopping The main idea we propose in this paper consists of learning a large number of binary splits of the image space which would ideally assign the same binary label to all the images of any given object. [sent-78, score-0.931]
41 In this section we define these splits and describe and justify how they are combined into a global rule. [sent-79, score-0.405]
42 1 Splits A split is a binary labelling of the image space, with the property to give the same label to all images of a given object. [sent-81, score-0.825]
43 We can trivially produce a labelling with that property on the training examples, but we need to be able to extend it to images not appearing in the training data, including images of other objects. [sent-82, score-0.845]
44 We suppose that it is possible to infer a relevant split function on the complete image space, including images of other objects by looking at the problem as a binary classification problem. [sent-83, score-0.796]
45 Inference is done by the mean of a simple learning scheme: a combination of a fast feature selection based on conditional mutual information (CMIM) [6] and a linear perceptron. [sent-84, score-0.108]
46 Thus, we create M arbitrary splits on the training sample by randomly assigning the label 1 to half of the NT objects appearing in the training set, and 0 to the others. [sent-85, score-0.823]
47 Since N there are NTT such balanced arbitrary labellings, with NT of the order of a few tens, a /2 very large number of splits is available and only a small subset of them will be actually used for learning. [sent-86, score-0.405]
48 For each one of those splits, we train a predictor using the scheme described above. [sent-87, score-0.25]
49 , SM ) denote the family of arbitrary splits and (L1 , . [sent-91, score-0.402]
50 The continuous outputs of these predictors before thresholding will be combined in the final classification. [sent-95, score-0.191]
51 We however still need some conditional independence assumption for the drawing of test image pairs. [sent-99, score-0.201]
52 To simplify the notation we denote L1 = (L1 ), L2 = (L2 ) i i 1 2 the collection of predictor outputs for images 1 and 2, S 1 = (Si ), S 2 = (Si ) the collection of their split labels and C1 , C2 their true classes. [sent-100, score-0.697]
53 L1 M In words, for each split i, the predictor output Li is assumed to be independent of the true class C conditionally to the split label Si ; and conditionally to the split labels (S1 , S2 ) of both images, the outputs of predictors on test pair images are assumed to be independent. [sent-113, score-1.499]
54 Finally, we make the additional symmetry hypothesis that conditionally to C1 = C2 , for all 1 2 i : Si = Si = Si and (Si ) are independent Bernoulli variables with parameter 0. [sent-114, score-0.107]
55 5, while 1 2 conditionally to C1 = C2 all split labels (Si , Si ) are independent Bernoulli(0. [sent-115, score-0.29]
56 As a quick check, note that if the predictor outputs (Li ) are i j uninformative (i. [sent-121, score-0.225]
57 To estimate the probabilities P (Sj | Lj ), we use a simple 1D Gaussian model for the output of the predictor given the true split label. [sent-128, score-0.379]
58 4 Experiments We estimate the performance of the chopping approach by comparing it to classical learning with several examples of the positive class and to a direct learning of the similarity of two objects on different images. [sent-131, score-0.983]
59 For every experiment, we use a family of 10, 000 features sampled uniformly in the complete set of features (see section 2. [sent-132, score-0.123]
60 1 Multiple example learning In this procedure, we train a predictor with several pictures of a positive class and with a very large number of pictures of a negative class. [sent-134, score-0.927]
61 The number of positive examples depends on the experiments (from 1 to 32) and the number of negative examples is 2, 000 1 Number of samples for multi-example learning 2 4 8 16 32 1 0. [sent-135, score-0.21]
62 6 Chopping Smart chopping Multi-example learning Similarity learnt directly 32 0. [sent-136, score-0.556]
63 1 Chopping Smart chopping Multi-example learning Similarity learnt directly 0. [sent-140, score-0.556]
64 1 0 0 1 2 4 8 16 32 64 128 256 512 1024 Number of splits for chopping 1 2 4 8 16 32 64 128 256 512 1024 Number of splits for chopping Figure 4: Error rates of the chopping, smart-chopping (see §4. [sent-147, score-1.602]
65 2), multi-example learning A and learnt similarity on the LTEX symbol (left) and the COIL-100 database (right). [sent-148, score-0.384]
66 The x-axis shows either the number of splits for chopping or the number of samples of the positive class for the multi-example learning. [sent-150, score-0.925]
67 Note that to handle the unbalanced positive and negative populations, the perceptron bias is chosen to minimize a balanced error rate. [sent-152, score-0.18]
68 Each experiment consists of several cross-validation cycles so that the total number of test pictures is roughly the same as the number of pairs in one-sample techniques experiments below. [sent-154, score-0.408]
69 2 One-sample learning For each experiment, whatever the predictor is, we first select 80 training objects from the A COIL-100 database (respectively 100 symbols from the LTEX symbol database). [sent-156, score-0.826]
70 The test error is computed with 500 pairs of images of the 20 unseen objects for the COIL-100, and A 1, 000 pairs of images of the 50 unseen objects for the LTEX symbols. [sent-157, score-1.177]
71 These test sets are built to have as many pairs of images of the same object than pairs of images of different objects. [sent-158, score-0.854]
72 Learnt similarity: Note that one-sample learning can also be simply cast as a standard binary classification problem of pairs of images into the classes {same, different}. [sent-159, score-0.48]
73 We therefore want to compare the Chopping method to a more standard learning method directly on pairs of images using a comparable set of features. [sent-160, score-0.396]
74 For every single feature f on single images, we consider three features of a pair of images standing for the conjunction, disjunction and equality of the feature responses on the two images. [sent-161, score-0.477]
75 From the 10, 000 features on single images, we thus create a set of 30, 000 features on pairs of images. [sent-162, score-0.166]
76 We generate a training set of 2, 000 pairs of pictures for the experiments with the COILA 100 database and 5, 000 for the LTEX symbols, half picturing the same object twice, half picturing two different objects. [sent-163, score-0.865]
77 We then train a predictor similar to those used for the splits in the chopping scheme: feature selection with CMIM, and linear combination with a perceptron (see section 3. [sent-164, score-1.088]
78 Chopping: The performance of the chopping approach is estimated for several numbers of splits (from 1 to 1024). [sent-166, score-0.824]
79 For each split we select 50 objects from the training objects, and select at random 1, 000 training images of these objects. [sent-167, score-0.784]
80 We generate an arbitrary balanced binary labelling of these 50 objects and label the training images accordingly. [sent-168, score-0.889]
81 We then build a predictor by selecting 2, 000 features with the CMIM algorithm, and combine them with a perceptron (see section 3. [sent-169, score-0.27]
82 To compensate for the limitation of our conditional independence assumptions we allow to add a fixed bias to the log-odds ratio (1). [sent-171, score-0.164]
83 Using the remaining training objects as validation set, we compute this bias so as to minimize the validation error. [sent-173, score-0.294]
84 We insist that no objects of the test classes be used for training. [sent-174, score-0.278]
85 To improve the performance of the splits, we also test a “smart” version of the chopping for which each split is built in two steps. [sent-175, score-0.725]
86 From that first step, we remove the 10 objects for which the labelling prediction has the highest error rate, and re-build the split with the 40 remaining objects. [sent-177, score-0.593]
87 This get rid of problematic objects or inconsistent labelling (for instance trying to force two similar objects to be in different halves of the split). [sent-178, score-0.576]
88 3 Results The experiments demonstrate the good performance of chopping when only one example is available. [sent-180, score-0.454]
89 By contrast, a direct learning of the similarity (see section 4. [sent-184, score-0.101]
90 On both databases, the classical multi-sample learning scheme requires 32 samples to reach A the same level of performances (10. [sent-188, score-0.127]
91 There is no overfitting when the number of splits increases, which is consistent with the absence of global learning: splits are combined with an ad-hoc Bayesian rule, without optimizing a global functional, which generally also results in better robustness. [sent-192, score-0.778]
92 There is no visible degradation of the asymptotic performance due to either a reduced independence between splits or a diminution of their separation power. [sent-195, score-0.395]
93 However the computational cost is twice as high, since every predictor has to be built twice. [sent-196, score-0.238]
94 5 Conclusion In this paper we have proposed an original approach to learning the appearance of an object from a single image. [sent-197, score-0.199]
95 Our method relies on a large number of individual splits of the image space designed to keep together the images of any of the training objects. [sent-198, score-0.769]
96 These splits are learned from a training set of examples and combined into a Bayesian framework to estimate the posterior probability for two images to show the same object. [sent-199, score-0.738]
97 It can be applied to predict the similarity of two signals as soon as a family of binary predictors exists on the space of individual signals. [sent-201, score-0.292]
98 Since the learning is decomposed into the training of several splits independently, it can be easily parallelized. [sent-202, score-0.487]
99 Also, because the combination rule is symmetric with respect to the splits, the learning can be incremental: splits can be added to the global rule progressively when they become available. [sent-203, score-0.466]
100 A Bayesian approach to unsupervised one-shot learning of object categories. [sent-248, score-0.175]
wordName wordTfidf (topN-words)
[('chopping', 0.454), ('splits', 0.347), ('pictures', 0.26), ('ltex', 0.252), ('images', 0.247), ('si', 0.233), ('objects', 0.2), ('split', 0.193), ('predictor', 0.186), ('labelling', 0.176), ('database', 0.143), ('object', 0.13), ('predictors', 0.12), ('symbols', 0.097), ('symbol', 0.083), ('binary', 0.083), ('smart', 0.08), ('pairs', 0.076), ('cmim', 0.076), ('lj', 0.076), ('image', 0.073), ('training', 0.072), ('picturing', 0.066), ('conditionally', 0.065), ('class', 0.064), ('learnt', 0.057), ('sm', 0.056), ('similarity', 0.056), ('sj', 0.053), ('label', 0.053), ('ferencz', 0.05), ('test', 0.049), ('independence', 0.048), ('multiclass', 0.048), ('features', 0.045), ('learning', 0.045), ('li', 0.044), ('disjunction', 0.044), ('equality', 0.042), ('symmetry', 0.042), ('pixels', 0.041), ('examples', 0.04), ('databases', 0.04), ('outputs', 0.039), ('perceptron', 0.039), ('generic', 0.036), ('balanced', 0.036), ('classi', 0.036), ('gure', 0.036), ('edge', 0.035), ('responses', 0.035), ('scheme', 0.034), ('positive', 0.034), ('nt', 0.033), ('columbia', 0.033), ('family', 0.033), ('ratio', 0.032), ('combined', 0.032), ('feature', 0.032), ('labels', 0.032), ('reference', 0.032), ('assumptions', 0.031), ('conditional', 0.031), ('appearing', 0.031), ('thick', 0.031), ('train', 0.03), ('relies', 0.03), ('built', 0.029), ('formula', 0.029), ('classes', 0.029), ('iccv', 0.029), ('rotation', 0.029), ('unseen', 0.029), ('want', 0.028), ('detector', 0.028), ('half', 0.026), ('eight', 0.026), ('chance', 0.026), ('dark', 0.026), ('invariant', 0.026), ('global', 0.026), ('bayesian', 0.026), ('samples', 0.026), ('negative', 0.025), ('segment', 0.025), ('appendix', 0.025), ('bernoulli', 0.025), ('error', 0.024), ('recognition', 0.024), ('rule', 0.024), ('appearance', 0.024), ('display', 0.024), ('cation', 0.023), ('rest', 0.023), ('twice', 0.023), ('several', 0.023), ('arbitrary', 0.022), ('classical', 0.022), ('bias', 0.022), ('resolution', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 151 nips-2005-Pattern Recognition from One Example by Chopping
Author: Francois Fleuret, Gilles Blanchard
Abstract: We investigate the learning of the appearance of an object from a single image of it. Instead of using a large number of pictures of the object to recognize, we use a labeled reference database of pictures of other objects to learn invariance to noise and variations in pose and illumination. This acquired knowledge is then used to predict if two pictures of new objects, which do not appear on the training pictures, actually display the same object. We propose a generic scheme called chopping to address this task. It relies on hundreds of random binary splits of the training set chosen to keep together the images of any given object. Those splits are extended to the complete image space with a simple learning algorithm. Given two images, the responses of the split predictors are combined with a Bayesian rule into a posterior probability of similarity. Experiments with the COIL-100 database and with a database of 150 deA graded LTEX symbols compare our method to a classical learning with several examples of the positive class and to a direct learning of the similarity. 1
2 0.13296761 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection
Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky
Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance.
3 0.12204292 98 nips-2005-Infinite latent feature models and the Indian buffet process
Author: Zoubin Ghahramani, Thomas L. Griffiths
Abstract: We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset. 1
4 0.11293953 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
Author: Nicolas Loeff, Himanshu Arora, Alexander Sorokin, David Forsyth
Abstract: We describe a novel method for learning templates for recognition and localization of objects drawn from categories. A generative model represents the configuration of multiple object parts with respect to an object coordinate system; these parts in turn generate image features. The complexity of the model in the number of features is low, meaning our model is much more efficient to train than comparative methods. Moreover, a variational approximation is introduced that allows learning to be orders of magnitude faster than previous approaches while incorporating many more features. This results in both accuracy and localization improvements. Our model has been carefully tested on standard datasets; we compare with a number of recent template models. In particular, we demonstrate state-of-the-art results for detection and localization. 1
5 0.11284705 131 nips-2005-Multiple Instance Boosting for Object Detection
Author: Cha Zhang, John C. Platt, Paul A. Viola
Abstract: A good image object detection algorithm is accurate, fast, and does not require exact locations of objects in a training set. We can create such an object detector by taking the architecture of the Viola-Jones detector cascade and training it with a new variant of boosting that we call MILBoost. MILBoost uses cost functions from the Multiple Instance Learning literature combined with the AnyBoost framework. We adapt the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade. Experiments show that the detection rate is up to 1.6 times better using MILBoost. This increased detection rate shows the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier. 1
6 0.098311543 78 nips-2005-From Weighted Classification to Policy Search
7 0.094754651 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits
8 0.090811402 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection
9 0.089828223 79 nips-2005-Fusion of Similarity Data in Clustering
10 0.08925233 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
11 0.083068609 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape
12 0.079965882 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
13 0.074608073 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
14 0.06948676 110 nips-2005-Learning Depth from Single Monocular Images
15 0.068785571 45 nips-2005-Conditional Visual Tracking in Kernel Space
16 0.068758361 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
17 0.063877694 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
18 0.062470898 192 nips-2005-The Information-Form Data Association Filter
19 0.06203175 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
20 0.060189765 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
topicId topicWeight
[(0, 0.21), (1, 0.028), (2, -0.011), (3, 0.171), (4, -0.047), (5, 0.11), (6, 0.125), (7, 0.158), (8, -0.008), (9, -0.103), (10, -0.015), (11, -0.014), (12, 0.081), (13, 0.056), (14, 0.074), (15, 0.039), (16, -0.073), (17, -0.132), (18, -0.028), (19, 0.024), (20, 0.029), (21, 0.182), (22, -0.049), (23, 0.079), (24, 0.019), (25, -0.051), (26, -0.047), (27, -0.12), (28, -0.005), (29, -0.004), (30, -0.076), (31, -0.04), (32, 0.015), (33, 0.028), (34, 0.037), (35, -0.013), (36, -0.01), (37, -0.048), (38, -0.019), (39, -0.095), (40, 0.001), (41, 0.037), (42, 0.05), (43, -0.029), (44, 0.009), (45, -0.013), (46, 0.02), (47, -0.004), (48, -0.026), (49, 0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.95924276 151 nips-2005-Pattern Recognition from One Example by Chopping
Author: Francois Fleuret, Gilles Blanchard
Abstract: We investigate the learning of the appearance of an object from a single image of it. Instead of using a large number of pictures of the object to recognize, we use a labeled reference database of pictures of other objects to learn invariance to noise and variations in pose and illumination. This acquired knowledge is then used to predict if two pictures of new objects, which do not appear on the training pictures, actually display the same object. We propose a generic scheme called chopping to address this task. It relies on hundreds of random binary splits of the training set chosen to keep together the images of any given object. Those splits are extended to the complete image space with a simple learning algorithm. Given two images, the responses of the split predictors are combined with a Bayesian rule into a posterior probability of similarity. Experiments with the COIL-100 database and with a database of 150 deA graded LTEX symbols compare our method to a classical learning with several examples of the positive class and to a direct learning of the similarity. 1
2 0.81332141 131 nips-2005-Multiple Instance Boosting for Object Detection
Author: Cha Zhang, John C. Platt, Paul A. Viola
Abstract: A good image object detection algorithm is accurate, fast, and does not require exact locations of objects in a training set. We can create such an object detector by taking the architecture of the Viola-Jones detector cascade and training it with a new variant of boosting that we call MILBoost. MILBoost uses cost functions from the Multiple Instance Learning literature combined with the AnyBoost framework. We adapt the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade. Experiments show that the detection rate is up to 1.6 times better using MILBoost. This increased detection rate shows the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier. 1
3 0.77053159 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
Author: Nicolas Loeff, Himanshu Arora, Alexander Sorokin, David Forsyth
Abstract: We describe a novel method for learning templates for recognition and localization of objects drawn from categories. A generative model represents the configuration of multiple object parts with respect to an object coordinate system; these parts in turn generate image features. The complexity of the model in the number of features is low, meaning our model is much more efficient to train than comparative methods. Moreover, a variational approximation is introduced that allows learning to be orders of magnitude faster than previous approaches while incorporating many more features. This results in both accuracy and localization improvements. Our model has been carefully tested on standard datasets; we compare with a number of recent template models. In particular, we demonstrate state-of-the-art results for detection and localization. 1
4 0.72442883 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection
Author: Long Zhu, Alan L. Yuille
Abstract: We describe a hierarchical compositional system for detecting deformable objects in images. Objects are represented by graphical models. The algorithm uses a hierarchical tree where the root of the tree corresponds to the full object and lower-level elements of the tree correspond to simpler features. The algorithm proceeds by passing simple messages up and down the tree. The method works rapidly, in under a second, on 320 × 240 images. We demonstrate the approach on detecting cats, horses, and hands. The method works in the presence of background clutter and occlusions. Our approach is contrasted with more traditional methods such as dynamic programming and belief propagation. 1
5 0.68306845 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
6 0.61757702 98 nips-2005-Infinite latent feature models and the Indian buffet process
7 0.58861607 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection
8 0.52274477 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
9 0.51222384 79 nips-2005-Fusion of Similarity Data in Clustering
10 0.472875 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits
11 0.43613228 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
12 0.43412608 110 nips-2005-Learning Depth from Single Monocular Images
13 0.41734368 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
14 0.41155612 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
15 0.41063654 195 nips-2005-Transfer learning for text classification
16 0.40863159 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape
17 0.40134972 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning
18 0.38841736 192 nips-2005-The Information-Form Data Association Filter
19 0.37932056 78 nips-2005-From Weighted Classification to Policy Search
20 0.37841201 189 nips-2005-Tensor Subspace Analysis
topicId topicWeight
[(3, 0.07), (10, 0.027), (27, 0.026), (31, 0.055), (34, 0.108), (39, 0.068), (55, 0.026), (60, 0.011), (69, 0.056), (73, 0.048), (88, 0.124), (91, 0.041), (94, 0.248)]
simIndex simValue paperId paperTitle
1 0.93472087 122 nips-2005-Logic and MRF Circuitry for Labeling Occluding and Thinline Visual Contours
Author: Eric Saund
Abstract: This paper presents representation and logic for labeling contrast edges and ridges in visual scenes in terms of both surface occlusion (border ownership) and thinline objects. In natural scenes, thinline objects include sticks and wires, while in human graphical communication thinlines include connectors, dividers, and other abstract devices. Our analysis is directed at both natural and graphical domains. The basic problem is to formulate the logic of the interactions among local image events, specifically contrast edges, ridges, junctions, and alignment relations, such as to encode the natural constraints among these events in visual scenes. In a sparse heterogeneous Markov Random Field framework, we define a set of interpretation nodes and energy/potential functions among them. The minimum energy configuration found by Loopy Belief Propagation is shown to correspond to preferred human interpretation across a wide range of prototypical examples including important illusory contour figures such as the Kanizsa Triangle, as well as more difficult examples. In practical terms, the approach delivers correct interpretations of inherently ambiguous hand-drawn box-and-connector diagrams at low computational cost.
same-paper 2 0.78349239 151 nips-2005-Pattern Recognition from One Example by Chopping
Author: Francois Fleuret, Gilles Blanchard
Abstract: We investigate the learning of the appearance of an object from a single image of it. Instead of using a large number of pictures of the object to recognize, we use a labeled reference database of pictures of other objects to learn invariance to noise and variations in pose and illumination. This acquired knowledge is then used to predict if two pictures of new objects, which do not appear on the training pictures, actually display the same object. We propose a generic scheme called chopping to address this task. It relies on hundreds of random binary splits of the training set chosen to keep together the images of any given object. Those splits are extended to the complete image space with a simple learning algorithm. Given two images, the responses of the split predictors are combined with a Bayesian rule into a posterior probability of similarity. Experiments with the COIL-100 database and with a database of 150 deA graded LTEX symbols compare our method to a classical learning with several examples of the positive class and to a direct learning of the similarity. 1
3 0.76227188 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
Author: Keiji Miura, Masato Okada, Shun-ichi Amari
Abstract: We considered a gamma distribution of interspike intervals as a statistical model for neuronal spike generation. The model parameters consist of a time-dependent firing rate and a shape parameter that characterizes spiking irregularities of individual neurons. Because the environment changes with time, observed data are generated from the time-dependent firing rate, which is an unknown function. A statistical model with an unknown function is called a semiparametric model, which is one of the unsolved problem in statistics and is generally very difficult to solve. We used a novel method of estimating functions in information geometry to estimate the shape parameter without estimating the unknown function. We analytically obtained an optimal estimating function for the shape parameter independent of the functional form of the firing rate. This estimation is efficient without Fisher information loss and better than maximum likelihood estimation. 1
4 0.625736 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
5 0.61939585 30 nips-2005-Assessing Approximations for Gaussian Process Classification
Author: Malte Kuss, Carl E. Rasmussen
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.
6 0.61850953 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
7 0.61731571 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
8 0.6120804 50 nips-2005-Convex Neural Networks
9 0.61141384 184 nips-2005-Structured Prediction via the Extragradient Method
10 0.61067486 98 nips-2005-Infinite latent feature models and the Indian buffet process
11 0.61018723 74 nips-2005-Faster Rates in Regression via Active Learning
12 0.60925496 144 nips-2005-Off-policy Learning with Options and Recognizers
13 0.60679168 45 nips-2005-Conditional Visual Tracking in Kernel Space
14 0.60612261 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
15 0.60497487 23 nips-2005-An Application of Markov Random Fields to Range Sensing
16 0.60191423 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining
17 0.60170025 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
18 0.60032892 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
19 0.59991419 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity
20 0.59972316 195 nips-2005-Transfer learning for text classification