nips nips2005 nips2005-98 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zoubin Ghahramani, Thomas L. Griffiths
Abstract: We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. [sent-6, score-0.658]
2 This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. [sent-7, score-0.389]
3 We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. [sent-8, score-0.451]
4 We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset. [sent-9, score-0.586]
5 The simplest representation, used in mixture models, associates each object with a single latent class. [sent-11, score-0.445]
6 This approach is appropriate when objects can be partitioned into relatively homogeneous subsets. [sent-12, score-0.235]
7 However, the properties of many objects are better captured by representing each object using multiple latent features. [sent-13, score-0.627]
8 A critical question in all of these approaches is the dimensionality of the representation: how many classes or features are needed to express the latent structure expressed by a set of objects. [sent-15, score-0.498]
9 An alternative is to assume that the true dimensionality is unbounded, and that the observed objects manifest only a finite subset of classes or features [5]. [sent-18, score-0.495]
10 In a Dirichlet process mixture model, each object is assigned to a latent class, and each class is associated with a distribution over observable properties. [sent-20, score-0.586]
11 The prior distribution over assignments of objects to classes is defined in such a way that the number of classes used by the model is bounded only by the number of objects, making Dirichlet process mixture models “infinite” mixture models [10]. [sent-21, score-0.86]
12 The prior distribution assumed in a Dirichlet process mixture model can be specified in terms of a sequential process called the Chinese restaurant process (CRP) [11, 12]. [sent-22, score-0.413]
13 In the CRP, N customers enter a restaurant with infinitely many tables, each with infinite seating mk capacity. [sent-23, score-0.616]
14 The ith customer chooses an already-occupied table k with probability i−1+α , where mk is the number of current occupants, and chooses a new table with probability α i−1+α . [sent-24, score-0.549]
15 Customers are exchangeable under this process: the probability of a particular seating arrangement depends only on the number of people at each table, and not the order in which they enter the restaurant. [sent-25, score-0.262]
16 If we replace customers with objects and tables with classes, the CRP specifies a distribution over partitions of objects into classes. [sent-26, score-0.755]
17 A partition is a division of the set of N objects into subsets, where each object belongs to a single subset and the ordering of the subsets does not matter. [sent-27, score-0.508]
18 Two assignments of objects to classes that result in the same division of objects correspond to the same partition. [sent-28, score-0.698]
19 A partition thus defines an equivalence class of assignment vectors. [sent-30, score-0.204]
20 The distribution over partitions implied by the CRP can be derived by taking the limit of the probability of the corresponding equivalence class of assignment vectors in a model where class assignments are generated from a multinomial distribution with a Dirichlet prior [9, 10]. [sent-31, score-0.728]
21 In this paper, we derive an infinitely exchangeable distribution over infinite binary matrices by pursuing this strategy of taking the limit of a finite model. [sent-32, score-0.598]
22 We also describe a stochastic process (the Indian buffet process, akin to the CRP) which generates this distribution. [sent-33, score-0.293]
23 Finally, we demonstrate how this distribution can be used as a prior in statistical models in which each object is represented by a sparse subset of an unbounded number of features. [sent-34, score-0.393]
24 2 A distribution on infinite binary matrices In a latent feature model, each object is represented by a vector of latent feature values f i , and the observable properties of that object xi are generated from a distribution determined by its latent features. [sent-36, score-1.799]
25 Using the maT T T T trix F = f1 f2 · · · fN to indicate the latent feature values for all N objects, the model is specified by a prior over features, p(F), and a distribution over observed property matrices conditioned on those features, p(X|F), where p(·) is a probability density function. [sent-39, score-0.706]
26 These distributions can be dealt with separately: p(F) specifies the number of features and the distribution over values associated with each feature, while p(X|F) determines how these features relate to the properties of objects. [sent-40, score-0.331]
27 We can break F into two components: a binary matrix Z indicating which features are possessed by each object, with zik = 1 if object i has feature k and 0 otherwise, and a matrix V indicating the value of each feature for each object. [sent-42, score-1.336]
28 , PCA) objects have non-zero values on every feature, and every entry of Z is 1. [sent-46, score-0.235]
29 We will focus on defining a prior on Z, since the effective dimensionality of a latent feature model is determined by Z. [sent-51, score-0.514]
30 Assuming that Z is sparse, we can define a prior for infinite latent feature models by defining a distribution over infinite binary matrices. [sent-52, score-0.747]
31 Our discussion of the Chinese restaurant process provides two desiderata for such a distribution: objects K features K features (b) N objects N objects 0. [sent-53, score-1.175]
32 1 3 0 0 5 0 3 0 0 1 4 2 0 4 5 Figure 1: A binary matrix Z, as shown in (a), indicates which features take non-zero values. [sent-62, score-0.399]
33 It also suggests a method by which these desiderata can be satisfied: start with a model that assumes a finite number of features, and consider the limit as the number of features approaches infinity. [sent-66, score-0.26]
34 1 A finite feature model We have N objects and K features, and the possession of feature k by object i is indicated by a binary variable zik . [sent-68, score-1.194]
35 Assume that each object possesses feature k with probability πk , and that the features are generated independently. [sent-70, score-0.519]
36 , πK }, is K N P (Z|π) = K m πk k (1 − πk )N −mk , P (zik |πk ) = k=1 i=1 (1) k=1 N where mk = i=1 zik is the number of objects possessing feature k. [sent-74, score-1.022]
37 We can define a prior on π by assuming that each πk follows a beta distribution, to give α πk | α ∼ Beta( K , 1) zik | πk ∼ Bernoulli(πk ) Each zik is independent of all other assignments, conditioned on πk , and the πk are generated independently. [sent-75, score-0.694]
38 We can integrate out π to obtain the probability of Z, which is K P (Z) = k=1 α K Γ(mk α + K )Γ(N − mk + 1) . [sent-76, score-0.311]
39 α Γ(N + 1 + K ) (2) This distribution is exchangeable, since mk is not affected by the ordering of the objects. [sent-77, score-0.402]
40 2 Equivalence classes In order to find the limit of the distribution specified by Equation 2 as K → ∞, we need to define equivalence classes of binary matrices – the analogue of partitions for class assignments. [sent-79, score-0.776]
41 Our equivalence classes will be defined with respect to a function on binary matrices, lof (·). [sent-80, score-0.725]
42 This function maps binary matrices to left-ordered binary matrices. [sent-81, score-0.493]
43 lof (Z) is obtained by ordering the columns of the binary matrix Z from left to right by the magnitude of the binary number expressed by that column, taking the first row as the most significant bit. [sent-82, score-0.966]
44 The left-ordering of a binary matrix is shown in Figure 2. [sent-83, score-0.26]
45 The history of feature k at object i is defined to be (z1k , . [sent-87, score-0.376]
46 Where no object is specified, we will use history to refer to the full history of feature k, (z1k , . [sent-91, score-0.43]
47 A binary matrix is transformed into a left-ordered binary matrix by the function lof (·). [sent-96, score-0.867]
48 The entries in the left-ordered matrix were generated from the Indian buffet process with α = 10. [sent-97, score-0.395]
49 will individuate the histories of features using the decimal equivalent of the binary numbers corresponding to the column entries. [sent-99, score-0.363]
50 Kh will denote the number of features possessing the history h, with K0 being the number of features for which mk = 0 2N −1 and K+ = h=1 Kh being the number of features for which mk > 0, so K = K0 + K+ . [sent-101, score-1.09]
51 Two binary matrices Y and Z are lof -equivalent if lof (Y) = lof (Z). [sent-102, score-1.347]
52 The lof equivalence class of a binary matrix Z, denoted [Z], is the set of binary matrices that are lof -equivalent to Z. [sent-103, score-1.402]
53 lof -equivalence classes play the role for binary matrices that partitions play for assignment vectors: they collapse together all binary matrices (assignment vectors) that differ only in column ordering (class labels). [sent-104, score-1.192]
54 lof -equivalence classes are preserved through permutation of the rows or the columns of a matrix, provided the same permutations are applied to the other members of the equivalence class. [sent-105, score-0.634]
55 Performing inference at the level of lof -equivalence classes is appropriate in models where feature order is not identifiable, with p(X|F) being unaffected by the order of the columns of F. [sent-106, score-0.726]
56 The cardinality of the lof -equivalence class [Z] is K K0 . [sent-108, score-0.384]
57 Taking the infinite limit Under the distribution defined by Equation 2, the probability of a particular lof -equivalence class of binary matrices, [Z], is P ([Z]) = K K! [sent-115, score-0.714]
58 P (Z) = 2N −1 h=0 Z∈[Z] α + K )Γ(N − mk + 1) . [sent-116, score-0.282]
59 N j=1 (j + K+ · α K) k=1 K+ · 1 · exp{−αHN } · k=1 N (N − mk )! [sent-123, score-0.282]
60 This distribution is infinitely exchangeable, since neither Kh nor mk are affected by the ordering on objects. [sent-129, score-0.402]
61 4 The Indian buffet process The probability distribution defined in Equation 4 can be derived from a simple stochastic process. [sent-132, score-0.399]
62 We will define a distribution over infinite binary matrices by specifying how customers (objects) choose dishes (features). [sent-135, score-0.685]
63 In our Indian buffet process (IBP), N customers enter a restaurant one after another. [sent-136, score-0.587]
64 Each customer encounters a buffet consisting of infinitely many dishes arranged in a line. [sent-137, score-0.564]
65 The first customer starts at the left of the buffet and takes a serving from each dish, stopping after a Poisson(α) number of dishes. [sent-138, score-0.403]
66 The ith customer moves along the buffet, sampling dishes in proportion to their popularity, taking dish k with probability mk , where mk is the i number of previous customers who have sampled that dish. [sent-139, score-1.239]
67 Having reached the end of all α previous sampled dishes, the ith customer then tries a Poisson( i ) number of new dishes. [sent-140, score-0.247]
68 We can indicate which customers chose which dishes using a binary matrix Z with N rows and infinitely many columns, where zik = 1 if the ith customer sampled the kth dish. [sent-141, score-1.165]
69 (i) Using K1 to indicate the number of new dishes sampled by the ith customer, the probability of any particular matrix being produced by the IBP is P (Z) = α K+ N i=1 (i) K1 ! [sent-142, score-0.375]
70 These matrices are also not ordered arbitrarily, because the Poisson draws always result in choices of new dishes that are to the right of the previously sampled dishes. [sent-148, score-0.318]
71 Customers are not (i) exchangeable under this distribution, as the number of dishes counted as K1 depends upon the order in which the customers make their choices. [sent-149, score-0.504]
72 However, if we only pay attention to the lof -equivalence classes of the matrices generated by this process, we obtain the infinitely exchangeable distribution P ([Z]) given by Equation 4: QN (i) i=1 K1 ! [sent-150, score-0.779]
73 5 Conditional distributions To define a Gibbs sampler for models using the IBP, we need to know the conditional distribution on feature assignments, P (zik = 1|Z−(ik) ). [sent-155, score-0.341]
74 In the finite model, where P (Z) is given by Equation 2, it is straightforward to compute this conditional distribution for any zik . [sent-156, score-0.359]
75 Integrating over πk gives α m−i,k + K P (zik = 1|z−i,k ) = , (6) α N+K where z−i,k is the set of assignments of other objects, not including i, for feature k, and m−i,k is the number of objects possessing feature k, not including i. [sent-157, score-0.74]
76 Choosing an ordering on objects such that the ith object corresponds to the last customer to visit the buffet, we obtain m−i,k P (zik = 1|z−i,k ) = , (7) N for any k such that m−i,k > 0. [sent-160, score-0.665]
77 The number of new features associated with object i should be α drawn from a Poisson( N ) distribution. [sent-162, score-0.293]
78 3 A linear-Gaussian binary latent feature model To illustrate how the IBP can be used as a prior in models for unsupervised learning, we derived and tested a linear-Gaussian latent feature model in which the features are binary. [sent-164, score-1.287]
79 In this case the feature matrix F reduces to the binary matrix Z. [sent-165, score-0.501]
80 In our finite model, the D-dimensional vector of properties of an object i, xi is generated 2 from a Gaussian distribution with mean zi A and covariance matrix ΣX = σX I, where zi is a K-dimensional binary vector, and A is a K × D matrix of weights. [sent-167, score-0.625]
81 If Z is a feature matrix, this is a form of binary factor analysis. [sent-169, score-0.355]
82 The prior on A is also matrix Gaussian, with 2 mean 0 and covariance matrix σA I. [sent-171, score-0.219]
83 The distribution over the number of new features for each object can be approximated by truncation, computing probabilities (i) for a range of values of K1 up to an upper bound. [sent-179, score-0.346]
84 For each value, p(X|Z, σX , σA ) can α be computed from Equation 8, and the prior on the number of new features is Poisson( N ). [sent-180, score-0.212]
85 We will demonstrate this Gibbs sampler for the infinite binary linear-Gaussian model on a dataset consisting of 100 240 × 320 pixel images. [sent-181, score-0.255]
86 Each image contained up to four everyday objects – a $20 bill, a Klein bottle, a prehistoric handaxe, and a cellular phone. [sent-183, score-0.331]
87 Each object constituted a single latent feature responsible for the observed pixel values. [sent-184, score-0.56]
88 The images were generated by sampling a feature vector, zi , from a distribution under which each feature was present with probability 0. [sent-185, score-0.499]
89 5, and then taking a photograph containing the appropriate objects using a LogiTech digital webcam. [sent-186, score-0.268]
90 The Gibbs sampler was initialized with K+ = 1, choosing the feature assignments for the first column by setting zi1 = 1 with probability 0. [sent-188, score-0.379]
91 (b) The posterior mean of the weights (A) for the four most frequent binary features from the 1000th sample. [sent-197, score-0.361]
92 These features perfectly indicate the presence or absence of the four objects. [sent-199, score-0.28]
93 The first feature indicates the presence of the $20 bill, the other three indicate the absence of the Klein bottle, the handaxe, and the cellphone. [sent-200, score-0.25]
94 (c) Reconstructions of the images in (a) using the binary codes inferred for those images. [sent-201, score-0.211]
95 For example, the code for the first image indicates that the $20 bill is absent, while the other three objects are not. [sent-203, score-0.319]
96 The four most common features perfectly indicated the presence and absence of the four objects (shown in Figure 3 (b)), and three less common features coded for slight differences in the locations of those objects. [sent-207, score-0.663]
97 While we derived this prior as the infinite limit of a simple distribution on finite binary matrices, we have shown that the same distribution can be specified in terms of a simple stochastic process – the Indian buffet process. [sent-209, score-0.744]
98 This distribution satisfies our two desiderata for a prior for infinite latent feature models: objects are exchangeable, and inference remains tractable. [sent-210, score-0.852]
99 Our success in transferring the strategy of taking the limit of a finite model from latent classes to latent features suggests that a similar approach could be applied with other representations, expanding the forms of latent structure that can be recovered through unsupervised learning. [sent-211, score-1.057]
100 Infinite latent feature models and the Indian buffet process. [sent-285, score-0.676]
wordName wordTfidf (topN-words)
[('lof', 0.347), ('mk', 0.282), ('zik', 0.282), ('buffet', 0.242), ('latent', 0.238), ('objects', 0.235), ('binary', 0.187), ('indian', 0.181), ('feature', 0.168), ('kh', 0.165), ('customers', 0.165), ('dishes', 0.161), ('customer', 0.161), ('object', 0.154), ('exchangeable', 0.145), ('features', 0.139), ('matrices', 0.119), ('crp', 0.116), ('assignments', 0.114), ('nite', 0.11), ('equivalence', 0.105), ('ibp', 0.093), ('dirichlet', 0.089), ('classes', 0.086), ('restaurant', 0.081), ('prior', 0.073), ('matrix', 0.073), ('columns', 0.072), ('sampler', 0.068), ('ordering', 0.067), ('nitely', 0.064), ('limit', 0.061), ('desiderata', 0.06), ('poisson', 0.058), ('bill', 0.055), ('possessing', 0.055), ('unbounded', 0.055), ('gibbs', 0.054), ('history', 0.054), ('distribution', 0.053), ('mixture', 0.053), ('hn', 0.053), ('chinese', 0.051), ('process', 0.051), ('ith', 0.048), ('enter', 0.048), ('cvq', 0.046), ('handaxe', 0.046), ('pca', 0.045), ('partitions', 0.042), ('dish', 0.04), ('bottle', 0.04), ('possessed', 0.04), ('seating', 0.04), ('sampled', 0.038), ('equation', 0.038), ('assignment', 0.038), ('class', 0.037), ('histories', 0.037), ('four', 0.035), ('dimensionality', 0.035), ('upon', 0.033), ('taking', 0.033), ('everyday', 0.032), ('reconstructions', 0.032), ('klein', 0.032), ('elementwise', 0.032), ('principal', 0.031), ('absence', 0.031), ('factorial', 0.031), ('francisco', 0.031), ('sparse', 0.03), ('image', 0.029), ('grif', 0.029), ('ths', 0.029), ('generated', 0.029), ('probability', 0.029), ('division', 0.028), ('zoubin', 0.028), ('beta', 0.028), ('zi', 0.028), ('models', 0.028), ('ik', 0.027), ('advances', 0.027), ('indicating', 0.026), ('grouped', 0.026), ('representation', 0.026), ('indicate', 0.026), ('presence', 0.025), ('tables', 0.025), ('london', 0.025), ('inference', 0.025), ('derived', 0.024), ('unsupervised', 0.024), ('perfectly', 0.024), ('images', 0.024), ('rows', 0.024), ('partition', 0.024), ('conditional', 0.024), ('gatsby', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 98 nips-2005-Infinite latent feature models and the Indian buffet process
Author: Zoubin Ghahramani, Thomas L. Griffiths
Abstract: We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset. 1
2 0.16336299 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
3 0.15593319 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
Author: Aaron Shon, Keith Grochow, Aaron Hertzmann, Rajesh P. Rao
Abstract: We propose an algorithm that uses Gaussian process regression to learn common hidden structure shared between corresponding sets of heterogenous observations. The observation spaces are linked via a single, reduced-dimensionality latent variable space. We present results from two datasets demonstrating the algorithms’s ability to synthesize novel data from learned correspondences. We first show that the method can learn the nonlinear mapping between corresponding views of objects, filling in missing data as needed to synthesize novel views. We then show that the method can learn a mapping between human degrees of freedom and robotic degrees of freedom for a humanoid robot, allowing robotic imitation of human poses from motion capture data. 1
4 0.13095731 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
5 0.13088614 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection
Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky
Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated fixations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of fixations, cumulative probability of fixating the target, and scanpath distance.
6 0.12204292 151 nips-2005-Pattern Recognition from One Example by Chopping
7 0.1121887 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
8 0.11151282 80 nips-2005-Gaussian Process Dynamical Models
9 0.10575856 192 nips-2005-The Information-Form Data Association Filter
10 0.097310282 79 nips-2005-Fusion of Similarity Data in Clustering
11 0.085888736 48 nips-2005-Context as Filtering
12 0.082037203 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection
13 0.080831915 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
14 0.079076201 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
15 0.076731816 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
16 0.068976067 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
17 0.068507247 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
18 0.066691361 30 nips-2005-Assessing Approximations for Gaussian Process Classification
19 0.064467907 2 nips-2005-A Bayes Rule for Density Matrices
20 0.06196681 52 nips-2005-Correlated Topic Models
topicId topicWeight
[(0, 0.209), (1, 0.034), (2, -0.013), (3, 0.202), (4, 0.012), (5, -0.147), (6, 0.105), (7, 0.132), (8, -0.079), (9, -0.104), (10, -0.022), (11, 0.04), (12, -0.015), (13, -0.003), (14, -0.054), (15, 0.197), (16, -0.105), (17, -0.092), (18, 0.03), (19, -0.017), (20, 0.085), (21, 0.134), (22, -0.025), (23, 0.075), (24, 0.055), (25, -0.022), (26, 0.075), (27, -0.099), (28, -0.001), (29, -0.074), (30, -0.074), (31, -0.117), (32, 0.019), (33, -0.049), (34, -0.053), (35, 0.031), (36, -0.043), (37, 0.037), (38, -0.083), (39, -0.08), (40, -0.05), (41, 0.011), (42, 0.034), (43, 0.142), (44, -0.015), (45, -0.178), (46, -0.137), (47, 0.143), (48, 0.059), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.96574771 98 nips-2005-Infinite latent feature models and the Indian buffet process
Author: Zoubin Ghahramani, Thomas L. Griffiths
Abstract: We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset. 1
2 0.74732369 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman
Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, flexibly exploiting partially labeled training images. 1
3 0.563564 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation
Author: Aaron Shon, Keith Grochow, Aaron Hertzmann, Rajesh P. Rao
Abstract: We propose an algorithm that uses Gaussian process regression to learn common hidden structure shared between corresponding sets of heterogenous observations. The observation spaces are linked via a single, reduced-dimensionality latent variable space. We present results from two datasets demonstrating the algorithms’s ability to synthesize novel data from learned correspondences. We first show that the method can learn the nonlinear mapping between corresponding views of objects, filling in missing data as needed to synthesize novel views. We then show that the method can learn a mapping between human degrees of freedom and robotic degrees of freedom for a humanoid robot, allowing robotic imitation of human poses from motion capture data. 1
4 0.55306655 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
Author: Nicolas Loeff, Himanshu Arora, Alexander Sorokin, David Forsyth
Abstract: We describe a novel method for learning templates for recognition and localization of objects drawn from categories. A generative model represents the configuration of multiple object parts with respect to an object coordinate system; these parts in turn generate image features. The complexity of the model in the number of features is low, meaning our model is much more efficient to train than comparative methods. Moreover, a variational approximation is introduced that allows learning to be orders of magnitude faster than previous approaches while incorporating many more features. This results in both accuracy and localization improvements. Our model has been carefully tested on standard datasets; we compare with a number of recent template models. In particular, we demonstrate state-of-the-art results for detection and localization. 1
5 0.53876501 151 nips-2005-Pattern Recognition from One Example by Chopping
Author: Francois Fleuret, Gilles Blanchard
Abstract: We investigate the learning of the appearance of an object from a single image of it. Instead of using a large number of pictures of the object to recognize, we use a labeled reference database of pictures of other objects to learn invariance to noise and variations in pose and illumination. This acquired knowledge is then used to predict if two pictures of new objects, which do not appear on the training pictures, actually display the same object. We propose a generic scheme called chopping to address this task. It relies on hundreds of random binary splits of the training set chosen to keep together the images of any given object. Those splits are extended to the complete image space with a simple learning algorithm. Given two images, the responses of the split predictors are combined with a Bayesian rule into a posterior probability of similarity. Experiments with the COIL-100 database and with a database of 150 deA graded LTEX symbols compare our method to a classical learning with several examples of the positive class and to a direct learning of the similarity. 1
6 0.49274996 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection
7 0.48151651 79 nips-2005-Fusion of Similarity Data in Clustering
8 0.47227824 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
9 0.44733199 192 nips-2005-The Information-Form Data Association Filter
10 0.41256085 80 nips-2005-Gaussian Process Dynamical Models
11 0.4103542 48 nips-2005-Context as Filtering
12 0.39426088 35 nips-2005-Bayesian model learning in human visual perception
13 0.38798538 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection
14 0.34470865 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
15 0.33743641 62 nips-2005-Efficient Estimation of OOMs
16 0.32896012 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
17 0.32469374 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects
18 0.3246232 77 nips-2005-From Lasso regression to Feature vector machine
19 0.31772131 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
20 0.29281279 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning
topicId topicWeight
[(3, 0.049), (10, 0.042), (11, 0.011), (27, 0.02), (31, 0.047), (34, 0.095), (39, 0.052), (47, 0.28), (55, 0.029), (69, 0.054), (73, 0.067), (88, 0.081), (91, 0.058)]
simIndex simValue paperId paperTitle
1 0.91702276 25 nips-2005-An aVLSI Cricket Ear Model
Author: Andre V. Schaik, Richard Reeve, Craig Jin, Tara Hamilton
Abstract: Female crickets can locate males by phonotaxis to the mating song they produce. The behaviour and underlying physiology has been studied in some depth showing that the cricket auditory system solves this complex problem in a unique manner. We present an analogue very large scale integrated (aVLSI) circuit model of this process and show that results from testing the circuit agree with simulation and what is known from the behaviour and physiology of the cricket auditory system. The aVLSI circuitry is now being extended to use on a robot along with previously modelled neural circuitry to better understand the complete sensorimotor pathway. 1 In trod u ction Understanding how insects carry out complex sensorimotor tasks can help in the design of simple sensory and robotic systems. Often insect sensors have evolved into intricate filters matched to extract highly specific data from the environment which solves a particular problem directly with little or no need for further processing [1]. Examples include head stabilisation in the fly, which uses vision amongst other senses to estimate self-rotation and thus to stabilise its head in flight, and phonotaxis in the cricket. Because of the narrowness of the cricket body (only a few millimetres), the Interaural Time Difference (ITD) for sounds arriving at the two sides of the head is very small (10–20µs). Even with the tympanal membranes (eardrums) located, as they are, on the forelegs of the cricket, the ITD only reaches about 40µs, which is too low to detect directly from timings of neural spikes. Because the wavelength of the cricket calling song is significantly greater than the width of the cricket body the Interaural Intensity Difference (IID) is also very low. In the absence of ITD or IID information, the cricket uses phase to determine direction. This is possible because the male cricket produces an almost pure tone for its calling song. * School of Electrical and Information Engineering, Institute of Perception, Action and Behaviour. + Figure 1: The cricket auditory system. Four acoustic inputs channel sounds directly or through tracheal tubes onto two tympanal membranes. Sound from contralateral inputs has to pass a (double) central membrane (the medial septum), inducing a phase delay and reduction in gain. The sound transmission from the contralateral tympanum is very weak, making each eardrum effectively a 3 input system. The physics of the cricket auditory system is well understood [2]; the system (see Figure 1) uses a pair of sound receivers with four acoustic inputs, two on the forelegs, which are the external surfaces of the tympana, and two on the body, the prothoracic or acoustic spiracles [3]. The connecting tracheal tubes are such that interference occurs as sounds travel inside the cricket, producing a directional response at the tympana to frequencies near to that of the calling song. The amplitude of vibration of the tympana, and hence the firing rate of the auditory afferent neurons attached to them, vary as a sound source is moved around the cricket and the sounds from the different inputs move in and out of phase. The outputs of the two tympana match when the sound is straight ahead, and the inputs are bilaterally symmetric with respect to the sound source. However, when sound at the calling song frequency is off-centre the phase of signals on the closer side comes better into alignment, and the signal increases on that side, and conversely decreases on the other. It is that crossover of tympanal vibration amplitudes which allows the cricket to track a sound source (see Figure 6 for example). A simplified version of the auditory system using only two acoustic inputs was implemented in hardware [4], and a simple 8-neuron network was all that was required to then direct a robot to carry out phonotaxis towards a species-specific calling song [5]. A simple simulator was also created to model the behaviour of the auditory system of Figure 1 at different frequencies [6]. Data from Michelsen et al. [2] (Figures 5 and 6) were digitised, and used together with average and “typical” values from the paper to choose gains and delays for the simulation. Figure 2 shows the model of the internal auditory system of the cricket from sound arriving at the acoustic inputs through to transmission down auditory receptor fibres. The simulator implements this model up to the summing of the delayed inputs, as well as modelling the external sound transmission. Results from the simulator were used to check the directionality of the system at different frequencies, and to gain a better understanding of its response. It was impractical to check the effect of leg movements or of complex sounds in the simulator due to the necessity of simulating the sound production and transmission. An aVLSI chip was designed to implement the same model, both allowing more complex experiments, such as leg movements to be run, and experiments to be run in the real world. Figure 2: A model of the auditory system of the cricket, used to build the simulator and the aVLSI implementation (shown in boxes). These experiments with the simulator and the circuits are being published in [6] and the reader is referred to those papers for more details. In the present paper we present the details of the circuits used for the aVLSI implementation. 2 Circuits The chip, implementing the aVLSI box in Figure 2, comprises two all-pass delay filters, three gain circuits, a second-order narrow-band band-pass filter, a first-order wide-band band-pass filter, a first-order high-pass filter, as well as supporting circuitry (including reference voltages, currents, etc.). A single aVLSI chip (MOSIS tiny-chip) thus includes half the necessary circuitry to model the complete auditory system of a cricket. The complete model of the auditory system can be obtained by using two appropriately connected chips. Only two all-pass delay filters need to be implemented instead of three as suggested by Figure 2, because it is only the relative delay between the three pathways arriving at the one summing node that counts. The delay circuits were implemented with fully-differential gm-C filters. In order to extend the frequency range of the delay, a first-order all-pass delay circuit was cascaded with a second-order all-pass delay circuit. The resulting addition of the first-order delay and the second-order delay allowed for an approximately flat delay response for a wider bandwidth as the decreased delay around the corner frequency of the first-order filter cancelled with the increased delay of the second-order filter around its resonant frequency. Figure 3 shows the first- and second-order sections of the all-pass delay circuit. Two of these circuits were used and, based on data presented in [2], were designed with delays of 28µs and 62µs, by way of bias current manipulation. The operational transconductance amplifier (OTA) in figure 3 is a standard OTA which includes the common-mode feedback necessary for fully differential designs. The buffers (Figure 3) are simple, cascoded differential pairs. V+ V- II+ V+ V- II+ V+ V- II+ V+ V- II+ V+ V- II+ V+ V- II+ Figure 3: The first-order all-pass delay circuit (left) and the second-order all-pass delay (right). The differential output of the delay circuits is converted into a current which is multiplied by a variable gain implemented as shown in Figure 4. The gain cell includes a differential pair with source degeneration via transistors N4 and N5. The source degeneration improves the linearity of the current. The three gain cells implemented on the aVLSI have default gains of 2, 3 and 0.91 which are set by holding the default input high and appropriately ratioing the bias currents through the value of vbiasp. To correct any on-chip mismatches and/or explore other gain configurations a current splitter cell [7] (p-splitter, figure 4) allows the gain to be programmed by digital means post fabrication. The current splitter takes an input current (Ibias, figure 4) and divides it into branches which recursively halve the current, i.e., the first branch gives ½ Ibias, the second branch ¼ Ibias, the third branch 1/8 Ibias and so on. These currents can be used together with digitally controlled switches as a Digital-to-Analogue converter. By holding default low and setting C5:C0 appropriately, any gain – from 4 to 0.125 – can be set. To save on output pins the program bits (C5:C0) for each of the three gain cells are set via a single 18-bit shift register in bit-serial fashion. Summing the output of the three gain circuits in the current domain simply involves connecting three wires together. Therefore, a natural option for the filters that follow is to use current domain filters. In our case we have chosen to implement log-domain filters using MOS transistors operating in weak inversion. Figure 5 shows the basic building blocks for the filters – the Tau Cell [8] and the multiplier cell – and block diagrams showing how these blocks were connected to create the necessary filtering blocks. The Tau Cell is a log-domain filter which has the firstorder response: I out 1 , = I in sτ + 1 where τ = nC aVT Ia and n = the slope factor, VT = thermal voltage, Ca = capacitance, and Ia = bias current. In figure 5, the input currents to the Tau Cell, Imult and A*Ia, are only used when building a second-order filter. The multiplier cell is simply a translinear loop where: I out1 ∗ I mult = I out 2 ∗ AI a or Imult = AIaIout2/Iout1. The configurations of the Tau Cell to get particular responses are covered in [8] along with the corresponding equations. The high frequency filter of Figure 2 is implemented by the high-pass filter in Figure 5 with a corner frequency of 17kHz. The low frequency filter, however, is divided into two parts since the biological filter’s response (see for example Figure 3A in [9]) separates well into a narrow second-order band-pass filter with a 10kHz resonant frequency and a wide band-pass filter made from a first-order high-pass filter with a 3kHz corner frequency followed by a first-order low-pass filter with a 12kHz corner frequency. These filters are then added together to reproduce the biological filter. The filters’ responses can be adjusted post fabrication via their bias currents. This allows for compensation due to processing and matching errors. Figure 4: The Gain Cell above is used to convert the differential voltage input from the delay cells into a single-ended current output. The gain of each cell is controllable via a programmable current cell (p_splitter). An on-chip bias generator [7] was used to create all the necessary current biases on the chip. All the main blocks (delays, gain cells and filters), however, can have their on-chip bias currents overridden through external pins on the chip. The chip was fabricated using the MOSIS AMI 1.6µm technology and designed using the Cadence Custom IC Design Tools (5.0.33). 3 Methods The chip was tested using sound generated on a computer and played through a soundcard to the chip. Responses from the chip were recorded by an oscilloscope, and uploaded back to the computer on completion. Given that the output from the chip and the gain circuits is a current, an external current-sense circuit built with discrete components was used to enable the output to be probed by the oscilloscope. Figure 5: The circuit diagrams for the log-domain filter building blocks – The Tau Cell and The Multiplier – along with the block diagrams for the three filters used in the aVLSI model. Initial experiments were performed to tune the delays and gains. After that, recordings were taken of the directional frequency responses. Sounds were generated by computer for each chip input to simulate moving the forelegs by delaying the sound by the appropriate amount of time; this was a much simpler solution than using microphones and moving them using motors. 4 Results The aVLSI chip was tested to measure its gains and delays, which were successfully tuned to the appropriate values. The chip was then compared with the simulation to check that it was faithfully modelling the system. A result of this test at 4kHz (approximately the cricket calling-song frequency) is shown in Figure 6. Apart from a drop in amplitude of the signal, the response of the circuit was very similar to that of the simulator. The differences were expected because the aVLSI circuit has to deal with real-world noise, whereas the simulated version has perfect signals. Examples of the gain versus frequency response of the two log-domain band-pass filters are shown in Figure 7. Note that the narrow-band filter peaks at 6kHz, which is significantly above the mating song frequency of the cricket which is around 4.5kHz. This is not a mistake, but is observed in real crickets as well. As stated in the introduction, a range of further testing results with both the circuit and the simulator are being published in [6]. 5 D i s c u s s i on The aVLSI auditory sensor in this research models the hearing of the field cricket Gryllus bimaculatus. It is a more faithful model of the cricket auditory system than was previously built in [4], reproducing all the acoustic inputs, as well as the responses to frequencies of both the co specific calling song and bat echolocation chirps. It also generates outputs corresponding to the two sets of behaviourally relevant auditory receptor fibres. Results showed that it matched the biological data well, though there were some inconsistencies due to an error in the specification that will be addressed in a future iteration of the design. A more complete implementation across all frequencies was impractical because of complexity and size issues as well as serving no clear behavioural purpose. Figure 6: Vibration amplitude of the left (dotted) and right (solid) virtual tympana measured in decibels in response to a 4kHz tone in simulation (left) and on the aVLSI chip (right). The plot shows the amplitude of the tympanal responses as the sound source is rotated around the cricket. Figure 7: Frequency-Gain curves for the narrow-band and wide-band bandpass filters. The long-term aim of this work is to better understand simple sensorimotor control loops in crickets and other insects. The next step is to mount this circuitry on a robot to carry out behavioural experiments, which we will compare with existing and new behavioural data (such as that in [10]). This will allow us to refine our models of the neural circuitry involved. Modelling the sensory afferent neurons in hardware is necessary in order to reduce processor load on our robot, so the next revision will include these either onboard, or on a companion chip as we have done before [11]. We will also move both sides of the auditory system onto a single chip to conserve space on the robot. It is our belief and experience that, as a result of this intelligent pre-processing carried out at the sensor level, the neural circuits necessary to accurately model the behaviour will remain simple. Acknowledgments The authors thank the Institute of Neuromorphic Engineering and the UK Biotechnology and Biological Sciences Research Council for funding the research in this paper. References [1] R. Wehner. Matched filters – neural models of the external world. J Comp Physiol A, 161: 511–531, 1987. [2] A. Michelsen, A. V. Popov, and B. Lewis. Physics of directional hearing in the cricket Gryllus bimaculatus. Journal of Comparative Physiology A, 175:153–164, 1994. [3] A. Michelsen. The tuned cricket. News Physiol. Sci., 13:32–38, 1998. [4] H. H. Lund, B. Webb, and J. Hallam. A robot attracted to the cricket species Gryllus bimaculatus. In P. Husbands and I. Harvey, editors, Proceedings of 4th European Conference on Artificial Life, pages 246–255. MIT Press/Bradford Books, MA., 1997. [5] R Reeve and B. Webb. New neural circuits for robot phonotaxis. Phil. Trans. R. Soc. Lond. A, 361:2245–2266, August 2003. [6] R. Reeve, A. van Schaik, C. Jin, T. Hamilton, B. Torben-Nielsen and B. Webb Directional hearing in a silicon cricket. Biosystems, (in revision), 2005b [7] T. Delbrück and A. van Schaik, Bias Current Generators with Wide Dynamic Range, Analog Integrated Circuits and Signal Processing 42(2), 2005 [8] A. van Schaik and C. Jin, The Tau Cell: A New Method for the Implementation of Arbitrary Differential Equations, IEEE International Symposium on Circuits and Systems (ISCAS) 2003 [9] Kazuo Imaizumi and Gerald S. Pollack. Neural coding of sound frequency by cricket auditory receptors. The Journal of Neuroscience, 19(4):1508– 1516, 1999. [10] Berthold Hedwig and James F.A. Poulet. Complex auditory behaviour emerges from simple reactive steering. Nature, 430:781–785, 2004. [11] R. Reeve, B. Webb, A. Horchler, G. Indiveri, and R. Quinn. New technologies for testing a model of cricket phonotaxis on an outdoor robot platform. Robotics and Autonomous Systems, 51(1):41-54, 2005.
same-paper 2 0.76688921 98 nips-2005-Infinite latent feature models and the Indian buffet process
Author: Zoubin Ghahramani, Thomas L. Griffiths
Abstract: We define a probability distribution over equivalence classes of binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features. We identify a simple generative process that results in the same distribution over equivalence classes, which we call the Indian buffet process. We illustrate the use of this distribution as a prior in an infinite latent feature model, deriving a Markov chain Monte Carlo algorithm for inference in this model and applying the algorithm to an image dataset. 1
3 0.73916364 54 nips-2005-Data-Driven Online to Batch Conversions
Author: Ofer Dekel, Yoram Singer
Abstract: Online learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.
4 0.52117109 100 nips-2005-Interpolating between types and tokens by estimating power-law generators
Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justifies the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.
5 0.520235 30 nips-2005-Assessing Approximations for Gaussian Process Classification
Author: Malte Kuss, Carl E. Rasmussen
Abstract: Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classification using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classifiers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classification remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classification (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classification error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classifier to produce class probabilities or confidences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classification Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classification (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to finding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the first two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difficult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efficiently requires some fine-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ significantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classification tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the first set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is defined by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a fine regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justifies the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and significantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we find the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classification. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are refined until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classification error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The first three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reflects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classification using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–73. The MIT Press, 2000.
6 0.51928812 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
7 0.51665848 102 nips-2005-Kernelized Infomax Clustering
8 0.51399571 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
9 0.51349533 177 nips-2005-Size Regularized Cut for Data Clustering
10 0.51228994 144 nips-2005-Off-policy Learning with Options and Recognizers
11 0.51219392 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs
12 0.51170152 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
13 0.50812542 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
14 0.50808048 184 nips-2005-Structured Prediction via the Extragradient Method
15 0.50770366 23 nips-2005-An Application of Markov Random Fields to Range Sensing
16 0.50759596 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes
17 0.50602204 151 nips-2005-Pattern Recognition from One Example by Chopping
18 0.50448567 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
19 0.50336242 74 nips-2005-Faster Rates in Regression via Active Learning
20 0.50269437 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach