nips nips2000 nips2000-107 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yee Whye Teh, Geoffrey E. Hinton
Abstract: We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. [sent-9, score-0.66]
2 Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. [sent-10, score-0.493]
3 The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. [sent-12, score-0.204]
4 The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations. [sent-13, score-0.362]
5 1 Introduction Face recognition is difficult when the number of individuals is large and the test and training images of an individual differ in expression, pose, lighting or the date on which they were taken. [sent-14, score-0.985]
6 We start by describing a new unsupervised learning algorithm for a restricted form of Boltzmann machine [1]. [sent-17, score-0.039]
7 We then show how to generalize the generative model and the learning algorithm to deal with real-valued pixel intensities and rate-coded feature detectors. [sent-18, score-0.191]
8 We then apply the model to face recognition and compare it to other methods. [sent-19, score-0.307]
9 Because there is no explaining away [3], inference in an RBM is much easier than in a general Boltzmann machine or in a causal belief network with one hidden layer. [sent-22, score-0.208]
10 There is no need to perform any iteration to determine the activities of the hidden units, as the hidden states, Sj, are conditionally independent given the visible states, Si . [sent-23, score-0.504]
11 The distribution of Sj is given by the standard logistic function: p(Sj = llsi) = 1 1 + exp( _ Li WijSi) (1) Conversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM to learn population codes in which units may be highly correlated. [sent-24, score-0.365]
12 It is hard to do this in causal belief networks with one hidden layer because the generative model of a causal belief net assumes marginal independence. [sent-25, score-0.374]
13 One way to implement this algorithm is to start the network with a data vector on the visible units and then to alternate between updating all of the hidden units in parallel and updating all of the visible units in parallel with Gibbs sampling. [sent-27, score-1.037]
14 This learning rule does not work well because it can take a long time to approach equilibrium and the sampling noise in the estimate of < SiSj >Q~ can swamp the gradient. [sent-30, score-0.042]
15 A simple way to increase the representational power without changing the inference and learning procedures is to imagine that each visible unit, i, has 10 replicas which all have identical weights to the hidden units. [sent-33, score-0.49]
16 So far as the hidden units are concerned, it makes no difference which particular replicas are turned on: it is only the number of active replicas that counts. [sent-34, score-0.579]
17 So a pixel can now have 11 different intensities. [sent-35, score-0.073]
18 During reconstruction of the image from the hidden activities, all the replicas can share the computation of the probability, Pi, of turning on, and then we can select n replicas to be on with probability (~)nPi (10 - n)(1-p;). [sent-36, score-0.524]
19 We assumed that the visible units can produce up to 10 spikes and the hidden units can produce up to 100 spikes. [sent-43, score-0.64]
20 3 by their expected values and we used the expected value of Si when computing the probability of activation of the hidden units. [sent-45, score-0.157]
21 However, we continued to use the stochastically chosen integer firing rates of the hidden units when computing the one-step reconstructions of the data, so the hidden activities cannot transmit an unbounded amount of information from the data to the reconstruction. [sent-46, score-0.606]
22 A simple way to use RBMrate for face recognition is to train a single model on the training set, and to identify a face by finding the gallery image that produces a hidden activity vector that is most similar to the one produced by the face. [sent-47, score-1.095]
23 This is how eigenfaces are used for recognition, but it does not work well because it does not take into account the fact that some variations across faces are important for recognition, while some variations are not. [sent-48, score-0.341]
24 To correct this, we instead trained an RBMrate model on pairs of different images of the same individual, and then we used this model of pairs to decide which gallery image is best paired with the test image. [sent-49, score-0.921]
25 To account for the fact that the model likes some individual face images more than others, we define the fit between two faces hand 12 as G(h, h) + G(h,h) - G(h,h) - G(h,h) where the goodness score G(VI,V2) is the negative free energy of the image pair VI, V2 under the model. [sent-50, score-0.929]
26 However, to preserve symmetry, each pair of images of the same individual VI, V2 in the training set has a reversed pair V2, VI in the set. [sent-52, score-0.573]
27 We trained the model with 100 hidden units on 1000 image pairs (500 distinct pairs) for 2000 iterations in batches of 100, with a learning rate of 2. [sent-53, score-0.53]
28 One advantage of eigenfaces over correlation is that once the test image has been converted into a vector of eigenface activations, comparisons of test and gallery images can be made in the low-dimensional space of eigenface activations rather than the high-dimensional space of pixel intensities. [sent-56, score-1.278]
29 The same applies to our face-pair network, as the goodness score of an image pair is a simple function of the total input received by each hidden unit from each image. [sent-57, score-0.513]
30 The total inputs from each gallery image can be precomputed and stored, while the total inputs from a test image only needs to be computed once for comparisons with all gallery images. [sent-58, score-0.752]
31 4 The FERET database Our version of the FERET database contained 1002 frontal face images of 429 individuals taken over a period of a few years under varying lighting conditions. [sent-59, score-0.963]
32 Of these images, 818 are used as both the gallery and the training set and the remaining 184 are divided into four disjoint test sets: The . [sent-60, score-0.359]
33 The training set also includes a further 244 pairs of images that differ only in expression. [sent-65, score-0.465]
34 The ildays test set contains 40 images that come from 20 individuals. [sent-66, score-0.432]
35 Each of these individuals has two images from the same session in the training set and two images taken in a session 4 days later or earlier in the test set. [sent-67, score-1.157]
36 A further 28 individuals were photographed 4 days apart and all 112 of these images are in the training set. [sent-68, score-0.591]
37 The ilmonths test set is just like the ~days test set except that the time between sessions was at least three months and different lighting conditions were present in the two sessions. [sent-69, score-0.446]
38 A further 36 images of 9 more individuals were included in the training set. [sent-71, score-0.517]
39 The ilglasses test set contains 14 images of 7 different individuals. [sent-72, score-0.432]
40 Each of these individuals has two images in the training set that were taken in another session on the same day. [sent-73, score-0.604]
41 The training and test pairs for an individual differ in that one pair has glasses and the other does not. [sent-74, score-0.489]
42 The training set includes a further 24 images, half with glasses and half without, from 6 more individuals. [sent-75, score-0.296]
43 The images include the whole head, parts of the shoulder, and background. [sent-76, score-0.282]
44 Instead of working with whole images, which contain much irrelevant information, we worked with face images that were normalized as shown in figure 2. [sent-77, score-0.501]
45 Masking out all of the background inevitably looses the contour of the face which contains much discriminative information. [sent-78, score-0.37]
46 The histogram equalization step removes most lighting effects, but it also removes some relevant information like the skin tone. [sent-79, score-0.335]
47 For the best performance, the contour shape and skin tone would have to be used as additional sources of discriminative information. [sent-80, score-0.296]
48 5 Comparative results We compared RBMrate with four popular face recognition methods. [sent-81, score-0.307]
49 The first and simplest is correlation, which returns the similarity score as the angle between two images represented as vectors of pixel intensities. [sent-82, score-0.484]
50 The second method is eigenfaces [5], which first projects the images onto the principal component subspaces, then returns the similarity score as the angle between the projected images. [sent-84, score-0.691]
51 Instead of projecting the images onto the subspace of the principal components, which maximizes the variance . [sent-86, score-0.38]
52 15 e e 10 10 CD CD 5 0 5 corr corr 30 100 25 _ ~ . [sent-106, score-0.2]
53 e g 10 CD Q) 5 0 80 ~ 0 corr eigen fisher oppca RBMrate 60 40 20 0 corr eigen fisher oppca RBMrate Figure 3: Error rates of all methods on all test sets. [sent-111, score-0.61]
54 The bars in each group correspond, from left to right, to the rank-I, rank-2, rank-4, rank-8 and rank-16 error rates. [sent-112, score-0.035]
55 The rank-n error rate is the percentage of test images where the n most similar gallery images are all incorrect. [sent-113, score-0.876]
56 among the projected images, fisherfaces projects the images onto a subspace which, at the same time, maximizes the between individual variances and minimizes the within individual variances in the training set. [sent-114, score-0.67]
57 This method models differences between images of the same individual as a PPCA [8, 9], and differences between images of different individuals as another PPCA. [sent-116, score-0.808]
58 Then given a difference of two images, it returns as the similarity score the likelihood ratio of the difference image under the two PPCA models. [sent-117, score-0.248]
59 It was the best performing algorithm in the September 1996 FERET test [10]. [sent-118, score-0.158]
60 For eigenfaces, we used 199 principal components, omitting the first principal component, as we determined manually that it encodes simply for lighting conditions. [sent-119, score-0.264]
61 This improved the recognition performances on all the test sets except for ~exp r ession . [sent-120, score-0.198]
62 We used a subspace of dimension 200 for fisherfaces, while we used 10 and 30 dimensional PPCAs for the within-class and between-class model of c5ppca respectively. [sent-121, score-0.045]
63 These are the same numbers used by Moghaddam et at and gives the best results in our simulations. [sent-122, score-0.048]
64 The number of dimensions or hidden units used by each method was optimized for that particular method for best performance. [sent-123, score-0.379]
65 Figure 3 shows the error rates of all five methods on the test sets. [sent-124, score-0.11]
66 Correlation and eigenfaces perform poorly on ~expre s s i o n, probably because they do not attempt to ignore the within-individual variations, whereas the other methods do. [sent-126, score-0.244]
67 All the models did very poorly on the ~months test set which is unfortunate as this is the test set that is most like real applications. [sent-127, score-0.22]
68 RBMrate performed best on ~expre s s i o n, fisherfaces is best on ~days and ~glasses ,while eigenfaces is best on ~months . [sent-128, score-0.477]
69 Most human observers cannot find the correct match within these 8. [sent-131, score-0.039]
70 Top half: with unconstrained weights; bottom half: with non-negative weight constraints. [sent-134, score-0.039]
71 contour produced by masking out all of the background. [sent-135, score-0.141]
72 6 Receptive fields learned by RBMrate The top half of figure 5 shows the weights of a few of the hidden units after training. [sent-136, score-0.492]
73 All the units encode global features, probably because the image normalization ensures that there are strong long range correlations in pixel intensities. [sent-137, score-0.457]
74 Note, however, that the hidden unit activations range from 0 to 100. [sent-141, score-0.282]
75 On the left are 4 units exhibiting interesting features and on the right are 4 units chosen at random. [sent-142, score-0.463]
76 The top unit of the first column seems to be encoding the presence of mustache in both faces. [sent-143, score-0.086]
77 The bottom unit seems to be coding for prominent right eyebrows in both faces. [sent-144, score-0.086]
78 Note that these are facial features which often remain constant across images of the same individual. [sent-145, score-0.397]
79 In the second column are two features which seem to encode for different facial expressions in the two faces. [sent-146, score-0.154]
80 The right side of the top unit encodes a smile while the left side is expressionless. [sent-147, score-0.082]
81 So the network has discovered some features which are fairly constant across images in the same class, and some features which can differ substantially within a class. [sent-149, score-0.498]
82 Inspired by [11], we tried to enforce local features by restricting the weights to be non- negative. [sent-150, score-0.154]
83 This is achieved by resetting negative weights to zero after each weight update. [sent-151, score-0.074]
84 The bottom half of figure 5 shows some of the hidden receptive fields learned. [sent-152, score-0.283]
85 Except for the 4 features on the left, all other features are local and code for features like mouth shape changes (third column) and eyes and cheeks (fourth column). [sent-153, score-0.282]
86 The 4 features on the left are much more global and clearly capture the fact that the direction of the lighting can differfor two images of the same person. [sent-154, score-0.555]
87 Unfortunately, constraining the weights to be non-negative strongly limits the representational power of RBMrate and makes it worse than all the other methods on all the test sets. [sent-155, score-0.184]
88 7 Conclusions We have introduced a new method for face recognition based on a non-linear generative model. [sent-156, score-0.386]
89 The generative model can be very complex, yet retains the efficiency required for applications. [sent-157, score-0.079]
90 Performance on the FERET database is comparable to popular methods. [sent-158, score-0.058]
91 However, unlike other methods based on linear models, there is plenty of room for further development using prior knowledge to constrain the weights or additional layers of hidden units to model the correlations of feature detector activities. [sent-159, score-0.474]
92 Eigenfaces versus fisherfaces: recognition using class specific linear projection. [sent-206, score-0.088]
wordName wordTfidf (topN-words)
[('rbmrate', 0.347), ('images', 0.282), ('face', 0.219), ('gallery', 0.202), ('eigenfaces', 0.188), ('individuals', 0.188), ('units', 0.174), ('feret', 0.173), ('lighting', 0.158), ('hidden', 0.157), ('rbm', 0.149), ('fisherfaces', 0.145), ('visible', 0.135), ('replicas', 0.124), ('image', 0.119), ('test', 0.11), ('boltzmann', 0.1), ('corr', 0.1), ('sisj', 0.1), ('score', 0.088), ('recognition', 0.088), ('half', 0.087), ('ppca', 0.087), ('session', 0.087), ('skin', 0.087), ('moghaddam', 0.084), ('pairs', 0.08), ('features', 0.08), ('generative', 0.079), ('activations', 0.078), ('sj', 0.078), ('glasses', 0.075), ('weights', 0.074), ('days', 0.074), ('pixel', 0.073), ('months', 0.068), ('pair', 0.065), ('contrastive', 0.063), ('faces', 0.063), ('reconstructions', 0.063), ('contour', 0.06), ('database', 0.058), ('eigen', 0.058), ('eigenface', 0.058), ('expre', 0.058), ('nths', 0.058), ('oppca', 0.058), ('oval', 0.058), ('phillips', 0.058), ('reversed', 0.058), ('differ', 0.056), ('individual', 0.056), ('probably', 0.056), ('activities', 0.055), ('gibbs', 0.055), ('principal', 0.053), ('causal', 0.051), ('cd', 0.051), ('discriminative', 0.051), ('hinton', 0.05), ('tone', 0.05), ('explorations', 0.05), ('mcclelland', 0.05), ('microstructure', 0.05), ('best', 0.048), ('unit', 0.047), ('training', 0.047), ('subspace', 0.045), ('variations', 0.045), ('rumelhart', 0.045), ('removes', 0.045), ('vi', 0.044), ('produced', 0.044), ('parallel', 0.044), ('si', 0.044), ('equilibrium', 0.042), ('eyes', 0.042), ('september', 0.042), ('gatsby', 0.041), ('returns', 0.041), ('contains', 0.04), ('pixels', 0.04), ('bottom', 0.039), ('column', 0.039), ('intensities', 0.039), ('observers', 0.039), ('alternating', 0.039), ('projects', 0.039), ('divergence', 0.039), ('restricted', 0.039), ('goodness', 0.037), ('masking', 0.037), ('qo', 0.037), ('layer', 0.036), ('facial', 0.035), ('left', 0.035), ('correlations', 0.035), ('states', 0.034), ('detector', 0.034), ('fisher', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
Author: Yee Whye Teh, Geoffrey E. Hinton
Abstract: We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.
2 0.3904123 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts
Author: Guy Mayraz, Geoffrey E. Hinton
Abstract: The product of experts learning procedure [1] can discover a set of stochastic binary features that constitute a non-linear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, it is helpful to learn a hierarchy of separate models for each digit class. Each model in the hierarchy has one layer of hidden units and the nth level model is trained on data that consists of the activities of the hidden units in the already trained (n - l)th level model. After training, each level produces a separate, unnormalized log probabilty score. With a three-level hierarchy for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data. On the MNIST database, our system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data. 1 Learning products of stochastic binary experts Hinton [1] describes a learning algorithm for probabilistic generative models that are composed of a number of experts. Each expert specifies a probability distribution over the visible variables and the experts are combined by multiplying these distributions together and renormalizing. (1) where d is a data vector in a discrete space, Om is all the parameters of individual model m, Pm(dIOm) is the probability of d under model m, and c is an index over all possible vectors in the data space. A Restricted Boltzmann machine [2, 3] is a special case of a product of experts in which each expert is a single, binary stochastic hidden unit that has symmetrical connections to a set of visible units, and connections between the hidden units are forbidden. Inference in an RBM is much easier than in a general Boltzmann machine and it is also much easier than in a causal belief net because there is no explaining away. There is therefore no need to perform any iteration to determine the activities of the hidden units. The hidden states, Sj , are conditionally independent given the visible states, Si, and the distribution of Sj is given by the standard logistic function : 1 p(Sj = 1) = (2) 1 + exp( - Li WijSi) Conversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM to learn population codes in which units may be highly correlated. It is hard to do this in causal belief nets with one hidden layer because the generative model of a causal belief net assumes marginal independence. An RBM can be trained using the standard Boltzmann machine learning algorithm which follows a noisy but unbiased estimate of the gradient of the log likelihood of the data. One way to implement this algorithm is to start the network with a data vector on the visible units and then to alternate between updating all of the hidden units in parallel and updating all of the visible units in parallel. Each update picks a binary state for a unit from its posterior distribution given the current states of all the units in the other set. If this alternating Gibbs sampling is run to equilibrium, there is a very simple way to update the weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data distribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, produced by the RBM [4]: flWij oc QO - Q~ (3) where < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units and the hidden states are sampled from their conditional distribution given the data, and Q ~ is the expected value of SiSj after prolonged Gibbs sampling. This learning rule does not work well because it can take a long time to approach thermal equilibrium and the sampling noise in the estimate of Q ~ can swamp the gradient. [1] shows that it is far more effective to minimize the difference between QOllQoo and Q111Qoo where Q1 is the distribution of the one-step reconstructions of the data that are produced by first picking binary hidden states from their conditional distribution given the data and then picking binary visible states from their conditional distribution given the hidden states. The exact gradient of this
3 0.18023469 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
Author: Michael S. Gray, Terrence J. Sejnowski, Javier R. Movellan
Abstract: We examine eight different techniques for developing visual representations in machine vision tasks. In particular we compare different versions of principal component and independent component analysis in combination with stepwise regression methods for variable selection. We found that local methods, based on the statistics of image patches, consistently outperformed global methods based on the statistics of entire images. This result is consistent with previous work on emotion and facial expression recognition. In addition, the use of a stepwise regression technique for selecting variables and regions of interest substantially boosted performance. 1
4 0.12958838 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
Author: Gal Elidan, Noam Lotner, Nir Friedman, Daphne Koller
Abstract: A serious problem in learning probabilistic models is the presence of hidden variables. These variables are not observed, yet interact with several of the observed variables. As such, they induce seemingly complex dependencies among the latter. In recent years, much attention has been devoted to the development of algorithms for learning parameters, and in some cases structure, in the presence of hidden variables. In this paper, we address the related problem of detecting hidden variables that interact with the observed variables. This problem is of interest both for improving our understanding of the domain and as a preliminary step that guides the learning procedure towards promising models. A very natural approach is to search for
5 0.12733179 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images
Author: Rafal Bogacz, Malcolm W. Brown, Christophe G. Giraud-Carrier
Abstract: Olshausen & Field demonstrated that a learning algorithm that attempts to generate a sparse code for natural scenes develops a complete family of localised, oriented, bandpass receptive fields, similar to those of 'simple cells' in VI. This paper describes an algorithm which finds a sparse code for sequences of images that preserves information about the input. This algorithm when trained on natural video sequences develops bases representing the movement in particular directions with particular speeds, similar to the receptive fields of the movement-sensitive cells observed in cortical visual areas. Furthermore, in contrast to previous approaches to learning direction selectivity, the timing of neuronal activity encodes the phase of the movement, so the precise timing of spikes is crucially important to the information encoding.
6 0.11239491 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
7 0.088298485 117 nips-2000-Shape Context: A New Descriptor for Shape Matching and Object Recognition
8 0.08708398 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference
9 0.084671088 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
11 0.079208985 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
12 0.079077736 102 nips-2000-Position Variance, Recurrence and Perceptual Learning
13 0.078890413 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach
14 0.076961145 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
15 0.076074488 13 nips-2000-A Tighter Bound for Graphical Models
16 0.076064594 96 nips-2000-One Microphone Source Separation
17 0.07601691 15 nips-2000-Accumulator Networks: Suitors of Local Probability Propagation
18 0.073909797 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure
19 0.071529284 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models
20 0.069051191 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
topicId topicWeight
[(0, 0.267), (1, -0.117), (2, 0.136), (3, 0.033), (4, 0.056), (5, 0.03), (6, 0.185), (7, -0.079), (8, 0.194), (9, 0.062), (10, 0.069), (11, -0.481), (12, 0.227), (13, -0.067), (14, 0.035), (15, 0.09), (16, -0.011), (17, 0.006), (18, -0.131), (19, -0.185), (20, 0.012), (21, 0.019), (22, -0.037), (23, 0.088), (24, 0.011), (25, 0.075), (26, -0.041), (27, -0.01), (28, -0.015), (29, 0.016), (30, -0.021), (31, -0.039), (32, -0.018), (33, 0.004), (34, 0.099), (35, -0.045), (36, -0.009), (37, -0.057), (38, -0.039), (39, 0.037), (40, 0.036), (41, 0.009), (42, 0.057), (43, 0.038), (44, 0.033), (45, 0.068), (46, -0.026), (47, -0.014), (48, 0.002), (49, 0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.96889031 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
Author: Yee Whye Teh, Geoffrey E. Hinton
Abstract: We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.
2 0.88358897 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts
Author: Guy Mayraz, Geoffrey E. Hinton
Abstract: The product of experts learning procedure [1] can discover a set of stochastic binary features that constitute a non-linear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, it is helpful to learn a hierarchy of separate models for each digit class. Each model in the hierarchy has one layer of hidden units and the nth level model is trained on data that consists of the activities of the hidden units in the already trained (n - l)th level model. After training, each level produces a separate, unnormalized log probabilty score. With a three-level hierarchy for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data. On the MNIST database, our system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data. 1 Learning products of stochastic binary experts Hinton [1] describes a learning algorithm for probabilistic generative models that are composed of a number of experts. Each expert specifies a probability distribution over the visible variables and the experts are combined by multiplying these distributions together and renormalizing. (1) where d is a data vector in a discrete space, Om is all the parameters of individual model m, Pm(dIOm) is the probability of d under model m, and c is an index over all possible vectors in the data space. A Restricted Boltzmann machine [2, 3] is a special case of a product of experts in which each expert is a single, binary stochastic hidden unit that has symmetrical connections to a set of visible units, and connections between the hidden units are forbidden. Inference in an RBM is much easier than in a general Boltzmann machine and it is also much easier than in a causal belief net because there is no explaining away. There is therefore no need to perform any iteration to determine the activities of the hidden units. The hidden states, Sj , are conditionally independent given the visible states, Si, and the distribution of Sj is given by the standard logistic function : 1 p(Sj = 1) = (2) 1 + exp( - Li WijSi) Conversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM to learn population codes in which units may be highly correlated. It is hard to do this in causal belief nets with one hidden layer because the generative model of a causal belief net assumes marginal independence. An RBM can be trained using the standard Boltzmann machine learning algorithm which follows a noisy but unbiased estimate of the gradient of the log likelihood of the data. One way to implement this algorithm is to start the network with a data vector on the visible units and then to alternate between updating all of the hidden units in parallel and updating all of the visible units in parallel. Each update picks a binary state for a unit from its posterior distribution given the current states of all the units in the other set. If this alternating Gibbs sampling is run to equilibrium, there is a very simple way to update the weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data distribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, produced by the RBM [4]: flWij oc QO - Q~ (3) where < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units and the hidden states are sampled from their conditional distribution given the data, and Q ~ is the expected value of SiSj after prolonged Gibbs sampling. This learning rule does not work well because it can take a long time to approach thermal equilibrium and the sampling noise in the estimate of Q ~ can swamp the gradient. [1] shows that it is far more effective to minimize the difference between QOllQoo and Q111Qoo where Q1 is the distribution of the one-step reconstructions of the data that are produced by first picking binary hidden states from their conditional distribution given the data and then picking binary visible states from their conditional distribution given the hidden states. The exact gradient of this
3 0.47961259 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
Author: Michael S. Gray, Terrence J. Sejnowski, Javier R. Movellan
Abstract: We examine eight different techniques for developing visual representations in machine vision tasks. In particular we compare different versions of principal component and independent component analysis in combination with stepwise regression methods for variable selection. We found that local methods, based on the statistics of image patches, consistently outperformed global methods based on the statistics of entire images. This result is consistent with previous work on emotion and facial expression recognition. In addition, the use of a stepwise regression technique for selecting variables and regions of interest substantially boosted performance. 1
4 0.45029268 142 nips-2000-Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task
Author: Brian Sallans, Geoffrey E. Hinton
Abstract: The problem of reinforcement learning in large factored Markov decision processes is explored. The Q-value of a state-action pair is approximated by the free energy of a product of experts network. Network parameters are learned on-line using a modified SARSA algorithm which minimizes the inconsistency of the Q-values of consecutive state-action pairs. Actions are chosen based on the current value estimates by fixing the current state and sampling actions from the network using Gibbs sampling. The algorithm is tested on a co-operative multi-agent task. The product of experts model is found to perform comparably to table-based Q-Iearning for small instances of the task, and continues to perform well when the problem becomes too large for a table-based representation.
5 0.43269467 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach
Author: Gal Elidan, Noam Lotner, Nir Friedman, Daphne Koller
Abstract: A serious problem in learning probabilistic models is the presence of hidden variables. These variables are not observed, yet interact with several of the observed variables. As such, they induce seemingly complex dependencies among the latter. In recent years, much attention has been devoted to the development of algorithms for learning parameters, and in some cases structure, in the presence of hidden variables. In this paper, we address the related problem of detecting hidden variables that interact with the observed variables. This problem is of interest both for improving our understanding of the domain and as a preliminary step that guides the learning procedure towards promising models. A very natural approach is to search for
6 0.40987983 97 nips-2000-Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping
8 0.39090511 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference
10 0.33337739 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
11 0.32808325 127 nips-2000-Structure Learning in Human Causal Induction
12 0.30138758 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach
13 0.28713673 15 nips-2000-Accumulator Networks: Suitors of Local Probability Propagation
14 0.27797145 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure
16 0.27044624 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
17 0.26867324 117 nips-2000-Shape Context: A New Descriptor for Shape Matching and Object Recognition
18 0.26412404 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes
19 0.25950685 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data
20 0.25851306 96 nips-2000-One Microphone Source Separation
topicId topicWeight
[(10, 0.024), (17, 0.175), (32, 0.025), (33, 0.05), (36, 0.024), (38, 0.014), (46, 0.222), (52, 0.072), (55, 0.041), (62, 0.033), (65, 0.026), (67, 0.057), (75, 0.019), (76, 0.031), (79, 0.01), (81, 0.036), (90, 0.034), (91, 0.022), (97, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.8390817 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
Author: Yee Whye Teh, Geoffrey E. Hinton
Abstract: We describe a neurally-inspired, unsupervised learning algorithm that builds a non-linear generative model for pairs of face images from the same individual. Individuals are then recognized by finding the highest relative probability pair among all pairs that consist of a test image and an image whose identity is known. Our method compares favorably with other methods in the literature. The generative model consists of a single layer of rate-coded, non-linear feature detectors and it has the property that, given a data vector, the true posterior probability distribution over the feature detector activities can be inferred rapidly without iteration or approximation. The weights of the feature detectors are learned by comparing the correlations of pixel intensities and feature activations in two phases: When the network is observing real data and when it is observing reconstructions of real data generated from the feature activations.
2 0.64358497 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts
Author: Guy Mayraz, Geoffrey E. Hinton
Abstract: The product of experts learning procedure [1] can discover a set of stochastic binary features that constitute a non-linear generative model of handwritten images of digits. The quality of generative models learned in this way can be assessed by learning a separate model for each class of digit and then comparing the unnormalized probabilities of test images under the 10 different class-specific models. To improve discriminative performance, it is helpful to learn a hierarchy of separate models for each digit class. Each model in the hierarchy has one layer of hidden units and the nth level model is trained on data that consists of the activities of the hidden units in the already trained (n - l)th level model. After training, each level produces a separate, unnormalized log probabilty score. With a three-level hierarchy for each of the 10 digit classes, a test image produces 30 scores which can be used as inputs to a supervised, logistic classification network that is trained on separate data. On the MNIST database, our system is comparable with current state-of-the-art discriminative methods, demonstrating that the product of experts learning procedure can produce effective generative models of high-dimensional data. 1 Learning products of stochastic binary experts Hinton [1] describes a learning algorithm for probabilistic generative models that are composed of a number of experts. Each expert specifies a probability distribution over the visible variables and the experts are combined by multiplying these distributions together and renormalizing. (1) where d is a data vector in a discrete space, Om is all the parameters of individual model m, Pm(dIOm) is the probability of d under model m, and c is an index over all possible vectors in the data space. A Restricted Boltzmann machine [2, 3] is a special case of a product of experts in which each expert is a single, binary stochastic hidden unit that has symmetrical connections to a set of visible units, and connections between the hidden units are forbidden. Inference in an RBM is much easier than in a general Boltzmann machine and it is also much easier than in a causal belief net because there is no explaining away. There is therefore no need to perform any iteration to determine the activities of the hidden units. The hidden states, Sj , are conditionally independent given the visible states, Si, and the distribution of Sj is given by the standard logistic function : 1 p(Sj = 1) = (2) 1 + exp( - Li WijSi) Conversely, the hidden states of an RBM are marginally dependent so it is easy for an RBM to learn population codes in which units may be highly correlated. It is hard to do this in causal belief nets with one hidden layer because the generative model of a causal belief net assumes marginal independence. An RBM can be trained using the standard Boltzmann machine learning algorithm which follows a noisy but unbiased estimate of the gradient of the log likelihood of the data. One way to implement this algorithm is to start the network with a data vector on the visible units and then to alternate between updating all of the hidden units in parallel and updating all of the visible units in parallel. Each update picks a binary state for a unit from its posterior distribution given the current states of all the units in the other set. If this alternating Gibbs sampling is run to equilibrium, there is a very simple way to update the weights so as to minimize the Kullback-Leibler divergence, QOIIQoo, between the data distribution, QO, and the equilibrium distribution of fantasies over the visible units, Qoo, produced by the RBM [4]: flWij oc QO - Q~ (3) where < SiSj >Qo is the expected value of SiSj when data is clamped on the visible units and the hidden states are sampled from their conditional distribution given the data, and Q ~ is the expected value of SiSj after prolonged Gibbs sampling. This learning rule does not work well because it can take a long time to approach thermal equilibrium and the sampling noise in the estimate of Q ~ can swamp the gradient. [1] shows that it is far more effective to minimize the difference between QOllQoo and Q111Qoo where Q1 is the distribution of the one-step reconstructions of the data that are produced by first picking binary hidden states from their conditional distribution given the data and then picking binary visible states from their conditional distribution given the hidden states. The exact gradient of this
3 0.63364726 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
Author: Michael S. Gray, Terrence J. Sejnowski, Javier R. Movellan
Abstract: We examine eight different techniques for developing visual representations in machine vision tasks. In particular we compare different versions of principal component and independent component analysis in combination with stepwise regression methods for variable selection. We found that local methods, based on the statistics of image patches, consistently outperformed global methods based on the statistics of entire images. This result is consistent with previous work on emotion and facial expression recognition. In addition, the use of a stepwise regression technique for selecting variables and regions of interest substantially boosted performance. 1
4 0.62451118 130 nips-2000-Text Classification using String Kernels
Author: Huma Lodhi, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins
Abstract: We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results. 1
5 0.62171578 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
6 0.62066346 79 nips-2000-Learning Segmentation by Random Walks
7 0.61728311 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes
8 0.61709011 4 nips-2000-A Linear Programming Approach to Novelty Detection
9 0.61481774 56 nips-2000-Foundations for a Circuit Complexity Theory of Sensory Processing
10 0.61407089 122 nips-2000-Sparse Representation for Gaussian Process Models
11 0.60994011 133 nips-2000-The Kernel Gibbs Sampler
12 0.60978723 37 nips-2000-Convergence of Large Margin Separable Linear Classification
13 0.60848689 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference
14 0.60685444 60 nips-2000-Gaussianization
15 0.60071635 36 nips-2000-Constrained Independent Component Analysis
16 0.60028982 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
17 0.59963411 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
18 0.59906906 111 nips-2000-Regularized Winnow Methods
19 0.59850007 10 nips-2000-A Productive, Systematic Framework for the Representation of Visual Structure
20 0.59746158 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling