nips nips2005 nips2005-97 knowledge-graph by maker-knowledge-mining

97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

Source: pdf

Author: Vinod Nair, Geoffrey E. Hinton

Abstract: We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. The inferred motor programs can be used directly for digit classiﬁcation, but they can also be used in other ways. By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. We can also use the motor programs as additional, highly informative outputs which reduce overﬁtting when training a feed-forward classiﬁer. 1 Overview The idea that patterns can be recognized by ﬁguring out how they were generated has been around for at least half a century [1, 2] and one of the ﬁrst proposed applications was the recognition of handwriting using a generative model that involved pairs of opposing springs [3, 4]. The “analysis-by-synthesis” approach is attractive because the true generative model should provide the most natural way to characterize a class of patterns. The handwritten 2’s in ﬁgure 1, for example, are very variable when viewed as pixels but they have very similar motor programs. Despite its obvious merits, analysis-by-synthesis has had few successes, partly because it is computationally expensive to invert non-linear generative models and partly because the underlying parameters of the generative model are unknown for most large data sets. For example, the only source of information about how the MNIST digits were drawn is the images themselves. We describe a simple generative model in which a pen is controlled by two pairs of opposing springs whose stiffnesses are speciﬁed by a motor program. If the sequence of stiffnesses is speciﬁed correctly, the model can produce images which look very like the MNIST digits. Using a separate network for each digit class, we show that backpropagation can be used to learn a “recognition” network that maps images to the motor programs required to produce them. An interesting aspect of this learning is that the network creates its own training data, so it does not require the training images to be labelled with motor programs. Each recognition network starts with a single example of a motor program and grows an “island of competence” around this example, progressively extending the region over which it can map small changes in the image to the corresponding small changes in the motor program (see ﬁgure 2). Figure 1: An MNIST image of a 2 and the additional images that can be generated by inferring the motor program and then adding random noise to it. The pixels are very different, but they are all clearly twos. Fairly good digit recognition can be achieved by using the 10 recognition networks to ﬁnd 10 motor programs for a test image and then scoring each motor program by its squared error in reconstructing the image. The 10 scores are then fed into a softmax classiﬁer. Recognition can be improved by using PCA to model the distribution of motor trajectories for each class and using the distance of a motor trajectory from the relevant PCA hyperplane as an additional score. Each recognition network is solving a difﬁcult global search problem in which the correct motor program must be found by a single, “open-loop” pass through the network. More accurate recognition can be achieved by using this open-loop global search to initialize an iterative, closed-loop local search which uses the error in the reconstructed image to revise the motor program. This requires reconstruction errors in pixel space to be mapped to corrections in the space of spring stiffnesses. We cannot backpropagate errors through the generative model because it is just a hand-coded computer program. So we learn “generative” networks, one per digit class, that emulate the generator. After learning, backpropagation through these generative networks is used to convert pixel reconstruction errors into stiffness corrections. Our ﬁnal system gives 1.82% error on the MNIST test set which is similar to the 1.7% achieved by a very different generative approach [5] but worse than the 1.53% produced by the best backpropagation networks or the 1.4% produced by support vector machines [6]. It is much worse than the 0.4% produced by convolutional neural networks that use cleverly enhanced training sets [7]. Recognition of test images is quite slow because it uses ten different recognition networks followed by iterative local search. There is, however, a much more efﬁcient way to make use of our ability to extract motor programs. They can be treated as additional output labels when using backpropagation to train a single, multilayer, discriminative neural network. These additional labels act as a very informative regularizer that reduces the error rate from 1.53% to 1.27% in a network with two hidden layers of 500 units each. This is a new method of improving performance that can be used in conjunction with other tricks such as preprocessing the images, enhancing the training set or using convolutional neural nets [8, 7]. 2 A simple generative model for drawing digits The generative model uses two pairs of opposing springs at right angles. One end of each spring is attached to a frictionless horizontal or vertical rail that is 39 pixels from the center of the image. The other end is attached to a “pen” that has signiﬁcant mass. The springs themselves are weightless and have zero rest length. The pen starts at the equilibrium position deﬁned by the initial stiffnesses of the four springs. It then follows a trajectory that is determined by the stiffness of each spring at each of the 16 subsequent time steps in the motor program. The mass is large compared with the rate at which the stiffnesses change, so the system is typically far from equilibrium as it follows the smooth trajectory. On each time step, the momentum is multiplied by 0.9 to simulate viscosity. A coarse-grain trajectory is computed by using one step of forward integration for each time step in the motor program, so it contains 17 points. The code is at www.cs.toronto.edu/∼ hinton/code. Figure 2: The training data for each class-speciﬁc recognition network is produced by adding noise to motor programs that are inferred from MNIST images using the current parameters of the recognition network. To initiate this process, the biases of the output units are set by hand so that they represent a prototypical motor program for the class. Given a coarse-grain trajectory, we need a way of assigning an intensity to each pixel. We tried various methods until we hand-evolved one that was able to reproduce the MNIST images fairly accurately, but we suspect that many other methods would be just as good. For each point on the coarse trajectory, we share two units of ink between the the four closest pixels using bilinear interpolation. We also use linear interpolation to add three ﬁne-grain trajectory points between every pair of coarse-grain points. These ﬁne-grain points also contribute ink to the pixels using bilinear interpolation, but the amount of ink they contribute is zero if they are less than one pixel apart and rises linearly to the same amount as the coarse-grain points if they are more than two pixels apart. This generates a thin skeleton with a fairly uniform ink density. To ﬂesh-out the skeleton, we use two “ink parameters”, a a a a a, b, to specify a 3 × 3 kernel of the form b(1 + a)[ 12 , a , 12 ; a , 1 − a, a ; 12 , a , 12 ] which 6 6 6 6 is convolved with the image four times. Finally, the pixel intensities are clipped to lie in the interval [0,1]. The matlab code is at www.cs.toronto.edu/∼ hinton/code. The values of 2a and b/1.5 are additional, logistic outputs of the recognition networks1 . 3 Training the recognition networks The obvious way to learn a recognition network is to use a training set in which the inputs are images and the target outputs are the motor programs that were used to generate those images. If we knew the distribution over motor programs for a given digit class, we could easily produce such a set by running the generator. Unfortunately, the distribution over motor programs is exactly what we want to learn from the data, so we need a way to train 1 We can add all sorts of parameters to the hand-coded generative model and then get the recognition networks to learn to extract the appropriate values for each image. The global mass and viscosity as well as the spacing of the rails that hold the springs can be learned. We can even implement afﬁnelike transformations by attaching the four springs to endpoints whose eight coordinates are given by the recognition networks. These extra parameters make the learning slower and, for the normalized digits, they do not improve discrimination, probably because they help the wrong digit models as much as the right one. the recognition network without knowing this distribution in advance. Generating scribbles from random motor programs will not work because the capacity of the network will be wasted on irrelevant images that are far from the real data. Figure 2 shows how a single, prototype motor program can be used to initialize a learning process that creates its own training data. The prototype consists of a sequence of 4 × 17 spring stiffnesses that are used to set the biases on 68 of the 70 logistic output units of the recognition net. If the weights coming from the 400 hidden units are initially very small, the recognition net will then output a motor program that is a close approximation to the prototype, whatever the input image. Some random noise is then added to this motor program and it is used to generate a training image. So initially, all of the generated training images are very similar to the one produced by the prototype. The recognition net will therefore devote its capacity to modeling the way in which small changes in these images map to small changes in the motor program. Images in the MNIST training set that are close to the prototype will then be given their correct motor programs. This will tend to stretch the distribution of motor programs produced by the network along the directions that correspond to the manifold on which the digits lie. As time goes by, the generated training set will expand along the manifold for that digit class until all of the MNIST training images of that class are well modelled by the recognition network. It takes about 10 hours in matlab on a 3 GHz Xeon to train each recognition network. We use minibatches of size 100, momentum of 0.9, and adaptive learning rates on each connection that increase additively when the sign of the gradient agrees with the sign of the previous weight change and decrease multiplicatively when the signs disagree [9]. The net is generating its own training data, so the objective function is always changing which makes it inadvisable to use optimization methods that go as far as they can in a carefully chosen direction. Figures 3 and 4 show some examples of how well the recognition nets perform after training. Nearly all models achieve an average squared pixel error of less than 15 per image on their validation set (pixel intensities are between 0 and 1 with a preponderance of extreme values). The inferred motor programs are clearly good enough to capture the diverse handwriting styles in the data. They are not good enough, however, to give classiﬁcation performance comparable to the state-of-the-art on the MNIST database. So we added a series of enhancements to the basic system to improve the classiﬁcation accuracy. 4 Enhancements to the basic system Extra strokes in ones and sevens. One limitation of the basic system is that it draws digits using only a single stroke (i.e. the trajectory is a single, unbroken curve). But when people draw digits, they often add extra strokes to them. Two of the most common examples are the dash at the bottom of ones, and the dash through the middle of sevens (see examples in ﬁgure 5). About 2.2% of ones and 13% of sevens in the MNIST training set are dashed and not modelling the dashes reduces classiﬁcation accuracy signiﬁcantly. We model dashed ones and sevens by augmenting their basic motor programs with another motor program to draw the dash. For example, a dashed seven is generated by ﬁrst drawing an ordinary seven using the motor program computed by the seven model, and then drawing the dash with a motor program computed by a separate neural network that models only dashes. Dashes in ones and sevens are modeled with two different networks. Their training proceeds the same way as with the other models, except now there are only 50 hidden units and the training set contains only the dashed cases of the digit. (Separating these cases from the rest of the MNIST training set is easy because they can be quickly spotted by looking at the difference between the images and their reconstructions by the dashless digit model.) The net takes the entire image of a digit as input, and computes the motor program for just the dash. When reconstructing an unlabelled image as say, a seven, we compute both Figure 3: Examples of validation set images reconstructed by their corresponding model. In each case the original image is on the left and the reconstruction is on the right. Superimposed on the original image is the pen trajectory. the dashed and dashless versions of seven and pick the one with the lower squared pixel error to be that image’s reconstruction as a seven. Figure 5 shows examples of images reconstructed using the extra stroke. Local search. When reconstructing an image in its own class, a digit model often produces a sensible, overall approximation of the image. However, some of the ﬁner details of the reconstruction may be slightly wrong and need to be ﬁxed up by an iterative local search that adjusts the motor program to reduce the reconstruction error. We ﬁrst approximate the graphics model with a neural network that contains a single hidden layer of 500 logistic units. We train one such generative network for each of the ten digits and for the dashed version of ones and sevens (for a total of 12 nets). The motor programs used for training are obtained by adding noise to the motor programs inferred from the training data by the relevant, fully trained recognition network. The images produced from these motor programs by the graphics model are used as the targets for the supervised learning of each generative network. Given these targets, the weight updates are computed in the same way as for the recognition networks. Figure 4: To model 4’s we use a single smooth trajectory, but turn off the ink for timesteps 9 and 10. For images in which the pen does not need to leave the paper, the recognition net ﬁnds a trajectory in which points 8 and 11 are close together so that points 9 and 10 are not needed. For 5’s we leave the top until last and turn off the ink for timesteps 13 and 14. Figure 5: Examples of dashed ones and sevens reconstructed using a second stroke. The pen trajectory for the dash is shown in blue, superimposed on the original image. Initial squared pixel error = 33.8 10 iterations, error = 15.2 20 iterations, error = 10.5 30 iterations, error = 9.3 Figure 6: An example of how local search improves the detailed registration of the trajectory found by the correct model. After 30 iterations, the squared pixel error is less than a third of its initial value. Once the generative network is trained, we can use it to iteratively improve the initial motor program computed by the recognition network for an image. The main steps in one iteration are: 1) compute the error between the image and the reconstruction generated from the current motor program by the graphics model; 2) backpropagate the reconstruction error through the generative network to calculate its gradient with respect to the motor program; 3) compute a new motor program by taking a step along the direction of steepest descent plus 0.5 times the previous step. Figure 6 shows an example of how local search improves the reconstruction by the correct model. Local search is usually less effective at improving the ﬁts of the wrong models, so it eliminates about 20% of the classiﬁcation errors on the validation set. PCA model of the image residuals. The sum of squared pixel errors is not the best way of comparing an image with its reconstruction, because it treats the residual pixel errors as independent and zero-mean Gaussian distributed, which they are not. By modelling the structure in the residual vectors, we can get a better estimate of the conditional probability of the image given the motor program. For each digit class, we construct a PCA model of the image residual vectors for the training images. Then, given a test image, we project the image residual vector produced by each inferred motor program onto the relevant PCA hyperplane and compute the squared distance between the residual and its projection. This gives ten scores for the image that measure the quality of its reconstructions by the digit models. We don’t discard the old sum of squared pixel errors as they are still useful for classifying most images correctly. Instead, all twenty scores are used as inputs to the classiﬁer, which decides how to combine both types of scores to achieve high classiﬁcation accuracy. PCA model of trajectories. Classifying an image by comparing its reconstruction errors for the different digit models tacitly relies on the assumption that the incorrect models will reconstruct the image poorly. Since the models have only been trained on images in their Squared error = 24.9, Shape prior score = 31.5 Squared error = 15.0, Shape prior score = 104.2 Figure 7: Reconstruction of a two image by the two model (left box) and by the three model (right box), with the pen trajectory superimposed on the original image. The three model sharply bends the bottom of its trajectory to better explain the ink, but the trajectory prior for three penalizes it with a high score. The two model has a higher squared error, but a much lower prior score, which allows the classiﬁer to correctly label the image. own class, they often do reconstruct images from other classes poorly, but occasionally they ﬁt an image from another class well. For example, ﬁgure 7 shows how the three model reconstructs a two image better than the two model by generating a highly contorted three. This problem becomes even more pronounced with local search which sometimes contorts the wrong model to ﬁt the image really well. The solution is to learn a PCA model of the trajectories that a digit model infers from images in its own class. Given a test image, the trajectory computed by each digit model is scored by its squared distance from the relevant PCA hyperplane. These 10 “prior” scores are then given to the classiﬁer along with the 20 “likelihood” scores described above. The prior scores eliminate many classiﬁcation mistakes such as the one in ﬁgure 7. 5 Classiﬁcation results To classify a test image, we apply multinomial logistic regression to the 30 scores – i.e. we use a neural network with no hidden units, 10 softmax output units and a cross-entropy error. The net is trained by gradient descent using the scores for the validation set images. To illustrate the gain in classiﬁcation accuracy achieved by the enhancements explained above, table 1 gives the percent error on the validation set as each enhancement is added to the system. Together, the enhancements almost halve the number of mistakes. Enhancements None 1 1, 2 1, 2, 3 1, 2, 3, 4 Validation set % error 4.43 3.84 3.01 2.67 2.28 Test set % error 1.82 Table 1: The gain in classiﬁcation accuracy on the validation set as the following enhancements are added: 1) extra stroke for dashed ones and sevens, 2) local search, 3) PCA model of image residual, and 4) PCA trajectory prior. To avoid using the test set for model selection, the performance on the ofﬁcial test set was only measured for the ﬁnal system. 6 Discussion After training a single neural network to output both the class label and the motor program for all classes (as described in section 1) we tried ignoring the label output and classifying the test images by using the cost, under 10 different PCA models, of the trajectory deﬁned by the inferred motor program. Each PCA model was ﬁtted to the trajectories extracted from the training images for a given class. This gave 1.80% errors which is as good as the 1.82% we got using the 10 separate recognition networks and local search. This is quite surprising because the motor programs produced by the single network were simpliﬁed to make them all have the same dimensionality and they produced signiﬁcantly poorer reconstructions. By only using the 10 digit-speciﬁc recognition nets to create the motor programs for the training data, we get much faster recognition of test data because at test time we can use a single recognition network for all classes. It also means we do not need to trade-off prior scores against image residual scores because there is only one image residual. The ability to extract motor programs could also be used to enhance the training set. [7] shows that error rates can be halved by using smooth vector distortion ﬁelds to create extra training data. They argue that these ﬁelds simulate “uncontrolled oscillations of the hand muscles dampened by inertia”. Motor noise may be better modelled by adding noise to an actual motor program as shown in ﬁgure 1. Notice that this produces a wide variety of non-blurry images and it can also change the topology. The techniques we have used for extracting motor programs from digit images may be applicable to speech. There are excellent generative models that can produce almost perfect speech if they are given the right formant parameters [10]. Using one of these generative models we may be able to train a large number of specialized recognition networks to extract formant parameters from speech without requiring labeled training data. Once this has been done, labeled data would be available for training a single feed-forward network that could recover accurate formant parameters which could be used for real-time recognition. Acknowledgements We thank Steve Isard, David MacKay and Allan Jepson for helpful discussions. This research was funded by NSERC, CFI and OIT. GEH is a fellow of the Canadian Institute for Advanced Research and holds a Canada Research Chair in machine learning. References [1] D. M. MacKay. Mindlike behaviour in artefacts. British Journal for Philosophy of Science, 2:105–121, 1951. [2] M. Halle and K. Stevens. Speech recognition: A model and a program for research. IRE Transactions on Information Theory, IT-8 (2):155–159, 1962. [3] Murray Eden. Handwriting and pattern recognition. IRE Transactions on Information Theory, IT-8 (2):160–166, 1962. [4] J.M. Hollerbach. An oscillation theory of handwriting. Biological Cybernetics, 39:139–156, 1981. [5] G. Mayraz and G. E. Hinton. Recognizing hand-written digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:189–197, 2001. [6] D. Decoste and B. Schoelkopf. Training invariant support vector machines. Machine Learning, 46:161–190, 2002. [7] Patrice Y. Simard, Dave Steinkraus, and John Platt. Best practice for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, Los Alamitos, pages 958–962, 2003. [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. [9] A. Jacobs R. Increased Rates of Convergence Through Learning Rate Adaptation. Technical Report: UM-CS-1987-117. University of Massachusetts, Amherst, MA, 1987. [10] W. Holmes, J. Holmes, and M. Judd. Extension of the bandwith of the jsru parallel-formant synthesizer for high quality synthesis of male and female speech. In Proceedings of ICASSP 90 (1), pages 313–316, 1990.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. [sent-3, score-1.173]

2 We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. [sent-4, score-0.988]

3 The inferred motor programs can be used directly for digit classiﬁcation, but they can also be used in other ways. [sent-5, score-1.127]

4 By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. [sent-6, score-1.196]

5 We can also use the motor programs as additional, highly informative outputs which reduce overﬁtting when training a feed-forward classiﬁer. [sent-7, score-0.962]

6 1 Overview The idea that patterns can be recognized by ﬁguring out how they were generated has been around for at least half a century [1, 2] and one of the ﬁrst proposed applications was the recognition of handwriting using a generative model that involved pairs of opposing springs [3, 4]. [sent-8, score-0.558]

7 The handwritten 2’s in ﬁgure 1, for example, are very variable when viewed as pixels but they have very similar motor programs. [sent-10, score-0.641]

8 Despite its obvious merits, analysis-by-synthesis has had few successes, partly because it is computationally expensive to invert non-linear generative models and partly because the underlying parameters of the generative model are unknown for most large data sets. [sent-11, score-0.267]

9 For example, the only source of information about how the MNIST digits were drawn is the images themselves. [sent-12, score-0.272]

10 We describe a simple generative model in which a pen is controlled by two pairs of opposing springs whose stiffnesses are speciﬁed by a motor program. [sent-13, score-1.162]

11 If the sequence of stiffnesses is speciﬁed correctly, the model can produce images which look very like the MNIST digits. [sent-14, score-0.318]

12 Using a separate network for each digit class, we show that backpropagation can be used to learn a “recognition” network that maps images to the motor programs required to produce them. [sent-15, score-1.439]

13 An interesting aspect of this learning is that the network creates its own training data, so it does not require the training images to be labelled with motor programs. [sent-16, score-1.002]

14 Each recognition network starts with a single example of a motor program and grows an “island of competence” around this example, progressively extending the region over which it can map small changes in the image to the corresponding small changes in the motor program (see ﬁgure 2). [sent-17, score-1.835]

15 Figure 1: An MNIST image of a 2 and the additional images that can be generated by inferring the motor program and then adding random noise to it. [sent-18, score-1.054]

16 Fairly good digit recognition can be achieved by using the 10 recognition networks to ﬁnd 10 motor programs for a test image and then scoring each motor program by its squared error in reconstructing the image. [sent-20, score-2.514]

17 Recognition can be improved by using PCA to model the distribution of motor trajectories for each class and using the distance of a motor trajectory from the relevant PCA hyperplane as an additional score. [sent-22, score-1.401]

18 Each recognition network is solving a difﬁcult global search problem in which the correct motor program must be found by a single, “open-loop” pass through the network. [sent-23, score-1.02]

19 More accurate recognition can be achieved by using this open-loop global search to initialize an iterative, closed-loop local search which uses the error in the reconstructed image to revise the motor program. [sent-24, score-1.06]

20 This requires reconstruction errors in pixel space to be mapped to corrections in the space of spring stiffnesses. [sent-25, score-0.325]

21 We cannot backpropagate errors through the generative model because it is just a hand-coded computer program. [sent-26, score-0.23]

22 After learning, backpropagation through these generative networks is used to convert pixel reconstruction errors into stiffness corrections. [sent-28, score-0.509]

23 4% produced by convolutional neural networks that use cleverly enhanced training sets [7]. [sent-35, score-0.236]

24 Recognition of test images is quite slow because it uses ten different recognition networks followed by iterative local search. [sent-36, score-0.472]

25 There is, however, a much more efﬁcient way to make use of our ability to extract motor programs. [sent-37, score-0.597]

26 2 A simple generative model for drawing digits The generative model uses two pairs of opposing springs at right angles. [sent-43, score-0.625]

27 The pen starts at the equilibrium position deﬁned by the initial stiffnesses of the four springs. [sent-47, score-0.283]

28 It then follows a trajectory that is determined by the stiffness of each spring at each of the 16 subsequent time steps in the motor program. [sent-48, score-0.828]

29 A coarse-grain trajectory is computed by using one step of forward integration for each time step in the motor program, so it contains 17 points. [sent-52, score-0.72]

30 Figure 2: The training data for each class-speciﬁc recognition network is produced by adding noise to motor programs that are inferred from MNIST images using the current parameters of the recognition network. [sent-57, score-1.697]

31 To initiate this process, the biases of the output units are set by hand so that they represent a prototypical motor program for the class. [sent-58, score-0.792]

32 For each point on the coarse trajectory, we share two units of ink between the the four closest pixels using bilinear interpolation. [sent-61, score-0.267]

33 These ﬁne-grain points also contribute ink to the pixels using bilinear interpolation, but the amount of ink they contribute is zero if they are less than one pixel apart and rises linearly to the same amount as the coarse-grain points if they are more than two pixels apart. [sent-63, score-0.489]

34 3 Training the recognition networks The obvious way to learn a recognition network is to use a training set in which the inputs are images and the target outputs are the motor programs that were used to generate those images. [sent-73, score-1.595]

35 If we knew the distribution over motor programs for a given digit class, we could easily produce such a set by running the generator. [sent-74, score-1.07]

36 Unfortunately, the distribution over motor programs is exactly what we want to learn from the data, so we need a way to train 1 We can add all sorts of parameters to the hand-coded generative model and then get the recognition networks to learn to extract the appropriate values for each image. [sent-75, score-1.331]

37 We can even implement afﬁnelike transformations by attaching the four springs to endpoints whose eight coordinates are given by the recognition networks. [sent-77, score-0.302]

38 These extra parameters make the learning slower and, for the normalized digits, they do not improve discrimination, probably because they help the wrong digit models as much as the right one. [sent-78, score-0.291]

39 the recognition network without knowing this distribution in advance. [sent-79, score-0.247]

40 Generating scribbles from random motor programs will not work because the capacity of the network will be wasted on irrelevant images that are far from the real data. [sent-80, score-1.117]

41 Figure 2 shows how a single, prototype motor program can be used to initialize a learning process that creates its own training data. [sent-81, score-0.894]

42 The prototype consists of a sequence of 4 × 17 spring stiffnesses that are used to set the biases on 68 of the 70 logistic output units of the recognition net. [sent-82, score-0.521]

43 If the weights coming from the 400 hidden units are initially very small, the recognition net will then output a motor program that is a close approximation to the prototype, whatever the input image. [sent-83, score-1.057]

44 Some random noise is then added to this motor program and it is used to generate a training image. [sent-84, score-0.817]

45 So initially, all of the generated training images are very similar to the one produced by the prototype. [sent-85, score-0.311]

46 The recognition net will therefore devote its capacity to modeling the way in which small changes in these images map to small changes in the motor program. [sent-86, score-0.966]

47 Images in the MNIST training set that are close to the prototype will then be given their correct motor programs. [sent-87, score-0.7]

48 This will tend to stretch the distribution of motor programs produced by the network along the directions that correspond to the manifold on which the digits lie. [sent-88, score-1.12]

49 As time goes by, the generated training set will expand along the manifold for that digit class until all of the MNIST training images of that class are well modelled by the recognition network. [sent-89, score-0.774]

50 Figures 3 and 4 show some examples of how well the recognition nets perform after training. [sent-94, score-0.231]

51 Nearly all models achieve an average squared pixel error of less than 15 per image on their validation set (pixel intensities are between 0 and 1 with a preponderance of extreme values). [sent-95, score-0.446]

52 The inferred motor programs are clearly good enough to capture the diverse handwriting styles in the data. [sent-96, score-0.981]

53 Two of the most common examples are the dash at the bottom of ones, and the dash through the middle of sevens (see examples in ﬁgure 5). [sent-104, score-0.275]

54 2% of ones and 13% of sevens in the MNIST training set are dashed and not modelling the dashes reduces classiﬁcation accuracy signiﬁcantly. [sent-106, score-0.382]

55 We model dashed ones and sevens by augmenting their basic motor programs with another motor program to draw the dash. [sent-107, score-1.888]

56 For example, a dashed seven is generated by ﬁrst drawing an ordinary seven using the motor program computed by the seven model, and then drawing the dash with a motor program computed by a separate neural network that models only dashes. [sent-108, score-1.904]

57 Their training proceeds the same way as with the other models, except now there are only 50 hidden units and the training set contains only the dashed cases of the digit. [sent-110, score-0.315]

58 (Separating these cases from the rest of the MNIST training set is easy because they can be quickly spotted by looking at the difference between the images and their reconstructions by the dashless digit model. [sent-111, score-0.516]

59 ) The net takes the entire image of a digit as input, and computes the motor program for just the dash. [sent-112, score-1.113]

60 When reconstructing an unlabelled image as say, a seven, we compute both Figure 3: Examples of validation set images reconstructed by their corresponding model. [sent-113, score-0.44]

61 In each case the original image is on the left and the reconstruction is on the right. [sent-114, score-0.229]

62 Superimposed on the original image is the pen trajectory. [sent-115, score-0.254]

63 the dashed and dashless versions of seven and pick the one with the lower squared pixel error to be that image’s reconstruction as a seven. [sent-116, score-0.496]

64 Figure 5 shows examples of images reconstructed using the extra stroke. [sent-117, score-0.268]

65 When reconstructing an image in its own class, a digit model often produces a sensible, overall approximation of the image. [sent-119, score-0.381]

66 However, some of the ﬁner details of the reconstruction may be slightly wrong and need to be ﬁxed up by an iterative local search that adjusts the motor program to reduce the reconstruction error. [sent-120, score-1.054]

67 We train one such generative network for each of the ten digits and for the dashed version of ones and sevens (for a total of 12 nets). [sent-122, score-0.618]

68 The motor programs used for training are obtained by adding noise to the motor programs inferred from the training data by the relevant, fully trained recognition network. [sent-123, score-2.212]

69 The images produced from these motor programs by the graphics model are used as the targets for the supervised learning of each generative network. [sent-124, score-1.287]

70 For images in which the pen does not need to leave the paper, the recognition net ﬁnds a trajectory in which points 8 and 11 are close together so that points 9 and 10 are not needed. [sent-127, score-0.716]

71 Figure 5: Examples of dashed ones and sevens reconstructed using a second stroke. [sent-129, score-0.302]

72 The pen trajectory for the dash is shown in blue, superimposed on the original image. [sent-130, score-0.387]

73 3 Figure 6: An example of how local search improves the detailed registration of the trajectory found by the correct model. [sent-135, score-0.224]

74 After 30 iterations, the squared pixel error is less than a third of its initial value. [sent-136, score-0.232]

75 Once the generative network is trained, we can use it to iteratively improve the initial motor program computed by the recognition network for an image. [sent-137, score-1.175]

76 The sum of squared pixel errors is not the best way of comparing an image with its reconstruction, because it treats the residual pixel errors as independent and zero-mean Gaussian distributed, which they are not. [sent-143, score-0.579]

77 By modelling the structure in the residual vectors, we can get a better estimate of the conditional probability of the image given the motor program. [sent-144, score-0.766]

78 For each digit class, we construct a PCA model of the image residual vectors for the training images. [sent-145, score-0.504]

79 Then, given a test image, we project the image residual vector produced by each inferred motor program onto the relevant PCA hyperplane and compute the squared distance between the residual and its projection. [sent-146, score-1.266]

80 This gives ten scores for the image that measure the quality of its reconstructions by the digit models. [sent-147, score-0.476]

81 We don’t discard the old sum of squared pixel errors as they are still useful for classifying most images correctly. [sent-148, score-0.427]

82 Classifying an image by comparing its reconstruction errors for the different digit models tacitly relies on the assumption that the incorrect models will reconstruct the image poorly. [sent-151, score-0.623]

83 Since the models have only been trained on images in their Squared error = 24. [sent-152, score-0.232]

84 2 Figure 7: Reconstruction of a two image by the two model (left box) and by the three model (right box), with the pen trajectory superimposed on the original image. [sent-156, score-0.498]

85 The three model sharply bends the bottom of its trajectory to better explain the ink, but the trajectory prior for three penalizes it with a high score. [sent-157, score-0.36]

86 own class, they often do reconstruct images from other classes poorly, but occasionally they ﬁt an image from another class well. [sent-159, score-0.362]

87 This problem becomes even more pronounced with local search which sometimes contorts the wrong model to ﬁt the image really well. [sent-161, score-0.261]

88 The solution is to learn a PCA model of the trajectories that a digit model infers from images in its own class. [sent-162, score-0.439]

89 Given a test image, the trajectory computed by each digit model is scored by its squared distance from the relevant PCA hyperplane. [sent-163, score-0.488]

90 The net is trained by gradient descent using the scores for the validation set images. [sent-169, score-0.254]

91 82 Table 1: The gain in classiﬁcation accuracy on the validation set as the following enhancements are added: 1) extra stroke for dashed ones and sevens, 2) local search, 3) PCA model of image residual, and 4) PCA trajectory prior. [sent-178, score-0.685]

92 Each PCA model was ﬁtted to the trajectories extracted from the training images for a given class. [sent-181, score-0.306]

93 82% we got using the 10 separate recognition networks and local search. [sent-184, score-0.249]

94 This is quite surprising because the motor programs produced by the single network were simpliﬁed to make them all have the same dimensionality and they produced signiﬁcantly poorer reconstructions. [sent-185, score-1.074]

95 By only using the 10 digit-speciﬁc recognition nets to create the motor programs for the training data, we get much faster recognition of test data because at test time we can use a single recognition network for all classes. [sent-186, score-1.67]

96 It also means we do not need to trade-off prior scores against image residual scores because there is only one image residual. [sent-187, score-0.55]

97 The ability to extract motor programs could also be used to enhance the training set. [sent-188, score-0.994]

98 Motor noise may be better modelled by adding noise to an actual motor program as shown in ﬁgure 1. [sent-191, score-0.765]

99 The techniques we have used for extracting motor programs from digit images may be applicable to speech. [sent-193, score-1.235]

100 Using one of these generative models we may be able to train a large number of specialized recognition networks to extract formant parameters from speech without requiring labeled training data. [sent-195, score-0.583]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('motor', 0.565), ('programs', 0.312), ('digit', 0.193), ('mnist', 0.184), ('recognition', 0.172), ('program', 0.167), ('images', 0.165), ('trajectory', 0.155), ('sevens', 0.149), ('ink', 0.135), ('pen', 0.13), ('springs', 0.13), ('stiffnesses', 0.128), ('image', 0.124), ('generative', 0.121), ('digits', 0.107), ('pixel', 0.105), ('reconstruction', 0.105), ('scores', 0.1), ('pca', 0.099), ('enhancements', 0.095), ('squared', 0.086), ('training', 0.085), ('residual', 0.077), ('network', 0.075), ('spring', 0.074), ('net', 0.064), ('formant', 0.064), ('validation', 0.064), ('opposing', 0.063), ('dash', 0.063), ('produced', 0.061), ('units', 0.06), ('seven', 0.06), ('nets', 0.059), ('inferred', 0.057), ('dashed', 0.056), ('extra', 0.055), ('backpropagation', 0.054), ('prototype', 0.05), ('networks', 0.049), ('ones', 0.049), ('reconstructed', 0.048), ('handwriting', 0.047), ('classi', 0.046), ('wrong', 0.043), ('backpropagate', 0.043), ('dashes', 0.043), ('dashless', 0.043), ('pixels', 0.042), ('search', 0.041), ('error', 0.041), ('errors', 0.041), ('convolutional', 0.041), ('superimposed', 0.039), ('reconstructing', 0.039), ('graphics', 0.038), ('logistic', 0.037), ('ire', 0.037), ('class', 0.037), ('reconstruct', 0.036), ('handwritten', 0.034), ('stiffness', 0.034), ('strokes', 0.034), ('stroke', 0.034), ('adding', 0.033), ('drawing', 0.033), ('train', 0.032), ('extract', 0.032), ('momentum', 0.032), ('skeleton', 0.032), ('trajectories', 0.031), ('gure', 0.03), ('classifying', 0.03), ('reconstructions', 0.03), ('bilinear', 0.03), ('leave', 0.03), ('holmes', 0.03), ('timesteps', 0.03), ('ten', 0.029), ('test', 0.029), ('hidden', 0.029), ('document', 0.029), ('attached', 0.028), ('softmax', 0.028), ('speech', 0.028), ('local', 0.028), ('creates', 0.027), ('trained', 0.026), ('fairly', 0.026), ('intensities', 0.026), ('score', 0.025), ('equilibrium', 0.025), ('prior', 0.025), ('iterations', 0.025), ('model', 0.025), ('cation', 0.025), ('tried', 0.023), ('hyperplane', 0.023), ('add', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

Author: Vinod Nair, Geoffrey E. Hinton

2 0.19520798 136 nips-2005-Noise and the two-thirds power Law

Author: Uri Maoz, Elon Portugaly, Tamar Flash, Yair Weiss

Abstract: The two-thirds power law, an empirical law stating an inverse non-linear relationship between the tangential hand speed and the curvature of its trajectory during curved motion, is widely acknowledged to be an invariant of upper-limb movement. It has also been shown to exist in eyemotion, locomotion and was even demonstrated in motion perception and prediction. This ubiquity has fostered various attempts to uncover the origins of this empirical relationship. In these it was generally attributed either to smoothness in hand- or joint-space or to the result of mechanisms that damp noise inherent in the motor system to produce the smooth trajectories evident in healthy human motion. We show here that white Gaussian noise also obeys this power-law. Analysis of signal and noise combinations shows that trajectories that were synthetically created not to comply with the power-law are transformed to power-law compliant ones after combination with low levels of noise. Furthermore, there exist colored noise types that drive non-power-law trajectories to power-law compliance and are not affected by smoothing. These results suggest caution when running experiments aimed at verifying the power-law or assuming its underlying existence without proper analysis of the noise. Our results could also suggest that the power-law might be derived not from smoothness or smoothness-inducing mechanisms operating on the noise inherent in our motor system but rather from the correlated noise which is inherent in this motor system. 1

3 0.17432828 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

Abstract: Spiking activity from neurophysiological experiments often exhibits dynamics beyond that driven by external stimulation, presumably reﬂecting the extensive recurrence of neural circuitry. Characterizing these dynamics may reveal important features of neural computation, particularly during internally-driven cognitive operations. For example, the activity of premotor cortex (PMd) neurons during an instructed delay period separating movement-target speciﬁcation and a movementinitiation cue is believed to be involved in motor planning. We show that the dynamics underlying this activity can be captured by a lowdimensional non-linear dynamical systems model, with underlying recurrent structure and stochastic point-process output. We present and validate latent variable methods that simultaneously estimate the system parameters and the trial-by-trial dynamical trajectories. These methods are applied to characterize the dynamics in PMd data recorded from a chronically-implanted 96-electrode array while monkeys perform delayed-reach tasks. 1

4 0.11260203 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classiﬁcation by semideﬁnite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difﬁculty, we ﬁnd that metrics trained in this way lead to signiﬁcant improvements in kNN classiﬁcation—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modiﬁcation or extension for problems in multiway (as opposed to binary) classiﬁcation. 1

5 0.1079964 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface

Author: Le Song, Evian Gordon, Elly Gysels

Abstract: Motor imagery attenuates EEG µ and β rhythms over sensorimotor cortices. These amplitude changes are most successfully captured by the method of Common Spatial Patterns (CSP) and widely used in braincomputer interfaces (BCI). BCI methods based on amplitude information, however, have not incoporated the rich phase dynamics in the EEG rhythm. This study reports on a BCI method based on phase synchrony rate (SR). SR, computed from binarized phase locking value, describes the number of discrete synchronization events within a window. Statistical nonparametric tests show that SRs contain signiﬁcant differences between 2 types of motor imageries. Classiﬁers trained on SRs consistently demonstrate satisfactory results for all 5 subjects. It is further observed that, for 3 subjects, phase is more discriminative than amplitude in the ﬁrst 1.5-2.0 s, which suggests that phase has the potential to boost the information transfer rate in BCIs. 1

6 0.094754651 151 nips-2005-Pattern Recognition from One Example by Chopping

7 0.086080804 150 nips-2005-Optimizing spatio-temporal filters for improving Brain-Computer Interfacing

8 0.085870788 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning

9 0.083530158 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

10 0.078307696 50 nips-2005-Convex Neural Networks

11 0.075123355 131 nips-2005-Multiple Instance Boosting for Object Detection

12 0.075077362 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

13 0.074478082 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

14 0.074311905 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

15 0.073647387 23 nips-2005-An Application of Markov Random Fields to Range Sensing

16 0.072739623 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

17 0.072366878 138 nips-2005-Non-Local Manifold Parzen Windows

18 0.069571711 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

19 0.067613445 110 nips-2005-Learning Depth from Single Monocular Images

20 0.064011469 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.204), (1, -0.023), (2, -0.009), (3, 0.139), (4, -0.004), (5, 0.063), (6, 0.138), (7, -0.044), (8, 0.147), (9, -0.017), (10, 0.021), (11, -0.013), (12, 0.026), (13, -0.147), (14, 0.258), (15, -0.072), (16, -0.053), (17, 0.023), (18, -0.007), (19, 0.073), (20, 0.108), (21, -0.054), (22, -0.165), (23, 0.158), (24, 0.053), (25, 0.104), (26, -0.088), (27, 0.112), (28, 0.035), (29, -0.085), (30, 0.011), (31, -0.017), (32, -0.063), (33, 0.132), (34, 0.067), (35, -0.068), (36, -0.122), (37, 0.02), (38, 0.045), (39, 0.057), (40, -0.135), (41, 0.039), (42, 0.094), (43, 0.002), (44, -0.021), (45, 0.116), (46, -0.006), (47, -0.052), (48, -0.023), (49, 0.178)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95906347 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

Author: Vinod Nair, Geoffrey E. Hinton

2 0.69059169 136 nips-2005-Noise and the two-thirds power Law

Author: Uri Maoz, Elon Portugaly, Tamar Flash, Yair Weiss

3 0.60581207 128 nips-2005-Modeling Memory Transfer and Saving in Cerebellar Motor Learning

Author: Naoki Masuda, Shun-ichi Amari

Abstract: There is a long-standing controversy on the site of the cerebellar motor learning. Different theories and experimental results suggest that either the cerebellar ﬂocculus or the brainstem learns the task and stores the memory. With a dynamical system approach, we clarify the mechanism of transferring the memory generated in the ﬂocculus to the brainstem and that of so-called savings phenomena. The brainstem learning must comply with a sort of Hebbian rule depending on Purkinje-cell activities. In contrast to earlier numerical models, our model is simple but it accommodates explanations and predictions of experimental situations as qualitative features of trajectories in the phase space of synaptic weights, without ﬁne parameter tuning. 1

4 0.4834713 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

Author: Afsheen Afshar, Gopal Santhanam, Stephen I. Ryu, Maneesh Sahani, Byron M. Yu, Krishna V. Shenoy

5 0.45603931 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

Author: Urs Muller, Jan Ben, Eric Cosatto, Beat Flepp, Yann L. Cun

Abstract: We describe a vision-based obstacle avoidance system for off-road mobile robots. The system is trained from end to end to map raw input images to steering angles. It is trained in supervised mode to predict the steering angles provided by a human driver during training runs collected in a wide variety of terrains, weather conditions, lighting conditions, and obstacle types. The robot is a 50cm off-road truck, with two forwardpointing wireless color cameras. A remote computer processes the video and controls the robot via radio. The learning system is a large 6-layer convolutional network whose input is a single left/right pair of unprocessed low-resolution images. The robot exhibits an excellent ability to detect obstacles and navigate around them in real time at speeds of 2 m/s.

6 0.43645129 152 nips-2005-Phase Synchrony Rate for the Recognition of Motor Imagery in Brain-Computer Interface

7 0.40982842 150 nips-2005-Optimizing spatio-temporal filters for improving Brain-Computer Interfacing

8 0.38886309 151 nips-2005-Pattern Recognition from One Example by Chopping

9 0.38180873 171 nips-2005-Searching for Character Models

10 0.37511197 119 nips-2005-Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

11 0.37433204 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

12 0.36985219 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

13 0.36838332 131 nips-2005-Multiple Instance Boosting for Object Detection

14 0.33732587 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

15 0.33178622 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

16 0.31334928 195 nips-2005-Transfer learning for text classification

17 0.29034987 108 nips-2005-Layered Dynamic Textures

18 0.27848747 158 nips-2005-Products of ``Edge-perts

19 0.27807572 189 nips-2005-Tensor Subspace Analysis

20 0.27490863 62 nips-2005-Efficient Estimation of OOMs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.034), (10, 0.025), (27, 0.015), (31, 0.036), (34, 0.062), (39, 0.013), (55, 0.012), (57, 0.012), (60, 0.013), (69, 0.057), (73, 0.021), (88, 0.596), (91, 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99527282 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly speciﬁed, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulﬁlled. In this paper, we ﬁrst show that the existing active learning method can be theoretically justiﬁed under slightly weaker condition: the model does not have to be correctly speciﬁed, but slightly misspeciﬁed models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justiﬁed for a wider class of misspeciﬁed models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspeciﬁcation of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ deﬁned on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are ﬁxed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly speciﬁed [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we ﬁrst show that the existing active learning method can still be theoretically justiﬁed when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justiﬁed for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-deﬁned distribution with density ÔÜ Ü , and discuss the problem of ﬁnding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error deﬁned by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ Ý¾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly speciﬁed, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly speciﬁed model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed Â ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspeciﬁcation of Models In this section, we investigate the validity of the existing active learning method for misspeciﬁed models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · ÆÖ´Üµ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual Ö´Üµ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ Ö´Üµ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ Ö´Ü¾µ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing Â is still justiﬁed if Æ ½ ¾ µ. However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ Â ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly speciﬁed, «ÇÄË is an unbiased estimator of «£ . However, for misspeciﬁed models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly speciﬁed models, inﬂuence of the covariate shift can be ignored, as the existing active learning method does. However, for misspeciﬁed models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ ´Üµ ¾ Ô ´Ü µ Ü ´Üµ ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be conﬁrmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that ¾ Â · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justiﬁed when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We ﬁrst illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly speciﬁed”, “approximately correct”, and “misspeciﬁed” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ speciﬁed”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspeciﬁed”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspeciﬁcation of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. signiﬁcance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspeciﬁed” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly speciﬁed” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives signiﬁcantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly speciﬁed case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspeciﬁcation of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and Ø ´¾ ¾ ÅÄ Ô ´Üµ is unknown. Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we ﬁrst create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for ﬁve data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for ﬁve data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly speciﬁed, the covariate shift does not matter. However, for misspeciﬁed models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly speciﬁed. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspeciﬁcation of models and thus it is reliable. The proposed method can be theoretically justiﬁed if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspeciﬁed models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspeciﬁed models. It is also important to notice that when the model is totally misspeciﬁed, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial ﬁnancial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artiﬁcial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspeciﬁed models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.

same-paper 2 0.98732841 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

Author: Vinod Nair, Geoffrey E. Hinton

3 0.98002225 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efﬁcient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classiﬁcation data sets show that the proposed approach is promising.

4 0.97903121 80 nips-2005-Gaussian Process Dynamical Models

Author: Jack Wang, Aaron Hertzmann, David M. Blei

Abstract: This paper introduces Gaussian Process Dynamical Models (GPDM) for nonlinear time series analysis. A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space. We marginalize out the model parameters in closed-form, using Gaussian Process (GP) priors for both the dynamics and the observation mappings. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We demonstrate the approach on human motion capture data in which each pose is 62-dimensional. Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces. Webpage: http://www.dgp.toronto.edu/∼ jmwang/gpdm/ 1

5 0.84145057 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

Author: Edward Meeds, Simon Osindero

Abstract: We present an inﬁnite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian speciﬁcation of the effective ‘gating network’ for the different experts. 1

6 0.80430424 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

7 0.80109715 136 nips-2005-Noise and the two-thirds power Law

8 0.79177582 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

9 0.7911976 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

10 0.78288525 37 nips-2005-Benchmarking Non-Parametric Statistical Tests

11 0.77514732 161 nips-2005-Radial Basis Function Network for Multi-task Learning

12 0.77501136 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

13 0.76914275 35 nips-2005-Bayesian model learning in human visual perception

14 0.7590124 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

15 0.75846583 48 nips-2005-Context as Filtering

16 0.75839555 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

17 0.75485379 45 nips-2005-Conditional Visual Tracking in Kernel Space

18 0.7471993 171 nips-2005-Searching for Character Models

19 0.74108756 195 nips-2005-Transfer learning for text classification

20 0.73842794 30 nips-2005-Assessing Approximations for Gaussian Process Classification