nips nips2005 nips2005-19 knowledge-graph by maker-knowledge-mining

19 nips-2005-Active Learning for Misspecified Models

Source: pdf

Author: Masashi Sugiyama

Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly speciﬁed, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulﬁlled. In this paper, we ﬁrst show that the existing active learning method can be theoretically justiﬁed under slightly weaker condition: the model does not have to be correctly speciﬁed, but slightly misspeciﬁed models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justiﬁed for a wider class of misspeciﬁed models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspeciﬁcation of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ deﬁned on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are ﬁxed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly speciﬁed [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we ﬁrst show that the existing active learning method can still be theoretically justiﬁed when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justiﬁed for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-deﬁned distribution with density ÔÜ Ü , and discuss the problem of ﬁnding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error deﬁned by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ Ý¾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly speciﬁed, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly speciﬁed model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed Â ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspeciﬁcation of Models In this section, we investigate the validity of the existing active learning method for misspeciﬁed models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · ÆÖ´Üµ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual Ö´Üµ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ Ö´Üµ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ Ö´Ü¾µ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing Â is still justiﬁed if Æ ½ ¾ µ. However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ Â ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly speciﬁed, «ÇÄË is an unbiased estimator of «£ . However, for misspeciﬁed models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly speciﬁed models, inﬂuence of the covariate shift can be ignored, as the existing active learning method does. However, for misspeciﬁed models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ ´Üµ ¾ Ô ´Ü µ Ü ´Üµ ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be conﬁrmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that ¾ Â · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justiﬁed when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We ﬁrst illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly speciﬁed”, “approximately correct”, and “misspeciﬁed” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ speciﬁed”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspeciﬁed”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspeciﬁcation of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. signiﬁcance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspeciﬁed” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly speciﬁed” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives signiﬁcantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly speciﬁed case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspeciﬁcation of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and Ø ´¾ ¾ ÅÄ Ô ´Üµ is unknown. Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we ﬁrst create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for ﬁve data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for ﬁve data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly speciﬁed, the covariate shift does not matter. However, for misspeciﬁed models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly speciﬁed. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspeciﬁcation of models and thus it is reliable. The proposed method can be theoretically justiﬁed if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspeciﬁed models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspeciﬁed models. It is also important to notice that when the model is totally misspeciﬁed, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial ﬁnancial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artiﬁcial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspeciﬁed models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. [sent-4, score-0.623]

2 Existing active learning methods often assume that the model used for learning is correctly speciﬁed, i. [sent-5, score-0.566]

3 , the learning target function can be expressed by the model at hand. [sent-7, score-0.24]

4 In this paper, we ﬁrst show that the existing active learning method can be theoretically justiﬁed under slightly weaker condition: the model does not have to be correctly speciﬁed, but slightly misspeciﬁed models are also allowed. [sent-9, score-0.975]

5 To cope with this problem, we propose an alternative active learning method which can be theoretically justiﬁed for a wider class of misspeciﬁed models. [sent-11, score-0.656]

6 Thus, the proposed method has a broader range of applications than the existing method. [sent-12, score-0.362]

7 Numerical studies show that the proposed active learning method is robust against the misspeciﬁcation of models and is thus reliable. [sent-13, score-0.572]

8 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ deﬁned on ½ where ¯ Ò ½ are i. [sent-14, score-0.235]

9 ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i. [sent-20, score-0.382]

10 When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. [sent-23, score-0.666]

11 In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i. [sent-25, score-0.418]

12 , Ü Ò ½ independently follow the distribution with density ÔØ Ü . [sent-27, score-0.206]

13 On the other hand, in some cases, the training input points can be designed by users. [sent-28, score-0.344]

14 In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e. [sent-29, score-0.427]

15 , by densely locating training input points in the regions of high uncertainty. [sent-31, score-0.394]

16 ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. [sent-32, score-0.502]

17 In active learning research, it is often assumed that the regression model is correctly speciﬁed [2, 1, 3], i. [sent-33, score-0.522]

18 , the learning target function Ü can be expressed by the model. [sent-35, score-0.216]

19 ´ µ In this paper, we ﬁrst show that the existing active learning method can still be theoretically justiﬁed when the model is approximately correct in a strong sense. [sent-37, score-0.897]

20 Then we propose an alternative active learning method which can also be theoretically justiﬁed for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. [sent-38, score-1.022]

21 Thus, the proposed method has a wider range of applications. [sent-39, score-0.212]

22 In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-deﬁned distribution with density ÔÜ Ü , and discuss the problem of ﬁnding the optimal density function. [sent-40, score-0.759]

23 ´µ 2 Existing Active Learning Method The generalization error deﬁned by Eq. [sent-41, score-0.113]

24 (1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i. [sent-42, score-0.238]

25 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. [sent-47, score-0.424]

26 In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. [sent-48, score-0.273]

27 Proposition 1 ([2, 1, 3]) Suppose that the model is correctly speciﬁed, i. [sent-50, score-0.138]

28 ´ µ 3 Analysis of Existing Method under Misspeciﬁcation of Models In this section, we investigate the validity of the existing active learning method for misspeciﬁed models. [sent-53, score-0.628]

29 ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i. [sent-54, score-0.269]

30 Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ Ö´Ü¾µ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. [sent-58, score-0.174]

31 The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing Â is still justiﬁed if Æ ½ ¾ µ. [sent-61, score-0.673]

32 However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ Â ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. [sent-62, score-0.33]

33 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. [sent-63, score-0.565]

34 1 Weighted Least-Squares Learning When the model is correctly speciﬁed, «ÇÄË is an unbiased estimator of «£ . [sent-65, score-0.177]

35 ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . [sent-67, score-1.109]

36 For correctly speciﬁed models, inﬂuence of the covariate shift can be ignored, as the existing active learning method does. [sent-68, score-1.007]

37 However, for misspeciﬁed models, we should explicitly cope with the covariate shift. [sent-69, score-0.248]

38 ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. [sent-70, score-0.317]

39 Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justiﬁed when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . [sent-76, score-0.44]

40 Therefore, the proposed method has a wider range of applications. [sent-77, score-0.212]

41 ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. [sent-79, score-0.536]

42 We ﬁrst illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. [sent-81, score-0.373]

43 Gaussian noise with mean zero and standard deviation and ¯ . [sent-85, score-0.126]

44 Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. [sent-86, score-0.295]

45 Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. [sent-91, score-0.541]

46 Then WLS learning is used for estimating the parameters. [sent-92, score-0.108]

47 (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. [sent-93, score-0.913]

48 (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. [sent-95, score-0.683]

49 The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. [sent-98, score-0.155]

50 When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . [sent-100, score-0.16]

51 This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. [sent-101, score-0.14]

52 ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspeciﬁcation of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. [sent-105, score-0.128]

53 5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. [sent-108, score-0.237]

54 The best method and comparable ones by the t-test at the are described with boldface. [sent-109, score-0.087]

55 signiﬁcance level The value of method (B) for Æ is extremely large but it is not a typo. [sent-110, score-0.087]

56 5 2 Figure 1: Learning target function and input density functions. [sent-121, score-0.365]

57 8 Figure 2: The means and error bars of functions of . [sent-199, score-0.094]

58 4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. [sent-207, score-0.357]

59 When Æ , the existing method (B) works better than the proposed method (A). [sent-208, score-0.516]

60 Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . [sent-209, score-0.333]

61 Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). [sent-212, score-0.792]

62 When Æ and Æ the proposed method (A) gives signiﬁcantly smaller errors than other methods. [sent-213, score-0.166]

63 ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). [sent-214, score-0.521]

64 On the other hand, the existing method (B) works excellently in the correctly speciﬁed case, although it tends to perform poorly once the correctness of the model is violated. [sent-215, score-0.603]

65 Therefore, the proposed method (A) is found to be robust against the misspeciﬁcation of models and thus it is reliable. [sent-216, score-0.227]

66 Table 2: The means and standard deviations of the test error for DELVE data sets. [sent-217, score-0.21]

67 For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. [sent-223, score-0.126]

68 Note that the error bars were reasonably small so they were omitted. [sent-224, score-0.141]

69 Each data set includes samples, consisting of -dimensional input and -dimensional output values. [sent-226, score-0.192]

70 ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i. [sent-228, score-0.231]

71 From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. [sent-232, score-0.542]

72 The ½ ½ task is to predict the output values of all unlabeled samples. [sent-233, score-0.126]

73 ½¼¼¼ In this experiment, the test input density independent Gaussian density. [sent-234, score-0.348]

74 Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. [sent-236, score-0.223]

75 Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. [sent-237, score-0.3]

76 ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. [sent-238, score-1.105]

77 Therefore, we ﬁrst create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. [sent-239, score-1.186]

78 For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. [sent-240, score-0.16]

79 ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. [sent-241, score-0.241]

80 The proposed method (A) outperforms the existing method (B) for ﬁve data sets, while it is outperformed by (B) for the other three data sets. [sent-242, score-0.571]

81 We conjecture that the model used for learning is almost correct in these three data sets. [sent-243, score-0.176]

82 This result implies that the proposed method (A) is slightly better than the existing method (B). [sent-244, score-0.474]

83 Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). [sent-245, score-0.635]

84 This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for ﬁve data sets. [sent-246, score-0.155]

85 Therefore, the proposed method (A) is overall shown to work better than other schemes. [sent-247, score-0.19]

86 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. [sent-248, score-1.136]

87 When the model used for learning is correctly speciﬁed, the covariate shift does not matter. [sent-249, score-0.486]

88 However, for misspeciﬁed models, we have to explicitly cope with the covariate shift. [sent-250, score-0.248]

89 In this paper, we proposed a new active learning method based on the weighted least-squares learning. [sent-251, score-0.554]

90 The numerical study showed that the existing method works better than the proposed method if model is correctly speciﬁed. [sent-252, score-0.682]

91 However, the existing method tends to perform poorly once the correctness of the model is violated. [sent-253, score-0.422]

92 On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. [sent-254, score-0.51]

93 Therefore, the proposed method would be robust against the misspeciﬁcation of models and thus it is reliable. [sent-255, score-0.227]

94 The proposed method can be theoretically justiﬁed if the model is approximately correct in a weak sense. [sent-256, score-0.435]

95 A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspeciﬁed models. [sent-258, score-0.515]

96 It is also important to notice that when the model is totally misspeciﬁed, even learning with optimal training input points would not be successful anyway. [sent-259, score-0.51]

97 In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. [sent-261, score-0.758]

98 That is, the model should have been chosen before performing active learning. [sent-262, score-0.286]

99 Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. [sent-263, score-0.503]

100 Improving predictive inference under covariate shift by weighting the loglikelihood function. [sent-297, score-0.265]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('misspeci', 0.486), ('ols', 0.364), ('wls', 0.355), ('active', 0.262), ('existing', 0.196), ('covariate', 0.191), ('density', 0.169), ('input', 0.139), ('correctly', 0.114), ('training', 0.113), ('kin', 0.112), ('bank', 0.102), ('delve', 0.097), ('unlabeled', 0.097), ('theoretically', 0.095), ('points', 0.092), ('justi', 0.091), ('outperformed', 0.09), ('method', 0.087), ('learning', 0.083), ('approximately', 0.081), ('proposed', 0.079), ('passive', 0.077), ('expressed', 0.076), ('shift', 0.074), ('pool', 0.072), ('reasonably', 0.07), ('correct', 0.069), ('deviation', 0.067), ('works', 0.067), ('generalization', 0.067), ('deviations', 0.067), ('ed', 0.065), ('leastsquares', 0.064), ('correctness', 0.059), ('totally', 0.059), ('tokyo', 0.059), ('target', 0.057), ('cope', 0.057), ('temporary', 0.051), ('bias', 0.047), ('error', 0.046), ('wider', 0.046), ('lemma', 0.045), ('location', 0.045), ('weighted', 0.043), ('test', 0.04), ('suppose', 0.04), ('regression', 0.039), ('template', 0.039), ('unbiased', 0.039), ('independently', 0.037), ('weaker', 0.037), ('criterion', 0.035), ('multiplied', 0.034), ('table', 0.034), ('robust', 0.034), ('standard', 0.034), ('eight', 0.033), ('asymptotically', 0.033), ('tends', 0.033), ('samples', 0.033), ('outperforms', 0.032), ('biased', 0.032), ('runs', 0.031), ('ordinary', 0.03), ('toy', 0.029), ('simulation', 0.029), ('determined', 0.029), ('output', 0.029), ('speci', 0.028), ('devising', 0.028), ('negatively', 0.028), ('sugiyama', 0.028), ('numerical', 0.028), ('caused', 0.028), ('models', 0.027), ('proposition', 0.026), ('scheme', 0.026), ('asymptotic', 0.026), ('propose', 0.026), ('camp', 0.026), ('weakened', 0.026), ('locating', 0.026), ('bars', 0.025), ('slightly', 0.025), ('estimating', 0.025), ('evaluate', 0.025), ('mean', 0.025), ('accurate', 0.025), ('model', 0.024), ('includes', 0.024), ('overall', 0.024), ('densely', 0.024), ('multilayer', 0.024), ('scientists', 0.024), ('therefore', 0.024), ('means', 0.023), ('poorly', 0.023), ('let', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

2 0.15676209 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning signiﬁcantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra ﬂexibility leads to signiﬁcantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra ﬂexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including ﬁeld estimation using wireless sensor networks and fault line detection. 1

3 0.14639543 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the speciﬁc target hypothesis, and the desired accuracy.

4 0.11059596 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares

Author: Jo-anne Ting, Aaron D'souza, Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato

Abstract: An increasing number of projects in neuroscience requires the statistical analysis of high dimensional data sets, as, for instance, in predicting behavior from neural ﬁring or in operating artiﬁcial devices from brain recordings in brain-machine interfaces. Linear analysis techniques remain prevalent in such cases, but classical linear regression approaches are often numerically too fragile in high dimensions. In this paper, we address the question of whether EMG data collected from arm movements of monkeys can be faithfully reconstructed with linear approaches from neural activity in primary motor cortex (M1). To achieve robust data analysis, we develop a full Bayesian approach to linear regression that automatically detects and excludes irrelevant features in the data, regularizing against overﬁtting. In comparison with ordinary least squares, stepwise regression, partial least squares, LASSO regression and a brute force combinatorial search for the most predictive input features in the data, we demonstrate that the new Bayesian method oﬀers a superior mixture of characteristics in terms of regularization against overﬁtting, computational eﬃciency and ease of use, demonstrating its potential as a drop-in replacement for other linear regression techniques. As neuroscientiﬁc results, our analyses demonstrate that EMG data can be well predicted from M1 neurons, further opening the path for possible real-time interfaces between brains and machines. 1

5 0.067289963 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

Abstract: We present an efﬁcient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassiﬁcation rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid 1 − α conﬁdence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.

6 0.066877626 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

7 0.063916087 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

8 0.062934801 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

9 0.062605523 47 nips-2005-Consistency of one-class SVM and related algorithms

10 0.062457718 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

11 0.061077908 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

12 0.057636965 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

13 0.055194467 160 nips-2005-Query by Committee Made Real

14 0.054597862 106 nips-2005-Large-scale biophysical parameter estimation in single neurons via constrained linear regression

15 0.0513741 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

16 0.050966531 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

17 0.050565928 138 nips-2005-Non-Local Manifold Parzen Windows

18 0.05012463 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

19 0.049309343 50 nips-2005-Convex Neural Networks

20 0.045529667 117 nips-2005-Learning from Data of Variable Quality

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, 0.019), (2, -0.035), (3, -0.015), (4, 0.062), (5, 0.076), (6, 0.008), (7, -0.019), (8, 0.018), (9, 0.118), (10, -0.043), (11, 0.126), (12, 0.203), (13, 0.031), (14, -0.046), (15, -0.071), (16, -0.069), (17, 0.049), (18, -0.049), (19, -0.049), (20, 0.114), (21, 0.022), (22, -0.025), (23, -0.027), (24, -0.152), (25, -0.076), (26, 0.087), (27, -0.086), (28, 0.027), (29, -0.152), (30, 0.068), (31, 0.16), (32, 0.021), (33, 0.03), (34, 0.072), (35, 0.031), (36, 0.027), (37, 0.022), (38, 0.074), (39, 0.021), (40, 0.006), (41, -0.014), (42, 0.001), (43, -0.04), (44, 0.076), (45, -0.138), (46, 0.045), (47, -0.123), (48, -0.136), (49, -0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94485003 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

2 0.80924541 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

3 0.68850261 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

4 0.67080724 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

5 0.4605194 155 nips-2005-Predicting EMG Data from M1 Neurons with Variational Bayesian Least Squares

Author: Jo-anne Ting, Aaron D'souza, Kenji Yamamoto, Toshinori Yoshioka, Donna Hoffman, Shinji Kakei, Lauren Sergio, John Kalaska, Mitsuo Kawato

6 0.45677528 160 nips-2005-Query by Committee Made Real

7 0.45170161 17 nips-2005-Active Bidirectional Coupling in a Cochlear Chip

8 0.41858646 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

9 0.40930176 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

10 0.40682098 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

11 0.38880447 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

12 0.38874328 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

13 0.3776516 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

14 0.37233016 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

15 0.3632341 112 nips-2005-Learning Minimum Volume Sets

16 0.340161 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

17 0.31746116 114 nips-2005-Learning Rankings via Convex Hull Separation

18 0.311609 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

19 0.30692616 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition

20 0.30492315 172 nips-2005-Selecting Landmark Points for Sparse Manifold Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.053), (10, 0.018), (27, 0.017), (31, 0.017), (34, 0.035), (55, 0.019), (69, 0.019), (73, 0.025), (88, 0.683), (91, 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99658942 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

2 0.97469103 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

Author: Vinod Nair, Geoffrey E. Hinton

Abstract: We describe a generative model for handwritten digits that uses two pairs of opposing springs whose stiffnesses are controlled by a motor program. We show how neural networks can be trained to infer the motor programs required to accurately reconstruct the MNIST digits. The inferred motor programs can be used directly for digit classiﬁcation, but they can also be used in other ways. By adding noise to the motor program inferred from an MNIST image we can generate a large set of very different images of the same class, thus enlarging the training set available to other methods. We can also use the motor programs as additional, highly informative outputs which reduce overﬁtting when training a feed-forward classiﬁer. 1 Overview The idea that patterns can be recognized by ﬁguring out how they were generated has been around for at least half a century [1, 2] and one of the ﬁrst proposed applications was the recognition of handwriting using a generative model that involved pairs of opposing springs [3, 4]. The “analysis-by-synthesis” approach is attractive because the true generative model should provide the most natural way to characterize a class of patterns. The handwritten 2’s in ﬁgure 1, for example, are very variable when viewed as pixels but they have very similar motor programs. Despite its obvious merits, analysis-by-synthesis has had few successes, partly because it is computationally expensive to invert non-linear generative models and partly because the underlying parameters of the generative model are unknown for most large data sets. For example, the only source of information about how the MNIST digits were drawn is the images themselves. We describe a simple generative model in which a pen is controlled by two pairs of opposing springs whose stiffnesses are speciﬁed by a motor program. If the sequence of stiffnesses is speciﬁed correctly, the model can produce images which look very like the MNIST digits. Using a separate network for each digit class, we show that backpropagation can be used to learn a “recognition” network that maps images to the motor programs required to produce them. An interesting aspect of this learning is that the network creates its own training data, so it does not require the training images to be labelled with motor programs. Each recognition network starts with a single example of a motor program and grows an “island of competence” around this example, progressively extending the region over which it can map small changes in the image to the corresponding small changes in the motor program (see ﬁgure 2). Figure 1: An MNIST image of a 2 and the additional images that can be generated by inferring the motor program and then adding random noise to it. The pixels are very different, but they are all clearly twos. Fairly good digit recognition can be achieved by using the 10 recognition networks to ﬁnd 10 motor programs for a test image and then scoring each motor program by its squared error in reconstructing the image. The 10 scores are then fed into a softmax classiﬁer. Recognition can be improved by using PCA to model the distribution of motor trajectories for each class and using the distance of a motor trajectory from the relevant PCA hyperplane as an additional score. Each recognition network is solving a difﬁcult global search problem in which the correct motor program must be found by a single, “open-loop” pass through the network. More accurate recognition can be achieved by using this open-loop global search to initialize an iterative, closed-loop local search which uses the error in the reconstructed image to revise the motor program. This requires reconstruction errors in pixel space to be mapped to corrections in the space of spring stiffnesses. We cannot backpropagate errors through the generative model because it is just a hand-coded computer program. So we learn “generative” networks, one per digit class, that emulate the generator. After learning, backpropagation through these generative networks is used to convert pixel reconstruction errors into stiffness corrections. Our ﬁnal system gives 1.82% error on the MNIST test set which is similar to the 1.7% achieved by a very different generative approach [5] but worse than the 1.53% produced by the best backpropagation networks or the 1.4% produced by support vector machines [6]. It is much worse than the 0.4% produced by convolutional neural networks that use cleverly enhanced training sets [7]. Recognition of test images is quite slow because it uses ten different recognition networks followed by iterative local search. There is, however, a much more efﬁcient way to make use of our ability to extract motor programs. They can be treated as additional output labels when using backpropagation to train a single, multilayer, discriminative neural network. These additional labels act as a very informative regularizer that reduces the error rate from 1.53% to 1.27% in a network with two hidden layers of 500 units each. This is a new method of improving performance that can be used in conjunction with other tricks such as preprocessing the images, enhancing the training set or using convolutional neural nets [8, 7]. 2 A simple generative model for drawing digits The generative model uses two pairs of opposing springs at right angles. One end of each spring is attached to a frictionless horizontal or vertical rail that is 39 pixels from the center of the image. The other end is attached to a “pen” that has signiﬁcant mass. The springs themselves are weightless and have zero rest length. The pen starts at the equilibrium position deﬁned by the initial stiffnesses of the four springs. It then follows a trajectory that is determined by the stiffness of each spring at each of the 16 subsequent time steps in the motor program. The mass is large compared with the rate at which the stiffnesses change, so the system is typically far from equilibrium as it follows the smooth trajectory. On each time step, the momentum is multiplied by 0.9 to simulate viscosity. A coarse-grain trajectory is computed by using one step of forward integration for each time step in the motor program, so it contains 17 points. The code is at www.cs.toronto.edu/∼ hinton/code. Figure 2: The training data for each class-speciﬁc recognition network is produced by adding noise to motor programs that are inferred from MNIST images using the current parameters of the recognition network. To initiate this process, the biases of the output units are set by hand so that they represent a prototypical motor program for the class. Given a coarse-grain trajectory, we need a way of assigning an intensity to each pixel. We tried various methods until we hand-evolved one that was able to reproduce the MNIST images fairly accurately, but we suspect that many other methods would be just as good. For each point on the coarse trajectory, we share two units of ink between the the four closest pixels using bilinear interpolation. We also use linear interpolation to add three ﬁne-grain trajectory points between every pair of coarse-grain points. These ﬁne-grain points also contribute ink to the pixels using bilinear interpolation, but the amount of ink they contribute is zero if they are less than one pixel apart and rises linearly to the same amount as the coarse-grain points if they are more than two pixels apart. This generates a thin skeleton with a fairly uniform ink density. To ﬂesh-out the skeleton, we use two “ink parameters”, a a a a a, b, to specify a 3 × 3 kernel of the form b(1 + a)[ 12 , a , 12 ; a , 1 − a, a ; 12 , a , 12 ] which 6 6 6 6 is convolved with the image four times. Finally, the pixel intensities are clipped to lie in the interval [0,1]. The matlab code is at www.cs.toronto.edu/∼ hinton/code. The values of 2a and b/1.5 are additional, logistic outputs of the recognition networks1 . 3 Training the recognition networks The obvious way to learn a recognition network is to use a training set in which the inputs are images and the target outputs are the motor programs that were used to generate those images. If we knew the distribution over motor programs for a given digit class, we could easily produce such a set by running the generator. Unfortunately, the distribution over motor programs is exactly what we want to learn from the data, so we need a way to train 1 We can add all sorts of parameters to the hand-coded generative model and then get the recognition networks to learn to extract the appropriate values for each image. The global mass and viscosity as well as the spacing of the rails that hold the springs can be learned. We can even implement afﬁnelike transformations by attaching the four springs to endpoints whose eight coordinates are given by the recognition networks. These extra parameters make the learning slower and, for the normalized digits, they do not improve discrimination, probably because they help the wrong digit models as much as the right one. the recognition network without knowing this distribution in advance. Generating scribbles from random motor programs will not work because the capacity of the network will be wasted on irrelevant images that are far from the real data. Figure 2 shows how a single, prototype motor program can be used to initialize a learning process that creates its own training data. The prototype consists of a sequence of 4 × 17 spring stiffnesses that are used to set the biases on 68 of the 70 logistic output units of the recognition net. If the weights coming from the 400 hidden units are initially very small, the recognition net will then output a motor program that is a close approximation to the prototype, whatever the input image. Some random noise is then added to this motor program and it is used to generate a training image. So initially, all of the generated training images are very similar to the one produced by the prototype. The recognition net will therefore devote its capacity to modeling the way in which small changes in these images map to small changes in the motor program. Images in the MNIST training set that are close to the prototype will then be given their correct motor programs. This will tend to stretch the distribution of motor programs produced by the network along the directions that correspond to the manifold on which the digits lie. As time goes by, the generated training set will expand along the manifold for that digit class until all of the MNIST training images of that class are well modelled by the recognition network. It takes about 10 hours in matlab on a 3 GHz Xeon to train each recognition network. We use minibatches of size 100, momentum of 0.9, and adaptive learning rates on each connection that increase additively when the sign of the gradient agrees with the sign of the previous weight change and decrease multiplicatively when the signs disagree [9]. The net is generating its own training data, so the objective function is always changing which makes it inadvisable to use optimization methods that go as far as they can in a carefully chosen direction. Figures 3 and 4 show some examples of how well the recognition nets perform after training. Nearly all models achieve an average squared pixel error of less than 15 per image on their validation set (pixel intensities are between 0 and 1 with a preponderance of extreme values). The inferred motor programs are clearly good enough to capture the diverse handwriting styles in the data. They are not good enough, however, to give classiﬁcation performance comparable to the state-of-the-art on the MNIST database. So we added a series of enhancements to the basic system to improve the classiﬁcation accuracy. 4 Enhancements to the basic system Extra strokes in ones and sevens. One limitation of the basic system is that it draws digits using only a single stroke (i.e. the trajectory is a single, unbroken curve). But when people draw digits, they often add extra strokes to them. Two of the most common examples are the dash at the bottom of ones, and the dash through the middle of sevens (see examples in ﬁgure 5). About 2.2% of ones and 13% of sevens in the MNIST training set are dashed and not modelling the dashes reduces classiﬁcation accuracy signiﬁcantly. We model dashed ones and sevens by augmenting their basic motor programs with another motor program to draw the dash. For example, a dashed seven is generated by ﬁrst drawing an ordinary seven using the motor program computed by the seven model, and then drawing the dash with a motor program computed by a separate neural network that models only dashes. Dashes in ones and sevens are modeled with two different networks. Their training proceeds the same way as with the other models, except now there are only 50 hidden units and the training set contains only the dashed cases of the digit. (Separating these cases from the rest of the MNIST training set is easy because they can be quickly spotted by looking at the difference between the images and their reconstructions by the dashless digit model.) The net takes the entire image of a digit as input, and computes the motor program for just the dash. When reconstructing an unlabelled image as say, a seven, we compute both Figure 3: Examples of validation set images reconstructed by their corresponding model. In each case the original image is on the left and the reconstruction is on the right. Superimposed on the original image is the pen trajectory. the dashed and dashless versions of seven and pick the one with the lower squared pixel error to be that image’s reconstruction as a seven. Figure 5 shows examples of images reconstructed using the extra stroke. Local search. When reconstructing an image in its own class, a digit model often produces a sensible, overall approximation of the image. However, some of the ﬁner details of the reconstruction may be slightly wrong and need to be ﬁxed up by an iterative local search that adjusts the motor program to reduce the reconstruction error. We ﬁrst approximate the graphics model with a neural network that contains a single hidden layer of 500 logistic units. We train one such generative network for each of the ten digits and for the dashed version of ones and sevens (for a total of 12 nets). The motor programs used for training are obtained by adding noise to the motor programs inferred from the training data by the relevant, fully trained recognition network. The images produced from these motor programs by the graphics model are used as the targets for the supervised learning of each generative network. Given these targets, the weight updates are computed in the same way as for the recognition networks. Figure 4: To model 4’s we use a single smooth trajectory, but turn off the ink for timesteps 9 and 10. For images in which the pen does not need to leave the paper, the recognition net ﬁnds a trajectory in which points 8 and 11 are close together so that points 9 and 10 are not needed. For 5’s we leave the top until last and turn off the ink for timesteps 13 and 14. Figure 5: Examples of dashed ones and sevens reconstructed using a second stroke. The pen trajectory for the dash is shown in blue, superimposed on the original image. Initial squared pixel error = 33.8 10 iterations, error = 15.2 20 iterations, error = 10.5 30 iterations, error = 9.3 Figure 6: An example of how local search improves the detailed registration of the trajectory found by the correct model. After 30 iterations, the squared pixel error is less than a third of its initial value. Once the generative network is trained, we can use it to iteratively improve the initial motor program computed by the recognition network for an image. The main steps in one iteration are: 1) compute the error between the image and the reconstruction generated from the current motor program by the graphics model; 2) backpropagate the reconstruction error through the generative network to calculate its gradient with respect to the motor program; 3) compute a new motor program by taking a step along the direction of steepest descent plus 0.5 times the previous step. Figure 6 shows an example of how local search improves the reconstruction by the correct model. Local search is usually less effective at improving the ﬁts of the wrong models, so it eliminates about 20% of the classiﬁcation errors on the validation set. PCA model of the image residuals. The sum of squared pixel errors is not the best way of comparing an image with its reconstruction, because it treats the residual pixel errors as independent and zero-mean Gaussian distributed, which they are not. By modelling the structure in the residual vectors, we can get a better estimate of the conditional probability of the image given the motor program. For each digit class, we construct a PCA model of the image residual vectors for the training images. Then, given a test image, we project the image residual vector produced by each inferred motor program onto the relevant PCA hyperplane and compute the squared distance between the residual and its projection. This gives ten scores for the image that measure the quality of its reconstructions by the digit models. We don’t discard the old sum of squared pixel errors as they are still useful for classifying most images correctly. Instead, all twenty scores are used as inputs to the classiﬁer, which decides how to combine both types of scores to achieve high classiﬁcation accuracy. PCA model of trajectories. Classifying an image by comparing its reconstruction errors for the different digit models tacitly relies on the assumption that the incorrect models will reconstruct the image poorly. Since the models have only been trained on images in their Squared error = 24.9, Shape prior score = 31.5 Squared error = 15.0, Shape prior score = 104.2 Figure 7: Reconstruction of a two image by the two model (left box) and by the three model (right box), with the pen trajectory superimposed on the original image. The three model sharply bends the bottom of its trajectory to better explain the ink, but the trajectory prior for three penalizes it with a high score. The two model has a higher squared error, but a much lower prior score, which allows the classiﬁer to correctly label the image. own class, they often do reconstruct images from other classes poorly, but occasionally they ﬁt an image from another class well. For example, ﬁgure 7 shows how the three model reconstructs a two image better than the two model by generating a highly contorted three. This problem becomes even more pronounced with local search which sometimes contorts the wrong model to ﬁt the image really well. The solution is to learn a PCA model of the trajectories that a digit model infers from images in its own class. Given a test image, the trajectory computed by each digit model is scored by its squared distance from the relevant PCA hyperplane. These 10 “prior” scores are then given to the classiﬁer along with the 20 “likelihood” scores described above. The prior scores eliminate many classiﬁcation mistakes such as the one in ﬁgure 7. 5 Classiﬁcation results To classify a test image, we apply multinomial logistic regression to the 30 scores – i.e. we use a neural network with no hidden units, 10 softmax output units and a cross-entropy error. The net is trained by gradient descent using the scores for the validation set images. To illustrate the gain in classiﬁcation accuracy achieved by the enhancements explained above, table 1 gives the percent error on the validation set as each enhancement is added to the system. Together, the enhancements almost halve the number of mistakes. Enhancements None 1 1, 2 1, 2, 3 1, 2, 3, 4 Validation set % error 4.43 3.84 3.01 2.67 2.28 Test set % error 1.82 Table 1: The gain in classiﬁcation accuracy on the validation set as the following enhancements are added: 1) extra stroke for dashed ones and sevens, 2) local search, 3) PCA model of image residual, and 4) PCA trajectory prior. To avoid using the test set for model selection, the performance on the ofﬁcial test set was only measured for the ﬁnal system. 6 Discussion After training a single neural network to output both the class label and the motor program for all classes (as described in section 1) we tried ignoring the label output and classifying the test images by using the cost, under 10 different PCA models, of the trajectory deﬁned by the inferred motor program. Each PCA model was ﬁtted to the trajectories extracted from the training images for a given class. This gave 1.80% errors which is as good as the 1.82% we got using the 10 separate recognition networks and local search. This is quite surprising because the motor programs produced by the single network were simpliﬁed to make them all have the same dimensionality and they produced signiﬁcantly poorer reconstructions. By only using the 10 digit-speciﬁc recognition nets to create the motor programs for the training data, we get much faster recognition of test data because at test time we can use a single recognition network for all classes. It also means we do not need to trade-off prior scores against image residual scores because there is only one image residual. The ability to extract motor programs could also be used to enhance the training set. [7] shows that error rates can be halved by using smooth vector distortion ﬁelds to create extra training data. They argue that these ﬁelds simulate “uncontrolled oscillations of the hand muscles dampened by inertia”. Motor noise may be better modelled by adding noise to an actual motor program as shown in ﬁgure 1. Notice that this produces a wide variety of non-blurry images and it can also change the topology. The techniques we have used for extracting motor programs from digit images may be applicable to speech. There are excellent generative models that can produce almost perfect speech if they are given the right formant parameters [10]. Using one of these generative models we may be able to train a large number of specialized recognition networks to extract formant parameters from speech without requiring labeled training data. Once this has been done, labeled data would be available for training a single feed-forward network that could recover accurate formant parameters which could be used for real-time recognition. Acknowledgements We thank Steve Isard, David MacKay and Allan Jepson for helpful discussions. This research was funded by NSERC, CFI and OIT. GEH is a fellow of the Canadian Institute for Advanced Research and holds a Canada Research Chair in machine learning. References [1] D. M. MacKay. Mindlike behaviour in artefacts. British Journal for Philosophy of Science, 2:105–121, 1951. [2] M. Halle and K. Stevens. Speech recognition: A model and a program for research. IRE Transactions on Information Theory, IT-8 (2):155–159, 1962. [3] Murray Eden. Handwriting and pattern recognition. IRE Transactions on Information Theory, IT-8 (2):160–166, 1962. [4] J.M. Hollerbach. An oscillation theory of handwriting. Biological Cybernetics, 39:139–156, 1981. [5] G. Mayraz and G. E. Hinton. Recognizing hand-written digits using hierarchical products of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:189–197, 2001. [6] D. Decoste and B. Schoelkopf. Training invariant support vector machines. Machine Learning, 46:161–190, 2002. [7] Patrice Y. Simard, Dave Steinkraus, and John Platt. Best practice for convolutional neural networks applied to visual document analysis. In International Conference on Document Analysis and Recogntion (ICDAR), IEEE Computer Society, Los Alamitos, pages 958–962, 2003. [8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. [9] A. Jacobs R. Increased Rates of Convergence Through Learning Rate Adaptation. Technical Report: UM-CS-1987-117. University of Massachusetts, Amherst, MA, 1987. [10] W. Holmes, J. Holmes, and M. Judd. Extension of the bandwith of the jsru parallel-formant synthesizer for high quality synthesis of male and female speech. In Proceedings of ICASSP 90 (1), pages 313–316, 1990.

3 0.97002661 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

Author: Jian Zhang, Zoubin Ghahramani, Yiming Yang

Abstract: We propose a probabilistic model based on Independent Component Analysis for learning multiple related tasks. In our model the task parameters are assumed to be generated from independent sources which account for the relatedness of the tasks. We use Laplace distributions to model hidden sources which makes it possible to identify the hidden, independent components instead of just modeling correlations. Furthermore, our model enjoys a sparsity property which makes it both parsimonious and robust. We also propose efﬁcient algorithms for both empirical Bayes method and point estimation. Our experimental results on two multi-label text classiﬁcation data sets show that the proposed approach is promising.

4 0.96324694 80 nips-2005-Gaussian Process Dynamical Models

Author: Jack Wang, Aaron Hertzmann, David M. Blei

Abstract: This paper introduces Gaussian Process Dynamical Models (GPDM) for nonlinear time series analysis. A GPDM comprises a low-dimensional latent space with associated dynamics, and a map from the latent space to an observation space. We marginalize out the model parameters in closed-form, using Gaussian Process (GP) priors for both the dynamics and the observation mappings. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We demonstrate the approach on human motion capture data in which each pose is 62-dimensional. Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces. Webpage: http://www.dgp.toronto.edu/∼ jmwang/gpdm/ 1

5 0.80607241 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

Author: Edward Meeds, Simon Osindero

Abstract: We present an inﬁnite mixture model in which each component comprises a multivariate Gaussian distribution over an input space, and a Gaussian Process model over an output space. Our model is neatly able to deal with non-stationary covariance functions, discontinuities, multimodality and overlapping output signals. The work is similar to that by Rasmussen and Ghahramani [1]; however, we use a full generative model over input and output space rather than just a conditional model. This allows us to deal with incomplete data, to perform inference over inverse functional mappings as well as for regression, and also leads to a more powerful and consistent Bayesian speciﬁcation of the effective ‘gating network’ for the different experts. 1

6 0.76974785 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

7 0.76473749 136 nips-2005-Noise and the two-thirds power Law

8 0.76313835 60 nips-2005-Dynamic Social Network Analysis using Latent Space Models

9 0.75620586 129 nips-2005-Modeling Neural Population Spiking Activity with Gibbs Distributions

10 0.75351191 37 nips-2005-Benchmarking Non-Parametric Statistical Tests

11 0.7502262 161 nips-2005-Radial Basis Function Network for Multi-task Learning

12 0.73759001 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks

13 0.73535514 35 nips-2005-Bayesian model learning in human visual perception

14 0.72822261 48 nips-2005-Context as Filtering

15 0.72738791 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

16 0.71884537 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex

17 0.71029598 45 nips-2005-Conditional Visual Tracking in Kernel Space

18 0.70388895 171 nips-2005-Searching for Character Models

19 0.70353884 74 nips-2005-Faster Rates in Regression via Active Learning

20 0.70348763 120 nips-2005-Learning vehicular dynamics, with application to modeling helicopters