nips nips2005 nips2005-160 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby
Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artificial and a real world problems.
Reference: text
sentIndex sentText sentNum sentScore
1 ♦ Intel Research Abstract Training a learning algorithm is a costly task. [sent-2, score-0.081]
2 A major goal of active learning is to reduce this cost. [sent-3, score-0.094]
3 In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. [sent-4, score-0.084]
4 The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. [sent-5, score-0.266]
5 Sampling the low dimension space is done using the hit and run random walk. [sent-7, score-0.249]
6 The selective sampling framework [1] suggests permitting the learner some control over the learning process. [sent-13, score-0.275]
7 In this way, the learner can collect a short and informative training sequence. [sent-14, score-0.074]
8 This is done by generating a large set of unlabeled instances and allowing the learner to select the instances to be labeled. [sent-15, score-0.247]
9 The Query By Committee algorithm (QBC) [2] was the inspiration behind many algorithms in the selective sampling framework [3, 4, 5]. [sent-16, score-0.222]
10 During learning it maintains a version space, the space of all the classifiers which are consistent with all the previous labeled instances. [sent-18, score-0.147]
11 Whenever an unlabeled instance is available, QBC selects two random hypotheses from the version space and only queries for the label of the new instance if the two hypotheses disagree. [sent-19, score-0.692]
12 [6] proved that when certain conditions apply, QBC will reach a generalization error of ǫ when using only O (log 1/ǫ) labels. [sent-21, score-0.139]
13 QBC works in an online fashion where each instance is considered only once to decide whether to query for its label or not. [sent-22, score-0.378]
14 However, QBC was never implemented as is, since it requires the ability to sample hypotheses from the version space, a task that all known method do in an unreasonable amount of time [8]. [sent-27, score-0.222]
15 The algorithm we present in this paper uses the same skeleton as QBC, but replaces sampling from the high dimensional version space by sampling from a low dimensional projection of it. [sent-28, score-0.448]
16 Although the algorithm uses linear classifiers at its core, the use of kernels makes it much broader in scope. [sent-30, score-0.075]
17 This new sampling method is presented in section 2. [sent-31, score-0.164]
18 The last building block is a method for sampling from convex bodies. [sent-33, score-0.176]
19 We suggest the hit and run [9] random walk for this purpose in section 4. [sent-34, score-0.262]
20 The second involves differentiating male and female facial images. [sent-43, score-0.075]
21 Related work: Many algorithms for selective sampling have been suggested in the literature. [sent-47, score-0.202]
22 Two other notable algorithms are the greedy active learning algorithm [11] and the perceptron based active learning algorithm [12]. [sent-50, score-0.272]
23 The greedy active learning algorithm has the remarkable property of being close to optimal in all settings. [sent-51, score-0.137]
24 However, it operates in a batch setting, where selecting the next query point requires reevaluation of the whole set of unlabeled instances. [sent-52, score-0.298]
25 The perceptron based active learning algorithm, on the other hand, is extremely efficient in its computational requirements, but is restricted to linear classifiers since it requires the explicit use of the input dimension. [sent-54, score-0.115]
26 [13] presented a billiard walk in the version space as a part of the Bayes Point Machine. [sent-56, score-0.21]
27 Similar to the method presented here, the billiard walk is capable of sampling hypotheses from the version space when kernels are used. [sent-57, score-0.526]
28 Whenever a new instance is presented, QBC generates two independent predictions for its label by sampling two hypotheses from the version space1 . [sent-60, score-0.455]
29 If the two predictions differ, QBC queries for the label of the instance at hand (see algorithm 1). [sent-61, score-0.273]
30 The main obstacle in implementing QBC is the need to sample from the version space (step 2b). [sent-62, score-0.154]
31 In the linear case, the dimension of the version space is the input dimension which is typically large for real world problems. [sent-65, score-0.154]
32 We overcome this obstacle by projecting the version space onto a low dimensional subspace. [sent-67, score-0.167]
33 k Assume that the learner has seen the labeled sample S = {(xi , yi )}i=1 , where xi ∈ IRd and yi ∈ {±1}. [sent-68, score-0.185]
34 The version space is defined to be the set of all classifiers which correctly classify all the instances seen so far: V = {w : w ≤ 1 and ∀i yi (w · xi ) > 0} 1 The version space is the collection of hypotheses that are consistent with previous labels. [sent-69, score-0.427]
35 (b) Let h1 , h2 be two random hypotheses selected from ν restricted to V . [sent-78, score-0.174]
36 Thus, the probability that QBC will query for the label of an instance x is exactly 2 Pr [w · x > 0] Pr [w · x < 0] w∼ν|V (2) w∼ν|V where ν|V is the restriction of ν to V . [sent-87, score-0.359]
37 Instead, we can use any stochastic approach that will query for the label with the same probability as in (2). [sent-89, score-0.321]
38 Furthermore, if we can sample y ∈ {±1} such that ˆ Pr [ˆ = 1] = Prw∼ν|V [w · x > 0] y and Pr [ˆ = −1] = Prw∼ν|V [w · x < 0] y (3) (4) we can use it instead, by querying the label of x with a probability of 2 Pr [ˆ = 1] Pr [ˆ = −1]. [sent-90, score-0.123]
39 This procedure can replace ˆ the sampling step in the QBC algorithm. [sent-92, score-0.14]
40 Let x be an instance for which we need to decide whether to query for its label or not. [sent-94, score-0.378]
41 We denote by V the version space as defined in (1) and denote by T the space spanned by x1 , . [sent-95, score-0.125]
42 QBC asks for two random hypotheses from V and queries for the label of x only if these two hypotheses predict different labels for x. [sent-99, score-0.516]
43 Our procedure does the same thing, but instead of sampling the hypotheses from V we sample them from V ∩ T . [sent-100, score-0.295]
44 This is true since T is a space of dimension k + 1 at most, where k is the number of queries for label QBC made so far. [sent-102, score-0.273]
45 Hence, the body V ∩ T is a low-dimensional convex body2 and thus sampling from it can be done efficiently. [sent-103, score-0.2]
46 The input dimension plays a minor role in the sampling algorithm. [sent-104, score-0.169]
47 The following theorem proves that indeed sampling from V ∩ T produces the desired results. [sent-107, score-0.162]
48 It shows that if the prior ν (see algorithm 1) is uniform, then sampling hypotheses uniformly from V or from V ∩ T generates the same results. [sent-108, score-0.281]
49 2 From the definition of the version space V it follows that it is a convex body. [sent-109, score-0.132]
50 k Theorem 1 Let S = {(xi , yi )}i=1 be a labeled sample and x an instance. [sent-110, score-0.087]
51 Let the version space be V = {w : w ≤ 1 and ∀i yi (w · xi ) > 0} and T = span (x, x1 , . [sent-111, score-0.176]
52 3 Sampling with Kernels In this section we show how the new sampling method presented in section 2 can be used together with kernel. [sent-116, score-0.164]
53 QBC uses the random hypotheses for one purpose alone: to check the labels they predict for instances. [sent-117, score-0.201]
54 In our new sampling method the hypotheses are sampled from V ∩ T , where T = span (x, x1 , . [sent-118, score-0.294]
55 Using these observations, we can sample a hypothesis by sampling α0 , . [sent-123, score-0.196]
56 However, since the xi ’s do not form an orthonormal basis of T , sampling the α’s uniformly is not equivalent to sampling the w’s uniformly. [sent-127, score-0.395]
57 We overcome this problem by using an orthonormal basis of T . [sent-128, score-0.092]
58 The following lemma shows a possible way in which the orthonormal basis for T can be computed when only inner products are used. [sent-129, score-0.17]
59 , xk ) and let G = (gi,j ) be the Grahm matrix such that gi,j = xi · xj . [sent-137, score-0.112]
60 , tr such that k ti = l=0 γi (l) √ xl λi form an orthonormal basis of the space T . [sent-148, score-0.202]
61 The proof of lemma 1 is given in the supplementary material [14]. [sent-149, score-0.093]
62 Since the ti ’s form an orthonormal i=1 basis, w = α . [sent-155, score-0.094]
63 Furthermore, we can check the label w assigns to xj by w · xj = i α (i) ti · xj = i,l γi (l) α (i) √ xl · xj γi which is a function of the Grahm matrix. [sent-156, score-0.317]
64 Therefore, sampling from V ∩ T boils down to the problem of sampling from convex bodies, where instead of sampling a vector directly we sample the coefficients of the orthonormal basis t1 , . [sent-157, score-0.582]
65 There are several methods for generating the final hypothesis to be used in the generalization phase. [sent-161, score-0.095]
66 4 Hit and Run Hit and run [9] is a method of sampling from a convex body K using a random walk. [sent-163, score-0.259]
67 Let z ∈ K, a single step of the hit and run begins by choosing a random point u from the unit sphere. [sent-164, score-0.191]
68 Afterwards the algorithm moves to a random point selected uniformly from l ∩ K, where l is the line passing through z and z + u. [sent-165, score-0.073]
69 Hit and run has several advantages over other random walks for sampling from convex bodies. [sent-166, score-0.235]
70 In practice hit and run mixes much faster than that. [sent-172, score-0.201]
71 5 Empirical Study In this section we present the results of applying our new kernelized version of the query by committee (KQBC), to two learning tasks. [sent-176, score-0.524]
72 , 0) thus the label of an instance x ∈ IRd is the sign of its first coordinate. [sent-183, score-0.127]
73 In each trial we used 10000 unlabeled instances and let KQBC select the instances to query for the labels. [sent-185, score-0.428]
74 We also applied Support Vector Machine (SVM) to the same data in order to demonstrate the benefit of using active learning. [sent-186, score-0.072]
75 As expected, the generalization error of KQBC decreases exponentially fast as the number of queries is increased, whereas the generalization error of SVM decreases only at an inverse-polynomial rate (the rate is O∗ (1/k) where k is the number of labels). [sent-195, score-0.344]
76 67k/d over the generalization error that was proved in [6] was replicated in our experiments (figure 1c). [sent-199, score-0.139]
77 % generalization error % generalization error % generalization error 100 10 Kernel Query By Committee Support Vector Machine −0. [sent-200, score-0.327]
78 The generalization error (y-axis) in percents (in logarithmic scale) versus the number of queries (x-axis). [sent-206, score-0.235]
79 The generalization error of KQBC is compared to the generalization error of SVM. [sent-208, score-0.218]
80 Recall that [6] proved a bound on the generalization error of 50 · 2−0. [sent-211, score-0.139]
81 67k/d where k is the number of queries and d is the dimension. [sent-212, score-0.126]
82 In the second task we used the AR face images dataset [16]. [sent-215, score-0.111]
83 The people in these images are wearing different accessories, have different facial expressions and the faces are lit from different directions. [sent-216, score-0.139]
84 We selected a subset of 1456 images from this dataset. [sent-217, score-0.083]
85 For this purpose we split the data into a training sequence of 1000 images and a test sequence of 456 images. [sent-223, score-0.095]
86 When the budget allows for 100 − 140 labels, KQBC has an error rate of 2 − 3 percent less than the error rate of SVM. [sent-229, score-0.072]
87 This difference is significant as in 90% of the trials % generalization error Figure 2: Examples of face images used for the face recognition task. [sent-232, score-0.24]
88 number of queries (x-axis) for KQBC (solid) and SVM (dashed) are compared. [sent-235, score-0.126]
89 When SVM was applied solely to the instances selected by KQBC (dotted line) the results are better than SVM but worse than KQBC. [sent-236, score-0.119]
90 We trained SVM over the instances selected by KQBC. [sent-240, score-0.099]
91 The generalization error obtained by this combined scheme was better than the passive SVM but worse than KQBC. [sent-241, score-0.109]
92 In figure 4 we see the last images for which KQBC queried for labels. [sent-242, score-0.098]
93 All the images are either highly saturated or partly covered by scarves and sunglasses. [sent-244, score-0.077]
94 6 Summary and Further Study In this paper we present a novel version of the QBC algorithm. [sent-246, score-0.088]
95 The time-complexity of our algorithm depends solely on the number of queries made and not on the input dimension or the VC-dimension of the class. [sent-248, score-0.195]
96 Furthermore, our technique only requires inner products of the labeled data points - thus it can be implemented with kernels as well. [sent-249, score-0.134]
97 We showed a practical implementation of QBC using kernels and the hit and run random walk which is very close to the “provable” version. [sent-250, score-0.296]
98 In the future, we would like to compare our algorithm with other active learning algorithms, over a variety of datasets. [sent-254, score-0.114]
99 The last six faces for which KQBC queried for a label. [sent-256, score-0.075]
100 Note that three of the images are saturated and that two of these are wearing a scarf that covers half of their faces. [sent-257, score-0.111]
wordName wordTfidf (topN-words)
[('kqbc', 0.674), ('qbc', 0.436), ('query', 0.232), ('committee', 0.18), ('svm', 0.161), ('sampling', 0.14), ('hit', 0.132), ('queries', 0.126), ('hypotheses', 0.121), ('prw', 0.119), ('label', 0.089), ('generalization', 0.073), ('active', 0.072), ('version', 0.067), ('instances', 0.067), ('selective', 0.062), ('orthonormal', 0.06), ('kernels', 0.055), ('pr', 0.052), ('learner', 0.051), ('images', 0.051), ('walk', 0.05), ('queried', 0.047), ('xk', 0.046), ('xj', 0.043), ('unlabeled', 0.042), ('classi', 0.042), ('ers', 0.042), ('outperformed', 0.041), ('supplementary', 0.041), ('face', 0.04), ('synthetic', 0.04), ('billiard', 0.04), ('grahm', 0.04), ('lov', 0.04), ('costly', 0.039), ('run', 0.038), ('instance', 0.038), ('labels', 0.038), ('error', 0.036), ('convex', 0.036), ('freund', 0.035), ('wearing', 0.034), ('ird', 0.034), ('ti', 0.034), ('sample', 0.034), ('couple', 0.034), ('gure', 0.033), ('span', 0.033), ('selected', 0.032), ('basis', 0.032), ('sz', 0.031), ('mixes', 0.031), ('graepel', 0.031), ('navot', 0.031), ('proved', 0.03), ('fifth', 0.029), ('space', 0.029), ('dimension', 0.029), ('labeled', 0.029), ('faces', 0.028), ('lemma', 0.028), ('products', 0.027), ('kernel', 0.027), ('dimensional', 0.026), ('saturated', 0.026), ('facial', 0.026), ('male', 0.025), ('collecting', 0.025), ('tr', 0.025), ('yi', 0.024), ('ar', 0.024), ('female', 0.024), ('material', 0.024), ('obstacle', 0.024), ('batch', 0.024), ('presented', 0.024), ('body', 0.024), ('kernelized', 0.023), ('seung', 0.023), ('greedy', 0.023), ('training', 0.023), ('inner', 0.023), ('xi', 0.023), ('proves', 0.022), ('xl', 0.022), ('support', 0.022), ('learning', 0.022), ('hypothesis', 0.022), ('random', 0.021), ('projecting', 0.021), ('novel', 0.021), ('purpose', 0.021), ('perceptron', 0.021), ('select', 0.02), ('solely', 0.02), ('algorithm', 0.02), ('dataset', 0.02), ('decide', 0.019), ('machine', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 160 nips-2005-Query by Committee Made Real
Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby
Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artificial and a real world problems.
2 0.17903027 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
3 0.11819521 33 nips-2005-Bayesian Sets
Author: Zoubin Ghahramani, Katherine A. Heller
Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1
4 0.088618465 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio
Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model fitted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classification, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1
5 0.085327476 74 nips-2005-Faster Rates in Regression via Active Learning
Author: Rebecca Willett, Robert Nowak, Rui M. Castro
Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1
6 0.085008636 47 nips-2005-Consistency of one-class SVM and related algorithms
7 0.071730562 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
8 0.069076657 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
9 0.061581515 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
10 0.061141904 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
11 0.061062176 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
12 0.059409052 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
13 0.058030047 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
14 0.057866815 54 nips-2005-Data-Driven Online to Batch Conversions
15 0.056502704 70 nips-2005-Fast Information Value for Graphical Models
16 0.056238305 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
17 0.055194467 19 nips-2005-Active Learning for Misspecified Models
18 0.052648552 85 nips-2005-Generalization to Unseen Cases
19 0.051718406 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
20 0.051570121 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
topicId topicWeight
[(0, 0.164), (1, 0.084), (2, -0.034), (3, -0.052), (4, 0.077), (5, 0.136), (6, 0.076), (7, 0.057), (8, 0.01), (9, 0.117), (10, -0.045), (11, 0.066), (12, 0.165), (13, -0.019), (14, -0.037), (15, -0.096), (16, 0.009), (17, -0.063), (18, -0.035), (19, -0.008), (20, 0.045), (21, -0.026), (22, 0.11), (23, -0.014), (24, -0.027), (25, -0.002), (26, 0.018), (27, -0.059), (28, -0.002), (29, -0.106), (30, -0.109), (31, 0.08), (32, 0.015), (33, 0.019), (34, -0.139), (35, -0.111), (36, -0.025), (37, -0.109), (38, 0.082), (39, -0.019), (40, -0.061), (41, -0.107), (42, -0.046), (43, 0.111), (44, -0.167), (45, -0.018), (46, -0.163), (47, 0.034), (48, 0.024), (49, -0.082)]
simIndex simValue paperId paperTitle
same-paper 1 0.91657817 160 nips-2005-Query by Committee Made Real
Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby
Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artificial and a real world problems.
2 0.67392099 41 nips-2005-Coarse sample complexity bounds for active learning
Author: Sanjoy Dasgupta
Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.
3 0.62831229 33 nips-2005-Bayesian Sets
Author: Zoubin Ghahramani, Katherine A. Heller
Abstract: Inspired by “Google™ Sets”, we consider the problem of retrieving items from a concept or cluster, given a query consisting of a few items from that cluster. We formulate this as a Bayesian inference problem and describe a very simple algorithm for solving it. Our algorithm uses a modelbased concept of a cluster and ranks items using a score which evaluates the marginal probability that each item belongs to a cluster containing the query items. For exponential family models with conjugate priors this marginal probability is a simple function of sufficient statistics. We focus on sparse binary data and show that our score can be evaluated exactly using a single sparse matrix multiplication, making it possible to apply our algorithm to very large datasets. We evaluate our algorithm on three datasets: retrieving movies from EachMovie, finding completions of author sets from the NIPS dataset, and finding completions of sets of words appearing in the Grolier encyclopedia. We compare to Google™ Sets and show that Bayesian Sets gives very reasonable set completions. 1
4 0.53670996 74 nips-2005-Faster Rates in Regression via Active Learning
Author: Rebecca Willett, Robert Nowak, Rui M. Castro
Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1
5 0.40723607 19 nips-2005-Active Learning for Misspecified Models
Author: Masashi Sugiyama
Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly specified, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulfilled. In this paper, we first show that the existing active learning method can be theoretically justified under slightly weaker condition: the model does not have to be correctly specified, but slightly misspecified models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justified for a wider class of misspecified models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspecification of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ defined on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are fixed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly specified [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we first show that the existing active learning method can still be theoretically justified when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justified for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-defined distribution with density ÔÜ Ü , and discuss the problem of finding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error defined by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ ݾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly specified, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly specified model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed  ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspecification of Models In this section, we investigate the validity of the existing active learning method for misspecified models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · Æִܵ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual ִܵ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ ִܵ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ ִܾµ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing  is still justified if Æ ½ ¾ µ. However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ  ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly specified, «ÇÄË is an unbiased estimator of «£ . However, for misspecified models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly specified models, influence of the covariate shift can be ignored, as the existing active learning method does. However, for misspecified models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ ´Üµ ¾ Ô ´Ü µ Ü ´Üµ ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be confirmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that ¾  · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justified when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We first illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly specified”, “approximately correct”, and “misspecified” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ specified”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspecified”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspecification of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. significance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspecified” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly specified” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives significantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly specified case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspecification of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and Ø ´¾ ¾ ÅÄ Ô ´Üµ is unknown. Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we first create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for five data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for five data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly specified, the covariate shift does not matter. However, for misspecified models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly specified. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspecification of models and thus it is reliable. The proposed method can be theoretically justified if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspecified models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspecified models. It is also important to notice that when the model is totally misspecified, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial financial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspecified models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.
6 0.4057048 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
7 0.38493478 54 nips-2005-Data-Driven Online to Batch Conversions
8 0.38093019 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
9 0.37974441 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
10 0.37463212 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
11 0.35334149 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
12 0.3497175 133 nips-2005-Nested sampling for Potts models
13 0.34777245 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
14 0.34362149 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
15 0.34293973 47 nips-2005-Consistency of one-class SVM and related algorithms
16 0.34136367 84 nips-2005-Generalization in Clustering with Unobserved Features
17 0.33401352 76 nips-2005-From Batch to Transductive Online Learning
18 0.33065629 70 nips-2005-Fast Information Value for Graphical Models
19 0.32833692 69 nips-2005-Fast Gaussian Process Regression using KD-Trees
20 0.31091955 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
topicId topicWeight
[(3, 0.084), (10, 0.048), (22, 0.251), (27, 0.023), (31, 0.03), (34, 0.12), (39, 0.013), (41, 0.011), (50, 0.013), (55, 0.064), (69, 0.059), (73, 0.042), (88, 0.088), (91, 0.04)]
simIndex simValue paperId paperTitle
1 0.90054083 81 nips-2005-Gaussian Processes for Multiuser Detection in CDMA receivers
Author: Juan J. Murillo-fuentes, Sebastian Caro, Fernando Pérez-Cruz
Abstract: In this paper we propose a new receiver for digital communications. We focus on the application of Gaussian Processes (GPs) to the multiuser detection (MUD) in code division multiple access (CDMA) systems to solve the near-far problem. Hence, we aim to reduce the interference from other users sharing the same frequency band. While usual approaches minimize the mean square error (MMSE) to linearly retrieve the user of interest, we exploit the same criteria but in the design of a nonlinear MUD. Since the optimal solution is known to be nonlinear, the performance of this novel method clearly improves that of the MMSE detectors. Furthermore, the GP based MUD achieves excellent interference suppression even for short training sequences. We also include some experiments to illustrate that other nonlinear detectors such as those based on Support Vector Machines (SVMs) exhibit a worse performance. 1
2 0.88163406 95 nips-2005-Improved risk tail bounds for on-line algorithms
Author: Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We prove the strongest known bound for the risk of hypotheses selected from the ensemble generated by running a learning algorithm incrementally on the training data. Our result is based on proof techniques that are remarkably different from the standard risk analysis based on uniform convergence arguments.
same-paper 3 0.78428161 160 nips-2005-Query by Committee Made Real
Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby
Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artificial and a real world problems.
4 0.64196873 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
Author: Rory Sayres, David Ress, Kalanit Grill-spector
Abstract: The category of visual stimuli has been reliably decoded from patterns of neural activity in extrastriate visual cortex [1]. It has yet to be seen whether object identity can be inferred from this activity. We present fMRI data measuring responses in human extrastriate cortex to a set of 12 distinct object images. We use a simple winner-take-all classifier, using half the data from each recording session as a training set, to evaluate encoding of object identity across fMRI voxels. Since this approach is sensitive to the inclusion of noisy voxels, we describe two methods for identifying subsets of voxels in the data which optimally distinguish object identity. One method characterizes the reliability of each voxel within subsets of the data, while another estimates the mutual information of each voxel with the stimulus set. We find that both metrics can identify subsets of the data which reliably encode object identity, even when noisy measurements are artificially added to the data. The mutual information metric is less efficient at this task, likely due to constraints in fMRI data. 1
5 0.62022328 177 nips-2005-Size Regularized Cut for Data Clustering
Author: Yixin Chen, Ya Zhang, Xiang Ji
Abstract: We present a novel spectral clustering method that enables users to incorporate prior knowledge of the size of clusters into the clustering process. The cost function, which is named size regularized cut (SRcut), is defined as the sum of the inter-cluster similarity and a regularization term measuring the relative size of two clusters. Finding a partition of the data set to minimize SRcut is proved to be NP-complete. An approximation algorithm is proposed to solve a relaxed version of the optimization problem as an eigenvalue problem. Evaluations over different data sets demonstrate that the method is not sensitive to outliers and performs better than normalized cut. 1
6 0.61705756 50 nips-2005-Convex Neural Networks
7 0.6083858 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
8 0.60726541 58 nips-2005-Divergences, surrogate loss functions and experimental design
9 0.60709238 114 nips-2005-Learning Rankings via Convex Hull Separation
10 0.60678512 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
11 0.60631305 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
12 0.60541552 184 nips-2005-Structured Prediction via the Extragradient Method
13 0.60345078 47 nips-2005-Consistency of one-class SVM and related algorithms
14 0.60061038 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
15 0.60003889 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
16 0.59764075 74 nips-2005-Faster Rates in Regression via Active Learning
17 0.59497499 112 nips-2005-Learning Minimum Volume Sets
18 0.59428 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
19 0.59225243 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
20 0.59155202 105 nips-2005-Large-Scale Multiclass Transduction