nips nips2005 nips2005-74 knowledge-graph by maker-knowledge-mining

74 nips-2005-Faster Rates in Regression via Active Learning

Source: pdf

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning signiﬁcantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra ﬂexibility leads to signiﬁcantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra ﬂexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including ﬁeld estimation using wireless sensor networks and fault line detection. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract This paper presents a rigorous statistical analysis characterizing regimes in which active learning signiﬁcantly outperforms classical passive learning. [sent-6, score-0.861]

2 Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. [sent-7, score-0.197]

3 In some regimes, this extra ﬂexibility leads to signiﬁcantly faster rates of error decay than those possible in classical passive learning settings. [sent-8, score-0.649]

4 The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. [sent-9, score-0.924]

5 In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra ﬂexibility of the active setting and provably improving upon the classical passive techniques. [sent-10, score-1.238]

6 Our active learning theory and methods show promise in a number of applications, including ﬁeld estimation using wireless sensor networks and fault line detection. [sent-11, score-0.662]

7 1 Introduction In this paper we address the theoretical capabilities of active learning for estimating functions in noise. [sent-12, score-0.463]

8 Several empirical and theoretical studies have shown that selecting samples or making strategic queries in order to learn a target function/classiﬁer can outperform commonly used passive methods based on random or deterministic sampling [1–5]. [sent-13, score-0.495]

9 Most previous analytical work in active learning regimes deals with very stringent conditions, like the ability to make perfect or nearly perfect decisions at every stage in the sampling procedure. [sent-16, score-0.622]

10 In the classical (passive) setting the sampling locations are chosen a priori, meaning that the selection of the sample locations precedes the gathering of the function observations. [sent-19, score-0.336]

11 The extra degree of ﬂexibility garnered through active learning can lead to signiﬁcantly better function estimates than those possible using classical (passive) methods. [sent-21, score-0.541]

12 To address this critical issue, in this paper we answer several pertinent questions regarding the fundamental performance limits of active learning in the context of regression under noisy conditions. [sent-23, score-0.559]

13 Signiﬁcantly faster rates of convergence are generally achievable in cases involving functions whose complexity (in a the Kolmogorov sense) is highly concentrated in small regions of space (e. [sent-24, score-0.301]

14 First, when the complexity of the function is spatially homogeneous, passive learning algorithms are near-minimax optimal over all estimation methods and all (active or passive) learning schemes, indicating that active learning methods cannot provide faster rates of convergence in this regime. [sent-29, score-1.139]

15 Second, for piecewise constant functions, active learning methods can capitalize on the highly localized nature of the boundary by focusing the sampling process in the estimated vicinity of the boundary. [sent-30, score-1.217]

16 We present an algorithm that provably improves on the best possible passive learning algorithm and achieves faster rates of error convergence. [sent-31, score-0.572]

17 Furthermore, we show that this performance cannot be signiﬁcantly improved on by any other active learning method (in a minimax sense). [sent-32, score-0.545]

18 Unfortunately these techniques cannot be extended to more general piecewise constant/smooth models, and to the best of our knowledge our work is the ﬁrst addressing active learning in this class of models. [sent-34, score-0.611]

19 Our active learning theory and methods show promise for a number of problems. [sent-35, score-0.424]

20 Using active learning in this context can signiﬁcantly reduce image acquisition times. [sent-37, score-0.424]

21 Incorporating active learning strategies into such systems can dramatically lengthen the lifetime of the system. [sent-40, score-0.532]

22 In fact, active learning problems like the one we pose in Section 4 have already found application in fault line detection [7] and boundary estimation in wireless sensor networking [9]. [sent-41, score-1.096]

23 ˆ Let fn : [0, 1]d → R denote an estimator based on the training samples {X i , Yi }n . [sent-60, score-0.278]

24 i=1 When constructing an estimator under the active learning paradigm there is another degree of freedom: we are allowed to choose our sampling strategy, that is, we can specify Xi |X1 . [sent-61, score-0.654]

25 Our goal is to construct estimation strategies which minimize the expected squared error, ˆ Ef,S [ fn − f 2 ], n where Ef,Sn is the expectation with respect to the probability measure of {Xi , Yi }n i=1 induced by model f and sampling strategy Sn , and · is the usual L2 norm. [sent-70, score-0.443]

26 3 Learning in Classical Smoothness Spaces In this section we consider classes of functions whose complexity is homogeneous over the entire domain, so that there are no localized features, as in Figure 1(a). [sent-71, score-0.17]

27 In this case we do not expect the extra ﬂexibility of the active learning strategies to provide any substantial beneﬁt over passive sampling strategies, since a simple uniform sampling scheme is naturally matched to the homogeneous “distribution” of the target function’s complexity. [sent-72, score-1.169]

28 The ﬁrst of our two main results is a minimax lower bound on the performance of all active estimation strategies for this class of functions. [sent-76, score-0.742]

29 Under the requirements of the active learning model we have the minimax bound 2α ˆ inf sup Ef,Sn [ fn − f 2 ] ≥ cn− 2α+d , (1) ˆ (fn ,Sn )∈Θactive f ∈Σ(L,α) c(L, α, σ 2 ) > 0 and Θactive is where c ≡ includes also passive strategies). [sent-78, score-1.114]

30 the set of all active estimation strategies (which Note that the rate in Theorem 1 is the same as the classical passive learning rate [10, 11] but the class of estimation strategies allowed is now much bigger. [sent-79, score-1.216]

31 The key aspects of the proof for the passive setting [13] apply to the active scenario due to the fact that we can choose an adequate set of hypotheses without knowledge of the sampling strategy, although some modiﬁcations are required due to the extra ﬂexibility of the sampling strategy. [sent-82, score-0.981]

32 4 The Active Advantage In this section we address two key questions: (i) when does active learning provably yield better results, and (ii) what are the fundamental limitations of active learning? [sent-85, score-0.861]

33 We expect that, for functions whose complexity is spatially non-uniform and highly concentrated in small subsets of the domain, the extra spatial adaptivity of the active learning paradigm can lead into signiﬁcant performance gains. [sent-87, score-0.692]

34 We study a class of functions which highlights this notion of “spatially concentrated complexity”. [sent-88, score-0.161]

35 A function f : [0, 1]d → R is called piecewise constant if it is locally constant2 in any point x ∈ [0, 1]d \ B(f ), where B(f ) ⊂ [0, 1]d , the boundary set, has upper box-counting dimension at most d − 1. [sent-90, score-0.611]

36 The set of all piecewise constant functions f satisfying the above conditions is denoted by PC(β, M ). [sent-92, score-0.216]

37 The conditions above mean that (a) the functions are constant except along d − 1dimensional “boundaries” where they are discontinuous and (b) the boundaries between the various constant regions are (d − 1)-dimensional non-fractal sets. [sent-93, score-0.219]

38 The class PC(β, M ) has the main ingredients that make active learning appealing: a function f is “well-behaved” everywhere on the unit square, except on a small subset B(f ). [sent-98, score-0.481]

39 We will see that the critical task for any good estimator is to accurately ﬁnd the location of the boundary B(f ). [sent-99, score-0.574]

40 1 Passive Learning Framework To obtain minimax lower bounds for PC(β, M ) we consider a smaller class of functions, namely the boundary fragment class studied in [11]. [sent-101, score-0.801]

41 The class of all the functions of this form is called the boundary fragment class (usually M = 1), denoted by BF(M ). [sent-105, score-0.719]

42 Note that there are only two regions, and the boundary separating those is a function of the ﬁrst d − 1 variables. [sent-106, score-0.434]

43 It is straightforward to show that BF(M ) ⊆ PC(β, M ) for a suitable constant β; therefore a minimax lower bound for the boundary fragment class is trivially a lower bound for the piecewise constant class. [sent-107, score-1.036]

44 There exist practical passive learning strategies that are near-minimax optimal. [sent-109, score-0.515]

45 (a) (b) Figure 1: Examples of functions in the classes considered: (a) H¨ lder smooth function. [sent-111, score-0.168]

46 Since the location of the boundary is a priori unknown it is natural to distribute the sample points uniformly over the unit cube. [sent-122, score-0.521]

47 Deﬁne the complexity regularized estimator as ˆ fn ≡ arg min ˜ f (π) :π∈Π 1 n n ˜ f (π) (X i ) − Yi i=1 2 +λ log n |π| , n (3) where |π| denotes the number of elements of π and λ > 0. [sent-129, score-0.277]

48 From that analysis we conclude that ˆ sup Ef [ fn − f f ∈PC(β,M ) 2 1 ] ≤ C(n/ log n)− d , (4) where C ≡ C(β, M, σ 2 ) > 0. [sent-132, score-0.182]

49 This shows that, up to a logarithmic factor, the rate in (2) is the optimal rate of convergence for passive strategies. [sent-133, score-0.318]

50 2 Active Learning Framework We now turn our attention to the active learning scenario. [sent-136, score-0.424]

51 In [8] this was studied for the boundary fragment class. [sent-137, score-0.566]

52 From that work and noting again that BF(M ) ⊆ PC(β, M ) we have, for d ≥ 2, 1 ˆ sup Ef,Sn [ fn − f 2 ] ≥ cn− d−1 , (5) inf ˆ (fn ,Sn )∈Θactive f ∈PC(β,M ) where c ≡ c(M, σ 2 ) > 0. [sent-138, score-0.217]

53 In contrast with (2), we observe that with active learning we have a potential performance gain over passive strategies, effectively equivalent to a dimensionality reduction. [sent-139, score-0.742]

54 Essentially the exponent in (5) depends now on the dimension of the boundary set, d − 1, instead of the dimension of the entire domain, d. [sent-140, score-0.434]

55 In [11] an algorithm capable of achieving the above rate for the boundary fragment class is presented, but this algorithm takes advantage of the very special functional form of the boundary fragment functions. [sent-141, score-1.189]

56 This change-point detection can be performed extremely accurately using active learning, as shown in the pioneering work of Burnashev and Zigangirov [6]. [sent-143, score-0.366]

57 Unfortunately, the boundary fragment class is very restrictive and impractical for most applications. [sent-144, score-0.663]

58 Recall that boundary fragments consist of only two regions, separated by a boundary that is a function of the ﬁrst d − 1 coordinates. [sent-145, score-0.906]

59 The class PC(β, M ) is much larger and more general and the algorithmic ideas that work for boundary fragments can no longer be used. [sent-146, score-0.529]

60 We now propose an active learning scheme for the piecewise constant class. [sent-148, score-0.601]

61 The proposed scheme is a two-step approach based in part on the tree-structured estimators described above for passive learning. [sent-149, score-0.426]

62 In the ﬁrst step, called the preview step, a rough estimator of f is constructed using n/2 samples (assume for simplicity that n is even), distributed uniformly over [0, 1]d . [sent-150, score-0.447]

63 In the second step, called the reﬁnement step, we select n/2 samples near the perceived locations of the boundaries (estimated in the preview step) separating constant regions. [sent-151, score-0.522]

64 At the end of this process we will have half the samples concentrated in the vicinity of the boundary set B(f ). [sent-152, score-0.586]

65 However, it is critical that the preview step is able to detect the boundary with very high probability. [sent-154, score-0.787]

66 If part of the boundary is missed, then the error incurred is going to propagate into the ﬁnal estimate, ultimately degrading the performance. [sent-155, score-0.552]

67 Therefore extreme care must be taken to detect the boundary in the preview step, as described below. [sent-156, score-0.743]

68 Next proceed by using the passive learning algorithm described before, but restrict the estimator d−1 to RDPs with leafs at a maximum depth of J = (d−1)2 +d log(n / log(n )). [sent-159, score-0.634]

69 This ensures that, on average, every element of the RDP contains many sample points; therefore we obtain a low variance estimate, although the estimator bias is going to be large. [sent-160, score-0.168]

70 The above strategy ensures that most of the time, leafs that intersect the boundary are at the maximum allowed depth (because otherwise the estimator would incur too much empirical error) and leafs away from the boundary are at shallower depths. [sent-162, score-1.314]

71 Therefore we can “detect” the rough location of the boundary just by looking at the deepest leafs. [sent-163, score-0.48]

72 Unfortunately, if the set B(f ) is somewhat aligned with the dyadic splits of the RDP, leafs intersecting the boundary can be pruned without incurring a large error. [sent-164, score-0.731]

73 This is illustrated in Figure 2(b); the cell with the arrow was pruned and contains a piece of the boundary, but the error incurred by pruning is small since that region is mostly a constant region. [sent-165, score-0.305]

74 We use d + 1 estimators in the preview step: one on the initial uniform partition, and d over partitions whose dyadic splits have been translated by 2−J in each one of the d coordinates. [sent-168, score-0.492]

75 Any leaf that is at the maximum depth of any of the d + 1 RDPs pruned in the preview step indicates the highly probable presence of a boundary, and will be reﬁned in the next stage. [sent-169, score-0.498]

76 Reﬁnement: With high probability, the boundary is contained in the leafs at the maximum depth. [sent-170, score-0.553]

77 In the reﬁnement step we collect additional n/2 samples in the corresponding partition cells, using these to obtain a reﬁned estimate of the function f by again applying (a) (b) (c) (d) Figure 2: The two step procedure for d = 2: (a) Initial unpruned RDP and n/2 samples. [sent-171, score-0.208]

78 This produces a higher resolution estimate in the vicinity of the boundary set B(f ), yielding better performance than the passive learning technique. [sent-177, score-0.853]

79 To formally show that this algorithm attains the faster rates we desire we have to consider a further technical assumption, namely that the boundary set is “cusp-free”3 . [sent-178, score-0.556]

80 This condition is rather technical, but it is not very restrictive, and encompasses many interesting situations, including of course boundary fragments. [sent-179, score-0.434]

81 Under the active learning scenario we have, for d ≥ 2 and functions f whose boundary is cusp-free, E ˆ fn − f 2 ≤C n log n 1 − d−1+1/d , (6) where C > 0. [sent-182, score-1.037]

82 This bound improves on (4), demonstrating that this technique performs better than the best possible passive learning estimator. [sent-183, score-0.41]

83 By restricting the maximum depth of the trees in the preview stage we can control the type-(i) error, ensuring that it does not exceed the error rate in (6). [sent-186, score-0.42]

84 Type-(ii) error corresponds to the situations when a part of the boundary was not detected in the preview step. [sent-187, score-0.777]

85 This can happen because of the inherent randomness of the noise and sampling distribution, or because the boundary is somewhat aligned with the dyadic splits. [sent-188, score-0.645]

86 The latter can be a problem and this is why one needs to perform d + 1 preview estimates over shifted partitions. [sent-189, score-0.309]

87 If the boundary is cusp-free then it is guaranteed that one of those preview estimators is going to “feel” the boundary since it is not aligned with the corresponding partition. [sent-190, score-1.351]

88 Finally, the type-(iii) error is very easy to analyze, using the same techniques we used for the passive estimator. [sent-191, score-0.352]

89 Therefore we can get rates arbitrarily close to the lower bound rates in (5). [sent-198, score-0.182]

90 3 A cusp-free boundary cannot have the behavior you observe in the graph of |x|1/2 at the origin. [sent-199, score-0.434]

91 5 Final Remarks The results presented in this paper show that in certain scenarios active learning attains provable gains over the classical passive approaches. [sent-201, score-0.837]

92 Despite these draws, the analysis of such active methods is quite challenging due to the loss of statistical independence in the observations (recall that now the sample locations are coupled with all the observations made in the past). [sent-203, score-0.535]

93 The two function classes presented are non-trivial canonical examples illustrating under what conditions one might expect active learning to improve rates of convergence. [sent-204, score-0.498]

94 The algorithm presented here for actively learning members of the piecewise constant class demonstrates the possibilities of active learning. [sent-205, score-0.658]

95 In fact, this algorithm has already been applied in the context of ﬁeld estimation using wireless sensor networks [9]. [sent-206, score-0.197]

96 Future work includes the further development of the ideas presented here to the context of binary classiﬁcation and active learning of the Bayes decision boundary. [sent-207, score-0.424]

97 Mackay, “Information-based objective functions for active data selection,” Neural Computation, vol. [sent-216, score-0.405]

98 [8] Alexander Korostelev, “On minimax rates of convergence in image models under sequential design,” Statistics & Probability Letters, vol. [sent-247, score-0.195]

99 Nowak, “Backcasting: Adaptive sampling for sensor networks,” in Proc. [sent-253, score-0.174]

100 Nowak, “Fast rates in regression via active learning,” Tech. [sent-268, score-0.44]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('boundary', 0.434), ('active', 0.366), ('passive', 0.318), ('preview', 0.309), ('fn', 0.14), ('fragment', 0.132), ('piecewise', 0.13), ('rdp', 0.124), ('minimax', 0.121), ('leafs', 0.119), ('nement', 0.116), ('strategies', 0.108), ('estimators', 0.108), ('pc', 0.105), ('willett', 0.103), ('sampling', 0.103), ('lder', 0.095), ('rdps', 0.095), ('estimator', 0.094), ('nowak', 0.094), ('dyadic', 0.075), ('rates', 0.074), ('bf', 0.071), ('sensor', 0.071), ('wireless', 0.07), ('pruned', 0.07), ('locations', 0.068), ('concentrated', 0.065), ('questions', 0.065), ('regimes', 0.063), ('extra', 0.061), ('spatially', 0.06), ('learning', 0.058), ('class', 0.057), ('classical', 0.056), ('estimation', 0.056), ('boundaries', 0.054), ('madison', 0.053), ('homogeneous', 0.052), ('exibility', 0.051), ('incurred', 0.051), ('wisconsin', 0.05), ('nonparametric', 0.049), ('faster', 0.048), ('burnashev', 0.047), ('castro', 0.047), ('korostelev', 0.047), ('zigangirov', 0.047), ('constant', 0.047), ('location', 0.046), ('depth', 0.045), ('samples', 0.044), ('step', 0.044), ('vicinity', 0.043), ('complexity', 0.043), ('sup', 0.042), ('fault', 0.041), ('tsybakov', 0.041), ('partition', 0.041), ('sample', 0.041), ('provably', 0.04), ('restrictive', 0.04), ('functions', 0.039), ('pruning', 0.039), ('scenarios', 0.039), ('limits', 0.039), ('ii', 0.038), ('fragments', 0.038), ('hypercube', 0.038), ('cn', 0.037), ('strategy', 0.036), ('localized', 0.036), ('methodologies', 0.035), ('inf', 0.035), ('collect', 0.035), ('smooth', 0.034), ('error', 0.034), ('sn', 0.034), ('yi', 0.034), ('bound', 0.034), ('re', 0.034), ('xi', 0.034), ('going', 0.033), ('allowed', 0.033), ('wi', 0.033), ('piece', 0.033), ('aligned', 0.033), ('regions', 0.032), ('stage', 0.032), ('tree', 0.032), ('arrow', 0.031), ('answers', 0.031), ('fundamental', 0.031), ('practical', 0.031), ('observations', 0.03), ('remarks', 0.03), ('twenty', 0.03), ('leaf', 0.03), ('queries', 0.03), ('proof', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

2 0.17065531 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the speciﬁc target hypothesis, and the desired accuracy.

3 0.15676209 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly speciﬁed, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulﬁlled. In this paper, we ﬁrst show that the existing active learning method can be theoretically justiﬁed under slightly weaker condition: the model does not have to be correctly speciﬁed, but slightly misspeciﬁed models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justiﬁed for a wider class of misspeciﬁed models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspeciﬁcation of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ deﬁned on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are ﬁxed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly speciﬁed [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we ﬁrst show that the existing active learning method can still be theoretically justiﬁed when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justiﬁed for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-deﬁned distribution with density ÔÜ Ü , and discuss the problem of ﬁnding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error deﬁned by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ Ý¾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly speciﬁed, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly speciﬁed model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed Â ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspeciﬁcation of Models In this section, we investigate the validity of the existing active learning method for misspeciﬁed models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · ÆÖ´Üµ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual Ö´Üµ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ Ö´Üµ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ Ö´Ü¾µ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing Â is still justiﬁed if Æ ½ ¾ µ. However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ Â ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly speciﬁed, «ÇÄË is an unbiased estimator of «£ . However, for misspeciﬁed models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly speciﬁed models, inﬂuence of the covariate shift can be ignored, as the existing active learning method does. However, for misspeciﬁed models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ ´Üµ ¾ Ô ´Ü µ Ü ´Üµ ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be conﬁrmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that ¾ Â · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justiﬁed when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We ﬁrst illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly speciﬁed”, “approximately correct”, and “misspeciﬁed” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ speciﬁed”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspeciﬁed”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspeciﬁcation of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. signiﬁcance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspeciﬁed” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly speciﬁed” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives signiﬁcantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly speciﬁed case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspeciﬁcation of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and Ø ´¾ ¾ ÅÄ Ô ´Üµ is unknown. Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we ﬁrst create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for ﬁve data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for ﬁve data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly speciﬁed, the covariate shift does not matter. However, for misspeciﬁed models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly speciﬁed. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspeciﬁcation of models and thus it is reliable. The proposed method can be theoretically justiﬁed if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspeciﬁed models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspeciﬁed models. It is also important to notice that when the model is totally misspeciﬁed, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial ﬁnancial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artiﬁcial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspeciﬁed models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.

4 0.148037 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

Abstract: We present an efﬁcient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassiﬁcation rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid 1 − α conﬁdence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.

5 0.096195348 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: The Perceptron algorithm, despite its simplicity, often performs well on online classiﬁcation tasks. The Perceptron becomes especially effective when it is used in conjunction with kernels. However, a common difﬁculty encountered when implementing kernel-based online algorithms is the amount of memory required to store the online hypothesis, which may grow unboundedly. In this paper we present and analyze the Forgetron algorithm for kernel-based online learning on a ﬁxed memory budget. To our knowledge, this is the ﬁrst online learning algorithm which, on one hand, maintains a strict limit on the number of examples it stores while, on the other hand, entertains a relative mistake bound. In addition to the formal results, we also present experiments with real datasets which underscore the merits of our approach.

6 0.085327476 160 nips-2005-Query by Committee Made Real

7 0.078357965 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

8 0.073751174 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

9 0.073519096 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

10 0.072851777 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

11 0.072590694 23 nips-2005-An Application of Markov Random Fields to Range Sensing

12 0.069535844 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

13 0.069237672 112 nips-2005-Learning Minimum Volume Sets

14 0.068019092 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

15 0.063200407 47 nips-2005-Consistency of one-class SVM and related algorithms

16 0.060012549 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

17 0.05939078 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

18 0.059329424 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

19 0.058224056 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

20 0.057696678 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.218), (1, 0.041), (2, -0.015), (3, -0.021), (4, 0.078), (5, 0.122), (6, -0.096), (7, 0.036), (8, 0.033), (9, 0.118), (10, -0.036), (11, 0.188), (12, 0.222), (13, -0.002), (14, -0.034), (15, -0.07), (16, -0.016), (17, 0.021), (18, -0.004), (19, -0.054), (20, 0.123), (21, 0.016), (22, -0.013), (23, 0.031), (24, -0.218), (25, -0.004), (26, 0.143), (27, -0.201), (28, -0.047), (29, -0.143), (30, 0.027), (31, 0.1), (32, 0.098), (33, 0.104), (34, -0.016), (35, 0.008), (36, 0.056), (37, 0.052), (38, 0.02), (39, -0.019), (40, 0.027), (41, -0.055), (42, -0.001), (43, 0.025), (44, 0.025), (45, 0.01), (46, 0.018), (47, -0.09), (48, -0.007), (49, -0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96822965 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

2 0.78385973 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

3 0.71504974 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

4 0.70606416 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

5 0.57768506 160 nips-2005-Query by Committee Made Real

Author: Ran Gilad-bachrach, Amir Navot, Naftali Tishby

Abstract: Training a learning algorithm is a costly task. A major goal of active learning is to reduce this cost. In this paper we introduce a new algorithm, KQBC, which is capable of actively learning large scale problems by using selective sampling. The algorithm overcomes the costly sampling step of the well known Query By Committee (QBC) algorithm by projecting onto a low dimensional space. KQBC also enables the use of kernels, providing a simple way of extending QBC to the non-linear scenario. Sampling the low dimension space is done using the hit and run random walk. We demonstrate the success of this novel algorithm by applying it to both artiﬁcial and a real world problems.

6 0.47678727 112 nips-2005-Learning Minimum Volume Sets

7 0.45041263 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

8 0.43324614 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

9 0.42571971 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

10 0.42449978 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

11 0.38607961 182 nips-2005-Statistical Convergence of Kernel CCA

12 0.38377893 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

13 0.36097074 133 nips-2005-Nested sampling for Potts models

14 0.35351467 205 nips-2005-Worst-Case Bounds for Gaussian Process Models

15 0.34435928 17 nips-2005-Active Bidirectional Coupling in a Cochlear Chip

16 0.33479118 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

17 0.3302682 126 nips-2005-Metric Learning by Collapsing Classes

18 0.32434478 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

19 0.31040671 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

20 0.31021908 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.081), (10, 0.058), (11, 0.015), (27, 0.034), (31, 0.049), (34, 0.076), (37, 0.207), (39, 0.02), (41, 0.011), (50, 0.01), (55, 0.057), (69, 0.051), (73, 0.039), (88, 0.124), (91, 0.07)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.83165324 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

2 0.69199926 112 nips-2005-Learning Minimum Volume Sets

Author: Clayton Scott, Robert Nowak

Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing conﬁdence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classiﬁcation. As in classiﬁcation, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain ﬁnite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1

3 0.68503809 30 nips-2005-Assessing Approximations for Gaussian Process Classification

Author: Malte Kuss, Carl E. Rasmussen

Abstract: Gaussian processes are attractive models for probabilistic classiﬁcation but unfortunately exact inference is analytically intractable. We compare Laplace’s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate. In recent years models based on Gaussian process (GP) priors have attracted much attention in the machine learning community. Whereas inference in the GP regression model with Gaussian noise can be done analytically, probabilistic classiﬁcation using GPs is analytically intractable. Several approaches to approximate Bayesian inference have been suggested, including Laplace’s approximation, Expectation Propagation (EP), variational approximations and Markov chain Monte Carlo (MCMC) sampling, some of these in conjunction with generalisation bounds, online learning schemes and sparse approximations. Despite the abundance of recent work on probabilistic GP classiﬁers, most experimental studies provide only anecdotal evidence, and no clear picture has yet emerged, as to when and why which algorithm should be preferred. Thus, from a practitioners point of view probabilistic GP classiﬁcation remains a jungle. In this paper, we set out to understand and compare two of the most wide-spread approximations: Laplace’s method and Expectation Propagation (EP). We also compare to a sophisticated, but computationally demanding MCMC scheme to examine how close the approximations are to ground truth. We examine two aspects of the approximation schemes: Firstly the accuracy of approximations to the marginal likelihood which is of central importance for model selection and model comparison. In any practical application of GPs in classiﬁcation (usually multiple) parameters of the covariance function (hyperparameters) have to be handled. Bayesian model selection provides a consistent framework for setting such parameters. Therefore, it is essential to evaluate the accuracy of the marginal likelihood approximations as a function of the hyperparameters, in order to assess the practical usefulness of the approach Secondly, we need to assess the quality of the approximate probabilistic predictions. In the past, the probabilistic nature of the GP predictions have not received much attention, the focus being mostly on classiﬁcation error rates. This unfortunate state of affairs is caused primarily by typical benchmarking problems being considered outside of a realistic context. The ability of a classiﬁer to produce class probabilities or conﬁdences, have obvious relevance in most areas of application, eg. medical diagnosis. We evaluate the predictive distributions of the approximate methods, and compare to the MCMC gold standard. 1 The Gaussian Process Model for Binary Classiﬁcation Let y ∈ {−1, 1} denote the class label of an input x. Gaussian process classiﬁcation (GPC) is discriminative in modelling p(y|x) for given x by a Bernoulli distribution. The probability of success p(y = 1|x) is related to an unconstrained latent function f (x) which is mapped to the unit interval by a sigmoid transformation, eg. the logit or the probit. For reasons of analytic convenience we exclusively use the probit model p(y = 1|x) = Φ(f (x)), where Φ denotes the cumulative density function of the standard Normal distribution. In the GPC model Bayesian inference is performed about the latent function f in the light of observed data D = {(yi , xi )|i = 1, . . . , m}. Let fi = f (xi ) and f = [f1 , . . . , fm ] be shorthand for the values of the latent function and y = [y1 , . . . , ym ] and X = [x1 , . . . , xm ] collect the class labels and inputs respectively. Given the latent function the class labels are independent Bernoulli variables, so the joint likelihood factories: m m p(yi |fi ) = p(y|f ) = i=1 Φ(yi fi ), i=1 and depends on f only through its value at the observed inputs. We use a zero-mean Gaussian process prior over the latent function f with a covariance function k(x, x |θ), which may depend on hyperparameters θ [1]. The functional form and parameters of the covariance function encodes assumptions about the latent function, and adaptation of these is part of the inference. The posterior distribution over latent function values f at the observed X for given hyperparameters θ becomes: m p(f |D, θ) = N (f |0, K) Φ(yi fi ), p(D|θ) i=1 where p(D|θ) = p(y|f )p(f |X, θ)df , denotes the marginal likelihood. Unfortunately neither the marginal likelihood, nor the posterior itself, or predictions can be computed analytically, so approximations are needed. 2 Approximate Bayesian Inference For the GPC model approximations are either based on a Gaussian approximation to the posterior p(f |D, θ) ≈ q(f |D, θ) = N (f |m, A) or involve Markov chain Monte Carlo (MCMC) sampling [2]. We compare Laplace’s method and Expectation Propagation (EP) which are two alternative approaches to ﬁnding parameters m and A of the Gaussian q(f |D, θ). Both methods also allow approximate evaluation of the marginal likelihood, which is useful for ML-II hyperparameter optimisation. Laplace’s approximation (LA) is found by making a second order Taylor approximation of the (un-normalised) log posterior [3]. The mean m is placed at the mode (MAP) and the covariance A equals the negative inverse Hessian of the log posterior density at m. The EP approximation [4] also gives a Gaussian approximation to the posterior. The parameters m and A are found in an iterative scheme by matching the approximate marginal moments of p(fi |D, θ) by the marginals of the approximation N (fi |mi , Aii ). Although we cannot prove the convergence of EP, we conjecture that it always converges for GPC with probit likelihood, and have never encountered an exception. A key insight is that a Gaussian approximation to the GPC posterior is equivalent to a GP approximation to the posterior distribution over latent functions. For a test input x∗ the fi 1 0.16 0.14 0.8 0.6 0.1 fj p(y|f) p(f|y) 0.12 Likelihood p(y|f) Prior p(f) Posterior p(f|y) Laplace q(f|y) EP q(f|y) 0.08 0.4 0.06 0.04 0.2 0.02 0 −4 0 4 8 0 f . (a) (b) Figure 1: Panel (a) provides a one-dimensional illustration of the approximations. The prior N (f |0, 52 ) combined with the probit likelihood (y = 1) results in a skewed posterior. The likelihood uses the right axis, all other curves use the left axis. Laplace’s approximation peaks at the posterior mode, but places far too much mass over negative values of f and too little at large positive values. The EP approximation matches the ﬁrst two posterior moments, which results in a larger mean and a more accurate placement of probability mass compared to Laplace’s approximation. In Panel (b) we caricature a high dimensional zeromean Gaussian prior as an ellipse. The gray shadow indicates that for a high dimensional Gaussian most of the mass lies in a thin shell. For large latent signals (large entries in K), the likelihood essentially cuts off regions which are incompatible with the training labels (hatched area), leaving the upper right orthant as the posterior. The dot represents the mode of the posterior, which remains close to the origin. approximate predictive latent and class probabilities are: 2 q(f∗ |D, θ, x∗ ) = N (µ∗ , σ∗ ), and 2 q(y∗ = 1|D, x∗ ) = Φ(µ∗ / 1 + σ∗ ), 2 where µ∗ = k∗ K−1 m and σ∗ = k(x∗ , x∗ )−k∗ (K−1 − K−1 AK−1 )k∗ , where the vector k∗ = [k(x1 , x∗ ), . . . , k(xm , x∗ )] collects covariances between x∗ and training inputs X. MCMC sampling has the advantage that it becomes exact in the limit of long runs and so provides a gold standard by which to measure the two analytic methods described above. Although MCMC methods can in principle be used to do inference over f and θ jointly [5], we compare to methods using ML-II optimisation over θ, thus we use MCMC to integrate over f only. Good marginal likelihood estimates are notoriously difﬁcult to obtain; in our experiments we use Annealed Importance Sampling (AIS) [6], combining several Thermodynamic Integration runs into a single (unbiased) estimate of the marginal likelihood. Both analytic approximations have a computational complexity which is cubic O(m3 ) as common among non-sparse GP models due to inversions m × m matrices. In our implementations LA and EP need similar running times, on the order of a few minutes for several hundred data-points. Making AIS work efﬁciently requires some ﬁne-tuning and a single estimate of p(D|θ) can take several hours for data sets of a few hundred examples, but this could conceivably be improved upon. 3 Structural Properties of the Posterior and its Approximations Structural properties of the posterior can best be understood by examining its construction. The prior is a correlated m-dimensional Gaussian N (f |0, K) centred at the origin. Each likelihood term p(yi |fi ) softly truncates the half-space from the prior that is incompatible with the observed label, see Figure 1. The resulting posterior is unimodal and skewed, similar to a multivariate Gaussian truncated to the orthant containing y. The mode of the posterior remains close to the origin, while the mass is placed in accordance with the observed class labels. Additionally, high dimensional Gaussian distributions exhibit the property that most probability mass is contained in a thin ellipsoidal shell – depending on the covariance structure – away from the mean [7, ch. 29.2]. Intuitively this occurs since in high dimensions the volume grows extremely rapidly with the radius. As an effect the mode becomes less representative (typical) for the prior distribution as the dimension increases. For the GPC posterior this property persists: the mode of the posterior distribution stays relatively close to the origin, still being unrepresentative for the posterior distribution, while the mean moves to the mass of the posterior making mean and mode differ signiﬁcantly. We cannot generally assume the posterior to be close to Gaussian, as in the often studied limit of low-dimensional parametric models with large amounts of data. Therefore in GPC we must be aware of making a Gaussian approximation to a non-Gaussian posterior. From the properties of the posterior it can be expected that Laplace’s method places m in the right orthant but too close to the origin, such that the approximation will overlap with regions having practically zero posterior mass. As an effect the amplitude of the approximate latent posterior GP will be underestimated systematically, leading to overly cautious predictive distributions. The EP approximation does not rely on a local expansion, but assumes that the marginal distributions can be well approximated by Gaussians. This assumption will be examined empirically below. 4 Experiments In this section we compare and inspect approximations for GPC using various benchmark data sets. The primary focus is not to optimise the absolute performance of GPC models but to compare the relative accuracy of approximations and to validate the arguments given in the previous section. In all experiments we use a covariance function of the form: k(x, x |θ) = σ 2 exp − 1 x − x 2 2 / 2 , (1) such that θ = [σ, ]. We refer to σ 2 as the signal variance and to as the characteristic length-scale. Note that for many classiﬁcation tasks it may be reasonable to use an individual length scale parameter for every input dimension (ARD) or a different kind of covariance function. Nevertheless, for the sake of presentability we use the above covariance function and we believe the conclusions about the accuracy of approximations to be independent of this choice, since it relies on arguments which are independent of the form of the covariance function. As measure of the accuracy of predictive probabilities we use the average information in bits of the predictions about the test targets in excess of that of random guessing. Let p∗ = p(y∗ = 1|D, θ, x∗ ) be the model’s prediction, then we average: I(p∗ , yi ) = i yi +1 2 log2 (p∗ ) + i 1−yi 2 log2 (1 − p∗ ) + H i (2) over all test cases, where H is the entropy of the training labels. The error rate E is equal to the percentage of erroneous class assignments if prediction is understood as a decision problem with symmetric costs. For the ﬁrst set of experiments presented here the well-known USPS digits and the Ionosphere data set were used. A binary sub-problem from the USPS digits is deﬁned by only considering 3’s vs. 5’s (which is probably the hardest of the binary sub-problems) and dividing the data into 767 cases for training and 773 for testing. The Ionosphere data is split into 200 training and 151 test cases. We do an exhaustive investigation on a ﬁne regular grid of values for the log hyperparameters. For each θ on the grid we compute the approximated log marginal likelihood by LA, EP and AIS. Additionally we compute the respective predictive performance (2) on the test set. Results are shown in Figure 2. Log marginal likelihood −150 −130 −200 Log marginal likelihood 5 −115 −105 −95 4 −115 −105 3 −130 −100 −150 2 1 log magnitude, log(σf) log magnitude, log(σf) 4 Log marginal likelihood 5 −160 4 −100 3 −130 −92 −160 2 −105 −160 −105 −200 −115 1 log magnitude, log(σf) 5 −92 −95 3 −100 −105 2−200 −115 −160 −130 −200 1 −200 0 0 0 −200 3 4 log lengthscale, log(l) 5 2 3 4 log lengthscale, log(l) (1a) 4 0.84 4 0.8 0.8 0.25 3 0.8 0.84 2 0.7 0.7 1 0.5 log magnitude, log(σf) 0.86 5 0.86 0.8 0.89 0.88 0.7 1 0.5 3 4 log lengthscale, log(l) 2 3 4 log lengthscale, log(l) (2a) Log marginal likelihood −90 −70 −100 −120 −120 0 −70 −75 −120 1 −100 1 2 3 log lengthscale, log(l) 4 0 −70 −90 −65 2 −100 −100 1 −120 −80 1 2 3 log lengthscale, log(l) 4 −1 −1 5 5 f 0.1 0.2 0.55 0 1 0.4 1 2 3 log lengthscale, log(l) 5 0.5 0.1 0 0.3 0.4 0.6 0.55 0.3 0.2 0.2 0.1 1 0 0.2 4 5 −1 −1 0.4 0.2 0.6 2 0.3 10 0 0.1 0.2 0.1 0 0 0.5 1 2 3 log lengthscale, log(l) 0.5 0.5 0.55 3 0 0.1 0 1 2 3 log lengthscale, log(l) 0.5 0.3 0.5 4 2 5 (3c) 0.5 3 4 Information about test targets in bits 4 log magnitude, log(σf) 4 2 0 (3b) Information about test targets in bits 0.3 log magnitude, log(σ ) −75 0 −1 −1 5 5 0 −120 3 −120 (3a) −1 −1 −90 −80 −65 −100 2 Information about test targets in bits 0 −75 4 0 3 5 Log marginal likelihood −90 3 −100 0 0.25 3 4 log lengthscale, log(l) 5 log magnitude, log(σf) log magnitude, log(σf) f log magnitude, log(σ ) −80 3 0.5 (2c) −75 −90 0.7 0.8 2 4 −75 −1 −1 0.86 0.84 Log marginal likelihood 4 1 0.7 1 5 5 −150 2 (2b) 5 2 0.88 3 0 5 0.84 0.89 0.25 0 0.7 0.25 0 0.86 4 0.84 3 2 5 Information about test targets in bits log magnitude, log(σf) log magnitude, log(σf) 5 −200 3 4 log lengthscale, log(l) (1c) Information about test targets in bits 5 2 2 (1b) Information about test targets in bits 0.5 5 log magnitude, log(σf) 2 4 5 −1 −1 0 1 2 3 log lengthscale, log(l) 4 5 (4a) (4b) (4c) Figure 2: Comparison of marginal likelihood approximations and predictive performances of different approximation techniques for USPS 3s vs. 5s (upper half) and the Ionosphere data (lower half). The columns correspond to LA (a), EP (b), and MCMC (c). The rows show estimates of the log marginal likelihood (rows 1 & 3) and the corresponding predictive performance (2) on the test set (rows 2 & 4) respectively. MCMC samples Laplace p(f|D) EP p(f|D) 0.2 0.15 0.45 0.1 0.4 0.05 0.3 −16 −14 −12 −10 −8 −6 f −4 −2 0 2 4 p(xi) 0 0.35 (a) 0.06 0.25 0.2 0.15 MCMC samples Laplace p(f|D) EP p(f|D) 0.1 0.05 0.04 0 0 2 0.02 xi 4 6 (c) 0 −40 −35 −30 −25 −20 −15 −10 −5 0 5 10 15 f (b) Figure 3: Panel (a) and (b) show two marginal distributions p(fi |D, θ) from a GPC posterior and its approximations. The true posterior is approximated by a normalised histogram of 9000 samples of fi obtained by MCMC sampling. Panel (c) shows a histogram of samples of a marginal distribution of a truncated high-dimensional Gaussian. The line describes a Gaussian with mean and variance estimated from the samples. For all three approximation techniques we see an agreement between marginal likelihood estimates and test performance, which justiﬁes the use of ML-II parameter estimation. But the shape of the contours and the values differ between the methods. The contours for Laplace’s method appear to be slanted compared to EP. The marginal likelihood estimates of EP and AIS agree surprisingly well1 , given that the marginal likelihood comes as a 767 respectively 200 dimensional integral. The EP predictions contain as much information about the test cases as the MCMC predictions and signiﬁcantly more than for LA. Note that for small signal variances (roughly ln(σ 2 ) < 1) LA and EP give very similar results. A possible explanation is that for small signal variances the likelihood does not truncate the prior but only down-weights the tail that disagrees with the observation. As an effect the posterior will be less skewed and both approximations will lead to similar results. For the USPS 3’s vs. 5’s we now inspect the marginal distributions p(fi |D, θ) of single latent function values under the posterior approximations for a given value of θ. We have chosen the values ln(σ) = 3.35 and ln( ) = 2.85 which are between the ML-II estimates of EP and LA. Hybrid MCMC was used to generate 9000 samples from the posterior p(f |D, θ). For LA and EP the approximate marginals are q(fi |D, θ) = N (fi |mi , Aii ) where m and A are found by the respective approximation techniques. In general we observe that the marginal distributions of MCMC samples agree very well with the respective marginal distributions of the EP approximation. For Laplace’s approximation we ﬁnd the mean to be underestimated and the marginal distributions to overlap with zero far more than the EP approximations. Figure (3a) displays the marginal distribution and its approximations for which the MCMC samples show maximal skewness. Figure (3b) shows a typical example where the EP approximation agrees very well with the MCMC samples. We show this particular example because under the EP approximation p(yi = 1|D, θ) < 0.1% but LA gives a wrong p(yi = 1|D, θ) ≈ 18%. In the experiment we saw that the marginal distributions of the posterior often agree very 1 Note that the agreement between the two seems to be limited by the accuracy of the MCMC runs, as judged by the regularity of the contour lines; the tolerance is less than one unit on a (natural) log scale. well with a Gaussian approximation. This seems to contradict the description given in the previous section were we argued that the posterior is skewed by construction. In order to inspect the marginals of a truncated high-dimensional multivariate Gaussian distribution we made an additional synthetic experiment. We constructed a 767 dimensional Gaussian N (x|0, C) with a covariance matrix having one eigenvalue of 100 with eigenvector 1, and all other eigenvalues are 1. We then truncate this distribution such that all xi ≥ 0. Note that the mode of the truncated Gaussian is still at zero, whereas the mean moves towards the remaining mass. Figure (3c) shows a normalised histogram of samples from a marginal distribution of one xi . The samples agree very well with a Gaussian approximation. In the previous section we described the somewhat surprising property, that for a truncated high-dimensional Gaussian, resembling the posterior, the mode (used by LA) may not be particularly representative of the distribution. Although the marginal is also truncated, it is still exceptionally well modelled by a Gaussian – however, the Laplace approximation centred on the origin would be completely inappropriate. In a second set of experiments we compare the predictive performance of LA and EP for GPC on several well known benchmark problems. Each data set is randomly split into 10 folds of which one at a time is left out as a test set to measure the predictive performance of a model trained (or selected) on the remaining nine folds. All performance measures are averages over the 10 folds. For GPC we implement model selection by ML-II hyperparameter estimation, reporting results given the θ that maximised the respective approximate marginal likelihoods p(D|θ). In order to get a better picture of the absolute performance we also compare to results obtained by C-SVM classiﬁcation. The kernel we used is equivalent to the covariance function (1) without the signal variance parameter. For each fold the parameters C and are found in an inner loop of 5-fold cross-validation, in which the parameter grids are reﬁned until the performance stabilises. Predictive probabilities for test cases are obtained by mapping the unthresholded output of the SVM to [0, 1] using a sigmoid function [8]. Results are summarised in Table 1. Comparing Laplace’s method to EP the latter shows to be more accurate both in terms of error rate and information. While the error rates are relatively similar the predictive distribution obtained by EP shows to be more informative about the test targets. Note that for GPC the error rate only depends of the sign of the mean µ∗ of the approximated posterior over latent functions and not the entire posterior predictive distribution. As to be expected, the length of the mean vector m shows much larger values for the EP approximations. Comparing EP and SVMs the results are mixed. For the Crabs data set all methods show the same error rate but the information content of the predictive distributions differs dramatically. For some test cases the SVM predicts the wrong class with large certainty. 5 Summary & Conclusions Our experiments reveal serious differences between Laplace’s method and EP when used in GPC models. From the structural properties of the posterior we described why LA systematically underestimates the mean m. The resulting posterior GP over latent functions will have too small amplitude, although the sign of the mean function will be mostly correct. As an effect LA gives over-conservative predictive probabilities, and diminished information about the test labels. This effect has been show empirically on several real world examples. Large resulting discrepancies in the actual posterior probabilities were found, even at the training locations, which renders the predictive class probabilities produced under this approximation grossly inaccurate. Note, the difference becomes less dramatic if we only consider the classiﬁcation error rates obtained by thresholding p∗ at 1/2. For this particular task, we’ve seen the the sign of the latent function tends to be correct (at least at the training locations). Laplace EP SVM Data Set m n E% I m E% I m E% I Ionosphere 351 34 8.84 0.591 49.96 7.99 0.661 124.94 5.69 0.681 Wisconsin 683 9 3.21 0.804 62.62 3.21 0.805 84.95 3.21 0.795 Pima Indians 768 8 22.77 0.252 29.05 22.63 0.253 47.49 23.01 0.232 Crabs 200 7 2.0 0.682 112.34 2.0 0.908 2552.97 2.0 0.047 Sonar 208 60 15.36 0.439 26.86 13.85 0.537 15678.55 11.14 0.567 USPS 3 vs 5 1540 256 2.27 0.849 163.05 2.21 0.902 22011.70 2.01 0.918 Table 1: Results for benchmark data sets. The ﬁrst three columns give the name of the data set, number of observations m and dimension of inputs n. For Laplace’s method and EP the table reports the average error rate E%, the average information I (2) and the average length m of the mean vector of the Gaussian approximation. For SVMs the error rate and the average information about the test targets are reported. Note that for the Crabs data set we use the sex (not the colour) of the crabs as class label. The EP approximation has shown to give results very close to MCMC both in terms of predictive distributions and marginal likelihood estimates. We have shown and explained why the marginal distributions of the posterior can be well approximated by Gaussians. Further, the marginal likelihood values obtained by LA and EP differ systematically which will lead to different results of ML-II hyperparameter estimation. The discrepancies are similar for different tasks. Using AIS we were able to show the accuracy of marginal likelihood estimates, which to the best of our knowledge has never been done before. In summary, we found that EP is the method of choice for approximate inference in binary GPC models, when the computational cost of MCMC is prohibitive. In contrast, the Laplace approximation is so inaccurate that we advise against its use, especially when predictive probabilities are to be taken seriously. Further experiments and a detailed description of the approximation schemes can be found in [2]. Acknowledgements Both authors acknowledge support by the German Research Foundation (DFG) through grant RA 1030/1. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST2002-506778. This publication only reﬂects the authors’ views. References [1] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, NIPS 8, pages 514–520. MIT Press, 1996. [2] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classiﬁcation. Journal of Machine Learning Research, 6:1679–1704, 2005. [3] C. K. I. Williams and D. Barber. Bayesian classiﬁcation with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351, 1998. [4] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Department of Electrical Engineering and Computer Science, MIT, 2001. [5] R. M. Neal. Regression and classiﬁcation using Gaussian process priors. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors, Bayesian Statistics 6, pages 475–501. Oxford University Press, 1998. [6] R. M. Neal. Annealed importance sampling. Statistics and Computing, 11:125–139, 2001. [7] D. J. C. MacKay. Information Theory, Inference and Learning Algorithms. CUP, 2003. [8] J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classiﬁers, pages 61–73. The MIT Press, 2000.

4 0.68304461 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

Author: Kazuho Watanabe, Sumio Watanabe

Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the eﬀect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1

5 0.68034738 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

Author: Gilles Blanchard, Masashi Sugiyama, Motoaki Kawanabe, Vladimir Spokoiny, Klaus-Robert Müller

Abstract: We propose a new linear method for dimension reduction to identify nonGaussian components in high dimensional data. Our method, NGCA (non-Gaussian component analysis), uses a very general semi-parametric framework. In contrast to existing projection methods we deﬁne what is uninteresting (Gaussian): by projecting out uninterestingness, we can estimate the relevant non-Gaussian subspace. We show that the estimation error of ﬁnding the non-Gaussian components tends to zero at a parametric rate. Once NGCA components are identiﬁed and extracted, various tasks can be applied in the data analysis process, like data visualization, clustering, denoising or classiﬁcation. A numerical study demonstrates the usefulness of our method. 1

6 0.67699504 41 nips-2005-Coarse sample complexity bounds for active learning

7 0.67383039 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

8 0.67158973 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations

9 0.67107636 32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation

10 0.67103499 144 nips-2005-Off-policy Learning with Options and Recognizers

11 0.67007607 24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

12 0.66752887 66 nips-2005-Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization

13 0.66747105 23 nips-2005-An Application of Markov Random Fields to Range Sensing

14 0.66737247 50 nips-2005-Convex Neural Networks

15 0.66703635 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

16 0.66680938 177 nips-2005-Size Regularized Cut for Data Clustering

17 0.66540873 136 nips-2005-Noise and the two-thirds power Law

18 0.6647383 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments

19 0.66262805 151 nips-2005-Pattern Recognition from One Example by Chopping

20 0.66215426 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations