nips nips2005 nips2005-18 knowledge-graph by maker-knowledge-mining

18 nips-2005-Active Learning For Identifying Function Threshold Boundaries


Source: pdf

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

Abstract: We present an efficient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassification rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid 1 − α confidence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We then show how these algorithms are used to determine simultaneously valid 1 − α confidence intervals for seven cosmological parameters. [sent-18, score-0.371]

2 Rather, one is curious about determining the set of points for which the function exceeds some particular value. [sent-21, score-0.12]

3 In this paper, we use this idea to compute confidence intervals for a set of cosmological parameters that affect the shape of the temperature power spectrum of the Cosmic Microwave Background (CMB). [sent-23, score-0.556]

4 In one dimension, the threshold discovery problem is a root-finding problem where no hints as to the location or number of solutions are given; several methods exist which can be used to solve this problem (e. [sent-24, score-0.099]

5 For instance, [1] presents a method for picking experiments to determine the localities of local extrema when the input space is discrete. [sent-30, score-0.198]

6 We are interested in locating the subset of the input space wherein the function is above a given threshold. [sent-36, score-0.116]

7 Intuitively, points on the function that are located far from the boundary are less interesting, regardless of their variance. [sent-39, score-0.31]

8 In this paper, we make the following contributions to the literature: • We present a method for choosing experiments that is more efficient than global variance minimization, as well as other heuristics, when one is solely interested in localizing a function contour. [sent-40, score-0.215]

9 • We show that this heuristic can be used in continuous valued input spaces, without defining a priori a set of possible experiments (e. [sent-41, score-0.152]

10 • We use our function threshold detection method to determine 1−α simultaneously valid confidence intervals of CMB parameters, making no assumptions about the model being fit and few assumptions about the data in general. [sent-44, score-0.167]

11 Assume that we are given a bounded sample space S ⊂ Rn and a scoring function: f : S → R, but possibly no data points ({s, f (s)}, s ∈ S). [sent-46, score-0.195]

12 Given a threshold t, we want to find the set of points S ′ where f is equal to or above the threshold: {s ∈ S ′ |s ∈ S, f (s) ≥ t}. [sent-47, score-0.219]

13 Thus, care should be taken when choosing the next experiment, as picking optimum points may reduce the runtime of the algorithm by orders of magnitude. [sent-53, score-0.236]

14 Therefore, it is preferable to analyze current knowledge about the underlying function and select experiments which quickly refine the estimate of the function around the threshold of interest. [sent-54, score-0.145]

15 However, we chose to approximate the unknown boundary as a Gaussian Process (GP), as many forms of regression (e. [sent-56, score-0.19]

16 However, for many GPs — and ordinary kriging in particular — the correlation between two points decreases as a function of distance. [sent-65, score-0.193]

17 Since we have assumed that experimentation is expensive, it would be ideal to iteratively analyze the entire input space and pick the next experiment in such a manner that minimized the total number of experiments necessary. [sent-67, score-0.159]

18 However, if |S| is large or infinite, testing all points may be impractical. [sent-69, score-0.12]

19 Instead of imposing some arbitrary structure on the possible experimental points (such as using a grid), our algorithm chooses candidate points uniformly at random from the input space, and then selects the candidate point with the highest score according to the metrics given in §2. [sent-70, score-0.551]

20 This allows the input space to be fully explored (in expectation), and ensures that interesting regions of space that would have fallen between successive grid points are not missed; in §4 we show how imposing a grid upon the input space results in just such a situation. [sent-72, score-0.623]

21 While the algorithm is unable to consider the entire space for each sampling iteration, over multiple iterations it does consider most of the space, resulting in the function boundaries being quickly localized, as can be seen in §3. [sent-73, score-0.114]

22 1 Choosing experiments from among candidates Given a set of random input points, the algorithm evaluates each one and chooses the point with the highest score as the location for the next experiment. [sent-75, score-0.132]

23 Random: One of the candidate points is chosen uniformly at random. [sent-77, score-0.19]

24 This method serves as a baseline for comparison, Probability of incorrect classification: Since we are trying to map the boundary between points above and below a threshold, we consider choosing the point from our random sample which has the largest probability of being misclassified by our model. [sent-78, score-0.466]

25 Using the distribution defined by Equations 1 and 2, the probability, p, that the point is above the given threshold can be computed. [sent-79, score-0.144]

26 The point is predicted to be above the threshold if p > 0. [sent-80, score-0.144]

27 Both entropy and misclassification will choose points near the boundary. [sent-85, score-0.293]

28 Unfortunately, they have the drawback that once they find a point near the boundary they continue to choose points near that location and will not explore the rest of the parameter space. [sent-86, score-0.461]

29 Variance: Both entropy and probability of incorrect classification suffer from a lack of incentive to explore the space. [sent-87, score-0.19]

30 To rectify this problem, we consider the variance of each query point (given by Equation 2) as an evaluation metric. [sent-88, score-0.25]

31 This metric is common in active learning methods whose goal is to map out an entire function. [sent-89, score-0.091]

32 Since variance is related to the distance to nearest neighbors, this strategy chooses points that are far from areas currently searched, and hence will not get stuck at one boundary point. [sent-90, score-0.401]

33 Information gain: Information gain is a common myopic metric used in active learning. [sent-92, score-0.1]

34 Information gain at the query point is the same as entropy in our case because all run experiments are assumed to have the same variance. [sent-93, score-0.321]

35 We believe the good performance of the evaluation metrics proposed below stems from their being heuristic proxies for global information gain or reduction in misclassification error. [sent-96, score-0.148]

36 Products of metrics: One way to rectify the problems of point policies that focus solely on points near the boundary or points with large variance regardless of their relevance to refining the predictive model, is to combine the two measures. [sent-97, score-0.705]

37 Intuitively, doing this can mimic the idea of information gain; the entropy of a query point measures the classification uncertainty, while the variance is a good estimator of how much impact a new observation would have in this region, and thus what fraction the uncertainty would be reduced. [sent-98, score-0.362]

38 [1] proposed scoring points based upon the product of their entropy and variance to identify the presence of local maxima and minima, a problem closely related to boundary detection. [sent-99, score-0.56]

39 We shall also consider scoring points based upon the product of their probability of incorrect classification and variance. [sent-100, score-0.229]

40 Note that while entropy and probability of incorrect classification are monotonically related, entropy times variance and probability of incorrect classification times variance are not. [sent-101, score-0.562]

41 Straddle: Using the same intuition as for products of heuristics, we define straddle heurisˆ tic, as straddle(sq ) = 1. [sent-102, score-0.367]

42 96ˆq − f (sq ) − t , The straddle algorithm scores points highest σ that are both unknown and near the boundary. [sent-103, score-0.54]

43 As such, the straddle algorithm prefers points near the threshold, but far from previous examples. [sent-104, score-0.54]

44 The straddle score for a point may be negative, which indicates that the model currently estimates the probability that the point is on a boundary is less than five percent. [sent-105, score-0.647]

45 Since the straddle heuristic relies on the variance estimate, it is also subject to oversampling edge positions. [sent-106, score-0.566]

46 This is done by computing the fraction of test points in which the predictive model agrees with the true function about which side of the threshold the test points are on after some fixed number of experiments. [sent-108, score-0.38]

47 The first model we consider is a 2D sinusoidal function given by f (x, y) = sin(10x) + cos(4y) − cos(3xy) x ∈ [0, 1], y ∈ [0, 2], with a boundary threshold of t = 0. [sent-110, score-0.34]

48 This function and threshold were examined for the following reasons: 1) the target threshold winds through the plot giving ample length to A B C 2 2 1 1 0 0 0 0. [sent-111, score-0.198]

49 5 1 Figure 1: Predicted function boundary (solid), true function boundary (dashed), and experiments (dots) for the 2D sinusoid function after A) 50 experiments and B) 100 experiments using the straddle heuristic and C) 100 experiments using the variance heuristic. [sent-114, score-1.13]

50 Table 1: Number of experiments required to obtain 99% classification accuracy for the 2D models and 95% classification accuracy for the 4D model for various heuristics. [sent-115, score-0.116]

51 Heuristics requiring more than 10,000 experiments to converge are labeled “did not converge”. [sent-116, score-0.096]

52 9, 1), where the true function is approximately equal to the threshold, and the gradient is small and 4) there are areas in the domain where the function is far from the threshold and hence we can ensure that the algorithm is not oversampling in these regions. [sent-124, score-0.142]

53 Note that picking points solely on entropy does not converge in many cases, while both the straddle algorithm and probability incorrect times standard deviation heuristic result in approximations that are significantly better than random and variance heuristics. [sent-126, score-0.995]

54 Figures 1A-C confirm that the straddle heuristic is aiding in boundary prediction. [sent-127, score-0.622]

55 Note that most of the 50 experiments sampled between Figures 1A and 1B are chosen near the boundary. [sent-128, score-0.146]

56 The 100 experiments chosen to minimize the variance result in an even distribution over the input space and a worse boundary approximation, as seen in Figure 1C. [sent-129, score-0.404]

57 To ensure that heuristics are not exploiting some feature of the 2D input space, we consider the 4D sinusoidal function f (x) = sin(10x1 ) + cos(4x2 ) − cos(3x1 x2 ) + cos(2x3 ) + cos(3x4 ) − sin(5x3 x4 ) where x ∈ [(0, 0, 1, 0), (1, 2, 2, 2)] and t = 0. [sent-132, score-0.164]

58 Comparison of the 2D and 4D results in Table 1 reveals that the relative performance of the heuristics remains unchanged, indicating that the best heuristic for picking experiments is independent of the problem dimension. [sent-133, score-0.258]

59 To show that the decrease in the number candidate points relative to the input parameter space that occurs with higher dimensional problems is not an issue, we reconsider the 2D sinusoidal problem. [sent-134, score-0.318]

60 Now, we use only 31 candidate points instead of 1000 to simulate the point density difference between 4D and 2D. [sent-135, score-0.235]

61 Results shown in Table 1, indicate that reducing the number of candidate points does not drastically alter the realized performance. [sent-136, score-0.19]

62 4 Statistical analysis of cosmological parameters Let us now look at a concrete application of this work: a statistical analysis of cosmological parameters that affect formation and evolution of our universe. [sent-138, score-0.552]

63 One key prediction of the Big Bang model for the origin of our universe is the presence of a 2. [sent-139, score-0.128]

64 Recently, the Wilkinson Microwave Anisotropy Project (WMAP) has completed a detailed survey of the this radiation exhibiting small CMB temperature fluctuations over the sky [8]. [sent-141, score-0.14]

65 It is conjectured that this radiation permeated through the universe unchanged since its formation 15 billion years ago. [sent-143, score-0.266]

66 Therefore, the sizes and angular separations of these CMB fluctuations give an unique picture of the universe immediately after the Big Bang and have a large implication on our understanding of primordial cosmology. [sent-144, score-0.128]

67 An important summary of the temperature fluctuations is the CMB power spectrum shown in Figure 2, which gives the temperature variance of the CMB as a function of spatial frequency (or multi-pole moment). [sent-145, score-0.399]

68 It is well known that the shape of this curve is affected by at least seven cosmological parameters: optical depth (τ ), dark energy mass fraction (ΩΛ ), total mass fraction (Ωm ), baryon density (ωb ), dark matter density (ωdm), neutrino fraction (fn ), and spectral index (ns ). [sent-146, score-0.577]

69 For instance, the height of first peak is determined by the total energy density of the universe, while the third peak is related to the amount of dark matter. [sent-147, score-0.305]

70 Thus, by fitting models of the CMB power spectrum for given values of the seven parameters, we can determine how the parameters influence the shape of the model spectrum. [sent-148, score-0.275]

71 Previous work characterizing confidence intervals for cosmological parameters either used marginalization over the other parameters, or made assumptions about the values of the parameters and/or the shape of the CMB power spectrum. [sent-150, score-0.405]

72 However, [9] notes that “CMB data have now become so sensitive that the key issue in cosmological parameter determination is not always the accuracy with which the CMB power spectrum features can be measured, but often what prior information is used or assumed. [sent-151, score-0.411]

73 ” In this analysis, we make no assumptions about the ranges or values of the parameters, and assume only that the data are normally distributed around the unknown CMB spectrum with covariance known up to a constant multiple. [sent-152, score-0.149]

74 Using the method of [10], we create a non-parametric confidence ball (under a weighted squared-error loss) for the unknown spectrum that is centered on a nonparametric estimate with a radius for each specified confidence level derived from the asymptotic distribution of a pivot statistic1 . [sent-153, score-0.202]

75 For any candidate spectrum, membership in the confidence ball can be determined by comparing the ball’s radius to the variance weighted sum of squares deviation between the candidate function and the center of the ball. [sent-154, score-0.322]

76 One advantage of this method is that it gives us simultaneously valid confidence intervals on all seven of our input parameters; this is not true for 1 − α confidence intervals derived from a collection of χ2 distributions where the confidence intervals often have substantially lower coverage [11]. [sent-155, score-0.328]

77 05 2000 0 0 200 400 600 800 Multipole Moment Figure 2: WMAP data, overlaid with regressed model (solid) and an example of a model CMB spectrum that barely fits at the 95% confidence level (dashed; parameter values are ωDM = 0. [sent-159, score-0.16]

78 Gray dots denote models which are rejected at a 95% confidence level, while the black dots denote those that are not. [sent-167, score-0.12]

79 in §2 to map out the confidence surface as a function of the input parameters; that is, we use the algorithm to pick a location in the seven dimensional parameter space to perform an experiment, and then run CMBFast [12] to create simulated power spectrum given this set of input parameters. [sent-168, score-0.357]

80 We can then compute the sum of squares of error for this spectrum (relative to the regressed model) and easily tell if the 7D input point is inside the confidence ball. [sent-169, score-0.246]

81 Due to previous efforts on this project, we were able to estimate the semivariogram of the GP from several hundred thousand random points already run through CMBFast. [sent-172, score-0.254]

82 These points also gave a starting point for our algorithm2. [sent-174, score-0.165]

83 CMBFast then takes roughly 3 minutes to compute the CMB spectrum given our chosen point in parameter space. [sent-177, score-0.156]

84 In Figure 3, we show a plot of baryon density (ωB ) versus the dark matter density (ωDM ) of the universe over all values of the other five parameters (τ, ΩDE , ΩM , fn , ns ). [sent-178, score-0.264]

85 Note how there are areas that remain unsampled, while the boundary regions (transitions between gray and black points) are heavily sampled, indicating that our algorithm is choosing reasonable points. [sent-180, score-0.271]

86 While hard to distinguish in Figure 3, the bottom left group of points above the 95% confidence boundary splits into two separate peaks in parameter space. [sent-182, score-0.31]

87 The one to the left is the concordance model, while the second peak (the one to the right) is not believed to represent the correct values of the parameters (due to constraints from other data). [sent-183, score-0.261]

88 The existence of high probability points in this region of the parameter space has been suggested before, but computational limitations have prevented much characterization of it. [sent-184, score-0.156]

89 Moreover, the third peak, near the top right corner of Figure 3 was basically ignored by previous grid based approaches. [sent-185, score-0.166]

90 Comparison of the number of experiments performed by our straddle 2 While initial values are not required (as we have seen in §3), it is possible to incorporate this background knowledge into the model to help the algorithm converge more quickly. [sent-186, score-0.499]

91 Table 2: Number of points found in the three peaks for the grid based approach of [9] and our straddle algorithm. [sent-187, score-0.6]

92 122 # Points in Effective Radius Grid Straddle 2118 16055 2825 9634 0 5488 5613300 603384 algorithm with the grid based approach used by [9] is shown in Table 2. [sent-194, score-0.113]

93 Even with only 10% of the experiments used in the grid approach, we sampled the concordance peak 8 times more frequently, and the second peak 3. [sent-195, score-0.558]

94 Moreover, it appears that the grid completely missed the third peak, while our method sampled it over 5000 times. [sent-197, score-0.16]

95 We described and showed how several different methods for picking the next experimental point from a group of candidates perform on synthetic test functions. [sent-201, score-0.12]

96 Our experiments indicate that the straddle algorithm outperforms previously published methods, and even handles functions with large discontinuities. [sent-202, score-0.413]

97 Moreover, the algorithm is shown to work on multi-dimensional data, correctly classifying the boundary at a 99% level with half the points required for variance minimizing methods. [sent-203, score-0.401]

98 We have then applied this algorithm to a seven dimensional statistical analysis of cosmological parameters affecting the Cosmic Microwave Background. [sent-204, score-0.339]

99 With only a few hundred thousand simulations we are able to accurately describe the interdependence of the cosmological parameters, leading to a better understanding of fundamental physical properties. [sent-205, score-0.305]

100 Towards a refined cosmic concordance model: Joint 11-parameter constraints from the cosmic microwave background and large-scale structure. [sent-268, score-0.672]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cmb', 0.416), ('straddle', 0.367), ('cosmological', 0.22), ('microwave', 0.196), ('boundary', 0.19), ('dence', 0.174), ('cosmic', 0.171), ('sq', 0.147), ('universe', 0.128), ('peak', 0.127), ('points', 0.12), ('entropy', 0.12), ('grid', 0.113), ('spectrum', 0.111), ('threshold', 0.099), ('concordance', 0.098), ('misclassi', 0.094), ('variance', 0.091), ('dm', 0.09), ('con', 0.087), ('genovese', 0.085), ('seven', 0.083), ('cos', 0.081), ('temperature', 0.076), ('picking', 0.075), ('cmbfast', 0.073), ('kriging', 0.073), ('nichol', 0.073), ('wmap', 0.073), ('aq', 0.072), ('heuristics', 0.072), ('incorrect', 0.07), ('candidate', 0.07), ('aa', 0.068), ('intervals', 0.068), ('query', 0.065), ('heuristic', 0.065), ('radiation', 0.064), ('gp', 0.059), ('uctuations', 0.058), ('active', 0.055), ('near', 0.053), ('si', 0.052), ('sinusoidal', 0.051), ('ya', 0.051), ('ball', 0.051), ('pittsburgh', 0.051), ('dark', 0.051), ('converge', 0.05), ('bang', 0.049), ('baryon', 0.049), ('deboor', 0.049), ('observatorio', 0.049), ('portsmouth', 0.049), ('rectify', 0.049), ('regressed', 0.049), ('semivariogram', 0.049), ('wilkinson', 0.049), ('sampled', 0.047), ('imposing', 0.047), ('experiments', 0.046), ('power', 0.045), ('gain', 0.045), ('thousand', 0.045), ('carnegie', 0.045), ('point', 0.045), ('mellon', 0.044), ('anisotropy', 0.043), ('chile', 0.043), ('oversampling', 0.043), ('sinusoid', 0.043), ('boundaries', 0.042), ('dots', 0.042), ('gps', 0.042), ('input', 0.041), ('sj', 0.041), ('fraction', 0.041), ('choosing', 0.041), ('formation', 0.04), ('hundred', 0.04), ('radius', 0.04), ('regions', 0.04), ('cation', 0.039), ('scoring', 0.039), ('miller', 0.039), ('christopher', 0.039), ('discontinuous', 0.039), ('larry', 0.039), ('locating', 0.039), ('ranges', 0.038), ('metrics', 0.038), ('solely', 0.037), ('pa', 0.037), ('entire', 0.036), ('rejected', 0.036), ('space', 0.036), ('background', 0.036), ('parameters', 0.036), ('accuracy', 0.035), ('unchanged', 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

Abstract: We present an efficient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassification rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid 1 − α confidence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.

2 0.148037 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1

3 0.085387066 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

Author: Jeremy Kubica, Joseph Masiero, Robert Jedicke, Andrew Connolly, Andrew W. Moore

Abstract: In this paper we consider the problem of finding sets of points that conform to a given underlying model from within a dense, noisy set of observations. This problem is motivated by the task of efficiently linking faint asteroid detections, but is applicable to a range of spatial queries. We survey current tree-based approaches, showing a trade-off exists between single tree and multiple tree algorithms. To this end, we present a new type of multiple tree algorithm that uses a variable number of trees to exploit the advantages of both approaches. We empirically show that this algorithm performs well using both simulated and astronomical data.

4 0.084936425 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.

5 0.075962633 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

Author: Edward Snelson, Zoubin Ghahramani

Abstract: We present a new Gaussian process (GP) regression model whose covariance is parameterized by the the locations of M pseudo-input points, which we learn by a gradient based optimization. We take M N, where N is the number of real data points, and hence obtain a sparse regression method which has O(M 2 N ) training cost and O(M 2 ) prediction cost per test case. We also find hyperparameters of the covariance function in the same joint optimization. The method can be viewed as a Bayesian regression model with particular input dependent noise. The method turns out to be closely related to several other sparse GP approaches, and we discuss the relation in detail. We finally demonstrate its performance on some large data sets, and make a direct comparison to other sparse GP methods. We show that our method can match full GP performance with small M , i.e. very sparse solutions, and it significantly outperforms other approaches in this regime. 1

6 0.075562991 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

7 0.067289963 19 nips-2005-Active Learning for Misspecified Models

8 0.064021774 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

9 0.062594362 126 nips-2005-Metric Learning by Collapsing Classes

10 0.061497077 33 nips-2005-Bayesian Sets

11 0.061092924 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

12 0.060171023 78 nips-2005-From Weighted Classification to Policy Search

13 0.057128035 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

14 0.054788925 4 nips-2005-A Bayesian Spatial Scan Statistic

15 0.053863514 171 nips-2005-Searching for Character Models

16 0.05384206 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

17 0.052278783 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

18 0.051838223 167 nips-2005-Robust design of biological experiments

19 0.051194102 169 nips-2005-Saliency Based on Information Maximization

20 0.050949667 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.204), (1, 0.018), (2, -0.007), (3, 0.037), (4, 0.014), (5, 0.022), (6, -0.018), (7, 0.011), (8, 0.005), (9, 0.12), (10, 0.011), (11, 0.022), (12, 0.201), (13, 0.026), (14, -0.032), (15, -0.056), (16, -0.003), (17, -0.022), (18, -0.061), (19, -0.057), (20, -0.006), (21, -0.046), (22, 0.002), (23, 0.006), (24, -0.174), (25, 0.028), (26, 0.088), (27, -0.154), (28, -0.026), (29, 0.033), (30, 0.02), (31, 0.085), (32, 0.07), (33, 0.105), (34, 0.003), (35, 0.007), (36, -0.031), (37, -0.087), (38, 0.055), (39, -0.037), (40, 0.025), (41, 0.002), (42, 0.012), (43, 0.034), (44, 0.047), (45, -0.109), (46, 0.108), (47, -0.119), (48, -0.034), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93498224 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

Abstract: We present an efficient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassification rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid 1 − α confidence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.

2 0.75395256 19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly specified, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulfilled. In this paper, we first show that the existing active learning method can be theoretically justified under slightly weaker condition: the model does not have to be correctly specified, but slightly misspecified models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justified for a wider class of misspecified models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspecification of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ defined on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are fixed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ   ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly specified [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we first show that the existing active learning method can still be theoretically justified when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justified for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-defined distribution with density ÔÜ Ü , and discuss the problem of finding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error defined by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ   ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ   ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ   Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ ݾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly specified, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly specified model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed  ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspecification of Models In this section, we investigate the validity of the existing active learning method for misspecified models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · Æִܵ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual ִܵ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ ִܵ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ   ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ   ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË   ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ ִܾµ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò  ¾ µ Therefore, the existing active learning method of minimizing  is still justified if Æ ½   ¾ µ. However, when Æ Ó ´Ò  ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term   is not smaller than the variance term Î , so it can not be ÇÄË   ¾ · Ó ´Ò ½µ  ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly specified, «ÇÄË is an unbiased estimator of «£ . However, for misspecified models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly specified models, influence of the covariate shift can be ignored, as the existing active learning method does. However, for misspecified models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ   Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ   ´Üµ ¾ Ô ´Ü µ Ü ´Üµ   ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be confirmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have   Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that   ¾  · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justified when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò  ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We first illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü   Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü  ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly specified”, “approximately correct”, and “misspecified” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ specified”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspecified”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspecification of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. significance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspecified” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly specified” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives significantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly specified case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspecification of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and ­ Ø ´¾ ­¾ ÅÄ Ô ´Üµ is unknown. Ø µ  ÜÔ    Ü   ¾ ÅÄ So we estimate it using the ¾ ´¾­¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½   ÜÔ   Ü   Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation ­Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we first create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for five data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for five data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly specified, the covariate shift does not matter. However, for misspecified models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly specified. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspecification of models and thus it is reliable. The proposed method can be theoretically justified if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspecified models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspecified models. It is also important to notice that when the model is totally misspecified, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial financial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artificial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspecified models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.

3 0.74049097 74 nips-2005-Faster Rates in Regression via Active Learning

Author: Rebecca Willett, Robert Nowak, Rui M. Castro

Abstract: This paper presents a rigorous statistical analysis characterizing regimes in which active learning significantly outperforms classical passive learning. Active learning algorithms are able to make queries or select sample locations in an online fashion, depending on the results of the previous queries. In some regimes, this extra flexibility leads to significantly faster rates of error decay than those possible in classical passive learning settings. The nature of these regimes is explored by studying fundamental performance limits of active and passive learning in two illustrative nonparametric function classes. In addition to examining the theoretical potential of active learning, this paper describes a practical algorithm capable of exploiting the extra flexibility of the active setting and provably improving upon the classical passive techniques. Our active learning theory and methods show promise in a number of applications, including field estimation using wireless sensor networks and fault line detection. 1

4 0.54240358 41 nips-2005-Coarse sample complexity bounds for active learning

Author: Sanjoy Dasgupta

Abstract: We characterize the sample complexity of active learning problems in terms of a parameter which takes into account the distribution over the input space, the specific target hypothesis, and the desired accuracy.

5 0.53876638 4 nips-2005-A Bayesian Spatial Scan Statistic

Author: Daniel B. Neill, Andrew W. Moore, Gregory F. Cooper

Abstract: We propose a new Bayesian method for spatial cluster detection, the “Bayesian spatial scan statistic,” and compare this method to the standard (frequentist) scan statistic approach. We demonstrate that the Bayesian statistic has several advantages over the frequentist approach, including increased power to detect clusters and (since randomization testing is unnecessary) much faster runtime. We evaluate the Bayesian and frequentist methods on the task of prospective disease surveillance: detecting spatial clusters of disease cases resulting from emerging disease outbreaks. We demonstrate that our Bayesian methods are successful in rapidly detecting outbreaks while keeping number of false positives low. 1

6 0.50581044 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

7 0.50274706 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

8 0.49559271 171 nips-2005-Searching for Character Models

9 0.4917486 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

10 0.46707997 160 nips-2005-Query by Committee Made Real

11 0.43891922 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

12 0.43570808 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

13 0.4306846 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

14 0.43004182 126 nips-2005-Metric Learning by Collapsing Classes

15 0.42090833 33 nips-2005-Bayesian Sets

16 0.41629526 35 nips-2005-Bayesian model learning in human visual perception

17 0.40870237 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

18 0.39231884 112 nips-2005-Learning Minimum Volume Sets

19 0.37984478 103 nips-2005-Kernels for gene regulatory regions

20 0.37221584 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.031), (10, 0.023), (27, 0.026), (31, 0.035), (34, 0.058), (55, 0.026), (69, 0.567), (73, 0.037), (88, 0.071), (91, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98240912 40 nips-2005-CMOL CrossNets: Possible Neuromorphic Nanoelectronic Circuits

Author: Jung Hoon Lee, Xiaolong Ma, Konstantin K. Likharev

Abstract: Hybrid “CMOL” integrated circuits, combining CMOS subsystem with nanowire crossbars and simple two-terminal nanodevices, promise to extend the exponential Moore-Law development of microelectronics into the sub-10-nm range. We are developing neuromorphic network (“CrossNet”) architectures for this future technology, in which neural cell bodies are implemented in CMOS, nanowires are used as axons and dendrites, while nanodevices (bistable latching switches) are used as elementary synapses. We have shown how CrossNets may be trained to perform pattern recovery and classification despite the limitations imposed by the CMOL hardware. Preliminary estimates have shown that CMOL CrossNets may be extremely dense (~10 7 cells per cm2) and operate approximately a million times faster than biological neural networks, at manageable power consumption. In Conclusion, we discuss in brief possible short-term and long-term applications of the emerging technology. 1 Introduction: CMOL Circuits Recent results [1, 2] indicate that the current VLSI paradigm based on CMOS technology can be hardly extended beyond the 10-nm frontier: in this range the sensitivity of parameters (most importantly, the gate voltage threshold) of silicon field-effect transistors to inevitable fabrication spreads grows exponentially. This sensitivity will probably send the fabrication facilities costs skyrocketing, and may lead to the end of Moore’s Law some time during the next decade. There is a growing consensus that the impending Moore’s Law crisis may be preempted by a radical paradigm shift from the purely CMOS technology to hybrid CMOS/nanodevice circuits, e.g., those of “CMOL” variety (Fig. 1). Such circuits (see, e.g., Ref. 3 for their recent review) would combine a level of advanced CMOS devices fabricated by the lithographic patterning, and two-layer nanowire crossbar formed, e.g., by nanoimprint, with nanowires connected by simple, similar, two-terminal nanodevices at each crosspoint. For such devices, molecular single-electron latching switches [4] are presently the leading candidates, in particular because they may be fabricated using the self-assembled monolayer (SAM) technique which already gave reproducible results for simpler molecular devices [5]. (a) nanodevices nanowiring and nanodevices interface pins upper wiring level of CMOS stack (b) βFCMOS Fnano α Fig. 1. CMOL circuit: (a) schematic side view, and (b) top-view zoom-in on several adjacent interface pins. (For clarity, only two adjacent nanodevices are shown.) In order to overcome the CMOS/nanodevice interface problems pertinent to earlier proposals of hybrid circuits [6], in CMOL the interface is provided by pins that are distributed all over the circuit area, on the top of the CMOS stack. This allows to use advanced techniques of nanowire patterning (like nanoimprint) which do not have nanoscale accuracy of layer alignment [3]. The vital feature of this interface is the tilt, by angle α = arcsin(Fnano/βFCMOS), of the nanowire crossbar relative to the square arrays of interface pins (Fig. 1b). Here Fnano is the nanowiring half-pitch, FCMOS is the half-pitch of the CMOS subsystem, and β is a dimensionless factor larger than 1 that depends on the CMOS cell complexity. Figure 1b shows that this tilt allows the CMOS subsystem to address each nanodevice even if Fnano << βFCMOS. By now, it has been shown that CMOL circuits can combine high performance with high defect tolerance (which is necessary for any circuit using nanodevices) for several digital applications. In particular, CMOL circuits with defect rates below a few percent would enable terabit-scale memories [7], while the performance of FPGA-like CMOL circuits may be several hundred times above that of overcome purely CMOL FPGA (implemented with the same FCMOS), at acceptable power dissipation and defect tolerance above 20% [8]. In addition, the very structure of CMOL circuits makes them uniquely suitable for the implementation of more complex, mixed-signal information processing systems, including ultradense and ultrafast neuromorphic networks. The objective of this paper is to describe in brief the current status of our work on the development of so-called Distributed Crossbar Networks (“CrossNets”) that could provide high performance despite the limitations imposed by CMOL hardware. A more detailed description of our earlier results may be found in Ref. 9. 2 Synapses The central device of CrossNet is a two-terminal latching switch [3, 4] (Fig. 2a) which is a combination of two single-electron devices, a transistor and a trap [3]. The device may be naturally implemented as a single organic molecule (Fig. 2b). Qualitatively, the device operates as follows: if voltage V = Vj – Vk applied between the external electrodes (in CMOL, nanowires) is low, the trap island has no net electric charge, and the single-electron transistor is closed. If voltage V approaches certain threshold value V+ > 0, an additional electron is inserted into the trap island, and its field lifts the Coulomb blockade of the single-electron transistor, thus connecting the nanowires. The switch state may be reset (e.g., wires disconnected) by applying a lower voltage V < V- < V+. Due to the random character of single-electron tunneling [2], the quantitative description of the switch is by necessity probabilistic: actually, V determines only the rates Γ↑↓ of device switching between its ON and OFF states. The rates, in turn, determine the dynamics of probability p to have the transistor opened (i.e. wires connected): dp/dt = Γ↑(1 - p) - Γ↓p. (1) The theory of single-electron tunneling [2] shows that, in a good approximation, the rates may be presented as Γ↑↓ = Γ0 exp{±e(V - S)/kBT} , (2) (a) single-electron trap tunnel junction Vj Vk single-electron transistor (b) O clipping group O N C R diimide acceptor groups O O C N R R O OPE wires O N R R N O O R O N R R = hexyl N O O R R O N C R R R Fig. 2. (a) Schematics and (b) possible molecular implementation of the two-terminal single-electron latching switch where Γ0 and S are constants depending on physical parameters of the latching switches. Note that despite the random character of switching, the strong nonlinearity of Eq. (2) allows to limit the degree of the device “fuzziness”. 3 CrossNets Figure 3a shows the generic structure of a CrossNet. CMOS-implemented somatic cells (within the Fire Rate model, just nonlinear differential amplifiers, see Fig. 3b,c) apply their output voltages to “axonic” nanowires. If the latching switch, working as an elementary synapse, on the crosspoint of an axonic wire with the perpendicular “dendritic” wire is open, some current flows into the latter wire, charging it. Since such currents are injected into each dendritic wire through several (many) open synapses, their addition provides a natural passive analog summation of signals from the corresponding somas, typical for all neural networks. Examining Fig. 3a, please note the open-circuit terminations of axonic and dendritic lines at the borders of the somatic cells; due to these terminations the somas do not communicate directly (but only via synapses). The network shown on Fig. 3 is evidently feedforward; recurrent networks are achieved in the evident way by doubling the number of synapses and nanowires per somatic cell (Fig. 3c). Moreover, using dual-rail (bipolar) representation of the signal, and hence doubling the number of nanowires and elementary synapses once again, one gets a CrossNet with somas coupled by compact 4-switch groups [9]. Using Eqs. (1) and (2), it is straightforward to show that that the average synaptic weight wjk of the group obeys the “quasi-Hebbian” rule: d w jk = −4Γ0 sinh (γ S ) sinh (γ V j ) sinh (γ Vk ) . dt (3) (a) - +soma j (b) RL + -- jk+ RL (c) jk- RL + -- -+soma k RL Fig. 3. (a) Generic structure of the simplest, (feedforward, non-Hebbian) CrossNet. Red lines show “axonic”, and blue lines “dendritic” nanowires. Gray squares are interfaces between nanowires and CMOS-based somas (b, c). Signs show the dendrite input polarities. Green circles denote molecular latching switches forming elementary synapses. Bold red and blue points are open-circuit terminations of the nanowires, that do not allow somas to interact in bypass of synapses In the simplest cases (e.g., quasi-Hopfield networks with finite connectivity), the tri-level synaptic weights of the generic CrossNets are quite satisfactory, leading to just a very modest (~30%) network capacity loss. However, some applications (in particular, pattern classification) may require a larger number of weight quantization levels L (e.g., L ≈ 30 for a 1% fidelity [9]). This may be achieved by using compact square arrays (e.g., 4×4) of latching switches (Fig. 4). Various species of CrossNets [9] differ also by the way the somatic cells are distributed around the synaptic field. Figure 5 shows feedforward versions of two CrossNet types most explored so far: the so-called FlossBar and InBar. The former network is more natural for the implementation of multilayered perceptrons (MLP), while the latter system is preferable for recurrent network implementations and also allows a simpler CMOS design of somatic cells. The most important advantage of CrossNets over the hardware neural networks suggested earlier is that these networks allow to achieve enormous density combined with large cell connectivity M >> 1 in quasi-2D electronic circuits. 4 CrossNet training CrossNet training faces several hardware-imposed challenges: (i) The synaptic weight contribution provided by the elementary latching switch is binary, so that for most applications the multi-switch synapses (Fig. 4) are necessary. (ii) The only way to adjust any particular synaptic weight is to turn ON or OFF the corresponding latching switch(es). This is only possible to do by applying certain voltage V = Vj – Vk between the two corresponding nanowires. At this procedure, other nanodevices attached to the same wires should not be disturbed. (iii) As stated above, synapse state switching is a statistical progress, so that the degree of its “fuzziness” should be carefully controlled. (a) Vj (b) V w – A/2 i=1 i=1 2 2 … … n n Vj V w+ A/2 i' = 1 RL 2 … i' = 1 n RS ±(V t –A/2) 2 … RS n ±(V t +A/2) Fig. 4. Composite synapse for providing L = 2n2+1 discrete levels of the weight in (a) operation and (b) weight adjustment modes. The dark-gray rectangles are resistive metallic strips at soma/nanowire interfaces (a) (b) Fig. 5. Two main CrossNet species: (a) FlossBar and (b) InBar, in the generic (feedforward, non-Hebbian, ternary-weight) case for the connectivity parameter M = 9. Only the nanowires and nanodevices coupling one cell (indicated with red dashed lines) to M post-synaptic cells (blue dashed lines) are shown; actually all the cells are similarly coupled We have shown that these challenges may be met using (at least) the following training methods [9]: (i) Synaptic weight import. This procedure is started with training of a homomorphic “precursor” artificial neural network with continuous synaptic weighs wjk, implemented in software, using one of established methods (e.g., error backpropagation). Then the synaptic weights wjk are transferred to the CrossNet, with some “clipping” (rounding) due to the binary nature of elementary synaptic weights. To accomplish the transfer, pairs of somatic cells are sequentially selected via CMOS-level wiring. Using the flexibility of CMOS circuitry, these cells are reconfigured to apply external voltages ±VW to the axonic and dendritic nanowires leading to a particular synapse, while all other nanowires are grounded. The voltage level V W is selected so that it does not switch the synapses attached to only one of the selected nanowires, while voltage 2VW applied to the synapse at the crosspoint of the selected wires is sufficient for its reliable switching. (In the composite synapses with quasi-continuous weights (Fig. 4), only a part of the corresponding switches is turned ON or OFF.) (ii) Error backpropagation. The synaptic weight import procedure is straightforward when wjk may be simply calculated, e.g., for the Hopfield-type networks. However, for very large CrossNets used, e.g., as pattern classifiers the precursor network training may take an impracticably long time. In this case the direct training of a CrossNet may become necessary. We have developed two methods of such training, both based on “Hebbian” synapses consisting of 4 elementary synapses (latching switches) whose average weight dynamics obeys Eq. (3). This quasi-Hebbian rule may be used to implement the backpropagation algorithm either using a periodic time-multiplexing [9] or in a continuous fashion, using the simultaneous propagation of signals and errors along the same dual-rail channels. As a result, presently we may state that CrossNets may be taught to perform virtually all major functions demonstrated earlier with the usual neural networks, including the corrupted pattern restoration in the recurrent quasi-Hopfield mode and pattern classification in the feedforward MLP mode [11]. 5 C r o s s N e t p e r f o r m an c e e s t i m a t e s The significance of this result may be only appreciated in the context of unparalleled physical parameters of CMOL CrossNets. The only fundamental limitation on the half-pitch Fnano (Fig. 1) comes from quantum-mechanical tunneling between nanowires. If the wires are separated by vacuum, the corresponding specific leakage conductance becomes uncomfortably large (~10-12 Ω-1m-1) only at Fnano = 1.5 nm; however, since realistic insulation materials (SiO2, etc.) provide somewhat lower tunnel barriers, let us use a more conservative value Fnano= 3 nm. Note that this value corresponds to 1012 elementary synapses per cm2, so that for 4M = 104 and n = 4 the areal density of neural cells is close to 2×107 cm-2. Both numbers are higher than those for the human cerebral cortex, despite the fact that the quasi-2D CMOL circuits have to compete with quasi-3D cerebral cortex. With the typical specific capacitance of 3×10-10 F/m = 0.3 aF/nm, this gives nanowire capacitance C0 ≈ 1 aF per working elementary synapse, because the corresponding segment has length 4Fnano. The CrossNet operation speed is determined mostly by the time constant τ0 of dendrite nanowire capacitance recharging through resistances of open nanodevices. Since both the relevant conductance and capacitance increase similarly with M and n, τ0 ≈ R0C0. The possibilities of reduction of R0, and hence τ0, are limited mostly by acceptable power dissipation per unit area, that is close to Vs2/(2Fnano)2R0. For room-temperature operation, the voltage scale V0 ≈ Vt should be of the order of at least 30 kBT/e ≈ 1 V to avoid thermally-induced errors [9]. With our number for Fnano, and a relatively high but acceptable power consumption of 100 W/cm2, we get R0 ≈ 1010Ω (which is a very realistic value for single-molecule single-electron devices like one shown in Fig. 3). With this number, τ0 is as small as ~10 ns. This means that the CrossNet speed may be approximately six orders of magnitude (!) higher than that of the biological neural networks. Even scaling R0 up by a factor of 100 to bring power consumption to a more comfortable level of 1 W/cm2, would still leave us at least a four-orders-of-magnitude speed advantage. 6 D i s c u s s i on: P o s s i bl e a p p l i c at i o n s These estimates make us believe that that CMOL CrossNet chips may revolutionize the neuromorphic network applications. Let us start with the example of relatively small (1-cm2-scale) chips used for recognition of a face in a crowd [11]. The most difficult feature of such recognition is the search for face location, i.e. optimal placement of a face on the image relative to the panel providing input for the processing network. The enormous density and speed of CMOL hardware gives a possibility to time-and-space multiplex this task (Fig. 6). In this approach, the full image (say, formed by CMOS photodetectors on the same chip) is divided into P rectangular panels of h×w pixels, corresponding to the expected size and approximate shape of a single face. A CMOS-implemented communication channel passes input data from each panel to the corresponding CMOL neural network, providing its shift in time, say using the TV scanning pattern (red line in Fig. 6). The standard methods of image classification require the network to have just a few hidden layers, so that the time interval Δt necessary for each mapping position may be so short that the total pattern recognition time T = hwΔt may be acceptable even for online face recognition. w h image network input Fig. 6. Scan mapping of the input image on CMOL CrossNet inputs. Red lines show the possible time sequence of image pixels sent to a certain input of the network processing image from the upper-left panel of the pattern Indeed, let us consider a 4-Megapixel image partitioned into 4K 32×32-pixel panels (h = w = 32). This panel will require an MLP net with several (say, four) layers with 1K cells each in order to compare the panel image with ~10 3 stored faces. With the feasible 4-nm nanowire half-pitch, and 65-level synapses (sufficient for better than 99% fidelity [9]), each interlayer crossbar would require chip area about (4K×64 nm)2 = 64×64 μm2, fitting 4×4K of them on a ~0.6 cm2 chip. (The CMOS somatic-layer and communication-system overheads are negligible.) With the acceptable power consumption of the order of 10 W/cm2, the input-to-output signal propagation in such a network will take only about 50 ns, so that Δt may be of the order of 100 ns and the total time T = hwΔt of processing one frame of the order of 100 microseconds, much shorter than the typical TV frame time of ~10 milliseconds. The remaining two-orders-of-magnitude time gap may be used, for example, for double-checking the results via stopping the scan mapping (Fig. 6) at the most promising position. (For this, a simple feedback from the recognition output to the mapping communication system is necessary.) It is instructive to compare the estimated CMOL chip speed with that of the implementation of a similar parallel network ensemble on a CMOS signal processor (say, also combined on the same chip with an array of CMOS photodetectors). Even assuming an extremely high performance of 30 billion additions/multiplications per second, we would need ~4×4K×1K×(4K)2/(30×109) ≈ 104 seconds ~ 3 hours per frame, evidently incompatible with the online image stream processing. Let us finish with a brief (and much more speculative) discussion of possible long-term prospects of CMOL CrossNets. Eventually, large-scale (~30×30 cm2) CMOL circuits may become available. According to the estimates given in the previous section, the integration scale of such a system (in terms of both neural cells and synapses) will be comparable with that of the human cerebral cortex. Equipped with a set of broadband sensor/actuator interfaces, such (necessarily, hierarchical) system may be capable, after a period of initial supervised training, of further self-training in the process of interaction with environment, with the speed several orders of magnitude higher than that of its biological prototypes. Needless to say, the successful development of such self-developing systems would have a major impact not only on all information technologies, but also on the society as a whole. Acknowledgments This work has been supported in part by the AFOSR, MARCO (via FENA Center), and NSF. Valuable contributions made by Simon Fölling, Özgür Türel and Ibrahim Muckra, as well as useful discussions with P. Adams, J. Barhen, D. Hammerstrom, V. Protopopescu, T. Sejnowski, and D. Strukov are gratefully acknowledged. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] Frank, D. J. et al. (2001) Device scaling limits of Si MOSFETs and their application dependencies. Proc. IEEE 89(3): 259-288. Likharev, K. K. (2003) Electronics below 10 nm, in J. Greer et al. (eds.), Nano and Giga Challenges in Microelectronics, pp. 27-68. Amsterdam: Elsevier. Likharev, K. K. and Strukov, D. B. (2005) CMOL: Devices, circuits, and architectures, in G. Cuniberti et al. (eds.), Introducing Molecular Electronics, Ch. 16. Springer, Berlin. Fölling, S., Türel, Ö. & Likharev, K. K. (2001) Single-electron latching switches as nanoscale synapses, in Proc. of the 2001 Int. Joint Conf. on Neural Networks, pp. 216-221. Mount Royal, NJ: Int. Neural Network Society. Wang, W. et al. (2003) Mechanism of electron conduction in self-assembled alkanethiol monolayer devices. Phys. Rev. B 68(3): 035416 1-8. Stan M. et al. (2003) Molecular electronics: From devices and interconnect to circuits and architecture, Proc. IEEE 91(11): 1940-1957. Strukov, D. B. & Likharev, K. K. (2005) Prospects for terabit-scale nanoelectronic memories. Nanotechnology 16(1): 137-148. Strukov, D. B. & Likharev, K. K. (2005) CMOL FPGA: A reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices. Nanotechnology 16(6): 888-900. Türel, Ö. et al. (2004) Neuromorphic architectures for nanoelectronic circuits”, Int. J. of Circuit Theory and Appl. 32(5): 277-302. See, e.g., Hertz J. et al. (1991) Introduction to the Theory of Neural Computation. Cambridge, MA: Perseus. Lee, J. H. & Likharev, K. K. (2005) CrossNets as pattern classifiers. Lecture Notes in Computer Sciences 3575: 434-441.

2 0.95381182 6 nips-2005-A Connectionist Model for Constructive Modal Reasoning

Author: Artur Garcez, Luis C. Lamb, Dov M. Gabbay

Abstract: We present a new connectionist model for constructive, intuitionistic modal reasoning. We use ensembles of neural networks to represent intuitionistic modal theories, and show that for each intuitionistic modal program there exists a corresponding neural network ensemble that computes the program. This provides a massively parallel model for intuitionistic modal reasoning, and sets the scene for integrated reasoning, knowledge representation, and learning of intuitionistic theories in neural networks, since the networks in the ensemble can be trained by examples using standard neural learning algorithms. 1

same-paper 3 0.94712496 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

Abstract: We present an efficient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassification rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid 1 − α confidence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.

4 0.9315868 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms

Author: Baback Moghaddam, Yair Weiss, Shai Avidan

Abstract: Sparse PCA seeks approximate sparse “eigenvectors” whose projections capture the maximal variance of data. As a cardinality-constrained and non-convex optimization problem, it is NP-hard and is encountered in a wide range of applied fields, from bio-informatics to finance. Recent progress has focused mainly on continuous approximation and convex relaxation of the hard cardinality constraint. In contrast, we consider an alternative discrete spectral formulation based on variational eigenvalue bounds and provide an effective greedy strategy as well as provably optimal solutions using branch-and-bound search. Moreover, the exact methodology used reveals a simple renormalization step that improves approximate solutions obtained by any continuous method. The resulting performance gain of discrete algorithms is demonstrated on real-world benchmark data and in extensive Monte Carlo evaluation trials. 1

5 0.87559265 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

Author: Aaron Shon, Keith Grochow, Aaron Hertzmann, Rajesh P. Rao

Abstract: We propose an algorithm that uses Gaussian process regression to learn common hidden structure shared between corresponding sets of heterogenous observations. The observation spaces are linked via a single, reduced-dimensionality latent variable space. We present results from two datasets demonstrating the algorithms’s ability to synthesize novel data from learned correspondences. We first show that the method can learn the nonlinear mapping between corresponding views of objects, filling in missing data as needed to synthesize novel views. We then show that the method can learn a mapping between human degrees of freedom and robotic degrees of freedom for a humanoid robot, allowing robotic imitation of human poses from motion capture data. 1

6 0.65726614 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

7 0.59299898 181 nips-2005-Spiking Inputs to a Winner-take-all Network

8 0.56821537 149 nips-2005-Optimal cue selection strategy

9 0.5433135 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

10 0.53343242 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

11 0.53305769 99 nips-2005-Integrate-and-Fire models with adaptation are good enough

12 0.53131455 169 nips-2005-Saliency Based on Information Maximization

13 0.52152026 20 nips-2005-Affine Structure From Sound

14 0.5138703 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

15 0.5128895 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

16 0.51117498 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise

17 0.50375682 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

18 0.4993408 45 nips-2005-Conditional Visual Tracking in Kernel Space

19 0.49924377 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs

20 0.49732816 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account