nips nips2005 knowledge-graph by maker-knowledge-mining

nips 2005 knowledge graph

papers list:

1 nips-2005-AER Building Blocks for Multi-Layer Multi-Chip Neuromorphic Vision Systems

Author: R. Serrano-Gotarredona, M. Oster, P. Lichtsteiner, A. Linares-Barranco, R. Paz-Vicente, F. Gomez-Rodriguez, H. Kolle Riis, T. Delbruck, S. C. Liu, S. Zahnd, A. M. Whatley, R. Douglas, P. Hafliger, G. Jimenez-Moreno, A. Civit, T. Serrano-Gotarredona, A. Acosta-Jimenez, B. Linares-Barranco

Abstract: A 5-layer neuromorphic vision processor whose components communicate spike events asychronously using the address-eventrepresentation (AER) is demonstrated. The system includes a retina chip, two convolution chips, a 2D winner-take-all chip, a delay line chip, a learning classiﬁer chip, and a set of PCBs for computer interfacing and address space remappings. The components use a mixture of analog and digital computation and will learn to classify trajectories of a moving object. A complete experimental setup and measurements results are shown.

2 nips-2005-A Bayes Rule for Density Matrices

Author: Manfred K. Warmuth

Abstract: The classical Bayes rule computes the posterior model probability from the prior probability and the data likelihood. We generalize this rule to the case when the prior is a density matrix (symmetric positive deﬁnite and trace one) and the data likelihood a covariance matrix. The classical Bayes rule is retained as the special case when the matrices are diagonal. In the classical setting, the calculation of the probability of the data is an expected likelihood, where the expectation is over the prior distribution. In the generalized setting, this is replaced by an expected variance calculation where the variance is computed along the eigenvectors of the prior density matrix and the expectation is over the eigenvalues of the density matrix (which form a probability vector). The variances along any direction is determined by the covariance matrix. Curiously enough this expected variance calculation is a quantum measurement where the co-variance matrix speciﬁes the instrument and the prior density matrix the mixture state of the particle. We motivate both the classical and the generalized Bayes rule with a minimum relative entropy principle, where the Kullbach-Leibler version gives the classical Bayes rule and Umegaki’s quantum relative entropy the new Bayes rule for density matrices. 1

3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

Author: Odelia Schwartz, Peter Dayan, Terrence J. Sejnowski

Abstract: The misjudgement of tilt in images lies at the heart of entertaining visual illusions and rigorous perceptual psychophysics. A wealth of ﬁndings has attracted many mechanistic models, but few clear computational principles. We adopt a Bayesian approach to perceptual tilt estimation, showing how a smoothness prior offers a powerful way of addressing much confusing data. In particular, we faithfully model recent results showing that conﬁdence in estimation can be systematically affected by the same aspects of images that affect bias. Conﬁdence is central to Bayesian modeling approaches, and is applicable in many other perceptual domains. Perceptual anomalies and illusions, such as the misjudgements of motion and tilt evident in so many psychophysical experiments, have intrigued researchers for decades.1–3 A Bayesian view4–8 has been particularly inﬂuential in models of motion processing, treating such anomalies as the normative product of prior information (often statistically codifying Gestalt laws) with likelihood information from the actual scenes presented. Here, we expand the range of statistically normative accounts to tilt estimation, for which there are classes of results (on estimation conﬁdence) that are so far not available for motion. The tilt illusion arises when the perceived tilt of a center target is misjudged (ie bias) in the presence of ﬂankers. Another phenomenon, called Crowding, refers to a loss in the conﬁdence (ie sensitivity) of perceived target tilt in the presence of ﬂankers. Attempts have been made to formalize these phenomena quantitatively. Crowding has been modeled as compulsory feature pooling (ie averaging of orientations), ignoring spatial positions.9, 10 The tilt illusion has been explained by lateral interactions11, 12 in populations of orientationtuned units; and by calibration.13 However, most models of this form cannot explain a number of crucial aspects of the data. First, the geometry of the positional arrangement of the stimuli affects attraction versus repulsion in bias, as emphasized by Kapadia et al14 (ﬁgure 1A), and others.15, 16 Second, Solomon et al. recently measured bias and sensitivity simultaneously.11 The rich and surprising range of sensitivities, far from ﬂat as a function of ﬂanker angles (ﬁgure 1B), are outside the reach of standard models. Moreover, current explanations do not offer a computational account of tilt perception as the outcome of a normative inference process. Here, we demonstrate that a Bayesian framework for orientation estimation, with a prior favoring smoothness, can naturally explain a range of seemingly puzzling tilt data. We explicitly consider both the geometry of the stimuli, and the issue of conﬁdence in the esti- 6 5 4 3 2 1 0 -1 -2 (B) Attraction Repulsion Sensititvity (1/deg) Bias (deg) (A) 0.6 0.5 0.4 0.3 0.2 0.1 -80 -60 -40 -20 0 20 40 60 80 Flanker tilt (deg) Figure 1: Tilt biases and sensitivities in visual perception. (A) Kapadia et al demonstrated the importance of geometry on tilt bias, with bar stimuli in the fovea (and similar results in the periphery). When 5 degrees clockwise ﬂankers are arranged colinearly, the center target appears attracted in the direction of the ﬂankers; when ﬂankers are lateral, the target appears repulsed. Data are an average of 5 subjects.14 (B) Solomon et al measured both biases and sensitivities for gratings in the visual periphery.11 On the top are example stimuli, with ﬂankers tilted 22.5 degrees clockwise. This constitutes the classic tilt illusion, with a repulsive bias percept. In addition, sensitivities vary as a function of ﬂanker angles, in a systematic way (even in cases when there are no biases at all). Sensitivities are given in units of the inverse of standard deviation of the tilt estimate. More detailed data for both experiments are shown in the results section. mation. Bayesian analyses have most frequently been applied to bias. Much less attention has been paid to the equally important phenomenon of sensitivity. This aspect of our model should be applicable to other perceptual domains. In section 1 we formulate the Bayesian model. The prior is determined by the principle of creating a smooth contour between the target and ﬂankers. We describe how to extract the bias and sensitivity. In section 2 we show experimental data of Kapadia et al and Solomon et al, alongside the model simulations, and demonstrate that the model can account for both geometry, and bias and sensitivity measurements in the data. Our results suggest a more uniﬁed, rational, approach to understanding tilt perception. 1 Bayesian model Under our Bayesian model, inference is controlled by the posterior distribution over the tilt of the target element. This comes from the combination of a prior favoring smooth conﬁgurations of the ﬂankers and target, and the likelihood associated with the actual scene. A complete distribution would consider all possible angles and relative spatial positions of the bars, and marginalize the posterior over all but the tilt of the central element. For simplicity, we make two benign approximations: conditionalizing over (ie clamping) the angles of the ﬂankers, and exploring only a small neighborhood of their positions. We now describe the steps of inference. Smoothness prior: Under these approximations, we consider a given actual conﬁguration (see ﬁg 2A) of ﬂankers f1 = (φ1 , x1 ), f2 = (φ2 , x2 ) and center target c = (φc , xc ), arranged from top to bottom. We have to generate a prior over φc and δ1 = x1 − xc and δ2 = x2 − xc based on the principle of smoothness. As a less benign approximation, we do this in two stages: articulating a principle that determines a single optimal conﬁguration; and generating a prior as a mixture of a Gaussian about this optimum and a uniform distribution, with the mixing proportion of the latter being determined by the smoothness of the optimum. Smoothness has been extensively studied in the computer vision literature.17–20 One widely (B) (C) f1 f1 β1 R Probability max smooth Max smooth target (deg) (A) 40 20 0 -20 c δ1 c -40 Φc f2 f2 1 0.8 0.6 0.4 0.2 0 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) 60 80 -80 -60 -40 20 0 20 40 Flanker tilt (deg) 60 80 Figure 2: Geometry and smoothness for ﬂankers, f1 and f2 , and center target, c. (A) Example actual conﬁguration of ﬂankers and target, aligned along the y axis from top to bottom. (B) The elastica procedure can rotate the target angle (to Φc ) and shift the relative ﬂanker and target positions on the x axis (to δ1 and δ2 ) in its search for the maximally smooth solution. Small spatial shifts (up to 1/15 the size of R) of positions are allowed, but positional shift is overemphasized in the ﬁgure for visibility. (C) Top: center tilt that results in maximal smoothness, as a function of ﬂanker tilt. Boxed cartoons show examples for given ﬂanker tilts, of the optimally smooth conﬁguration. Note attraction of target towards ﬂankers for small ﬂanker angles; here ﬂankers and target are positioned in a nearly colinear arrangement. Note also repulsion of target away from ﬂankers for intermediate ﬂanker angles. Bottom: P [c, f1 , f2 ] for center tilt that yields maximal smoothness. The y axis is normalized between 0 and 1. used principle, elastica, known even to Euler, has been applied to contour completion21 and other computer vision applications.17 The basic idea is to ﬁnd the curve with minimum energy (ie, square of curvature). Sharon et al19 showed that the elastica function can be well approximated by a number of simpler forms. We adopt a version that Leung and Malik18 adopted from Sharon et al.19 We assume that the probability for completing a smooth curve, can be factorized into two terms: P [c, f1 , f2 ] = G(c, f1 )G(c, f2 ) (1) with the term G(c, f1 ) (and similarly, G(c, f2 )) written as: R Dβ 2 2 Dβ = β1 + βc − β1 βc (2) − ) where σR σβ and β1 (and similarly, βc ) is the angle between the orientation at f1 , and the line joining f1 and c. The distance between the centers of f1 and c is given by R. The two constants, σβ and σR , control the relative contribution to smoothness of the angle versus the spatial distance. Here, we set σβ = 1, and σR = 1.5. Figure 2B illustrates an example geometry, in which φc , δ1 , and δ2 , have been shifted from the actual scene (of ﬁgure 2A). G(c, f1 ) = exp(− We now estimate the smoothest solution for given conﬁgurations. Figure 2C shows for given ﬂanker tilts, the center tilt that yields maximal smoothness, and the corresponding probability of smoothness. For near vertical ﬂankers, the spatial lability leads to very weak attraction and high probability of smoothness. As the ﬂanker angle deviates farther from vertical, there is a large repulsion, but also lower probability of smoothness. These observations are key to our model: the maximally smooth center tilt will inﬂuence attractive and repulsive interactions of tilt estimation; the probability of smoothness will inﬂuence the relative weighting of the prior versus the likelihood. From the smoothness principle, we construct a two dimensional prior (ﬁgure 3A). One dimension represents tilt, the other dimension, the overall positional shift between target (B) Likelihood (D) Marginalized Posterior (C) Posterior 20 0.03 10 -10 -20 0 Probability 0 10 Angle Angle Angle 10 0 -10 -20 0.01 -10 -20 0.02 0 -0. 2 0 Position 0.2 (E) Psychometric function 20 -0. 2 0 0.2 -0. 2 0 0.2 Position Position -10 -5 0 Angle 5 10 Probability clockwise (A) Prior 20 1 0.8 0.6 0.4 0.2 0 -20 -10 0 10 20 Target angle (deg) Counter-clockwise Clockwise Figure 3: Bayes model for example ﬂankers and target. (A) Prior 2D distribution for ﬂankers set at 22.5 degrees (note repulsive preference for -5.5 degrees). (B) Likelihood 2D distribution for a target tilt of 3 degrees; (C) Posterior 2D distribution. All 2D distributions are drawn on the same grayscale range, and the presence of a larger baseline in the prior causes it to appear more dimmed. (D) Marginalized posterior, resulting in 1D distribution over tilt. Dashed line represents the mean, with slight preference for negative angle. (E) For this target tilt, we calculate probability clockwise, and obtain one point on psychometric curve. and ﬂankers (called ’position’). The prior is a 2D Gaussian distribution, sat upon a constant baseline.22 The Gaussian is centered at the estimated smoothest target angle and relative position, and the baseline is determined by the probability of smoothness. The baseline, and its dependence on the ﬂanker orientation, is a key difference from Weiss et al’s Gaussian prior for smooth, slow motion. It can be seen as a mechanism to allow segmentation (see Posterior description below). The standard deviation of the Gaussian is a free parameter. Likelihood: The likelihood over tilt and position (ﬁgure 3B) is determined by a 2D Gaussian distribution with an added baseline.22 The Gaussian is centered at the actual target tilt; and at a position taken as zero, since this is the actual position, to which the prior is compared. The standard deviation and baseline constant are free parameters. Posterior and marginalization: The posterior comes from multiplying likelihood and prior (ﬁgure 3C) and then marginalizing over position to obtain a 1D distribution over tilt. Figure 3D shows an example in which this distribution is bimodal. Other likelihoods, with closer agreement between target and smooth prior, give unimodal distributions. Note that the bimodality is a direct consequence of having an added baseline to the prior and likelihood (if these were Gaussian without a baseline, the posterior would always be Gaussian). The viewer is effectively assessing whether the target is associated with the same object as the ﬂankers, and this is reﬂected in the baseline, and consequently, in the bimodality, and conﬁdence estimate. We deﬁne α as the mean angle of the 1D posterior distribution (eg, value of dashed line on the x axis), and β as the height of the probability distribution at that mean angle (eg, height of dashed line). The term β is an indication of conﬁdence in the angle estimate, where for larger values we are more certain of the estimate. Decision of probability clockwise: The probability of a clockwise tilt is estimated from the marginalized posterior: 1 P = 1 + exp (3) −α.∗k − log(β+η) where α and β are deﬁned as above, k is a free parameter and η a small constant. Free parameters are set to a single constant value for all ﬂanker and center conﬁgurations. Weiss et al use a similar compressive nonlinearity, but without the term β. We also tried a decision function that integrates the posterior, but the resulting curves were far from the sigmoidal nature of the data. Bias and sensitivity: For one target tilt, we generate a single probability and therefore a single point on the psychometric function relating tilt to the probability of choosing clockwise. We generate the full psychometric curve from all target tilts and ﬁt to it a cumulative 60 40 20 -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 (C) Data -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 (D) Model 100 100 100 80 0 -10 Model Frequency responding clockwise (B) Data Frequency responding clockwise Frequency responding clockwise Frequency responding clockwise (A) 100 -5 0 5 Target tilt (deg) 10 80 60 40 20 0 -10 -5 0 5 10 Target tilt (deg) Figure 4: Kapadia et al data,14 versus Bayesian model. Solid lines are ﬁts to a cumulative Gaussian distribution. (A) Flankers are tilted 5 degrees clockwise (black curve) or anti-clockwise (gray) of vertical, and positioned spatially in a colinear arrangement. The center bar appears tilted in the direction of the ﬂankers (attraction), as can be seen by the attractive shift of the psychometric curve. The boxed stimuli cartoon illustrates a vertical target amidst the ﬂankers. (B) Model for colinear bars also produces attraction. (C) Data and (D) model for lateral ﬂankers results in repulsion. All data are collected in the fovea for bars. Gaussian distribution N (µ, σ) (ﬁgure 3E). The mean µ of the ﬁt corresponds to the bias, 1 and σ to the sensitivity, or conﬁdence in the bias. The ﬁt to a cumulative Gaussian and extraction of these parameters exactly mimic psychophysical procedures.11 2 Results: data versus model We ﬁrst consider the geometry of the center and ﬂanker conﬁgurations, modeling the full psychometric curve for colinear and parallel ﬂanks (recall that ﬁgure 1A showed summary biases). Figure 4A;B demonstrates attraction in the data and model; that is, the psychometric curve is shifted towards the ﬂanker, because of the nature of smooth completions for colinear ﬂankers. Figure 4C;D shows repulsion in the data and model. In this case, the ﬂankers are arranged laterally instead of colinearly. The smoothest solution in the model arises by shifting the target estimate away from the ﬂankers. This shift is rather minor, because the conﬁguration has a low probability of smoothness (similar to ﬁgure 2C), and thus the prior exerts only a weak effect. The above results show examples of changes in the psychometric curve, but do not address both bias and, particularly, sensitivity, across a whole range of ﬂanker conﬁgurations. Figure 5 depicts biases and sensitivity from Solomon et al, versus the Bayes model. The data are shown for a representative subject, but the qualitative behavior is consistent across all subjects tested. In ﬁgure 5A, bias is shown, for the condition that both ﬂankers are tilted at the same angle. The data exhibit small attraction at near vertical ﬂanker angles (this arrangement is close to colinear); large repulsion at intermediate ﬂanker angles of 22.5 and 45 degrees from vertical; and minimal repulsion at large angles from vertical. This behavior is also exhibited in the Bayes model (Figure 5B). For intermediate ﬂanker angles, the smoothest solution in the model is repulsive, and the effect of the prior is strong enough to induce a signiﬁcant repulsion. For large angles, the prior exerts almost no effect. Interestingly, sensitivity is far from ﬂat in both data and model. In the data (Figure 5C), there is most loss in sensitivity at intermediate ﬂanker angles of 22.5 and 45 degrees (ie, the subject is less certain); and sensitivity is higher for near vertical or near horizontal ﬂankers. The model shows the same qualitative behavior (Figure 5D). In the model, there are two factors driving sensitivity: one is the probability of completing a smooth curvature for a given ﬂanker conﬁguration, as in Figure 2B; this determines the strength of the prior. The other factor is certainty in a particular center estimation; this is determined by β, derived from the posterior distribution, and incorporated into the decision stage of the model Data 5 0 -60 -40 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) -20 0 20 40 Flanker tilt (deg) 60 60 80 -60 -40 0.6 0.5 0.4 0.3 0.2 0.1 -20 0 20 40 Flanker tilt (deg) 60 80 60 80 -80 -60 -40 -20 0 20 40 Flanker tilt (deg) -80 -60 -40 -20 0 20 40 Flanker tilt (deg) 60 80 -20 0 20 40 Flanker tilt (deg) 60 80 (F) Bias (deg) 10 5 0 -5 0.6 0.5 0.4 0.3 0.2 0.1 -80 (D) 10 -10 -10 80 Sensitivity (1/deg) -80 5 0 -5 -80 -80 -60 -60 -40 -40 -20 0 20 40 Flanker tilt (deg) -20 0 20 40 Flanker tilt (deg) 60 -10 80 (H) 60 80 Sensitivity (1/deg) Sensititvity (1/deg) Bias (deg) 0.6 0.5 0.4 0.3 0.2 0.1 (G) Sensititvity (1/deg) 0 -5 (C) (E) 5 -5 -10 Model (B) 10 Bias (deg) Bias (deg) (A) 10 0.6 0.5 0.4 0.3 0.2 0.1 -80 -60 -40 Figure 5: Solomon et al data11 (subject FF), versus Bayesian model. (A) Data and (B) model biases with same-tilted ﬂankers; (C) Data and (D) model sensitivities with same-tilted ﬂankers; (E;G) data and (F;H) model as above, but for opposite-tilted ﬂankers (note that opposite-tilted data was collected for less ﬂanker angles). Each point in the ﬁgure is derived by ﬁtting a cummulative Gaussian distribution N (µ, σ) to corresponding psychometric curve, and setting bias 1 equal to µ and sensitivity to σ . In all experiments, ﬂanker and target gratings are presented in the visual periphery. Both data and model stimuli are averages of two conﬁgurations, on the left hand side (9 O’clock position) and right hand side (3 O’clock position). The conﬁgurations are similar to Figure 1 (B), but slightly shifted according to an iso-eccentric circle, so that all stimuli are similarly visible in the periphery. (equation 3). For ﬂankers that are far from vertical, the prior has minimal effect because one cannot ﬁnd a smooth solution (eg, the likelihood dominates), and thus sensitivity is higher. The low sensitivity at intermediate angles arises because the prior has considerable effect; and there is conﬂict between the prior (tilt, position), and likelihood (tilt, position). This leads to uncertainty in the target angle estimation . For ﬂankers near vertical, the prior exerts a strong effect; but there is less conﬂict between the likelihood and prior estimates (tilt, position) for a vertical target. This leads to more conﬁdence in the posterior estimate, and therefore, higher sensitivity. The only aspect that our model does not reproduce is the (more subtle) sensitivity difference between 0 and +/- 5 degree ﬂankers. Figure 5E-H depict data and model for opposite tilted ﬂankers. The bias is now close to zero in the data (Figure 5E) and model (Figure 5F), as would be expected (since the maximally smooth angle is now always roughly vertical). Perhaps more surprisingly, the sensitivities continue to to be non-ﬂat in the data (Figure 5G) and model (Figure 5H). This behavior arises in the model due to the strength of prior, and positional uncertainty. As before, there is most loss in sensitivity at intermediate angles. Note that to ﬁt Kapadia et al, simulations used a constant parameter of k = 9 in equation 3, whereas for the Solomon et al. simulations, k = 2.5. This indicates that, in our model, there was higher conﬁdence in the foveal experiments than in the peripheral ones. 3 Discussion We applied a Bayesian framework to the widely studied tilt illusion, and demonstrated the model on examples from two different data sets involving foveal and peripheral estimation. Our results support the appealing hypothesis that perceptual misjudgements are not a consequence of poor system design, but rather can be described as optimal inference.4–8 Our model accounts correctly for both attraction and repulsion, determined by the smoothness prior and the geometry of the scene. We emphasized the issue of estimation conﬁdence. The dataset showing how conﬁdence is affected by the same issues that affect bias,11 was exactly appropriate for a Bayesian formulation; other models in the literature typically do not incorporate conﬁdence in a thoroughly probabilistic manner. In fact, our model ﬁts the conﬁdence (and bias) data more proﬁciently than an account based on lateral interactions among a population of orientationtuned cells.11 Other Bayesian work, by Stocker et al,6 utilized the full slope of the psychometric curve in ﬁtting a prior and likelihood to motion data, but did not examine the issue of conﬁdence. Estimation conﬁdence plays a central role in Bayesian formulations as a whole. Understanding how priors affect conﬁdence should have direct bearing on many other Bayesian calculations such as multimodal integration.23 Our model is obviously over-simpliﬁed in a number of ways. First, we described it in terms of tilts and spatial positions; a more complete version should work in the pixel/ﬁltering domain.18, 19 We have also only considered two ﬂanking elements; the model is extendible to a full-ﬁeld surround, whereby smoothness operates along a range of geometric directions, and some directions are more (smoothly) dominant than others. Second, the prior is constructed by summarizing the maximal smoothness information; a more probabilistically correct version should capture the full probability of smoothness in its prior. Third, our model does not incorporate a formal noise representation; however, sensitivities could be inﬂuenced both by stimulus-driven noise and conﬁdence. Fourth, our model does not address attraction in the so-called indirect tilt illusion, thought to be mediated by a different mechanism. Finally, we have yet to account for neurophysiological data within this framework, and incorporate constraints at the neural implementation level. However, versions of our computations are oft suggested for intra-areal and feedback cortical circuits; and smoothness principles form a key part of the association ﬁeld connection scheme in Li’s24 dynamical model of contour integration in V1. Our model is connected to a wealth of literature in computer vision and perception. Notably, occlusion and contour completion might be seen as the extreme example in which there is no likelihood information at all for the center target; a host of papers have shown that under these circumstances, smoothness principles such as elastica and variants explain many aspects of perception. The model is also associated with many studies on contour integration motivated by Gestalt principles;25, 26 and exploration of natural scene statistics and Gestalt,27, 28 including the relation to contour grouping within a Bayesian framework.29, 30 Indeed, our model could be modiﬁed to include a prior from natural scenes. There are various directions for the experimental test and reﬁnement of our model. Most pressing is to determine bias and sensitivity for different center and ﬂanker contrasts. As in the case of motion, our model predicts that when there is more uncertainty in the center element, prior information is more dominant. Another interesting test would be to design a task such that the center element is actually part of a different ﬁgure and unrelated to the ﬂankers; our framework predicts that there would be minimal bias, because of segmentation. Our model should also be applied to other tilt-based illusions such as the Fraser spiral and Z¨ llner. Finally, our model can be applied to other perceptual domains;31 and given o the apparent similarities between the tilt illusion and the tilt after-effect, we plan to extend the model to adaptation, by considering smoothness in time as well as space. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Serge Belongie, Leanne Chukoskie, Philip Meier and Joshua Solomon for helpful discussions. References [1] J J Gibson. Adaptation, after-effect, and contrast in the perception of tilted lines. Journal of Experimental Psychology, 20:553–569, 1937. [2] C Blakemore, R H S Carpentar, and M A Georgeson. Lateral inhibition between orientation detectors in the human visual system. Nature, 228:37–39, 1970. [3] J A Stuart and H M Burian. A study of separation difﬁculty: Its relationship to visual acuity in normal and amblyopic eyes. American Journal of Ophthalmology, 53:471–477, 1962. [4] A Yuille and H H Bulthoff. Perception as bayesian inference. In Knill and Whitman, editors, Bayesian decision theory and psychophysics, pages 123–161. Cambridge University Press, 1996. [5] Y Weiss, E P Simoncelli, and E H Adelson. Motion illusions as optimal percepts. Nature Neuroscience, 5:598–604, 2002. [6] A Stocker and E P Simoncelli. Constraining a bayesian model of human visual speed perception. Adv in Neural Info Processing Systems, 17, 2004. [7] D Kersten, P Mamassian, and A Yuille. Object perception as bayesian inference. Annual Review of Psychology, 55:271–304, 2004. [8] K Kording and D Wolpert. Bayesian integration in sensorimotor learning. Nature, 427:244–247, 2004. [9] L Parkes, J Lund, A Angelucci, J Solomon, and M Morgan. Compulsory averaging of crowded orientation signals in human vision. Nature Neuroscience, 4:739–744, 2001. [10] D G Pelli, M Palomares, and N J Majaj. Crowding is unlike ordinary masking: Distinguishing feature integration from detection. Journal of Vision, 4:1136–1169, 2002. [11] J Solomon, F M Felisberti, and M Morgan. Crowding and the tilt illusion: Toward a uniﬁed account. Journal of Vision, 4:500–508, 2004. [12] J A Bednar and R Miikkulainen. Tilt aftereffects in a self-organizing model of the primary visual cortex. Neural Computation, 12:1721–1740, 2000. [13] C W Clifford, P Wenderoth, and B Spehar. A functional angle on some after-effects in cortical vision. Proc Biol Sci, 1454:1705–1710, 2000. [14] M K Kapadia, G Westheimer, and C D Gilbert. Spatial distribution of contextual interactions in primary visual cortex and in visual perception. J Neurophysiology, 4:2048–262, 2000. [15] C C Chen and C W Tyler. Lateral modulation of contrast discrimination: Flanker orientation effects. Journal of Vision, 2:520–530, 2002. [16] I Mareschal, M P Sceniak, and R M Shapley. Contextual inﬂuences on orientation discrimination: binding local and global cues. Vision Research, 41:1915–1930, 2001. [17] D Mumford. Elastica and computer vision. In Chandrajit Bajaj, editor, Algebraic geometry and its applications. Springer Verlag, 1994. [18] T K Leung and J Malik. Contour continuity in region based image segmentation. In Proc. ECCV, pages 544–559, 1998. [19] E Sharon, A Brandt, and R Basri. Completion energies and scale. IEEE Pat. Anal. Mach. Intell., 22(10), 1997. [20] S W Zucker, C David, A Dobbins, and L Iverson. The organization of curve detection: coarse tangent ﬁelds. Computer Graphics and Image Processing, 9(3):213–234, 1988. [21] S Ullman. Filling in the gaps: the shape of subjective contours and a model for their generation. Biological Cybernetics, 25:1–6, 1976. [22] G E Hinton and A D Brown. Spiking boltzmann machines. Adv in Neural Info Processing Systems, 12, 1998. [23] R A Jacobs. What determines visual cue reliability? Trends in Cognitive Sciences, 6:345–350, 2002. [24] Z Li. A saliency map in primary visual cortex. Trends in Cognitive Science, 6:9–16, 2002. [25] D J Field, A Hayes, and R F Hess. Contour integration by the human visual system: evidence for a local “association ﬁeld”. Vision Research, 33:173–193, 1993. [26] J Beck, A Rosenfeld, and R Ivry. Line segregation. Spatial Vision, 4:75–101, 1989. [27] M Sigman, G A Cecchi, C D Gilbert, and M O Magnasco. On a common circle: Natural scenes and gestalt rules. PNAS, 98(4):1935–1940, 2001. [28] S Mahumad, L R Williams, K K Thornber, and K Xu. Segmentation of multiple salient closed contours from real images. IEEE Pat. Anal. Mach. Intell., 25(4):433–444, 1997. [29] W S Geisler, J S Perry, B J Super, and D P Gallogly. Edge co-occurence in natural images predicts contour grouping performance. Vision Research, 6:711–724, 2001. [30] J H Elder and R M Goldberg. Ecological statistics of gestalt laws for the perceptual organization of contours. Journal of Vision, 4:324–353, 2002. [31] S R Lehky and T J Sejnowski. Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10:2281–2299, 1990.

4 nips-2005-A Bayesian Spatial Scan Statistic

Author: Daniel B. Neill, Andrew W. Moore, Gregory F. Cooper

Abstract: We propose a new Bayesian method for spatial cluster detection, the “Bayesian spatial scan statistic,” and compare this method to the standard (frequentist) scan statistic approach. We demonstrate that the Bayesian statistic has several advantages over the frequentist approach, including increased power to detect clusters and (since randomization testing is unnecessary) much faster runtime. We evaluate the Bayesian and frequentist methods on the task of prospective disease surveillance: detecting spatial clusters of disease cases resulting from emerging disease outbreaks. We demonstrate that our Bayesian methods are successful in rapidly detecting outbreaks while keeping number of false positives low. 1

5 nips-2005-A Computational Model of Eye Movements during Object Class Detection

Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky

Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated ﬁxations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of ﬁxations, cumulative probability of ﬁxating the target, and scanpath distance.

6 nips-2005-A Connectionist Model for Constructive Modal Reasoning

Author: Artur Garcez, Luis C. Lamb, Dov M. Gabbay

Abstract: We present a new connectionist model for constructive, intuitionistic modal reasoning. We use ensembles of neural networks to represent intuitionistic modal theories, and show that for each intuitionistic modal program there exists a corresponding neural network ensemble that computes the program. This provides a massively parallel model for intuitionistic modal reasoning, and sets the scene for integrated reasoning, knowledge representation, and learning of intuitionistic theories in neural networks, since the networks in the ensemble can be trained by examples using standard neural learning algorithms. 1

7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects

Author: David Arathorn

Abstract: Recent neurophysiological evidence suggests the ability to interpret biological motion is facilitated by a neuronal

8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

Author: Robert A. Legenstein, Wolfgang Maass

Abstract: We investigate under what conditions a neuron can learn by experimentally supported rules for spike timing dependent plasticity (STDP) to predict the arrival times of strong “teacher inputs” to the same neuron. It turns out that in contrast to the famous Perceptron Convergence Theorem, which predicts convergence of the perceptron learning rule for a simpliﬁed neuron model whenever a stable solution exists, no equally strong convergence guarantee can be given for spiking neurons with STDP. But we derive a criterion on the statistical dependency structure of input spike trains which characterizes exactly when learning with STDP will converge on average for a simple model of a spiking neuron. This criterion is reminiscent of the linear separability criterion of the Perceptron Convergence Theorem, but it applies here to the rows of a correlation matrix related to the spike inputs. In addition we show through computer simulations for more realistic neuron models that the resulting analytically predicted positive learning results not only hold for the common interpretation of STDP where STDP changes the weights of synapses, but also for a more realistic interpretation suggested by experimental data where STDP modulates the initial release probability of dynamic synapses. 1

9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

Author: Zhenyue Zhang, Hongyuan Zha

Abstract: We propose a fast manifold learning algorithm based on the methodology of domain decomposition. Starting with the set of sample points partitioned into two subdomains, we develop the solution of the interface problem that can glue the embeddings on the two subdomains into an embedding on the whole domain. We provide a detailed analysis to assess the errors produced by the gluing process using matrix perturbation theory. Numerical examples are given to illustrate the efﬁciency and effectiveness of the proposed methods. 1

10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

Author: Sören Sonnenburg, Gunnar Rätsch, Christin Schäfer

Abstract: While classical kernel-based learning algorithms are based on a single kernel, in practice it is often desirable to use multiple kernels. Lankriet et al. (2004) considered conic combinations of kernel matrices for classiﬁcation, leading to a convex quadratically constraint quadratic program. We show that it can be rewritten as a semi-inﬁnite linear program that can be efﬁciently solved by recycling the standard SVM implementations. Moreover, we generalize the formulation and our method to a larger class of problems, including regression and one-class classiﬁcation. Experimental results show that the proposed algorithm helps for automatic model selection, improving the interpretability of the learning result and works for hundred thousands of examples or hundreds of kernels to be combined. 1

11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection

Author: Long Zhu, Alan L. Yuille

Abstract: We describe a hierarchical compositional system for detecting deformable objects in images. Objects are represented by graphical models. The algorithm uses a hierarchical tree where the root of the tree corresponds to the full object and lower-level elements of the tree correspond to simpler features. The algorithm proceeds by passing simple messages up and down the tree. The method works rapidly, in under a second, on 320 × 240 images. We demonstrate the approach on detecting cats, horses, and hands. The method works in the presence of background clutter and occlusions. Our approach is contrasted with more traditional methods such as dynamic programming and belief propagation. 1

12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

Author: François Laviolette, Mario Marchand, Mohak Shah

Abstract: We design a new learning algorithm for the Set Covering Machine from a PAC-Bayes perspective and propose a PAC-Bayes risk bound which is minimized for classiﬁers achieving a non trivial margin-sparsity trade-oﬀ. 1

13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

Author: Rong Jin, Feng Kang, Chris H. Ding

Abstract: Spectral clustering enjoys its success in both data clustering and semisupervised learning. But, most spectral clustering algorithms cannot handle multi-class clustering problems directly. Additional strategies are needed to extend spectral clustering algorithms to multi-class clustering problems. Furthermore, most spectral clustering algorithms employ hard cluster membership, which is likely to be trapped by the local optimum. In this paper, we present a new spectral clustering algorithm, named “Soft Cut”. It improves the normalized cut algorithm by introducing soft membership, and can be efﬁciently computed using a bound optimization algorithm. Our experiments with a variety of datasets have shown the promising performance of the proposed clustering algorithm. 1

14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

Author: Yves Grandvalet, Johnny Mariethoz, Samy Bengio

Abstract: In this paper, we show that the hinge loss can be interpreted as the neg-log-likelihood of a semi-parametric model of posterior probabilities. From this point of view, SVMs represent the parametric component of a semi-parametric model ﬁtted by a maximum a posteriori estimation procedure. This connection enables to derive a mapping from SVM scores to estimated posterior probabilities. Unlike previous proposals, the suggested mapping is interval-valued, providing a set of posterior probabilities compatible with each SVM score. This framework offers a new way to adapt the SVM optimization problem to unbalanced classiﬁcation, when decisions result in unequal (asymmetric) losses. Experiments show improvements over state-of-the-art procedures. 1

15 nips-2005-A Theoretical Analysis of Robust Coding over Noisy Overcomplete Channels

Author: Eizaburo Doi, Doru C. Balcan, Michael S. Lewicki

Abstract: Biological sensory systems are faced with the problem of encoding a high-ﬁdelity sensory signal with a population of noisy, low-ﬁdelity neurons. This problem can be expressed in information theoretic terms as coding and transmitting a multi-dimensional, analog signal over a set of noisy channels. Previously, we have shown that robust, overcomplete codes can be learned by minimizing the reconstruction error with a constraint on the channel capacity. Here, we present a theoretical analysis that characterizes the optimal linear coder and decoder for one- and twodimensional data. The analysis allows for an arbitrary number of coding units, thus including both under- and over-complete representations, and provides a number of important insights into optimal coding strategies. In particular, we show how the form of the code adapts to the number of coding units and to different data and noise conditions to achieve robustness. We also report numerical solutions for robust coding of highdimensional image data and show that these codes are substantially more robust compared against other image codes such as ICA and wavelets. 1

16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

Author: Sathiya Keerthi, Wei Chu

Abstract: In this paper we propose a new basis selection criterion for building sparse GP regression models that provides promising gains in accuracy as well as efﬁciency over previous methods. Our algorithm is much faster than that of Smola and Bartlett, while, in generalization it greatly outperforms the information gain approach proposed by Seeger et al, especially on the quality of predictive distributions. 1

17 nips-2005-Active Bidirectional Coupling in a Cochlear Chip

Author: Bo Wen, Kwabena A. Boahen

Abstract: We present a novel cochlear model implemented in analog very large scale integration (VLSI) technology that emulates nonlinear active cochlear behavior. This silicon cochlea includes outer hair cell (OHC) electromotility through active bidirectional coupling (ABC), a mechanism we proposed in which OHC motile forces, through the microanatomical organization of the organ of Corti, realize the cochlear ampliﬁer. Our chip measurements demonstrate that frequency responses become larger and more sharply tuned when ABC is turned on; the degree of the enhancement decreases with input intensity as ABC includes saturation of OHC forces. 1 Silicon Cochleae Cochlear models, mathematical and physical, with the shared goal of emulating nonlinear active cochlear behavior, shed light on how the cochlea works if based on cochlear micromechanics. Among the modeling efforts, silicon cochleae have promise in meeting the need for real-time performance and low power consumption. Lyon and Mead developed the ﬁrst analog electronic cochlea [1], which employed a cascade of second-order ﬁlters with exponentially decreasing resonant frequencies. However, the cascade structure suffers from delay and noise accumulation and lacks fault-tolerance. Modeling the cochlea more faithfully, Watts built a two-dimensional (2D) passive cochlea that addressed these shortcomings by incorporating the cochlear ﬂuid using a resistive network [2]. This parallel structure, however, has its own problem: response gain is diminished by interference among the second-order sections’ outputs due to the large phase change at resonance [3]. Listening more to biology, our silicon cochlea aims to overcome the shortcomings of existing architectures by mimicking the cochlear micromechanics while including outer hair cell (OHC) electromotility. Although how exactly OHC motile forces boost the basilar membrane’s (BM) vibration remains a mystery, cochlear microanatomy provides clues. Based on these clues, we previously proposed a novel mechanism, active bidirectional coupling (ABC), for the cochlear ampliﬁer [4]. Here, we report an analog VLSI chip that implements this mechanism. In essence, our implementation is the ﬁrst silicon cochlea that employs stimulus enhancement (i.e., active behavior) instead of undamping (i.e., high ﬁlter Q [5]). The paper is organized as follows. In Section 2, we present the hypothesized mechanism (ABC), ﬁrst described in [4]. In Section 3, we provide a mathematical formulation of the Oval window organ of Corti BM Round window IHC RL A OHC PhP DC BM Basal @ Stereocilia i -1 i i+1 Apical B Figure 1: The inner ear. A Cutaway showing cochlear ducts (adapted from [6]). B Longitudinal view of cochlear partition (CP) (modiﬁed from [7]-[8]). Each outer hair cell (OHC) tilts toward the base while the Deiter’s cell (DC) on which it sits extends a phalangeal process (PhP) toward the apex. The OHCs’ stereocilia and the PhPs’ apical ends form the reticular lamina (RL). d is the tilt distance, and the segment size. IHC: inner hair cell. model as the basis of cochlear circuit design. Then we proceed in Section 4 to synthesize the circuit for the cochlear chip. Last, we present chip measurements in Section 5 that demonstrate nonlinear active cochlear behavior. 2 Active Bidirectional Coupling The cochlea actively ampliﬁes acoustic signals as it performs spectral analysis. The movement of the stapes sets the cochlear ﬂuid into motion, which passes the stimulus energy onto a certain region of the BM, the main vibrating organ in the cochlea (Figure 1A). From the base to the apex, BM ﬁbers increase in width and decrease in thickness, resulting in an exponential decrease in stiffness which, in turn, gives rise to the passive frequency tuning of the cochlea. The OHCs’ electromotility is widely thought to account for the cochlea’s exquisite sensitivity and discriminability. The exact way that OHC motile forces enhance the BM’s motion, however, remains unresolved. We propose that the triangular mechanical unit formed by an OHC, a phalangeal process (PhP) extended from the Deiter’s cell (DC) on which the OHC sits, and a portion of the reticular lamina (RL), between the OHC’s stereocilia end and the PhP’s apical tip, plays an active role in enhancing the BM’s responses (Figure 1B). The cochlear partition (CP) is divided into a number of segments longitudinally. Each segment includes one DC, one PhP’s apical tip and one OHC’s stereocilia end, both attached to the RL. Approximating the anatomy, we assume that when an OHC’s stereocilia end lies in segment i − 1, its basolateral end lies in the immediately apical segment i. Furthermore, the DC in segment i extends a PhP that angles toward the apex of the cochlea, with its apical end inserted just behind the stereocilia end of the OHC in segment i + 1. Our hypothesis (ABC) includes both feedforward and feedbackward interactions. On one hand, the feedforward mechanism, proposed in [9], hypothesized that the force resulting from OHC contraction or elongation is exerted onto an adjacent downstream BM segment due to the OHC’s basal tilt. On the other hand, the novel insight of the feedbackward mechanism is that the OHC force is delivered onto an adjacent upstream BM segment due to the apical tilt of the PhP extending from the DC’s main trunk. In a nutshell, the OHC motile forces, through the microanatomy of the CP, feed forward and backward, in harmony with each other, resulting in bidirectional coupling between BM segments in the longitudinal direction. Speciﬁcally, due to the opposite action of OHC S x M x Re Zm 1 0.5 0 0.2 0 A 5 10 15 20 Distance from stapes mm 25 B Figure 2: Wave propagation (WP) and basilar membrane (BM) impedance in the active cochlear model with a 2kHz pure tone (α = 0.15, γ = 0.3). A WP in ﬂuid and BM. B BM impedance Zm (i.e., pressure divided by velocity), normalized by S(x)M (x). Only the resistive component is shown; dot marks peak location. forces on the BM and the RL, the motion of BM segment i − 1 reinforces that of segment i while the motion of segment i + 1 opposes that of segment i, as described in detail in [4]. 3 The 2D Nonlinear Active Model To provide a blueprint for the cochlear circuit design, we formulate a 2D model of the cochlea that includes ABC. Both the cochlea’s length (BM) and height (cochlear ducts) are discretized into a number of segments, with the original aspect ratio of the cochlea maintained. In the following expressions, x represents the distance from the stapes along the CP, with x = 0 at the base (or the stapes) and x = L (uncoiled cochlear duct length) at the apex; y represents the vertical distance from the BM, with y = 0 at the BM and y = ±h (cochlear duct radius) at the bottom/top wall. Providing that the assumption of ﬂuid incompressibility holds, the velocity potential φ of the ﬂuids is required to satisfy 2 φ(x, y, t) = 0, where 2 denotes the Laplacian operator. By deﬁnition, this potential is related to ﬂuid velocities in the x and y directions: Vx = −∂φ/∂x and Vy = −∂φ/∂y. The BM is driven by the ﬂuid pressure difference across it. Hence, the BM’s vertical motion (with downward displacement being positive) can be described as follows. ˙ ¨ Pd (x) + FOHC (x) = S(x)δ(x) + β(x)δ(x) + M (x)δ(x), (1) where S(x) is the stiffness, β(x) is the damping, and M (x) is the mass, per unit area, of the BM; δ is the BM’s downward displacement. Pd = ρ ∂(φSV (x, y, t) − φST (x, y, t))/∂t is the pressure difference between the two ﬂuid ducts (the scala vestibuli (SV) and the scala tympani (ST)), evaluated at the BM (y = 0); ρ is the ﬂuid density. The FOHC(x) term combines feedforward and feedbackward OHC forces, described by FOHC (x) = s0 tanh(αγS(x)δ(x − d)/s0 ) − tanh(αS(x)δ(x + d)/s0 ) , (2) where α denotes the OHC motility, expressed as a fraction of the BM stiffness, and γ is the ratio of feedforward to feedbackward coupling, representing relative strengths of the OHC forces exerted on the BM segment through the DC, directly and via the tilted PhP. d denotes the tilt distance, which is the horizontal displacement between the source and the recipient of the OHC force, assumed to be equal for the forward and backward cases. We use the hyperbolic tangent function to model saturation of the OHC forces, the nonlinearity that is evident in physiological measurements [8]; s0 determines the saturation level. We observed wave propagation in the model and computed the BM’s impedance (i.e., the ratio of driving pressure to velocity). Following the semi-analytical approach in [2], we simulated a linear version of the model (without saturation). The traveling wave transitions from long-wave to short-wave before the BM vibration peaks; the wavelength around the characteristic place is comparable to the tilt distance (Figure 2A). The BM impedance’s real part (i.e., the resistive component) becomes negative before the peak (Figure 2B). On the whole, inclusion of OHC motility through ABC boosts the traveling wave by pumping energy onto the BM when the wavelength matches the tilt of the OHC and PhP. 4 Analog VLSI Design and Implementation Based on our mathematical model, which produces realistic responses, we implemented a 2D nonlinear active cochlear circuit in analog VLSI, taking advantage of the 2D nature of silicon chips. We ﬁrst synthesize a circuit analog of the mathematical model, and then we implement the circuit in the log-domain. We start by synthesizing a passive model, and then extend it to a nonlinear active one by including ABC with saturation. 4.1 Synthesizing the BM Circuit The model consists of two fundamental parts: the cochlear ﬂuid and the BM. First, we design the ﬂuid element and thus the ﬂuid network. In discrete form, the ﬂuids can be viewed as a grid of elements with a speciﬁc resistance that corresponds to the ﬂuid density or mass. Since charge is conserved for a small sheet of resistance and so are particles for a small volume of ﬂuid, we use current to simulate ﬂuid velocity. At the transistor level, the current ﬂowing through the channel of a MOS transistor, operating subthreshold as a diffusive element, can be used for this purpose. Therefore, following the approach in [10], we implement the cochlear ﬂuid network using a diffusor network formed by a 2D grid of nMOS transistors. Second, we design the BM element and thus the BM. As current represents velocity, we rewrite the BM boundary condition (Equation 1, without the FOHC term): ˙ Iin = S(x) Imem dt + β(x)Imem + M (x)I˙mem , (3) where Iin , obtained by applying the voltage from the diffusor network to the gate of a pMOS transistor, represents the velocity potential scaled by the ﬂuid density. In turn, Imem ˙ drives the diffusor network to match the ﬂuid velocity with the BM velocity, δ. The FOHC term is dealt with in Section 4.2. Implementing this second-order system requires two state-space variables, which we name Is and Io . And with s = jω, our synthesized BM design (passive) is τ1 Is s + Is τ2 Io s + Io Imem = −Iin + Io , = Iin − bIs , = Iin + Is − Io , (4) (5) (6) where the two ﬁrst-order systems are both low-pass ﬁlters (LPFs), with time constants τ1 and τ2 , respectively; b is a gain factor. Thus, Iin can be expressed in terms of Imem as: Iin s2 = (b + 1)/τ1 τ2 + ((τ1 + τ2 )/τ1 τ2)s + s2 Imem . Comparing this expression with the design target (Equation 3) yields the circuit analogs: S(x) = (b + 1)/τ1τ2 , β(x) = (τ1 + τ2 )/τ1 τ2 , and M (x) = 1. Note that the mass M (x) is a constant (i.e., 1), which was also the case in our mathematical model simulation. These analogies require that τ1 and τ2 increase exponentially to Half LPF ( ) + Iout- Iin+ Iout+ Iout Vq Iin+ Iin- C+ B Iin- A Iin+ + - - Iin- + + + C To neighbors Is- Is+ > + > + IT+ IT- + - + From neighbors Io- Io+ + + + + - - + + LPF Iout+ Iout- BM Imem+ Imem- Figure 3: Low-pass ﬁlter (LPF) and second-order section circuit design. A Half-LPF circuit. B Complete LPF circuit formed by two half-LPF circuits. C Basilar membrane (BM) circuit. It consists of two LPFs and connects to its neighbors through Is and IT . simulate the exponentially decreasing BM stiffness (and damping); b allows us to achieve a reasonable stiffness for a practical choice of τ1 and τ2 (capacitor size is limited by silicon area). 4.2 Adding Active Bidirectional Coupling To include ABC in the BM boundary condition, we replace δ in Equation 2 with Imem dt to obtain FOHC = rﬀ S(x)T Imem (x − d)dt − rfb S(x)T Imem (x + d)dt , where rﬀ = αγ and rfb = α denote the feedforward and feedbackward OHC motility factors, and T denotes saturation. The saturation is applied to the displacement, instead of the force, as this simpliﬁes the implementation. We obtain the integrals by observing that, in the passive design, the state variable Is = −Imem /sτ1 . Thus, Imem (x − d)dt = −τ1f Isf and Imem (x + d)dt = −τ1b Isb . Here, Isf and Isb represent the outputs of the ﬁrst LPF in the upstream and downstream BM segments, respectively; τ1f and τ1b represent their respective time constants. To reduce complexity in implementation, we use τ1 to approximate both τ1f and τ1b as the longitudinal span is small. We obtain the active BM design by replacing Equation 5 with the synthesis result: τ2 Ios + Io = Iin − bIs + rfb (b + 1)T (−Isb ) − rﬀ (b + 1)T (−Isf ). Note that, to implement ABC, we only need to add two currents to the second LPF in the passive system. These currents, Isf and Isb , come from the upstream and downstream neighbors of each segment. ISV Fluid Base BM IST Apex Fluid A IT + IT Is+ Is- + Vsat Imem Iin+ Imem- Iin- Is+ Is+ Is- IsBM IT + IT + - I IT T Vsat IT + IT Is+ Is- B Figure 4: Cochlear chip. A Architecture: Two diffusive grids with embedded BM circuits model the cochlea. B Detail. BM circuits exchange currents with their neighbors. 4.3 Class AB Log-domain Implementation We employ the log-domain ﬁltering technique [11] to realize current-mode operation. In addition, following the approach proposed in [12], we implement the circuit in Class AB to increase dynamic range, reduce the effect of mismatch and lower power consumption. This differential signaling is inspired by the way the biological cochlea works—the vibration of BM is driven by the pressure difference across it. Taking a bottom-up strategy, we start by designing a Class AB LPF, a building block for the BM circuit. It is described by + − + − + − + − + − 2 τ (Iout − Iout )s + (Iout − Iout ) = Iin − Iin and τ Iout Iout s + Iout Iout = Iq , where Iq sets the geometric mean of the positive and negative components of the output current, and τ sets the time constant. Combining the common-mode constraint with the differential design equation yields the nodal equation for the positive path (the negative path has superscripts + and − swapped): + − + + + − 2 ˙+ C Vout = Iτ (Iin − Iin ) + (Iq /Iout − Iout ) /(Iout + Iout ). + This nodal equation suggests the half-LPF circuit shown in Figure 3A. Vout , the voltage on + the positive capacitor (C ), gates a pMOS transistor to produce the corresponding current + − − signal, Iout (Vout and Iout are similarly related). The bias Vq sets the quiescent current Iq while Vτ determines the current Iτ , which is related to the time constant by τ = CuT/κIτ (κ is the subthreshold slope coefﬁcient and uT is the thermal voltage). Two of these subcircuits, connected in push–pull, form a complete LPF (Figure 3B). The BM circuit is implemented using two LPFs interacting in accordance with the synthesized design equations (Figure 3C). Imem is the combination of three currents, Iin , Is , and Io . Each BM sends out Is and receives IT , a saturated version of its neighbor’s Is . The saturation is accomplished by a current-limiting transistor (see Figure 4B), which yields IT = T (Is ) = Is Isat /(Is + Isat ), where Isat is set by a bias voltage Vsat. 4.4 Chip Architecture We fabricated a version of our cochlear chip architecture (Figure 4) with 360 BM circuits and two 4680-element ﬂuid grids (360 ×13). This chip occupies 10.9mm2 of silicon area in 0.25µm CMOS technology. Differential input signals are applied at the base while the two ﬂuid grids are connected at the apex through a ﬂuid element that represents the helicotrema. 5 Chip Measurements We carried out two measurements that demonstrate the desired ampliﬁcation by ABC, and the compressive growth of BM responses due to saturation. To obtain sinusoidal current as the input to the BM subcircuits, we set the voltages applied at the base to be the logarithm of a half-wave rectiﬁed sinusoid. We ﬁrst investigated BM-velocity frequency responses at six linearly spaced cochlear positions (Figure 5). The frequency that maximally excites the ﬁrst position (Stage 30), deﬁned as its characteristic frequency (CF), is 12.1kHz. The remaining ﬁve CFs, from early to later stages, are 8.2k, 1.7k, 905, 366, and 218Hz, respectively. Phase accumulation at the CFs ranges from 0.56 to 2.67π radians, comparable to 1.67π radians in the mammalian cochlea [13]. Q10 factor (the ratio of the CF to the bandwidth 10dB below the peak) ranges from 1.25 to 2.73, comparable to 2.55 at mid-sound intensity in biology (computed from [13]). The cutoff slope ranges from -20 to -54dB/octave, as compared to -85dB/octave in biology (computed from [13]). BM Velocity Amplitude dB 40 Stage 0 230 190 150 110 70 30 30 20 10 0 BM Velocity Phase Π radians 50 2 4 10 0.1 0.2 0.5 1 2 5 Frequency kHz A 10 20 0.1 0.2 0.5 1 2 5 Frequency kHz 10 20 B Figure 5: Measured BM-velocity frequency responses at six locations. A Amplitude. B Phase. Dashed lines: Biological data (adapted from [13]). Dots mark peaks. We then explored the longitudinal pattern of BM-velocity responses and the effect of ABC. Stimulating the chip using four different pure tones, we obtained responses in which a 4kHz input elicits a peak around Stage 85 while 500Hz sound travels all the way to Stage 178 and peaks there (Figure 6A). We varied the input voltage level and obtained frequency responses at Stage 100 (Figure 6B). Input voltage level increases linearly such that the current increases exponentially; the input current level (in dB) was estimated based on the measured κ for this chip. As expected, we observed linearly increasing responses at low frequencies in the logarithmic plot. In contrast, the responses around the CF increase less and become broader with increasing input level as saturation takes effect in that region (resembling a passive cochlea). We observed 24dB compression as compared to 27 to 47dB in biology [13]. At the highest intensities, compression also occurs at low frequencies. These chip measurements demonstrate that inclusion of ABC, simply through coupling neighboring BM elements, transforms a passive cochlea into an active one. This active cochlear model’s nonlinear responses are qualitatively comparable to physiological data. 6 Conclusions We presented an analog VLSI implementation of a 2D nonlinear cochlear model that utilizes a novel active mechanism, ABC, which we proposed to account for the cochlear ampliﬁer. ABC was shown to pump energy into the traveling wave. Rather than detecting the wave’s amplitude and implementing an automatic-gain-control loop, our biomorphic model accomplishes this simply by nonlinear interactions between adjacent neighbors. Im- 60 Frequency 4k 2k 1k 500 Hz BM Velocity Amplitude dB BM Velocity Amplitude dB 20 10 0 Input Level 40 48 dB 20 Stage 100 32 dB 16 dB 0 0 dB 10 0 50 100 150 Stage Number A 200 0.2 0.5 1 2 5 Frequency kHz 10 20 B Figure 6: Measured BM-velocity responses (cont’d). A Longitudinal responses (20-stage moving average). Peak shifts to earlier (basal) stages as input frequency increases from 500 to 4kHz. B Effects of increasing input intensity. Responses become broader and show compressive growth. plemented in the log-domain, with Class AB operation, our silicon cochlea shows enhanced frequency responses, with compressive behavior around the CF, when ABC is turned on. These features are desirable in prosthetic applications and automatic speech recognition systems as they capture the properties of the biological cochlea. References [1] Lyon, R.F. & Mead, C.A. (1988) An analog electronic cochlea. IEEE Trans. Acoust. Speech and Signal Proc., 36: 1119-1134. [2] Watts, L. (1993) Cochlear Mechanics: Analysis and Analog VLSI . Ph.D. thesis, Pasadena, CA: California Institute of Technology. [3] Fragni`re, E. (2005) A 100-Channel analog CMOS auditory ﬁlter bank for speech recognition. e IEEE International Solid-State Circuits Conference (ISSCC 2005) , pp. 140-141. [4] Wen, B. & Boahen, K. (2003) A linear cochlear model with active bi-directional coupling. The 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2003), pp. 2013-2016. [5] Sarpeshkar, R., Lyon, R.F., & Mead, C.A. (1996) An analog VLSI cochlear model with new transconductance ampliﬁer and nonlinear gain control. Proceedings of the IEEE Symposium on Circuits and Systems (ISCAS 1996) , 3: 292-295. [6] Mead, C.A. (1989) Analog VLSI and Neural Systems . Reading, MA: Addison-Wesley. [7] Russell, I.J. & Nilsen, K.E. (1997) The location of the cochlear ampliﬁer: Spatial representation of a single tone on the guinea pig basilar membrane. Proc. Natl. Acad. Sci. USA, 94: 2660-2664. [8] Geisler, C.D. (1998) From sound to synapse: physiology of the mammalian ear . Oxford University Press. [9] Geisler, C.D. & Sang, C. (1995) A cochlear model using feed-forward outer-hair-cell forces. Hearing Research , 86: 132-146. [10] Boahen, K.A. & Andreou, A.G. (1992) A contrast sensitive silicon retina with reciprocal synapses. In Moody, J.E. and Lippmann, R.P. (eds.), Advances in Neural Information Processing Systems 4 (NIPS 1992) , pp. 764-772, Morgan Kaufmann, San Mateo, CA. [11] Frey, D.R. (1993) Log-domain ﬁltering: an approach to current-mode ﬁltering. IEE Proc. G, Circuits Devices Syst., 140 (6): 406-416. [12] Zaghloul, K. & Boahen, K.A. (2005) An On-Off log-domain circuit that recreates adaptive ﬁltering in the retina. IEEE Transactions on Circuits and Systems I: Regular Papers , 52 (1): 99-107. [13] Ruggero, M.A., Rich, N.C., Narayan, S.S., & Robles, L. (1997) Basilar membrane responses to tones at the base of the chinchilla cochlea. J. Acoust. Soc. Am., 101 (4): 2151-2163.

18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

Author: Brent Bryan, Robert C. Nichol, Christopher R. Genovese, Jeff Schneider, Christopher J. Miller, Larry Wasserman

Abstract: We present an efﬁcient algorithm to actively select queries for learning the boundaries separating a function domain into regions where the function is above and below a given threshold. We develop experiment selection methods based on entropy, misclassiﬁcation rate, variance, and their combinations, and show how they perform on a number of data sets. We then show how these algorithms are used to determine simultaneously valid 1 − α conﬁdence intervals for seven cosmological parameters. Experimentation shows that the algorithm reduces the computation necessary for the parameter estimation problem by an order of magnitude.

19 nips-2005-Active Learning for Misspecified Models

Author: Masashi Sugiyama

Abstract: Active learning is the problem in supervised learning to design the locations of training input points so that the generalization error is minimized. Existing active learning methods often assume that the model used for learning is correctly speciﬁed, i.e., the learning target function can be expressed by the model at hand. In many practical situations, however, this assumption may not be fulﬁlled. In this paper, we ﬁrst show that the existing active learning method can be theoretically justiﬁed under slightly weaker condition: the model does not have to be correctly speciﬁed, but slightly misspeciﬁed models are also allowed. However, it turns out that the weakened condition is still restrictive in practice. To cope with this problem, we propose an alternative active learning method which can be theoretically justiﬁed for a wider class of misspeciﬁed models. Thus, the proposed method has a broader range of applications than the existing method. Numerical studies show that the proposed active learning method is robust against the misspeciﬁcation of models and is thus reliable. 1 Introduction and Problem Formulation Let us discuss the regression problem of learning a real-valued function Ê from training examples ´Ü Ý µ ´Ü µ · ¯ Ý Ò ´Üµ deﬁned on ½ where ¯ Ò ½ are i.i.d. noise with mean zero and unknown variance ¾. We use the following linear regression model for learning. ´Ü µ ´µ Ô ½ « ³ ´Ü µ where ³ Ü Ô ½ are ﬁxed linearly independent functions and are parameters to be learned. ´ µ « ´«½ «¾ « Ô µ We evaluate the goodness of the learned function Ü by the expected squared test error over test input points and noise (i.e., the generalization error). When the test input points are drawn independently from a distribution with density ÔØ Ü , the generalization error is expressed as ´ µ ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø where ¯ denotes the expectation over the noise ¯ Ò Ô ´Üµ is known1. ½. In the following, we suppose that Ø In a standard setting of regression, the training input points are provided from the environment, i.e., Ü Ò ½ independently follow the distribution with density ÔØ Ü . On the other hand, in some cases, the training input points can be designed by users. In such cases, it is expected that the accuracy of the learning result can be improved if the training input points are chosen appropriately, e.g., by densely locating training input points in the regions of high uncertainty. ´ µ Active learning—also referred to as experimental design—is the problem of optimizing the location of training input points so that the generalization error is minimized. In active learning research, it is often assumed that the regression model is correctly speciﬁed [2, 1, 3], i.e., the learning target function Ü can be expressed by the model. In practice, however, this assumption is often violated. ´ µ In this paper, we ﬁrst show that the existing active learning method can still be theoretically justiﬁed when the model is approximately correct in a strong sense. Then we propose an alternative active learning method which can also be theoretically justiﬁed for approximately correct models, but the condition on the approximate correctness of the models is weaker than that for the existing method. Thus, the proposed method has a wider range of applications. In the following, we suppose that the training input points Ü Ò ½ are independently drawn from a user-deﬁned distribution with density ÔÜ Ü , and discuss the problem of ﬁnding the optimal density function. ´µ 2 Existing Active Learning Method The generalization error deﬁned by Eq.(1) can be decomposed as ·Î is the (squared) bias term and Î is the variance term given by where ¯ ´Üµ ´Üµ ¾ Ô ´Üµ Ü Ø Î and ¯ ´Üµ ¯ ´Üµ ¾ Ô ´Üµ Ü Ø A standard way to learn the parameters in the regression model (1) is the ordinary leastsquares learning, i.e., parameter vector « is determined as follows. « ÇÄË It is known that «ÇÄË is given by Ö« Ò Ñ « ÇÄË where Ä ÇÄË ´ µ ½ Ò ´Ü µ Ý ½ Ä ÇÄË ³ ´Ü µ ¾ Ý and Ý ´Ý½ Ý¾ Ý Ò µ Let ÇÄË , ÇÄË and ÎÇÄË be , and Î for the learned function obtained by the ordinary least-squares learning, respectively. Then the following proposition holds. 1 In some application domains such as web page analysis or bioinformatics, a large number of unlabeled samples—input points without output values independently drawn from the distribution with density ÔØ ´Üµ—are easily gathered. In such cases, a reasonably good estimate of ÔØ ´Üµ may be obtained by some standard density estimation method. Therefore, the assumption that ÔØ ´Üµ is known may not be so restrictive. Proposition 1 ([2, 1, 3]) Suppose that the model is correctly speciﬁed, i.e., the learning target function Ü is expressed as ´µ Ô ´Ü µ Then ½ «£ ³ ´Üµ and ÎÇÄË are expressed as ÇÄË ¼ ÇÄË and Î ¾ ÇÄË Â ÇÄË where ØÖ´ÍÄ Â ÇÄË ÇÄË Ä ÇÄË µ ³ ´Üµ³ ´ÜµÔ ´Üµ Ü Í and Ø Therefore, for the correctly speciﬁed model (1), the generalization error as ÇÄË ¾ ÇÄË is expressed Â ÇÄË Based on this expression, the existing active learning method determines the location of training input points Ü Ò ½ (or the training input density ÔÜ Ü ) so that ÂÇÄË is minimized [2, 1, 3]. ´ µ 3 Analysis of Existing Method under Misspeciﬁcation of Models In this section, we investigate the validity of the existing active learning method for misspeciﬁed models. ´ µ Suppose the model does not exactly include the learning target function Ü , but it approximately includes it, i.e., for a scalar Æ such that Æ is small, Ü is expressed as ´ µ ´Ü µ ´Üµ · ÆÖ´Üµ where ´Üµ is the orthogonal projection of ´Üµ onto the span of residual Ö´Üµ is orthogonal to ³ ´Üµ ½ : Ô Ô ´Üµ ½ «£ ³ ´Üµ Ö´Üµ³ ´ÜµÔ ´Üµ Ü and In this case, the bias term Ø ¼ for ³ ´Üµ ½¾ Ô and the ½ Ô is expressed as ¾ ´ ´Üµ ´Üµµ¾ Ô ´Üµ Ü is constant which does not depend on the training input density Ô ´Üµ, we subtract ¯ ´Üµ ´Üµ Ô ´Üµ Ü · where Ø Ø Since in the following discussion. Ü Then we have the following lemma2 . Lemma 2 For the approximately correct model (3), we have ÇÄË ÇÄË Î ÇÄË where 2 Þ Æ ¾ ÍÄ ¾Â Ö ÇÄË Þ Ä Þ Ç ´Ò ½ µ ´Ö´Ü½µ Ö´Ü¾µ Ö ÇÄË Ö Ô Ö ´Ü Proofs of lemmas are provided in an extended version [6]. Ò µµ Ç ´Æ ¾ µ Note that the asymptotic order in Eq.(1) is in probability since ÎÇÄË is a random variable that includes Ü Ò ½ . The above lemma implies that ½ Ó ´Ò ¾ µ Therefore, the existing active learning method of minimizing Â is still justiﬁed if Æ ½ ¾ µ. However, when Æ Ó ´Ò ½ µ, the existing method may not work well because ¾ Ó ´Ò the bias term is not smaller than the variance term Î , so it can not be ÇÄË ¾ · Ó ´Ò ½µ Â ÇÄË if Æ Ô Ô ÇÄË Ô Ô ÇÄË ÇÄË neglected. 4 New Active Learning Method In this section, we propose a new active learning method based on the weighted leastsquares learning. 4.1 Weighted Least-Squares Learning When the model is correctly speciﬁed, «ÇÄË is an unbiased estimator of «£ . However, for misspeciﬁed models, «ÇÄË is generally biased even asymptotically if Æ ÇÔ . ´½µ The bias of «ÇÄË is actually caused by the covariate shift [5]—the training input density ÔÜ Ü is different from the test input density ÔØ Ü . For correctly speciﬁed models, inﬂuence of the covariate shift can be ignored, as the existing active learning method does. However, for misspeciﬁed models, we should explicitly cope with the covariate shift. ´µ ´ µ Under the covariate shift, it is known that the following weighted least-squares learning is [5]. asymptotically unbiased even if Æ ÇÔ ´½µ Ô ´Ü µ Ô ´Ü µ ½ Ò Ö« Ò Ñ « Ï ÄË ¾ ´Ü µ Ý Ø Ü Asymptotic unbiasedness of «Ï ÄË would be intuitively understood by the following identity, which is similar in spirit to importance sampling: ´Üµ ´Üµ ¾ Ô ´Ü µ Ü ´Üµ ´Üµ Ø ´µ ¾ Ô ´Üµ Ô ´Ü µ Ü Ô ´Üµ Ø Ü Ü In the following, we assume that ÔÜ Ü is strictly positive for all Ü. Let matrix with the -th diagonal element be the diagonal Ô ´Ü µ Ô ´Ü µ Ø Ü Then it can be conﬁrmed that «Ï ÄË is given by « Ä Ï ÄË Ï ÄË Ý where Ä ´ Ï ÄË µ ½ 4.2 Active Learning Based on Weighted Least-Squares Learning Let Ï ÄË , Ï ÄË and ÎÏ ÄË be , and Î for the learned function obtained by the above weighted least-squares learning, respectively. Then we have the following lemma. Lemma 3 For the approximately correct model (3), we have Ï ÄË Î Æ ¾ ÍÄ ¾Â Ï ÄË where Ï ÄË Ï ÄË Â Ï ÄË Þ Ä Þ Ç ´Ò ½ µ Ö Ï ÄË Ö Ô Ô ØÖ´ÍÄ Ï ÄË Ä Ï ÄË Ç ´Æ ¾ Ò ½ µ µ This lemma implies that ¾ Â · Ó ´Ò ½µ ´½µ if Æ ÓÔ Based on this expression, we propose determining the training input density ÔÜ ÂÏ ÄË is minimized. Ï ÄË Ï ÄË Ô ´Üµ so that ´½µ The use of the proposed criterion ÂÏ ÄË can be theoretically justiﬁed when Æ ÓÔ , ½ while the existing criterion ÂÇÄË requires Æ ÓÔ Ò ¾ . Therefore, the proposed method has a wider range of applications. The effect of this extension is experimentally investigated in the next section. ´ 5 µ Numerical Examples We evaluate the usefulness of the proposed active learning method through experiments. Toy Data Set: setting. We ﬁrst illustrate how the proposed method works under a controlled ½ ´µ ´µ ½ · · ½¼¼ ´µ Let and the learning target function Ü be Ü Ü Ü¾ ÆÜ¿. Let Ò ½¼¼ be i.i.d. Gaussian noise with mean zero and standard deviation and ¯ . Let ÔØ Ü ½ be the Gaussian density with mean and standard deviation , which is assumed to be known here. Let Ô and the basis functions be ³ Ü Ü ½ for . Let us consider the following three cases. Æ , where each case corresponds to “correctly speciﬁed”, “approximately correct”, and “misspeciﬁed” (see Figure 1). We choose the training input density ÔÜ Ü from the Gaussian density with mean and standard , where deviation ¼¾ ¿ ´µ ¼ ¼ ¼¼ ¼ ¼ ¼ ½¼ ´µ ¼ ¼¿ ½¾¿ ¼¾ ¾ We compare the accuracy of the following three methods: (A) Proposed active learning criterion + WLS learning : The training input density is determined so that ÂÏ ÄË is minimized. Following the determined input density, training input points Ü ½¼¼ are created and corresponding output values Ý ½¼¼ ½ ½ are observed. Then WLS learning is used for estimating the parameters. (B) Existing active learning criterion + OLS learning [2, 1, 3]: The training input density is determined so that ÂÇÄË is minimized. OLS learning is used for estimating the parameters. (C) Passive learning + OLS learning: The test input density ÔØ Ü is used as the training input density. OLS learning is used for estimating the parameters. ´ µ First, we evaluate the accuracy of ÂÏ ÄË and ÂÇÄË as approximations of Ï ÄË and ÇÄË . The means and standard deviations of Ï ÄË , ÂÏ ÄË , ÇÄË , and ÂÇÄË over runs are (“correctly depicted as functions of in Figure 2. These graphs show that when Æ speciﬁed”), both ÂÏ ÄË and ÂÇÄË give accurate estimates of Ï ÄË and ÇÄË . When Æ (“approximately correct”), ÂÏ ÄË again works well, while ÂÇÄË tends to be negatively biased for large . This result is surprising since as illustrated in Figure 1, the learning target functions with Æ and Æ are visually quite similar. Therefore, it intuitively seems that the result of Æ is not much different from that of Æ . However, the simulation result shows that this slight difference makes ÂÇÄË unreliable. (“misspeciﬁed”), ÂÏ ÄË is still reasonably accurate, while ÂÇÄË is heavily When Æ biased. ½¼¼ ¼ ¼¼ ¼ ¼ ¼¼ ¼¼ ¼ These results show that as an approximation of the generalization error, ÂÏ ÄË is more robust against the misspeciﬁcation of models than ÂÇÄË , which is in good agreement with the theoretical analyses given in Section 3 and Section 4. Learning target function f(x) 8 δ=0 δ=0.04 δ=0.5 6 Table 1: The means and standard deviations of the generalization error for Toy data set. The best method and comparable ones by the t-test at the are described with boldface. signiﬁcance level The value of method (B) for Æ is extremely large but it is not a typo. 4 ± 2 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Input density functions 1.5 ¼ pt(x) Æ ¼ ½ ¦¼ ¼ px(x) 1 0.5 0 −1.5 −1 −0.5 0 0.5 1 1.5 2 Figure 1: Learning target function and input density functions. ¼ Æ (A) (B) (C) ¼¼ Æ −3 −3 −3 G−WLS 12 4 3 G−WLS 5 4 ¼ x 10 6 5 ½¼¿. “misspeciﬁed” x 10 G−WLS ¼ ¦¼ ¼ ¿¼¿ ¦ ½ ¦½ ½ ¿ ¾ ¦ ½ ¾¿ ¾ ¾¦¼ ¿ “approximately correct” x 10 6 Æ All values in the table are multiplied by Æ “correctly speciﬁed” ¦¼ ¼ ¾ ¼¦¼ ½¿ ¼¼ Æ ¾ ¼¾ ¦ ¼ ¼ 3 10 8 6 0.8 1.2 1.6 2 0.07 2.4 J−WLS 0.06 0.8 1.2 1.6 2 0.07 2.4 0.8 1.2 1.6 2 0.07 J−WLS 0.06 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 2.4 J−WLS 0.06 0.8 −3 x 10 1.2 1.6 2 2.4 G−OLS 5 0.03 0.8 −3 x 10 1.2 1.6 2 3 1.2 1.6 2 1.6 2.4 2 G−OLS 0.4 4 3 0.8 0.5 G−OLS 5 4 2.4 0.3 0.2 0.1 2 2 0.8 1.2 1.6 2 0.06 2.4 J−OLS 0.8 1.2 1.6 2 0.06 2.4 0.8 1.2 0.06 J−OLS 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02 0.02 2.4 J−OLS 0.8 1.2 1.6 c 2 2.4 0.03 0.02 0.8 Figure 2: The means and error bars of functions of . 1.2 1.6 c Ï ÄË , 2 Â Ï ÄË 2.4 , 0.8 ÇÄË 1.2 1.6 c , and ÂÇÄË over 2 2.4 ½¼¼ runs as In Table 1, the mean and standard deviation of the generalization error obtained by each method is described. When Æ , the existing method (B) works better than the proposed method (A). Actually, in this case, training input densities that approximately minimize Ï ÄË and ÇÄË were found by ÂÏ ÄË and ÂÇÄË . Therefore, the difference of the errors is caused by the difference of WLS and OLS: WLS generally has larger variance than OLS. Since bias is zero for both WLS and OLS if Æ , OLS would be more accurate than WLS. Although the proposed method (A) is outperformed by the existing method (B), it still works better than the passive learning scheme (C). When Æ and Æ the proposed method (A) gives signiﬁcantly smaller errors than other methods. ¼ ¼ ¼¼ ¼ Overall, we found that for all three cases, the proposed method (A) works reasonably well and outperforms the passive learning scheme (C). On the other hand, the existing method (B) works excellently in the correctly speciﬁed case, although it tends to perform poorly once the correctness of the model is violated. Therefore, the proposed method (A) is found to be robust against the misspeciﬁcation of models and thus it is reliable. Table 2: The means and standard deviations of the test error for DELVE data sets. All values in the table are multiplied by ¿. Bank-8fm Bank-8fh Bank-8nm Bank-8nh (A) ¼ ¿½ ¦ ¼ ¼ ¾ ½¼ ¦ ¼ ¼ ¾ ¦ ½ ¾¼ ¿ ¦ ½ ½½ (B) ¦ ¦ ¦ ¦ (C) ¦ ¦ ¦ ¦ ½¼ ¼ ¼¼ ¼¿ ¼¼ ¾ ¾½ ¼ ¼ ¾ ¾¼ ¼ ¼ Kin-8fm Kin-8fh ½ ¦¼ ¼ ½ ¦¼ ¼ ½ ¼¦¼ ¼ (A) (B) (C) ¾ ½ ¼ ¿ ½ ½¿ ¾ ¿ ½¿ ¿ ½¿ Kin-8nm ¼¦¼ ½ ¿ ¦ ¼ ½¿ ¾ ¦¼ ¾ Kin-8nh ¿ ¦¼ ¼ ¿ ¼¦ ¼ ¼ ¿ ¦¼ ½ ¼ ¾¦¼ ¼ ¼ ¦¼ ¼ ¼ ½¦¼ ¼ (A)/(C) (B)/(C) (C)/(C) 1.2 1.1 1 0.9 Bank−8fm Bank−8fh Bank−8nm Bank−8nh Kin−8fm Kin−8fh Kin−8nm Kin−8nh Figure 3: Mean relative performance of (A) and (B) compared with (C). For each run, the test errors of (A) and (B) are normalized by the test error of (C), and then the values are averaged over runs. Note that the error bars were reasonably small so they were omitted. ½¼¼ Realistic Data Set: Here we use eight practical data sets provided by DELVE [4]: Bank8fm, Bank-8fh, Bank-8nm, Bank-8nh, Kin-8fm, Kin-8fh, Kin-8nm, and Kin-8nh. Each data set includes samples, consisting of -dimensional input and -dimensional output values. For convenience, every attribute is normalized into . ½¾ ¼ ½℄ ½¾ ½ Suppose we are given all input points (i.e., unlabeled samples). Note that output values are unknown. From the pool of unlabeled samples, we choose Ò input points Ü ½¼¼¼ for training and observe the corresponding output values Ý ½¼¼¼. The ½ ½ task is to predict the output values of all unlabeled samples. ½¼¼¼ In this experiment, the test input density independent Gaussian density. Ô ´Üµ and Ø ´¾ ¾ ÅÄ Ô ´Üµ is unknown. Ø µ ÜÔ Ü ¾ ÅÄ So we estimate it using the ¾ ´¾¾ µ¡ ÅÄ where Å Ä are the maximum likelihood estimates of the mean and standard ÅÄ and the basis functions be deviation obtained from all unlabeled samples. Let Ô where Ø ³ ´Üµ ¼ ½ ÜÔ Ü Ø ¾ ¡ ¾ ¼ for ½¾ ¼ are template points randomly chosen from the pool of unlabeled samples. ´µ We select the training input density ÔÜ Ü from the independent Gaussian density with mean Å Ä and standard deviation Å Ä , where ¼ ¼ ¼ ¾ In this simulation, we can not create the training input points in an arbitrary location because we only have samples. Therefore, we ﬁrst create temporary input points following the determined training input density, and then choose the input points from the pool of unlabeled samples that are closest to the temporary input points. For each data set, we repeat this simulation times, by changing the template points Ø ¼ ½ in each run. ½¾ ½¼¼ ½¼¼ The means and standard deviations of the test error over runs are described in Table 2. The proposed method (A) outperforms the existing method (B) for ﬁve data sets, while it is outperformed by (B) for the other three data sets. We conjecture that the model used for learning is almost correct in these three data sets. This result implies that the proposed method (A) is slightly better than the existing method (B). Figure 3 depicts the relative performance of the proposed method (A) and the existing method (B) compared with the passive learning scheme (C). This shows that (A) outperforms (C) for all eight data sets, while (B) is comparable or is outperformed by (C) for ﬁve data sets. Therefore, the proposed method (A) is overall shown to work better than other schemes. 6 Conclusions We argued that active learning is essentially the situation under the covariate shift—the training input density is different from the test input density. When the model used for learning is correctly speciﬁed, the covariate shift does not matter. However, for misspeciﬁed models, we have to explicitly cope with the covariate shift. In this paper, we proposed a new active learning method based on the weighted least-squares learning. The numerical study showed that the existing method works better than the proposed method if model is correctly speciﬁed. However, the existing method tends to perform poorly once the correctness of the model is violated. On the other hand, the proposed method overall worked reasonably well and it consistently outperformed the passive learning scheme. Therefore, the proposed method would be robust against the misspeciﬁcation of models and thus it is reliable. The proposed method can be theoretically justiﬁed if the model is approximately correct in a weak sense. However, it is no longer valid for totally misspeciﬁed models. A natural future direction would be therefore to devise an active learning method which has theoretical guarantee with totally misspeciﬁed models. It is also important to notice that when the model is totally misspeciﬁed, even learning with optimal training input points would not be successful anyway. In such cases, it is of course important to carry out model selection. In active learning research—including the present paper, however, the location of training input points are designed for a single model at hand. That is, the model should have been chosen before performing active learning. Devising a method for simultaneously optimizing models and the location of training input points would be a more important and promising future direction. Acknowledgments: The author would like to thank MEXT (Grant-in-Aid for Young Scientists 17700142) for partial ﬁnancial support. References [1] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. Journal of Artiﬁcial Intelligence Research, 4:129–145, 1996. [2] V. V. Fedorov. Theory of Optimal Experiments. Academic Press, New York, 1972. [3] K. Fukumizu. Statistical active learning in multilayer perceptrons. IEEE Transactions on Neural Networks, 11(1):17–26, 2000. [4] C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahramani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. [5] H. Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000. [6] M. Sugiyama. Active learning for misspeciﬁed models. Technical report, Department of Computer Science, Tokyo Institute of Technology, 2005.

20 nips-2005-Affine Structure From Sound

Author: Sebastian Thrun

Abstract: We consider the problem of localizing a set of microphones together with a set of external acoustic events (e.g., hand claps), emitted at unknown times and unknown locations. We propose a solution that approximates this problem under a far ﬁeld approximation deﬁned in the calculus of afﬁne geometry, and that relies on singular value decomposition (SVD) to recover the afﬁne structure of the problem. We then deﬁne low-dimensional optimization techniques for embedding the solution into Euclidean geometry, and further techniques for recovering the locations and emission times of the acoustic events. The approach is useful for the calibration of ad-hoc microphone arrays and sensor networks. 1

21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

22 nips-2005-An Analog Visual Pre-Processing Processor Employing Cyclic Line Access in Only-Nearest-Neighbor-Interconnects Architecture

23 nips-2005-An Application of Markov Random Fields to Range Sensing

24 nips-2005-An Approximate Inference Approach for the PCA Reconstruction Error

25 nips-2005-An aVLSI Cricket Ear Model

26 nips-2005-An exploration-exploitation model based on norepinepherine and dopamine activity

27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions

29 nips-2005-Analyzing Coupled Brain Sources: Distinguishing True from Spurious Interaction

30 nips-2005-Assessing Approximations for Gaussian Process Classification

31 nips-2005-Asymptotics of Gaussian Regularized Least Squares

32 nips-2005-Augmented Rescorla-Wagner and Maximum Likelihood Estimation