nips nips2003 knowledge-graph by maker-knowledge-mining

nips 2003 knowledge graph

papers list:

1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1

2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality

Author: Maxim Likhachev, Geoffrey J. Gordon, Sebastian Thrun

Abstract: In real world planning problems, time for deliberation is often limited. Anytime planners are well suited for these problems: they ﬁnd a feasible solution quickly and then continually work on improving it until time runs out. In this paper we propose an anytime heuristic search, ARA*, which tunes its performance bound based on available search time. It starts by ﬁnding a suboptimal solution quickly using a loose bound, then tightens the bound progressively as time allows. Given enough time it ﬁnds a provably optimal solution. While improving its bound, ARA* reuses previous search efforts and, as a result, is signiﬁcantly more efﬁcient than other anytime search methods. In addition to our theoretical analysis, we demonstrate the practical utility of ARA* with experiments on a simulated robot kinematic arm and a dynamic path planning problem for an outdoor rover. 1

3 nips-2003-AUC Optimization vs. Error Rate Minimization

Author: Corinna Cortes, Mehryar Mohri

Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classiﬁcation algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate. Our results show that the average AUC is monotonically increasing as a function of the classiﬁcation accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classiﬁcation error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a speciﬁc query. That list may contain several thousand documents, but, in practice, only the top ﬁfty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classiﬁcation of all documents as relevant or not. More generally, for a binary classiﬁer assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classiﬁcation is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classiﬁer is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classiﬁcation algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values signiﬁcantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modiﬁed under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classiﬁed positive total positive incorrectly classiﬁed negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classiﬁcation, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. speciﬁcity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the ﬁrst exact expression of the expected value and the variance of the AUC for a ﬁxed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the beneﬁts of an algorithm speciﬁcally designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Deﬁnition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is deﬁned as the area under the ROC curve and is closely related to the ranking quality of the classiﬁcation as shown more formally by Lemma 1 below. Consider a binary classiﬁcation task with m positive examples and n negative examples. We will assume that a classiﬁer outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a ﬁxed classiﬁer. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a ﬁxed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classiﬁcations with a ﬁxed number of errors and compare that to the error rate. Different classiﬁers may have the same error rate but different AUC values. Indeed, for a given classiﬁcation threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is ﬁxed. We wish to compute the average value of the AUC over all classiﬁcations with k errors. Our model is based on the simple assumption that all classiﬁcations or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are k − x false negative examples. Figure 3 shows the corresponding conﬁguration. The two regions of examples with classiﬁcation outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The ﬁrst term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classiﬁcation rank. Let position one be the ﬁrst position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classiﬁed as positive is N = m − (k − x) + x. Thus, by deﬁnition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the ﬁrst term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a ﬁxed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is ﬁxed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classiﬁcations with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a ﬁxed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classiﬁcations for a given x. There are N possible ways of x choosing the positions of the x misclassiﬁed negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassiﬁed positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a ﬁxed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simpliﬁed into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classiﬁer only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a ﬁxed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing speciﬁc learning algorithms for maximizing the AUC: a classiﬁcation algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classiﬁcations, we can ﬁrst determine its average < X >x for a ﬁxed x, and then use the function F deﬁned by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of conﬁgurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a ﬁxed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By deﬁnition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i

4 nips-2003-A Biologically Plausible Algorithm for Reinforcement-shaped Representational Learning

Author: Maneesh Sahani

Abstract: Signiﬁcant plasticity in sensory cortical representations can be driven in mature animals either by behavioural tasks that pair sensory stimuli with reinforcement, or by electrophysiological experiments that pair sensory input with direct stimulation of neuromodulatory nuclei, but usually not by sensory stimuli presented alone. Biologically motivated theories of representational learning, however, have tended to focus on unsupervised mechanisms, which may play a signiﬁcant role on evolutionary or developmental timescales, but which neglect this essential role of reinforcement in adult plasticity. By contrast, theoretical reinforcement learning has generally dealt with the acquisition of optimal policies for action in an uncertain world, rather than with the concurrent shaping of sensory representations. This paper develops a framework for representational learning which builds on the relative success of unsupervised generativemodelling accounts of cortical encodings to incorporate the effects of reinforcement in a biologically plausible way. 1

5 nips-2003-A Classification-based Cocktail-party Processor

Author: Nicoleta Roman, Deliang Wang, Guy J. Brown

Abstract: At a cocktail party, a listener can selectively attend to a single voice and filter out other acoustical interferences. How to simulate this perceptual ability remains a great challenge. This paper describes a novel supervised learning approach to speech segregation, in which a target speech signal is separated from interfering sounds using spatial location cues: interaural time differences (ITD) and interaural intensity differences (IID). Motivated by the auditory masking effect, we employ the notion of an ideal time-frequency binary mask, which selects the target if it is stronger than the interference in a local time-frequency unit. Within a narrow frequency band, modifications to the relative strength of the target source with respect to the interference trigger systematic changes for estimated ITD and IID. For a given spatial configuration, this interaction produces characteristic clustering in the binaural feature space. Consequently, we perform pattern classification in order to estimate ideal binary masks. A systematic evaluation in terms of signal-to-noise ratio as well as automatic speech recognition performance shows that the resulting system produces masks very close to ideal binary ones. A quantitative comparison shows that our model yields significant improvement in performance over an existing approach. Furthermore, under certain conditions the model produces large speech intelligibility improvements with normal listeners. 1 In t ro d u c t i o n The perceptual ability to detect, discriminate and recognize one utterance in a background of acoustic interference has been studied extensively under both monaural and binaural conditions [1, 2, 3]. The human auditory system is able to segregate a speech signal from an acoustic mixture using various cues, including fundamental frequency (F0), onset time and location, in a process that is known as auditory scene analysis (ASA) [1]. F0 is widely used in computational ASA systems that operate upon monaural input – however, systems that employ only this cue are limited to voiced speech [4, 5, 6]. Increased speech intelligibility in binaural listening compared to the monaural case has prompted research in designing cocktail-party processors based on spatial cues [7, 8, 9]. Such a system can be applied to, among other things, enhancing speech recognition in noisy environments and improving binaural hearing aid design. In this study, we propose a sound segregation model using binaural cues extracted from the responses of a KEMAR dummy head that realistically simulates the filtering process of the head, torso and external ear. A typical approach for signal reconstruction uses a time-frequency (T-F) mask: T-F units are weighted selectively in order to enhance the target signal. Here, we employ an ideal binary mask [6], which selects the T-F units where the signal energy is greater than the noise energy. The ideal mask notion is motivated by the human auditory masking phenomenon, in which a stronger signal masks a weaker one in the same critical band. In addition, from a theoretical ASA perspective, an ideal binary mask gives a performance ceiling for all binary masks. Moreover, such masks have been recently shown to provide a highly effective front-end for robust speech recognition [10]. We show for mixtures of multiple sound sources that there exists a strong correlation between the relative strength of target and interference and estimated ITD/IID, resulting in a characteristic clustering across frequency bands. Consequently, we employ a nonparametric classification method to determine decision regions in the joint ITDIID feature space that correspond to an optimal estimate for an ideal mask. Related models for estimating target masks through clustering have been proposed previously [11, 12]. Notably, the experimental results by Jourjine et al. [12] suggest that speech signals in a multiple-speaker condition obey to a large extent disjoint orthogonality in time and frequency. That is, at most one source has a nonzero energy at a specific time and frequency. Such models, however, assume input directly from microphone recordings and head-related filtering is not considered. Simulation of human binaural hearing introduces different constraints as well as clues to the problem. First, both ITD and IID should be utilized since IID is more reliable at higher frequencies than ITD. Second, frequency-dependent combinations of ITD and IID arise naturally for a fixed spatial configuration. Consequently, channel-dependent training should be performed for each frequency band. The rest of the paper is organized as follows. The next section contains the architecture of the model and describes our method for azimuth localization. Section 3 is devoted to ideal binary mask estimation, which constitutes the core of the model. Section 4 presents the performance of the system and a quantitative comparison with the Bodden [7] model. Section 5 concludes our paper. 2 M od el a rch i t ect u re a n d a zi mu t h locali zat i o n Our model consists of the following stages: 1) a model of the auditory periphery; 2) frequency-dependent ITD/IID extraction and azimuth localization; 3) estimation of an ideal binary mask. The input to our model is a mixture of two or more signals presented at different, but fixed, locations. Signals are sampled at 44.1 kHz. We follow a standard procedure for simulating free-field acoustic signals from monaural signals (no reverberations are modeled). Binaural signals are obtained by filtering the monaural signals with measured head-related transfer functions (HRTF) from a KEMAR dummy head [13]. HRTFs introduce a natural combination of ITD and IID into the signals that is extracted in the subsequent stages of the model. To simulate the auditory periphery we use a bank of 128 gammatone filters in the range of 80 Hz to 5 kHz as described in [4]. In addition, the gains of the gammatone filters are adjusted in order to simulate the middle ear transfer function. In the final step of the peripheral model, the output of each gammatone filter is half-wave rectified in order to simulate firing rates of the auditory nerve. Saturation effects are modeled by taking the square root of the signal. Current models of azimuth localization almost invariably start with Jeffress’s crosscorrelation mechanism. For all frequency channels, we use the normalized crosscorrelation computed at lags equally distributed in the plausible range from –1 ms to 1 ms using an integration window of 20 ms. Frequency-dependent nonlinear transformations are used to map the time-delay axis onto the azimuth axis resulting in a cross-correlogram structure. In addition, a ‘skeleton’ cross-correlogram is formed by replacing the peaks in the cross-correlogram with Gaussians of narrower widths that are inversely proportional to the channel center frequency. This results in a sharpening effect, similar in principle to lateral inhibition. Assuming fixed sources, multiple locations are determined as peaks after summating the skeleton cross-correlogram across frequency and time. The number of sources and their locations computed here, as well as the target source location, feed to the next stage. 3 B i n a ry ma s k est i mat i on The objective of this stage of the model is to develop an efficient mechanism for estimating an ideal binary mask based on observed patterns of extracted ITD and IID features. Our theoretical analysis for two-source interactions in the case of pure tones shows relatively smooth changes for ITD and IID with the relative strength R between the two sources in narrow frequency bands [14]. More specifically, when the frequencies vary uniformly in a narrow band the derived mean values of ITD/IID estimates vary monotonically with respect to R. To capture this relationship in the context of real signals, statistics are collected for individual spatial configurations during training. We employ a training corpus consisting of 10 speech utterances from the TIMIT database (see [14] for details). In the two-source case, we divide the corpus in two equal sets: target and interference. In the three-source case, we select 4 signals for the target set and 2 interfering sets of 3 signals each. For all frequency channels, local estimates of ITD, IID and R are based on 20-ms time frames with 10 ms overlap between consecutive time frames. In order to eliminate the multi-peak ambiguity in the cross-correlation function for mid- and high-frequency channels, we use the following strategy. We compute ITDi as the peak location of the cross-correlation in the range 2π / ω i centered at the target ITD, where ω i indicates the center frequency of the ith channel. On the other hand, IID and R are computed as follows: ∑ t s i2 (t )     Ri = ∑ ∑ t li2 (t ) , t s i2 (t ) + ∑ ∑ t ri2 (t ) t ni2 (t )     IIDi = 20 log10 where l i and ri refer to the left and right peripheral output of the ith channel, respectively, s i refers to the output for the target signal, and ni that for the acoustic interference. In computing IIDi , we use 20 instead of 10 in order to compensate for the square root operation in the peripheral model. Fig. 1 shows empirical results obtained for a two-source configuration on the training corpus. The data exhibits a systematic shift for both ITD and IID with respect to the relative strength R. Moreover, the theoretical mean values obtained in the case of pure tones [14] match the empirical ones very well. This observation extends to multiple-source scenarios. As an example, Fig. 2 displays histograms that show the relationship between R and both ITD (Fig. 2A) and IID (Fig. 2B) for a three-source situation. Note that the interfering sources introduce systematic deviations for the binaural cues. Consider a worst case: the target is silent and two interferences have equal energy in a given T-F unit. This results in binaural cues indicating an auditory event at half of the distance between the two interference locations; for Fig. 2, it is 0° - the target location. However, the data in Fig. 2 has a low probability for this case and shows instead a clustering phenomenon, suggesting that in most cases only one source dominates a T-F unit. B 1 1 R R A theoretical empirical 0 -1 theoretical empirical 0 -15 1 ITD (ms) 15 IID (dB) Figure 1. Relationship between ITD/IID and relative strength R for a two-source configuration: target in the median plane and interference on the right side at 30°. The solid curve shows the theoretical mean and the dash curve shows the data mean. A: The scatter plot of ITD and R estimates for a filter channel with center frequency 500 Hz. B: Results for IID for a filter channel with center frequency 2.5 kHz. A B 1 C 10 1 IID s) 0.5 0 -10 IID (d B) 10 ) (dB R R 0 -0.5 m ITD ( -10 -0.5 m ITD ( s) 0.5 Figure 2. Relationship between ITD/IID and relative strength R for a three-source configuration: target in the median plane and interference at -30° and 30°. Statistics are obtained for a channel with center frequency 1.5 kHz. A: Histogram of ITD and R samples. B: Histogram of IID and R samples. C: Clustering in the ITD-IID space. By displaying the information in the joint ITD-IID space (Fig. 2C), we observe location-based clustering of the binaural cues, which is clearly marked by strong peaks that correspond to distinct active sources. There exists a tradeoff between ITD and IID across frequencies, where ITD is most salient at low frequencies and IID at high frequencies [2]. But a fixed cutoff frequency that separates the effective use of ITD and IID does not exist for different spatial configurations. This motivates our choice of a joint ITD-IID feature space that optimizes the system performance across different configurations. Differential training seems necessary for different channels given that there exist variations of ITD and, especially, IID values for different center frequencies. Since the goal is to estimate an ideal binary mask, we focus on detecting decision regions in the 2-dimensional ITD-IID space for individual frequency channels. Consequently, supervised learning techniques can be applied. For the ith channel, we test the following two hypotheses. The first one is H 1 : target is dominant or Ri > 0.5 , and the second one is H 2 : interference is dominant or Ri < 0.5 . Based on the estimates of the bivariate densities p( x | H 1 ) and p( x | H 2 ) the classification is done by the maximum a posteriori decision rule: p( H 1 ) p( x | H 1 ) > p( H 2 ) p( x | H 2 ) . There exist a plethora of techniques for probability density estimation ranging from parametric techniques (e.g. mixture of Gaussians) to nonparametric ones (e.g. kernel density estimators). In order to completely characterize the distribution of the data we use the kernel density estimation method independently for each frequency channel. One approach for finding smoothing parameters is the least-squares crossvalidation method, which is utilized in our estimation. One cue not employed in our model is the interaural time difference between signal envelopes (IED). Auditory models generally employ IED in the high-frequency range where the auditory system becomes gradually insensitive to ITD. We have compared the performance of the three binaural cues: ITD, IID and IED and have found no benefit for using IED in our system after incorporating ITD and IID [14]. 4 Pe rfo rmanc e an d c omp arison The performance of a segregation system can be assessed in different ways, depending on intended applications. To extensively evaluate our model, we use the following three criteria: 1) a signal-to-noise (SNR) measure using the original target as signal; 2) ASR rates using our model as a front-end; and 3) human speech intelligibility tests. To conduct the SNR evaluation a segregated signal is reconstructed from a binary mask using a resynthesis method described in [5]. To quantitatively assess system performance, we measure the SNR using the original target speech as signal: ∑ t 2 s o (t ) ∑ SNR = 10 log 10 (s o (t ) − s e (t ))2 t where s o (t ) represents the resynthesized original speech and s e (t ) the reconstructed speech from an estimated mask. One can measure the initial SNR by replacing the denominator with s N (t ) , the resynthesized original interference. Fig. 3 shows the systematic results for two-source scenarios using the Cooke corpus [4], which is commonly used in sound separation studies. The corpus has 100 mixtures obtained from 10 speech utterances mixed with 10 types of intrusion. We compare the SNR gain obtained by our model against that obtained using the ideal binary mask across different noise types. Excellent results are obtained when the target is close to the median plane for an azimuth separation as small as 5°. Performance degrades when the target source is moved to the side of the head, from an average gain of 13.7 dB for the target in the median plane (Fig. 3A) to 1.7 dB when target is at 80° (Fig. 3B). When spatial separation increases the performance improves even for side targets, to an average gain of 14.5 dB in Fig. 3C. This performance profile is in qualitative agreement with experimental data [2]. Fig. 4 illustrates the performance in a three-source scenario with target in the median plane and two interfering sources at –30° and 30°. Here 5 speech signals from the Cooke corpus form the target set and the other 5 form one interference set. The second interference set contains the 10 intrusions. The performance degrades compared to the two-source situation, from an average SNR of about 12 dB to 4.1 dB. However, the average SNR gain obtained is approximately 11.3 dB. This ability of our model to segregate mixtures of more than two sources differs from blind source separation with independent component analysis. In order to draw a quantitative comparison, we have implemented Bodden’s cocktail-party processor using the same 128-channel gammatone filterbank [7]. The localization stage of this model uses an extended cross-correlation mechanism based on contralateral inhibition and it adapts to HRTFs. The separation stage of the model is based on estimation of the weights for a Wiener filter as the ratio between a desired excitation and an actual one. Although the Bodden model is more flexible by incorporating aspects of the precedence effect into the localization stage, the estimation of Wiener filter weights is less robust than our binary estimation of ideal masks. Shown in Fig. 5, our model shows a considerable improvement over the Bodden system, producing a 3.5 dB average improvement. A B C 20 20 10 10 10 0 0 0 -10 SNR (dB) 20 -10 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 3. Systematic results for two-source configuration. Black bars correspond to the SNR of the initial mixture, white bars indicate the SNR obtained using ideal binary mask, and gray bars show the SNR from our model. Results are obtained for speech mixed with ten intrusion types (N0: pure tone; N1: white noise; N2: noise burst; N3: ‘cocktail party’; N4: rock music; N5: siren; N6: trill telephone; N7: female speech; N8: male speech; N9: female speech). A: Target at 0°, interference at 5°. B: Target at 80°, interference at 85°. C: Target at 60°, interference at 90°. 20 0 SNR (dB) SNR (dB) 5 -5 -10 -15 -20 10 0 -10 N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 4. Evaluation for a three-source configuration: target at 0° and two interfering sources at –30° and 30°. Black bars correspond to the SNR of the initial mixture, white bars to the SNR obtained using the ideal binary mask, and gray bars to the SNR from our model. N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 Figure 5. SNR comparison between the Bodden model (white bars) and our model (gray bars) for a two-source configuration: target at 0° and interference at 30°. Black bars correspond to the SNR of the initial mixture. For the ASR evaluation, we use the missing-data technique as described in [10]. In this approach, a continuous density hidden Markov model recognizer is modified such that only acoustic features indicated as reliable in a binary mask are used during decoding. Hence, it works seamlessly with the output from our speech segregation system. We have implemented the missing data algorithm with the same 128-channel gammatone filterbank. Feature vectors are obtained using the Hilbert envelope at the output of the gammatone filter. More specifically, each feature vector is extracted by smoothing the envelope using an 8-ms first-order filter, sampling at a frame-rate of 10 ms and finally log-compressing. We use the bounded marginalization method for classification [10]. The task domain is recognition of connected digits, and both training and testing are performed on acoustic features from the left ear signal using the male speaker dataset in the TIDigits database. A 100 B 100 Correctness (%) Correctness (%) Fig. 6A shows the correctness scores for a two-source condition, where the male target speaker is located at 0° and the interference is another male speaker at 30°. The performance of our model is systematically compared against the ideal masks for four SNR levels: 5 dB, 0 dB, -5 dB and –10 dB. Similarly, Fig. 6B shows the results for the three-source case with an added female speaker at -30°. The ideal mask exhibits only slight and gradual degradation in recognition performance with decreasing SNR and increasing number of sources. Observe that large improvements over baseline performance are obtained across all conditions. This shows the strong potential of applying our model to robust speech recognition. 80 60 40 20 5 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB 80 60 40 20 5 dB −10 dB Baseline Ideal Mask Estimated Mask 0 dB −5 dB −10 dB Figure 6. Recognition performance at different SNR values for original mixture (dotted line), ideal binary mask (dashed line) and estimated mask (solid line). A. Correctness score for a two-source case. B. Correctness score for a three-source case. Finally we evaluate our model on speech intelligibility with listeners with normal hearing. We use the Bamford-Kowal-Bench sentence database that contains short semantically predictable sentences [15]. The score is evaluated as the percentage of keywords correctly identified, ignoring minor errors such as tense and plurality. To eliminate potential location-based priming effects we randomly swap the locations for target and interference for different trials. In the unprocessed condition, binaural signals are produced by convolving original signals with the corresponding HRTFs and the signals are presented to a listener dichotically. In the processed condition, our algorithm is used to reconstruct the target signal at the better ear and results are presented diotically. 80 80 Keyword score (%) B100 Keyword score (%) A 100 60 40 20 0 0 dB −5 dB −10 dB 60 40 20 0 Figure 7. Keyword intelligibility score for twelve native English speakers (median values and interquartile ranges) before (white bars) and after processing (black bars). A. Two-source condition (0° and 5°). B. Three-source condition (0°, 30° and -30°). Fig. 7A gives the keyword intelligibility score for a two-source configuration. Three SNR levels are tested: 0 dB, -5 dB and –10 dB, where the SNR is computed at the better ear. Here the target is a male speaker and the interference is babble noise. Our algorithm improves the intelligibility score for the tested conditions and the improvement becomes larger as the SNR decreases (61% at –10 dB). Our informal observations suggest, as expected, that the intelligibility score improves for unprocessed mixtures when two sources are more widely separated than 5°. Fig. 7B shows the results for a three-source configuration, where our model yields a 40% improvement. Here the interfering sources are one female speaker and another male speaker, resulting in an initial SNR of –10 dB at the better ear. 5 C onclu si on We have observed systematic deviations of the ITD and IID cues with respect to the relative strength between target and acoustic interference, and configuration-specific clustering in the joint ITD-IID feature space. Consequently, supervised learning of binaural patterns is employed for individual frequency channels and different spatial configurations to estimate an ideal binary mask that cancels acoustic energy in T-F units where interference is stronger. Evaluation using both SNR and ASR measures shows that the system estimates ideal binary masks very well. A comparison shows a significant improvement in performance over the Bodden model. Moreover, our model produces substantial speech intelligibility improvements for two and three source conditions. A c k n ow l e d g me n t s This research was supported in part by an NSF grant (IIS-0081058) and an AFOSR grant (F49620-01-1-0027). A preliminary version of this work was presented in 2002 ICASSP. References [1] A. S. Bregman, Auditory Scene Analysis, Cambridge, MA: MIT press, 1990. [2] J. Blauert, Spatial Hearing - The Psychophysics of Human Sound Localization, Cambridge, MA: MIT press, 1997. [3] A. Bronkhorst, “The cocktail party phenomenon: a review of research on speech intelligibility in multiple-talker conditions,” Acustica, vol. 86, pp. 117-128, 2000. [4] M. P. Cooke, Modeling Auditory Processing and Organization, Cambridge, U.K.: Cambridge University Press, 1993. [5] G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297-336, 1994. [6] G. Hu and D. L. Wang, “Monaural speech separation,” Proc. NIPS, 2002. [7] M. Bodden, “Modeling human sound-source localization and the cocktail-party-effect,” Acta Acoustica, vol. 1, pp. 43-55, 1993. [8] C. Liu et al., “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, pp. 32183230, 2001. [9] T. Whittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Comm., vol. 39, pp. 111-138, 2003. [10] M. P. Cooke, P. Green, L. Josifovski and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Comm., vol. 34, pp. 267285, 2001. [11] H. Glotin, F. Berthommier and E. Tessier, “A CASA-labelling model using the localisation cue for robust cocktail-party speech recognition,” Proc. EUROSPEECH, pp. 2351-2354, 1999. [12] A. Jourjine, S. Rickard and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” Proc. ICASSP, 2000. [13] W. G. Gardner and K. D. Martin, “HRTF measurements of a KEMAR dummy-head microphone,” MIT Media Lab Technical Report #280, 1994. [14] N. Roman, D. L. Wang and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, pp. 2236-2252, 2003. [15] J. Bench and J. Bamford, Speech Hearing Tests and the Spoken Language of HearingImpaired Children, London: Academic press, 1979.

6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters

Author: Daniel B. Neill, Andrew W. Moore

Abstract: Given an N ×N grid of squares, where each square has a count and an underlying population, our goal is to ﬁnd the square region with the highest density, and to calculate its signiﬁcance by randomization. Any density measure D, dependent on the total count and total population of a region, can be used. For example, if each count represents the number of disease cases occurring in that square, we can use Kulldorff’s spatial scan statistic DK to ﬁnd the most signiﬁcant spatial disease cluster. A naive approach to ﬁnding the maximum density region requires O(N 3 ) time, and is generally computationally infeasible. We present a novel algorithm which partitions the grid into overlapping regions, bounds the maximum score of subregions contained in each region, and prunes regions which cannot contain the maximum density region. For sufﬁciently dense regions, this method ﬁnds the maximum density region in optimal O(N 2 ) time, in practice resulting in signiﬁcant (10-200x) speedups. 1

7 nips-2003-A Functional Architecture for Motion Pattern Processing in MSTd

Author: Scott A. Beardsley, Lucia M. Vaina

Abstract: Psychophysical studies suggest the existence of specialized detectors for component motion patterns (radial, circular, and spiral), that are consistent with the visual motion properties of cells in the dorsal medial superior temporal area (MSTd) of non-human primates. Here we use a biologically constrained model of visual motion processing in MSTd, in conjunction with psychophysical performance on two motion pattern tasks, to elucidate the computational mechanisms associated with the processing of widefield motion patterns encountered during self-motion. In both tasks discrimination thresholds varied significantly with the type of motion pattern presented, suggesting perceptual correlates to the preferred motion bias reported in MSTd. Through the model we demonstrate that while independently responding motion pattern units are capable of encoding information relevant to the visual motion tasks, equivalent psychophysical performance can only be achieved using interconnected neural populations that systematically inhibit non-responsive units. These results suggest the cyclic trends in psychophysical performance may be mediated, in part, by recurrent connections within motion pattern responsive areas whose structure is a function of the similarity in preferred motion patterns and receptive field locations between units. 1 In trod u ction A major challenge in computational neuroscience is to elucidate the architecture of the cortical circuits for sensory processing and their effective role in mediating behavior. In the visual motion system, biologically constrained models are playing an increasingly important role in this endeavor by providing an explanatory substrate linking perceptual performance and the visual properties of single cells. Single cell studies indicate the presence of complex interconnected structures in middle temporal and primary visual cortex whose most basic horizontal connections can impart considerable computational power to the underlying neural population [1, 2]. Combined psychophysical and computational studies support these findings Figure 1: a) Schematic of the graded motion pattern (GMP) task. Discrimination pairs of stimuli were created by perturbing the flow angle (φ) of each 'test' motion (with average dot speed, vav), by ±φp in the stimulus space spanned by radial and circular motions. b) Schematic of the shifted center-of-motion (COM) task. Discrimination pairs of stimuli were created by shifting the COM of the ‘test’ motion to the left and right of a central fixation point. For each motion pattern the COM was shifted within the illusory inner aperture and was never explicitly visible. and suggest that recurrent connections may play a significant role in encoding the visual motion properties associated with various psychophysical tasks [3, 4]. Using this methodology our goal is to elucidate the computational mechanisms associated with the processing of wide-field motion patterns encountered during self-motion. In the human visual motion system, psychophysical studies suggest the existence of specialized detectors for the motion pattern components (i.e., radial, circular and spiral motions) associated with self-motion [5, 6]. Neurophysiological studies reporting neurons sensitive to motion patterns in the dorsal medial superior temporal area (MSTd) support the existence of such mechanisms [7-10], and in conjunction with psychophysical studies suggest a strong link between the patterns of neural activity and motion-based perceptual performance [11, 12]. Through the combination of human psychophysical performance and biologically constrained modeling we investigate the computational role of simple recurrent connections within a population of MSTd-like units. Based on the known visual motion properties within MSTd we ask what neural structures are computationally sufficient to encode psychophysical performance on a series of motion pattern tasks. 2 M o t i o n pa t t e r n d i sc r i m i n a t i o n Using motion pattern stimuli consistent with previous studies [5, 6], we have developed a set of novel psychophysical tasks designed to facilitate a more direct comparison between human perceptual performance and the visual motion properties of cells in MSTd that have been found to underlie the discrimination of motion patterns [11, 12]. The psychophysical tasks, referred to as the graded motion pattern (GMP) and shifted center-of-motion (COM) tasks, are outlined in Fig. 1. Using a temporal two-alternative-forced-choice task we measured discrimination thresholds to global changes in the patterns of complex motion (GMP task), [13], and shifts in the center-of-motion (COM task). Stimuli were presented with central fixation using a constant stimulus paradigm and consisted of dynamic random dot displays presented in a 24o annular region (central 4o removed). In each task, the stimulus duration was randomly perturbed across presentations (440±40 msec) to control for timing-based cues, and dots moved coherently through a radial speed Figure 2: a) GMP thresholds across 8 'test' motions at two mean dot speeds for two observers. Performance varied continuously with thresholds for radial motions (φ=0, 180o) significantly lower than those for circular motions (φ=90,270o), (p<0.001; t(37)=3.39). b) COM thresholds at three mean dot speeds for two observers. As with the GMP task, performance varied continuously with thresholds for radial motions significantly lower than those for circular motions, (p<0.001; t(37)=4.47). gradient in directions consistent with the global motion pattern presented. Discrimination thresholds were obtained across eight ‘test’ motions corresponding to expansion, contraction, CW and CCW rotation, and the four intermediate spiral motions. To minimize adaptation to specific motion patterns, opposing motions (e.g., expansion/ contraction) were interleaved across paired presentations. 2.1 Results Discrimination thresholds are reported here from a subset of the observer population consisting of three experienced psychophysical observers, one of which was naïve to the purpose of the psychophysical tasks. For each condition, performance is reported as the mean and standard error averaged across 8-12 thresholds. Across observers and dot speeds GMP thresholds followed a distinct trend in the stimulus space [13], with radial motions (expansion/contraction) significantly lower than circular motions (CW/CCW rotation), (p<0.001; t(37)=3.39), (Fig. 2a). While thresholds for the intermediate spiral motions were not significantly different from circular motions (p=0.223, t(60)=0.74), the trends across 'test' motions were well fit within the stimulus space (SB: r>0.82, SC: r>0.77) by sinusoids whose period and phase were 196 ± 10o and -72 ± 20o respectively (Fig. 1a). When the radial speed gradient was removed by randomizing the spatial distribution of dot speeds, threshold performance increased significantly across observers (p<0.05; t(17)=1.91), particularly for circular motions (p<0.005; t(25)=3.31), (data not shown). Such performance suggests a perceptual contribution associated with the presence of the speed gradient and is particularly interesting given the fact that the speed gradient did not contribute computationally relevant information to the task. However, the speed gradient did convey information regarding the integrative structure of the global motion field and as such suggests a preference of the underlying motion mechanisms for spatially structured speed information. Similar trends in performance were observed in the COM task across observers and dot speeds. Discrimination thresholds varied continuously as a function of the 'test' motion with thresholds for radial motions significantly lower than those for circular motions, (p<0.001; t(37)=4.47) and could be well fit by a sinusoidal trend line (e.g. SB at 3 deg/s: r>0.91, period = 178 ± 10 o and phase = -70 ± 25o), (Fig. 2b). 2.2 A local or global task? The consistency of the cyclic threshold profile in stimuli that restricted the temporal integration of individual dot motions [13], and simultaneously contained all directions of motion, generally argues against a primary role for local motion mechanisms in the psychophysical tasks. While the psychophysical literature has reported a wide variety of “local” motion direction anisotropies whose properties are reminiscent of the results observed here, e.g. [14], all would predict equivalent thresholds for radial and circular motions for a set of uniformly distributed and/or spatially restricted motion direction mechanisms. Together with the computational impact of the speed gradient and psychophysical studies supporting the existence of wide-field motion pattern mechanisms [5, 6], these results suggest that the threshold differences across the GMP and COM tasks may be associated with variations in the computational properties across a series of specialized motion pattern mechanisms. 3 A computational model The similarities between the motion pattern stimuli used to quantify human perception and the visual motion properties of cells in MSTd suggests that MSTd may play a computational role in the psychophysical tasks. To examine this hypothesis, we constructed a population of MSTd-like units whose visual motion properties were consistent with the reported neurophysiology (see [13] for details). Across the population, the distribution of receptive field centers was uniform across polar angle and followed a gamma distribution Γ(5,6) across eccenticity [7]. For each unit, visual motion responses followed a gaussian tuning profile as a function of the stimulus flow angle G( φ), (σi=60±30o; [10]), and the distance of the stimulus COM from the unit’s receptive field center Gsat(xi, yi, σs=19o), Eq. 1, such that its preferred motion response was position invariant to small shifts in the COM [10] and degraded continuously for large shifts [9]. Within the model, simulations were categorized according to the distribution of preferred motions represented across the population (one reported in MSTd and a uniform control). The first distribution simulated an expansion bias in which the density of preferred motions decreased symmetrically from expansions to contraction [10]. The second distribution simulated a uniform preference for all motions and was used as a control to quantify the effects of an expansion bias on psychophysical performance. Throughout the paper we refer to simulations containing these distributions as ‘Expansion-biased’ and ‘Uniform’ respectively. 3.1 Extracting perceptual estimates from the neural code For each stimulus presentation, the ith unit’s response was calculated as the average firing rate, Ri, from the product of its motion pattern and spatial tuning profiles, ( ) Ri = Rmax G min[φ − φi ] ,σ ti G sati (x− xi , y − y i ,σ s ) + P (λ = 12 ) (1) where Rmax is the maximum preferred stimulus response (spikes/s), min[ ] refers to the minimum angular distance between the stimulus flow angle φ and the unit’s preferred motion φi, Gsat is the unit’s spatial tuning profile saturated within the central 5±3o, σti and σs are the standard deviations of the unit’s motion pattern and Figure 3: Model vs. psychophysical performance for independently responding units. Model thresholds are reported as the average (±1 S.E.) across five simulated populations. a) GMP thresholds were highest for contracting motions and lowest for expanding motions across all Expansion-biased populations. b) Comparable trends in performance were observed for COM thresholds. Comparison with the Uniform control simulations in both tasks (2000 units shown here) indicates that thresholds closely followed the distribution of preferred motions simulated within the model. spatial tuning profiles respectively, (xi,yi) is the spatial location of the unit’s receptive field center, (x,y) is the spatial location of the stimulus COM, and P(λ=12) is the background activity simulated as an uncorrelated Poisson process. The psychophysical tasks were simulated using a modified center-of-gravity ^ approach to decode estimates of the stimulus properties, i.e. flow angle (φ ) and ˆ ˆ COM location in the visual field (x, y ) , from the neural population   ∑ xi Ri ∑ y i Ri v  , i , ∑ φ i Ri  ∑ Ri i   i i   i (xˆ, yˆ , φˆ) =  i∑ R  (2) v where φi is the unit vector in the stimulus space (Fig. 1a) corresponding to the unit’s preferred motion. For each set of paired stimuli, psychophysical judgments were made by comparing the estimated stimulus properties according to the discrimination criteria, specified in the psychophysical tasks. As with the psychophysical experiments, discrimination thresholds were computed using a leastsquares fit to percent correct performance across constant stimulus levels. 3.2 Simulation 1: Independent neural responses In the first series of simulations, GMP and COM thresholds were quantified across three populations (500, 1000, and 2000 units) of independently responding units for each simulated distribution (Expansion-biased and Uniform). Across simulations, both the range in thresholds and their trends across ‘test’ motions were compared with human psychophysical performance to quantify the effects of population size and an expansion biased preferred motion distribution on model performance. Over the psychophysical range of interest (φp ± 7o), GMP thresholds for contracting motions were at chance across all Expansion-biased populations, (Fig. 3a). While thresholds for expanding motions were generally consistent with those for human observers, those for circular motions remained significantly higher for all but the largest populations. Similar trends in performance were observed for the COM task, (Fig. 3b). Here the range of COM thresholds was well matched with human performance for simulations containing 1000 units, however, the trends across motion patterns remained inconsistent even for the largest populations. Figure 4: Proposed recurrent connection profile between motion pattern units. a) Across the motion pattern space connection strength followed an inverse gaussian profile such that the ith unit (with preferred motion φi) systematically inhibited units with anti-preferred motions centered at 180+φi. b) Across the visual field connection strength followed a difference-of-gaussians profile as a function of the relative distance between receptive field centers such that spatially local units are mutually excitatory (σRe=10o) and more distant units were mutually inhibitory (σRi=80o). For simulations containing a uniform distribution of preferred motions, the threshold range was consistent with human performance on both tasks, however, the trend across motion patterns was generally flat. What variability did occur was due primarily to the discrete sampling of preferred motions across the population. Comparison of the discrimination thresholds for the Expansion-biased and Uniform populations indicates that the trend across thresholds was closely matched to the underlying distributions of preferred motions. This result in due in part to the nearequal weighting of independently responding units and can be explained to a first approximation by the proportional increase in the signal-to-noise ratio across the population as a function of the density of units responsive to a given 'test' motion. 3.3 Simulation 2: An interconnected neural structure In a second series of simulations, we examined the computational effect of adding recurrent connections between units. If the distribution of preferred motions in MSTd is in fact biased towards expansions, as the neurophysiology suggests, it seems unlikely that independent estimates of the visual motion information would be sufficient to yield the threshold profiles observed in the psychophysical tasks. We hypothesize that a simple fixed architecture of excitatory and/or inhibitory connections is sufficient to account for the cyclic trends in discrimination thresholds. Specifically, we propose that a recurrent connection profile whose strength varies as a function of (a) the similarity between preferred motion patterns and (b) the distance between receptive field centers, is computationally sufficient to recover the trends in GMP/COM performance (Fig. 4), wij = S R e − ( xi − x j )2 + ( yi − y j )2 2 2σ R e − SR e 2 − −(min[ φi − φ j ])2 ( xi − x j )2 + ( yi − y j )2 2 2 σ Ri − Sφ e 2σ I2 (3) Figure 5: Model vs. psychophysical performance for populations containing recurrent connections (σI=80o). As the number of units increased for Expansionbiased populations, discrimination thresholds decreased to psychophysical levels and the sinusoidal trend in thresholds emerged for both the (a) GMP and (b) COM tasks. Sinusoidal trends were established for as few as 1000 units and were well fit (r>0.9) by sinusoids whose periods and phases were (193.8 ± 11.7o, -70.0 ± 22.6o) and (168.2 ± 13.7o, -118.8 ± 31.8o) for the GMP and COM tasks respectively. where wij is the strength of the recurrent connection between ith and jth units, (xi,yi) and (xj,yj) denote the spatial locations of their receptive field centers, σRe (=10o) and σRi (=80o) together define the spatial extent of a difference-of-gaussians interaction between receptive field centers, and SR and Sφ scale the connection strength. To examine the effects of the spread of motion pattern-specific inhibition and connection strength in the model, σI, Sφ, and SR were considered free parameters. Within the parameter space used to define recurrent connections (i.e., σI, Sφ and SR), Monte Carlo simulations of Expansion-biased model performance (1000 units) yielded regions of high correlation on both tasks (with respect to the psychophysical thresholds, r>0.7) that were consistent across independently simulated populations. Typically these regions were well defined over a broad range such that there was significant overlap between tasks (e.g., for the GMP task (SR=0.03), σI=[45,120o], Sφ=[0.03,0.3] and for the COM task (σI=80o), Sφ = [0.03,0.08], SR = [0.005, 0.04]). Fig. 5 shows averaged threshold performance for simulations of interconnected units drawn from the highly correlated regions of the (σI, Sφ, SR) parameter space. For populations not explicitly examined in the Monte Carlo simulations connection strengths (Sφ, SR) were scaled inversely with population size to maintain an equivalent level of recurrent activity. With the incorporation of recurrent connections, the sinusoidal trend in GMP and COM thresholds emerged for Expansion-biased populations as the number of units increased. In both tasks the cyclic threshold profiles were established for 1000 units and were well fit (r>0.9) by sinusoids whose periods and phases were consistent with human performance. Unlike the Expansion-biased populations, Uniform populations were not significantly affected by the presence of recurrent connections (Fig. 5). Both the range in thresholds and the flat trend across motion patterns were well matched to those in Section 3.2. Together these results suggest that the sinusoidal trends in GMP and COM performance may be mediated by the combined contribution of the recurrent interconnections and the bias in preferred motions across the population. 4 D i s c u s s i on Using a biologically constrained computational model in conjunction with human psychophysical performance on two motion pattern tasks we have shown that the visual motion information encoded across an interconnected population of cells responsive to motion patterns, such as those in MSTd, is computationally sufficient to extract perceptual estimates consistent with human performance. Specifically, we have shown that the cyclic trend in psychophysical performance observed across tasks, (a) cannot be reproduced using populations of independently responding units and (b) is dependent, in part, on the presence of an expanding motion bias in the distribution of preferred motions across the neural population. The model’s performance suggests the presence of specific recurrent structures within motion pattern responsive areas, such as MSTd, whose strength varies as a function of the similarity between preferred motion patterns and the distance between receptive field centers. While such structures have not been explicitly examined in MSTd and other higher visual motion areas there is anecdotal support for the presence of inhibitory connections [8]. Together, these results suggest that robust processing of the motion patterns associated with self-motion and optic flow may be mediated, in part, by recurrent structures in extrastriate visual motion areas whose distributions of preferred motions are biased strongly in favor of expanding motions. Acknowledgments This work was supported by National Institutes of Health grant EY-2R01-07861-13 to L.M.V. References [1] Malach, R., Schirman, T., Harel, M., Tootell, R., & Malonek, D., (1997), Cerebral Cortex, 7(4): 386-393. [2] Gilbert, C. D., (1992), Neuron, 9: 1-13. [3] Koechlin, E., Anton, J., & Burnod, Y., (1999), Biological Cybernetics, 80: 2544. [4] Stemmler, M., Usher, M., & Niebur, E., (1995), Science, 269: 1877-1880. [5] Burr, D. C., Morrone, M. C., & Vaina, L. M., (1998), Vision Research, 38(12): 1731-1743. [6] Meese, T. S. & Harris, S. J., (2002), Vision Research, 42: 1073-1080. [7] Tanaka, K. & Saito, H. A., (1989), Journal of Neurophysiology, 62(3): 626-641. [8] Duffy, C. J. & Wurtz, R. H., (1991), Journal of Neurophysiology, 65(6): 13461359. [9] Duffy, C. J. & Wurtz, R. H., (1995), Journal of Neuroscience, 15(7): 5192-5208. [10] Graziano, M. S., Anderson, R. A., & Snowden, R., (1994), Journal of Neuroscience, 14(1): 54-67. [11] Celebrini, S. & Newsome, W., (1994), Journal of Neuroscience, 14(7): 41094124. [12] Celebrini, S. & Newsome, W. T., (1995), Journal of Neurophysiology, 73(2): 437-448. [13] Beardsley, S. A. & Vaina, L. M., (2001), Journal of Computational Neuroscience, 10: 255-280. [14] Matthews, N. & Qian, N., (1999), Vision Research, 39: 2205-2211.

8 nips-2003-A Holistic Approach to Compositional Semantics: a connectionist model and robot experiments

Author: Yuuya Sugita, Jun Tani

Abstract: We present a novel connectionist model for acquiring the semantics of a simple language through the behavioral experiences of a real robot. We focus on the “compositionality” of semantics, a fundamental characteristic of human language, which is the ability to understand the meaning of a sentence as a combination of the meanings of words. We also pay much attention to the “embodiment” of a robot, which means that the robot should acquire semantics which matches its body, or sensory-motor system. The essential claim is that an embodied compositional semantic representation can be self-organized from generalized correspondences between sentences and behavioral patterns. This claim is examined and conﬁrmed through simple experiments in which a robot generates corresponding behaviors from unlearned sentences by analogy with the correspondences between learned sentences and behaviors. 1

9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

Author: Pedro J. Moreno, Purdy P. Ho, Nuno Vasconcelos

Abstract: Over the last years signiﬁcant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classiﬁers such as SVM’s. In this paper we suggest an alternative procedure to the Fisher kernel for systematically ﬁnding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identiﬁcation/veriﬁcation and image classiﬁcation tasks and show that these new kernels have the best performance in speaker veriﬁcation and mostly outperform the Fisher kernel based SVM’s and the generative classiﬁers in speaker identiﬁcation and image classiﬁcation. 1

10 nips-2003-A Low-Power Analog VLSI Visual Collision Detector

Author: Reid R. Harrison

Abstract: We have designed and tested a single-chip analog VLSI sensor that detects imminent collisions by measuring radially expansive optic flow. The design of the chip is based on a model proposed to explain leg-extension behavior in flies during landing approaches. A new elementary motion detector (EMD) circuit was developed to measure optic flow. This EMD circuit models the bandpass nature of large monopolar cells (LMCs) immediately postsynaptic to photoreceptors in the fly visual system. A 16 × 16 array of 2-D motion detectors was fabricated on a 2.24 mm × 2.24 mm die in a standard 0.5-µm CMOS process. The chip consumes 140 µW of power from a 5 V supply. With the addition of wide-angle optics, the sensor is able to detect collisions around 500 ms before impact in complex, real-world scenes. 1

11 nips-2003-A Mixed-Signal VLSI for Real-Time Generation of Edge-Based Image Vectors

Author: Masakazu Yagi, Hideo Yamasaki, Tadashi Shibata

Abstract: A mixed-signal image filtering VLSI has been developed aiming at real-time generation of edge-based image vectors for robust image recognition. A four-stage asynchronous median detection architecture based on analog digital mixed-signal circuits has been introduced to determine the threshold value of edge detection, the key processing parameter in vector generation. As a result, a fully seamless pipeline processing from threshold detection to edge feature map generation has been established. A prototype chip was designed in a 0.35-µm double-polysilicon three-metal-layer CMOS technology and the concept was verified by the fabricated chip. The chip generates a 64-dimension feature vector from a 64x64-pixel gray scale image every 80µsec. This is about 104 times faster than the software computation, making a real-time image recognition system feasible. 1 In tro du c ti o n The development of human-like image recognition systems is a key issue in information technology. However, a number of algorithms developed for robust image recognition so far [1]-[3] are mostly implemented as software systems running on general-purpose computers. Since the algorithms are generally complex and include a lot of floating point operations, they are computationally too expensive to build real-time systems. Development of hardware-friendly algorithms and their direct VLSI implementation would be a promising solution for real-time response systems. Being inspired by the biological principle that edge information is firstly detected in the visual cortex, we have developed an edge-based image representation algorithm compatible to hardware processing. In this algorithm, multiple-direction edges extracted from an original gray scale image is utilized to form a feature vector. Since the spatial distribution of principal edges is represented by a vector, it was named Projected Principal-Edge Distribution (PPED) [4],[5], or formerly called Principal Axis Projection (PAP) [6],[7]. (The algorithm is explained later.) Since the PPED vectors very well represent the human perception of similarity among images, robust image recognition systems have been developed using PPED vectors in conjunction with the analog soft pattern classifier [4],[8], the digital VQ (Vector Quantization) processor [9], and support vector machines [10] . The robust nature of PPED representation is demonstrated in Fig. 1, where the system was applied to cephalometric landmark identification (identifying specific anatomical landmarks on medical radiographs) as an example, one of the most important clinical practices of expert dentists in orthodontics [6],[7]. Typical X-ray images to be experienced by apprentice doctors were converted to PPED vectors and utilized as templates for vector matching. The system performance has been proven for 250 head film samples regarding the fundamental 26 landmarks [11]. Important to note is the successful detection of the landmark on the soft tissue boundary (the tip of the lower lip) shown in Fig. 1(c). Landmarks on soft tissues are very difficult to detect as compared to landmarks on hard tissues (solid bones) because only faint images are captured on radiographs. The successful detection is due to the median algorithm that determines the threshold value for edge detection. Sella Nasion Orbitale By our system (a) By expert dentists Landmark on soft tissue (b) (c) Fig. 1: Image recognition using PPED vectors: (a,b) cephalometric landmark identification; (c) successful landmark detection on soft tissue. We have adopted the median value of spatial variance of luminance within the filtering kernel (5x5 pixels), which allows us to extract all essential features in a delicate gray scale image. However, the problem is the high computational cost in determining the median value. It takes about 0.6 sec to generate one PPED vector from a 64x64-pixel image (a standard image size for recognition in our system) on a SUN workstation, making real time processing unrealistic. About 90% of the computation time is for edge detection from an input image, in which most of the time is spent for median detection. Then the purpose of this work is to develop a new architecture median-filter VLSI subsystem for real-time PPED-vector generation. Special attention has been paid to realize a fully seamless pipeline processing from threshold detection to edge feature map generation by employing the four-stage asynchronous median detection architecture. 2 P r o je c t e d P r i n c i pa l E dg e Dis tribution (PPED ) Projected Principal Edge Distribution (PPED) algorithm [5],[6] is briefly explained using Fig. 2(a). A 5x5-pixel block taken from a 64x64-pixel target image is subjected to edge detection filtering in four principal directions, i.e. horizontal, vertical, and ±45-degree directions. In the figure, horizontal edge filtering is shown as an example. (The filtering kernels used for edge detection are given in Fig. 2(b).) In order to determine the threshold value for edge detection, all the absolute-value differences between two neighboring pixels are calculated in both vertical and horizontal directions and the median value is taken as the threshold. By scanning the 5x5-pixel filtering kernels in the target image, four 64x64 edge-flag maps are generated, which are called feature maps. In the horizontal feature map, for example, edge flags in every four rows are accumulated and spatial distribution of edge flags are represented by a histogram having 16 elements. Similar procedures are applied to other three directions to form respective histograms each having 16 elements. Finally, a 64-dimension vector is formed by series-connecting the four histograms in the order of horizontal, +45-degree, vertical, and –45-degree. Edge Detection 64x64 Feature Map (64x64) (Horizontal) 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 -1-1-1-1-1 0 0 0 0 0 (Horizontal) Threshold || Median Scan (16 elements) Edge Filter PPED Vector (Horizontal Section) 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 -1 0 1 0 -1 0 1 0 -1 -1 0 0 -1 0 0 0 Horizontal +45-degree 0 0 0 0 0 Threshold Detection Absolute value difference between neiboring pels. 1 1 1 1 1 0 -1 0 -1 0 -1 0 -1 0 -1 0 0 0 0 0 0 -1 0 0 0 1 0 -1 -1 0 0 1 0 -1 0 0 1 1 0 -1 0 0 0 1 0 Vertical (a) -45-degree (b) Fig. 2: PPED algorithm (a) and filtering kernels for edge detection (b). 3 Sy stem Orga ni za ti o n The system organization of the feature map generation VLSI is illustrated in Fig. 3. The system receives one column of data (8-b x 5 pixels) at each clock and stores the data in the last column of the 5x6 image buffer. The image buffer shifts all the stored data to the right at every clock. Before the edge filtering circuit (EFC) starts detecting four direction edges with respect to the center pixel in the 5x5 block, the threshold value calculated from all the pixel data in the 5x5 block must be ready in time for the processing. In order to keep the coherence of the threshold detection and the edge filtering processing, the two last-in data locating at column 5 and 6 are given to median filter circuit (MFC) in advance via absolute value circuit (AVC). AVC calculates all luminance differences between two neighboring pixels in columns 5 and 6. In this manner, a fully seamless pipeline processing from threshold detection to edge feature map generation has been established. The key requirement here is that MFC must determine the median value of the 40 luminance difference data from the 5x5-pixel block fast enough to carry out the seamless pipeline processing. For this purpose, a four-stage asynchronous median detection architecture has been developed which is explained in the following. Edge Filtering Circuit (EFC) 6 5 4 3 2 1 Edge flags H +45 V Image buffer 8-b x 5 pixels (One column) Absolute Value Circuit (AVC) Threshold value Median Filter Circuit (MFC) -45 Feature maps Fig. 3: System organization of feature map generation VLSI. The well-known binary search algorithm was adopted for fast execution of median detection. The median search processing for five 4-b data is illustrated in Fig. 4 for the purpose of explanation. In the beginning, majority voting is carried out for the MSB’s of all data. Namely, the number of 1’s is compared with the number of 0’s and the majority group wins. The majority group flag (“0” in this example) is stored as the MSB of the median value. In addition, the loser group is withdrawn in the following voting by changing all remaining bits to the loser MSB (“1” in this example). By repeating the processing, the median value is finally stored in the median value register. Elapse of time Median Register : 0 1 X X 0 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 MVC0 MVC1 MVC2 MVC3 MVC0 MVC1 MVC2 MVC3 MVC0 MVC1 MVC2 MVC3 MVC0 MVC1 MVC2 MVC3 Majority Flag : 0 0 X X X Majority Voting Circuit (MVC) Fig. 4: Hardware algorithm for median detection by binary search. How the median value is detected from all the 40 8-b data (20 horizontal luminance difference data and 20 vertical luminance difference data) is illustrated in Fig. 5. All the data are stored in the array of median detection units (MDU’s). At each clock, the array receives four vertical luminance difference data and five horizontal luminance difference data calculated from the data in column 5 and 6 in Fig. 3. The entire data are shifted downward at each clock. The median search is carried out for the upper four bits and the lower four bits separately in order to enhance the throughput by pipelining. For this purpose, the chip is equipped with eight majority voting circuits (MVC 0~7). The upper four bits from all the data are processed by MVC 4~7 in a single clock cycle to yield the median value. In the next clock cycle, the loser information is transferred to the lower four bits within each MDU and MVC0~3 carry out the median search for the lower four bits from all the data in the array. Vertical Luminance Difference AVC AVC AVC AVC Horizontal Luminance Difference AVC AVC AVC AVC AVC Shift Shift Median Detection Unit (MDU) x (40 Units) Lower 4bit Upper 4bit MVC0 MVC2 MVC1 MVC3 MVC4 MVC5 MVC6 MVC7 MVCs for upper 4bit MVCs for lower 4bit Fig. 5: Median detection architecture for all 40 luminance difference data. The majority voting circuit (MVC) is shown in Fig. 6. Output connected CMOS inverters are employed as preamplifiers for majority detection which was first proposed in Ref. [12]. In the present implementation, however, two preamps receiving input data and inverted input data are connected to a 2-stage differential amplifier. Although this doubles the area penalty, the instability in the threshold for majority detection due to process and temperature variations has been remarkably improved as compared to the single inverter thresholding in Ref. [12]. The MVC in Fig. 6 has 41 input terminals although 40 bits of data are inputted to the circuit at one time. Bit “0” is always given to the terminal IN40 to yield “0” as the majority when there is a tie in the majority voting. PREAMP IN0 PREAMP 2W/L IN0 2W/L OUT W/L ENBL W/L W/L IN1 IN1 2W/L 2W/L W/L ENBL IN40 W/L W/L IN40 Fig. 6: Majority voting circuit (MVC). The edge filtering circuit (EFC) in Fig. 3 is composed as a four-stage pipeline of regular CMOS digital logic. In the first two stages, four-direction edge gradients are computed, and in the succeeding two stages, the detection of the largest gradient and the thresholding is carried out to generate four edge flags. 4 E x p e r i m e n t a l R es u l t s The feature map generation VLSI was fabricated in a 0.35-µm double-poly three-metal-layer CMOS technology. A photomicrograph of the proof-of-concept chip is shown in Fig. 7. The measured waveforms of the MVC at operating frequencies of 10MHz and 90MHz are demonstrated in Fig. 8. The input condition is in the worst case. Namely, 21 “1” bits and 20 “0” bits were fed to the inputs. The observed computation time is about 12 nsec which is larger than the simulation result of 2.5 nsec. This was caused by the capacitance loading due to the probing of the test circuit. In the real circuit without external probing, we confirmed the average computation time of 4~5 nsec. Edge-detection Filtering Circuit Processing Technology 0.35µm CMOS 2-Poly 3-Metal Median Filter Control Unit Chip Size 4.5mm x 4.5mm MVC Majority Voting Circuit X8 Supply Voltage 3.3 V Operation Frequengy 50MHz Vector Generator Fig. 7: Photomicrograph and specification of the fabricated proof-of-concept chip. 1V/div 5ns/div MVC_Output 1V/div 8ns/div MVC_OUT IN IN 1 Majority Voting operation (a) Majority Voting operation (b) Fig. 8: Measured waveforms of majority voting circuit (MVC) at operation frequencies of 10MHz (a) and 90 MHz (b) for the worst-case input data. The feature maps generated by the chip at the operation frequency of 25 MHz are demonstrated in Fig. 9. The power dissipation was 224 mW. The difference between the flag bits detected by the chip and those obtained by computer simulation are also shown in the figure. The number of error flags was from 80 to 120 out of 16,384 flags, only a 0.6% of the total. The occurrence of such error bits is anticipated since we employed analog circuits for median detection. However, such error does not cause any serious problems in the PPED algorithm as demonstrated in Figs. 10 and 11. The template matching results with the top five PPED vector candidates in Sella identification are demonstrated in Fig. 11, where Manhattan distance was adopted as the dissimilarity measure. The error in the feature map generation processing yields a constant bias to the dissimilarity and does not affect the result of the maximum likelihood search. Generated Feature maps Difference as compared to computer simulation Sella Horizontal Plus 45-degrees Vertical Minus 45-degrees Fig. 9: Feature maps for Sella pattern generated by the chip. Generated PPED vector by the chip Sella Difference as compared to computer simulation Dissimilarity (by Manhattan Distance) Fig. 10: PPED vector for Sella pattern generated by the chip. The difference in the vector components between the PPED vector generated by the chip and that obtained by computer simulation is also shown. 1200 Measured Data 1000 800 Computer Simulation 600 400 200 0 1st (Correct) 2nd 3rd 4th 5th Candidates in Sella recognition Fig. 11: Comparison of template matching results. 5 Conclusion A mixed-signal median filter VLSI circuit for PPED vector generation is presented. A four-stage asynchronous median detection architecture based on analog digital mixed-signal circuits has been introduced. As a result, a fully seamless pipeline processing from threshold detection to edge feature map generation has been established. A prototype chip was designed in a 0.35-µm CMOS technology and the fab- ricated chip generates an edge based image vector every 80 µsec, which is about 10 4 times faster than the software computation. Acknowledgments The VLSI chip in this study was fabricated in the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo with the collaboration by Rohm Corporation and Toppan Printing Corporation. The work is partially supported by the Ministry of Education, Science, Sports, and Culture under Grant-in-Aid for Scientific Research (No. 14205043) and by JST in the program of CREST. References [1] C. Liu and Harry Wechsler, “Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition”, IEEE Transactions on Image Processing, Vol. 11, No.4, Apr. 2002. [2] C. Yen-ting, C. Kuo-sheng, and L. Ja-kuang, “Improving cephalogram analysis through feature subimage extraction”, IEEE Engineering in Medicine and Biology Magazine, Vol. 18, No. 1, 1999, pp. 25-31. [3] H. Potlapalli and R. C. Luo, “Fractal-based classification of natural textures”, IEEE Transactions on Industrial Electronics, Vol. 45, No. 1, Feb. 1998. [4] T. Yamasaki and T. Shibata, “Analog Soft-Pattern-Matching Classifier Using Floating-Gate MOS Technology,” Advances in Neural Information Processing Systems 14, Vol. II, pp. 1131-1138. [5] Masakazu Yagi, Tadashi Shibata, “An Image Representation Algorithm Compatible to Neural-Associative-Processor-Based Hardware Recognition Systems,” IEEE Trans. Neural Networks, Vol. 14, No. 5, pp. 1144-1161, September (2003). [6] M. Yagi, M. Adachi, and T. Shibata,

12 nips-2003-A Model for Learning the Semantics of Pictures

Author: Victor Lavrenko, R. Manmatha, Jiwoon Jeon

Abstract: We propose an approach to learning the semantics of images which allows us to automatically annotate an image with keywords and to retrieve images based on text queries. We do this using a formalism that models the generation of annotated images. We assume that every image is divided into regions, each described by a continuous-valued feature vector. Given a training set of images with annotations, we compute a joint probabilistic model of image features and words which allow us to predict the probability of generating a word given the image regions. This may be used to automatically annotate and retrieve images given a word as a query. Experiments show that our model signiﬁcantly outperforms the best of the previously reported results on the tasks of automatic image annotation and retrieval. 1

13 nips-2003-A Neuromorphic Multi-chip Model of a Disparity Selective Complex Cell

Author: Bertram E. Shi, Eric K. Tsang

Abstract: The relative depth of objects causes small shifts in the left and right retinal positions of these objects, called binocular disparity. Here, we describe a neuromorphic implementation of a disparity selective complex cell using the binocular energy model, which has been proposed to model the response of disparity selective cells in the visual cortex. Our system consists of two silicon chips containing spiking neurons with monocular Gabor-type spatial receptive fields (RF) and circuits that combine the spike outputs to compute a disparity selective complex cell response. The disparity selectivity of the cell can be adjusted by both position and phase shifts between the monocular RF profiles, which are both used in biology. Our neuromorphic system performs better with phase encoding, because the relative responses of neurons tuned to different disparities by phase shifts are better matched than the responses of neurons tuned by position shifts.

14 nips-2003-A Nonlinear Predictive State Representation

Author: Matthew R. Rudary, Satinder P. Singh

Abstract: Predictive state representations (PSRs) use predictions of a set of tests to represent the state of controlled dynamical systems. One reason why this representation is exciting as an alternative to partially observable Markov decision processes (POMDPs) is that PSR models of dynamical systems may be much more compact than POMDP models. Empirical work on PSRs to date has focused on linear PSRs, which have not allowed for compression relative to POMDPs. We introduce a new notion of tests which allows us to deﬁne a new type of PSR that is nonlinear in general and allows for exponential compression in some deterministic dynamical systems. These new tests, called e-tests, are related to the tests used by Rivest and Schapire [1] in their work with the diversity representation, but our PSR avoids some of the pitfalls of their representation—in particular, its potential to be exponentially larger than the equivalent POMDP. 1

15 nips-2003-A Probabilistic Model of Auditory Space Representation in the Barn Owl

Author: Brian J. Fischer, Charles H. Anderson

Abstract: The barn owl is a nocturnal hunter, capable of capturing prey using auditory information alone [1]. The neural basis for this localization behavior is the existence of auditory neurons with spatial receptive ﬁelds [2]. We provide a mathematical description of the operations performed on auditory input signals by the barn owl that facilitate the creation of a representation of auditory space. To develop our model, we ﬁrst formulate the sound localization problem solved by the barn owl as a statistical estimation problem. The implementation of the solution is constrained by the known neurobiology.

16 nips-2003-A Recurrent Model of Orientation Maps with Simple and Complex Cells

Author: Paul Merolla, Kwabena A. Boahen

Abstract: We describe a neuromorphic chip that utilizes transistor heterogeneity, introduced by the fabrication process, to generate orientation maps similar to those imaged in vivo. Our model consists of a recurrent network of excitatory and inhibitory cells in parallel with a push-pull stage. Similar to a previous model the recurrent network displays hotspots of activity that give rise to visual feature maps. Unlike previous work, however, the map for orientation does not depend on the sign of contrast. Instead, signindependent cells driven by both ON and OFF channels anchor the map, while push-pull interactions give rise to sign-preserving cells. These two groups of orientation-selective cells are similar to complex and simple cells observed in V1. 1 Orientation Maps Neurons in visual areas 1 and 2 (V1 and V2) are selectively tuned for a number of visual features, the most pronounced feature being orientation. Orientation preference of individual cells varies across the two-dimensional surface of the cortex in a stereotyped manner, as revealed by electrophysiology [1] and optical imaging studies [2]. The origin of these preferred orientation (PO) maps is debated, but experiments demonstrate that they exist in the absence of visual experience [3]. To the dismay of advocates of Hebbian learning, these results suggest that the initial appearance of PO maps rely on neural mechanisms oblivious to input correlations. Here, we propose a model that accounts for observed PO maps based on innate noise in neuron thresholds and synaptic currents. The network is implemented in silicon where heterogeneity is as ubiquitous as it is in biology. 2 Patterned Activity Model Ernst et al. have previously described a 2D rate model that can account for the origin of visual maps [4]. Individual units in their network receive isotropic feedforward input from the geniculate and recurrent connections from neighboring units in a Mexican hat profile, described by short-range excitation and long-range inhibition. If the recurrent connections are sufficiently strong, hotspots of activity (or ‘bumps’) form periodically across space. In a homogeneous network, these bumps of activity are equally stable at any position in the network and are free to wander. Introducing random jitter to the Mexican hat connectivity profiles breaks the symmetry and reduces the number of stable states for the bumps. Subsequently, the bumps are pinned down at the locations that maximize their net local recurrent feedback. In this regime, moving gratings are able to shift the bumps away from their stability points such that the responses of the network resemble PO maps. Therefore, the recurrent network, given an ample amount of noise, can innately generate its own orientation specificity without the need for specific hardwired connections or visually driven learning rules. 2.1 Criticisms of the Bump model We might posit that the brain uses a similar opportunistic model to derive and organize its feature maps – but the parallels between the primary visual cortex and the Ernst et al. bump model are unconvincing. For instance, the units in their model represent the collective activity of a column, reducing the network dynamics to a firing-rate approximation. But this simplification ignores the rich temporal dynamics of spiking networks, which are known to affect bump stability. More fundamentally, there is no role for functionally distinct neuron types. The primary criticism of the Ernst et al.’s bump model is that its input only consists of a luminance channel, and it is not obvious how to replace this channel with ON and OFF rectified channels to account for simple and complex cells. One possibility would be to segregate ON-driven and OFF-driven cells (referred to as simple cells in this paper) into two distinct recurrent networks. Because each network would have its own innate noise profile, bumps would form independently. Consequently, there is no guarantee that ON-driven maps would line up with OFF-driven maps, which would result in conflicting orientation signals when these simple cells converge onto sign-independent (complex) cells. 2.2 Simple Cells Solve a Complex Problem To ensure that both ON-driven and OFF-driven simple cells have the same orientation maps, both ON and OFF bumps must be computed in the same recurrent network so that they are subjected to the same noise profile. We achieve this by building our recurrent network out of cells that are sign-independent; that is both ON and OFF channels drive the network. These cells exhibit complex cell-like behavior (and are referred to as complex cells in this paper) because they are modulated at double the spatial frequency of a sinusoidal grating input. The simple cells subsequently derive their responses from two separate signals: an orientation selective feedback signal from the complex cells indicating the presence of either an ON or an OFF bump, and an ON–OFF selection signal that chooses the appropriate response flavor. Figure 1 left illustrates the formation of bumps (highlighted cells) by a recurrent network with a Mexican hat connectivity profile. Extending the Ernst et al. model, these complex bumps seed simple bumps when driven by a grating. Simple bumps that match the sign of the input survive, whereas out-of-phase bumps are extinguished (faded cells) by push-pull inhibition. Figure 1 right shows the local connections within a microcircuit. An EXC (excitatory) cell receives excitatory input from both ON and OFF channels, and projects to other EXC (not shown) and INH (inhibitory) cells. The INH cell projects back in a reciprocal configuration to EXC cells. The divergence is indicated in left. ON-driven and OFF-driven simple cells receive input in a push-pull configuration (i.e., ON cells are excited by ON inputs and inhibited by OFF inputs, and vise-versa), while additionally receiving input from the EXC–INH recurrent network. In this model, we implement our push-pull circuit using monosynaptic inhibitory connections, despite the fact that geniculate input is strictly excitatory. This simplification, while anatomically incorrect, yields a more efficient implementation that is functionally equivalent. ON Input Luminance OFF Input left right EXC EXC Divergence INH INH Simple Cells Complex Cells ON & OFF Input ON OFF OFF Space Figure 1: left, Complex and simple cell responses to a sinusoidal grating input. Luminance is transformed into ON (green) and OFF (red) pathways by retinal processing. Complex cells form a recurrent network through excitatory and inhibitory projections (yellow and blue lines, respectively), and clusters of activity occur at twice the spatial frequency of the grating. ON input activates ON-driven simple cells (bright green) and suppresses OFF-driven simple cells (faded red), and vise-versa. right, The bump model’s local microcircuit: circles represent neurons, curved lines represent axon arbors that end in excitatory synapses (v shape) or inhibitory synapses (open circles). For simplicity, inhibitory interneurons were omitted in our push-pull circuit. 2.3 Mathematical Description • The neurons in our network follow the equation CV = −∑ ∂(t − tn) + I syn − I KCa − I leak , • n where C is membrane capacitance, V is the temporal derivative of the membrane voltage, δ(·) is the Dirac delta function, which resets the membrane at the times tn when it crosses threshold, Isyn is synaptic current from the network, and Ileak is a constant leak current. Neurons receive synaptic current of the form: ON I syn = w+ I ON − w− I OFF + wEE I EXC − wEI I INH , EXC I syn = w+ ( I ON + I OFF ) + wEE I EXC − wEI I INH + I back , OFF INH I syn = w+ I OFF − w− I ON + wEE I EXC − wEI I INH , I syn = wIE I EXC where w+ is the excitatory synaptic strength for ON and OFF input synapses, w- is the strength of the push-pull inhibition, wEE is the synaptic strength for EXC cell projections to other EXC cells, wEI is the strength of INH cell projections to EXC cells, wIE is the strength of EXC cell projections to INH cells, Iback is a constant input current, and I{ON,OFF,EXC,INH} account for all impinging synapses from each of the four cell types. These terms are calculated for cell i using an arbor function that consists of a spatial weighting J(r) and a post-synaptic current waveform α(t): k ∑ J (i − k ) ⋅ α (t − t n ) , where k spans all cells of a given type and n indexes their spike k ,n times. The spatial weighting function is described by J (i − k ) = exp( − i − k σ ) , with σ as the space constant. The current waveform, which is non-zero for t>0, convolves a 1 t function with a decaying exponential: α (t ) = (t τ c + α 0 ) −1 ∗ exp(− t τ e ) , where τc is the decay-rate, and τe is the time constant of the exponential. Finally, we model spike-rate adaptation with a calcium-dependent potassium-channel (KCa), which integrates Ca triggered by spikes at times tn with a gain K and a time constant τk, as described by I KCa = ∑ K exp(tn − t τ k ) . n 3 Silicon Implementation We implemented our model in silicon using the TSMC (Taiwan Semiconductor Manufacturing Company) 0.25µm 5-metal layer CMOS process. The final chip consists of a 2-D core of 48x48 pixels, surrounded by asynchronous digital circuitry that transmits and receives spikes in real-time. Neurons that reach threshold within the array are encoded as address-events and sent off-chip, and concurrently, incoming address-events are sent to their appropriate synapse locations. This interface is compatible with other spike-based chips that use address-events [5]. The fabricated bump chip has close to 460,000 transistors packed in 10 mm2 of silicon area for a total of 9,216 neurons. 3.1 Circuit Design Our neural circuit was morphed into hardware using four building blocks. Figure 2 shows the transistor implementation for synapses, axonal arbors (diffuser), KCa analogs, and neurons. The circuits are designed to operate in the subthreshold region (except for the spiking mechanism of the neuron). Noise is not purposely designed into the circuits. Instead, random variations from the fabrication process introduce significant deviations in I-V curves of theoretically identical MOS transistors. The function of the synapse circuit is to convert a brief voltage pulse (neuron spike) into a postsynaptic current with biologically realistic temporal dynamics. Our synapse achieves this by cascading a current-mirror integrator with a log-domain low-pass filter. The current-mirror integrator has a current impulse response that decays as 1 t (with a decay rate set by the voltage τc and an amplitude set by A). This time-extended current pulse is fed into a log-domain low-pass filter (equivalent to a current-domain RC circuit) that imposes a rise-time on the post-synaptic current set by τe. ON and OFF input synapses receive presynaptic spikes from the off-chip link, whereas EXC and INH synapses receive presynaptic spikes from local on-chip neurons. Synapse Je Diffuser Ir A Ig Jc KCa Analog Neuron Jk Vmem Vspk K Figure 2: Transistor implementations are shown for a synapse, diffuser, KCa analog, and neuron (simplified), with circuit insignias in the top-left of each box. The circuits they interact with are indicated (e.g. the neuron receives synaptic current from the diffuser as well as adaptation current from the KCa analog; the neuron in turn drives the KCa analog). The far right shows layout for one pixel of the bump chip (vertical dimension is 83µm, horizontal is 30 µm). The diffuser circuit models axonal arbors that project to a local region of space with an exponential weighting. Analogous to resistive divider networks, diffusers [6] efficiently distribute synaptic currents to multiple targets. We use four diffusers to implement axonal projections for: the ON pathway, which excites ON and EXC cells and inhibits OFF cells; the OFF pathway, which excites OFF and EXC cells and inhibits ON cells; the EXC cells, which excite all cell types; and the INH cells, which inhibits EXC, ON, and OFF cells. Each diffuser node connects to its six neighbors through transistors that have a pseudo-conductance set by σr, and to its target site through a pseudo-conductance set by σg; the space-constant of the exponential synaptic decay is set by σr and σg’s relative levels. The neuron circuit integrates diffuser currents on its membrane capacitance. Diffusers either directly inject current (excitatory), or siphon off current (inhibitory) through a current-mirror. Spikes are generated by an inverter with positive feedback (modified from [7]), and the membrane is subsequently reset by the spike signal. We model a calcium concentration in the cell with a KCa analog. K controls the amount of calcium that enters the cell per spike; the concentration decays exponentially with a time constant set by τk. Elevated charge levels activate a KCa-like current that throttles the spike-rate of the neuron. 3.2 Experimental Setup Our setup uses either a silicon retina [8] or a National Instruments DIO (digital input–output) card as input to the bump chip. This allows us to test our V1 model with real-time visual stimuli, similar to the experimental paradigm of electrophysiologists. More specifically, the setup uses an address-event link [5] to establish virtual point-to-point connectivity between ON or OFF ganglion cells from the retina chip (or DIO card) with ON or OFF synapses on the bump chip. Both the input activity and the output activity of the bump chip is displayed in real-time using receiver chips, which integrate incoming spikes and displays their rates as pixel intensities on a monitor. A logic analyzer is used to capture spike output from the bump chip so it can be further analyzed. We investigated responses of the bump chip to gratings moving in sixteen different directions, both qualitatively and quantitatively. For the qualitative aspect, we created a PO map by taking each cell’s average activity for each stimulus direction and computing the vector sum. To obtain a quantitative measure, we looked at the normalized vector magnitude (NVM), which reveals the sharpness of a cell’s tuning. The NVM is calculated by dividing the vector sum by the magnitude sum for each cell. The NVM is 0 if a cell responds equally to all orientations, and 1 if a cell’s orientation selectivity is perfect such that it only responds at a single orientation. 4 Results We presented sixteen moving gratings to the network, with directions ranging from 0 to 360 degrees. The spatial frequency of the grating is tuned to match the size of the average bump, and the temporal frequency is 1 Hz. Figure 3a shows a resulting PO map for directions from 180 to 360 degrees, looking at the inhibitory cell population (the data looks similar for other cell types). Black contours represent stable bump regions, or equivalently, the regions that exceed a prescribed threshold (90 spikes) for all directions. The PO map from the bump chip reveals structure that resembles data from real cortex. Nearby cells tend to prefer similar orientations except at fractures. There are even regions that are similar to pinwheels (delimited by a white rectangle). A PO is a useful tool to describe a network’s selectivity, but it only paints part of the picture. So we have additionally computed a NVM map and a NVM histogram, shown in Figure 3b and 3c respectively. The NVM map shows that cells with sharp selectivity tend to cluster, particularly around the edge of the bumps. The histogram also reveals that the distribution of cell selectivity across the network varies considerably, skewed towards broadly tuned cells. We also looked at spike rasters from different cell-types to gain insight into their phase relationship with the stimulus. In particular, we present recordings for the site indicated by the arrow (see Figure 3a) for gratings moving in eight directions ranging from 0 to 360 degrees in 45-degree increments (this location was chosen because it is in the vicinity of a pinwheel, is reasonably selective, and shows considerable modulation in its firing rate). Figure 4 shows the luminance of the stimulus (bottom sinusoids), ON- (cyan) and OFF-input (magenta) spike trains, and the resulting spike trains from EXC (yellow), INH (blue), ON- (green), and OFFdriven (red) cell types for each of the eight directions. The center polar plot summarizes the orientation selectivity for each cell-type by showing the normalized number of spikes for each stimulus. Data is shown for one period. Even though all cells-types are selective for the same orientation (regardless of grating direction), complex cell responses tend to be phase-insensitive while the simple cell responses are modulated at the fundamental frequency. It is worth noting that the simple cells have sharper orientation selectivity compared to the complex cells. This trend is characteristic of our data. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 300 250 200 150 100 50 20 40 60 80 100 120 140 160 180 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 3: (a) PO map for the inhibitory cell population stimulated with eight different directions from 180 to 360 degrees (black represents no activity, contours delineate regions that exceed 90 spikes for all stimuli). Normalized vector magnitude (NVM) data is presented as (b) a map and (c) a histogram. Figure 4: Spike rasters and polar plot for 8 directions ranging from 0 to 360 degrees. Each set of spike rasters represent from bottom to top, ON- (cyan) and OFF-input (magenta), INH (yellow), EXC (blue), and ON- (green) and OFF-driven (red). The stimulus period is 1 sec. 5 Discussion We have implemented a large-scale network of spiking neurons in a silicon chip that is based on layer 4 of the visual cortex. The initial testing of the network reveals a PO map, inherited from innate chip heterogeneities, resembling cortical maps. Our microcircuit proposes a novel function for complex-like cells; that is they create a sign-independent orientation selective signal, which through a push-pull circuit creates sharply tuned simple cells with the same orientation preference. Recently, Ringach et al. surveyed orientation selectivity in the macaque [9]. They observed that, in a population of V1 neurons (N=308) the distribution of orientation selectivity is quite broad, having a median NVM of 0.39. We have measured median NVM’s ranging from 0.25 to 0.32. Additionally, Ringach et al. found a negative correlation between spontaneous firing rate and NVM. This is consistent with our model because cells closer to the center of the bump have higher firing rates and broader tuning. While the results from the bump chip are promising, our maps are less consistent and noisier than the maps Ernst et al. have reported. We believe this is because our network is tuned to operate in a fluid state where bumps come on, travel a short distance and disappear (motivated by cortical imaging studies). But excessive fluidity can cause non-dominant bumps to briefly appear and adversely shift the PO maps. We are currently investigating the role of lateral connections between bumps as a means to suppress these spontaneous shifts. The neural mechanisms that underlie the orientation selectivity of V1 neurons are still highly debated. This may be because neuron responses are not only shaped by feedforward inputs, but are also influenced at the network level. If modeling is going to be a useful guide for electrophysiologists, we must model at the network level while retaining cell level detail. Our results demonstrate that a spike-based neuromorphic system is well suited to model layer 4 of the visual cortex. The same approach may be used to build large-scale models of other cortical regions. References 1. Hubel, D. and T. Wiesel, Receptive firelds, binocular interaction and functional architecture in the cat's visual cortex. J. Physiol, 1962. 160: p. 106-154. 2. Blasdel, G.G., Orientation selectivity, preference, and continuity in monkey striate cortex. J Neurosci, 1992. 12(8): p. 3139-61. 3. Crair, M.C., D.C. Gillespie, and M.P. Stryker, The role of visual experience in the development of columns in cat visual cortex. Science, 1998. 279(5350): p. 566-70. 4. Ernst, U.A., et al., Intracortical origin of visual maps. Nat Neurosci, 2001. 4(4): p. 431-6. 5. Boahen, K., Point-to-Point Connectivity. IEEE Transactions on Circuits & Systems II, 2000. vol 47 no 5: p. 416-434. 6. Boahen, K. and Andreou. A contrast sensitive silicon retina with reciprocal synapses. in NIPS91. 1992: IEEE. 7. Culurciello, E., R. Etienne-Cummings, and K. Boahen, A Biomorphic Digital Image Sensor. IEEE Journal of Solid State Circuits, 2003. vol 38 no 2: p. 281-294. 8. Zaghloul, K., A silicon implementation of a novel model for retinal processing, in Neuroscience. 2002, UPENN: Philadelphia. 9. Ringach, D.L., R.M. Shapley, and M.J. Hawken, Orientation selectivity in macaque V1: diversity and laminar dependence. J Neurosci, 2002. 22(13): p. 5639-51.

17 nips-2003-A Sampled Texture Prior for Image Super-Resolution

Author: Lyndsey C. Pickup, Stephen J. Roberts, Andrew Zisserman

Abstract: Super-resolution aims to produce a high-resolution image from a set of one or more low-resolution images by recovering or inventing plausible high-frequency image content. Typical approaches try to reconstruct a high-resolution image using the sub-pixel displacements of several lowresolution images, usually regularized by a generic smoothness prior over the high-resolution image space. Other methods use training data to learn low-to-high-resolution matches, and have been highly successful even in the single-input-image case. Here we present a domain-speciﬁc image prior in the form of a p.d.f. based upon sampled images, and show that for certain types of super-resolution problems, this sample-based prior gives a signiﬁcant improvement over other common multiple-image super-resolution techniques. 1

18 nips-2003-A Summating, Exponentially-Decaying CMOS Synapse for Spiking Neural Systems

Author: Rock Z. Shi, Timothy K. Horiuchi

Abstract: Synapses are a critical element of biologically-realistic, spike-based neural computation, serving the role of communication, computation, and modiﬁcation. Many different circuit implementations of synapse function exist with different computational goals in mind. In this paper we describe a new CMOS synapse design that separately controls quiescent leak current, synaptic gain, and time-constant of decay. This circuit implements part of a commonly-used kinetic model of synaptic conductance. We show a theoretical analysis and experimental data for prototypes fabricated in a commercially-available 1.5µm CMOS process. 1

19 nips-2003-Algorithms for Interdependent Security Games

Author: Michael Kearns, Luis E. Ortiz

Abstract: unkown-abstract

20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games

Author: Yu-han Chang, Tracey Ho, Leslie P. Kaelbling

Abstract: In large multiagent games, partial observability, coordination, and credit assignment persistently plague attempts to design good learning algorithms. We provide a simple and efﬁcient algorithm that in part uses a linear system to model the world from a single agent’s limited perspective, and takes advantage of Kalman ﬁltering to allow an agent to construct a good training signal and learn an effective policy. 1

21 nips-2003-An Autonomous Robotic System for Mapping Abandoned Mines

22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays

23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering

25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science

26 nips-2003-An MDP-Based Approach to Online Mechanism Design

27 nips-2003-Analytical Solution of Spike-timing Dependent Plasticity Based on Synaptic Biophysics

28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs

30 nips-2003-Approximability of Probability Distributions

31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

32 nips-2003-Approximate Expectation Maximization