nips nips2001 nips2001-48 knowledge-graph by maker-knowledge-mining

48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

Source: pdf

Author: Odelia Schwartz, E. J. Chichilnisky, Eero P. Simoncelli

Abstract: Spike-triggered averaging techniques are effective for linear characterization of neural responses. But neurons exhibit important nonlinear behaviors, such as gain control, that are not captured by such analyses. We describe a spike-triggered covariance method for retrieving suppressive components of the gain control signal in a neuron. We demonstrate the method in simulation and on retinal ganglion cell data. Analysis of physiological data reveals signiﬁcant suppressive axes and explains neural nonlinearities. This method should be applicable to other sensory areas and modalities. White noise analysis has emerged as a powerful technique for characterizing response properties of spiking neurons. A sequence of stimuli are drawn randomly from an ensemble and presented in rapid succession, and one examines the subset that elicit action potentials. This “spike-triggered” stimulus ensemble can provide information about the neuron’s response characteristics. In the most widely used form of this analysis, one estimates an excitatory linear kernel by computing the spike-triggered average (STA); that is, the mean stimulus that elicited a spike [e.g., 1, 2]. Under the assumption that spikes are generated by a Poisson process with instantaneous rate determined by linear projection onto a kernel followed by a static nonlinearity, the STA provides an unbiased estimate of this kernel [3]. Recently, a number of authors have developed interesting extensions of white noise analysis. Some have examined spike-triggered averages in a reduced linear subspace of input stimuli [e.g., 4]. Others have recovered excitatory subspaces, by computing the spiketriggered covariance (STC), followed by an eigenvector analysis to determine the subspace axes [e.g., 5, 6]. Sensory neurons exhibit striking nonlinear behaviors that are not explained by fundamentally linear mechanisms. For example, the response of a neuron typically saturates for large amplitude stimuli; the response to the optimal stimulus is often suppressed by the presence of a non-optimal mask [e.g., 7]; and the kernel recovered from STA analysis may change shape as a function of stimulus amplitude [e.g., 8, 9]. A variety of these nonlinear behaviors can be attributed to gain control [e.g., 8, 10, 11, 12, 13, 14], in which neural responses are suppressively modulated by a gain signal derived from the stimulus. Although the underlying mechanisms and time scales associated with such gain control are current topics of research, the basic functional properties appear to be ubiquitous, occurring throughout the nervous system. a b 0 k0 0 Figure 1: Geometric depiction of spike-triggered analyses. a, Spike-triggered averaging with two-dimensional stimuli. Black points indicate raw stimuli. White points indicate stimuli eliciting a spike, and the STA (black vector), which provides an estimate of , corresponds to their center of mass. b, Spike-triggered covariance analysis of suppressive axes. Shown are a set of stimuli lying on a plane perpendicular to the excitatory kernel, . Within the plane, stimuli eliciting a spike are concentrated in an elliptical region. The minor axis of the ellipse corresponds to a suppressive stimulus direction: stimuli with a signiﬁcant component along this axis are less likely to elicit spikes. The stimulus component along the major axis of the ellipse has no inﬂuence on spiking. ¢ £ ¡ ¢ ¡ Here we develop a white noise methodology for characterizing a neuron with gain control. We show that a set of suppressive kernels may be recovered by ﬁnding the eigenvectors of the spike-triggered covariance matrix associated with smallest variance. We apply the technique to electrophysiological data obtained from ganglion cells in salamander and macaque retina, and recover a set of axes that are shown to reduce responses in the neuron. Moreover, when we ﬁt a gain control model to the data using a maximum likelihood procedure within this subspace, the model accounts for changes in the STA as a function of contrast. 1 Characterizing suppressive axes ¤¥ As in all white noise approaches, we assume that stimuli correspond to vectors, , in some ﬁnite-dimensional space (e.g., a neighborhood of pixels or an interval of time samples). We assume a gain control model in which the probability of a stimulus eliciting a spike grows monotonically with the halfwave-rectiﬁed projection onto an excitatory linear kernel, , and is suppressively modulated by the fullwave-rectiﬁed projection onto a set of . linear kernels, ¨ ¤§ ¤¥ ©¤ § ¨ ¤ ¥ ©¤ § ¦ First, we recover the excitatory kernel, . This is achieved by presenting spherically symmetric input stimuli (e.g., Gaussian white noise) to the neuron and computing the STA (Fig. 1a). STA correctly recovers the excitatory kernel, under the assumption that each of the gain control kernels are orthogonal (or equal) to the excitatory kernel. The proof is essentially the same as that given for recovering the kernel of a linear model followed by a monotonic nonlinearity [3]. In particular, any stimulus can be decomposed into a component in the direction of the excitatory kernel, and a component in a perpendicular direction. This can be paired with another stimulus that is identical, except that its component in the perpendicular direction is negated. The two stimuli are equally likely to occur in a spherically Gaussian stimulus set (since they are equidistant from the origin), and they are equally likely to elicit a spike (since their excitatory components are equal, and their rectiﬁed perpendicular components are equal). Their vector average lies in the direction of the excitatory kernel. Thus, the STA (which is an average over all such stimuli, or all such stimulus pairs) must also lie in that direction. In a subsequent section we explain how to Model: Retrieved: Excitatory: Excitatory: Eigenvalues: Suppressive: Suppressive: Weights Variance (eigenvalue) { 1.5 { 2{ 2.5 { 3{ 1 1 Arbitrary 0 Axis number 350 Figure 2: Estimation of kernels from a simulated model (equation 2). Left: Model kernels. Right: Sorted eigenvalues of covariance matrix of stimuli eliciting spikes (STC). Five eigenvalues fall signiﬁcantly below the others. Middle: STA (excitatory kernel) and eigenvectors (suppressive kernels) associated with the lowest eigenvalues. recover the excitatory kernel when it is not orthogonal to the suppressive kernels. Next, we recover the suppressive subspace, assuming the excitatory kernel is known. Consider the stimuli lying on a plane perpendicular to this kernel. These stimuli all elicit the same response in the excitatory kernel, but they may produce different amounts of suppression. Figure 1b illustrates the behavior in a three-dimensional stimulus space, in which one axis is assumed to be suppressive. The distribution of raw stimuli on the plane is spherically symmetric about the origin. But the distribution of stimuli eliciting a spike is narrower along the suppressive direction: these stimuli have a component along the suppressive axis and are therefore less likely to elicit a spike. This behavior is easily generalized from this plane to the entire stimulus space. If we assume that the suppressive axes are ﬁxed, then we expect to see reductions in variance in the same directions for any level of numerator excitation. Given this behavior of the spike-triggered stimulus ensemble, we can recover the suppressive subspace using principal component analysis. We construct the sample covariance matrix of the stimuli eliciting a spike: £ §¥

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We describe a spike-triggered covariance method for retrieving suppressive components of the gain control signal in a neuron. [sent-12, score-0.918]

2 We demonstrate the method in simulation and on retinal ganglion cell data. [sent-13, score-0.387]

3 Analysis of physiological data reveals signiﬁcant suppressive axes and explains neural nonlinearities. [sent-14, score-0.793]

4 A sequence of stimuli are drawn randomly from an ensemble and presented in rapid succession, and one examines the subset that elicit action potentials. [sent-17, score-0.338]

5 In the most widely used form of this analysis, one estimates an excitatory linear kernel by computing the spike-triggered average (STA); that is, the mean stimulus that elicited a spike [e. [sent-19, score-0.691]

6 Under the assumption that spikes are generated by a Poisson process with instantaneous rate determined by linear projection onto a kernel followed by a static nonlinearity, the STA provides an unbiased estimate of this kernel [3]. [sent-22, score-0.449]

7 Some have examined spike-triggered averages in a reduced linear subspace of input stimuli [e. [sent-24, score-0.324]

8 Others have recovered excitatory subspaces, by computing the spiketriggered covariance (STC), followed by an eigenvector analysis to determine the subspace axes [e. [sent-27, score-0.725]

9 For example, the response of a neuron typically saturates for large amplitude stimuli; the response to the optimal stimulus is often suppressed by the presence of a non-optimal mask [e. [sent-31, score-0.291]

10 , 7]; and the kernel recovered from STA analysis may change shape as a function of stimulus amplitude [e. [sent-33, score-0.408]

11 White points indicate stimuli eliciting a spike, and the STA (black vector), which provides an estimate of , corresponds to their center of mass. [sent-43, score-0.361]

12 Shown are a set of stimuli lying on a plane perpendicular to the excitatory kernel, . [sent-45, score-0.634]

13 Within the plane, stimuli eliciting a spike are concentrated in an elliptical region. [sent-46, score-0.469]

14 The minor axis of the ellipse corresponds to a suppressive stimulus direction: stimuli with a signiﬁcant component along this axis are less likely to elicit spikes. [sent-47, score-1.347]

15 The stimulus component along the major axis of the ellipse has no inﬂuence on spiking. [sent-48, score-0.336]

16 ¢ £ ¡ ¢ ¡ Here we develop a white noise methodology for characterizing a neuron with gain control. [sent-49, score-0.335]

17 We show that a set of suppressive kernels may be recovered by ﬁnding the eigenvectors of the spike-triggered covariance matrix associated with smallest variance. [sent-50, score-0.945]

18 We apply the technique to electrophysiological data obtained from ganglion cells in salamander and macaque retina, and recover a set of axes that are shown to reduce responses in the neuron. [sent-51, score-0.607]

19 1 Characterizing suppressive axes ¤¥ As in all white noise approaches, we assume that stimuli correspond to vectors, , in some ﬁnite-dimensional space (e. [sent-53, score-1.123]

20 We assume a gain control model in which the probability of a stimulus eliciting a spike grows monotonically with the halfwave-rectiﬁed projection onto an excitatory linear kernel, , and is suppressively modulated by the fullwave-rectiﬁed projection onto a set of . [sent-56, score-1.268]

21 linear kernels, ¨ ¤§ ¤¥ ©¤ § ¨ ¤ ¥ ©¤ § ¦ First, we recover the excitatory kernel, . [sent-57, score-0.356]

22 STA correctly recovers the excitatory kernel, under the assumption that each of the gain control kernels are orthogonal (or equal) to the excitatory kernel. [sent-62, score-1.026]

23 In particular, any stimulus can be decomposed into a component in the direction of the excitatory kernel, and a component in a perpendicular direction. [sent-64, score-0.609]

24 The two stimuli are equally likely to occur in a spherically Gaussian stimulus set (since they are equidistant from the origin), and they are equally likely to elicit a spike (since their excitatory components are equal, and their rectiﬁed perpendicular components are equal). [sent-66, score-1.04]

25 Their vector average lies in the direction of the excitatory kernel. [sent-67, score-0.332]

26 Right: Sorted eigenvalues of covariance matrix of stimuli eliciting spikes (STC). [sent-73, score-0.581]

27 recover the excitatory kernel when it is not orthogonal to the suppressive kernels. [sent-76, score-1.152]

28 Next, we recover the suppressive subspace, assuming the excitatory kernel is known. [sent-77, score-1.096]

29 Consider the stimuli lying on a plane perpendicular to this kernel. [sent-78, score-0.331]

30 These stimuli all elicit the same response in the excitatory kernel, but they may produce different amounts of suppression. [sent-79, score-0.605]

31 The distribution of raw stimuli on the plane is spherically symmetric about the origin. [sent-81, score-0.343]

32 But the distribution of stimuli eliciting a spike is narrower along the suppressive direction: these stimuli have a component along the suppressive axis and are therefore less likely to elicit a spike. [sent-82, score-2.152]

33 If we assume that the suppressive axes are ﬁxed, then we expect to see reductions in variance in the same directions for any level of numerator excitation. [sent-84, score-0.833]

34 Given this behavior of the spike-triggered stimulus ensemble, we can recover the suppressive subspace using principal component analysis. [sent-85, score-0.971]

35 We construct the sample covariance matrix of the stimuli eliciting a spike: £ §¥ " ¨! [sent-86, score-0.381]

36 To ensure the estimated suppressive subspace is orthogonal to the estimated (as in Figure 1b), the stimuli are ﬁrst projected onto the subspace perpendicular to the estimated . [sent-88, score-1.347]

37 The principal axes (eigenvectors) of that are associated with small variance (eigenvalues) correspond to directions in which the response of the neuron is modulated suppressively. [sent-89, score-0.308]

38 ¤ ¥ ¨ ¤§ ¤¥ ¤ § ¨ ¤§ ¦ 42 531 Q H I G E¤D ¥ A09¥ B 7 ¡ C§ @ 86 The goal of simulation is to recover excitatory kernel by , weights , and constant . [sent-97, score-0.506]

39 Low eigenvalues correspond to suppressive directions, while other eigenvalues correspond to arbitrary (ignored) directions. [sent-100, score-1.018]

40 Raw stimulus ensemble was sphered (whitened) prior to analysis and low-variance axes underrepresented in stimulus set were discarded. [sent-101, score-0.525]

41 First, we note that STA recovers an accurate estimate of the excitatory kernel. [sent-104, score-0.35]

42 The last ﬁve eigenvalues are signiﬁcantly below the values one would obtain with randomly selected stimulus subsets. [sent-107, score-0.313]

43 The eigenvectors associated with these lowest eigenvalues span approximately the same subspace as the suppressive kernels. [sent-108, score-0.97]

44 Note that some eigenvectors correspond to mixtures of the original suppressive kernels, due to non-uniqueness of the eigenvector decomposition. [sent-109, score-0.719]

45 In contrast, eigenvectors corresponding to eigenvalues in the gradually-descending region appear arbitrary in their structure. [sent-110, score-0.286]

46 2 Suppressive Axes in Retinal Ganglion Cells Retinal ganglion cells exhibit rapid [8, 15] as well as slow [9, 16, 17] gain control. [sent-113, score-0.372]

47 We now demonstrate that we can recover a rapid gain control signal by applying the method to data from salamander retina [9]. [sent-114, score-0.507]

48 To correct for this, we discard low-variance axes and whiten the stimuli within the remaining axes. [sent-119, score-0.39]

49 ¤¥ Figure 3 depicts the kernels estimated from the 623 stimulus vectors eliciting spikes. [sent-120, score-0.419]

50 Similar to the model simulation, the eigenvalues gradually fall off, but four of the eigenvalues appear to drop signiﬁcantly below the rest. [sent-121, score-0.344]

51 To make this more concrete, we test the hypothesis that the majority of the eigenvalues are consistent with those of randomly selected stimulus vectors, but that the last eigenvalues fall signiﬁcantly below this range. [sent-122, score-0.486]

52 We also randomly select (orthogonal) 4 4 0 projection onto suppressive kernel projection onto excitatory kernel a 0. [sent-124, score-1.449]

53 5 projection onto arbitrary kernel Figure 4: Scatter plots from salamander ganglion cell data (cell 1999-11-12-B6A). [sent-132, score-0.663]

54 a, Projection of stimuli onto estimated excitatory kernel vs. [sent-135, score-0.744]

55 b, Projection of stimuli onto an estimated suppressive kernel vs. [sent-137, score-1.063]

56 axes, representing a suppressive subspace, and project this subspace out of the set of randomly chosen stimuli. [sent-139, score-0.727]

57 £ £¢¡ These low eigenvalues correspond to eigenvectors that are concentrated in recent time (as is the estimated excitatory kernel). [sent-144, score-0.615]

58 We emphasize that these kernels should not be interpreted to correspond to receptive ﬁelds of individual neurons underlying the suppressive signal, but merely provide an orthogonal basis for a suppressive subspace. [sent-146, score-1.48]

59 We can now verify that the recovered STA axis is in fact excitatory, and the kernels corresponding to the lowest eigenvalues are suppressive. [sent-147, score-0.472]

60 Figure 4a shows a scatter plot of the stimuli projected onto the excitatory axis vs. [sent-148, score-0.751]

61 Spikes are seen to occur only when the component along the excitatory axis is high, as expected. [sent-150, score-0.439]

62 Figure 4b is a scatter plot of the stimuli projected onto one of the suppressive axes vs. [sent-151, score-1.154]

63 The spiking stimuli lie within an ellipse, with the minor axis corresponding to the suppressive kernel. [sent-153, score-0.947]

64 This is exactly what we would expect in a suppressive gain control system (see Figure 1b). [sent-154, score-0.852]

65 Figure 5 illustrates recovery of a two-dimensional suppressive subspace for a macaque retinal ganglion cell. [sent-155, score-1.03]

66 43K stimulus vectors eliciting spikes out of a total of 284. [sent-157, score-0.336]

67 3 Correcting for Bias in Kernel Estimates The kernels in the previous section were all recovered from stimuli of a single contrast. [sent-163, score-0.432]

68 However, when the STA is computed in a ganglion cell for low and high contrast stimuli, the low-contrast kernel shows a slower time course [9] (ﬁgure 7,a). [sent-164, score-0.415]

69 This would appear inconsistent with the method we describe, in which the STA is meant to provide an estimate of a single excitatory kernel. [sent-165, score-0.343]

70 This behavior can be explained by assuming a model of the form given in equation 2, and in addition dropping the constraint that the gain control kernels are orthogonal (or identical) to the excitatory kernel. [sent-166, score-0.695]

71 a 60 0 projection onto suppressive kernel projection onto excitatory kernel 1 ¨¦ £ ¡ ©§©§¥ ¤¢ 6 4 1 0(& 5 3 )2 1 ))'% $ # ! [sent-167, score-1.449]

72 5 projection onto arbitrary kernel projection onto arbitrary kernel Figure 5: a, Sorted eigenvalues of stimuli eliciting spikes from a macaque retina (cell 200109-29-E6A). [sent-176, score-1.238]

73 b-c, Scatter plots of stimuli projected onto recovered axes. [sent-177, score-0.427]

74 When a gain control kernel is not orthogonal to the excitatory kernel, the responses to one side of the excitatory kernel are suppressed more than those on the other side. [sent-179, score-1.175]

75 The resulting STA estimate is thus biased away from the true excitatory kernel, . [sent-180, score-0.357]

76 ¢ ¡ First we show that when the orthogonality constraint is dropped, the STA estimate of the excitatory kernel is biased by the gain control signal. [sent-181, score-0.705]

77 Consider a situation in which a suppressive kernel contains a component in the direction of the excitatory kernel, . [sent-182, score-1.1]

78 We write , where is perpendicular to the excitatory kernel. [sent-183, score-0.361]

79 Then, for example, a stimulus , with , produces a suppressive component along equal to , but the corresponding paired stimulus vector produces a suppressive component of . [sent-184, score-1.647]

80 Thus, the two stimuli are equally likely to occur but not equally likely to elicit a spike. [sent-185, score-0.36]

81 Figure 6 illustrates an example in which a non-orthogonal suppressive axis biases the estimate of the STA. [sent-187, score-0.728]

82 Note that the bias is stronger for larger amplitude stimuli because the constant term dominates the gain control signal for weak stimuli. [sent-189, score-0.496]

83 FQ Even when the STA estimate is biased by the gain control signal, we can still obtain an (asymptotically) unbiased estimate of the excitatory kernel. [sent-191, score-0.606]

84 Speciﬁcally, the true excitatory kernel lies within the subspace spanned by the estimated (biased) excitatory and suppressive kernels. [sent-192, score-1.506]

85 So, assuming a particular gain control model, we can again maximize the likelihood of the data, but now allowing both the excitatory and suppressive kernels to move within the subspace spanned by the initial estimated kernels. [sent-193, score-1.44]

86 5%) and high (34%) contrast salamander retinal ganglion cell data (cell 1999-11-12-B6A). [sent-206, score-0.531]

87 sive kernels need not be orthogonal to the excitatory kernel. [sent-211, score-0.465]

88 The excitatory axis is initially set to the STA and the suppressive axes are set to the low-eigenvalue eigenvectors of the STC, along with the STA (e. [sent-214, score-1.275]

89 Whereas the axes recovered from the STA/STC analysis are orthogonal, the axes determined during the maximum likelihood stage need not be (and in the data example are not) orthogonal. [sent-218, score-0.468]

90 Speciﬁcally, we simulate responses of the model (equation (3) with Poisson spike generation) on each of the two contrast stimulus sets, and then compute the STA based on these simulated spike trains. [sent-220, score-0.476]

91 Although it is based on a single ﬁxed excitatory kernel, the model exhibits a change in STA shape as a function of contrast very much like the salamander neuron. [sent-221, score-0.479]

92 2 £ 4 S ¡ Q ¦£ S ¤¡ 9 ¥ £ § ¨ 4 Discussion We have described a spike-triggered covariance method for characterizing a neuron with gain control, and demonstrated the plausibility of the technique through simulation and analysis of neural data. [sent-222, score-0.344]

93 Models of retinal processing often incorporate gain control [e. [sent-224, score-0.355]

94 We have shown for the ﬁrst time how one can use white noise analysis to recover a gain control subspace. [sent-227, score-0.368]

95 Thus, it is interesting to compare the recovered subspace to models of rapid gain control. [sent-229, score-0.4]

96 In particular, Victor [15] proposed a retinal gain model in which the gain signal consists of time-delayed copies of the excitatory kernel. [sent-230, score-0.797]

97 In fact, for the cell shown in Figure 3, the recovered suppressive subspace lies within the space spanned by shifted copies of the excitatory kernel. [sent-231, score-1.281]

98 The fact that we do not see evidence for slow gain control in the analysis might indicate that these signals do not lie within a low-dimensional stimulus subspace. [sent-232, score-0.412]

99 The effect of contrast on the transfer properties of cat retinal ganglion cells. [sent-270, score-0.341]

100 Temporal contrast adaptation in the input and output signals of salamander retinal ganglion cells. [sent-305, score-0.461]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('suppressive', 0.622), ('sta', 0.315), ('excitatory', 0.303), ('stimuli', 0.219), ('axes', 0.171), ('stimulus', 0.163), ('gain', 0.16), ('eigenvalues', 0.15), ('ganglion', 0.136), ('salamander', 0.127), ('retinal', 0.125), ('eliciting', 0.123), ('kernel', 0.118), ('spike', 0.107), ('recovered', 0.107), ('kernels', 0.106), ('subspace', 0.105), ('cell', 0.094), ('axis', 0.087), ('onto', 0.077), ('eigenvectors', 0.071), ('control', 0.07), ('projection', 0.067), ('elicit', 0.063), ('white', 0.061), ('perpendicular', 0.058), ('orthogonal', 0.056), ('recover', 0.053), ('spikes', 0.05), ('stc', 0.049), ('spherically', 0.049), ('contrast', 0.049), ('neuron', 0.047), ('arbitrary', 0.044), ('characterizing', 0.043), ('raw', 0.042), ('macaque', 0.042), ('retina', 0.042), ('scatter', 0.041), ('covariance', 0.039), ('ellipse', 0.037), ('biased', 0.035), ('plane', 0.033), ('simulation', 0.032), ('behaviors', 0.031), ('cat', 0.031), ('sorted', 0.031), ('retrieved', 0.029), ('receptive', 0.029), ('direction', 0.029), ('cells', 0.029), ('spanned', 0.028), ('chander', 0.028), ('odelia', 0.028), ('shapley', 0.028), ('suppressively', 0.028), ('recovers', 0.028), ('component', 0.028), ('rapid', 0.028), ('ensemble', 0.028), ('cantly', 0.027), ('estimated', 0.027), ('signal', 0.027), ('correspond', 0.026), ('responses', 0.026), ('signi', 0.026), ('modulated', 0.026), ('poisson', 0.025), ('simulated', 0.024), ('adaptation', 0.024), ('projected', 0.024), ('noise', 0.024), ('fall', 0.023), ('technique', 0.023), ('preceding', 0.022), ('copies', 0.022), ('reductions', 0.022), ('lowest', 0.022), ('along', 0.021), ('suppressed', 0.021), ('schwartz', 0.021), ('lying', 0.021), ('appear', 0.021), ('response', 0.02), ('sensory', 0.02), ('concentrated', 0.02), ('amplitude', 0.02), ('likely', 0.02), ('equally', 0.019), ('lie', 0.019), ('estimate', 0.019), ('likelihood', 0.019), ('neurons', 0.019), ('exhibit', 0.019), ('suppression', 0.019), ('striate', 0.019), ('directions', 0.018), ('nonlinearity', 0.018), ('low', 0.018), ('eigenvalue', 0.017)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

Author: Odelia Schwartz, E. J. Chichilnisky, Eero P. Simoncelli

2 0.14354241 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds

Author: B. D. Wright, Kamal Sen, William Bialek, A. J. Doupe

Abstract: In nature, animals encounter high dimensional sensory stimuli that have complex statistical and dynamical structure. Attempts to study the neural coding of these natural signals face challenges both in the selection of the signal ensemble and in the analysis of the resulting neural responses. For zebra ﬁnches, naturalistic stimuli can be deﬁned as sounds that they encounter in a colony of conspeciﬁc birds. We assembled an ensemble of these sounds by recording groups of 10-40 zebra ﬁnches, and then analyzed the response of single neurons in the songbird central auditory area (ﬁeld L) to continuous playback of long segments from this ensemble. Following methods developed in the ﬂy visual system, we measured the information that spike trains provide about the acoustic stimulus without any assumptions about which features of the stimulus are relevant. Preliminary results indicate that large amounts of information are carried by spike timing, with roughly half of the information accessible only at time resolutions better than 10 ms; additional information is still being revealed as time resolution is improved to 2 ms. Information can be decomposed into that carried by the locking of individual spikes to the stimulus (or modulations of spike rate) vs. that carried by timing in spike patterns. Initial results show that in ﬁeld L, temporal patterns give at least % extra information. Thus, single central auditory neurons can provide an informative representation of naturalistic sounds, in which spike timing may play a signiﬁcant role.

3 0.14089341 141 nips-2001-Orientation-Selective aVLSI Spiking Neurons

Author: Shih-Chii Liu, Jörg Kramer, Giacomo Indiveri, Tobi Delbrück, Rodney J. Douglas

Abstract: We describe a programmable multi-chip VLSI neuronal system that can be used for exploring spike-based information processing models. The system consists of a silicon retina, a PIC microcontroller, and a transceiver chip whose integrate-and-ﬁre neurons are connected in a soft winner-take-all architecture. The circuit on this multi-neuron chip approximates a cortical microcircuit. The neurons can be conﬁgured for different computational properties by the virtual connections of a selected set of pixels on the silicon retina. The virtual wiring between the different chips is effected by an event-driven communication protocol that uses asynchronous digital pulses, similar to spikes in a neuronal system. We used the multi-chip spike-based system to synthesize orientation-tuned neurons using both a feedforward model and a feedback model. The performance of our analog hardware spiking model matched the experimental observations and digital simulations of continuous-valued neurons. The multi-chip VLSI system has advantages over computer neuronal models in that it is real-time, and the computational time does not scale with the size of the neuronal network.

4 0.12387697 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

Author: Aaron C. Courville, David S. Touretzky

Abstract: The Temporal Coding Hypothesis of Miller and colleagues [7] suggests that animals integrate related temporal patterns of stimuli into single memory representations. We formalize this concept using quasi-Bayes estimation to update the parameters of a constrained hidden Markov model. This approach allows us to account for some surprising temporal effects in the second order conditioning experiments of Miller et al. [1 , 2, 3], which other models are unable to explain. 1

5 0.12020198 136 nips-2001-On the Concentration of Spectral Properties

Author: John Shawe-Taylor, Nello Cristianini, Jaz S. Kandola

Abstract: We consider the problem of measuring the eigenvalues of a randomly drawn sample of points. We show that these values can be reliably estimated as can the sum of the tail of eigenvalues. Furthermore, the residuals when data is projected into a subspace is shown to be reliably estimated on a random sample. Experiments are presented that confirm the theoretical results. 1

6 0.11223411 164 nips-2001-Sampling Techniques for Kernel Methods

7 0.093570441 37 nips-2001-Associative memory in realistic neuronal networks

8 0.092438661 87 nips-2001-Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway

9 0.086293429 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

10 0.081964344 74 nips-2001-Face Recognition Using Kernel Methods

11 0.080927998 58 nips-2001-Covariance Kernels from Bayesian Generative Models

12 0.076310724 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections

13 0.075040601 145 nips-2001-Perceptual Metamers in Stereoscopic Vision

14 0.07365448 124 nips-2001-Modeling the Modulatory Effect of Attention on Human Spatial Vision

15 0.070312522 72 nips-2001-Exact differential equation population dynamics for integrate-and-fire neurons

16 0.070236228 27 nips-2001-Activity Driven Adaptive Stochastic Resonance

17 0.068213269 73 nips-2001-Eye movements and the maturation of cortical orientation selectivity

18 0.067843989 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines

19 0.066438898 170 nips-2001-Spectral Kernel Methods for Clustering

20 0.06549263 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.164), (1, -0.129), (2, -0.135), (3, -0.082), (4, 0.136), (5, 0.112), (6, 0.021), (7, 0.054), (8, 0.008), (9, 0.074), (10, -0.053), (11, 0.081), (12, -0.095), (13, 0.017), (14, -0.037), (15, -0.146), (16, 0.004), (17, 0.084), (18, 0.028), (19, 0.101), (20, -0.088), (21, -0.158), (22, 0.088), (23, -0.055), (24, -0.063), (25, 0.085), (26, -0.007), (27, 0.007), (28, -0.007), (29, -0.086), (30, -0.056), (31, 0.014), (32, 0.019), (33, -0.034), (34, -0.026), (35, -0.024), (36, -0.041), (37, -0.007), (38, -0.026), (39, 0.016), (40, 0.044), (41, 0.028), (42, -0.011), (43, 0.062), (44, 0.024), (45, 0.049), (46, -0.058), (47, 0.132), (48, -0.142), (49, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95874619 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

Author: Odelia Schwartz, E. J. Chichilnisky, Eero P. Simoncelli

2 0.61212796 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds

Author: B. D. Wright, Kamal Sen, William Bialek, A. J. Doupe

3 0.54159737 87 nips-2001-Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway

Author: Gal Chechik, Amir Globerson, M. J. Anderson, E. D. Young, Israel Nelken, Naftali Tishby

Abstract: The way groups of auditory neurons interact to code acoustic information is investigated using an information theoretic approach. We develop measures of redundancy among groups of neurons, and apply them to the study of collaborative coding efficiency in two processing stations in the auditory pathway: the inferior colliculus (IC) and the primary auditory cortex (AI). Under two schemes for the coding of the acoustic content, acoustic segments coding and stimulus identity coding, we show differences both in information content and group redundancies between IC and AI neurons. These results provide for the first time a direct evidence for redundancy reduction along the ascending auditory pathway, as has been hypothesized for theoretical considerations [Barlow 1959,2001]. The redundancy effects under the single-spikes coding scheme are significant only for groups larger than ten cells, and cannot be revealed with the redundancy measures that use only pairs of cells. The results suggest that the auditory system transforms low level representations that contain redundancies due to the statistical structure of natural stimuli, into a representation in which cortical neurons extract rare and independent component of complex acoustic signals, that are useful for auditory scene analysis. 1

4 0.4953585 141 nips-2001-Orientation-Selective aVLSI Spiking Neurons

Author: Shih-Chii Liu, Jörg Kramer, Giacomo Indiveri, Tobi Delbrück, Rodney J. Douglas

5 0.48615089 136 nips-2001-On the Concentration of Spectral Properties

Author: John Shawe-Taylor, Nello Cristianini, Jaz S. Kandola

6 0.47974718 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

7 0.47497326 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement

8 0.43537492 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

9 0.42429471 145 nips-2001-Perceptual Metamers in Stereoscopic Vision

10 0.42263278 164 nips-2001-Sampling Techniques for Kernel Methods

11 0.42003688 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections

12 0.41148421 74 nips-2001-Face Recognition Using Kernel Methods

13 0.40645623 124 nips-2001-Modeling the Modulatory Effect of Attention on Human Spatial Vision

14 0.4029744 14 nips-2001-A Neural Oscillator Model of Auditory Selective Attention

15 0.38769731 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

16 0.34562451 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task

17 0.33449686 72 nips-2001-Exact differential equation population dynamics for integrate-and-fire neurons

18 0.33351594 57 nips-2001-Correlation Codes in Neuronal Populations

19 0.3283298 155 nips-2001-Quantizing Density Estimators

20 0.3148852 58 nips-2001-Covariance Kernels from Bayesian Generative Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.243), (14, 0.058), (17, 0.062), (19, 0.041), (27, 0.111), (30, 0.072), (38, 0.026), (59, 0.038), (72, 0.032), (79, 0.022), (91, 0.166)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88016105 48 nips-2001-Characterizing Neural Gain Control using Spike-triggered Covariance

Author: Odelia Schwartz, E. J. Chichilnisky, Eero P. Simoncelli

2 0.85025632 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

Author: Michael C. Mozer, Robert Dodier, Michael D. Colagrosso, Cesar Guerra-Salcedo, Richard Wolniewicz

Abstract: When designing a two-alternative classiﬁer, one ordinarily aims to maximize the classiﬁer’s ability to discriminate between members of the two classes. We describe a situation in a real-world business application of machine-learning prediction in which an additional constraint is placed on the nature of the solution: that the classiﬁer achieve a speciﬁed correct acceptance or correct rejection rate (i.e., that it achieve a ﬁxed accuracy on members of one class or the other). Our domain is predicting churn in the telecommunications industry. Churn refers to customers who switch from one service provider to another. We propose four algorithms for training a classiﬁer subject to this domain constraint, and present results showing that each algorithm yields a reliable improvement in performance. Although the improvement is modest in magnitude, it is nonetheless impressive given the difﬁculty of the problem and the ﬁnancial return that it achieves to the service provider. When designing a classiﬁer, one must specify an objective measure by which the classiﬁer’s performance is to be evaluated. One simple objective measure is to minimize the number of misclassiﬁcations. If the cost of a classiﬁcation error depends on the target and/ or response class, one might utilize a risk-minimization framework to reduce the expected loss. A more general approach is to maximize the classiﬁer’s ability to discriminate one class from another class (e.g., Chang & Lippmann, 1994). An ROC curve (Green & Swets, 1966) can be used to visualize the discriminative performance of a two-alternative classiﬁer that outputs class posteriors. To explain the ROC curve, a classiﬁer can be thought of as making a positive/negative judgement as to whether an input is a member of some class. Two different accuracy measures can be obtained from the classiﬁer: the accuracy of correctly identifying an input as a member of the class (a correct acceptance or CA), and the accuracy of correctly identifying an input as a nonmember of the class (a correct rejection or CR). To evaluate the CA and CR rates, it is necessary to pick a threshold above which the classiﬁer’s probability estimate is interpreted as an “accept,” and below which is interpreted as a “reject”—call this the criterion. The ROC curve plots CA against CR rates for various criteria (Figure 1a). Note that as the threshold is lowered, the CA rate increases and the CR rate decreases. For a criterion of 1, the CA rate approaches 0 and the CR rate 1; for a criterion of 0, the CA rate approaches 1 0 0 correct rejection rate 20 40 60 80 100 100 (b) correct rejection rate 20 40 60 80 (a) 0 20 40 60 80 100 correct acceptance rate 0 20 40 60 80 100 correct acceptance rate FIGURE 1. (a) two ROC curves reﬂecting discrimination performance; the dashed curve indicates better performance. (b) two plausible ROC curves, neither of which is clearly superior to the other. and the CR rate 0. Thus, the ROC curve is anchored at (0,1) and (1,0), and is monotonically nonincreasing. The degree to which the curve is bowed reﬂects the discriminative ability of the classiﬁer. The dashed curve in Figure 1a is therefore a better classiﬁer than the solid curve. The degree to which the curve is bowed can be quantiﬁed by various measures such as the area under the ROC curve or d’, the distance between the positive and negative distributions. However, training a classiﬁer to maximize either the ROC area or d’ often yields the same result as training a classiﬁer to estimate posterior class probabilities, or equivalently, to minimize the mean squared error (e.g., Frederick & Floyd, 1998). The ROC area and d’ scores are useful, however, because they reﬂect a classiﬁer’s intrinsic ability to discriminate between two classes, regardless of how the decision criterion is set. That is, each point on an ROC curve indicates one possible CA/CR trade off the classiﬁer can achieve, and that trade off is determined by the criterion. But changing the criterion does not change the classiﬁer’s intrinsic ability to discriminate. Generally, one seeks to optimize the discrimination performance of a classiﬁer. However, we are working in a domain where overall discrimination performance is not as critical as performance at a particular point on the ROC curve, and we are not interested in the remainder of the ROC curve. To gain an intuition as to why this goal should be feasible, consider Figure 1b. Both the solid and dashed curves are valid ROC curves, because they satisfy the monotonicity constraint: as the criterion is lowered, the CA rate does not decrease and the CR rate does not increase. Although the bow shape of the solid curve is typical, it is not mandatory; the precise shape of the curve depends on the nature of the classiﬁer and the nature of the domain. Thus, it is conceivable that a classiﬁer could produce a curve like the dashed one. The dashed curve indicates better performance when the CA rate is around 50%, but worse performance when the CA rate is much lower or higher than 50%. Consequently, if our goal is to maximize the CR rate subject to the constraint that the CA rate is around 50%, or to maximize the CA rate subject to the constraint that the CR rate is around 90%, the dashed curve is superior to the solid curve. One can imagine that better performance can be obtained along some stretches of the curve by sacriﬁcing performance along other stretches of the curve. Note that obtaining a result such as the dashed curve requires a nonstandard training algorithm, as the discrimination performance as measured by the ROC area is worse for the dashed curve than for the solid curve. In this paper, we propose and evaluate four algorithms for optimizing performance in a certain region of the ROC curve. To begin, we explain the domain we are concerned with and why focusing on a certain region of the ROC curve is important in this domain. 1 OUR DOMAIN Athene Software focuses on predicting and managing subscriber churn in the telecommunications industry (Mozer, Wolniewicz, Grimes, Johnson, & Kaushansky, 2000). “Churn” refers to the loss of subscribers who switch from one company to the other. Churn is a signiﬁcant problem for wireless, long distance, and internet service providers. For example, in the wireless industry, domestic monthly churn rates are 2–3% of the customer base. Consequently, service providers are highly motivated to identify subscribers who are dissatisﬁed with their service and offer them incentives to prevent churn. We use techniques from statistical machine learning—primarily neural networks and ensemble methods—to estimate the probability that an individual subscriber will churn in the near future. The prediction of churn is based on various sources of information about a subscriber, including: call detail records (date, time, duration, and location of each call, and whether call was dropped due to lack of coverage or available bandwidth), ﬁnancial information appearing on a subscriber’s bill (monthly base fee, additional charges for roaming and usage beyond monthly prepaid limit), complaints to the customer service department and their resolution, information from the initial application for service (contract details, rate plan, handset type, credit report), market information (e.g., rate plans offered by the service provider and its competitors), and demographic data. Churn prediction is an extremely difﬁcult problem for several reasons. First, the business environment is highly nonstationary; models trained on data from a certain time period perform far better with hold-out examples from that same time period than examples drawn from successive time periods. Second, features available for prediction are only weakly related to churn; when computing mutual information between individual features and churn, the greatest value we typically encounter is .01 bits. Third, information critical to predicting subscriber behavior, such as quality of service, is often unavailable. Obtaining accurate churn predictions is only part of the challenge of subscriber retention. Subscribers who are likely to churn must be contacted by a call center and offered some incentive to remain with the service provider. In a mathematically principled business scenario, one would frame the challenge as maximizing proﬁtability to a service provider, and making the decision about whether to contact a subscriber and what incentive to offer would be based on the expected utility of offering versus not offering an incentive. However, business practices complicate the scenario and place some unique constraints on predictive models. First, call centers are operated by a staff of customer service representatives who can contact subscribers at a ﬁxed rate; consequently, our models cannot advise contacting 50,000 subscribers one week, and 50 the next. Second, internal business strategies at the service providers constrain the minimum acceptable CA or CR rates (above and beyond the goal of maximizing proﬁtability). Third, contracts that Athene makes with service providers will occasionally call for achieving a speciﬁc target CA and CR rate. These three practical issues pose formal problems which, to the best of our knowledge, have not been addressed by the machine learning community. The formal problems can be stated in various ways, including: (1) maximize the CA rate, subject to the constraint that a ﬁxed percentage of the subscriber base is identiﬁed as potential churners, (2) optimize the CR rate, subject to the constraint that the CA rate should be αCA, (3) optimize the CA rate, subject to the constraint that the CR rate should be αCR, and ﬁnally—what marketing executives really want—(4) design a classiﬁer that has a CA rate of αCA and a CR rate of αCR. Problem (1) sounds somewhat different than problems (2) or (3), but it can be expressed in terms of a lift curve, which plots the CA rate as a function of the total fraction of subscribers identiﬁed by the model. Problem (1) thus imposes the constraint that the solution lies at one coordinate of the lift curve, just as problems (2) and (3) place the constraint that the solution lies at one coordinate of the ROC curve. Thus, a solution to problems (2) or (3) will also serve as a solution to (1). Although addressing problem (4) seems most fanciful, it encompasses problems (2) and (3), and thus we focus on it. Our goal is not altogether unreasonable, because a solution to problem (4) has the property we characterized in Figure 1b: the ROC curve can suffer everywhere except in the region near CA αCA and CR αCR. Hence, the approaches we consider will trade off performance in some regions of the ROC curve against performance in other regions. We call this prodding the ROC curve. 2 FOUR ALGORITHMS TO PROD THE ROC CURVE In this section, we describe four algorithms for prodding the ROC curve toward a target CA rate of αCA and a target CR rate of αCR. 2.1 EMPHASIZING CRITICAL TRAINING EXAMPLES Suppose we train a classiﬁer on a set of positive and negative examples from a class— churners and nonchurners in our domain. Following training, the classiﬁer will assign a posterior probability of class membership to each example. The examples can be sorted by the posterior and arranged on a continuum anchored by probabilities 0 and 1 (Figure 2). We can identify the thresholds, θCA and θCR, which yield CA and CR rates of αCA and αCR, respectively. If the classiﬁer’s discrimination performance fails to achieve the target CA and CR rates, then θCA will be lower than θCR, as depicted in the Figure. If we can bring these two thresholds together, we will achieve the target CA and CR rates. Thus, the ﬁrst algorithm we propose involves training a series of classiﬁers, attempting to make classiﬁer n+1 achieve better CA and CR rates by focusing its effort on examples from classiﬁer n that lie between θCA and θCR; the positive examples must be pushed above θCR and the negative examples must be pushed below θCA. (Of course, the thresholds are speciﬁc to a classiﬁer, and hence should be indexed by n.) We call this the emphasis algorithm, because it involves placing greater weight on the examples that lie between the two thresholds. In the Figure, the emphasis for classiﬁer n+1 would be on examples e5 through e8. This retraining procedure can be iterated until the classiﬁer’s training set performance reaches asymptote. In our implementation, we deﬁne a weighting of each example i for training classiﬁer n, λ in . For classiﬁer 1, λ i1 = 1 . For subsequent classiﬁers, λ in + 1 = λ in if example i is not in the region of emphasis, or λ in + 1 = κ e λ in otherwise, where κe is a constant, κe > 1. 2.2 DEEMPHASIZING IRRELEVANT TRAINING EXAMPLES The second algorithm we propose is related to the ﬁrst, but takes a slightly different perspective on the continuum depicted in Figure 2. Positive examples below θCA—such as e2—are clearly the most dif ﬁcult positive examples to classify correctly. Not only are they the most difﬁcult positive examples, but they do not in fact need to be classiﬁed correctly to achieve the target CA and CR rates. Threshold θCR does not depend on examples such as e2, and threshold θCA allows a fraction (1–αCA) of the positive examples to be classiﬁed incorrectly. Likewise, one can argue that negative examples above θCR—such as e10 and e11—need not be of concern. Essentially , the second algorithm, which we term the eemd phasis algorithm, is like the emphasis algorithm in that a series of classiﬁers are trained, but when training classiﬁer n+1, less weight is placed on the examples whose correct clasθCA e1 e2 e3 0 e4 θCR e5 e6 e7 e8 churn probability e9 e10 e11 e12 e13 1 FIGURE 2. A schematic depiction of all training examples arranged by the classiﬁer’s posterior. Each solid bar corresponds to a positive example (e.g., a churner) and each grey bar corresponds to a negative example (e.g., a nonchurner). siﬁcation is unnecessary to achieve the target CA and CR rates for classiﬁer n. As with the emphasis algorithm, the retraining procedure can be iterated until no further performance improvements are obtained on the training set. Note that the set of examples given emphasis by the previous algorithm is not the complement of the set of examples deemphasized by the current algorithm; the algorithms are not identical. In our implementation, we assign a weight to each example i for training classiﬁer n, λ in . For classiﬁer 1, λ i1 = 1 . For subsequent classiﬁers, λ in + 1 = λ in if example i is not in the region of deemphasis, or λ in + 1 = κ d λ in otherwise, where κd is a constant, κd <1. 2.3 CONSTRAINED OPTIMIZATION The third algorithm we propose is formulated as maximizing the CR rate while maintaining the CA rate equal to αCA. (We do not attempt to simultaneously maximize the CA rate while maintaining the CR rate equal to αCR.) Gradient methods cannot be applied directly because the CA and CR rates are nondifferentiable, but we can approximate the CA and CR rates with smooth differentiable functions: 1 1 CA ( w, t ) = ----- ∑ σ β ( f ( x i, w ) – t ) CR ( w, t ) = ------ ∑ σ β ( t – f ( x i, w ) ) , P i∈P N i∈N where P and N are the set of positive and negative examples, respectively, f(x,w) is the model posterior for input x, w is the parameterization of the model, t is a threshold, and σβ –1 is a sigmoid function with scaling parameter β: σ β ( y ) = ( 1 + exp ( – βy ) ) . The larger β is, the more nearly step-like the sigmoid is and the more nearly equal the approximations are to the model CR and CA rates. We consider the problem formulation in which CA is a constraint and CR is a ﬁgure of merit. We convert the constrained optimization problem into an unconstrained problem by the augmented Lagrangian method (Bertsekas, 1982), which involves iteratively maximizing an objective function 2 µ A ( w, t ) = CR ( w, t ) + ν CA ( w, t ) – α CA + -- CA ( w, t ) – α CA 2 with a ﬁxed Lagrangian multiplier, ν, and then updating ν following the optimization step: ν ← ν + µ CA ( w *, t * ) – α CA , where w * and t * are the values found by the optimization step. We initialize ν = 1 and ﬁx µ = 1 and β = 10 and iterate until ν converges. 2.4 GENETIC ALGORITHM The fourth algorithm we explore is a steady-state genetic search over a space deﬁned by the continuous parameters of a classiﬁer (Whitley, 1989). The ﬁtness of a classiﬁer is the reciprocal of the number of training examples falling between the θCA and θCR thresholds. Much like the emphasis algorithm, this ﬁtness function encourages the two thresholds to come together. The genetic search permits direct optimization over a nondifferentiable criterion, and therefore seems sensible for the present task. 3 METHODOLOGY For our tests, we studied two large data bases made available to Athene by two telecommunications providers. Data set 1 had 50,000 subscribers described by 35 input features and a churn rate of 4.86%. Data set 2 had 169,727 subscribers described by 51 input features and a churn rate of 6.42%. For each data base, the features input to the classiﬁer were obtained by proprietary transformations of the raw data (see Mozer et al., 2000). We chose these two large, real world data sets because achieving gains with these data sets should be more difﬁcult than with smaller, less noisy data sets. Plus, with our real-world data, we can evaluate the cost savings achieved by an improvement in prediction accuracy. We performed 10-fold cross-validation on each data set, preserving the overall churn/nonchurn ratio in each split. In all tests, we chose α CR = 0.90 and α CA = 0.50 , values which, based on our past experience in this domain, are ambitious yet realizable targets for data sets such as these. We used a logistic regression model (i.e., a no hidden unit neural network) for our studies, believing that it would be more difﬁcult to obtain improvements with such a model than with a more ﬂexible multilayer perceptron. For the emphasis and deemphasis algorithms, models were trained to minimize mean-squared error on the training set. We chose κe = 1.3 and κd = .75 by quick exploration. Because the weightings are cumulative over training restarts, the choice of κ is not critical for either algorithm; rather, the magnitude of κ controls how many restarts are necessary to reach asymptotic performance, but the results we obtained were robust to the choice of κ. The emphasis and deemphasis algorithms were run for 100 iterations, which was the number of iterations required to reach asymptotic performance on the training set. 4 RESULTS Figure 3 illustrates training set performance for the emphasis algorithm on data set 1. The graph on the left shows the CA rate when the CR rate is .9, and the graph on the right show the CR rate when the CA rate is .5. Clearly, the algorithm appears to be stable, and the ROC curve is improving in the region around (αCA, αCR). Figure 4 shows cross-validation performance on the two data sets for the four prodding algorithms as well as for a traditional least-squares training procedure. The emphasis and deemphasis algorithms yield reliable improvements in performance in the critical region of the ROC curve over the traditional training procedure. The constrained-optimization and genetic algorithms perform well on achieving a high CR rate for a ﬁxed CA rate, but neither does as well on achieving a high CA rate for a ﬁxed CR rate. For the constrained-optimization algorithm, this result is not surprising as it was trained asymmetrically, with the CA rate as the constraint. However, for the genetic algorithm, we have little explanation for its poor performance, other than the difﬁculty faced in searching a continuous space without gradient information. 5 DISCUSSION In this paper, we have identiﬁed an interesting, novel problem in classiﬁer design which is motivated by our domain of churn prediction and real-world business considerations. Rather than seeking a classiﬁer that maximizes discriminability between two classes, as measured by area under the ROC curve, we are concerned with optimizing performance at certain points along the ROC curve. We presented four alternative approaches to prodding the ROC curve, and found that all four have promise, depending on the speciﬁc goal. Although the magnitude of the gain is small—an increase of about .01 in the CR rate given a target CA rate of .50—the impro vement results in signiﬁcant dollar savings. Using a framework for evaluating dollar savings to a service provider, based on estimates of subscriber retention and costs of intervention obtained in real world data collection (Mozer et 0.845 0.84 0.39 0.835 0.385 CR rate CA rate 0.4 0.395 0.38 0.83 0.825 0.375 0.82 0.37 0.815 0.365 0.81 0 5 10 15 20 25 30 35 40 45 50 Iteration 0 5 10 15 20 25 30 35 40 45 50 Iteration FIGURE 3. Training set performance for the emphasis algorithm on data set 1. (a) CA rate as a function of iteration for a CR rate of .9; (b) CR rate as a function of iteration for a CA rate of .5. Error bars indicate +/–1 standard error of the mean. Data set 1 0.835 0.380 0.830 0.375 0.825 CR rate ISP Test Set 0.840 0.385 CA rate 0.390 0.370 0.820 0.365 0.815 0.360 0.810 0.355 0.805 0.350 0.800 std emph deemph constr GA std emph deemph constr GA std emph deemph constr GA 0.900 0.375 0.350 CR rate Data set 2 0.875 CA rate Wireless Test Set 0.850 0.325 0.825 0.300 0.800 std emph deemph constr GA FIGURE 4. Cross-validation performance on the two data sets for the standard training procedure (STD), as well as the emphasis (EMPH), deemphasis (DEEMPH), constrained optimization (CONSTR), and genetic (GEN) algorithms. The left column shows the CA rate for CR rate .9; the right column shows the CR rate for CA rate .5. The error bar indicates one standard error of the mean over the 10 data splits. al., 2000), we obtain a savings of $11 per churnable subscriber when the (CA, CR) rates go from (.50, .80) to (.50, .81), which amounts to an 8% increase in proﬁtability of the subscriber intervention effort. These ﬁgures are clearly promising. However, based on the data sets we have studied, it is difﬁcult to know whether another algorithm might exist that achieves even greater gains. Interestingly, all algorithms we proposed yielded roughly the same gains when successful, suggesting that we may have milked the data for whatever gain could be had, given the model class evaluated. Our work clearly illustrate the difﬁculty of the problem, and we hope that others in the NIPS community will be motivated by the problem to suggest even more powerful, theoretically grounded approaches. 6 ACKNOWLEDGEMENTS No white males were angered in the course of conducting this research. We thank Lian Yan and David Grimes for comments and assistance on this research. This research was supported in part by McDonnell-Pew grant 97-18, NSF award IBN-9873492, and NIH/IFOPAL R01 MH61549–01A1. 7 REFERENCES Bertsekas, D. P. (1982). Constrained optimization and Lagrange multiplier methods. NY: Academic. Chang, E. I., & Lippmann, R. P. (1994). Figure of merit training for detection and spotting. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6 (1019–1026). San Mateo, CA: Morgan Kaufmann. Frederick, E. D., & Floyd, C. E. (1998). Analysis of mammographic ﬁndings and patient history data with genetic algorithms for the prediction of breast cancer biopsy outcome. Proceedings of the SPIE, 3338, 241–245. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Mozer, M. C., Wolniewicz, R., Grimes, D., Johnson, E., & Kaushansky, H. (2000). Maximizing revenue by predicting and addressing customer dissatisfaction. IEEE Transactions on Neural Networks, 11, 690–696. Whitley, D. (1989). The GENITOR algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In D. Schaffer (Ed.), Proceedings of the Third International Conference on Genetic Algorithms (pp. 116–121). San Mateo, CA: Morgan Kaufmann.

3 0.74271345 122 nips-2001-Model Based Population Tracking and Automatic Detection of Distribution Changes

Author: Igor V. Cadez, P. S. Bradley

Abstract: Probabilistic mixture models are used for a broad range of data analysis tasks such as clustering, classiﬁcation, predictive modeling, etc. Due to their inherent probabilistic nature, mixture models can easily be combined with other probabilistic or non-probabilistic techniques thus forming more complex data analysis systems. In the case of online data (where there is a stream of data available) models can be constantly updated to reﬂect the most current distribution of the incoming data. However, in many business applications the models themselves represent a parsimonious summary of the data and therefore it is not desirable to change models frequently, much less with every new data point. In such a framework it becomes crucial to track the applicability of the mixture model and detect the point in time when the model fails to adequately represent the data. In this paper we formulate the problem of change detection and propose a principled solution. Empirical results over both synthetic and real-life data sets are presented. 1 Introduction and Notation Consider a data set D = {x1 , x2 , . . . , xn } consisting of n independent, identically distributed (iid) data points. In context of this paper the data points could be vectors, sequences, etc. Further, consider a probabilistic mixture model that maps each data set to a real number, the probability of observing the data set: n P (D|Θ) = n K P (xi |Θ) = i=1 πk P (xi |θk ), (1) i=1 k=1 where the model is parameterized by Θ = {π1 , . . . , πK , θ1 , . . . , θK }. Each P (.|θk ) represents a mixture component, while πi represents mixture weights. It is often more convenient ∗ Work was done while author was at digiMine, Inc., Bellevue, WA. to operate with the log of the probability and deﬁne the log-likelihood function as: n l(Θ|D) = log P (D|Θ) = n log P (xi |Θ) = i=1 LogPi i=1 which is additive over data points rather than multiplicative. The LogPi terms we introduce in the notation represent each data point’s contribution to the overall log-likelihood and therefore describe how well a data point ﬁts under the model. For example, Figure 3 shows a distribution of LogP scores using a mixture of conditionally independent (CI) models. Maximizing probability1 of the data with respect to the parameters Θ can be accomplished by the Expectation-Maximization (EM) algorithm [6] in linear time in both data complexity (e.g., number of dimensions) and data set size (e.g., number of data points). Although EM guarantees only local optimality, it is a preferred method for ﬁnding good solutions in linear time. We consider an arbitrary but ﬁxed parametric form of the model, therefore we sometimes refer to a speciﬁc set of parameters Θ as the model. Note that since the logarithm is a monotonic function, the optimal set of parameters is the same whether we use likelihood or log-likelihood. Consider an online data source where there are data sets Dt available at certain time intervals t (not necessarily equal time periods or number of data points). For example, there could be a data set generated on a daily basis, or it could represent a constant stream of data from a monitoring device. In addition, we assume that we have an initial model Θ0 that was built (optimized, ﬁtted) on some in-sample data D0 = {D1 , D2 , . . . , Dt0 }. We would like to be able to detect a change in the underlying distribution of data points within data sets Dt that would be sufﬁcient to require building of a new model Θ1 . The criterion for building a new model is loosely deﬁned as “the model does not adequately ﬁt the data anymore”. 2 Model Based Population Similarity In this section we formulate the problem of model-based population similarity and tracking. In case of mixture models we start with the following observations: • The mixture model deﬁnes the probability density function (PDF) that is used to score each data point (LogP scores), leading to the score for the overall population (log-likelihood or sum of LogP scores). • The optimal mixture model puts more PDF mass over dense regions in the data space. Different components allow the mixture model to distribute its PDF over disconnected dense regions in the data space. More PDF mass in a portion of the data space implies higher LogP scores for the data points lying in that region of the space. • If model is to generalize well (e.g., there is no signiﬁcant overﬁtting) it cannot put signiﬁcant PDF mass over regions of data space that are populated by data points solely due to the details of a speciﬁc data sample used to build the model. • Dense regions in the data space discovered by a non-overﬁtting model are the intrinsic property of the true data-generating distribution even if the functional form of the model is not well matched with the true data generating distribution. In the latter case, the model might not be able to discover all dense regions or might not model the correct shape of the regions, but the regions that are discovered (if any) are intrinsic to the data. 1 This approach is called maximum-likelihood estimation. If we included parameter priors we could equally well apply results in this paper to the maximum a posteriori estimation. • If there is conﬁ dence that the model is not overﬁ tting and that it generalizes well (e.g., cross-validation was used to determine the optimal number of mixture components), the new data from the same distribution as the in-sample data should be dense in the same regions that are predicted by the model. Given these observations, we seek to deﬁ a measure of data-distribution similarity based ne on how well the dense regions of the data space are preserved when new data is introduced. In model based clustering, dense regions are equivalent to higher LogP scores, hence we cast the problem of determining data distribution similarity into one of determining LogP distribution similarity (relative to the model). For example, Figure 3 (left) shows a histogram of one such distribution. It is important to note several properties of Figure 3: 1) there are several distinct peaks from which distribution tails off toward smaller LogP values, therefore simple summary scores fail to efﬁ ciently summarize the LogP distribution. For example, log-likelihood is proportional to the mean of LogP distribution in Figure 3, and the mean is not a very useful statistic when describing such a multimodal distribution (also conﬁ rmed experimentally); 2) the histogram itself is not a truly non-parametric representation of the underlying distribution, given that the results are dependent on bin width. In passing we also note that the shape of the histogram in Figure 3 is a consequence of the CI model we use: different peaks come from different discrete attributes, while the tails come from continuous Gaussians. It is a simple exercise to show that LogP scores for a 1-dimensional data set generated by a single Gaussian have an exponential distribution with a sharp cutoff on the right and tail toward the left. To deﬁ the similarity of the data distributions based on LogP scores in a purely nonne parametric way we have at our disposal the powerful formalism of Kolmogorov-Smirnov (KS) statistics [7]. KS statistics make use of empirical cumulative distribution functions (CDF) to estimate distance between two empirical 1-dimensional distributions, in our case distributions of LogP scores. In principle, we could compare the LogP distribution of the new data set Dt to that of the training set D0 and obtain the probability that the two came from the same distribution. In practice, however, this approach is not feasible since we do not assume that the estimated model and the true data generating process share the same functional form (see Section 3). Consequently, we need to consider the speciﬁ KS score c in relation to the natural variability of the true data generating distribution. In the situation with streaming data, the model is estimated over the in-sample data D0 . Then the individual in-sample data sets D1 , D2 , . . . , Dt0 are used to estimate the natural variability of the KS statistics. This variability needs to be quantiﬁ due to the fact that the model may not ed truly match the data distribution. When the natural variance of the KS statistics over the in-sample data has been determined, the LogP scores for a new dataset Dt , t > t0 are computed. Using principled heuristics, one can then determine whether or not the LogP signature for Dt is signiﬁ cantly different than the LogP signatures for the in-sample data. To clarify various steps, we provide an algorithmic description of the change detection process. Algorithm 1 (Quantifying Natural Variance of KS Statistics): Given an “in-sample” dataset D0 = {D1 , D2 , . . . , Dt0 }, proceed as follows: 1. Estimate the parameters Θ0 of the mixture model P (D|Θ) over D0 (see equation (1)). 2. Compute ni log P (xˆ|Θ0 ), xˆ ∈ Di , ni = |Di |, i = 1, . . . , t0 . i i LogP (Di ) = (2) ˆ i=1 3. For 1 ≤ i, j ≤ t0 , compute LKS (i, j) = log [PKS (Di , Dj )]. See [7] for details on PKS computation. 4. For 1 ≤ i ≤ t0 , compute the KS measure MKS (i) as MKS (i) = t0 j=1 LKS (i, j) t0 . (3) 5. Compute µM = M ean[MKS (i)] and σM = ST D[MKS (i)] to quantify the natural variability of MKS over the “in-sample” data. Algorithm 2 (Evaluating New Data): Given a new dataset Dt , t > t0 , µM and σM proceed as follows: 1. 2. 3. 4. Compute LogP (Dt ) as in (2). For 1 ≤ i ≤ t0 , compute LKS (i, t). Compute MKS (t) as in (3). Apply decision criteria using MKS (t), µM , σM to determine whether or not Θ0 is a good ﬁ for the new data. For example, if t |MKS (t) − µM | > 3, σM then Θ0 is not a good ﬁ any more. t (4) Note, however, that the 3-σ interval be interpreted as a conﬁ dence interval only in the limit when number of data sets goes to inﬁ nity. In applications presented in this paper we certainly do not have that condition satisﬁ and we consider this approach as an “educated ed heuristic” (gaining ﬁ statistical grounds in the limit). rm 2.1 Space and Time Complexity of the Methodology The proposed methodology was motivated by a business application with large data sets, hence it must have time complexity that is close to linear in order to scale well. In order to assess the time complexity, we use the following notation: nt = |Dt | is the number of data points in the data set Dt ; t0 is the index of the last in-sample data set, but is also the t0 number of in-sample data sets; n0 = |D0 | = t=1 nt is the total number of in-sample data points (in all the in-sample data sets); n = n0 /t0 is the average number of data points in the in-sample data sets. For simplicity of argument, we assume that all the data sets are approximately of the same size, that is nt ≈ n. The analysis presented here does not take into account the time and space complexity needed to estimated the parameters Θ of the mixture model (1). In the ﬁ phase of the rst methodology, we must score each of the in-sample data points under the model (to obtain the LogP distributions) which has time complexity of O(n0 ). Calculation of KS statistics for two data sets is done in one pass over the LogP distributions, but it requires that the LogP scores be sorted, hence it has time complexity of 2n + 2O(n log n) = O(n log n). Since we must calculate all the pairwise KS measures, this step has time complexity of t0 (t0 − 1)/2 O(n log n) = O(t2 n log n). In-sample mean and variance of the KS measure 0 are obtained in time which is linear in t0 hence the asymptotic time complexity does not change. In order to evaluate out-of-sample data sets we must keep LogP distributions for each of the in-sample data sets as well as several scalars (e.g., mean and variance of the in-sample KS measure) which requires O(n0 ) memory. To score an out-of-sample data set Dt , t > t0 , we must ﬁ obtain the LogP distribution rst of Dt which has time complexity of O(n) and then calculate the KS measure relative to each of the in-sample data sets which has time complexity O(n log n) per in-sample data set, or t0 O(n log n) = O(t0 n log n) for the full in-sample period. The LogP distribution for Dt can be discarded once the pairwise KS measures are obtained. 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 0 −2.5 −5.5 −5 −4.5 LogP −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 LogP 3000 3000 2500 2500 2000 2000 Count 3500 Count 3500 1500 1500 1000 1000 500 500 0 −5.5 −5 −4.5 −4 −3.5 −3 LogP −2.5 0 −5.5 −5 −4.5 LogP Figure 1: Histograms of LogP scores for two data sets generated from the ﬁ model rst (top row) and two data sets generated from the second model (bottom row). Each data set contains 50,000 data points. All histograms are obtained from the model ﬁ tted on the in-sample period. Overall, the proposed methodology requires O(n0 ) memory, O(t2 n log n) time for prepro0 cessing and O(t0 n log n) time for out-of-sample evaluation. Further, since t0 is typically a small constant (e.g., t0 = 7 or t0 = 30), the out-of-sample evaluation practically has time complexity of O(n log n). 3 Experimental Setup Experiments presented consist of two parts: experiments on synthetic data and experiments on the aggregations over real web-log data. 3.1 Experiments on Synthetic Data Synthetic data is a valuable tool when determining both applicability and limitations of the proposed approach. Synthetic data was generated by sampling from a a two component CI model (the true model is not used in evaluations). The data consist of a two-state discrete dimension and a continuous dimension. First 100 data sets where generated by sampling from a mixture model with parameters: [π1 , π2 ] = [0.6, 0.4] as weights, θ1 = [0.8, 0.2] 2 2 and θ2 = [0.4, 0.6] as discrete state probabilities, [µ1 , σ1 ] = [10, 5] and [µ2 , σ2 ] = [0, 7] as mean and variance (Gaussian) for the continuous variable. Then the discrete dimension probability of the second cluster was changed from θ2 = [0.4, 0.6] to θ 2 = [0.5, 0.5] keeping the remaining parameters ﬁ and an additional 100 data sets were generated by xed sampling from this altered model. This is a fairly small change in the distribution and the underlying LogP scores appear to be very similar as can be seen in Figure 1. The ﬁ gure shows LogP distributions for the ﬁ two data sets generated from the ﬁ model (top row) rst rst and the ﬁ two data sets generated from the second model (bottom row). Plots within each rst 0 0 −1 −5 −2 −3 −4 −10 −5 (b) (a) −6 0 20 40 60 80 100 Data set Dt 120 140 160 180 −15 200 0 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 40 60 80 100 Data set Dt 120 140 160 180 200 0 −5 −2 −10 −15 −4 −6 −8 −20 −25 −30 −35 −10 −40 −12 −45 (c) −14 0 20 40 60 80 100 Data set Dt 120 140 160 180 200 −50 (d) 0 20 Figure 2: Average log(KS probability) over the in-sample period for four experiments on synthetic data, varying the number of data points per data set: a) 1,000; b) 5,000; c) 10,000; d) 50,000. The dotted vertical line separates in-sample and out-of-sample periods. Note that y-axes have different scales in order to show full variability of the data. row should be more similar than plots from different rows, but this is difﬁ cult to discern by visual inspection. Algorithms 1 and 2 were evaluated by using the ﬁ 10 data sets to estimate a two comrst ponent model. Then pairwise KS measures were calculated between all possible data set pairs relative to the estimated model. Figure 2 shows average KS measures over in-sample data sets (ﬁ 10) for four experiments varying the number of data points in each experirst ment. Note that the vertical axes are different in each of the plots to better show the range of values. As the number of data points in the data set increases, the change that occurs at t = 101 becomes more apparent. At 50,000 data points (bottom right plot of Figure 2) the change in the distribution becomes easily detectable. Since this number of data points is typically considered to be small compared to the number of data points in our real life applications we expect to be able to detect such slight distribution changes. 3.2 Experiments on Real Life Data Figure 3 shows a distribution for a typical day from a content web-site. There are almost 50,000 data points in the data set with over 100 dimensions each. The LogP score distribution is similar to that of synthetic data in Figure 1 which is a consequence of the CI model used. Note, however, that in this data set the true generating distribution is not known and is unlikely to be purely a CI model. Therefore, the average log KS measure over insample data has much lower values (see Figure 3 right, and plots in Figure 2). Another way to phrase this observation is to note that since the true generating data distribution is most likely not CI, the observed similarity of LogP distributions (the KS measure) is much lower since there are two factors of dissimilarity: 1) different data sets; 2) inability of the CI model to capture all the aspects of the true data distribution. Nonetheless, the ﬁ 31 rst −100 5000 −200 4500 4000 −300 3500 Count 3000 2500 2000 −400 −500 −600 1500 1000 −700 500 0 −100 −800 −80 −60 −40 −20 LogP 0 20 40 60 0 10 20 30 40 50 Data set D 60 70 80 90 100 t Figure 3: Left: distribution of 42655 LogP scores from mixture of conditional independence models. The data is a single-day of click-stream data from a commercial web site. Right: Average log(KS probability) over the 31 day in-sample period for a content website showing a glitch on day 27 and a permanent change on day 43, both detected by the proposed methodology. data sets (one month of data) that were used to build the initial model Θ0 can be used to deﬁ the natural variability of the KS measures against which additional data sets can be ne compared. The result is that in Figure 3 we clearly see a problem with the distribution on day 27 (a glitch in the data) and a permanent change in the distribution on day 43. Both of the detected changes correspond to real changes in the data, as veriﬁ by the commered cial website operators. Automatic description of changes in the distribution and criteria for automatic rebuilding of the model are beyond scope of this paper. 4 Related Work Automatic detection of various types of data changes appear in the literature in several different ﬂavors. For example, novelty detection ([4], [8]) is the task of determining unusual or novel data points relative to some model. This is closely related to the outlier detection problem ([1], [5]) where the goal is not only to ﬁ unusual data points, but the ones that nd appear not to have been generated by the data generating distribution. A related problem has been addressed by [2] in the context of time series modeling where outliers and trends can contaminate the model estimation. More recently mixture models have been applied more directly to outlier detection [3]. The method proposed in this paper addesses a different problem. We are not interested in new and unusual data points; on the contrary, the method is quite robust with respect to outliers. An outlier or two do not necessarily mean that the underlying data distribution has changed. Also, some of the distribution changes we are interested in detecting might be considered uninteresting and/or not-novel; for example, a slight shift of the population as a whole is something that we certainly detect as a change but it is rarely considered novel unless the shift is drastic. There is also a set of online learning algorithms that update model parameters as the new data becomes available (for variants and additional references, e.g. [6]). In that framework there is no such concept as a data distribution change since the models are constantly updated to reﬂect the most current distribution. For example, instead of detecting a slight shift of the population as a whole, online learning algorithms update the model to reﬂect the shift. 5 Conclusions In this paper we introduced a model-based method for automatic distribution change detection in an online data environment. Given the LogP distribution data signature we further showed how to compare different data sets relative to the model using KS statistics and how to obtain a single measure of similarity between the new data and the model. Finally, we discussed heuristics for change detection that become principled in the limit as the number of possible data sets increases. Experimental results over synthetic and real online data indicate that the proposed methodology is able to alert the analyst to slight distributional changes. This methodology may be used as the basis of a system to automatically re-estimate parameters of a mixture model on an “ as-needed” basis – when the model fails to adequately represent the data after a certain point in time. References [1] V. Barnett and T. Lewis. Outliers in statistical data. Wiley, 1984. [2] A. G. Bruce, J. T. Conor, and R. D. Martin. Prediction with robustness towards outliers, trends, and level shifts. In Proceedings of the Third International Conference on Neural Networks in Financial Engineering, pages 564–577, 1996. [3] I. V. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transaction data with applications to proﬁ ling, visualization, and prediction. In F. Provost and R. Srikant, editors, Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 37–46. ACM, 2001. [4] C. Campbell and K. P. Bennett. A linear programming approach to novelty detection. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 395–401. MIT Press, 2001. [5] T. Fawcett and F. J. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999. [6] R. Neal and G. Hinton. A view of the em algorithm that justiﬁ incremental, sparse and other es variants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355–368. Kluwer Academic Publishers, 1998. [7] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The Art of Scientiﬁc Computing, Second Edition. Cambridge University Press, Cambridge, UK, 1992. [8] B. Sch¨ lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, and J. C. Platt. Support vector o method for novelty detection. In S. A. Solla, T. K. Leen, and K.-R. Mller, editors, Advances in Neural Information Processing Systems 12, pages 582–588. MIT Press, 2000.

4 0.66939908 13 nips-2001-A Natural Policy Gradient

Author: Sham M. Kakade

Abstract: We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. [9]. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. 1

5 0.6658479 89 nips-2001-Grouping with Bias

Author: Stella X. Yu, Jianbo Shi

Abstract: With the optimization of pattern discrimination as a goal, graph partitioning approaches often lack the capability to integrate prior knowledge to guide grouping. In this paper, we consider priors from unitary generative models, partially labeled data and spatial attention. These priors are modelled as constraints in the solution space. By imposing uniformity condition on the constraints, we restrict the feasible space to one of smooth solutions. A subspace projection method is developed to solve this constrained eigenproblema We demonstrate that simple priors can greatly improve image segmentation results. 1

6 0.66319847 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

7 0.65789837 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

8 0.65643466 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

9 0.65629178 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

10 0.65397996 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks

11 0.65161991 131 nips-2001-Neural Implementation of Bayesian Inference in Population Codes

12 0.65069288 157 nips-2001-Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

13 0.65057385 121 nips-2001-Model-Free Least-Squares Policy Iteration

14 0.6502769 132 nips-2001-Novel iteration schemes for the Cluster Variation Method

15 0.64900339 36 nips-2001-Approximate Dynamic Programming via Linear Programming

16 0.64877713 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding

17 0.64790678 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

18 0.64786708 161 nips-2001-Reinforcement Learning with Long Short-Term Memory

19 0.64707559 57 nips-2001-Correlation Codes in Neuronal Populations

20 0.6462307 74 nips-2001-Face Recognition Using Kernel Methods