nips nips2005 nips2005-169 knowledge-graph by maker-knowledge-mining

169 nips-2005-Saliency Based on Information Maximization

Source: pdf

Author: Neil Bruce, John Tsotsos

Abstract: A model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from a scene. The proposed operation is based on Shannon's self-information measure and is achieved in a neural circuit, which is demonstrated as having close ties with the circuitry existent in the primate visual cortex. It is further shown that the proposed saliency measure may be extended to address issues that currently elude explanation in the domain of saliency based models. Resu lts on natural images are compared with experimental eye tracking data revealing the efficacy of the model in predicting the deployment of overt attention as compared with existing efforts.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 c a Abstract A model of bottom-up overt attention is proposed based on the principle of maximizing information sampled from a scene. [sent-6, score-0.226]

2 The proposed operation is based on Shannon's self-information measure and is achieved in a neural circuit, which is demonstrated as having close ties with the circuitry existent in the primate visual cortex. [sent-7, score-0.464]

3 It is further shown that the proposed saliency measure may be extended to address issues that currently elude explanation in the domain of saliency based models. [sent-8, score-1.204]

4 Resu lts on natural images are compared with experimental eye tracking data revealing the efficacy of the model in predicting the deployment of overt attention as compared with existing efforts. [sent-9, score-0.632]

5 1 Introduction There has long been interest in the nature of eye movements and fixation behavior following early studies by Buswell [I] and Yarbus [2]. [sent-10, score-0.509]

6 However, a complete description of the mechanisms underlying these peculiar fixation patterns remains elusive. [sent-11, score-0.368]

7 Th is is further complicated by the fact that task demands and contextual knowledge factor heavily in how sampling of visual content proceeds. [sent-12, score-0.179]

8 Current bottom-up models of attention posit that saliency is the impetus for selection of fixation points. [sent-13, score-1.0]

9 In perhaps the most popular model of bottom-up attention, saliency is based on centre-surround contrast of units modeled on known properties of primary visual cortical cells [3]. [sent-15, score-0.66]

10 In other efforts, saliency is defined by more ad hoc quantities having less connection to biology [4] . [sent-16, score-0.7]

11 There exist several previous efforts that define saliency based on Shannon entropy of image content defined on a local neighborhood [5, 6, 7, 8]. [sent-19, score-1.102]

12 2 we discuss differences between entropy and self-information in this context, including why self-information may present a more appropriate metric than entropy in this domain. [sent-22, score-0.32]

13 A bottom-up model of overt attention with selection based on the self-information of local image content. [sent-24, score-0.354]

14 A qualitative and quantitative comparison of predictions of the model with human eye tracking data, contrasted against the model ofItti and Koch [3] . [sent-26, score-0.339]

15 Demonstration that the model is neurally plausible via implementation based on a neural circuit resembling circuitry involved in early visual processing in primates. [sent-28, score-0.415]

16 Discussion of how the proposal generalizes to address issues that deny explanation by existing saliency based attention models. [sent-30, score-0.677]

17 2 The Proposed Saliency Measure There exists much evidence indicating that the primate visual system is built on the principle of establishing a sparse representation of image statistics. [sent-31, score-0.453]

18 In the most prominent of such studies, it was demonstrated that learning a sparse code for natural image statistics results in the emergence of simple-cell receptive fields similar to those appearing in the primary visual cortex of primates [10, 11]. [sent-32, score-0.346]

19 The apparent benefit of such a representation comes from the fact that a sparse representation allows certain independence assumptions with regard to neural firing. [sent-33, score-0.224]

20 This issue becomes important in evaluating the likelihood of a set of local image statistics and is elaborated on later in this section. [sent-34, score-0.266]

21 In this paper, saliency is determined by quantifying the self-information of each local image patch. [sent-35, score-0.713]

22 For a given image, an estimate of the distribution of each basis coefficient is learned across the entire image through non-parametric density estimation. [sent-40, score-0.378]

23 The probability of observing the RGB values corresponding to a patch centred at any image location may then be evaluated by independently considering the likelihood of each corresponding basis coefficient. [sent-41, score-0.451]

24 The product of such likelihoods yields the joint likelihood of the entire set of basis coefficients. [sent-42, score-0.172]

25 Given the basis determined by ICA, the preceding computation may be realized entirely in the context of a biologically plausible neural circuit. [sent-43, score-0.252]

26 Details of each of the aforesaid model components including the details of the neural circuit are as follows: Projection into independent component space provides, for each local neighborhood of the image, a vector W consisting of N variables Wi with values Vi. [sent-45, score-0.245]

27 Each W i specifies the contribution of a particular basis function to the representation of the local neighborhood. [sent-46, score-0.228]

28 As mentioned, these basis functions, learned from statistical regularities observed in a large set of natural images show remarkable similarity to V 1 cells [10, 11]. [sent-47, score-0.15]

29 For further details on the ICA projection of local image statistics see [12]. [sent-49, score-0.177]

30 In this paper, we propose that salience may be defined based on a strategy for maximum information sampling. [sent-50, score-0.172]

31 In particular, Shannon's self-information measure [9], -log(p(x )), applied to the joint likelihood of statistics in a local neighborhood decribed by w, provides an appropriate transformation between probability and the degree of infom1ation inherent in the local statistics. [sent-51, score-0.303]

32 It is in computing the observation likelihood that a sparse representation is instrumental: Consider the probability density function p( W l = Vl, Wz = Vz, . [sent-52, score-0.273]

33 , Wn = v n ) which quantifies the likelihood of observing the local statistics with values Vl, . [sent-55, score-0.208]

34 An appropriate context may include a larger area encompassing the local neigbourhood described by w, or the entire scene in question. [sent-59, score-0.198]

35 Thus, a sparse representation allows the estimation of the n-dimensional space described by W to be derived from n one dimensional probability density functions. [sent-64, score-0.262]

36 In practice, this might be derived on the basis of a nonparametric or histogram density estimate. [sent-69, score-0.246]

37 In the section that follows, we demonstrate that an operation equivalent to a non-parametric density estimate may be achieved using a suitable neural circuit. [sent-70, score-0.23]

38 Let Wi ,j ,k denote the set of independent coefficients based on the neighborhood centered at j , k. [sent-74, score-0.226]

39 is the context on which the probability estimate of the coefficients of w is based. [sent-76, score-0.2]

40 On the basis of the form given in equation I it is evident that this operation may equivalently be implemented by the neural circuit depicted in figure 2. [sent-78, score-0.357]

41 Figure 2 demonstrates only coefficients derived from a horizontal cross-section. [sent-79, score-0.201]

42 Coefficients at the input layer correspond to coefficients of v. [sent-85, score-0.161]

43 The similarity between independent components and VI cells, in conjunction with the aforementioned consideration lends credibility to the proposal that information may contribute to driving overt attentional selection. [sent-88, score-0.299]

44 One aspect lacking from the preceding description is that the saliency map fails to take into account the dropoff in visual acuity moving peripherally from the fovea. [sent-89, score-0.955]

45 In some instances the maximum information accommodating for visual acuity may correspond to the center of a cluster of salient items, rather than centered on one such item. [sent-90, score-0.294]

46 For this reason, the resulting saliency map is convolved with a Gaussian with parameters chosen to correspond approximately to the drop off in visual acuity observed in the human visual system. [sent-91, score-0.986]

47 The first consideration lies in the expected behavior in popout paradigms and the second in the neural circuitry involved. [sent-95, score-0.218]

48 , xnl denote a vector of RGB values corresponding to image patch X, and D a probability density function describing the distribution of some feature set over X. [sent-99, score-0.265]

49 For example, D might correspond to a histogram estimate of intensity values within X or the relative contribution of different orientations within a local neighborhood situated on the boundary of an object silhouette [6]. [sent-100, score-0.151]

50 The connections shown facilitate computation of the information measure corresponding to the pixel centered in the purple window. [sent-119, score-0.179]

51 The network architecture produces this measure on the basis of evaluating the probability of these coefficients with consideration to the values of such coefficients in neighbouring regions. [sent-120, score-0.62]

52 In this example, entropy characterizes the extent to which the feature(s) characterized by D are uniformly distributed on X . [sent-122, score-0.195]

53 Self-information in the proposed saliency measure is given by -log(p(X)). [sent-123, score-0.631]

54 Thus, p( X) characterizes the raw likelihood of observing X based on its surround and -l og(p(X)) becomes closer to a measure of local contrast whereas entropy as defined in the usual manner is closer to a measure of local activity. [sent-126, score-0.621]

55 In classic popout experiments, a vertical line among horizontal lines presents a highly salient target. [sent-131, score-0.155]

56 With regard to the neural circuitry involved, we have demonstrated that self-information may be computed using a neural circuit in the absence of a representation of the entire probability distribution. [sent-133, score-0.355]

57 Whether an equivalent operation may be achieved in a biologically plausible manner for the computation of entropy remains to be established. [sent-134, score-0.32]

58 The operation is equivalent to a Kernel density estimate. [sent-149, score-0.16]

59 The small black circles indicate an inhibitory relationship and the small white circles an excitatory relationship Figure 3: An image that highlights the difference between entropy and self-information. [sent-151, score-0.248]

60 Fixation invariably falls on the empty patch, the locus of minimum entropy in orientation and color but maximum in self-information when the surrounding context is considered. [sent-152, score-0.201]

61 The model of Itti and Koch is perhaps the most popular model of saliency based attention and currently appears to be the yardstick against which other models are measured. [sent-154, score-0.632]

62 1 Experimental eye tracking data The data that forms the basis for performance evaluation is derived from eye tracking experiments performed while subjects observed 120 different color images. [sent-156, score-0.652]

63 The eye tracking apparatus consisted of a standard non head-mounted device. [sent-161, score-0.236]

64 The issue of comparing between the output of a particular algorithm, and the eye tracking data is non-trivial. [sent-164, score-0.236]

65 Previous efforts have selected a number of fixation points based on the saliency map, and compared these with the experimental fixation points derived from a small number of subjects and images (7 subjects and 15 images in a recent effort [4]). [sent-165, score-1.626]

66 The most important such consideration is that the representation of perceptual importance is typically based on a saliency map. [sent-167, score-0.691]

67 Observing the output of an algorithm that selects fixation points based on the underlying saliency map obscures observation of the degree to which the saliency maps predict important and unimportant content and in particular, ignores confidence away from highly salient regions. [sent-168, score-1.698]

68 Secondly, it is not clear how many fixation points should be selected. [sent-169, score-0.368]

69 Choosing this value based on the experimental data will bias output based on information pertaining to the content of the image and may produce artificially good results. [sent-170, score-0.216]

70 The preceding discussion is intended to motivate the fact that selecting discrete fixation coordinates based on the saliency map for comparison may not present the most appropriate representation to use for performance evaluation. [sent-171, score-1.191]

71 Qualitative comparison is based on the representation proposed in [16]. [sent-173, score-0.137]

72 In this representation, a fixation density map is produced for each image based on all fixation points, and subjects. [sent-174, score-1.036]

73 Given a fixation point, one might consider how the image under consideration is sampled by the human visual system as photoreceptor density drops steeply moving peripherally from the centre of the fovea. [sent-175, score-0.941]

74 This dropoff may be modeled based on a 2D Gaussian distribution with appropriately chosen parameters, and centred on the measured fixation point. [sent-176, score-0.506]

75 A continuous fixation density map may be derived for a particular image based on the sum of all 2D Gaussians corresponding to each fixation point, from each subject. [sent-177, score-1.113]

76 The density map then comprises a measure of the extent to which each pixel of the image is sampled on average by a human observer based on observed fixations. [sent-178, score-0.466]

77 This affords a representation for which similarity to a saliency map may be considered at a glance. [sent-179, score-0.794]

78 The saliency maps produced by each algorithm are treated as binary classifiers for fixation versus non-fixation points. [sent-181, score-0.904]

79 The choice of several different thresholds and assessment of performance in predicting fixated versus not fixated pixel locations allows an ROC curve to be produced for each algorithm. [sent-182, score-0.17]

80 2 Experimental Results Figure 4 affords a qualitative comparison of the output of the proposed model with the experimental eye tracking data for a variety of images. [sent-184, score-0.379]

81 In the implementation results shown, the ICA basis set was learned from a set of 360,000 7x7x3 image patches from 3600 natural images using the Lee et al. [sent-186, score-0.329]

82 W consists of the entire extent of the image and w(s, t) = ~ 'V s, t with p the number of pixels in the image. [sent-189, score-0.192]

83 One might make a variety of selections for these variables based on arguments related to the human visual system, or based on performance. [sent-190, score-0.209]

84 The ROC curves appearing in figure 5 give some sense of the efficacy of the model in predicting which regions of a scene human observers tend to fixate. [sent-194, score-0.215]

85 Encouraging is the fact that similar perfonnance is achieved using a method derived from first principles, and with no parameter tuning or ad hoc design choices. [sent-196, score-0.142]

86 Within each boxed region defined by solid lines: (Top Left) Original Image (Top Right) Saliency map produced by Itti + Koch algorithm. [sent-198, score-0.185]

87 (Bottom Right) Fixation density map based on experimental human eye tracking data. [sent-200, score-0.464]

88 The proposed approach may accommodate such a configuration with the single necessary condition being a sparse representation at each layer. [sent-202, score-0.222]

89 As we have described in section 2, there is evidence that suggests the possibility that the primate visual system may consist of a multi-layer sparse coding architecture [10, 11]. [sent-203, score-0.338]

90 The proposed algorithm quantifies information on the basis of a neural circuit, on units with response properties corresponding to neurons appearing in the primary visual cortex. [sent-204, score-0.364]

91 However, given an analogous representation corresponding to higher visual areas that encode form, depth, convexity etc. [sent-205, score-0.212]

92 Since the popout of features can occur on the basis of more complex properties such as a convex surface among concave surfaces [19], this is perhaps the next stage in a system that encodes saliency in the same manner as primates. [sent-207, score-0.705]

93 In the model of Itti and Koch, a multi-layer winner-take-all network acts directly on the saliency map and there is no hierarchical representation of image content. [sent-209, score-0.838]

94 There are however attention models that subscribe to a distributed representation of saliency (e. [sent-210, score-0.72]

95 [20]), that may implement attentional selection with the proposed neural circuit encoding saliency at each layer. [sent-212, score-0.818]

96 •• •0 ~ cr 05 ~ 05 •• 01 02 0) O' OS 06 07 01 0' 1 False Alarm Rate Figure 5: ROC curves for Self-information (blue) and Itti and Koch (red) saliency maps. [sent-213, score-0.536]

97 5 Conclusion We have described a strategy that predicts human attentional deployment on the principle of maximizing information sampled from a scene. [sent-217, score-0.164]

98 Although no computational machinery is included strictly on the basis of biological plausibility, nevertheless the formulation results in an implementation based on a neurally plausible circuit acting on units that resemble those that facilitate early visual processing in primates. [sent-218, score-0.478]

99 Comparison with an existing attention model reveals the efficacy of the proposed model in predicting salient image content. [sent-219, score-0.455]

100 Finally, we demonstrate that the proposal might be generalized to facilitate selection based on high-level features provided an appropriate sparse representation is available. [sent-220, score-0.294]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('saliency', 0.536), ('fixation', 0.368), ('koch', 0.19), ('coefficients', 0.161), ('eye', 0.141), ('itti', 0.133), ('circuit', 0.127), ('visual', 0.124), ('image', 0.124), ('entropy', 0.124), ('attention', 0.096), ('tracking', 0.095), ('defined', 0.095), ('map', 0.09), ('representation', 0.088), ('basis', 0.087), ('density', 0.086), ('popout', 0.082), ('overt', 0.081), ('shannon', 0.076), ('ica', 0.075), ('operation', 0.074), ('salient', 0.073), ('rgb', 0.071), ('efficacy', 0.071), ('circuitry', 0.069), ('attentional', 0.069), ('primate', 0.069), ('consideration', 0.067), ('neighborhood', 0.065), ('images', 0.063), ('acuity', 0.06), ('architecture', 0.06), ('vi', 0.058), ('patch', 0.055), ('content', 0.055), ('patches', 0.055), ('beii', 0.054), ('dropoff', 0.054), ('lea', 0.054), ('lui', 0.054), ('peripherally', 0.054), ('purple', 0.054), ('quantifies', 0.054), ('wz', 0.054), ('yarbus', 0.054), ('subjects', 0.053), ('local', 0.053), ('human', 0.052), ('plausible', 0.052), ('qualitative', 0.051), ('likelihood', 0.051), ('observing', 0.05), ('appearing', 0.05), ('efforts', 0.05), ('proposed', 0.049), ('sparse', 0.048), ('wi', 0.048), ('tsotsos', 0.047), ('centred', 0.047), ('fixated', 0.047), ('coefficient', 0.047), ('plausibility', 0.047), ('roc', 0.046), ('measure', 0.046), ('proposal', 0.045), ('wn', 0.045), ('facilitate', 0.045), ('lee', 0.045), ('neurally', 0.043), ('deployment', 0.043), ('affords', 0.043), ('davis', 0.043), ('braun', 0.043), ('predicting', 0.042), ('confidence', 0.04), ('bruce', 0.04), ('salience', 0.04), ('derived', 0.04), ('context', 0.039), ('evaluating', 0.038), ('locus', 0.038), ('vl', 0.038), ('characterizes', 0.037), ('preceding', 0.037), ('may', 0.037), ('hoc', 0.036), ('appropriate', 0.035), ('chicago', 0.035), ('entire', 0.034), ('pixel', 0.034), ('extent', 0.034), ('might', 0.033), ('achieved', 0.033), ('centre', 0.033), ('ad', 0.033), ('items', 0.033), ('closer', 0.033), ('depicted', 0.032), ('effort', 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 169 nips-2005-Saliency Based on Information Maximization

Author: Neil Bruce, John Tsotsos

2 0.2009116 193 nips-2005-The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search

Author: Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, Dimitris Samaras

Abstract: To investigate how top-down (TD) and bottom-up (BU) information is weighted in the guidance of human search behavior, we manipulated the proportions of BU and TD components in a saliency-based model. The model is biologically plausible and implements an artiﬁcial retina and a neuronal population code. The BU component is based on featurecontrast. The TD component is deﬁned by a feature-template match to a stored target representation. We compared the model’s behavior at different mixtures of TD and BU components to the eye movement behavior of human observers performing the identical search task. We found that a purely TD model provides a much closer match to human behavior than any mixture model using BU information. Only when biological constraints are removed (e.g., eliminating the retina) did a BU/TD mixture model begin to approximate human behavior.

3 0.16289821 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection

Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky

Abstract: We present a computational model of human eye movements in an object class detection task. The model combines state-of-the-art computer vision object class detection methods (SIFT features trained using AdaBoost) with a biologically plausible model of human eye movement to produce a sequence of simulated ﬁxations, culminating with the acquisition of a target. We validated the model by comparing its behavior to the behavior of human observers performing the identical object class detection task (looking for a teddy bear among visually complex nontarget objects). We found considerable agreement between the model and human data in multiple eye movement measures, including number of ﬁxations, cumulative probability of ﬁxating the target, and scanpath distance.

4 0.16245767 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

Author: Michael Shettel, Shaun Vecera, Michael C. Mozer

Abstract: Theories of visual attention commonly posit that early parallel processes extract conspicuous features such as color contrast and motion from the visual field. These features are then combined into a saliency map, and attention is directed to the most salient regions first. Top-down attentional control is achieved by modulating the contribution of different feature types to the saliency map. A key source of data concerning attentional control comes from behavioral studies in which the effect of recent experience is examined as individuals repeatedly perform a perceptual discrimination task (e.g., “what shape is the odd-colored object?”). The robust finding is that repetition of features of recent trials (e.g., target color) facilitates performance. We view this facilitation as an adaptation to the statistical structure of the environment. We propose a probabilistic model of the environment that is updated after each trial. Under the assumption that attentional control operates so as to make performance more efficient for more likely environmental states, we obtain parsimonious explanations for data from four different experiments. Further, our model provides a rational explanation for why the influence of past experience on attentional control is short lived. 1 INTRODUCTION The brain does not have the computational capacity to fully process the massive quantity of information provided by the eyes. Selective attention operates to filter the spatiotemporal stream to a manageable quantity. Key to understanding the nature of attention is discovering the algorithm governing selection, i.e., understanding what information will be selected and what will be suppressed. Selection is influenced by attributes of the spatiotemporal stream, often referred to as bottom-up contributions to attention. For example, attention is drawn to abrupt onsets, motion, and regions of high contrast in brightness and color. Most theories of attention posit that some visual information processing is performed preattentively and in parallel across the visual field. This processing extracts primitive visual features such as color and motion, which provide the bottom-up cues for attentional guidance. However, attention is not driven willy nilly by these cues. The deployment of attention can be modulated by task instructions, current goals, and domain knowledge, collectively referred to as top-down contributions to attention. How do bottom-up and top-down contributions to attention interact? Most psychologically and neurobiologically motivated models propose a very similar architecture in which information from bottom-up and top-down sources combines in a saliency (or activation) map (e.g., Itti et al., 1998; Koch & Ullman, 1985; Mozer, 1991; Wolfe, 1994). The saliency map indicates, for each location in the visual field, the relative importance of that location. Attention is drawn to the most salient locations first. Figure 1 sketches the basic architecture that incorporates bottom-up and top-down contributions to the saliency map. The visual image is analyzed to extract maps of primitive features such as color and orientation. Associated with each location in a map is a scalar visual image horizontal primitive feature maps vertical green top-down gains red saliency map FIGURE 1. An attentional saliency map constructed from bottom-up and top-down information bottom-up activations FIGURE 2. Sample display from Experiment 1 of Maljkovic and Nakayama (1994) response or activation indicating the presence of a particular feature. Most models assume that responses are stronger at locations with high local feature contrast, consistent with neurophysiological data, e.g., the response of a red feature detector to a red object is stronger if the object is surrounded by green objects. The saliency map is obtained by taking a sum of bottom-up activations from the feature maps. The bottom-up activations are modulated by a top-down gain that specifies the contribution of a particular map to saliency in the current task and environment. Wolfe (1994) describes a heuristic algorithm for determining appropriate gains in a visual search task, where the goal is to detect a target object among distractor objects. Wolfe proposes that maps encoding features that discriminate between target and distractors have higher gains, and to be consistent with the data, he proposes limits on the magnitude of gain modulation and the number of gains that can be modulated. More recently, Wolfe et al. (2003) have been explicit in proposing optimization as a principle for setting gains given the task definition and stimulus environment. One aspect of optimizing attentional control involves configuring the attentional system to perform a given task; for example, in a visual search task for a red vertical target among green vertical and red horizontal distractors, the task definition should result in a higher gain for red and vertical feature maps than for other feature maps. However, there is a more subtle form of gain modulation, which depends on the statistics of display environments. For example, if green vertical distractors predominate, then red is a better discriminative cue than vertical; and if red horizontal distractors predominate, then vertical is a better discriminative cue than red. In this paper, we propose a model that encodes statistics of the environment in order to allow for optimization of attentional control to the structure of the environment. Our model is designed to address a key set of behavioral data, which we describe next. 1.1 Attentional priming phenomena Psychological studies involve a sequence of experimental trials that begin with a stimulus presentation and end with a response from the human participant. Typically, trial order is randomized, and the context preceding a trial is ignored. However, in sequential studies, performance is examined on one trial contingent on the past history of trials. These sequential studies explore how experience influences future performance. Consider a the sequential attentional task of Maljkovic and Nakayama (1994). On each trial, the stimulus display (Figure 2) consists of three notched diamonds, one a singleton in color—either green among red or red among green. The task is to report whether the singleton diamond, referred to as the target, is notched on the left or the right. The task is easy because the singleton pops out, i.e., the time to locate the singleton does not depend on the number of diamonds in the display. Nonetheless, the response time significantly depends on the sequence of trials leading up to the current trial: If the target is the same color on the cur- rent trial as on the previous trial, response time is roughly 100 ms faster than if the target is a different color on the current trial. Considering that response times are on the order of 700 ms, this effect, which we term attentional priming, is gigantic in the scheme of psychological phenomena. 2 ATTENTIONAL CONTROL AS ADAPTATION TO THE STATISTICS OF THE ENVIRONMENT We interpret the phenomenon of attentional priming via a particular perspective on attentional control, which can be summarized in two bullets. • The perceptual system dynamically constructs a probabilistic model of the environment based on its past experience. • Control parameters of the attentional system are tuned so as to optimize performance under the current environmental model. The primary focus of this paper is the environmental model, but we first discuss the nature of performance optimization. The role of attention is to make processing of some stimuli more efficient, and consequently, the processing of other stimuli less efficient. For example, if the gain on the red feature map is turned up, processing will be efficient for red items, but competition from red items will reduce the efficiency for green items. Thus, optimal control should tune the system for the most likely states of the world by minimizing an objective function such as: J(g) = ∑ P ( e )RT g ( e ) (1) e where g is a vector of top-down gains, e is an index over environmental states, P(.) is the probability of an environmental state, and RTg(.) is the expected response time—assuming a constant error rate—to the environmental state under gains g. Determining the optimal gains is a challenge because every gain setting will result in facilitation of responses to some environmental states but hindrance of responses to other states. The optimal control problem could be solved via direct reinforcement learning, but the rapidity of human learning makes this possibility unlikely: In a variety of experimental tasks, evidence suggests that adaptation to a new task or environment can occur in just one or two trials (e.g., Rogers & Monsell, 1996). Model-based reinforcement learning is an attractive alternative, because given a model, optimization can occur without further experience in the real world. Although the number of real-world trials necessary to achieve a given level of performance is comparable for direct and model-based reinforcement learning in stationary environments (Kearns & Singh, 1999), naturalistic environments can be viewed as highly nonstationary. In such a situation, the framework we suggest is well motivated: After each experience, the environment model is updated. The updated environmental model is then used to retune the attentional system. In this paper, we propose a particular model of the environment suitable for visual search tasks. Rather than explicitly modeling the optimization of attentional control by setting gains, we assume that the optimization process will serve to minimize Equation 1. Because any gain adjustment will facilitate performance in some environmental states and hinder performance in others, an optimized control system should obtain faster reaction times for more probable environmental states. This assumption allows us to explain experimental results in a minimal, parsimonious framework. 3 MODELING THE ENVIRONMENT Focusing on the domain of visual search, we characterize the environment in terms of a probability distribution over configurations of target and distractor features. We distinguish three classes of features: defining, reported, and irrelevant. To explain these terms, consider the task of searching a display of size varying, colored, notched diamonds (Figure 2), with the task of detecting the singleton in color and judging the notch location. Color is the defining feature, notch location is the reported feature, and size is an irrelevant feature. To simplify the exposition, we treat all features as having discrete values, an assumption which is true of the experimental tasks we model. We begin by considering displays containing a single target and a single distractor, and shortly generalize to multidistractor displays. We use the framework of Bayesian networks to characterize the environment. Each feature of the target and distractor is a discrete random variable, e.g., Tcolor for target color and Dnotch for the location of the notch on the distractor. The Bayes net encodes the probability distribution over environmental states; in our working example, this distribution is P(Tcolor, Tsize, Tnotch, Dcolor, Dsize, Dnotch). The structure of the Bayes net specifies the relationships among the features. The simplest model one could consider would be to treat the features as independent, illustrated in Figure 3a for singleton-color search task. The opposite extreme would be the full joint distribution, which could be represented by a look up table indexed by the six features, or by the cascading Bayes net architecture in Figure 3b. The architecture we propose, which we’ll refer to as the dominance model (Figure 3c), has an intermediate dependency structure, and expresses the joint distribution as: P(Tcolor)P(Dcolor |Tcolor)P(Tsize |Tcolor)P(Tnotch |Tcolor)P(Dsize |Dcolor)P(Dnotch |Tcolor). The structured model is constructed based on three rules. 1. The defining feature of the target is at the root of the tree. 2. The defining feature of the distractor is conditionally dependent on the defining feature of the target. We refer to this rule as dominance of the target over the distractor. 3. The reported and irrelevant features of target (distractor) are conditionally dependent on the defining feature of the target (distractor). We refer to this rule as dominance of the defining feature over nondefining features. As we will demonstrate, the dominance model produces a parsimonious account of a wide range of experimental data. 3.1 Updating the environment model The model’s parameters are the conditional distributions embodied in the links. In the example of Figure 3c with binary random variables, the model has 11 parameters. However, these parameters are determined by the environment: To be adaptive in nonstationary environments, the model must be updated following each experienced state. We propose a simple exponentially weighted averaging approach. For two variables V and W with observed values v and w on trial t, a conditional distribution, P t ( V = u W = w ) = δ uv , is (a) Tcolor Dcolor Tsize Tnotch (b) Tcolor Dcolor Dsize Tsize Dnotch Tnotch (c) Tcolor Dcolor Dsize Tsize Dsize Dnotch Tnotch Dnotch FIGURE 3. Three models of a visual-search environment with colored, notched, size-varying diamonds. (a) feature-independence model; (b) full-joint model; (c) dominance model. defined, where δ is the Kronecker delta. The distribution representing the environment E following trial t, denoted P t , is then updated as follows: E E P t ( V = u W = w ) = αP t – 1 ( V = u W = w ) + ( 1 – α )P t ( V = u W = w ) (2) for all u, where α is a memory constant. Note that no update is performed for values of W other than w. An analogous update is performed for unconditional distributions. E How the model is initialized—i.e., specifying P 0 —is irrelevant, because all experimental tasks that we model, participants begin the experiment with many dozens of practice trials. E Data is not collected during practice trials. Consequently, any transient effects of P 0 do E not impact the results. In our simulations, we begin with a uniform distribution for P 0 , and include practice trials as in the human studies. Thus far, we’ve assumed a single target and a single distractor. The experiments that we model involve multiple distractors. The simple extension we require to handle multiple distractors is to define a frequentist probability for each distractor feature V, P t ( V = v W = w ) = C vw ⁄ C w , where C vw is the count of co-occurrences of feature values v and w among the distractors, and C w is the count of w. Our model is extremely simple. Given a description of the visual search task and environment, the model has only a single degree of freedom, α . In all simulations, we fix α = 0.75 ; however, the choice of α does not qualitatively impact any result. 4 SIMULATIONS In this section, we show that the model can explain a range of data from four different experiments examining attentional priming. All experiments measure response times of participants. On each trial, the model can be used to obtain a probability of the display configuration (the environmental state) on that trial, given the history of trials to that point. Our critical assumption—as motivated earlier—is that response times monotonically decrease with increasing probability, indicating that visual information processing is better configured for more likely environmental states. The particular relationship we assume is that response times are linear in log probability. This assumption yields long response time tails, as are observed in all human studies. 4.1 Maljkovic and Nakayama (1994, Experiment 5) In this experiment, participants were asked to search for a singleton in color in a display of three red or green diamonds. Each diamond was notched on either the left or right side, and the task was to report the side of the notch on the color singleton. The well-practiced participants made very few errors. Reaction time (RT) was examined as a function of whether the target on a given trial is the same or different color as the target on trial n steps back or ahead. Figure 4 shows the results, with the human RTs in the left panel and the simulation log probabilities in the right panel. The horizontal axis represents n. Both graphs show the same outcome: repetition of target color facilitates performance. This influence lasts only for a half dozen trials, with an exponentially decreasing influence further into the past. In the model, this decreasing influence is due to the exponential decay of recent history (Equation 2). Figure 4 also shows that—as expected—the future has no influence on the current trial. 4.2 Maljkovic and Nakayama (1994, Experiment 8) In the previous experiment, it is impossible to determine whether facilitation is due to repetition of the target’s color or the distractor’s color, because the display contains only two colors, and therefore repetition of target color implies repetition of distractor color. To unconfound these two potential factors, an experiment like the previous one was con- ducted using four distinct colors, allowing one to examine the effect of repeating the target color while varying the distractor color, and vice versa. The sequence of trials was composed of subsequences of up-to-six consecutive trials with either the target or distractor color held constant while the other color was varied trial to trial. Following each subsequence, both target and distractors were changed. Figure 5 shows that for both humans and the simulation, performance improves toward an asymptote as the number of target and distractor repetitions increases; in the model, the asymptote is due to the probability of the repeated color in the environment model approaching 1.0. The performance improvement is greater for target than distractor repetition; in the model, this difference is due to the dominance of the defining feature of the target over the defining feature of the distractor. 4.3 Huang, Holcombe, and Pashler (2004, Experiment 1) Huang et al. (2004) and Hillstrom (2000) conducted studies to determine whether repetitions of one feature facilitate performance independently of repetitions of another feature. In the Huang et al. study, participants searched for a singleton in size in a display consisting of lines that were short and long, slanted left or right, and colored white or black. The reported feature was target slant. Slant, size, and color were uncorrelated. Huang et al. discovered that repeating an irrelevant feature (color or orientation) facilitated performance, but only when the defining feature (size) was repeated. As shown in Figure 6, the model replicates human performance, due to the dominance of the defining feature over the reported and irrelevant features. 4.4 Wolfe, Butcher, Lee, and Hyde (2003, Experiment 1) In an empirical tour-de-force, Wolfe et al. (2003) explored singleton search over a range of environments. The task is to detect the presence or absence of a singleton in displays conHuman data Different Color 600 Different Color 590 580 570 15 13 11 9 7 Past 5 3.2 3 Same Color 2.8 Same Color 560 550 Simulation 3.4 log(P(trial)) Reaction Time (msec) 610 3 1 +1 +3 +5 Future 2.6 +7 15 13 Relative Trial Number 11 9 7 Past 5 3 1 +1 +3 +5 Future +7 Relative Trial Number FIGURE 4. Experiment 5 of Maljkovic and Nakayama (1994): performance on a given trial conditional on the color of the target on a previous or subsequent trial. Human data is from subject KN. 650 6 Distractors Same 630 5.5 log(P(trial)) FIGURE 5. Experiment 8 of Maljkovic and Nakayama (1994). (left panel) human data, average of subjects KN and SS; (right panel) simulation Reaction Time (msec) 640 620 Target Same 610 5 Distractors Same 4.5 4 600 Target Same 3.5 590 3 580 1 2 3 4 5 1 6 4 5 6 4 1000 Size Alternate Size Alternate log(P(trial)) Reaction Time (msec) 3 4.2 1050 FIGURE 6. Experiment 1 of Huang, Holcombe, & Pashler (2004). (left panel) human data; (right panel) simulation 2 Order in Sequence Order in Sequence 950 3.8 3.6 3.4 900 3.2 Size Repeat 850 Size Repeat 3 Color Repeat Color Alternate Color Repeat Color Alternate sisting of colored (red or green), oriented (horizontal or vertical) lines. Target-absent trials were used primarily to ensure participants were searching the display. The experiment examined seven experimental conditions, which varied in the amount of uncertainty as to the target identity. The essential conditions, from least to most uncertainty, are: blocked (e.g., target always red vertical among green horizontals), mixed feature (e.g., target always a color singleton), mixed dimension (e.g., target either red or vertical), and fully mixed (target could be red, green, vertical, or horizontal). With this design, one can ascertain how uncertainty in the environment and in the target definition influence task difficulty. Because the defining feature in this experiment could be either color or orientation, we modeled the environment with two Bayes nets—one color dominant and one orientation dominant—and performed model averaging. A comparison of Figures 7a and 7b show a correspondence between human RTs and model predictions. Less uncertainty in the environment leads to more efficient performance. One interesting result from the model is its prediction that the mixed-feature condition is easier than the fully-mixed condition; that is, search is more efficient when the dimension (i.e., color vs. orientation) of the singleton is known, even though the model has no abstract representation of feature dimensions, only feature values. 4.5 Optimal adaptation constant In all simulations so far, we fixed the memory constant. From the human data, it is clear that memory for recent experience is relatively short lived, on the order of a half dozen trials (e.g., left panel of Figure 4). In this section we provide a rational argument for the short duration of memory in attentional control. Figure 7c shows mean negative log probability in each condition of the Wolfe et al. (2003) experiment, as a function of α . To assess these probabilities, for each experimental condition, the model was initialized so that all of the conditional distributions were uniform, and then a block of trials was run. Log probability for all trials in the block was averaged. The negative log probability (y axis of the Figure) is a measure of the model’s misprediction of the next trial in the sequence. For complex environments, such as the fully-mixed condition, a small memory constant is detrimental: With rapid memory decay, the effective history of trials is a high-variance sample of the distribution of environmental states. For simple environments, a large memory constant is detrimental: With slow memory decay, the model does not transition quickly from the initial environmental model to one that reflects the statistics of a new environment. Thus, the memory constant is constrained by being large enough that the environment model can hold on to sufficient history to represent complex environments, and by being small enough that the model adapts quickly to novel environments. If the conditions in Wolfe et al. give some indication of the range of naturalistic environments an agent encounters, we have a rational account of why attentional priming is so short lived. Whether priming lasts 2 trials or 20, the surprising empirical result is that it does not last 200 or 2000 trials. Our rational argument provides a rough insight into this finding. (a) fully mixed mixed feature mixed dimension blocked 460 (c) Simulation fully mixed mixed feature mixed dimension blocked 4 5 420 log(P(trial)) 440 2 Blocked Red or Vertical Blocked Red and Vertical Mixed Feature Mixed Dimension Fully Mixed 4 3 log(P(trial)) reaction time (msec) (b) Human Data 480 3 2 1 400 1 380 0 360 0 red or vert red and vert target type red or vert red and vert target type 0 0.5 0.8 0.9 0.95 0.98 Memory Constant FIGURE 7. (a) Human data for Wolfe et al. (2003), Experiment 1; (b) simulation; (c) misprediction of model (i.e., lower y value = better) as a function of α for five experimental condition 5 DISCUSSION The psychological literature contains two opposing accounts of attentional priming and its relation to attentional control. Huang et al. (2004) and Hillstrom (2000) propose an episodic account in which a distinct memory trace—representing the complete configuration of features in the display—is laid down for each trial, and priming depends on configural similarity of the current trial to previous trials. Alternatively, Maljkovic and Nakayama (1994) and Wolfe et al. (2003) propose a feature-strengthening account in which detection of a feature on one trial increases its ability to attract attention on subsequent trials, and priming is proportional to the number of overlapping features from one trial to the next. The episodic account corresponds roughly to the full joint model (Figure 3b), and the feature-strengthening account corresponds roughly to the independence model (Figure 3a). Neither account is adequate to explain the range of data we presented. However, an intermediate account, the dominance model (Figure 3c), is not only sufficient, but it offers a parsimonious, rational explanation. Beyond the model’s basic assumptions, it has only one free parameter, and can explain results from diverse experimental paradigms. The model makes a further theoretical contribution. Wolfe et al. distinguish the environments in their experiment in terms of the amount of top-down control available, implying that different mechanisms might be operating in different environments. However, in our account, top-down control is not some substance distributed in different amounts depending on the nature of the environment. Our account treats all environments uniformly, relying on attentional control to adapt to the environment at hand. We conclude with two limitations of the present work. First, our account presumes a particular network architecture, instead of a more elegant Bayesian approach that specifies priors over architectures, and performs automatic model selection via the sequence of trials. We did explore such a Bayesian approach, but it was unable to explain the data. Second, at least one finding in the literature is problematic for the model. Hillstrom (2000) occasionally finds that RTs slow when an irrelevant target feature is repeated but the defining target feature is not. However, because this effect is observed only in some experiments, it is likely that any model would require elaboration to explain the variability. ACKNOWLEDGEMENTS We thank Jeremy Wolfe for providing the raw data from his experiment for reanalysis. This research was funded by NSF BCS Award 0339103. REFERENCES Huang, L, Holcombe, A. O., & Pashler, H. (2004). Repetition priming in visual search: Episodic retrieval, not feature priming. Memory & Cognition, 32, 12–20. Hillstrom, A. P. (2000). Repetition effects in visual search. Perception & Psychophysics, 62, 800-817. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Analysis & Machine Intelligence, 20, 1254–1259. Kearns, M., & Singh, S. (1999). Finite-sample convergence rates for Q-learning and indirect algorithms. In Advances in Neural Information Processing Systems 11 (pp. 996–1002). Cambridge, MA: MIT Press. Koch, C. and Ullman, S. (1985). Shifts in selective visual attention: towards the underlying neural circuitry. Human Neurobiology, 4, 219–227. Maljkovic, V., & Nakayama, K. (1994). Priming of pop-out: I. Role of features. Mem. & Cognition, 22, 657-672. Mozer, M. C. (1991). The perception of multiple objects: A connectionist approach. Cambridge, MA: MIT Press. Rogers, R. D., & Monsell, S. (1995). The cost of a predictable switch between simple cognitive tasks. Journal of Experimental Psychology: General, 124, 207–231. Wolfe, J.M. (1994). Guided Search 2.0: A Revised Model of Visual Search. Psych. Bull. & Rev., 1, 202–238. Wolfe, J. S., Butcher, S. J., Lee, C., & Hyde, M. (2003). Changing your mind: on the contributions of top-down and bottom-up guidance in visual search for feature singletons. Journal of Exptl. Psychology: Human Perception & Performance, 29, 483-502.

5 0.12955244 149 nips-2005-Optimal cue selection strategy

Author: Vidhya Navalpakkam, Laurent Itti

Abstract: Survival in the natural world demands the selection of relevant visual cues to rapidly and reliably guide attention towards prey and predators in cluttered environments. We investigate whether our visual system selects cues that guide search in an optimal manner. We formally obtain the optimal cue selection strategy by maximizing the signal to noise ratio (SN R) between a search target and surrounding distractors. This optimal strategy successfully accounts for several phenomena in visual search behavior, including the effect of target-distractor discriminability, uncertainty in target’s features, distractor heterogeneity, and linear separability. Furthermore, the theory generates a new prediction, which we verify through psychophysical experiments with human subjects. Our results provide direct experimental evidence that humans select visual cues so as to maximize SN R between the targets and surrounding clutter.

6 0.12135963 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects

7 0.11862484 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?

8 0.11136951 34 nips-2005-Bayesian Surprise Attracts Human Attention

9 0.081822827 110 nips-2005-Learning Depth from Single Monocular Images

10 0.070539616 109 nips-2005-Learning Cue-Invariant Visual Responses

11 0.070269898 23 nips-2005-An Application of Markov Random Fields to Range Sensing

12 0.069835149 157 nips-2005-Principles of real-time computing with feedback applied to cortical microcircuit models

13 0.067253456 79 nips-2005-Fusion of Similarity Data in Clustering

14 0.063008718 88 nips-2005-Gradient Flow Independent Component Analysis in Micropower VLSI

15 0.061898485 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

16 0.061437476 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models

17 0.060958024 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

18 0.060834408 203 nips-2005-Visual Encoding with Jittering Eyes

19 0.058408685 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

20 0.055678371 45 nips-2005-Conditional Visual Tracking in Kernel Space

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.209), (1, -0.072), (2, -0.014), (3, 0.245), (4, -0.047), (5, 0.135), (6, -0.052), (7, -0.075), (8, -0.128), (9, -0.053), (10, 0.081), (11, 0.062), (12, -0.075), (13, 0.139), (14, -0.078), (15, -0.076), (16, -0.015), (17, -0.006), (18, 0.088), (19, -0.023), (20, -0.075), (21, -0.134), (22, -0.069), (23, -0.05), (24, 0.035), (25, -0.019), (26, 0.014), (27, 0.041), (28, 0.051), (29, 0.044), (30, 0.036), (31, 0.1), (32, 0.078), (33, -0.012), (34, 0.014), (35, -0.076), (36, -0.033), (37, -0.013), (38, 0.055), (39, 0.036), (40, 0.044), (41, -0.117), (42, -0.027), (43, -0.054), (44, -0.055), (45, -0.013), (46, -0.047), (47, 0.029), (48, -0.033), (49, -0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94311154 169 nips-2005-Saliency Based on Information Maximization

Author: Neil Bruce, John Tsotsos

2 0.85984856 193 nips-2005-The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search

Author: Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, Dimitris Samaras

3 0.67703176 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

Author: Michael Shettel, Shaun Vecera, Michael C. Mozer

4 0.65435791 5 nips-2005-A Computational Model of Eye Movements during Object Class Detection

Author: Wei Zhang, Hyejin Yang, Dimitris Samaras, Gregory J. Zelinsky

5 0.58666146 149 nips-2005-Optimal cue selection strategy

Author: Vidhya Navalpakkam, Laurent Itti

6 0.55131465 203 nips-2005-Visual Encoding with Jittering Eyes

7 0.53806573 7 nips-2005-A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects

8 0.51360136 34 nips-2005-Bayesian Surprise Attracts Human Attention

9 0.50351328 176 nips-2005-Silicon growth cones map silicon retina

10 0.44715643 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?

11 0.39290559 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

12 0.38420975 93 nips-2005-Ideal Observers for Detecting Motion: Correspondence Noise

13 0.375826 110 nips-2005-Learning Depth from Single Monocular Images

14 0.36842555 141 nips-2005-Norepinephrine and Neural Interrupts

15 0.36380568 143 nips-2005-Off-Road Obstacle Avoidance through End-to-End Learning

16 0.33629724 158 nips-2005-Products of ``Edge-perts

17 0.33210754 35 nips-2005-Bayesian model learning in human visual perception

18 0.32528877 109 nips-2005-Learning Cue-Invariant Visual Responses

19 0.31914252 170 nips-2005-Scaling Laws in Natural Scenes and the Inference of 3D Shape

20 0.31799752 23 nips-2005-An Application of Markov Random Fields to Range Sensing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.024), (10, 0.04), (11, 0.012), (25, 0.286), (27, 0.054), (31, 0.042), (34, 0.079), (39, 0.052), (41, 0.029), (50, 0.013), (55, 0.03), (60, 0.014), (62, 0.01), (65, 0.011), (69, 0.069), (73, 0.03), (88, 0.073), (91, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87151033 4 nips-2005-A Bayesian Spatial Scan Statistic

Author: Daniel B. Neill, Andrew W. Moore, Gregory F. Cooper

Abstract: We propose a new Bayesian method for spatial cluster detection, the “Bayesian spatial scan statistic,” and compare this method to the standard (frequentist) scan statistic approach. We demonstrate that the Bayesian statistic has several advantages over the frequentist approach, including increased power to detect clusters and (since randomization testing is unnecessary) much faster runtime. We evaluate the Bayesian and frequentist methods on the task of prospective disease surveillance: detecting spatial clusters of disease cases resulting from emerging disease outbreaks. We demonstrate that our Bayesian methods are successful in rapidly detecting outbreaks while keeping number of false positives low. 1

same-paper 2 0.80867863 169 nips-2005-Saliency Based on Information Maximization

Author: Neil Bruce, John Tsotsos

3 0.60109419 195 nips-2005-Transfer learning for text classification

Author: Chuong B. Do, Andrew Y. Ng

Abstract: Linear text classiﬁcation algorithms work by computing an inner product between a test document vector and a parameter vector. In many such algorithms, including naive Bayes and most TFIDF variants, the parameters are determined by some simple, closed-form, function of training set statistics; we call this mapping mapping from statistics to parameters, the parameter function. Much research in text classiﬁcation over the last few decades has consisted of manual efforts to identify better parameter functions. In this paper, we propose an algorithm for automatically learning this function from related classiﬁcation problems. The parameter function found by our algorithm then deﬁnes a new learning algorithm for text classiﬁcation, which we can apply to novel classiﬁcation tasks. We ﬁnd that our learned classiﬁer outperforms existing methods on a variety of multiclass text classiﬁcation tasks. 1

4 0.51687968 193 nips-2005-The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search

Author: Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, Dimitris Samaras

5 0.47797298 34 nips-2005-Bayesian Surprise Attracts Human Attention

Author: Laurent Itti, Pierre F. Baldi

Abstract: The concept of surprise is central to sensory processing, adaptation, learning, and attention. Yet, no widely-accepted mathematical theory currently exists to quantitatively characterize surprise elicited by a stimulus or event, for observers that range from single neurons to complex natural or engineered systems. We describe a formal Bayesian deﬁnition of surprise that is the only consistent formulation under minimal axiomatic assumptions. Surprise quantiﬁes how data affects a natural or artiﬁcial observer, by measuring the difference between posterior and prior beliefs of the observer. Using this framework we measure the extent to which humans direct their gaze towards surprising items while watching television and video games. We ﬁnd that subjects are strongly attracted towards surprising locations, with 72% of all human gaze shifts directed towards locations more surprising than the average, a ﬁgure which rises to 84% when considering only gaze targets simultaneously selected by all subjects. The resulting theory of surprise is applicable across different spatio-temporal scales, modalities, and levels of abstraction. Life is full of surprises, ranging from a great christmas gift or a new magic trick, to wardrobe malfunctions, reckless drivers, terrorist attacks, and tsunami waves. Key to survival is our ability to rapidly attend to, identify, and learn from surprising events, to decide on present and future courses of action [1]. Yet, little theoretical and computational understanding exists of the very essence of surprise, as evidenced by the absence from our everyday vocabulary of a quantitative unit of surprise: Qualities such as the “wow factor” have remained vague and elusive to mathematical analysis. Informal correlates of surprise exist at nearly all stages of neural processing. In sensory neuroscience, it has been suggested that only the unexpected at one stage is transmitted to the next stage [2]. Hence, sensory cortex may have evolved to adapt to, to predict, and to quiet down the expected statistical regularities of the world [3, 4, 5, 6], focusing instead on events that are unpredictable or surprising. Electrophysiological evidence for this early sensory emphasis onto surprising stimuli exists from studies of adaptation in visual [7, 8, 4, 9], olfactory [10, 11], and auditory cortices [12], subcortical structures like the LGN [13], and even retinal ganglion cells [14, 15] and cochlear hair cells [16]: neural response greatly attenuates with repeated or prolonged exposure to an initially novel stimulus. Surprise and novelty are also central to learning and memory formation [1], to the point that surprise is believed to be a necessary trigger for associative learning [17, 18], as supported by mounting evidence for a role of the hippocampus as a novelty detector [19, 20, 21]. Finally, seeking novelty is a well-identiﬁed human character trait, with possible association with the dopamine D4 receptor gene [22, 23, 24]. In the Bayesian framework, we develop the only consistent theory of surprise, in terms of the difference between the posterior and prior distributions of beliefs of an observer over the available class of models or hypotheses about the world. We show that this deﬁnition derived from ﬁrst principles presents key advantages over more ad-hoc formulations, typically relying on detecting outlier stimuli. Armed with this new framework, we provide direct experimental evidence that surprise best characterizes what attracts human gaze in large amounts of natural video stimuli. We here extend a recent pilot study [25], adding more comprehensive theory, large-scale human data collection, and additional analysis. 1 Theory Bayesian Deﬁnition of Surprise. We propose that surprise is a general concept, which can be derived from ﬁrst principles and formalized across spatio-temporal scales, sensory modalities, and, more generally, data types and data sources. Two elements are essential for a principled deﬁnition of surprise. First, surprise can exist only in the presence of uncertainty, which can arise from intrinsic stochasticity, missing information, or limited computing resources. A world that is purely deterministic and predictable in real-time for a given observer contains no surprises. Second, surprise can only be deﬁned in a relative, subjective, manner and is related to the expectations of the observer, be it a single synapse, neuronal circuit, organism, or computer device. The same data may carry different amount of surprise for different observers, or even for the same observer taken at different times. In probability and decision theory it can be shown that the only consistent and optimal way for modeling and reasoning about uncertainty is provided by the Bayesian theory of probability [26, 27, 28]. Furthermore, in the Bayesian framework, probabilities correspond to subjective degrees of beliefs in hypotheses or models which are updated, as data is acquired, using Bayes’ theorem as the fundamental tool for transforming prior belief distributions into posterior belief distributions. Therefore, within the same optimal framework, the only consistent deﬁnition of surprise must involve: (1) probabilistic concepts to cope with uncertainty; and (2) prior and posterior distributions to capture subjective expectations. Consistently with this Bayesian approach, the background information of an observer is captured by his/her/its prior probability distribution {P (M )}M ∈M over the hypotheses or models M in a model space M. Given this prior distribution of beliefs, the fundamental effect of a new data observation D on the observer is to change the prior distribution {P (M )}M ∈M into the posterior distribution {P (M |D)}M ∈M via Bayes theorem, whereby P (D|M ) ∀M ∈ M, P (M |D) = P (M ). (1) P (D) In this framework, the new data observation D carries no surprise if it leaves the observer beliefs unaffected, that is, if the posterior is identical to the prior; conversely, D is surprising if the posterior distribution resulting from observing D signiﬁcantly differs from the prior distribution. Therefore we formally measure surprise elicited by data as some distance measure between the posterior and prior distributions. This is best done using the relative entropy or Kullback-Leibler (KL) divergence [29]. Thus, surprise is deﬁned by the average of the log-odd ratio: P (M |D) S(D, M) = KL(P (M |D), P (M )) = P (M |D) log dM (2) P (M ) M taken with respect to the posterior distribution over the model class M. Note that KL is not symmetric but has well-known theoretical advantages, including invariance with respect to Figure 1: Computing surprise in early sensory neurons. (a) Prior data observations, tuning preferences, and top-down inﬂuences contribute to shaping a set of “prior beliefs” a neuron may have over a class of internal models or hypotheses about the world. For instance, M may be a set of Poisson processes parameterized by the rate λ, with {P (M )}M ∈M = {P (λ)}λ∈I +∗ the prior distribution R of beliefs about which Poisson models well describe the world as sensed by the neuron. New data D updates the prior into the posterior using Bayes’ theorem. Surprise quantiﬁes the difference between the posterior and prior distributions over the model class M. The remaining panels detail how surprise differs from conventional model ﬁtting and outlier-based novelty. (b) In standard iterative Bayesian model ﬁtting, at every iteration N , incoming data DN is used to update the prior {P (M |D1 , D2 , ..., DN −1 )}M ∈M into the posterior {P (M |D1 , D2 , ..., DN )}M ∈M . Freezing this learning at a given iteration, one then picks the currently best model, usually using either a maximum likelihood criterion, or a maximum a posteriori one (yielding MM AP shown). (c) This best model is used for a number of tasks at the current iteration, including outlier-based novelty detection. New data is then considered novel at that instant if it has low likelihood for the best model b a (e.g., DN is more novel than DN ). This focus onto the single best model presents obvious limitations, especially in situations where other models are nearly as good (e.g., M∗ in panel (b) is entirely ignored during standard novelty computation). One palliative solution is to consider mixture models, or simply P (D), but this just amounts to shifting the problem into a different model class. (d) Surprise directly addresses this problem by simultaneously considering all models and by measuring how data changes the observer’s distribution of beliefs from {P (M |D1 , D2 , ..., DN −1 )}M ∈M to {P (M |D1 , D2 , ..., DN )}M ∈M over the entire model class M (orange shaded area). reparameterizations. A unit of surprise — a “wow” — may then be deﬁned for a single model M as the amount of surprise corresponding to a two-fold variation between P (M |D) and P (M ), i.e., as log P (M |D)/P (M ) (with log taken in base 2), with the total number of wows experienced for all models obtained through the integration in eq. 2. Surprise and outlier detection. Outlier detection based on the likelihood P (D|M best ) of D given a single best model Mbest is at best an approximation to surprise and, in some cases, is misleading. Consider, for instance, a case where D has very small probability both for a model or hypothesis M and for a single alternative hypothesis M. Although D is a strong outlier, it carries very little information regarding whether M or M is the better model, and therefore very little surprise. Thus an outlier detection method would strongly focus attentional resources onto D, although D is a false positive, in the sense that it carries no useful information for discriminating between the two alternative hypotheses M and M. Figure 1 further illustrates this disconnect between outlier detection and surprise. 2 Human experiments To test the surprise hypothesis — that surprise attracts human attention and gaze in natural scenes — we recorded eye movements from eight na¨ve observers (three females and ı ﬁve males, ages 23-32, normal or corrected-to-normal vision). Each watched a subset from 50 videoclips totaling over 25 minutes of playtime (46,489 video frames, 640 × 480, 60.27 Hz, mean screen luminance 30 cd/m2 , room 4 cd/m2 , viewing distance 80cm, ﬁeld of view 28◦ × 21◦ ). Clips comprised outdoors daytime and nighttime scenes of crowded environments, video games, and television broadcast including news, sports, and commercials. Right-eye position was tracked with a 240 Hz video-based device (ISCAN RK-464), with methods as previously [30]. Two hundred calibrated eye movement traces (10,192 saccades) were analyzed, corresponding to four distinct observers for each of the 50 clips. Figure 2 shows sample scanpaths for one videoclip. To characterize image regions selected by participants, we process videoclips through computational metrics that output a topographic dynamic master response map, assigning in real-time a response value to every input location. A good master map would highlight, more than expected by chance, locations gazed to by observers. To score each metric we hence sample, at onset of every human saccade, master map activity around the saccade’s future endpoint, and around a uniformly random endpoint (random sampling was repeated 100 times to evaluate variability). We quantify differences between histograms of master Figure 2: (a) Sample eye movement traces from four observers (squares denote saccade endpoints). (b) Our data exhibits high inter-individual overlap, shown here with the locations where one human saccade endpoint was nearby (≈ 5◦ ) one (white squares), two (cyan squares), or all three (black squares) other humans. (c) A metric where the master map was created from the three eye movement traces other than that being tested yields an upper-bound KL score, computed by comparing the histograms of metric values at human (narrow blue bars) and random (wider green bars) saccade targets. Indeed, this metric’s map was very sparse (many random saccades landing on locations with nearzero response), yet humans preferentially saccaded towards the three active hotspots corresponding to the eye positions of three other humans (many human saccades landing on locations with near-unity responses). map samples collected from human and random saccades using again the Kullback-Leibler (KL) distance: metrics which better predict human scanpaths exhibit higher distances from random as, typically, observers non-uniformly gaze towards a minority of regions with highest metric responses while avoiding a majority of regions with low metric responses. This approach presents several advantages over simpler scoring schemes [31, 32], including agnosticity to putative mechanisms for generating saccades and the fact that applying any continuous nonlinearity to master map values would not affect scoring. Experimental results. We test six computational metrics, encompassing and extending the state-of-the-art found in previous studies. The ﬁrst three quantify static image properties (local intensity variance in 16 × 16 image patches [31]; local oriented edge density as measured with Gabor ﬁlters [33]; and local Shannon entropy in 16 × 16 image patches [34]). The remaining three metrics are more sensitive to dynamic events (local motion [33]; outlier-based saliency [33]; and surprise [25]). For all metrics, we ﬁnd that humans are signiﬁcantly attracted by image regions with higher metric responses. However, the static metrics typically respond vigorously at numerous visual locations (Figure 3), hence they are poorly speciﬁc and yield relatively low KL scores between humans and random. The metrics sensitive to motion, outliers, and surprising events, in comparison, yield sparser maps and higher KL scores. The surprise metric of interest here quantiﬁes low-level surprise in image patches over space and time, and at this point does not account for high-level or cognitive beliefs of our human observers. Rather, it assumes a family of simple models for image patches, each processed through 72 early feature detectors sensitive to color, orientation, motion, etc., and computes surprise from shifts in the distribution of beliefs about which models better describe the patches (see [25] and [35] for details). We ﬁnd that the surprise metric signiﬁcantly outperforms all other computational metrics (p < 10−100 or better on t-tests for equality of KL scores), scoring nearly 20% better than the second-best metric (saliency) and 60% better than the best static metric (entropy). Surprising stimuli often substantially differ from simple feature outliers; for example, a continually blinking light on a static background elicits sustained ﬂicker due to its locally outlier temporal dynamics but is only surprising for a moment. Similarly, a shower of randomly-colored pixels continually excites all low-level feature detectors but rapidly becomes unsurprising. Strongest attractors of human attention. Clearly, in our and previous eye-tracking experiments, in some situations potentially interesting targets were more numerous than in others. With many possible targets, different observers may orient towards different locations, making it more difﬁcult for a single metric to accurately predict all observers. Hence we consider (Figure 4) subsets of human saccades where at least two, three, or all four observers simultaneously agreed on a gaze target. Observers could have agreed based on bottom-up factors (e.g., only one location had interesting visual appearance at that time), top-down factors (e.g., only one object was of current cognitive interest), or both (e.g., a single cognitively interesting object was present which also had distinctive appearance). Irrespectively of the cause for agreement, it indicates consolidated belief that a location was attractive. While the KL scores of all metrics improved when progressively focusing onto only those locations, dynamic metrics improved more steeply, indicating that stimuli which more reliably attracted all observers carried more motion, saliency, and surprise. Surprise remained signiﬁcantly the best metric to characterize these agreed-upon attractors of human gaze (p < 10−100 or better on t-tests for equality of KL scores). Overall, surprise explained the greatest fraction of human saccades, indicating that humans are signiﬁcantly attracted towards surprising locations in video displays. Over 72% of all human saccades were targeted to locations predicted to be more surprising than on average. When only considering saccades where two, three, or four observers agreed on a common gaze target, this ﬁgure rose to 76%, 80%, and 84%, respectively. Figure 3: (a) Sample video frames, with corresponding human saccades and predictions from the entropy, surprise, and human-derived metrics. Entropy maps, like intensity variance and orientation maps, exhibited many locations with high responses, hence had low speciﬁcity and were poorly discriminative. In contrast, motion, saliency, and surprise maps were much sparser and more speciﬁc, with surprise signiﬁcantly more often on target. For three example frames (ﬁrst column), saccades from one subject are shown (arrows) with corresponding apertures over which master map activity at the saccade endpoint was sampled (circles). (b) KL scores for these metrics indicate signiﬁcantly different performance levels, and a strict ranking of variance < orientation < entropy < motion < saliency < surprise < human-derived. KL scores were computed by comparing the number of human saccades landing onto each given range of master map values (narrow blue bars) to the number of random saccades hitting the same range (wider green bars). A score of zero would indicate equality between the human and random histograms, i.e., humans did not tend to hit various master map values any differently from expected by chance, or, the master map could not predict human saccades better than random saccades. Among the six computational metrics tested in total, surprise performed best, in that surprising locations were relatively few yet reliably gazed to by humans. Figure 4: KL scores when considering only saccades where at least one (all 10,192 saccades), two (7,948 saccades), three (5,565 saccades), or all four (2,951 saccades) humans agreed on a common gaze location, for the static (a) and dynamic metrics (b). Static metrics improved substantially when progressively focusing onto saccades with stronger inter-observer agreement (average slope 0.56 ± 0.37 percent KL score units per 1,000 pruned saccades). Hence, when humans agreed on a location, they also tended to be more reliably predicted by the metrics. Furthermore, dynamic metrics improved 4.5 times more steeply (slope 2.44 ± 0.37), suggesting a stronger role of dynamic events in attracting human attention. Surprising events were signiﬁcantly the strongest (t-tests for equality of KL scores between surprise and other metrics, p < 10−100 ). 3 Discussion While previous research has shown with either static scenes or dynamic synthetic stimuli that humans preferentially ﬁxate regions of high entropy [34], contrast [31], saliency [32], ﬂicker [36], or motion [37], our data provides direct experimental evidence that humans ﬁxate surprising locations even more reliably. These conclusions were made possible by developing new tools to quantify what attracts human gaze over space and time in dynamic natural scenes. Surprise explained best where humans look when considering all saccades, and even more so when restricting the analysis to only those saccades for which human observers tended to agree. Surprise hence represents an inexpensive, easily computable approximation to human attentional allocation. In the absence of quantitative tools to measure surprise, most experimental and modeling work to date has adopted the approximation that novel events are surprising, and has focused on experimental scenarios which are simple enough to ensure an overlap between informal notions of novelty and surprise: for example, a stimulus is novel during testing if it has not been seen during training [9]. Our deﬁnition opens new avenues for more sophisticated experiments, where surprise elicited by different stimuli can be precisely compared and calibrated, yielding predictions at the single-unit as well as behavioral levels. The deﬁnition of surprise — as the distance between the posterior and prior distributions of beliefs over models — is entirely general and readily applicable to the analysis of auditory, olfactory, gustatory, or somatosensory data. While here we have focused on behavior rather than detailed biophysical implementation, it is worth noting that detecting surprise in neural spike trains does not require semantic understanding of the data carried by the spike trains, and thus could provide guiding signals during self-organization and development of sensory areas. At higher processing levels, top-down cues and task demands are known to combine with stimulus novelty in capturing attention and triggering learning [1, 38], ideas which may now be formalized and quantiﬁed in terms of priors, posteriors, and surprise. Surprise, indeed, inherently depends on uncertainty and on prior beliefs. Hence surprise theory can further be tested and utilized in experiments where the prior is biased, for ex- ample by top-down instructions or prior exposures to stimuli [38]. In addition, simple surprise-based behavioral measures such as the eye-tracking one used here may prove useful for early diagnostic of human conditions including autism and attention-deﬁcit hyperactive disorder, as well as for quantitative comparison between humans and animals which may have lower or different priors, including monkeys, frogs, and ﬂies. Beyond sensory biology, computable surprise could guide the development of data mining and compression systems (giving more bits to surprising regions of interest), to ﬁnd surprising agents in crowds, surprising sentences in books or speeches, surprising sequences in genomes, surprising medical symptoms, surprising odors in airport luggage racks, surprising documents on the world-wide-web, or to design surprising advertisements. Acknowledgments: Supported by HFSP, NSF and NGA (L.I.), NIH and NSF (P.B.). We thank UCI’s Institute for Genomics and Bioinformatics and USC’s Center High Performance Computing and Communications (www.usc.edu/hpcc) for access to their computing clusters. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] Ranganath, C. & Rainer, G. Nat Rev Neurosci 4, 193–202 (2003). Rao, R. P. & Ballard, D. H. Nat Neurosci 2, 79–87 (1999). Olshausen, B. A. & Field, D. J. Nature 381, 607–609 (1996). M¨ ller, J. R., Metha, A. B., Krauskopf, J. & Lennie, P. Science 285, 1405–1408 (1999). u Dragoi, V., Sharma, J., Miller, E. K. & Sur, M. Nat Neurosci 5, 883–891 (2002). David, S. V., Vinje, W. E. & Gallant, J. L. J Neurosci 24, 6991–7006 (2004). Maffei, L., Fiorentini, A. & Bisti, S. Science 182, 1036–1038 (1973). Movshon, J. A. & Lennie, P. Nature 278, 850–852 (1979). Fecteau, J. H. & Munoz, D. P. Nat Rev Neurosci 4, 435–443 (2003). Kurahashi, T. & Menini, A. Nature 385, 725–729 (1997). Bradley, J., Bonigk, W., Yau, K. W. & Frings, S. Nat Neurosci 7, 705–710 (2004). Ulanovsky, N., Las, L. & Nelken, I. Nat Neurosci 6, 391–398 (2003). Solomon, S. G., Peirce, J. W., Dhruv, N. T. & Lennie, P. Neuron 42, 155–162 (2004). Smirnakis, S. M., Berry, M. J. & et al. Nature 386, 69–73 (1997). Brown, S. P. & Masland, R. H. Nat Neurosci 4, 44–51 (2001). Kennedy, H. J., Evans, M. G. & et al. Nat Neurosci 6, 832–836 (2003). Schultz, W. & Dickinson, A. Annu Rev Neurosci 23, 473–500 (2000). Fletcher, P. C., Anderson, J. M., Shanks, D. R. et al. Nat Neurosci 4, 1043–1048 (2001). Knight, R. Nature 383, 256–259 (1996). Stern, C. E., Corkin, S., Gonzalez, R. G. et al. Proc Natl Acad Sci U S A 93, 8660–8665 (1996). Li, S., Cullen, W. K., Anwyl, R. & Rowan, M. J. Nat Neurosci 6, 526–531 (2003). Ebstein, R. P., Novick, O., Umansky, R. et al. Nat Genet 12, 78–80 (1996). Benjamin, J., Li, L. & et al. Nat Genet 12, 81–84 (1996). Lusher, J. M., Chandler, C. & Ball, D. Mol Psychiatry 6, 497–499 (2001). Itti, L. & Baldi, P. In Proc. IEEE CVPR. San Siego, CA (2005 in press). Cox, R. T. Am. J. Phys. 14, 1–13 (1964). Savage, L. J. The foundations of statistics (Dover, New York, 1972). (First Edition in 1954). Jaynes, E. T. Probability Theory. The Logic of Science (Cambridge University Press, 2003). Kullback, S. Information Theory and Statistics (Wiley, New York:New York, 1959). Itti, L. Visual Cognition (2005 in press). Reinagel, P. & Zador, A. M. Network 10, 341–350 (1999). Parkhurst, D., Law, K. & Niebur, E. Vision Res 42, 107–123 (2002). Itti, L. & Koch, C. Nat Rev Neurosci 2, 194–203 (2001). Privitera, C. M. & Stark, L. W. IEEE Trans Patt Anal Mach Intell 22, 970–982 (2000). All source code for all metrics is freely available at http://iLab.usc.edu/toolkit/. Theeuwes, J. Percept Psychophys 57, 637–644 (1995). Abrams, R. A. & Christ, S. E. Psychol Sci 14, 427–432 (2003). Wolfe, J. M. & Horowitz, T. S. Nat Rev Neurosci 5, 495–501 (2004).

6 0.46899319 30 nips-2005-Assessing Approximations for Gaussian Process Classification

7 0.46617708 8 nips-2005-A Criterion for the Convergence of Learning with Spike Timing Dependent Plasticity

8 0.46476009 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach

9 0.46373686 72 nips-2005-Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation

10 0.46342748 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

11 0.46322224 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

12 0.46304601 90 nips-2005-Hot Coupling: A Particle Approach to Inference and Normalization on Pairwise Undirected Graphs

13 0.45901579 194 nips-2005-Top-Down Control of Visual Attention: A Rational Account

14 0.45791024 144 nips-2005-Off-policy Learning with Options and Recognizers

15 0.45684654 45 nips-2005-Conditional Visual Tracking in Kernel Space

16 0.45672935 23 nips-2005-An Application of Markov Random Fields to Range Sensing

17 0.45541373 136 nips-2005-Noise and the two-thirds power Law

18 0.45420229 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

19 0.45356834 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations

20 0.45260066 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification