nips nips2004 nips2004-21 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Laura W. Renninger, James M. Coughlan, Preeti Verghese, Jitendra Malik
Abstract: We propose a sequential information maximization model as a general strategy for programming eye movements. The model reconstructs high-resolution visual information from a sequence of fixations, taking into account the fall-off in resolution from the fovea to the periphery. From this framework we get a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that minimizes uncertainty (maximizes information) about the stimulus. By comparing our model performance to human eye movement data and to predictions from a saliency and random model, we demonstrate that our model is best at predicting fixation locations. Modeling additional biological constraints will improve the prediction of fixation sequences. Our results suggest that information maximization is a useful principle for programming eye movements. 1 In trod u ction Since the earliest recordings [1, 2], vision researchers have sought to understand the non-random yet idiosyncratic behavior of volitional eye movements. To do so, we must not only unravel the bottom-up visual processing involved in selecting a fixation location, but we must also disentangle the effects of top-down cognitive factors such as task and prior knowledge. Our ability to predict volitional eye movements provides a clear measure of our understanding of biological vision. One approach to predicting fixation locations is to propose that the eyes move to points that are “salient”. Salient regions can be found by looking for centersurround contrast in visual channels such as color, contrast and orientation, among others [3, 4]. Saliency has been shown to correlate with human fixation locations when observers “look around” an image [5, 6] but it is not clear if saliency alone can explain why some locations are chosen over others and in what order. Task as well as scene or object knowledge will play a role in constraining the fixation locations chosen [7]. Observations such as this led to the scanpath theory, which proposed that eye movement sequences are tightly linked to both the encoding and retrieval of specific object memories [8]. 1.1 Our Approach We propose that during natural, active vision, we center our fixation on the most informative points in an image in order to reduce our overall uncertainty about what we are looking at. This approach is intuitive and may be biologically plausible, as outlined by Lee & Yu [9]. The most informative point will depend on both the observer’s current knowledge of the stimulus and the task. The quality of the information gathered with each fixation will depend greatly on human visual resolution limits. This is the reason we must move our eyes in the first place, yet it is often ignored. A sequence of eye movements may then be understood within a framework of sequential information maximization. 2 Human eye movements We investigated how observers examine a novel shape when they must rely heavily on bottom-up stimulus information. Because eye movements will be affected by the task of the observer, we constructed a learn-discriminate paradigm. Observers are asked to carefully study a shape and then discriminate it from a highly similar one. 2.1 Stimuli and Design We use novel silhouettes to reduce the influence of object familiarity on the pattern of eye movements and to facilitate our computations of information in the model. Each silhouette subtends 12.5º to ensure that its entire shape cannot be characterized with a single fixation. During the learning phase, subjects first fixated a marker and then pressed a button to cue the appearance of the shape which appeared 10º to the left or right of fixation. Subjects maintained fixation for 300ms, allowing for a peripheral preview of the object. When the fixation marker disappeared, subjects were allowed to study the object for 1.2 seconds while their eye movements were recorded. During the discrimination phase, subjects were asked to select the shape they had just studied from a highly similar shape pair (Figure 1). Performance was near 75% correct, indicating that the task was challenging yet feasible. Subjects saw 140 shapes and given auditory feedback. release fixation, view object freely (1200ms) maintain fixation (300ms) Which shape is a match? fixate, initiate trial Figure 1. Temporal layout of a trial during the learning phase (left). Discrimination of learned shape from a highly similar one (right). 2.2 Apparatus Right eye position was measured with an SRI Dual Purkinje Image eye tracker while subjects viewed the stimulus binocularly. Head position was fixed with a bitebar. A 25 dot grid that covered the extent of the presentation field was used for calibration. The points were measured one at a time with each dot being displayed for 500ms. The stimuli were presented using the Psychtoolbox software [10]. 3 Model We wish to create a model that builds a representation of a shape silhouette given imperfect visual information, and which updates its representation as new visual information is acquired. The model will be defined statistically so as to explicitly encode uncertainty about the current knowledge of the shape silhouette. We will use this model to generate a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that will decrease the model’s uncertainty as much as possible. Similar approaches have been described in an ideal observer model for reading [11], an information maximization algorithm for tracking contours in cluttered images [12] and predicting fixation locations during object learning [13]. 3.1 Representing information The information in silhouettes clearly resides at its contour, which we represent with a collection of points and associated tangent orientations. These points and their associated orientations are called edgelets, denoted e1, e2, ... eN, where N is the total number of edgelets along the boundary. Each edgelet ei is defined as a triple ei=(xi, yi, zi) where (xi, yi) is the 2D location of the edgelet and zi is the orientation of the tangent to the boundary contour at that point. zi can assume any of Q possible values 1, 2, …, Q, representing a discretization of Q possible orientations ranging from 0 to π , and we have chosen Q=8 in our experiments. The goal of the model is to infer the most likely orientation values given the visual information provided by one or more fixations. 3.2 Updating knowledge The visual information is based on indirect measurements of the true edgelet values e1, e2, ... eN. Although our model assumes complete knowledge of the number N and locations (xi, yi) of the edgelets, it does not have direct access to the orientations zi.1 Orientation information is instead derived from measurements that summarize the local frequency of occurrence of edgelet orientations, averaged locally over a coarse scale (corresponding to the spatial scale at which resolution is limited by the human visual system). These coarse measurements provide indirect information about individual edgelet orientations, which may not uniquely determine the orientations. We will use a simple statistical model to estimate the distribution of individual orientation values conditioned on this information. Our measurements are defined to model the resolution limitations of the human visual system, with highest resolution at the fovea and lower resolution in the 1 Although the visual system does not have precise knowledge of location coordinates, the model is greatly simplified by assuming this knowledge. It is reasonable to expect that location uncertainty will be highly correlated with orientation uncertainty, so that the inclusion of location should not greatly affect the model's decisions of where to fixate next. periphery. Distance to the fovea is r measured as eccentricity E, the visual angle between any point and the fovea. If x = ( x, y ) is the location of a point in an image r and f = ( f x , f y ) is the fixation (i.e. foveal) location in the image then the r r eccentricity is E = x − f , measured in units of visual degrees. The effective resolution of orientation discrimination falls with increasing eccentricity as r (E ) = FPH ( E + E 2 ) where r(E) is an effective radius over which the visual system spatially pools information and FPH =0.1 and E2=0.8 [14]. Our model represents pooled information as a histogram of edge orientations within the effective radius. For each edgelet ei we define the histogram of all edgelet r orientations ej within radius ri = r(E) of ei , where E is the eccentricity of xi = ( xi , yi ) r r r relative to the current fixation f , i.e. E = xi − f . To define the histogram more precisely we will introduce the neighborhood set Ni of all indices j corresponding to r r edgelets within radius ri of ei : N i = all j s.t. xi − x j ≤ ri , with number of { } neighborhood edgelets |Ni|. The (normalized) histogram centered at edgelet ei is then defined as hiz = 1 Ni ∑δ j∈N i z,z j , which is the proportion of edgelet orientations that assume value z in the (eccentricity-dependent) neighborhood of edgelet ei.2 Figure 2. Relation between eccentricity E and radius r(E) of the neighborhood (disk) which defines the local orientation histogram (hiz ). Left and right panels show two fixations for the same object. Up to this point we have restricted ourselves to the case of a single fixation. To designate a sequence of multiple fixations we will index them byrk=1, 2, …, K (for K total fixations). The k th fixation location is denoted by f ( k ) = ( f xk , f yk ) . The quantities ri , Ni and hiz depend on fixation location and so to make this dependence (k explicit we will augment them with superscripts as ri(k ) , N i(k ) , and hiz ) . 2 δ x, y is the Kronecker delta function, defined to equal 1 if x = y and 0 if x ≠ y . Now we describe the statistical model of edgelet orientations given information obtained from multiple fixations. Ideally we would like to model the exact distribution of orientations conditioned on the histogram data: (1) ( 2) (K ) ( , where {hizk ) } represents all histogram P(zi , z 2 , ... z N | {hiz }, {hiz },K, {hiz }) r components z at every edgelet ei for fixation f (k ) . This exact distribution is intractable, so we will use a simple approximation. We assume the distribution factors over individual edgelets: N ( ( ( P(zi , z 2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK ) }) = ∏ g i(zi ) i =1 where gi(zi) is the marginal distribution of orientation zi. Determining these marginal distributions is still difficult even with the factorization assumption, so we K will make an additional approximation: g (z ) = 1 ∏ hiz( k ) , where Zi is a suitable i i Z i k =1 (k ) normalization factor. This approximation corresponds to treating hiz as a likelihood function over z, with independent likelihoods for each fixation k. While the approximation has some undesirable properties (such as making the marginal distribution gi(zi) more peaked if the same fixation is made repeatedly), it provides a simple mechanism for combining histogram evidence from multiple, distinct fixations. 3.3 Selecting the next fixation r ( K +1) Given the past K fixations, the next fixation f is chosen to minimize the model r ( K +1) entropy of the edgelet orientations. In other words, f is chosen to minimize r ( K +1) ( ( ( H( f ) = entropy[ P(zi , z2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK +1) })] , where the entropy of a distribution P(x) is defined as − ∑ P( x) log P ( x) . In practice, we minimize the x r entropy by evaluating it across a set of candidate locations f ( K +1) which forms a regularly sampled grid across the image.3 We note that this selection rule makes decisions that depend, in general, on the full history of previous K fixations. 4 Results Figure 3 shows an example of one observer’s eye movements superimposed over the shape (top row), the prediction from a saliency model (middle row) [3] and the prediction from the information maximization model (bottom row). The information maximization model updates its prediction after each fixation. An ideal sequence of fixations can be generated by both models. The saliency model selects fixations in order of decreasing salience. The information maximization model selects the maximally informative point after incorporating information from the previous fixations. To provide an additional benchmark, we also implemented a 3 This rule evaluates the entropy resulting from every possible next fixation before making a decision. Although this rule is suitable for our modeling purposes, it would be inefficient to implement in a biological or machine vision system. A practical decision rule would use current knowledge to estimate the expected (rather than actual) entropy. Figure 3. Example eye movement pattern, superimposed over the stimulus (top row), saliency map (middle row) and information maximization map (bottom row). model that selects fixations at random. One way to quantify the performance is to map a subject’s fixations onto the closest model predicted fixation locations, ignoring the sequence in which they were made. In this analysis, both the saliency and information maximization models are significantly better than random at predicting candidate locations (p < 0.05; t-test) for three observers (Figure 4, left). The information maximization model performs slightly but significantly better than the saliency model for two observers (lm, kr). If we match fixation locations while retaining the sequence, errors become quite large, indicating that the models cannot account for the observed behavior (Figure 4, right). Sequence Error Visual Angle (deg) Location Error R S I R S I R S I R S I R S I R S I Figure 4. Prediction error of three models: random (R), saliency (S) and information maximization (I) for three observers (pv, lm, kr). The left panel shows the error in predicting fixation locations, ignoring sequence. The right panel shows the error when sequence is retained before mapping. Error bars are 95% confidence intervals. The information maximization model incorporates resolution limitations, but there are further biological constraints that must be considered if we are to build a model that can fully explain human eye movement patterns. First, saccade amplitudes are typically around 2-4º and rarely exceed 15º [15]. When we move our eyes, the image of the visual world is smeared across the retina and our perception of it is actively suppressed [16]. Shorter saccade lengths may be a mechanism to reduce this cost. This biological constraint would cause a fixation to fall short of the prediction if it is distant from the current fixation (Figure 5). Figure 5. Cost of moving the eyes. Successive fixations may fall short of the maximally salient or informative point if it is very distant from the current fixation. Second, the biological system may increase its sampling efficiency by planning a series of saccades concurrently [17, 18]. Several fixations may therefore be made before sampled information begins to influence target selection. The information maximization model currently updates after each fixation. This would create a discrepancy in the prediction of the eye movement sequence (Figure 6). Figure 6. Three fixations are made to a location that is initially highly informative according to the information maximization model. By the fourth fixation, the subject finally moves to the next most informative point. 5 D i s c u s s i on Our model and the saliency model are using the same image information to determine fixation locations, thus it is not surprising that they are roughly similar in their performance of predicting human fixation locations. The main difference is how we decide to “shift attention” or program the sequence of eye movements to these locations. The saliency model uses a winner-take-all and inhibition-of-return mechanism to shift among the salient regions. We take a completely different approach by saying that observers adopt a strategy of sequential information maximization. In effect, the history of where we have been matters because our model is continually collecting information from the stimulus. We have an implicit “inhibition-of-return” because there is little to be gained by revisiting a point. Second, we attempt to take biological resolution limits into account when determining the quality of information gained with each fixation. By including additional biological constraints such as the cost of making large saccades and the natural time course of information update, we may be able to improve our prediction of eye movement sequences. We have shown that the programming of eye movements can be understood within a framework of sequential information maximization. This framework is portable to any image or task. A remaining challenge is to understand how different tasks constrain the representation of information and to what degree observers are able to utilize the information. Acknowledgments Smith-Kettlewell Eye Research Institute, NIH Ruth L. Kirschstein NRSA, ONR #N0001401-1-0890, NSF #IIS0415310, NIDRR #H133G030080, NASA #NAG 9-1461. References [1] Buswell (1935). How people look at pictures. Chicago: The University of Chicago Press. [2] Yarbus (1967). Eye movements and vision. New York: Plenum Press. [3] Itti & Koch (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489-1506. [4] Kadir & Brady (2001). Scale, saliency and image description. International Journal of Computer Vision, 45(2), 83-105. [5] Parkhurst, Law, and Niebur (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107-123. [6] Nothdurft (2002). Attention shifts to salient targets. Vision Research, 42, 1287-1306. [7] Oliva, Torralba, Castelhano & Henderson (2003). Top-down control of visual attention in object detection. Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain. [8] Noton & Stark (1971). Scanpaths in eye movements during pattern perception. Science, 171, 308-311. [9] Lee & Yu (2000). An information-theoretic framework for understanding saccadic behaviors. Advanced in Neural Processing Systems, 12, 834-840. [10] Brainard (1997). The psychophysics toolbox. Spatial Vision, 10 (4), 433-436. [11] Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.Chips 2002: new insights from an ideal-observer model of reading. Vision Research, 42, 2219-2234. [12] Geman & Jedynak (1996). An active testing model for tracking roads in satellite images. IEEE Trans. Pattern Analysis and Machine Intel, 18(1), 1-14. [13] Renninger & Malik (2004). Sequential information maximization can explain eye movements in an object learning task. Journal of Vision, 4(8), 744a. [14] Levi, Klein & Aitesbaomo (1985). Vernier acuity, crowding and cortical magnification. Vision Research, 25(7), 963-977. [15] Bahill, Adler & Stark (1975). Most naturally occurring human saccades have magnitudes of 15 degrees or less. Investigative Ophthalmology, 14, 468-469. [16] Burr, Morrone & Ross (1994). Selective suppression of the magnocellular visual pathway during saccadic eye movements. Nature, 371, 511-513. [17] Caspi, Beutter & Eckstein (2004). The time course of visual information accrual guiding eye movement decisions. Proceedings of the Nat’l Academy of Science, 101(35), 13086-90. [18] McPeek, Skavenski & Nakayama (2000). Concurrent processing of saccades in visual search. Vision Research, 40, 2499-2516.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose a sequential information maximization model as a general strategy for programming eye movements. [sent-4, score-0.498]
2 The model reconstructs high-resolution visual information from a sequence of fixations, taking into account the fall-off in resolution from the fovea to the periphery. [sent-5, score-0.311]
3 From this framework we get a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that minimizes uncertainty (maximizes information) about the stimulus. [sent-6, score-0.934]
4 By comparing our model performance to human eye movement data and to predictions from a saliency and random model, we demonstrate that our model is best at predicting fixation locations. [sent-7, score-1.355]
5 Modeling additional biological constraints will improve the prediction of fixation sequences. [sent-8, score-0.752]
6 Our results suggest that information maximization is a useful principle for programming eye movements. [sent-9, score-0.439]
7 1 In trod u ction Since the earliest recordings [1, 2], vision researchers have sought to understand the non-random yet idiosyncratic behavior of volitional eye movements. [sent-10, score-0.389]
8 To do so, we must not only unravel the bottom-up visual processing involved in selecting a fixation location, but we must also disentangle the effects of top-down cognitive factors such as task and prior knowledge. [sent-11, score-0.785]
9 Our ability to predict volitional eye movements provides a clear measure of our understanding of biological vision. [sent-12, score-0.518]
10 One approach to predicting fixation locations is to propose that the eyes move to points that are “salient”. [sent-13, score-0.872]
11 Salient regions can be found by looking for centersurround contrast in visual channels such as color, contrast and orientation, among others [3, 4]. [sent-14, score-0.111]
12 Saliency has been shown to correlate with human fixation locations when observers “look around” an image [5, 6] but it is not clear if saliency alone can explain why some locations are chosen over others and in what order. [sent-15, score-1.224]
13 Task as well as scene or object knowledge will play a role in constraining the fixation locations chosen [7]. [sent-16, score-0.821]
14 Observations such as this led to the scanpath theory, which proposed that eye movement sequences are tightly linked to both the encoding and retrieval of specific object memories [8]. [sent-17, score-0.402]
15 1 Our Approach We propose that during natural, active vision, we center our fixation on the most informative points in an image in order to reduce our overall uncertainty about what we are looking at. [sent-19, score-0.794]
16 The most informative point will depend on both the observer’s current knowledge of the stimulus and the task. [sent-21, score-0.104]
17 The quality of the information gathered with each fixation will depend greatly on human visual resolution limits. [sent-22, score-0.958]
18 A sequence of eye movements may then be understood within a framework of sequential information maximization. [sent-24, score-0.52]
19 2 Human eye movements We investigated how observers examine a novel shape when they must rely heavily on bottom-up stimulus information. [sent-25, score-0.636]
20 Because eye movements will be affected by the task of the observer, we constructed a learn-discriminate paradigm. [sent-26, score-0.435]
21 1 Stimuli and Design We use novel silhouettes to reduce the influence of object familiarity on the pattern of eye movements and to facilitate our computations of information in the model. [sent-29, score-0.552]
22 During the learning phase, subjects first fixated a marker and then pressed a button to cue the appearance of the shape which appeared 10º to the left or right of fixation. [sent-32, score-0.146]
23 Subjects maintained fixation for 300ms, allowing for a peripheral preview of the object. [sent-33, score-0.674]
24 When the fixation marker disappeared, subjects were allowed to study the object for 1. [sent-34, score-0.801]
25 During the discrimination phase, subjects were asked to select the shape they had just studied from a highly similar shape pair (Figure 1). [sent-36, score-0.244]
26 release fixation, view object freely (1200ms) maintain fixation (300ms) Which shape is a match? [sent-39, score-0.773]
27 2 Apparatus Right eye position was measured with an SRI Dual Purkinje Image eye tracker while subjects viewed the stimulus binocularly. [sent-44, score-0.689]
28 3 Model We wish to create a model that builds a representation of a shape silhouette given imperfect visual information, and which updates its representation as new visual information is acquired. [sent-49, score-0.374]
29 The model will be defined statistically so as to explicitly encode uncertainty about the current knowledge of the shape silhouette. [sent-50, score-0.205]
30 We will use this model to generate a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that will decrease the model’s uncertainty as much as possible. [sent-51, score-0.958]
31 Similar approaches have been described in an ideal observer model for reading [11], an information maximization algorithm for tracking contours in cluttered images [12] and predicting fixation locations during object learning [13]. [sent-52, score-1.055]
32 These points and their associated orientations are called edgelets, denoted e1, e2, . [sent-55, score-0.134]
33 eN, where N is the total number of edgelets along the boundary. [sent-58, score-0.115]
34 Each edgelet ei is defined as a triple ei=(xi, yi, zi) where (xi, yi) is the 2D location of the edgelet and zi is the orientation of the tangent to the boundary contour at that point. [sent-59, score-0.901]
35 zi can assume any of Q possible values 1, 2, …, Q, representing a discretization of Q possible orientations ranging from 0 to π , and we have chosen Q=8 in our experiments. [sent-60, score-0.21]
36 The goal of the model is to infer the most likely orientation values given the visual information provided by one or more fixations. [sent-61, score-0.232]
37 2 Updating knowledge The visual information is based on indirect measurements of the true edgelet values e1, e2, . [sent-63, score-0.47]
38 Although our model assumes complete knowledge of the number N and locations (xi, yi) of the edgelets, it does not have direct access to the orientations zi. [sent-67, score-0.265]
39 1 Orientation information is instead derived from measurements that summarize the local frequency of occurrence of edgelet orientations, averaged locally over a coarse scale (corresponding to the spatial scale at which resolution is limited by the human visual system). [sent-68, score-0.56]
40 These coarse measurements provide indirect information about individual edgelet orientations, which may not uniquely determine the orientations. [sent-69, score-0.359]
41 We will use a simple statistical model to estimate the distribution of individual orientation values conditioned on this information. [sent-70, score-0.103]
42 It is reasonable to expect that location uncertainty will be highly correlated with orientation uncertainty, so that the inclusion of location should not greatly affect the model's decisions of where to fixate next. [sent-72, score-0.397]
43 Distance to the fovea is r measured as eccentricity E, the visual angle between any point and the fovea. [sent-74, score-0.23]
44 If x = ( x, y ) is the location of a point in an image r and f = ( f x , f y ) is the fixation (i. [sent-75, score-0.777]
45 foveal) location in the image then the r r eccentricity is E = x − f , measured in units of visual degrees. [sent-77, score-0.29]
46 The effective resolution of orientation discrimination falls with increasing eccentricity as r (E ) = FPH ( E + E 2 ) where r(E) is an effective radius over which the visual system spatially pools information and FPH =0. [sent-78, score-0.43]
47 Our model represents pooled information as a histogram of edge orientations within the effective radius. [sent-81, score-0.248]
48 For each edgelet ei we define the histogram of all edgelet r orientations ej within radius ri = r(E) of ei , where E is the eccentricity of xi = ( xi , yi ) r r r relative to the current fixation f , i. [sent-82, score-1.733]
49 To define the histogram more precisely we will introduce the neighborhood set Ni of all indices j corresponding to r r edgelets within radius ri of ei : N i = all j s. [sent-85, score-0.391]
50 xi − x j ≤ ri , with number of { } neighborhood edgelets |Ni|. [sent-87, score-0.203]
51 The (normalized) histogram centered at edgelet ei is then defined as hiz = 1 Ni ∑δ j∈N i z,z j , which is the proportion of edgelet orientations that assume value z in the (eccentricity-dependent) neighborhood of edgelet ei. [sent-88, score-1.294]
52 Relation between eccentricity E and radius r(E) of the neighborhood (disk) which defines the local orientation histogram (hiz ). [sent-90, score-0.297]
53 Left and right panels show two fixations for the same object. [sent-91, score-0.231]
54 To designate a sequence of multiple fixations we will index them byrk=1, 2, …, K (for K total fixations). [sent-93, score-0.263]
55 The k th fixation location is denoted by f ( k ) = ( f xk , f yk ) . [sent-94, score-0.742]
56 The quantities ri , Ni and hiz depend on fixation location and so to make this dependence (k explicit we will augment them with superscripts as ri(k ) , N i(k ) , and hiz ) . [sent-95, score-1.127]
57 Now we describe the statistical model of edgelet orientations given information obtained from multiple fixations. [sent-97, score-0.426]
58 Ideally we would like to model the exact distribution of orientations conditioned on the histogram data: (1) ( 2) (K ) ( , where {hizk ) } represents all histogram P(zi , z 2 , . [sent-98, score-0.302]
59 z N | {hiz }, {hiz },K, {hiz }) r components z at every edgelet ei for fixation f (k ) . [sent-101, score-0.992]
60 This approximation corresponds to treating hiz as a likelihood function over z, with independent likelihoods for each fixation k. [sent-108, score-0.847]
61 While the approximation has some undesirable properties (such as making the marginal distribution gi(zi) more peaked if the same fixation is made repeatedly), it provides a simple mechanism for combining histogram evidence from multiple, distinct fixations. [sent-109, score-0.79]
62 3 Selecting the next fixation r ( K +1) Given the past K fixations, the next fixation f is chosen to minimize the model r ( K +1) entropy of the edgelet orientations. [sent-111, score-1.651]
63 In practice, we minimize the x r entropy by evaluating it across a set of candidate locations f ( K +1) which forms a regularly sampled grid across the image. [sent-116, score-0.116]
64 4 Results Figure 3 shows an example of one observer’s eye movements superimposed over the shape (top row), the prediction from a saliency model (middle row) [3] and the prediction from the information maximization model (bottom row). [sent-118, score-0.931]
65 The information maximization model updates its prediction after each fixation. [sent-119, score-0.201]
66 An ideal sequence of fixations can be generated by both models. [sent-120, score-0.263]
67 The saliency model selects fixations in order of decreasing salience. [sent-121, score-0.452]
68 The information maximization model selects the maximally informative point after incorporating information from the previous fixations. [sent-122, score-0.264]
69 To provide an additional benchmark, we also implemented a 3 This rule evaluates the entropy resulting from every possible next fixation before making a decision. [sent-123, score-0.727]
70 Although this rule is suitable for our modeling purposes, it would be inefficient to implement in a biological or machine vision system. [sent-124, score-0.123]
71 Example eye movement pattern, superimposed over the stimulus (top row), saliency map (middle row) and information maximization map (bottom row). [sent-127, score-0.72]
72 One way to quantify the performance is to map a subject’s fixations onto the closest model predicted fixation locations, ignoring the sequence in which they were made. [sent-129, score-0.961]
73 In this analysis, both the saliency and information maximization models are significantly better than random at predicting candidate locations (p < 0. [sent-130, score-0.46]
74 The information maximization model performs slightly but significantly better than the saliency model for two observers (lm, kr). [sent-132, score-0.473]
75 If we match fixation locations while retaining the sequence, errors become quite large, indicating that the models cannot account for the observed behavior (Figure 4, right). [sent-133, score-0.761]
76 Prediction error of three models: random (R), saliency (S) and information maximization (I) for three observers (pv, lm, kr). [sent-135, score-0.402]
77 The left panel shows the error in predicting fixation locations, ignoring sequence. [sent-136, score-0.729]
78 The information maximization model incorporates resolution limitations, but there are further biological constraints that must be considered if we are to build a model that can fully explain human eye movement patterns. [sent-139, score-0.725]
79 When we move our eyes, the image of the visual world is smeared across the retina and our perception of it is actively suppressed [16]. [sent-141, score-0.165]
80 This biological constraint would cause a fixation to fall short of the prediction if it is distant from the current fixation (Figure 5). [sent-143, score-1.448]
81 Successive fixations may fall short of the maximally salient or informative point if it is very distant from the current fixation. [sent-146, score-0.386]
82 Second, the biological system may increase its sampling efficiency by planning a series of saccades concurrently [17, 18]. [sent-147, score-0.112]
83 Several fixations may therefore be made before sampled information begins to influence target selection. [sent-148, score-0.277]
84 The information maximization model currently updates after each fixation. [sent-149, score-0.168]
85 This would create a discrepancy in the prediction of the eye movement sequence (Figure 6). [sent-150, score-0.427]
86 Three fixations are made to a location that is initially highly informative according to the information maximization model. [sent-152, score-0.496]
87 5 D i s c u s s i on Our model and the saliency model are using the same image information to determine fixation locations, thus it is not surprising that they are roughly similar in their performance of predicting human fixation locations. [sent-154, score-1.72]
88 The main difference is how we decide to “shift attention” or program the sequence of eye movements to these locations. [sent-155, score-0.467]
89 The saliency model uses a winner-take-all and inhibition-of-return mechanism to shift among the salient regions. [sent-156, score-0.301]
90 We take a completely different approach by saying that observers adopt a strategy of sequential information maximization. [sent-157, score-0.16]
91 Second, we attempt to take biological resolution limits into account when determining the quality of information gained with each fixation. [sent-160, score-0.166]
92 By including additional biological constraints such as the cost of making large saccades and the natural time course of information update, we may be able to improve our prediction of eye movement sequences. [sent-161, score-0.525]
93 We have shown that the programming of eye movements can be understood within a framework of sequential information maximization. [sent-162, score-0.506]
94 A remaining challenge is to understand how different tasks constrain the representation of information and to what degree observers are able to utilize the information. [sent-164, score-0.125]
95 A saliency-based search mechanism for overt and covert shifts of visual attention. [sent-174, score-0.188]
96 Modeling the role of salience in the allocation of overt visual attention. [sent-180, score-0.142]
97 Sequential information maximization can explain eye movements in an object learning task. [sent-206, score-0.617]
98 Most naturally occurring human saccades have magnitudes of 15 degrees or less. [sent-212, score-0.112]
99 Selective suppression of the magnocellular visual pathway during saccadic eye movements. [sent-215, score-0.441]
100 The time course of visual information accrual guiding eye movement decisions. [sent-218, score-0.491]
wordName wordTfidf (topN-words)
[('fixation', 0.674), ('eye', 0.297), ('edgelet', 0.25), ('fixations', 0.231), ('hiz', 0.173), ('saliency', 0.171), ('movements', 0.138), ('orientations', 0.134), ('edgelets', 0.115), ('visual', 0.111), ('observers', 0.107), ('maximization', 0.106), ('locations', 0.087), ('resolution', 0.083), ('orientation', 0.079), ('fixate', 0.077), ('eccentricity', 0.076), ('zi', 0.076), ('histogram', 0.072), ('ei', 0.068), ('location', 0.068), ('saccades', 0.067), ('defined', 0.066), ('movement', 0.065), ('salient', 0.061), ('subjects', 0.06), ('shape', 0.059), ('hizk', 0.058), ('predicting', 0.055), ('vision', 0.054), ('observer', 0.051), ('informative', 0.049), ('biological', 0.045), ('human', 0.045), ('fovea', 0.043), ('object', 0.04), ('radius', 0.039), ('ri', 0.039), ('indirect', 0.038), ('laura', 0.038), ('preeti', 0.038), ('renninger', 0.038), ('saccade', 0.038), ('volitional', 0.038), ('eyes', 0.037), ('ni', 0.036), ('uncertainty', 0.036), ('sequential', 0.035), ('stimulus', 0.035), ('image', 0.035), ('fph', 0.033), ('kr', 0.033), ('saccadic', 0.033), ('measurements', 0.033), ('prediction', 0.033), ('malik', 0.033), ('sequence', 0.032), ('row', 0.032), ('neighborhood', 0.031), ('overt', 0.031), ('silhouette', 0.031), ('silhouettes', 0.031), ('stark', 0.031), ('entropy', 0.029), ('coughlan', 0.028), ('influence', 0.028), ('superimposed', 0.028), ('greatly', 0.027), ('define', 0.027), ('lm', 0.027), ('marker', 0.027), ('selects', 0.026), ('mechanism', 0.025), ('chicago', 0.024), ('discrimination', 0.024), ('highly', 0.024), ('model', 0.024), ('rule', 0.024), ('maximally', 0.023), ('significantly', 0.023), ('yu', 0.023), ('tangent', 0.023), ('distant', 0.022), ('contour', 0.021), ('shifts', 0.021), ('knowledge', 0.02), ('shift', 0.02), ('gained', 0.02), ('coarse', 0.02), ('updates', 0.02), ('marginal', 0.019), ('move', 0.019), ('decisions', 0.018), ('gi', 0.018), ('xi', 0.018), ('information', 0.018), ('explain', 0.018), ('programming', 0.018), ('attention', 0.018), ('asked', 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 21 nips-2004-An Information Maximization Model of Eye Movements
Author: Laura W. Renninger, James M. Coughlan, Preeti Verghese, Jitendra Malik
Abstract: We propose a sequential information maximization model as a general strategy for programming eye movements. The model reconstructs high-resolution visual information from a sequence of fixations, taking into account the fall-off in resolution from the fovea to the periphery. From this framework we get a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that minimizes uncertainty (maximizes information) about the stimulus. By comparing our model performance to human eye movement data and to predictions from a saliency and random model, we demonstrate that our model is best at predicting fixation locations. Modeling additional biological constraints will improve the prediction of fixation sequences. Our results suggest that information maximization is a useful principle for programming eye movements. 1 In trod u ction Since the earliest recordings [1, 2], vision researchers have sought to understand the non-random yet idiosyncratic behavior of volitional eye movements. To do so, we must not only unravel the bottom-up visual processing involved in selecting a fixation location, but we must also disentangle the effects of top-down cognitive factors such as task and prior knowledge. Our ability to predict volitional eye movements provides a clear measure of our understanding of biological vision. One approach to predicting fixation locations is to propose that the eyes move to points that are “salient”. Salient regions can be found by looking for centersurround contrast in visual channels such as color, contrast and orientation, among others [3, 4]. Saliency has been shown to correlate with human fixation locations when observers “look around” an image [5, 6] but it is not clear if saliency alone can explain why some locations are chosen over others and in what order. Task as well as scene or object knowledge will play a role in constraining the fixation locations chosen [7]. Observations such as this led to the scanpath theory, which proposed that eye movement sequences are tightly linked to both the encoding and retrieval of specific object memories [8]. 1.1 Our Approach We propose that during natural, active vision, we center our fixation on the most informative points in an image in order to reduce our overall uncertainty about what we are looking at. This approach is intuitive and may be biologically plausible, as outlined by Lee & Yu [9]. The most informative point will depend on both the observer’s current knowledge of the stimulus and the task. The quality of the information gathered with each fixation will depend greatly on human visual resolution limits. This is the reason we must move our eyes in the first place, yet it is often ignored. A sequence of eye movements may then be understood within a framework of sequential information maximization. 2 Human eye movements We investigated how observers examine a novel shape when they must rely heavily on bottom-up stimulus information. Because eye movements will be affected by the task of the observer, we constructed a learn-discriminate paradigm. Observers are asked to carefully study a shape and then discriminate it from a highly similar one. 2.1 Stimuli and Design We use novel silhouettes to reduce the influence of object familiarity on the pattern of eye movements and to facilitate our computations of information in the model. Each silhouette subtends 12.5º to ensure that its entire shape cannot be characterized with a single fixation. During the learning phase, subjects first fixated a marker and then pressed a button to cue the appearance of the shape which appeared 10º to the left or right of fixation. Subjects maintained fixation for 300ms, allowing for a peripheral preview of the object. When the fixation marker disappeared, subjects were allowed to study the object for 1.2 seconds while their eye movements were recorded. During the discrimination phase, subjects were asked to select the shape they had just studied from a highly similar shape pair (Figure 1). Performance was near 75% correct, indicating that the task was challenging yet feasible. Subjects saw 140 shapes and given auditory feedback. release fixation, view object freely (1200ms) maintain fixation (300ms) Which shape is a match? fixate, initiate trial Figure 1. Temporal layout of a trial during the learning phase (left). Discrimination of learned shape from a highly similar one (right). 2.2 Apparatus Right eye position was measured with an SRI Dual Purkinje Image eye tracker while subjects viewed the stimulus binocularly. Head position was fixed with a bitebar. A 25 dot grid that covered the extent of the presentation field was used for calibration. The points were measured one at a time with each dot being displayed for 500ms. The stimuli were presented using the Psychtoolbox software [10]. 3 Model We wish to create a model that builds a representation of a shape silhouette given imperfect visual information, and which updates its representation as new visual information is acquired. The model will be defined statistically so as to explicitly encode uncertainty about the current knowledge of the shape silhouette. We will use this model to generate a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that will decrease the model’s uncertainty as much as possible. Similar approaches have been described in an ideal observer model for reading [11], an information maximization algorithm for tracking contours in cluttered images [12] and predicting fixation locations during object learning [13]. 3.1 Representing information The information in silhouettes clearly resides at its contour, which we represent with a collection of points and associated tangent orientations. These points and their associated orientations are called edgelets, denoted e1, e2, ... eN, where N is the total number of edgelets along the boundary. Each edgelet ei is defined as a triple ei=(xi, yi, zi) where (xi, yi) is the 2D location of the edgelet and zi is the orientation of the tangent to the boundary contour at that point. zi can assume any of Q possible values 1, 2, …, Q, representing a discretization of Q possible orientations ranging from 0 to π , and we have chosen Q=8 in our experiments. The goal of the model is to infer the most likely orientation values given the visual information provided by one or more fixations. 3.2 Updating knowledge The visual information is based on indirect measurements of the true edgelet values e1, e2, ... eN. Although our model assumes complete knowledge of the number N and locations (xi, yi) of the edgelets, it does not have direct access to the orientations zi.1 Orientation information is instead derived from measurements that summarize the local frequency of occurrence of edgelet orientations, averaged locally over a coarse scale (corresponding to the spatial scale at which resolution is limited by the human visual system). These coarse measurements provide indirect information about individual edgelet orientations, which may not uniquely determine the orientations. We will use a simple statistical model to estimate the distribution of individual orientation values conditioned on this information. Our measurements are defined to model the resolution limitations of the human visual system, with highest resolution at the fovea and lower resolution in the 1 Although the visual system does not have precise knowledge of location coordinates, the model is greatly simplified by assuming this knowledge. It is reasonable to expect that location uncertainty will be highly correlated with orientation uncertainty, so that the inclusion of location should not greatly affect the model's decisions of where to fixate next. periphery. Distance to the fovea is r measured as eccentricity E, the visual angle between any point and the fovea. If x = ( x, y ) is the location of a point in an image r and f = ( f x , f y ) is the fixation (i.e. foveal) location in the image then the r r eccentricity is E = x − f , measured in units of visual degrees. The effective resolution of orientation discrimination falls with increasing eccentricity as r (E ) = FPH ( E + E 2 ) where r(E) is an effective radius over which the visual system spatially pools information and FPH =0.1 and E2=0.8 [14]. Our model represents pooled information as a histogram of edge orientations within the effective radius. For each edgelet ei we define the histogram of all edgelet r orientations ej within radius ri = r(E) of ei , where E is the eccentricity of xi = ( xi , yi ) r r r relative to the current fixation f , i.e. E = xi − f . To define the histogram more precisely we will introduce the neighborhood set Ni of all indices j corresponding to r r edgelets within radius ri of ei : N i = all j s.t. xi − x j ≤ ri , with number of { } neighborhood edgelets |Ni|. The (normalized) histogram centered at edgelet ei is then defined as hiz = 1 Ni ∑δ j∈N i z,z j , which is the proportion of edgelet orientations that assume value z in the (eccentricity-dependent) neighborhood of edgelet ei.2 Figure 2. Relation between eccentricity E and radius r(E) of the neighborhood (disk) which defines the local orientation histogram (hiz ). Left and right panels show two fixations for the same object. Up to this point we have restricted ourselves to the case of a single fixation. To designate a sequence of multiple fixations we will index them byrk=1, 2, …, K (for K total fixations). The k th fixation location is denoted by f ( k ) = ( f xk , f yk ) . The quantities ri , Ni and hiz depend on fixation location and so to make this dependence (k explicit we will augment them with superscripts as ri(k ) , N i(k ) , and hiz ) . 2 δ x, y is the Kronecker delta function, defined to equal 1 if x = y and 0 if x ≠ y . Now we describe the statistical model of edgelet orientations given information obtained from multiple fixations. Ideally we would like to model the exact distribution of orientations conditioned on the histogram data: (1) ( 2) (K ) ( , where {hizk ) } represents all histogram P(zi , z 2 , ... z N | {hiz }, {hiz },K, {hiz }) r components z at every edgelet ei for fixation f (k ) . This exact distribution is intractable, so we will use a simple approximation. We assume the distribution factors over individual edgelets: N ( ( ( P(zi , z 2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK ) }) = ∏ g i(zi ) i =1 where gi(zi) is the marginal distribution of orientation zi. Determining these marginal distributions is still difficult even with the factorization assumption, so we K will make an additional approximation: g (z ) = 1 ∏ hiz( k ) , where Zi is a suitable i i Z i k =1 (k ) normalization factor. This approximation corresponds to treating hiz as a likelihood function over z, with independent likelihoods for each fixation k. While the approximation has some undesirable properties (such as making the marginal distribution gi(zi) more peaked if the same fixation is made repeatedly), it provides a simple mechanism for combining histogram evidence from multiple, distinct fixations. 3.3 Selecting the next fixation r ( K +1) Given the past K fixations, the next fixation f is chosen to minimize the model r ( K +1) entropy of the edgelet orientations. In other words, f is chosen to minimize r ( K +1) ( ( ( H( f ) = entropy[ P(zi , z2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK +1) })] , where the entropy of a distribution P(x) is defined as − ∑ P( x) log P ( x) . In practice, we minimize the x r entropy by evaluating it across a set of candidate locations f ( K +1) which forms a regularly sampled grid across the image.3 We note that this selection rule makes decisions that depend, in general, on the full history of previous K fixations. 4 Results Figure 3 shows an example of one observer’s eye movements superimposed over the shape (top row), the prediction from a saliency model (middle row) [3] and the prediction from the information maximization model (bottom row). The information maximization model updates its prediction after each fixation. An ideal sequence of fixations can be generated by both models. The saliency model selects fixations in order of decreasing salience. The information maximization model selects the maximally informative point after incorporating information from the previous fixations. To provide an additional benchmark, we also implemented a 3 This rule evaluates the entropy resulting from every possible next fixation before making a decision. Although this rule is suitable for our modeling purposes, it would be inefficient to implement in a biological or machine vision system. A practical decision rule would use current knowledge to estimate the expected (rather than actual) entropy. Figure 3. Example eye movement pattern, superimposed over the stimulus (top row), saliency map (middle row) and information maximization map (bottom row). model that selects fixations at random. One way to quantify the performance is to map a subject’s fixations onto the closest model predicted fixation locations, ignoring the sequence in which they were made. In this analysis, both the saliency and information maximization models are significantly better than random at predicting candidate locations (p < 0.05; t-test) for three observers (Figure 4, left). The information maximization model performs slightly but significantly better than the saliency model for two observers (lm, kr). If we match fixation locations while retaining the sequence, errors become quite large, indicating that the models cannot account for the observed behavior (Figure 4, right). Sequence Error Visual Angle (deg) Location Error R S I R S I R S I R S I R S I R S I Figure 4. Prediction error of three models: random (R), saliency (S) and information maximization (I) for three observers (pv, lm, kr). The left panel shows the error in predicting fixation locations, ignoring sequence. The right panel shows the error when sequence is retained before mapping. Error bars are 95% confidence intervals. The information maximization model incorporates resolution limitations, but there are further biological constraints that must be considered if we are to build a model that can fully explain human eye movement patterns. First, saccade amplitudes are typically around 2-4º and rarely exceed 15º [15]. When we move our eyes, the image of the visual world is smeared across the retina and our perception of it is actively suppressed [16]. Shorter saccade lengths may be a mechanism to reduce this cost. This biological constraint would cause a fixation to fall short of the prediction if it is distant from the current fixation (Figure 5). Figure 5. Cost of moving the eyes. Successive fixations may fall short of the maximally salient or informative point if it is very distant from the current fixation. Second, the biological system may increase its sampling efficiency by planning a series of saccades concurrently [17, 18]. Several fixations may therefore be made before sampled information begins to influence target selection. The information maximization model currently updates after each fixation. This would create a discrepancy in the prediction of the eye movement sequence (Figure 6). Figure 6. Three fixations are made to a location that is initially highly informative according to the information maximization model. By the fourth fixation, the subject finally moves to the next most informative point. 5 D i s c u s s i on Our model and the saliency model are using the same image information to determine fixation locations, thus it is not surprising that they are roughly similar in their performance of predicting human fixation locations. The main difference is how we decide to “shift attention” or program the sequence of eye movements to these locations. The saliency model uses a winner-take-all and inhibition-of-return mechanism to shift among the salient regions. We take a completely different approach by saying that observers adopt a strategy of sequential information maximization. In effect, the history of where we have been matters because our model is continually collecting information from the stimulus. We have an implicit “inhibition-of-return” because there is little to be gained by revisiting a point. Second, we attempt to take biological resolution limits into account when determining the quality of information gained with each fixation. By including additional biological constraints such as the cost of making large saccades and the natural time course of information update, we may be able to improve our prediction of eye movement sequences. We have shown that the programming of eye movements can be understood within a framework of sequential information maximization. This framework is portable to any image or task. A remaining challenge is to understand how different tasks constrain the representation of information and to what degree observers are able to utilize the information. Acknowledgments Smith-Kettlewell Eye Research Institute, NIH Ruth L. Kirschstein NRSA, ONR #N0001401-1-0890, NSF #IIS0415310, NIDRR #H133G030080, NASA #NAG 9-1461. References [1] Buswell (1935). How people look at pictures. Chicago: The University of Chicago Press. [2] Yarbus (1967). Eye movements and vision. New York: Plenum Press. [3] Itti & Koch (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489-1506. [4] Kadir & Brady (2001). Scale, saliency and image description. International Journal of Computer Vision, 45(2), 83-105. [5] Parkhurst, Law, and Niebur (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107-123. [6] Nothdurft (2002). Attention shifts to salient targets. Vision Research, 42, 1287-1306. [7] Oliva, Torralba, Castelhano & Henderson (2003). Top-down control of visual attention in object detection. Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain. [8] Noton & Stark (1971). Scanpaths in eye movements during pattern perception. Science, 171, 308-311. [9] Lee & Yu (2000). An information-theoretic framework for understanding saccadic behaviors. Advanced in Neural Processing Systems, 12, 834-840. [10] Brainard (1997). The psychophysics toolbox. Spatial Vision, 10 (4), 433-436. [11] Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.Chips 2002: new insights from an ideal-observer model of reading. Vision Research, 42, 2219-2234. [12] Geman & Jedynak (1996). An active testing model for tracking roads in satellite images. IEEE Trans. Pattern Analysis and Machine Intel, 18(1), 1-14. [13] Renninger & Malik (2004). Sequential information maximization can explain eye movements in an object learning task. Journal of Vision, 4(8), 744a. [14] Levi, Klein & Aitesbaomo (1985). Vernier acuity, crowding and cortical magnification. Vision Research, 25(7), 963-977. [15] Bahill, Adler & Stark (1975). Most naturally occurring human saccades have magnitudes of 15 degrees or less. Investigative Ophthalmology, 14, 468-469. [16] Burr, Morrone & Ross (1994). Selective suppression of the magnocellular visual pathway during saccadic eye movements. Nature, 371, 511-513. [17] Caspi, Beutter & Eckstein (2004). The time course of visual information accrual guiding eye movement decisions. Proceedings of the Nat’l Academy of Science, 101(35), 13086-90. [18] McPeek, Skavenski & Nakayama (2000). Concurrent processing of saccades in visual search. Vision Research, 40, 2499-2516.
2 0.19601142 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes
Author: Dashan Gao, Nuno Vasconcelos
Abstract: Saliency mechanisms play an important role when visual recognition must be performed in cluttered scenes. We propose a computational definition of saliency that deviates from existing models by equating saliency to discrimination. In particular, the salient attributes of a given visual class are defined as the features that enable best discrimination between that class and all other classes of recognition interest. It is shown that this definition leads to saliency algorithms of low complexity, that are scalable to large recognition problems, and is compatible with existing models of early biological vision. Experimental results demonstrating success in the context of challenging recognition problems are also presented. 1
3 0.079200521 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces
Author: N. J. Hill, Thomas N. Lal, Karin Bierig, Niels Birbaumer, Bernhard Schölkopf
Abstract: Motivated by the particular problems involved in communicating with “locked-in” paralysed patients, we aim to develop a braincomputer interface that uses auditory stimuli. We describe a paradigm that allows a user to make a binary decision by focusing attention on one of two concurrent auditory stimulus sequences. Using Support Vector Machine classification and Recursive Channel Elimination on the independent components of averaged eventrelated potentials, we show that an untrained user’s EEG data can be classified with an encouragingly high level of accuracy. This suggests that it is possible for users to modulate EEG signals in a single trial by the conscious direction of attention, well enough to be useful in BCI. 1
4 0.073309094 88 nips-2004-Intrinsically Motivated Reinforcement Learning
Author: Nuttapong Chentanez, Andrew G. Barto, Satinder P. Singh
Abstract: Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous entities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing artificial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1
5 0.063298434 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception
Author: Alan Stocker, Eero P. Simoncelli
Abstract: It has been demonstrated that basic aspects of human visual motion perception are qualitatively consistent with a Bayesian estimation framework, where the prior probability distribution on velocity favors slow speeds. Here, we present a refined probabilistic model that can account for the typical trial-to-trial variabilities observed in psychophysical speed perception experiments. We also show that data from such experiments can be used to constrain both the likelihood and prior functions of the model. Specifically, we measured matching speeds and thresholds in a two-alternative forced choice speed discrimination task. Parametric fits to the data reveal that the likelihood function is well approximated by a LogNormal distribution with a characteristic contrast-dependent variance, and that the prior distribution on velocity exhibits significantly heavier tails than a Gaussian, and approximately follows a power-law function. Humans do not perceive visual motion veridically. Various psychophysical experiments have shown that the perceived speed of visual stimuli is affected by stimulus contrast, with low contrast stimuli being perceived to move slower than high contrast ones [1, 2]. Computational models have been suggested that can qualitatively explain these perceptual effects. Commonly, they assume the perception of visual motion to be optimal either within a deterministic framework with a regularization constraint that biases the solution toward zero motion [3, 4], or within a probabilistic framework of Bayesian estimation with a prior that favors slow velocities [5, 6]. The solutions resulting from these two frameworks are similar (and in some cases identical), but the probabilistic framework provides a more principled formulation of the problem in terms of meaningful probabilistic components. Specifically, Bayesian approaches rely on a likelihood function that expresses the relationship between the noisy measurements and the quantity to be estimated, and a prior distribution that expresses the probability of encountering any particular value of that quantity. A probabilistic model can also provide a richer description, by defining a full probability density over the set of possible “percepts”, rather than just a single value. Numerous analyses of psychophysical experiments have made use of such distributions within the framework of signal detection theory in order to model perceptual behavior [7]. Previous work has shown that an ideal Bayesian observer model based on Gaussian forms µ posterior low contrast probability density probability density high contrast likelihood prior a posterior likelihood prior v ˆ v ˆ visual speed µ b visual speed Figure 1: Bayesian model of visual speed perception. a) For a high contrast stimulus, the likelihood has a narrow width (a high signal-to-noise ratio) and the prior induces only a small shift µ of the mean v of the posterior. b) For a low contrast stimuli, the measurement ˆ is noisy, leading to a wider likelihood. The shift µ is much larger and the perceived speed lower than under condition (a). for both likelihood and prior is sufficient to capture the basic qualitative features of global translational motion perception [5, 6]. But the behavior of the resulting model deviates systematically from human perceptual data, most importantly with regard to trial-to-trial variability and the precise form of interaction between contrast and perceived speed. A recent article achieved better fits for the model under the assumption that human contrast perception saturates [8]. In order to advance the theory of Bayesian perception and provide significant constraints on models of neural implementation, it seems essential to constrain quantitatively both the likelihood function and the prior probability distribution. In previous work, the proposed likelihood functions were derived from the brightness constancy constraint [5, 6] or other generative principles [9]. Also, previous approaches defined the prior distribution based on general assumptions and computational convenience, typically choosing a Gaussian with zero mean, although a Laplacian prior has also been suggested [4]. In this paper, we develop a more general form of Bayesian model for speed perception that can account for trial-to-trial variability. We use psychophysical speed discrimination data in order to constrain both the likelihood and the prior function. 1 1.1 Probabilistic Model of Visual Speed Perception Ideal Bayesian Observer Assume that an observer wants to obtain an estimate for a variable v based on a measurement m that she/he performs. A Bayesian observer “knows” that the measurement device is not ideal and therefore, the measurement m is affected by noise. Hence, this observer combines the information gained by the measurement m with a priori knowledge about v. Doing so (and assuming that the prior knowledge is valid), the observer will – on average – perform better in estimating v than just trusting the measurements m. According to Bayes’ rule 1 p(v|m) = p(m|v)p(v) (1) α the probability of perceiving v given m (posterior) is the product of the likelihood of v for a particular measurements m and the a priori knowledge about the estimated variable v (prior). α is a normalization constant independent of v that ensures that the posterior is a proper probability distribution. ^ ^ P(v2 > v1) 1 + Pcum=0.5 0 a b Pcum=0.875 vmatch vthres v2 Figure 2: 2AFC speed discrimination experiment. a) Two patches of drifting gratings were displayed simultaneously (motion without movement). The subject was asked to fixate the center cross and decide after the presentation which of the two gratings was moving faster. b) A typical psychometric curve obtained under such paradigm. The dots represent the empirical probability that the subject perceived stimulus2 moving faster than stimulus1. The speed of stimulus1 was fixed while v2 is varied. The point of subjective equality, vmatch , is the value of v2 for which Pcum = 0.5. The threshold velocity vthresh is the velocity for which Pcum = 0.875. It is important to note that the measurement m is an internal variable of the observer and is not necessarily represented in the same space as v. The likelihood embodies both the mapping from v to m and the noise in this mapping. So far, we assume that there is a monotonic function f (v) : v → vm that maps v into the same space as m (m-space). Doing so allows us to analytically treat m and vm in the same space. We will later propose a suitable form of the mapping function f (v). An ideal Bayesian observer selects the estimate that minimizes the expected loss, given the posterior and a loss function. We assume a least-squares loss function. Then, the optimal estimate v is the mean of the posterior in Equation (1). It is easy to see why this model ˆ of a Bayesian observer is consistent with the fact that perceived speed decreases with contrast. The width of the likelihood varies inversely with the accuracy of the measurements performed by the observer, which presumably decreases with decreasing contrast due to a decreasing signal-to-noise ratio. As illustrated in Figure 1, the shift in perceived speed towards slow velocities grows with the width of the likelihood, and thus a Bayesian model can qualitatively explain the psychophysical results [1]. 1.2 Two Alternative Forced Choice Experiment We would like to examine perceived speeds under a wide range of conditions in order to constrain a Bayesian model. Unfortunately, perceived speed is an internal variable, and it is not obvious how to design an experiment that would allow subjects to express it directly 1 . Perceived speed can only be accessed indirectly by asking the subject to compare the speed of two stimuli. For a given trial, an ideal Bayesian observer in such a two-alternative forced choice (2AFC) experimental paradigm simply decides on the basis of the two trial estimates v1 (stimulus1) and v2 (stimulus2) which stimulus moves faster. Each estimate v is based ˆ ˆ ˆ on a particular measurement m. For a given stimulus with speed v, an ideal Bayesian observer will produce a distribution of estimates p(ˆ|v) because m is noisy. Over trials, v the observers behavior can be described by classical signal detection theory based on the distributions of the estimates, hence e.g. the probability of perceiving stimulus2 moving 1 Although see [10] for an example of determining and even changing the prior of a Bayesian model for a sensorimotor task, where the estimates are more directly accessible. faster than stimulus1 is given as the cumulative probability Pcum (ˆ2 > v1 ) = v ˆ ∞ 0 p(ˆ2 |v2 ) v v2 ˆ 0 p(ˆ1 |v1 ) dˆ1 dˆ2 v v v (2) Pcum describes the full psychometric curve. Figure 2b illustrates the measured psychometric curve and its fit from such an experimental situation. 2 Experimental Methods We measured matching speeds (Pcum = 0.5) and thresholds (Pcum = 0.875) in a 2AFC speed discrimination task. Subjects were presented simultaneously with two circular patches of horizontally drifting sine-wave gratings for the duration of one second (Figure 2a). Patches were 3deg in diameter, and were displayed at 6deg eccentricity to either side of a fixation cross. The stimuli had an identical spatial frequency of 1.5 cycle/deg. One stimulus was considered to be the reference stimulus having one of two different contrast values (c1 =[0.075 0.5]) and one of five different speed values (u1 =[1 2 4 8 12] deg/sec) while the second stimulus (test) had one of five different contrast values (c2 =[0.05 0.1 0.2 0.4 0.8]) and a varying speed that was determined by an interleaved staircase procedure. For each condition there were 96 trials. Conditions were randomly interleaved, including a random choice of stimulus identity (test vs. reference) and motion direction (right vs. left). Subjects were asked to fixate during stimulus presentation and select the faster moving stimulus. The threshold experiment differed only in that auditory feedback was given to indicate the correctness of their decision. This did not change the outcome of the experiment but increased significantly the quality of the data and thus reduced the number of trials needed. 3 Analysis With the data from the speed discrimination experiments we could in principal apply a parametric fit using Equation (2) to derive the prior and the likelihood, but the optimization is difficult, and the fit might not be well constrained given the amount of data we have obtained. The problem becomes much more tractable given the following weak assumptions: • We consider the prior to be relatively smooth. • We assume that the measurement m is corrupted by additive Gaussian noise with a variance whose dependence on stimulus speed and contrast is separable. • We assume that there is a mapping function f (v) : v → vm that maps v into the space of m (m-space). In that space, the likelihood is convolutional i.e. the noise in the measurement directly defines the width of the likelihood. These assumptions allow us to relate the psychophysical data to our probabilistic model in a simple way. The following analysis is in the m-space. The point of subjective equality (Pcum = 0.5) is defined as where the expected values of the speed estimates are equal. We write E vm,1 ˆ vm,1 − E µ1 = E vm,2 ˆ = vm,2 − E µ2 (3) where E µ is the expected shift of the perceived speed compared to the veridical speed. For the discrimination threshold experiment, above assumptions imply that the variance var vm of the speed estimates vm is equal for both stimuli. Then, (2) predicts that the ˆ ˆ discrimination threshold is proportional to the standard deviation, thus vm,2 − vm,1 = γ var vm ˆ (4) likelihood a b prior vm Figure 3: Piece-wise approximation We perform a parametric fit by assuming the prior to be piece-wise linear and the likelihood to be LogNormal (Gaussian in the m-space). where γ is a constant that depends on the threshold criterion Pcum and the exact shape of p(ˆm |vm ). v 3.1 Estimating the prior and likelihood In order to extract the prior and the likelihood of our model from the data, we have to find a generic local form of the prior and the likelihood and relate them to the mean and the variance of the speed estimates. As illustrated in Figure 3, we assume that the likelihood is Gaussian with a standard deviation σ(c, vm ). Furthermore, the prior is assumed to be wellapproximated by a first-order Taylor series expansion over the velocity ranges covered by the likelihood. We parameterize this linear expansion of the prior as p(vm ) = avm + b. We now can derive a posterior for this local approximation of likelihood and prior and then define the perceived speed shift µ(m). The posterior can be written as 2 vm 1 1 p(m|vm )p(vm ) = [exp(− )(avm + b)] α α 2σ(c, vm )2 where α is the normalization constant ∞ b p(m|vm )p(vm )dvm = π2σ(c, vm )2 α= 2 −∞ p(vm |m) = (5) (6) We can compute µ(m) as the first order moment of the posterior for a given m. Exploiting the symmetries around the origin, we find ∞ a(m) µ(m) = σ(c, vm )2 vp(vm |m)dvm ≡ (7) b(m) −∞ The expected value of µ(m) is equal to the value of µ at the expected value of the measurement m (which is the stimulus velocity vm ), thus a(vm ) σ(c, vm )2 E µ = µ(m)|m=vm = (8) b(vm ) Similarly, we derive var vm . Because the estimator is deterministic, the variance of the ˆ estimate only depends on the variance of the measurement m. For a given stimulus, the variance of the estimate can be well approximated by ∂ˆm (m) v var vm = var m ( ˆ |m=vm )2 (9) ∂m ∂µ(m) |m=vm )2 ≈ var m = var m (1 − ∂m Under the assumption of a locally smooth prior, the perceived velocity shift remains locally constant. The variance of the perceived speed vm becomes equal to the variance of the ˆ measurement m, which is the variance of the likelihood (in the m-space), thus var vm = σ(c, vm )2 ˆ (10) With (3) and (4), above derivations provide a simple dependency of the psychophysical data to the local parameters of the likelihood and the prior. 3.2 Choosing a Logarithmic speed representation We now want to choose the appropriate mapping function f (v) that maps v to the m-space. We define the m-space as the space in which the likelihood is Gaussian with a speedindependent width. We have shown that discrimination threshold is proportional to the width of the likelihood (4), (10). Also, we know from the psychophysics literature that visual speed discrimination approximately follows a Weber-Fechner law [11, 12], thus that the discrimination threshold increases roughly proportional with speed and so would the likelihood. A logarithmic speed representation would be compatible with the data and our choice of the likelihood. Hence, we transform the linear speed-domain v into a normalized logarithmic domain according to v + v0 vm = f (v) = ln( ) (11) v0 where v0 is a small normalization constant. The normalization is chosen to account for the expected deviation of equal variance behavior at the low end. Surprisingly, it has been found that neurons in the Medial Temporal area (Area MT) of macaque monkeys have speed-tuning curves that are very well approximated by Gaussians of constant width in above normalized logarithmic space [13]. These neurons are known to play a central role in the representation of motion. It seems natural to assume that they are strongly involved in tasks such as our performed psychophysical experiments. 4 Results Figure 4 shows the contrast dependent shift of speed perception and the speed discrimination threshold data for two subjects. Data points connected with a dashed line represent the relative matching speed (v2 /v1 ) for a particular contrast value c2 of the test stimulus as a function of the speed of the reference stimulus. Error bars are the empirical standard deviation of fits to bootstrapped samples of the data. Clearly, low contrast stimuli are perceived to move slower. The effect, however, varies across the tested speed range and tends to become smaller for higher speeds. The relative discrimination thresholds for two different contrasts as a function of speed show that the Weber-Fechner law holds only approximately. The data are in good agreement with other data from the psychophysics literature [1, 11, 8]. For each subject, data from both experiments were used to compute a parametric leastsquares fit according to (3), (4), (7), and (10). In order to test the assumption of a LogNormal likelihood we allowed the standard deviation to be dependent on contrast and speed, thus σ(c, vm ) = g(c)h(vm ). We split the speed range into six bins (subject2: five) and parameterized h(vm ) and the ratio a/b accordingly. Similarly, we parameterized g(c) for the seven contrast values. The resulting fits are superimposed as bold lines in Figure 4. Figure 5 shows the fitted parametric values for g(c) and h(v) (plotted in the linear domain), and the reconstructed prior distribution p(v) transformed back to the linear domain. The approximately constant values for h(v) provide evidence that a LogNormal distribution is an appropriate functional description of the likelihood. The resulting values for g(c) suggest for the likelihood width a roughly exponential decaying dependency on contrast with strong saturation for higher contrasts. discrimination threshold (relative) reference stimulus contrast c1: 0.075 0.5 subject 1 normalized matching speed 1.5 contrast c2 1 0.5 1 10 0.075 0.5 0.79 0.5 0.4 0.3 0.2 0.1 0 10 1 contrast: 1 10 discrimination threshold (relative) normalized matching speed subject 2 1.5 contrast c2 1 0.5 10 1 a 0.5 0.4 0.3 0.2 0.1 10 1 1 b speed of reference stimulus [deg/sec] 10 stimulus speed [deg/sec] Figure 4: Speed discrimination data for two subjects. a) The relative matching speed of a test stimulus with different contrast levels (c2 =[0.05 0.1 0.2 0.4 0.8]) to achieve subjective equality with a reference stimulus (two different contrast values c1 ). b) The relative discrimination threshold for two stimuli with equal contrast (c1,2 =[0.075 0.5]). reconstructed prior subject 1 p(v) [unnormalized] 1 Gaussian Power-Law g(c) 1 h(v) 2 0.9 1.5 0.8 0.1 n=-1.41 0.7 1 0.6 0.01 0.5 0.5 0.4 0.3 1 p(v) [unnormalized] subject 2 10 0.1 1 1 1 1 10 1 10 2 0.9 n=-1.35 0.1 1.5 0.8 0.7 1 0.6 0.01 0.5 0.5 0.4 1 speed [deg/sec] 10 0.3 0 0.1 1 contrast speed [deg/sec] Figure 5: Reconstructed prior distribution and parameters of the likelihood function. The reconstructed prior for both subjects show much heavier tails than a Gaussian (dashed fit), approximately following a power-law function with exponent n ≈ −1.4 (bold line). 5 Conclusions We have proposed a probabilistic framework based on a Bayesian ideal observer and standard signal detection theory. We have derived a likelihood function and prior distribution for the estimator, with a fairly conservative set of assumptions, constrained by psychophysical measurements of speed discrimination and matching. The width of the resulting likelihood is nearly constant in the logarithmic speed domain, and decreases approximately exponentially with contrast. The prior expresses a preference for slower speeds, and approximately follows a power-law distribution, thus has much heavier tails than a Gaussian. It would be interesting to compare the here derived prior distributions with measured true distributions of local image velocities that impinge on the retina. Although a number of authors have measured the spatio-temporal structure of natural images [14, e.g. ], it is clearly difficult to extract therefrom the true prior distribution because of the feedback loop formed through movements of the body, head and eyes. Acknowledgments The authors thank all subjects for their participation in the psychophysical experiments. References [1] P. Thompson. Perceived rate of movement depends on contrast. Vision Research, 22:377–380, 1982. [2] L.S. Stone and P. Thompson. Human speed perception is contrast dependent. Vision Research, 32(8):1535–1549, 1992. [3] A. Yuille and N. Grzywacz. A computational theory for the perception of coherent visual motion. Nature, 333(5):71–74, May 1988. [4] Alan Stocker. Constraint Optimization Networks for Visual Motion Perception - Analysis and Synthesis. PhD thesis, Dept. of Physics, Swiss Federal Institute of Technology, Z¨ rich, Switzeru land, March 2002. [5] Eero Simoncelli. Distributed analysis and representation of visual motion. PhD thesis, MIT, Dept. of Electrical Engineering, Cambridge, MA, 1993. [6] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [7] D.M. Green and J.A. Swets. Signal Detection Theory and Psychophysics. Wiley, New York, 1966. [8] F. H¨ rlimann, D. Kiper, and M. Carandini. Testing the Bayesian model of perceived speed. u Vision Research, 2002. [9] Y. Weiss and D.J. Fleet. Probabilistic Models of the Brain, chapter Velocity Likelihoods in Biological and Machine Vision, pages 77–96. Bradford, 2002. [10] K. Koerding and D. Wolpert. Bayesian integration in sensorimotor learning. 427(15):244–247, January 2004. Nature, [11] Leslie Welch. The perception of moving plaids reveals two motion-processing stages. Nature, 337:734–736, 1989. [12] S. McKee, G. Silvermann, and K. Nakayama. Precise velocity discrimintation despite random variations in temporal frequency and contrast. Vision Research, 26(4):609–619, 1986. [13] C.H. Anderson, H. Nover, and G.C. DeAngelis. Modeling the velocity tuning of macaque MT neurons. Journal of Vision/VSS abstract, 2003. [14] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6:345–358, 1995.
6 0.062439408 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
7 0.056334719 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity
8 0.056225702 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation
9 0.056073401 99 nips-2004-Learning Hyper-Features for Visual Identification
10 0.053697318 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
11 0.050901923 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture
12 0.048690289 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification
13 0.048200332 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
14 0.045790177 155 nips-2004-Responding to Modalities with Different Latencies
15 0.042340439 83 nips-2004-Incremental Learning for Visual Tracking
16 0.041515365 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
17 0.036999419 44 nips-2004-Conditional Random Fields for Object Recognition
18 0.036543839 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
19 0.035079505 179 nips-2004-Surface Reconstruction using Learned Shape Models
20 0.034366254 40 nips-2004-Common-Frame Model for Object Recognition
topicId topicWeight
[(0, -0.115), (1, -0.023), (2, -0.006), (3, -0.096), (4, 0.054), (5, 0.062), (6, 0.104), (7, -0.077), (8, -0.02), (9, 0.016), (10, -0.013), (11, 0.06), (12, 0.015), (13, -0.074), (14, 0.131), (15, -0.004), (16, -0.051), (17, -0.055), (18, -0.055), (19, 0.048), (20, 0.07), (21, -0.009), (22, 0.027), (23, -0.11), (24, 0.068), (25, -0.042), (26, 0.109), (27, 0.011), (28, -0.082), (29, 0.09), (30, -0.126), (31, -0.075), (32, -0.004), (33, -0.054), (34, 0.181), (35, -0.035), (36, 0.027), (37, -0.092), (38, 0.147), (39, 0.138), (40, 0.267), (41, -0.158), (42, -0.131), (43, -0.282), (44, -0.162), (45, -0.06), (46, -0.016), (47, 0.035), (48, -0.005), (49, 0.072)]
simIndex simValue paperId paperTitle
same-paper 1 0.95377189 21 nips-2004-An Information Maximization Model of Eye Movements
Author: Laura W. Renninger, James M. Coughlan, Preeti Verghese, Jitendra Malik
Abstract: We propose a sequential information maximization model as a general strategy for programming eye movements. The model reconstructs high-resolution visual information from a sequence of fixations, taking into account the fall-off in resolution from the fovea to the periphery. From this framework we get a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that minimizes uncertainty (maximizes information) about the stimulus. By comparing our model performance to human eye movement data and to predictions from a saliency and random model, we demonstrate that our model is best at predicting fixation locations. Modeling additional biological constraints will improve the prediction of fixation sequences. Our results suggest that information maximization is a useful principle for programming eye movements. 1 In trod u ction Since the earliest recordings [1, 2], vision researchers have sought to understand the non-random yet idiosyncratic behavior of volitional eye movements. To do so, we must not only unravel the bottom-up visual processing involved in selecting a fixation location, but we must also disentangle the effects of top-down cognitive factors such as task and prior knowledge. Our ability to predict volitional eye movements provides a clear measure of our understanding of biological vision. One approach to predicting fixation locations is to propose that the eyes move to points that are “salient”. Salient regions can be found by looking for centersurround contrast in visual channels such as color, contrast and orientation, among others [3, 4]. Saliency has been shown to correlate with human fixation locations when observers “look around” an image [5, 6] but it is not clear if saliency alone can explain why some locations are chosen over others and in what order. Task as well as scene or object knowledge will play a role in constraining the fixation locations chosen [7]. Observations such as this led to the scanpath theory, which proposed that eye movement sequences are tightly linked to both the encoding and retrieval of specific object memories [8]. 1.1 Our Approach We propose that during natural, active vision, we center our fixation on the most informative points in an image in order to reduce our overall uncertainty about what we are looking at. This approach is intuitive and may be biologically plausible, as outlined by Lee & Yu [9]. The most informative point will depend on both the observer’s current knowledge of the stimulus and the task. The quality of the information gathered with each fixation will depend greatly on human visual resolution limits. This is the reason we must move our eyes in the first place, yet it is often ignored. A sequence of eye movements may then be understood within a framework of sequential information maximization. 2 Human eye movements We investigated how observers examine a novel shape when they must rely heavily on bottom-up stimulus information. Because eye movements will be affected by the task of the observer, we constructed a learn-discriminate paradigm. Observers are asked to carefully study a shape and then discriminate it from a highly similar one. 2.1 Stimuli and Design We use novel silhouettes to reduce the influence of object familiarity on the pattern of eye movements and to facilitate our computations of information in the model. Each silhouette subtends 12.5º to ensure that its entire shape cannot be characterized with a single fixation. During the learning phase, subjects first fixated a marker and then pressed a button to cue the appearance of the shape which appeared 10º to the left or right of fixation. Subjects maintained fixation for 300ms, allowing for a peripheral preview of the object. When the fixation marker disappeared, subjects were allowed to study the object for 1.2 seconds while their eye movements were recorded. During the discrimination phase, subjects were asked to select the shape they had just studied from a highly similar shape pair (Figure 1). Performance was near 75% correct, indicating that the task was challenging yet feasible. Subjects saw 140 shapes and given auditory feedback. release fixation, view object freely (1200ms) maintain fixation (300ms) Which shape is a match? fixate, initiate trial Figure 1. Temporal layout of a trial during the learning phase (left). Discrimination of learned shape from a highly similar one (right). 2.2 Apparatus Right eye position was measured with an SRI Dual Purkinje Image eye tracker while subjects viewed the stimulus binocularly. Head position was fixed with a bitebar. A 25 dot grid that covered the extent of the presentation field was used for calibration. The points were measured one at a time with each dot being displayed for 500ms. The stimuli were presented using the Psychtoolbox software [10]. 3 Model We wish to create a model that builds a representation of a shape silhouette given imperfect visual information, and which updates its representation as new visual information is acquired. The model will be defined statistically so as to explicitly encode uncertainty about the current knowledge of the shape silhouette. We will use this model to generate a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that will decrease the model’s uncertainty as much as possible. Similar approaches have been described in an ideal observer model for reading [11], an information maximization algorithm for tracking contours in cluttered images [12] and predicting fixation locations during object learning [13]. 3.1 Representing information The information in silhouettes clearly resides at its contour, which we represent with a collection of points and associated tangent orientations. These points and their associated orientations are called edgelets, denoted e1, e2, ... eN, where N is the total number of edgelets along the boundary. Each edgelet ei is defined as a triple ei=(xi, yi, zi) where (xi, yi) is the 2D location of the edgelet and zi is the orientation of the tangent to the boundary contour at that point. zi can assume any of Q possible values 1, 2, …, Q, representing a discretization of Q possible orientations ranging from 0 to π , and we have chosen Q=8 in our experiments. The goal of the model is to infer the most likely orientation values given the visual information provided by one or more fixations. 3.2 Updating knowledge The visual information is based on indirect measurements of the true edgelet values e1, e2, ... eN. Although our model assumes complete knowledge of the number N and locations (xi, yi) of the edgelets, it does not have direct access to the orientations zi.1 Orientation information is instead derived from measurements that summarize the local frequency of occurrence of edgelet orientations, averaged locally over a coarse scale (corresponding to the spatial scale at which resolution is limited by the human visual system). These coarse measurements provide indirect information about individual edgelet orientations, which may not uniquely determine the orientations. We will use a simple statistical model to estimate the distribution of individual orientation values conditioned on this information. Our measurements are defined to model the resolution limitations of the human visual system, with highest resolution at the fovea and lower resolution in the 1 Although the visual system does not have precise knowledge of location coordinates, the model is greatly simplified by assuming this knowledge. It is reasonable to expect that location uncertainty will be highly correlated with orientation uncertainty, so that the inclusion of location should not greatly affect the model's decisions of where to fixate next. periphery. Distance to the fovea is r measured as eccentricity E, the visual angle between any point and the fovea. If x = ( x, y ) is the location of a point in an image r and f = ( f x , f y ) is the fixation (i.e. foveal) location in the image then the r r eccentricity is E = x − f , measured in units of visual degrees. The effective resolution of orientation discrimination falls with increasing eccentricity as r (E ) = FPH ( E + E 2 ) where r(E) is an effective radius over which the visual system spatially pools information and FPH =0.1 and E2=0.8 [14]. Our model represents pooled information as a histogram of edge orientations within the effective radius. For each edgelet ei we define the histogram of all edgelet r orientations ej within radius ri = r(E) of ei , where E is the eccentricity of xi = ( xi , yi ) r r r relative to the current fixation f , i.e. E = xi − f . To define the histogram more precisely we will introduce the neighborhood set Ni of all indices j corresponding to r r edgelets within radius ri of ei : N i = all j s.t. xi − x j ≤ ri , with number of { } neighborhood edgelets |Ni|. The (normalized) histogram centered at edgelet ei is then defined as hiz = 1 Ni ∑δ j∈N i z,z j , which is the proportion of edgelet orientations that assume value z in the (eccentricity-dependent) neighborhood of edgelet ei.2 Figure 2. Relation between eccentricity E and radius r(E) of the neighborhood (disk) which defines the local orientation histogram (hiz ). Left and right panels show two fixations for the same object. Up to this point we have restricted ourselves to the case of a single fixation. To designate a sequence of multiple fixations we will index them byrk=1, 2, …, K (for K total fixations). The k th fixation location is denoted by f ( k ) = ( f xk , f yk ) . The quantities ri , Ni and hiz depend on fixation location and so to make this dependence (k explicit we will augment them with superscripts as ri(k ) , N i(k ) , and hiz ) . 2 δ x, y is the Kronecker delta function, defined to equal 1 if x = y and 0 if x ≠ y . Now we describe the statistical model of edgelet orientations given information obtained from multiple fixations. Ideally we would like to model the exact distribution of orientations conditioned on the histogram data: (1) ( 2) (K ) ( , where {hizk ) } represents all histogram P(zi , z 2 , ... z N | {hiz }, {hiz },K, {hiz }) r components z at every edgelet ei for fixation f (k ) . This exact distribution is intractable, so we will use a simple approximation. We assume the distribution factors over individual edgelets: N ( ( ( P(zi , z 2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK ) }) = ∏ g i(zi ) i =1 where gi(zi) is the marginal distribution of orientation zi. Determining these marginal distributions is still difficult even with the factorization assumption, so we K will make an additional approximation: g (z ) = 1 ∏ hiz( k ) , where Zi is a suitable i i Z i k =1 (k ) normalization factor. This approximation corresponds to treating hiz as a likelihood function over z, with independent likelihoods for each fixation k. While the approximation has some undesirable properties (such as making the marginal distribution gi(zi) more peaked if the same fixation is made repeatedly), it provides a simple mechanism for combining histogram evidence from multiple, distinct fixations. 3.3 Selecting the next fixation r ( K +1) Given the past K fixations, the next fixation f is chosen to minimize the model r ( K +1) entropy of the edgelet orientations. In other words, f is chosen to minimize r ( K +1) ( ( ( H( f ) = entropy[ P(zi , z2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK +1) })] , where the entropy of a distribution P(x) is defined as − ∑ P( x) log P ( x) . In practice, we minimize the x r entropy by evaluating it across a set of candidate locations f ( K +1) which forms a regularly sampled grid across the image.3 We note that this selection rule makes decisions that depend, in general, on the full history of previous K fixations. 4 Results Figure 3 shows an example of one observer’s eye movements superimposed over the shape (top row), the prediction from a saliency model (middle row) [3] and the prediction from the information maximization model (bottom row). The information maximization model updates its prediction after each fixation. An ideal sequence of fixations can be generated by both models. The saliency model selects fixations in order of decreasing salience. The information maximization model selects the maximally informative point after incorporating information from the previous fixations. To provide an additional benchmark, we also implemented a 3 This rule evaluates the entropy resulting from every possible next fixation before making a decision. Although this rule is suitable for our modeling purposes, it would be inefficient to implement in a biological or machine vision system. A practical decision rule would use current knowledge to estimate the expected (rather than actual) entropy. Figure 3. Example eye movement pattern, superimposed over the stimulus (top row), saliency map (middle row) and information maximization map (bottom row). model that selects fixations at random. One way to quantify the performance is to map a subject’s fixations onto the closest model predicted fixation locations, ignoring the sequence in which they were made. In this analysis, both the saliency and information maximization models are significantly better than random at predicting candidate locations (p < 0.05; t-test) for three observers (Figure 4, left). The information maximization model performs slightly but significantly better than the saliency model for two observers (lm, kr). If we match fixation locations while retaining the sequence, errors become quite large, indicating that the models cannot account for the observed behavior (Figure 4, right). Sequence Error Visual Angle (deg) Location Error R S I R S I R S I R S I R S I R S I Figure 4. Prediction error of three models: random (R), saliency (S) and information maximization (I) for three observers (pv, lm, kr). The left panel shows the error in predicting fixation locations, ignoring sequence. The right panel shows the error when sequence is retained before mapping. Error bars are 95% confidence intervals. The information maximization model incorporates resolution limitations, but there are further biological constraints that must be considered if we are to build a model that can fully explain human eye movement patterns. First, saccade amplitudes are typically around 2-4º and rarely exceed 15º [15]. When we move our eyes, the image of the visual world is smeared across the retina and our perception of it is actively suppressed [16]. Shorter saccade lengths may be a mechanism to reduce this cost. This biological constraint would cause a fixation to fall short of the prediction if it is distant from the current fixation (Figure 5). Figure 5. Cost of moving the eyes. Successive fixations may fall short of the maximally salient or informative point if it is very distant from the current fixation. Second, the biological system may increase its sampling efficiency by planning a series of saccades concurrently [17, 18]. Several fixations may therefore be made before sampled information begins to influence target selection. The information maximization model currently updates after each fixation. This would create a discrepancy in the prediction of the eye movement sequence (Figure 6). Figure 6. Three fixations are made to a location that is initially highly informative according to the information maximization model. By the fourth fixation, the subject finally moves to the next most informative point. 5 D i s c u s s i on Our model and the saliency model are using the same image information to determine fixation locations, thus it is not surprising that they are roughly similar in their performance of predicting human fixation locations. The main difference is how we decide to “shift attention” or program the sequence of eye movements to these locations. The saliency model uses a winner-take-all and inhibition-of-return mechanism to shift among the salient regions. We take a completely different approach by saying that observers adopt a strategy of sequential information maximization. In effect, the history of where we have been matters because our model is continually collecting information from the stimulus. We have an implicit “inhibition-of-return” because there is little to be gained by revisiting a point. Second, we attempt to take biological resolution limits into account when determining the quality of information gained with each fixation. By including additional biological constraints such as the cost of making large saccades and the natural time course of information update, we may be able to improve our prediction of eye movement sequences. We have shown that the programming of eye movements can be understood within a framework of sequential information maximization. This framework is portable to any image or task. A remaining challenge is to understand how different tasks constrain the representation of information and to what degree observers are able to utilize the information. Acknowledgments Smith-Kettlewell Eye Research Institute, NIH Ruth L. Kirschstein NRSA, ONR #N0001401-1-0890, NSF #IIS0415310, NIDRR #H133G030080, NASA #NAG 9-1461. References [1] Buswell (1935). How people look at pictures. Chicago: The University of Chicago Press. [2] Yarbus (1967). Eye movements and vision. New York: Plenum Press. [3] Itti & Koch (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489-1506. [4] Kadir & Brady (2001). Scale, saliency and image description. International Journal of Computer Vision, 45(2), 83-105. [5] Parkhurst, Law, and Niebur (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107-123. [6] Nothdurft (2002). Attention shifts to salient targets. Vision Research, 42, 1287-1306. [7] Oliva, Torralba, Castelhano & Henderson (2003). Top-down control of visual attention in object detection. Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain. [8] Noton & Stark (1971). Scanpaths in eye movements during pattern perception. Science, 171, 308-311. [9] Lee & Yu (2000). An information-theoretic framework for understanding saccadic behaviors. Advanced in Neural Processing Systems, 12, 834-840. [10] Brainard (1997). The psychophysics toolbox. Spatial Vision, 10 (4), 433-436. [11] Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.Chips 2002: new insights from an ideal-observer model of reading. Vision Research, 42, 2219-2234. [12] Geman & Jedynak (1996). An active testing model for tracking roads in satellite images. IEEE Trans. Pattern Analysis and Machine Intel, 18(1), 1-14. [13] Renninger & Malik (2004). Sequential information maximization can explain eye movements in an object learning task. Journal of Vision, 4(8), 744a. [14] Levi, Klein & Aitesbaomo (1985). Vernier acuity, crowding and cortical magnification. Vision Research, 25(7), 963-977. [15] Bahill, Adler & Stark (1975). Most naturally occurring human saccades have magnitudes of 15 degrees or less. Investigative Ophthalmology, 14, 468-469. [16] Burr, Morrone & Ross (1994). Selective suppression of the magnocellular visual pathway during saccadic eye movements. Nature, 371, 511-513. [17] Caspi, Beutter & Eckstein (2004). The time course of visual information accrual guiding eye movement decisions. Proceedings of the Nat’l Academy of Science, 101(35), 13086-90. [18] McPeek, Skavenski & Nakayama (2000). Concurrent processing of saccades in visual search. Vision Research, 40, 2499-2516.
2 0.83319795 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes
Author: Dashan Gao, Nuno Vasconcelos
Abstract: Saliency mechanisms play an important role when visual recognition must be performed in cluttered scenes. We propose a computational definition of saliency that deviates from existing models by equating saliency to discrimination. In particular, the salient attributes of a given visual class are defined as the features that enable best discrimination between that class and all other classes of recognition interest. It is shown that this definition leads to saliency algorithms of low complexity, that are scalable to large recognition problems, and is compatible with existing models of early biological vision. Experimental results demonstrating success in the context of challenging recognition problems are also presented. 1
3 0.35846683 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
Author: Tanzeem Choudhury, Sumit Basu
Abstract: In this work, we quantitatively investigate the ways in which a given person influences the joint turn-taking behavior in a conversation. After collecting an auditory database of social interactions among a group of twenty-three people via wearable sensors (66 hours of data each over two weeks), we apply speech and conversation detection methods to the auditory streams. These methods automatically locate the conversations, determine their participants, and mark which participant was speaking when. We then model the joint turn-taking behavior as a Mixed-Memory Markov Model [1] that combines the statistics of the individual subjects' self-transitions and the partners ' cross-transitions. The mixture parameters in this model describe how much each person 's individual behavior contributes to the joint turn-taking behavior of the pair. By estimating these parameters, we thus estimate how much influence each participant has in determining the joint turntaking behavior. We show how this measure correlates significantly with betweenness centrality [2], an independent measure of an individual's importance in a social network. This result suggests that our estimate of conversational influence is predictive of social influence. 1
4 0.30953345 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception
Author: Alan Stocker, Eero P. Simoncelli
Abstract: It has been demonstrated that basic aspects of human visual motion perception are qualitatively consistent with a Bayesian estimation framework, where the prior probability distribution on velocity favors slow speeds. Here, we present a refined probabilistic model that can account for the typical trial-to-trial variabilities observed in psychophysical speed perception experiments. We also show that data from such experiments can be used to constrain both the likelihood and prior functions of the model. Specifically, we measured matching speeds and thresholds in a two-alternative forced choice speed discrimination task. Parametric fits to the data reveal that the likelihood function is well approximated by a LogNormal distribution with a characteristic contrast-dependent variance, and that the prior distribution on velocity exhibits significantly heavier tails than a Gaussian, and approximately follows a power-law function. Humans do not perceive visual motion veridically. Various psychophysical experiments have shown that the perceived speed of visual stimuli is affected by stimulus contrast, with low contrast stimuli being perceived to move slower than high contrast ones [1, 2]. Computational models have been suggested that can qualitatively explain these perceptual effects. Commonly, they assume the perception of visual motion to be optimal either within a deterministic framework with a regularization constraint that biases the solution toward zero motion [3, 4], or within a probabilistic framework of Bayesian estimation with a prior that favors slow velocities [5, 6]. The solutions resulting from these two frameworks are similar (and in some cases identical), but the probabilistic framework provides a more principled formulation of the problem in terms of meaningful probabilistic components. Specifically, Bayesian approaches rely on a likelihood function that expresses the relationship between the noisy measurements and the quantity to be estimated, and a prior distribution that expresses the probability of encountering any particular value of that quantity. A probabilistic model can also provide a richer description, by defining a full probability density over the set of possible “percepts”, rather than just a single value. Numerous analyses of psychophysical experiments have made use of such distributions within the framework of signal detection theory in order to model perceptual behavior [7]. Previous work has shown that an ideal Bayesian observer model based on Gaussian forms µ posterior low contrast probability density probability density high contrast likelihood prior a posterior likelihood prior v ˆ v ˆ visual speed µ b visual speed Figure 1: Bayesian model of visual speed perception. a) For a high contrast stimulus, the likelihood has a narrow width (a high signal-to-noise ratio) and the prior induces only a small shift µ of the mean v of the posterior. b) For a low contrast stimuli, the measurement ˆ is noisy, leading to a wider likelihood. The shift µ is much larger and the perceived speed lower than under condition (a). for both likelihood and prior is sufficient to capture the basic qualitative features of global translational motion perception [5, 6]. But the behavior of the resulting model deviates systematically from human perceptual data, most importantly with regard to trial-to-trial variability and the precise form of interaction between contrast and perceived speed. A recent article achieved better fits for the model under the assumption that human contrast perception saturates [8]. In order to advance the theory of Bayesian perception and provide significant constraints on models of neural implementation, it seems essential to constrain quantitatively both the likelihood function and the prior probability distribution. In previous work, the proposed likelihood functions were derived from the brightness constancy constraint [5, 6] or other generative principles [9]. Also, previous approaches defined the prior distribution based on general assumptions and computational convenience, typically choosing a Gaussian with zero mean, although a Laplacian prior has also been suggested [4]. In this paper, we develop a more general form of Bayesian model for speed perception that can account for trial-to-trial variability. We use psychophysical speed discrimination data in order to constrain both the likelihood and the prior function. 1 1.1 Probabilistic Model of Visual Speed Perception Ideal Bayesian Observer Assume that an observer wants to obtain an estimate for a variable v based on a measurement m that she/he performs. A Bayesian observer “knows” that the measurement device is not ideal and therefore, the measurement m is affected by noise. Hence, this observer combines the information gained by the measurement m with a priori knowledge about v. Doing so (and assuming that the prior knowledge is valid), the observer will – on average – perform better in estimating v than just trusting the measurements m. According to Bayes’ rule 1 p(v|m) = p(m|v)p(v) (1) α the probability of perceiving v given m (posterior) is the product of the likelihood of v for a particular measurements m and the a priori knowledge about the estimated variable v (prior). α is a normalization constant independent of v that ensures that the posterior is a proper probability distribution. ^ ^ P(v2 > v1) 1 + Pcum=0.5 0 a b Pcum=0.875 vmatch vthres v2 Figure 2: 2AFC speed discrimination experiment. a) Two patches of drifting gratings were displayed simultaneously (motion without movement). The subject was asked to fixate the center cross and decide after the presentation which of the two gratings was moving faster. b) A typical psychometric curve obtained under such paradigm. The dots represent the empirical probability that the subject perceived stimulus2 moving faster than stimulus1. The speed of stimulus1 was fixed while v2 is varied. The point of subjective equality, vmatch , is the value of v2 for which Pcum = 0.5. The threshold velocity vthresh is the velocity for which Pcum = 0.875. It is important to note that the measurement m is an internal variable of the observer and is not necessarily represented in the same space as v. The likelihood embodies both the mapping from v to m and the noise in this mapping. So far, we assume that there is a monotonic function f (v) : v → vm that maps v into the same space as m (m-space). Doing so allows us to analytically treat m and vm in the same space. We will later propose a suitable form of the mapping function f (v). An ideal Bayesian observer selects the estimate that minimizes the expected loss, given the posterior and a loss function. We assume a least-squares loss function. Then, the optimal estimate v is the mean of the posterior in Equation (1). It is easy to see why this model ˆ of a Bayesian observer is consistent with the fact that perceived speed decreases with contrast. The width of the likelihood varies inversely with the accuracy of the measurements performed by the observer, which presumably decreases with decreasing contrast due to a decreasing signal-to-noise ratio. As illustrated in Figure 1, the shift in perceived speed towards slow velocities grows with the width of the likelihood, and thus a Bayesian model can qualitatively explain the psychophysical results [1]. 1.2 Two Alternative Forced Choice Experiment We would like to examine perceived speeds under a wide range of conditions in order to constrain a Bayesian model. Unfortunately, perceived speed is an internal variable, and it is not obvious how to design an experiment that would allow subjects to express it directly 1 . Perceived speed can only be accessed indirectly by asking the subject to compare the speed of two stimuli. For a given trial, an ideal Bayesian observer in such a two-alternative forced choice (2AFC) experimental paradigm simply decides on the basis of the two trial estimates v1 (stimulus1) and v2 (stimulus2) which stimulus moves faster. Each estimate v is based ˆ ˆ ˆ on a particular measurement m. For a given stimulus with speed v, an ideal Bayesian observer will produce a distribution of estimates p(ˆ|v) because m is noisy. Over trials, v the observers behavior can be described by classical signal detection theory based on the distributions of the estimates, hence e.g. the probability of perceiving stimulus2 moving 1 Although see [10] for an example of determining and even changing the prior of a Bayesian model for a sensorimotor task, where the estimates are more directly accessible. faster than stimulus1 is given as the cumulative probability Pcum (ˆ2 > v1 ) = v ˆ ∞ 0 p(ˆ2 |v2 ) v v2 ˆ 0 p(ˆ1 |v1 ) dˆ1 dˆ2 v v v (2) Pcum describes the full psychometric curve. Figure 2b illustrates the measured psychometric curve and its fit from such an experimental situation. 2 Experimental Methods We measured matching speeds (Pcum = 0.5) and thresholds (Pcum = 0.875) in a 2AFC speed discrimination task. Subjects were presented simultaneously with two circular patches of horizontally drifting sine-wave gratings for the duration of one second (Figure 2a). Patches were 3deg in diameter, and were displayed at 6deg eccentricity to either side of a fixation cross. The stimuli had an identical spatial frequency of 1.5 cycle/deg. One stimulus was considered to be the reference stimulus having one of two different contrast values (c1 =[0.075 0.5]) and one of five different speed values (u1 =[1 2 4 8 12] deg/sec) while the second stimulus (test) had one of five different contrast values (c2 =[0.05 0.1 0.2 0.4 0.8]) and a varying speed that was determined by an interleaved staircase procedure. For each condition there were 96 trials. Conditions were randomly interleaved, including a random choice of stimulus identity (test vs. reference) and motion direction (right vs. left). Subjects were asked to fixate during stimulus presentation and select the faster moving stimulus. The threshold experiment differed only in that auditory feedback was given to indicate the correctness of their decision. This did not change the outcome of the experiment but increased significantly the quality of the data and thus reduced the number of trials needed. 3 Analysis With the data from the speed discrimination experiments we could in principal apply a parametric fit using Equation (2) to derive the prior and the likelihood, but the optimization is difficult, and the fit might not be well constrained given the amount of data we have obtained. The problem becomes much more tractable given the following weak assumptions: • We consider the prior to be relatively smooth. • We assume that the measurement m is corrupted by additive Gaussian noise with a variance whose dependence on stimulus speed and contrast is separable. • We assume that there is a mapping function f (v) : v → vm that maps v into the space of m (m-space). In that space, the likelihood is convolutional i.e. the noise in the measurement directly defines the width of the likelihood. These assumptions allow us to relate the psychophysical data to our probabilistic model in a simple way. The following analysis is in the m-space. The point of subjective equality (Pcum = 0.5) is defined as where the expected values of the speed estimates are equal. We write E vm,1 ˆ vm,1 − E µ1 = E vm,2 ˆ = vm,2 − E µ2 (3) where E µ is the expected shift of the perceived speed compared to the veridical speed. For the discrimination threshold experiment, above assumptions imply that the variance var vm of the speed estimates vm is equal for both stimuli. Then, (2) predicts that the ˆ ˆ discrimination threshold is proportional to the standard deviation, thus vm,2 − vm,1 = γ var vm ˆ (4) likelihood a b prior vm Figure 3: Piece-wise approximation We perform a parametric fit by assuming the prior to be piece-wise linear and the likelihood to be LogNormal (Gaussian in the m-space). where γ is a constant that depends on the threshold criterion Pcum and the exact shape of p(ˆm |vm ). v 3.1 Estimating the prior and likelihood In order to extract the prior and the likelihood of our model from the data, we have to find a generic local form of the prior and the likelihood and relate them to the mean and the variance of the speed estimates. As illustrated in Figure 3, we assume that the likelihood is Gaussian with a standard deviation σ(c, vm ). Furthermore, the prior is assumed to be wellapproximated by a first-order Taylor series expansion over the velocity ranges covered by the likelihood. We parameterize this linear expansion of the prior as p(vm ) = avm + b. We now can derive a posterior for this local approximation of likelihood and prior and then define the perceived speed shift µ(m). The posterior can be written as 2 vm 1 1 p(m|vm )p(vm ) = [exp(− )(avm + b)] α α 2σ(c, vm )2 where α is the normalization constant ∞ b p(m|vm )p(vm )dvm = π2σ(c, vm )2 α= 2 −∞ p(vm |m) = (5) (6) We can compute µ(m) as the first order moment of the posterior for a given m. Exploiting the symmetries around the origin, we find ∞ a(m) µ(m) = σ(c, vm )2 vp(vm |m)dvm ≡ (7) b(m) −∞ The expected value of µ(m) is equal to the value of µ at the expected value of the measurement m (which is the stimulus velocity vm ), thus a(vm ) σ(c, vm )2 E µ = µ(m)|m=vm = (8) b(vm ) Similarly, we derive var vm . Because the estimator is deterministic, the variance of the ˆ estimate only depends on the variance of the measurement m. For a given stimulus, the variance of the estimate can be well approximated by ∂ˆm (m) v var vm = var m ( ˆ |m=vm )2 (9) ∂m ∂µ(m) |m=vm )2 ≈ var m = var m (1 − ∂m Under the assumption of a locally smooth prior, the perceived velocity shift remains locally constant. The variance of the perceived speed vm becomes equal to the variance of the ˆ measurement m, which is the variance of the likelihood (in the m-space), thus var vm = σ(c, vm )2 ˆ (10) With (3) and (4), above derivations provide a simple dependency of the psychophysical data to the local parameters of the likelihood and the prior. 3.2 Choosing a Logarithmic speed representation We now want to choose the appropriate mapping function f (v) that maps v to the m-space. We define the m-space as the space in which the likelihood is Gaussian with a speedindependent width. We have shown that discrimination threshold is proportional to the width of the likelihood (4), (10). Also, we know from the psychophysics literature that visual speed discrimination approximately follows a Weber-Fechner law [11, 12], thus that the discrimination threshold increases roughly proportional with speed and so would the likelihood. A logarithmic speed representation would be compatible with the data and our choice of the likelihood. Hence, we transform the linear speed-domain v into a normalized logarithmic domain according to v + v0 vm = f (v) = ln( ) (11) v0 where v0 is a small normalization constant. The normalization is chosen to account for the expected deviation of equal variance behavior at the low end. Surprisingly, it has been found that neurons in the Medial Temporal area (Area MT) of macaque monkeys have speed-tuning curves that are very well approximated by Gaussians of constant width in above normalized logarithmic space [13]. These neurons are known to play a central role in the representation of motion. It seems natural to assume that they are strongly involved in tasks such as our performed psychophysical experiments. 4 Results Figure 4 shows the contrast dependent shift of speed perception and the speed discrimination threshold data for two subjects. Data points connected with a dashed line represent the relative matching speed (v2 /v1 ) for a particular contrast value c2 of the test stimulus as a function of the speed of the reference stimulus. Error bars are the empirical standard deviation of fits to bootstrapped samples of the data. Clearly, low contrast stimuli are perceived to move slower. The effect, however, varies across the tested speed range and tends to become smaller for higher speeds. The relative discrimination thresholds for two different contrasts as a function of speed show that the Weber-Fechner law holds only approximately. The data are in good agreement with other data from the psychophysics literature [1, 11, 8]. For each subject, data from both experiments were used to compute a parametric leastsquares fit according to (3), (4), (7), and (10). In order to test the assumption of a LogNormal likelihood we allowed the standard deviation to be dependent on contrast and speed, thus σ(c, vm ) = g(c)h(vm ). We split the speed range into six bins (subject2: five) and parameterized h(vm ) and the ratio a/b accordingly. Similarly, we parameterized g(c) for the seven contrast values. The resulting fits are superimposed as bold lines in Figure 4. Figure 5 shows the fitted parametric values for g(c) and h(v) (plotted in the linear domain), and the reconstructed prior distribution p(v) transformed back to the linear domain. The approximately constant values for h(v) provide evidence that a LogNormal distribution is an appropriate functional description of the likelihood. The resulting values for g(c) suggest for the likelihood width a roughly exponential decaying dependency on contrast with strong saturation for higher contrasts. discrimination threshold (relative) reference stimulus contrast c1: 0.075 0.5 subject 1 normalized matching speed 1.5 contrast c2 1 0.5 1 10 0.075 0.5 0.79 0.5 0.4 0.3 0.2 0.1 0 10 1 contrast: 1 10 discrimination threshold (relative) normalized matching speed subject 2 1.5 contrast c2 1 0.5 10 1 a 0.5 0.4 0.3 0.2 0.1 10 1 1 b speed of reference stimulus [deg/sec] 10 stimulus speed [deg/sec] Figure 4: Speed discrimination data for two subjects. a) The relative matching speed of a test stimulus with different contrast levels (c2 =[0.05 0.1 0.2 0.4 0.8]) to achieve subjective equality with a reference stimulus (two different contrast values c1 ). b) The relative discrimination threshold for two stimuli with equal contrast (c1,2 =[0.075 0.5]). reconstructed prior subject 1 p(v) [unnormalized] 1 Gaussian Power-Law g(c) 1 h(v) 2 0.9 1.5 0.8 0.1 n=-1.41 0.7 1 0.6 0.01 0.5 0.5 0.4 0.3 1 p(v) [unnormalized] subject 2 10 0.1 1 1 1 1 10 1 10 2 0.9 n=-1.35 0.1 1.5 0.8 0.7 1 0.6 0.01 0.5 0.5 0.4 1 speed [deg/sec] 10 0.3 0 0.1 1 contrast speed [deg/sec] Figure 5: Reconstructed prior distribution and parameters of the likelihood function. The reconstructed prior for both subjects show much heavier tails than a Gaussian (dashed fit), approximately following a power-law function with exponent n ≈ −1.4 (bold line). 5 Conclusions We have proposed a probabilistic framework based on a Bayesian ideal observer and standard signal detection theory. We have derived a likelihood function and prior distribution for the estimator, with a fairly conservative set of assumptions, constrained by psychophysical measurements of speed discrimination and matching. The width of the resulting likelihood is nearly constant in the logarithmic speed domain, and decreases approximately exponentially with contrast. The prior expresses a preference for slower speeds, and approximately follows a power-law distribution, thus has much heavier tails than a Gaussian. It would be interesting to compare the here derived prior distributions with measured true distributions of local image velocities that impinge on the retina. Although a number of authors have measured the spatio-temporal structure of natural images [14, e.g. ], it is clearly difficult to extract therefrom the true prior distribution because of the feedback loop formed through movements of the body, head and eyes. Acknowledgments The authors thank all subjects for their participation in the psychophysical experiments. References [1] P. Thompson. Perceived rate of movement depends on contrast. Vision Research, 22:377–380, 1982. [2] L.S. Stone and P. Thompson. Human speed perception is contrast dependent. Vision Research, 32(8):1535–1549, 1992. [3] A. Yuille and N. Grzywacz. A computational theory for the perception of coherent visual motion. Nature, 333(5):71–74, May 1988. [4] Alan Stocker. Constraint Optimization Networks for Visual Motion Perception - Analysis and Synthesis. PhD thesis, Dept. of Physics, Swiss Federal Institute of Technology, Z¨ rich, Switzeru land, March 2002. [5] Eero Simoncelli. Distributed analysis and representation of visual motion. PhD thesis, MIT, Dept. of Electrical Engineering, Cambridge, MA, 1993. [6] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [7] D.M. Green and J.A. Swets. Signal Detection Theory and Psychophysics. Wiley, New York, 1966. [8] F. H¨ rlimann, D. Kiper, and M. Carandini. Testing the Bayesian model of perceived speed. u Vision Research, 2002. [9] Y. Weiss and D.J. Fleet. Probabilistic Models of the Brain, chapter Velocity Likelihoods in Biological and Machine Vision, pages 77–96. Bradford, 2002. [10] K. Koerding and D. Wolpert. Bayesian integration in sensorimotor learning. 427(15):244–247, January 2004. Nature, [11] Leslie Welch. The perception of moving plaids reveals two motion-processing stages. Nature, 337:734–736, 1989. [12] S. McKee, G. Silvermann, and K. Nakayama. Precise velocity discrimintation despite random variations in temporal frequency and contrast. Vision Research, 26(4):609–619, 1986. [13] C.H. Anderson, H. Nover, and G.C. DeAngelis. Modeling the velocity tuning of macaque MT neurons. Journal of Vision/VSS abstract, 2003. [14] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6:345–358, 1995.
5 0.28233156 155 nips-2004-Responding to Modalities with Different Latencies
Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka
Abstract: Motor control depends on sensory feedback in multiple modalities with different latencies. In this paper we consider within the framework of reinforcement learning how different sensory modalities can be combined and selected for real-time, optimal movement control. We propose an actor-critic architecture with multiple modules, whose output are combined using a softmax function. We tested our architecture in a simulation of a sequential reaching task. Reaching was initially guided by visual feedback with a long latency. Our learning scheme allowed the agent to utilize the somatosensory feedback with shorter latency when the hand is near the experienced trajectory. In simulations with different latencies for visual and somatosensory feedback, we found that the agent depended more on feedback with shorter latency. 1
6 0.2736758 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces
7 0.2725957 88 nips-2004-Intrinsically Motivated Reinforcement Learning
8 0.26285103 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account
9 0.26170135 193 nips-2004-Theories of Access Consciousness
10 0.25550538 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
11 0.24386883 146 nips-2004-Pictorial Structures for Molecular Modeling: Interpreting Density Maps
12 0.24135518 99 nips-2004-Learning Hyper-Features for Visual Identification
13 0.23833017 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation
14 0.23572677 109 nips-2004-Mass Meta-analysis in Talairach Space
15 0.22392172 106 nips-2004-Machine Learning Applied to Perception: Decision Images for Gender Classification
16 0.21191972 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
17 0.20324984 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
18 0.20089559 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes
19 0.19547197 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
20 0.19466184 140 nips-2004-Optimal Information Decoding from Neuronal Populations with Specific Stimulus Selectivity
topicId topicWeight
[(13, 0.05), (15, 0.097), (26, 0.032), (31, 0.027), (33, 0.182), (35, 0.057), (36, 0.033), (39, 0.013), (50, 0.016), (52, 0.019), (70, 0.325), (71, 0.015), (87, 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.78294277 21 nips-2004-An Information Maximization Model of Eye Movements
Author: Laura W. Renninger, James M. Coughlan, Preeti Verghese, Jitendra Malik
Abstract: We propose a sequential information maximization model as a general strategy for programming eye movements. The model reconstructs high-resolution visual information from a sequence of fixations, taking into account the fall-off in resolution from the fovea to the periphery. From this framework we get a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that minimizes uncertainty (maximizes information) about the stimulus. By comparing our model performance to human eye movement data and to predictions from a saliency and random model, we demonstrate that our model is best at predicting fixation locations. Modeling additional biological constraints will improve the prediction of fixation sequences. Our results suggest that information maximization is a useful principle for programming eye movements. 1 In trod u ction Since the earliest recordings [1, 2], vision researchers have sought to understand the non-random yet idiosyncratic behavior of volitional eye movements. To do so, we must not only unravel the bottom-up visual processing involved in selecting a fixation location, but we must also disentangle the effects of top-down cognitive factors such as task and prior knowledge. Our ability to predict volitional eye movements provides a clear measure of our understanding of biological vision. One approach to predicting fixation locations is to propose that the eyes move to points that are “salient”. Salient regions can be found by looking for centersurround contrast in visual channels such as color, contrast and orientation, among others [3, 4]. Saliency has been shown to correlate with human fixation locations when observers “look around” an image [5, 6] but it is not clear if saliency alone can explain why some locations are chosen over others and in what order. Task as well as scene or object knowledge will play a role in constraining the fixation locations chosen [7]. Observations such as this led to the scanpath theory, which proposed that eye movement sequences are tightly linked to both the encoding and retrieval of specific object memories [8]. 1.1 Our Approach We propose that during natural, active vision, we center our fixation on the most informative points in an image in order to reduce our overall uncertainty about what we are looking at. This approach is intuitive and may be biologically plausible, as outlined by Lee & Yu [9]. The most informative point will depend on both the observer’s current knowledge of the stimulus and the task. The quality of the information gathered with each fixation will depend greatly on human visual resolution limits. This is the reason we must move our eyes in the first place, yet it is often ignored. A sequence of eye movements may then be understood within a framework of sequential information maximization. 2 Human eye movements We investigated how observers examine a novel shape when they must rely heavily on bottom-up stimulus information. Because eye movements will be affected by the task of the observer, we constructed a learn-discriminate paradigm. Observers are asked to carefully study a shape and then discriminate it from a highly similar one. 2.1 Stimuli and Design We use novel silhouettes to reduce the influence of object familiarity on the pattern of eye movements and to facilitate our computations of information in the model. Each silhouette subtends 12.5º to ensure that its entire shape cannot be characterized with a single fixation. During the learning phase, subjects first fixated a marker and then pressed a button to cue the appearance of the shape which appeared 10º to the left or right of fixation. Subjects maintained fixation for 300ms, allowing for a peripheral preview of the object. When the fixation marker disappeared, subjects were allowed to study the object for 1.2 seconds while their eye movements were recorded. During the discrimination phase, subjects were asked to select the shape they had just studied from a highly similar shape pair (Figure 1). Performance was near 75% correct, indicating that the task was challenging yet feasible. Subjects saw 140 shapes and given auditory feedback. release fixation, view object freely (1200ms) maintain fixation (300ms) Which shape is a match? fixate, initiate trial Figure 1. Temporal layout of a trial during the learning phase (left). Discrimination of learned shape from a highly similar one (right). 2.2 Apparatus Right eye position was measured with an SRI Dual Purkinje Image eye tracker while subjects viewed the stimulus binocularly. Head position was fixed with a bitebar. A 25 dot grid that covered the extent of the presentation field was used for calibration. The points were measured one at a time with each dot being displayed for 500ms. The stimuli were presented using the Psychtoolbox software [10]. 3 Model We wish to create a model that builds a representation of a shape silhouette given imperfect visual information, and which updates its representation as new visual information is acquired. The model will be defined statistically so as to explicitly encode uncertainty about the current knowledge of the shape silhouette. We will use this model to generate a simple rule for predicting fixation sequences: after each fixation, fixate next at the location that will decrease the model’s uncertainty as much as possible. Similar approaches have been described in an ideal observer model for reading [11], an information maximization algorithm for tracking contours in cluttered images [12] and predicting fixation locations during object learning [13]. 3.1 Representing information The information in silhouettes clearly resides at its contour, which we represent with a collection of points and associated tangent orientations. These points and their associated orientations are called edgelets, denoted e1, e2, ... eN, where N is the total number of edgelets along the boundary. Each edgelet ei is defined as a triple ei=(xi, yi, zi) where (xi, yi) is the 2D location of the edgelet and zi is the orientation of the tangent to the boundary contour at that point. zi can assume any of Q possible values 1, 2, …, Q, representing a discretization of Q possible orientations ranging from 0 to π , and we have chosen Q=8 in our experiments. The goal of the model is to infer the most likely orientation values given the visual information provided by one or more fixations. 3.2 Updating knowledge The visual information is based on indirect measurements of the true edgelet values e1, e2, ... eN. Although our model assumes complete knowledge of the number N and locations (xi, yi) of the edgelets, it does not have direct access to the orientations zi.1 Orientation information is instead derived from measurements that summarize the local frequency of occurrence of edgelet orientations, averaged locally over a coarse scale (corresponding to the spatial scale at which resolution is limited by the human visual system). These coarse measurements provide indirect information about individual edgelet orientations, which may not uniquely determine the orientations. We will use a simple statistical model to estimate the distribution of individual orientation values conditioned on this information. Our measurements are defined to model the resolution limitations of the human visual system, with highest resolution at the fovea and lower resolution in the 1 Although the visual system does not have precise knowledge of location coordinates, the model is greatly simplified by assuming this knowledge. It is reasonable to expect that location uncertainty will be highly correlated with orientation uncertainty, so that the inclusion of location should not greatly affect the model's decisions of where to fixate next. periphery. Distance to the fovea is r measured as eccentricity E, the visual angle between any point and the fovea. If x = ( x, y ) is the location of a point in an image r and f = ( f x , f y ) is the fixation (i.e. foveal) location in the image then the r r eccentricity is E = x − f , measured in units of visual degrees. The effective resolution of orientation discrimination falls with increasing eccentricity as r (E ) = FPH ( E + E 2 ) where r(E) is an effective radius over which the visual system spatially pools information and FPH =0.1 and E2=0.8 [14]. Our model represents pooled information as a histogram of edge orientations within the effective radius. For each edgelet ei we define the histogram of all edgelet r orientations ej within radius ri = r(E) of ei , where E is the eccentricity of xi = ( xi , yi ) r r r relative to the current fixation f , i.e. E = xi − f . To define the histogram more precisely we will introduce the neighborhood set Ni of all indices j corresponding to r r edgelets within radius ri of ei : N i = all j s.t. xi − x j ≤ ri , with number of { } neighborhood edgelets |Ni|. The (normalized) histogram centered at edgelet ei is then defined as hiz = 1 Ni ∑δ j∈N i z,z j , which is the proportion of edgelet orientations that assume value z in the (eccentricity-dependent) neighborhood of edgelet ei.2 Figure 2. Relation between eccentricity E and radius r(E) of the neighborhood (disk) which defines the local orientation histogram (hiz ). Left and right panels show two fixations for the same object. Up to this point we have restricted ourselves to the case of a single fixation. To designate a sequence of multiple fixations we will index them byrk=1, 2, …, K (for K total fixations). The k th fixation location is denoted by f ( k ) = ( f xk , f yk ) . The quantities ri , Ni and hiz depend on fixation location and so to make this dependence (k explicit we will augment them with superscripts as ri(k ) , N i(k ) , and hiz ) . 2 δ x, y is the Kronecker delta function, defined to equal 1 if x = y and 0 if x ≠ y . Now we describe the statistical model of edgelet orientations given information obtained from multiple fixations. Ideally we would like to model the exact distribution of orientations conditioned on the histogram data: (1) ( 2) (K ) ( , where {hizk ) } represents all histogram P(zi , z 2 , ... z N | {hiz }, {hiz },K, {hiz }) r components z at every edgelet ei for fixation f (k ) . This exact distribution is intractable, so we will use a simple approximation. We assume the distribution factors over individual edgelets: N ( ( ( P(zi , z 2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK ) }) = ∏ g i(zi ) i =1 where gi(zi) is the marginal distribution of orientation zi. Determining these marginal distributions is still difficult even with the factorization assumption, so we K will make an additional approximation: g (z ) = 1 ∏ hiz( k ) , where Zi is a suitable i i Z i k =1 (k ) normalization factor. This approximation corresponds to treating hiz as a likelihood function over z, with independent likelihoods for each fixation k. While the approximation has some undesirable properties (such as making the marginal distribution gi(zi) more peaked if the same fixation is made repeatedly), it provides a simple mechanism for combining histogram evidence from multiple, distinct fixations. 3.3 Selecting the next fixation r ( K +1) Given the past K fixations, the next fixation f is chosen to minimize the model r ( K +1) entropy of the edgelet orientations. In other words, f is chosen to minimize r ( K +1) ( ( ( H( f ) = entropy[ P(zi , z2 , ... z N | {hiz1) }, {hiz2 ) },K, {hizK +1) })] , where the entropy of a distribution P(x) is defined as − ∑ P( x) log P ( x) . In practice, we minimize the x r entropy by evaluating it across a set of candidate locations f ( K +1) which forms a regularly sampled grid across the image.3 We note that this selection rule makes decisions that depend, in general, on the full history of previous K fixations. 4 Results Figure 3 shows an example of one observer’s eye movements superimposed over the shape (top row), the prediction from a saliency model (middle row) [3] and the prediction from the information maximization model (bottom row). The information maximization model updates its prediction after each fixation. An ideal sequence of fixations can be generated by both models. The saliency model selects fixations in order of decreasing salience. The information maximization model selects the maximally informative point after incorporating information from the previous fixations. To provide an additional benchmark, we also implemented a 3 This rule evaluates the entropy resulting from every possible next fixation before making a decision. Although this rule is suitable for our modeling purposes, it would be inefficient to implement in a biological or machine vision system. A practical decision rule would use current knowledge to estimate the expected (rather than actual) entropy. Figure 3. Example eye movement pattern, superimposed over the stimulus (top row), saliency map (middle row) and information maximization map (bottom row). model that selects fixations at random. One way to quantify the performance is to map a subject’s fixations onto the closest model predicted fixation locations, ignoring the sequence in which they were made. In this analysis, both the saliency and information maximization models are significantly better than random at predicting candidate locations (p < 0.05; t-test) for three observers (Figure 4, left). The information maximization model performs slightly but significantly better than the saliency model for two observers (lm, kr). If we match fixation locations while retaining the sequence, errors become quite large, indicating that the models cannot account for the observed behavior (Figure 4, right). Sequence Error Visual Angle (deg) Location Error R S I R S I R S I R S I R S I R S I Figure 4. Prediction error of three models: random (R), saliency (S) and information maximization (I) for three observers (pv, lm, kr). The left panel shows the error in predicting fixation locations, ignoring sequence. The right panel shows the error when sequence is retained before mapping. Error bars are 95% confidence intervals. The information maximization model incorporates resolution limitations, but there are further biological constraints that must be considered if we are to build a model that can fully explain human eye movement patterns. First, saccade amplitudes are typically around 2-4º and rarely exceed 15º [15]. When we move our eyes, the image of the visual world is smeared across the retina and our perception of it is actively suppressed [16]. Shorter saccade lengths may be a mechanism to reduce this cost. This biological constraint would cause a fixation to fall short of the prediction if it is distant from the current fixation (Figure 5). Figure 5. Cost of moving the eyes. Successive fixations may fall short of the maximally salient or informative point if it is very distant from the current fixation. Second, the biological system may increase its sampling efficiency by planning a series of saccades concurrently [17, 18]. Several fixations may therefore be made before sampled information begins to influence target selection. The information maximization model currently updates after each fixation. This would create a discrepancy in the prediction of the eye movement sequence (Figure 6). Figure 6. Three fixations are made to a location that is initially highly informative according to the information maximization model. By the fourth fixation, the subject finally moves to the next most informative point. 5 D i s c u s s i on Our model and the saliency model are using the same image information to determine fixation locations, thus it is not surprising that they are roughly similar in their performance of predicting human fixation locations. The main difference is how we decide to “shift attention” or program the sequence of eye movements to these locations. The saliency model uses a winner-take-all and inhibition-of-return mechanism to shift among the salient regions. We take a completely different approach by saying that observers adopt a strategy of sequential information maximization. In effect, the history of where we have been matters because our model is continually collecting information from the stimulus. We have an implicit “inhibition-of-return” because there is little to be gained by revisiting a point. Second, we attempt to take biological resolution limits into account when determining the quality of information gained with each fixation. By including additional biological constraints such as the cost of making large saccades and the natural time course of information update, we may be able to improve our prediction of eye movement sequences. We have shown that the programming of eye movements can be understood within a framework of sequential information maximization. This framework is portable to any image or task. A remaining challenge is to understand how different tasks constrain the representation of information and to what degree observers are able to utilize the information. Acknowledgments Smith-Kettlewell Eye Research Institute, NIH Ruth L. Kirschstein NRSA, ONR #N0001401-1-0890, NSF #IIS0415310, NIDRR #H133G030080, NASA #NAG 9-1461. References [1] Buswell (1935). How people look at pictures. Chicago: The University of Chicago Press. [2] Yarbus (1967). Eye movements and vision. New York: Plenum Press. [3] Itti & Koch (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research, 40, 1489-1506. [4] Kadir & Brady (2001). Scale, saliency and image description. International Journal of Computer Vision, 45(2), 83-105. [5] Parkhurst, Law, and Niebur (2002). Modeling the role of salience in the allocation of overt visual attention. Vision Research, 42(1), 107-123. [6] Nothdurft (2002). Attention shifts to salient targets. Vision Research, 42, 1287-1306. [7] Oliva, Torralba, Castelhano & Henderson (2003). Top-down control of visual attention in object detection. Proceedings of the IEEE International Conference on Image Processing, Barcelona, Spain. [8] Noton & Stark (1971). Scanpaths in eye movements during pattern perception. Science, 171, 308-311. [9] Lee & Yu (2000). An information-theoretic framework for understanding saccadic behaviors. Advanced in Neural Processing Systems, 12, 834-840. [10] Brainard (1997). The psychophysics toolbox. Spatial Vision, 10 (4), 433-436. [11] Legge, Hooven, Klitz, Mansfield & Tjan (2002). Mr.Chips 2002: new insights from an ideal-observer model of reading. Vision Research, 42, 2219-2234. [12] Geman & Jedynak (1996). An active testing model for tracking roads in satellite images. IEEE Trans. Pattern Analysis and Machine Intel, 18(1), 1-14. [13] Renninger & Malik (2004). Sequential information maximization can explain eye movements in an object learning task. Journal of Vision, 4(8), 744a. [14] Levi, Klein & Aitesbaomo (1985). Vernier acuity, crowding and cortical magnification. Vision Research, 25(7), 963-977. [15] Bahill, Adler & Stark (1975). Most naturally occurring human saccades have magnitudes of 15 degrees or less. Investigative Ophthalmology, 14, 468-469. [16] Burr, Morrone & Ross (1994). Selective suppression of the magnocellular visual pathway during saccadic eye movements. Nature, 371, 511-513. [17] Caspi, Beutter & Eckstein (2004). The time course of visual information accrual guiding eye movement decisions. Proceedings of the Nat’l Academy of Science, 101(35), 13086-90. [18] McPeek, Skavenski & Nakayama (2000). Concurrent processing of saccades in visual search. Vision Research, 40, 2499-2516.
2 0.55646449 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes
Author: Dashan Gao, Nuno Vasconcelos
Abstract: Saliency mechanisms play an important role when visual recognition must be performed in cluttered scenes. We propose a computational definition of saliency that deviates from existing models by equating saliency to discrimination. In particular, the salient attributes of a given visual class are defined as the features that enable best discrimination between that class and all other classes of recognition interest. It is shown that this definition leads to saliency algorithms of low complexity, that are scalable to large recognition problems, and is compatible with existing models of early biological vision. Experimental results demonstrating success in the context of challenging recognition problems are also presented. 1
3 0.55488098 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
Author: Daniela D. Farias, Benjamin V. Roy
Abstract: We introduce a new algorithm based on linear programming that approximates the differential value function of an average-cost Markov decision process via a linear combination of pre-selected basis functions. The algorithm carries out a form of cost shaping and minimizes a version of Bellman error. We establish an error bound that scales gracefully with the number of states without imposing the (strong) Lyapunov condition required by its counterpart in [6]. We propose a path-following method that automates selection of important algorithm parameters which represent counterparts to the “state-relevance weights” studied in [6]. 1
4 0.54827374 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering
Author: Zhengdong Lu, Todd K. Leen
Abstract: While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to influence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on Gaussian mixture models (GMM) of the data distribution. We express clustering preferences in the prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. We fit the model parameters with EM. Experiments on a variety of data sets show that PPC can consistently improve clustering results.
5 0.54744673 151 nips-2004-Rate- and Phase-coded Autoassociative Memory
Author: Máté Lengyel, Peter Dayan
Abstract: Areas of the brain involved in various forms of memory exhibit patterns of neural activity quite unlike those in canonical computational models. We show how to use well-founded Bayesian probabilistic autoassociative recall to derive biologically reasonable neuronal dynamics in recurrently coupled models, together with appropriate values for parameters such as the membrane time constant and inhibition. We explicitly treat two cases. One arises from a standard Hebbian learning rule, and involves activity patterns that are coded by graded firing rates. The other arises from a spike timing dependent learning rule, and involves patterns coded by the phase of spike times relative to a coherent local field potential oscillation. Our model offers a new and more complete understanding of how neural dynamics may support autoassociation. 1
6 0.5472061 54 nips-2004-Distributed Information Regularization on Graphs
7 0.54695427 77 nips-2004-Hierarchical Clustering of a Mixture Model
8 0.54633796 44 nips-2004-Conditional Random Fields for Object Recognition
9 0.54584122 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
10 0.54582089 179 nips-2004-Surface Reconstruction using Learned Shape Models
11 0.54564047 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming
12 0.54287171 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
13 0.54264784 127 nips-2004-Neighbourhood Components Analysis
14 0.54213285 161 nips-2004-Self-Tuning Spectral Clustering
15 0.54128456 99 nips-2004-Learning Hyper-Features for Visual Identification
16 0.54112184 166 nips-2004-Semi-supervised Learning via Gaussian Processes
17 0.54087335 125 nips-2004-Multiple Relational Embedding
18 0.54081655 145 nips-2004-Parametric Embedding for Class Visualization
19 0.54039145 207 nips-2004-ℓ₀-norm Minimization for Basis Selection
20 0.53985143 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data